There are two main models for reading an XML document: the event-based model and the tree-based model. The Simple API for XML (SAX) is an event-based API for reading XML documents.
The event-based model for reading XML works as follows. There are two interacting actors in the process: the parser, a program that reads the XML document, and the client application, that invoked the parser and waits for the information collected by the parser.
The XML document is read by the parser from beginning to end. Each time the parser encounters a new piece of information (the start of the document, a start-tag, an end-tag, character data, a processing instruction, the end of the document), it notifies the event to the client program by sending the relevant information (like the tag name or the text of character data). The client application may save this information in some data structure of further process the received information. Hence, the reading process is realized by a stream of events from the parser to the client. Streaming is a well known reading method when working on large data. For instance, movies are often streamed across the Internet.
There are two major advantages of this model:
In particular, the event-based model allows to process XML documents that, for space constraints, cannot be entirely stored in main memory.
On the other hand, event-based applications are harder to program than tree-based ones, since they are programmed in a stack-like style rather than in a recursive style. Indeed, a SAX parser visits the XML document tree in preorder, and this visit cannot be changed by the client application.
SAX is a de facto standard originally proposed by David Megginson. At writing time, the current version of SAX is SAX2, that will be described here. SAX1 methods are still supported by all major parsers but are deprecated (you shouldn't use them). SAX was originally defined as a Java API and is primary intended for parsers in Java. However, it is implemented in other major object-oriented languages. We will describe the implementation of the API contained in Java platform 5.0.
SAX contains two major interfaces that do most of the job: XMLReader and ContentHandler. The XMLReader parses the XML document and notifies the events by calling methods of the associated ContentHandler. In order to parse a document with the SAX interface you have to:
XMLReader parser = XMLReaderFactory.createXMLReader();
parser.setFeature("http://xml.org/sax/features/namespaces", false); parser.setFeature("http://xml.org/sax/features/validation", true);
parser.setContentHandler(handler);
parser.parse("Turing.xml");
The following Java program SAXStats illustrates the technique. It computes the following statistics on an input XML document: number of elements, number of attributes, total length of character data (including ignorable whitespace), and height of the XML tree (defined as the maximum nesting level of elements in the XML document):
import java.io.*; import org.xml.sax.*; import org.xml.sax.helpers.*; public class SAXStats { public static void main(String[] args) { // name of the XML document to parse String filename = args[0]; XMLReader parser = null; // Create a parser instance try { parser = XMLReaderFactory.createXMLReader(); } catch (SAXException se) { // No default SAX parser is available System.out.println(se.getMessage()); System.exit(1); } // create a default handler StatsHandler handler = new StatsHandler(); // register the default content handler with the parser parser.setContentHandler(handler); // parse the document try { parser.parse(filename); } catch (SAXParseException spe) { // Document is not well-formed System.out.println(spe.getMessage()); System.exit(1); } catch (SAXException se) { // Some other general parse error occurred System.out.println(se.getMessage()); System.exit(1); } catch (IOException ioe) { // Some IO error occurred System.out.println(ioe.getMessage()); System.exit(1); } } /* Extends DefaultHandler to count elements, attributes, text length and tree height in the XML file and prints these numbers. */ public static class StatsHandler extends DefaultHandler { // used to store the counts private int numElements, numAttributes, numChars, height, maxHeight; // This method is invoked when the parser encounters the document start public void startDocument() { numElements = 0; numAttributes = 0; numChars = 0; height = -1; maxHeight = 0; } // This method is invoked when the parser encounters a start-tag public void startElement(String uri, String localname, String qname, Attributes attributes) { numElements++; numAttributes += attributes.getLength(); height++; if (height > maxHeight) { maxHeight = height; } } // This method is invoked when the parser encounters an end-tag public void endElement(String uri, String localname, String qname) { height--; } // This method is invoked when the parser encounters any plain text within an // element public void characters(char[] text, int start, int length) { numChars += length; } // This method is invoked when the parser encounters the document end public void endDocument() { System.out.println("Number of elements: " + numElements); System.out.println("Number of attributes: " + numAttributes); System.out.println("Number of characters of plain text: " + numChars); System.out.println("Tree height: " + maxHeight); } } }
DefaultHandler is a helper class that implements the commonly-used SAX handler interfaces (in particular ContentHandler) by defining empty implementations for all of their methods. Is is easier to subclass DefaultHandler and override only the desired methods (as we did above) rather than to implement all the interface methods from scratch.
An alternative method to parse a document using the SAX model is to use the SAX parser factory and SAX parser objects contained in the JAXP package javax.xml.parsers.*. The method works as follows:
SAXParserFactory parserFactory = SAXParserFactory.newInstance();
factory.setNamespaceAware(true); factory.setValidating(true);
SAXParser parser = factory.newSAXParser();
parser.parse(new File("Turing.xml"), handler);
The SAXStats2 class uses the above described parsing method to produce some basic statistics on a given XML document.