Next page Previous page Start of chapter End of chapter

Simple API for XML

There are two main models for reading an XML document: the event-based model and the tree-based model. The Simple API for XML (SAX) is an event-based API for reading XML documents.

The event-based model for reading XML works as follows. There are two interacting actors in the process: the parser, a program that reads the XML document, and the client application, that invoked the parser and waits for the information collected by the parser.

The XML document is read by the parser from beginning to end. Each time the parser encounters a new piece of information (the start of the document, a start-tag, an end-tag, character data, a processing instruction, the end of the document), it notifies the event to the client program by sending the relevant information (like the tag name or the text of character data). The client application may save this information in some data structure of further process the received information. Hence, the reading process is realized by a stream of events from the parser to the client. Streaming is a well known reading method when working on large data. For instance, movies are often streamed across the Internet.

There are two major advantages of this model:

In particular, the event-based model allows to process XML documents that, for space constraints, cannot be entirely stored in main memory.

On the other hand, event-based applications are harder to program than tree-based ones, since they are programmed in a stack-like style rather than in a recursive style. Indeed, a SAX parser visits the XML document tree in preorder, and this visit cannot be changed by the client application.

SAX is a de facto standard originally proposed by David Megginson. At writing time, the current version of SAX is SAX2, that will be described here. SAX1 methods are still supported by all major parsers but are deprecated (you shouldn't use them). SAX was originally defined as a Java API and is primary intended for parsers in Java. However, it is implemented in other major object-oriented languages. We will describe the implementation of the API contained in Java platform 5.0.

SAX contains two major interfaces that do most of the job: XMLReader and ContentHandler. The XMLReader parses the XML document and notifies the events by calling methods of the associated ContentHandler. In order to parse a document with the SAX interface you have to:

  1. implement the methods of the ContentHandler interface for the events that you intend to handle;
  2. create an XMLReader object (a SAX parser) as follows:
    XMLReader parser = XMLReaderFactory.createXMLReader();
    
  3. optionally, configure the parser. This configuration will influence the way the parser will parse the document. Two relevant features that you can set or reset here are namespace awareness (this is set by default) and validation. For instance, to reset namespace awareness and to set validation write as follows:
    parser.setFeature("http://xml.org/sax/features/namespaces", false);    
    parser.setFeature("http://xml.org/sax/features/validation", true);    
    
  4. associate the implementation of the content handler to the SAX parser:
    parser.setContentHandler(handler);
    
  5. parse the document as follows:
    parser.parse("Turing.xml");
    

The following Java program SAXStats illustrates the technique. It computes the following statistics on an input XML document: number of elements, number of attributes, total length of character data (including ignorable whitespace), and height of the XML tree (defined as the maximum nesting level of elements in the XML document):

import java.io.*;
import org.xml.sax.*;
import org.xml.sax.helpers.*;

public class SAXStats {
  
  public static void main(String[] args) {
    
    // name of the XML document to parse
    String filename = args[0];
    XMLReader parser = null;
    // Create a parser instance 
    try {
      parser = XMLReaderFactory.createXMLReader();
    } catch (SAXException se) {
      // No default SAX parser is available
      System.out.println(se.getMessage());
      System.exit(1);
    }
    // create a default handler
    StatsHandler handler = new StatsHandler();
    // register the default content handler with the parser
    parser.setContentHandler(handler);
    // parse the document
    try {
      parser.parse(filename);
    } catch (SAXParseException spe) {
      // Document is not well-formed
      System.out.println(spe.getMessage());
      System.exit(1);
    } catch (SAXException se) {
      // Some other general parse error occurred
      System.out.println(se.getMessage());
      System.exit(1);
    } catch (IOException ioe) {
      // Some IO error occurred
      System.out.println(ioe.getMessage());
      System.exit(1);
    }
  }

  /* Extends DefaultHandler to count elements, attributes, text length and 
  tree height in the XML file and prints these numbers. */
  public static class StatsHandler extends DefaultHandler {
      
    // used to store the counts
    private int numElements, numAttributes, numChars, height, maxHeight; 
          
    // This method is invoked when the parser encounters the document start
    public void startDocument() {
      numElements = 0;
      numAttributes = 0;
      numChars = 0;
      height = -1;
      maxHeight = 0;
    }
    
    // This method is invoked when the parser encounters a start-tag 
    public void startElement(String uri, String localname, String qname, 
                             Attributes attributes) {
      numElements++;
      numAttributes += attributes.getLength();
      height++;
      if (height > maxHeight) {
        maxHeight = height;
      }
    }
    
    // This method is invoked when the parser encounters an end-tag 
    public void endElement(String uri, String localname, String qname) {
      height--;
    }


    // This method is invoked when the parser encounters any plain text within an 
    // element
    public void characters(char[] text, int start, int length) {
      numChars += length;
    }
    
    // This method is invoked when the parser encounters the document end
    public void endDocument() {
      System.out.println("Number of elements: " + numElements);
      System.out.println("Number of attributes: " + numAttributes);
      System.out.println("Number of characters of plain text: " + numChars);
      System.out.println("Tree height: " + maxHeight);
    }
  }
}

DefaultHandler is a helper class that implements the commonly-used SAX handler interfaces (in particular ContentHandler) by defining empty implementations for all of their methods. Is is easier to subclass DefaultHandler and override only the desired methods (as we did above) rather than to implement all the interface methods from scratch.

An alternative method to parse a document using the SAX model is to use the SAX parser factory and SAX parser objects contained in the JAXP package javax.xml.parsers.*. The method works as follows:

  1. implement the methods of the ContentHandler interface for the events that you intend to handle;
  2. obtain a parser factory for creating SAX parsers as follows:
    SAXParserFactory parserFactory = SAXParserFactory.newInstance();
    
  3. optionally, configure the parser factory. This configuration will influence the way the parser will parse the document. Two relevant features that you can set or reset here are namespace awareness and validation (in this case, both features are not set by default). For instance, to set namespace awareness and validation write as follows:
    factory.setNamespaceAware(true);    
    factory.setValidating(true);
    
  4. use the factory to create a SAX parser:
    SAXParser parser = factory.newSAXParser();
    
  5. Finally, parse an existing document calling back the defined content handler methods as follows:
    parser.parse(new File("Turing.xml"), handler);
    

The SAXStats2 class uses the above described parsing method to produce some basic statistics on a given XML document.

Next page Previous page Start of chapter End of chapter
Caffè XML - Massimo Franceschet