Next page Previous page Start of chapter End of chapter

Document Object Model

Document Object Model (DOM) is a tree-based API for reading XML documents. A DOM parser reads the entire XML document and builds a tree data structure that rapresents the entire XML document. After the parsing has successfully concluded, the tree can be navigated by the client application to find the desired pieces of information.

The advantages of a tree-based model are the following:

On the other hand, a DOM parser has to store the entire document in memory, and this is not always feasible. Moreover, the client application cannot start working until the whole document has been read and transformed into its tree representation.

Hence, DOM and SAX are complementary technologies, and the programmer must care to choose the right model according to the application needs.

DOM is a W3C standard. It is a set of interfaces defined in a neutral language: the Interface Description Language notation proposed by the Object Management Group. Versions of the DOM are defied as levels, and the current version is DOM Level 3. We will describe the implementation of DOM Level 3 contained in Java platform 5.0.

To parse a document using the DOM model, follow these steps:

  1. obtain a document builder (i.e., parser) factory for creating DOM parsers as follows:
    DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
    
  2. optionally, configure the document builder factory. This configuration will influence the way the parser will parse the document. Two relevant properties that you can set here are namespace awareness and validation (the default in both cases is false):
    factory.setNamespaceAware(true);    
    factory.setValidating(true);
    
  3. use the factory to create a DOM parser:
    DocumentBuilder parser = factory.newDocumentBuilder();
    
  4. you can now obtain a new empty document from the parser using the method newDocument() or parse an existing document as follows:
    Document document = parser.parse(new File("Turing.xml"));
    

The obtained Document object is the DOM tree associated with the XML document you have provided. In particular, this object is the root node of the tree. The tree representation is essentially that used by XPath, with some small changes. The client application can now navigate and modify the tree. Moreover, it can serialize the tree obtaining a corresponding XML document (serialization and parsing are inverse operations).

All nodes in a DOM document tree (including the Document object itself) implement the Node interface, which provides basic methods for traversing and manipulating the tree. The following methods can be used to navigate the tree:

getParentNode()
Returns the parent node.
getChildNodes()
Returns a list of child nodes. The returned NodeList object can be iterated through using getLengh(), that returns the number of nodes in the list, and item(int index) that returns the node at index position (the first index is 0). The nodes in NodeList are read-only and live, that is, they reflect changes to the document tree. Notice that child nodes may have different types (element, text, comment, processing instruction, and more) but do not include attributes, which may be retrieved with the getAttributes() method below.
getFirstChild()
Child nodes are ordered in document order. This method returns the first child node.
getLastChild()
Returns the last child node.
getNextSibling()
Returns the next sibling node.
getPreviousSibling()
Returns the previous sibling node.
getAttributes()
Returns a list of attribute nodes. The returned NamedNodeMap object can be iterated through using getLengh(), that returns the number of nodes in the list, and item(int index) that returns the node at index position (the first index is 0). You can also get an attribute node by name using the getNamedItem(String name) method. Nodes in NamedNodeMap are live, that is, they reflect changes to the document tree. Notice that attributes are not child nodes of the element the attributes belong to, and, in contrast to XPath data model, the parent of an attribute node does not exist (check this!).
getOwnerDocument()
Returns the Document node, that is the tree root.
hasChildNodes()
Determines whether a node has children or not.
hasAttributes()
Determines whether a node has attributes or not (only element nodes may have attributes).

Moreover, the Node interface contains methods to get useful information from a node, like the type, the name and the value:

getNodeType()
Returns the type of the node. The main node types are the following: element (identified by the constant ELEMENT_NODE defined in the Node class), attribute (ATTRIBUTE_NODE), text (TEXT_NODE), and document (DOCUMENT_NODE).
getNodeName()
Returns the name of the node. This is the element tag name for element nodes and the attribute name for attribute nodes. If the document uses namespaces, you may use getNamespaceURI(), getLocalName(), and getPrefix() methods to obtain more specific information about the node name.
getNodeValue()
Returns the value of the node. This is the attribute value for attribute nodes, the character data for text nodes, and null for element nodes.

Finally, there are methods for inserting, deleting, and replacing nodes from the tree:

appendChild(Node newChild)
Appends newChild node as the last children of the current node. It returns the inserted node. If the newChild is already in the tree, it is first removed.
insertBefore(Node newChild, Node refChild)
Insert newChild node before refChild node in the list of children of the current node. It returns the inserted node. If the newChild is already in the tree, it is first removed.
removeChild(Node oldChild)
Removes oldChild node from the list of children of the current node and returns it.
replaceChild(Node newChild, Node oldChild)
Replaces oldChild node with newChild node from the list of children of the current node and returns the replaced node.
cloneNode(Boolean deep)
Returns a copy of the current node. The copy includes attributes and their values, but this method does not copy any children (including text children) it contains unless deep is true.

For each node type (element, attribute, and so on) there is a specialized interface corresponding to its type. The most important of such interfaces are Element, Document, and Text, that are described below.

The Element interface represents element nodes. In particular, it allows to get, set and remove element's attributes. The most useful methods of the interface are the following:

getElementsByTagName(String name)
Returns a NodeList of descendant elements, with respect to the current node, with the given name.
hasAttribute(String name)
Determines whether the current element has an attribute with the given name.
getAttribute(String name)
Returns the attribute value (a string) of the named attribute for the current element.
setAttribute(String name, String value)
Set the attribute value of the named attribute to the given value.
removeAttribute(String name)
Remove the named attribute for the current element.

The Document interface represents the DOM tree root. Notice that the tree root is a virtual node not present in the document. In particular, this node is the parent of the document element node, that is the first element of the XML document. This interface is important mainly because it contains factory methods to create nodes of the various types. The most useful methods of the interface are the following:

getElementsByTagName(String name)
Returns a NodeList of document elements with the given name.
getElementsByID(String elementID)
Returns a single Element corresponding to the unique element in the document that has an attribute of type ID with value elementID. The attribute of type ID must be declared in the DTD.
getDocumentElement()
Returns the document element, that is, the only child element of the root node.
getXmlEncoding(), getXmlStandalone(), getXmlVersion()
Returns the encoding, the standalone, and the version values defined in the XML declaration of the document.
createElement(String tagName)
Creates and returns an element node with the specified tag name.
createTextNode(String data)
Creates and returns a text node with the specified character data.
importNode(Node importedNode, Boolean deep)
Returns a node imported from another document to the current document, without altering or removing the source node from the original document; this method creates a new copy of the source node. The returned node has no parent. This method does not copy any children (including text children) the imported node contains unless deep is true. You can use this method to add foreign nodes to the current document.

The Text interface represents text nodes, that are strings of plain text without markup. These nodes may be manipulated with the following methods:

getData()
Returns the character data of the node.
getLength()
Returns the length of the character data of the node.
substringData(int offset, int count)
Extracts a range of data from the node.
setData(String data)
Set the character data of the node to the given string.
appendData(String data)
Append the given string to the node character data.
insertData(int offset, String data)
Insert the given string at the given offset in the node character data.
deleteData(int offset, int count)
Deletes a range of data from the node.
replaceData(int offset, int count, String data)
Replaces a range of data withe the given string.

The following Java program DOMStats parses an input XML document and computes some statistics on it (number of elements, number of attributes, total length of character data and document tree height). It is the tree-based version of the event-based program SAXStats presented earlier:

import java.io.*;
import javax.xml.parsers.*;
import org.w3c.dom.*;
import org.xml.sax.SAXException;

public class DOMStats {

  // used to store the counts
  public static int numElements, numAttributes, numChars, height; 
  
  /* counts elements, attributes, text length and returns tree height
  of the tree rooted at the given node. */
  public static int domCount(Node node) {
    
    int type = node.getNodeType();
    
    if (type == Node.TEXT_NODE) {
      // method getLenth() belongs to the Text interface
      numChars += ((Text) node).getLength();
      return 0;
    }
    
    if (type == Node.ELEMENT_NODE) {
      numElements++;
      numAttributes += node.getAttributes().getLength();
      NodeList children = node.getChildNodes();
      int outdegree = children.getLength();
      int h = 0;
      int hMax = 0;
      boolean leaf = true;
      for (int i = 0; i < outdegree; i++) {
        h = domCount(children.item(i));
        if (h > hMax) {
          hMax = h;
        }
        if (children.item(i).getNodeType() == Node.ELEMENT_NODE) {
          leaf = false;
        } 
      }
      if (leaf) {
        return 0;
      }
      else {
        return (1 + hMax); 
      }
    }
    // in all other cases
    return 0;
  }
  
  
  public static void main(String[] args) {
    
    // name of the XML document to parse
    String filename = args[0];
    
    DocumentBuilderFactory factory = null;
    // create a parser factory instance 
    try {
      factory = DocumentBuilderFactory.newInstance();
    } catch (FactoryConfigurationError fce) {
      // The implementation is not available or cannot be instantiated
      System.out.println(fce.getMessage());
      System.exit(1);
    }
    
    DocumentBuilder parser = null;
    // use the factory to create a parser instance
    try {
      parser = factory.newDocumentBuilder();
    } catch (ParserConfigurationException pce) {
      // a parser cannot be created which satisfies the requested configuration
      System.out.println(pce.getMessage());
      System.exit(1);
    }
    
    // parse the document
    Document document = null;
    try {
      document = parser.parse(new File(filename));
    } catch (SAXException se) {
      // Some general parse error occurred. Might be thrown because DocumentBuilder
      // class reuses several classes from the SAX API.
      System.out.println(se.getMessage());
      System.exit(1);
    } catch (IOException ioe) {
      // Some IO error occurred
      System.out.println(ioe.getMessage());
      System.exit(1);
    } catch (IllegalArgumentException iae) {
      // filename is null
      System.out.println(iae.getMessage());
      System.exit(1);
    }
    
    numElements = 0;
    numAttributes = 0;
    numChars = 0;
    // computes statistics
    height = domCount(document.getDocumentElement());
    
    System.out.println("Number of elements: " + numElements);
    System.out.println("Number of attributes: " + numAttributes);
    System.out.println("Number of characters of plain text: " + numChars);
    System.out.println("Tree height: " + height);
    
  }
}

Next page Previous page Start of chapter End of chapter
Caffè XML - Massimo Franceschet