Document Object Model (DOM) is a tree-based API for reading XML documents. A DOM parser reads the entire XML document and builds a tree data structure that rapresents the entire XML document. After the parsing has successfully concluded, the tree can be navigated by the client application to find the desired pieces of information.
The advantages of a tree-based model are the following:
On the other hand, a DOM parser has to store the entire document in memory, and this is not always feasible. Moreover, the client application cannot start working until the whole document has been read and transformed into its tree representation.
Hence, DOM and SAX are complementary technologies, and the programmer must care to choose the right model according to the application needs.
DOM is a W3C standard. It is a set of interfaces defined in a neutral language: the Interface Description Language notation proposed by the Object Management Group. Versions of the DOM are defied as levels, and the current version is DOM Level 3. We will describe the implementation of DOM Level 3 contained in Java platform 5.0.
To parse a document using the DOM model, follow these steps:
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true); factory.setValidating(true);
DocumentBuilder parser = factory.newDocumentBuilder();
Document document = parser.parse(new File("Turing.xml"));
The obtained Document object is the DOM tree associated with the XML document you have provided. In particular, this object is the root node of the tree. The tree representation is essentially that used by XPath, with some small changes. The client application can now navigate and modify the tree. Moreover, it can serialize the tree obtaining a corresponding XML document (serialization and parsing are inverse operations).
All nodes in a DOM document tree (including the Document object itself) implement the Node interface, which provides basic methods for traversing and manipulating the tree. The following methods can be used to navigate the tree:
Moreover, the Node interface contains methods to get useful information from a node, like the type, the name and the value:
Finally, there are methods for inserting, deleting, and replacing nodes from the tree:
For each node type (element, attribute, and so on) there is a specialized interface corresponding to its type. The most important of such interfaces are Element, Document, and Text, that are described below.
The Element interface represents element nodes. In particular, it allows to get, set and remove element's attributes. The most useful methods of the interface are the following:
The Document interface represents the DOM tree root. Notice that the tree root is a virtual node not present in the document. In particular, this node is the parent of the document element node, that is the first element of the XML document. This interface is important mainly because it contains factory methods to create nodes of the various types. The most useful methods of the interface are the following:
The Text interface represents text nodes, that are strings of plain text without markup. These nodes may be manipulated with the following methods:
The following Java program DOMStats parses an input XML document and computes some statistics on it (number of elements, number of attributes, total length of character data and document tree height). It is the tree-based version of the event-based program SAXStats presented earlier:
import java.io.*;
import javax.xml.parsers.*;
import org.w3c.dom.*;
import org.xml.sax.SAXException;
public class DOMStats {
// used to store the counts
public static int numElements, numAttributes, numChars, height;
/* counts elements, attributes, text length and returns tree height
of the tree rooted at the given node. */
public static int domCount(Node node) {
int type = node.getNodeType();
if (type == Node.TEXT_NODE) {
// method getLenth() belongs to the Text interface
numChars += ((Text) node).getLength();
return 0;
}
if (type == Node.ELEMENT_NODE) {
numElements++;
numAttributes += node.getAttributes().getLength();
NodeList children = node.getChildNodes();
int outdegree = children.getLength();
int h = 0;
int hMax = 0;
boolean leaf = true;
for (int i = 0; i < outdegree; i++) {
h = domCount(children.item(i));
if (h > hMax) {
hMax = h;
}
if (children.item(i).getNodeType() == Node.ELEMENT_NODE) {
leaf = false;
}
}
if (leaf) {
return 0;
}
else {
return (1 + hMax);
}
}
// in all other cases
return 0;
}
public static void main(String[] args) {
// name of the XML document to parse
String filename = args[0];
DocumentBuilderFactory factory = null;
// create a parser factory instance
try {
factory = DocumentBuilderFactory.newInstance();
} catch (FactoryConfigurationError fce) {
// The implementation is not available or cannot be instantiated
System.out.println(fce.getMessage());
System.exit(1);
}
DocumentBuilder parser = null;
// use the factory to create a parser instance
try {
parser = factory.newDocumentBuilder();
} catch (ParserConfigurationException pce) {
// a parser cannot be created which satisfies the requested configuration
System.out.println(pce.getMessage());
System.exit(1);
}
// parse the document
Document document = null;
try {
document = parser.parse(new File(filename));
} catch (SAXException se) {
// Some general parse error occurred. Might be thrown because DocumentBuilder
// class reuses several classes from the SAX API.
System.out.println(se.getMessage());
System.exit(1);
} catch (IOException ioe) {
// Some IO error occurred
System.out.println(ioe.getMessage());
System.exit(1);
} catch (IllegalArgumentException iae) {
// filename is null
System.out.println(iae.getMessage());
System.exit(1);
}
numElements = 0;
numAttributes = 0;
numChars = 0;
// computes statistics
height = domCount(document.getDocumentElement());
System.out.println("Number of elements: " + numElements);
System.out.println("Number of attributes: " + numAttributes);
System.out.println("Number of characters of plain text: " + numChars);
System.out.println("Tree height: " + height);
}
}