XPath Performance Test

XPath-PT represents a major revision of the XPath benchmark proposed in [1]. In the following we highlight the main points of XPath-PT:

queries have a natural interpretation over documents generated with the popular XML benchmark XMark [2]. Hence they simulate realistic query needs of a potential user of the the auction site;
queries are divided into groups according to the intrinsic computational complexity of the corresponding evaluation problem. XPath language can be stratified in a number of fragments for which different complexity bounds are known [3]. Comparing the theoretical computational complexity of the query evaluation problem with the actual amount of resources consumed during query evaluation might be, at least, a stimulating and instructive exercise;
queries are defined to challenge data scalability of the XML processing system, that is the performance of the system as the data complexity (document size) grows. In particular, the queries talk about document sections (like open and closed auctions, items, people, descriptions) that become bigger when the XMark document scaling factor increases. Moreover, the results of the queries are small compared to the size of the target document. This avoids that the time taken to serialize the results (that may be relevant) obfuscates the pure query processing time;
some of the queries are defined to challenge query scalability of the XML processing system, that is the performance of the system as the query complexity grows. These queries are parametric with respect to a given factor that influences the query complexity, like length or nesting degree of predicates, and for each parametric query a generator is provided;
finally, the test includes some instances of queries using transitive closure of path expressions. A typical example is reachability: find all the objects that are reachable from a given object trough an arbitrary path in an object graph. Transitive closure of path expressions is beyond the expressive power of XPath. The addition of transitive closure to XPath has been investigated from a theoretical point of view in [4, 5]. However, a practical counterpart consisting of different approaches, implementations, and optimizations to tackle the intrinsic complexity of transitive closure in XPath is still missing, and a benchmark might help to start this endaveour.

XPath-PT consists of 6 groups of queries. Each group contains queries belonging to a particular XPath language. The 6 XPath languages form an inclusion chain in terms of expressivity and complexity. The XPath languages that we consider are as follows:

XPath-A: This fragment contains so-called unary tree pattern queries. These queries use only child and descendant axes, node tests equal to * or to a tag name, and filters (predicates). Conjunctive and disjunctive Boolean operators are allowed, but negation is not. Relational and arithmetic operators and functions are disallowed.
XPath-B: This fragment contains so-called core or navigational XPath queries. This fragment extends XPath-A by allowing all XPath axes and negation.
XPath-C: This fragment contains relational XPath queries. This fragment extends XPath-B by allowing all relational operators (=, !=, >, <, >=, <=) and the id() function.
XPath-D: This fragment contains arithmetic XPath queries. This fragment extends XPath-C by allowing all arithmetic operators (+, -, *, div, mod) and functions sum() and count().
XPath-E: This fragment contains all XPath 1.0 queries. In particular, it extends XPath-D by allowing all functions (like position() and contains()).
XPath-F: This fragment contains all regular XPath queries. It extends XPath-E with idref() and closure() functions. While idref(), the natural counterpart of id(), is defined in XPath 2.0, closure() is new to XPath and XQuery (though it has been added to EXSLT, and extension of XSLT). The closure function takes two arguments: closure(C, path), where C is a node set and path is a location path in XPath 1.0 (an expression that computes a node set). The closure function computes the transitive closure of path starting at C, that is, the set of nodes that are reached in one or more applications of path setting C as the initial context set. For instance, closure(/, child::*) returns all descendants of the root, while closure(/, child::a) returns all descendants of the root that can be reached trough a path labelled with a.