XQuery Full Text

XQuery Full Text extends XQuery with full text searches. The following examples can be run on the Shakespeare's play A Midsummer Night's Dream put in XML format by Jon Bosak.

The lines containing the word hath:

//LINE[ . contains text "hath" ]

On the other hand, the following query retrieves the lines containing the string hath:

//LINE[contains(., "hath")]

Notice that the search string is tokenized before it is compared with the tokenized input string. In the tokenization process, several normalizations take place: as an example, upper/lower case and diacritics (umlauts, accents, etc.) are removed and an optional, language-dependent stemming algorithm is applied. Beside that, special characters such as whitespaces and punctuation marks will be ignored. Hence the following search produces the same result:

//LINE[ . contains text "...Hath!" ]

The lines containing both the words death and abjure:

//LINE[ . contains text "death" ftand "abjure" ]

The lines containing either the word death or the word abjure:

//LINE[ . contains text "death" ftor "abjure" ]

The lines containing the word death but not the word abjure:

//LINE[ . contains text "death" ftand ftnot "abjure" ]

The lines containing the word Demetrius in this case:

//LINE[ . contains text "Demetrius" using case sensitive ]

The lines containing the word Orléans with this accent:

//LINE[ . contains text "Orléans" using diacritics sensitive]

The lines containing the word hate or related words by stemming:

//LINE[ . contains text "hate" using stemming using language "en"]

The lines containing the word love avoiding stop words:

//LINE[ . contains text "my very love" using stop words at "http://files.basex.org/etc/stopwords.txt"]

The lines containing the word love using wildcards:

//LINE[ . contains text ".*love.+" using wildcards]

. matches a single arbitrary character
.? matches either zero or one character
.* matches zero or more characters
.+ matches one or more characters
.{min,max} matches min–max number of characters

The lines containing the word love using a given thesaurus:

//LINE[ . contains text "love"  using thesaurus at "thesaurus.xml"]

The lines containing the word hate using fuzzy (approximate) search:

//LINE[ . contains text "hate" using fuzzy]

Fuzzy search is based on the Levenshtein distance. The maximum number of allowed errors is calculated by dividing the token length of a specified query term by 4, preserving a minimum of 1 errors.

The lines containing the word love at least twice:

//LINE[ . contains text "love" occurs at least 2 times ]

The lines containing both words death and abjure at a distance of at least two words:

//LINE[ . contains text "death" ftand "abjure" distance at least 2 words ]

Other distance variants are: at most, exactly, from x to y

The lines containing both words hate and love in the same sentence (or paragraph):

//LINE[ . contains text "hate" ftand "love" same sentence ]

Sentences are delimited by end of line markers (., !, ?, etc.), and newline characters are treated as paragraph delimiters.

The lines that contain the word love sorted in decreasing order of relevance:

for $hit score $score in //LINE[ . contains text "love" ]
order by $score descending
return <hit score='{ format-number($score, "0.00")}'>{$hit}</hit>

The keyword score introduces a variable that receives the score value. This value is guaranteed to be between 0 and 1: a higher value means a more relevant hit. The computation of relevance is implementation-dependent. A few output lines follow:

<hit score="0.62">
  <LINE>Sweet love,--</LINE>
</hit>
<hit score="0.45">
  <LINE>my sweet love?</LINE>
</hit>
<hit score="0.45">
  <LINE>Asleep, my love?</LINE>
</hit>
<hit score="0.36">
  <LINE>Love takes the meaning in love's conference.</LINE>
</hit>