Skip to content

Primer: XPath Query Evaluation

Aliza Stone edited this page Apr 3, 2017 · 10 revisions

This document is a primer on XPath evaluation basics to help understand how they are implemented and optimized in CommCare.

XPath Query Evaluation

At a fundamental and native level an XPath query follows a straightforward pattern.

The expression we're concerned with starts with a reference to data somewhere in the session, which is an expression which refers to one of the session's data models.

/data/child/node[@value = 'filtered']/output

A reference doesn't necessarily refer to a value. A reference is conceptually a way to refer to any number of instance elements named output which follow the path to get there.

For clarity in the evaluation process it is helpful to distinguish between a reference which refers to an abstract set of nodes (like the one above), and "fully qualified" reference to a single node, like

/data[1]/child[1]/node[4]/output[1]

The process by which the value of "unqualified" reference is requested and is covered to a usable value is what we're referring to here as evaluating a query.

Query evaluation takes an input tree reference like

/data/child/node[@value = 'filtered']/output

and converts it to a nodeset, which is a set of fully qualified references that match that query, like

/data[1]/child[1]/node[1]/output[1]
/data[1]/child[1]/node[4]/output[1]

Often only a single node is a member of the nodeset, in which case the result of a query is the contents of the requested node. In other cases the nodeset will have multiple elements, and are only useful as inputs to functions like count(), sum(), or join().

Evaluation Context lifecycle

In CommCare, XPath expressions are evaluated against an evaluation context, which provides a frame for evaluating a query. Among other things, the context provides

  • The relative context of a query - IE: "What should a relative query (../some_node) evaluate to right now?"
  • The "original" context of a query - IE: If we are in a nested query, what should current() evaluate to?
  • The body of instances available for evaluation
  • XPath variables (if relevant)

In order to keep state isolated, the operation to "expand" a reference into a nodeset of fully qualified references is performed within an evaluation context.

Any time a new context is needed (for instance, inside of a predicate expression where ../some_node needs to refer to a new relative reference), a new derivative evaluation context is created and used to expand and evaluate the new expressions.

Query Evaluation Process Example

The native process for performing a query is to walk each "step" of the provided query depth-first against the live instance, and including a fully qualified reference to each element which matches.

To evaluate the XPath expression

/data/state[@population > 1000]/city[@type='city'][@capital = 'yes']/name

against a sample body

<data>
  <state population="1500">
    <city capital="no" type="town">
      <name>Woburn</name>
    </city>
    <city capital="yes" type="city">
      <name>Boston</name>
    </city>
    <city capital="no" type="city">
      <name>Cambridge</name>
    </city>
  </state>
  <state population="500">
    <city capital="no" type="city">
      <name>Albuquerque</name>
    </city>
    <city capital="yes" type="city">
      <name>Santa Fe</name>
    </city>
  </state>
  <state population="2000">
    <city capital="yes" type="city">
      <name>Albany</name>
    </city>
    <city capital="no" type="city">
      <name>New York</name>
    </city>
  </state>
</data>

we start at the root (/), and look for matches. Since we find one (data), we continue processing with just that node as our working set

/data[1]/

state

On the next step state[@population > 1000] we need to both check for children with matching names, and also evaluate a filter condition.

First the engine generates a candidate set of references which match the name pattern

/data[1]/state[1]
/data[1]/state[2]
/data[1]/state[3]

Then the engine identifies which of the candidate references is compatible with the filter condition, striking out any references which fail to meet the filter

/data[1]/state[1]/@population > 1000 = true
/data[1]/state[2]/@population > 1000 = false
/data[1]/state[3]/@population > 1000 = true

Once all candidate references are evaluated, we move on to the next step with our current working set

Current Working Set:
/data[1]/state[1]/
/data[1]/state[3]/

Remaining query to evaluate:
city[@type='city'][@capital = 'yes']/name

We'll evaluate each remaining step in the reference (city[@type='city'][@capital = 'yes']/name) against this working set. If at any point it is empty, an empty nodeset is returned and evaluation stops.

city

To evaluate the next step city[@type='city'][@capital = 'yes'], the engine once again generates a candidate set of matching references, based on the current instance.

/data[1]/state[1]/city[1]
/data[1]/state[1]/city[2]
/data[1]/state[1]/city[3]
/data[1]/state[3]/city[1]
/data[1]/state[3]/city[2]

and once again processes the filter expressions against those candidate references.

Predicates are evaluated depth-first and only if the previous predicate passed for that reference.

/data[1]/state[1]/city[1]/@type = 'town' > false

/data[1]/state[1]/city[2]/@type = 'city' > true
/data[1]/state[1]/city[2]/@capital = 'yes' > true

/data[1]/state[1]/city[3]/@type = 'city' > true
/data[1]/state[1]/city[3]/@capital = 'no'  > false

/data[1]/state[3]/city[1]/@type = 'city' > true
/data[1]/state[3]/city[1]/@capital = 'yes' > true

/data[1]/state[3]/city[2]/@type = 'city' > true
/data[1]/state[3]/city[2]/@capital = 'yes' > false

moving forward with the resulting working nodeset below

Current Working Set:
/data[1]/state[1]/city[2]
/data[1]/state[3]/city[1]

Remaining query to evaluate:
name

name

Finally, the last step (name) is applied to each working reference, resulting in the final nodeset

/data[1]/state[1]/city[2]/name[1]
/data[1]/state[3]/city[1]/name[1]

which would produce an error if referenced alone, IE:

<output value="/data/state[@population > 1000]/city[@type='city'][@capital = 'yes']/name"/>

but can be used with an expression like

<output value="join(", ", /data/state[@population > 1000]/city[@type='city'][@capital = 'yes']/name)"/>

to produce

Boston, Albany