Skip to content

Performance: XPath Query Evaluation

Cal Ellowitz edited this page Jun 9, 2021 · 15 revisions

Pre-Requisites

Much of this writeup won't make sense without having read the Virtual Instances document and having an understanding of how those instances work.

It is also generally useful to understand what an XPath filter (predicate) expression is (like /data/weights[@scale = 'kg'])) at a high level, and specifically understanding the notion of the evaluation context (IE: That inside of the predicate expression you are referring to a relative, nested evaluation frame).

You should almost certainly read the XPath Query Evaluation Primer before reading this document.

Overview

In CommCare, data is read and manipulated with XPath.

XPath Expressions can contain functions, operators, and other element to produce a result, like

min(/data/inputs[@priority > 5][0] + 15, /data/fallback)

The majority of performance considerations in CommCare come from XPath expressions which reference data models (like /data/fallback, especially when referring to potentially large datasets, like cases or fixtures, and most specifically when referring to datasets with filters (like [@priority > 5][0]).

This document will work through the way that those queries are performed, what performance considerations exist in the code, and how our optimizations work.

Performance Challenges

There are primary performance concerns with the native process for XPath query evaluations in practice with CommCare specifically, each of which have their own relevance for optimizations

The reliance on linear scans of potentially long working nodesets while evaluating queries

Example: instance('casedb')/casedb/case[@case_id='unique_id']/case_name

requires a full linear scan of each <case> element by default.

The ease with which nested queries can produce highly non-linear runtimes

Example: Fetching a case's siblings

  count(instance('casedb')/casedb/case
                                        [
                                        index/parent 
                                        = 
                                        instance('casedb')/casedb/case
                                                                      [
                                                                      @case_id
                                                                      =
                                                                      'unique_id'
                                                                      ]
                                                                      /index/parent
                                        ]
        )

requires a full scan of the case database for every case in the database O(N^2), despite one set being a reference to a constant.

  1. Redundancy in evaluations

When optimizing for situations like nested queries there is a significant amount of redundancy in the processing, even if the final results are cached.

  1. The high cost of transactional I/O when using Virtual Instances.

Ideally large chunks of virtual instances could be read from the database at once (IE: All patient cases) rather than once at a time, but the scans are performed in a linear way.

Example: After a fairly simple index optimization, the query

instance('casedb')/case[@case_type='patient'][patient_status = 'new']

can eliminate required reads for the first predicate, but the remaining query will still require a fetch for a potentially large, divided set of remaining case records.

Optimization Structure

The primary entry point for optimizations occurs in virtual instances at the root level.

Along with other optimizations which occur along the way (opportunistic caching, etc), the majority of complexity in the performance code occurs when the evaluation context is about to expand the child step of a query against a root level element.

For example:

Current Working Set:
instance('casedb')/casedb[1]/

Remaining query to evaluate:
case[@case_type='patient'][@status='open'][is_enrolled='yes']/case_name

Before the evaluation context asks each member of the working set for all "candidate" children to evaluate for processing, it provides the element referenced by the current fully qualified reference the opportunity to instead provide a list of children (in the form of their 'multiplicity', or their child index in a fully qualified reference) which match one or more of the queries.

In this example the Evaluation Context informs the casedb root node that it is looking for which of the node's children are named case and match the predicate set [@case_type='patient'][@status='open'], to which the node would reply with a set of fully qualified references

Current Working Set:
instance('casedb')/casedb[1]/case[23]/
instance('casedb')/casedb[1]/case[24]/
instance('casedb')/casedb[1]/case[30]/
instance('casedb')/casedb[1]/case[42]/
instance('casedb')/casedb[1]/case[49]/

Remaining query to evaluate:
[is_enrolled='yes']/case_name

which it matched however it can, along with any predicates which could not be optimized.

If any predicates remain to be evaluated, the evaluation context filters the current working set that is provided linearly by those predicates and then proceeds to the next step as usual.

One quite notable consideration to make regarding this structure is that the order of the predicates can be important.

Assume for instance that the engine is able to optimize both the [@case_type='patient'] and the [@status='open'] predicate in the <casedb> for instance, but not a third predicate [unoptimized = 'yes'].

If those predicates were provided in the expression

instance('casedb')/casedb/case[@case_type='patient'][@status='open'][unoptimized = 'yes']/value

the engine would be able to optimize the first two in a constant time lookup, and then proceed to do a linear scan over Open Patient cases to evaluate which meet the [unoptimized = 'yes'] condition.

If, however, the predicates were provided in the expression

instance('casedb')/casedb/case[unoptimized = 'yes'][@case_type='patient'][@status='open']/value

The latter two optimizations would not be effective, since a linear scan is performed in the first filter.

If the non-optimized filter appears in the middle

instance('casedb')/casedb/case[@case_type='patient'][unoptimized = 'yes'][@status='open']/value

The @case_type optimization triggers, but @status is processed with the same linear scan as [unoptimized = 'yes'].

Basic optimizations

Key/Value Set Optimizations

The most basic optimizations in the engine identify key/value pair matches which can be accomplished through an indexed lookup.

These lookups must be in the format

[MODEL_KEY = XPATH_VALUE_EXPRESSION]

or

[MODEL_KEY != XPATH_VALUE_EXPRESSION]

Where MODEL_KEY is a single reference to an indexed value on the requested model, and XPATH_VALUE_EXPRESSION is evaluated to produce a lookup into that index.

Performance:

Generally speaking the performance of key/value queries should be logarithmic to the size of their dataset (as is consistent with a dB index), and each query will execute a single DB request.

Models and keys supporting Key/Value Lookups:

casedb/case

  • @case_type
  • @status
  • @case_id
  • @owner_id <- Since 2.36
  • index/INDEX_NAME
  • @external_id
  • @state <- Only in formplayer
  • @category <- Only in formplayer`

ledgerdb/ledger

  • @entity-id

Indexed Fixtures

  • Named indexes as provided by the <schema> block

Example:

instance('casedb')/casedb/case[ @case_id = /data/selected_case/id ]/output_value

Multi-Key Optimizations

In the case database, any keys other than index/* can actually be combined into a single request instead of a request per-predicate. This both saves a DB query, and also prevents the need to read two large query response sets from the db, requiring only the ids of cases which meet both conditions to be returned.

This is only currently meaningful on expressions of the following pattern (pairing @case_type and @status)

instance('casedb')/casedb/case[@case_type='patient'][@status='open']

but is a primary optimization of the system, as that pattern is followed in most of the high-volume contexts used by apps

Set Member Optimizations

A set member optimization matches an key value to one of a set of potential matching values. This is currently implemented for Case indexes (IE: index/*) which sometimes used in contexts where the index needs to be matched to one of many values.

This optimization follows the pattern

[selected('value_one value_two value_three', index/*)]

or

[selected(/path/to/space_separated_list, index/*)]

where the second argument is a valid key (currently only case indexes are supported), and the first argument is a space separated set of strings of which the index may be one.

Advanced Optimizations: Query Planner

There are more complex situations which are optimized by the engine, and generally speaking future additions and changes will be shifted into the Query Planner, including backporting the existing optimizations into that system.