Performance: External Data Instances

"External Data Instances" are CommCare's way of providing structure for data which is not contained entirely within an XForm. They are the foundation of dynamic "user level" data in the system, like access to cases, ledgers, fixtures, and the current session details.

By exposing this data through an "Instance" structure, the system allows the re-use of the XPath engine to build complex queries and transformations of that data, but the document format of an XML instance becomes quite challenging to scale within the limited memory space of a CommCare runtime. The case "database" on many mobile apps contains tens of thousands of elements, and a location tree could need to support an entire nation or region.

To ensure performance, CommCare exposes these instance "documents" through a "external data instance" interface which provides the ability to navigate the space of the document as if it were in memory while dynamically building it.

Reference Sample

For the rest of this document we'll refer to this sample external data instance, which is an example casedb instance

<casedb>
  <case case_id="2kfk23432mf24tt" case_type="example" status="open">
    <case_name>Example Case</case_name>
    <value_one>10</value_one>
    <index>
      <parent type="child">992ffk23432mf24tt</parent>
    </index>
  </case>
...
</casedb>

Structure

External data instances appear to the engine using the same interfaces as a normal "XML" instance element (represented by an AbstractTreeElement in code). Each external data instance is broken down into three elements

Root - The root of the instance tree (example: <casedb>)
Child - The element one step down from a root (example: <case case_id="2kfk23432mf24tt...>)
Body - Any children of the child element (IE: <value_one>, or the <index> subtree)

Each of these three elements is handled by the system in a different way. The root and child elements are special implementations of tree elements which handle custom code. The body elements are generally expected to be native TreeElement implementations which represent a vanilla XML element implementation, but are loaded dynamically by the other elements.

In practice the general current structure for optimized external data instances is that each child node represents one unit of storage (one row in a DB generally, but it can be split up a bit).

Process

From a performance standpoint there are three "steps" involved in a external data instance

Connect - This is when an instance connector is declared (IE: <instance id="casedb" src="jr://instance/casedb">). At this time the external data instance is requested and brought into the session, but no storage is requested. At this stage the root element exists, but children are not loaded.
Prime - The first time data is requested from the instance, the root of the instance is responsible for priming any data it needs from storage to begin servicing requests.
- Generally speaking this step involves creating one child element per row in the database and connecting it with the data it will need to load itself. The way in which that is executed is often platform specific.
Query - Any time a property of a child element is requested from the DB, that child element may need to dynamically fetch data from storage to make itself ready to process requests. This generally involves building out body elements and then responding to requests normally.

Example

Load

A user syncs a set of CaseXML Transactions into their local environment. Those transactions are parsed into a database locally with one row per case.

Connect

The session defines an instance connector

<instance id="casedb" src="jr://instance/casedb">

to read cases.

A CaseInstanceTreeElement is created as the root of the tree and sits in the session with no memory requested/used yet.

Session
 - Root (<casedb>)

Prime

Inside of a detail, a config makes a request for

instance('casedb')/casedb/case[@case_id = '2kfk23432mf24tt']/case_name

The root identifies that it hasn't been primed, and queries the database for just the ID's of all elements and loads a new child object for each. The child objects are not the full element (like a <case>) they are placeholders that track that the Nth child of the <casedb> root match up to some virtual node and can load that node on demand.

Session
 - Root (<casedb>)
   - Child [db_id: 0]
   - Child [db_id: 3]
   - Child [db_id: 4]

Query

The engine requests to read the attribute @case_id from the first child to scan for matches to the predicate.

The child element checks to see whether it has loaded data. Since it hasn't, it queries the database for the full record at id 0, and then creates a full XML tree to load that case's details. It then dispatches all requests for properties and attributes to that child element

Session
 - Root (<casedb>)
   - Child [db_id: 0]
     - Body (TreeElement "<case>")
       - Body (TreeElement "<case_name")
...     
   - Child [db_id: 3]
   - Child [db_id: 4]

as the scan is querying each child it loads them in turn

Session
 - Root (<casedb>)
   - Child [db_id: 0]
     - Body (TreeElement "<case>")
       - Body (TreeElement "<case_name")
...     
   - Child [db_id: 3]
     - Body (TreeElement "<case>")
       - Body (TreeElement "<case_name")
...     
   - Child [db_id: 4]

The body elements are left in a soft cache in most cases, tied to the root element, to support rapid operations on the same element.

After each child element is scanned, its body elements are eligible for garbage collection, allowing the requests to scan through the full document regardless of size of the body elements or scope of the full doc.

Analysis / Notes

One important aspect covered elsewhere is that the root element is also responsible for providing optimizations which index into the external data instance. This is the core of query optimizations and is a complex topic.

Advantages

Most elements in the structure can be lazily loaded, so many large documents can be requested in a context without a huge hit while each of them are made ready
This structure isolates I/O requests into chunkable portions. The initial scoping query of db rows and each individual request.
The structure provides good locality/isolation for caching

Disadvantages / Future Work

The "object oriented" nature of the instance body still requires one object per element, which is somewhat wasteful (although in profiling this affect is quite minimal)
The initial request for the size and row ID set can scale quite poorly.
There is still a lot of wasted effort munging around the resulting TreeElements, despite them having highly homogenous structures.

Gotcha's / Cautions

The in-memory lifecycle of the "body" elements is expected to be managed solely by the external data instance. This means that if any other code retains references to those elements it will result in the memory failing to free up.

References (Names subject to change)

StorageBackedTreeRoot - The common code for root elements as defined here
StorageBackedChildElement - The common root for child elements as defined here

Provide feedback

Saved searches

Use saved searches to filter your results more quickly