Skip to content

XMLStorageLayer

bgreenwood edited this page Jul 5, 2012 · 3 revisions

In Release 8, ANDS introduced an XML Storage layer in an initial attempt to alleviate the bottleneck that had been identified in the previous version's dependence on the Postgres RDBMS for access to registry objects. As the majority of ANDS systems utilise RIF-CS (an XML schema) as a base format for delivering records via presentation layers (web pages and servers). Unfortunately, the initial design of the registry is dependent on a highly normalised relational table structure which provides a very inefficient method for reconstructing the XML structures (particularly when multiple records are requested).

ANDS intends to deprecate the current database structure in Release 9, instead opting to maintain a minimal set of tables describing the registry object and its relationships separately from the content of the object itself. The first step towards this is storing the record's content at XML. In Release 8, this is referred to as the "caching layer" as the XML Storage Layer is simply a flat-file version of the database storage). From Release 9, other more resource-efficient options will be evaluated for storage of the RIFCS.

The majority of functions relating to the XML Storage Layer are contained in /orca/_functions/orca_cache_functions.php and the caching layer is invoked by importRegistryObjects() in /orca/_functions/orca_import_functions.php.

Setting up the ORCA Caching Layer

  1. Create a new directory on your server where the cache files will be stored (default is /var/www/orca_cache/) Note: The directory MUST be writable by your web server user

  2. Add the following block to your global_config.php:

// CACHING OPTIONS
$gCacheEnabled = true; // is caching enabled?
//======================================
define('eCACHE_DIRECTORY', '/var/www/orca_' . 'cache'); // what directory do we store our cache files in? (relative)
define('eCACHE_CURRENT_NAME','current'); // what filename do we use for the latest version symlink
define('eCACHE_PERMISSION',0775); // octal representation of default directory permissions
$gCacheExtended = true; // are we caching "extended RIFCS" (rich) or plain vanilla? (not backwards compatible)
define('eCACHE_ENABLED', $gCacheEnabled);
  1. Run the SYNC_DATASOURCE task from the Background Task Manager. (Note: HOURLY_REGISTRY_MAINTENANCE should run once before SYNC_DATASOURCE to ensure the database is updated to R8 before the cache is generated!)

  2. Refer to your task manager log (default: /var/log/andstaskmgr.log) if the cache is not populated as expected

How are the XML files stored?

The cache structure will be created as follows:

        -> (data source key hash)
            -> (registry object key hash)
                -> current      // a symlink to the most current copy of the record
                -> (timestamp)  // file containing RIFCS (filename is the unix timestamp when the record was imported)

Every time a record is imported, a new file is created in the appropriate directory and the current symlink is updated to point to the latest version. As this flat file structure relies on the file system, all files take up space equivalent to the block size of your storage medium (on the ANDS systems, this is 4KB per file) -- this can create quite large cache stores for large registries (on the ANDS production system the cache is over 800MB in size).

The ExtRif: namespace

The data files contain RIFCS as well as additional information that is regularly used throughout the ANDS System that isn't present in RIFCS schema elements (such as formatted display titles, the key hashes, the search weighting of the record and the registry object status and data source). These objects are stored in the cached file using a different XML namespace (extRif:).

This namespace and schema is not controlled by a definition, but this schema diagram gives an indication of the fields currently used by ANDS.

You may want to extend this in your own application if you wish to only execute complex business logic during the first time the record is ingested and simply display it in the presentation layer. The extRif namespace also provides a method for mapping variables from the registry across to the indexer (provided you adjust the SOLR schema.xml and /orca/_xsl/rif2solr.xsl accordingly).

Algorithmic process

When a record is ingested

  • The record is first imported into the registry database (as normal)
  • Quality checks are run against the record
  • Quality checks are run against all other records in the data source (a new record could potentially improve the quality level of records which relate to that registry object key)
  • All records in that data source are "cached" using the GENERATE_CACHE task which calls /orca/_functions/orca_export_functions.php's getRegistryObjectXMLFromDB() and then parses the output as extRif (param $forSOLR = true)
  • The cache for that data source is updated using the writeCache functions
  • A task is added to reindex the datasource next time SOLR is ready to reindex

When a record is requested

  • getRegistryObject() will first check if the cache is enabled and that the record is cached:
    • If so, it will retrieve the XML from the cache and display it (stripping the extRif: fields if $forSOLR = false)
    • If not, it will retrieve the RIFCS from the database and write it to the cache before displaying it