Skip to content

Moving toward BIOM Format 2.0.0

Jai Ram Rideout edited this page Oct 29, 2013 · 5 revisions

The current BIOM Format 1.0.0, based on JSON, is breaking down with large tables (greater than 2GB, uncompressed on-disk storage) because Python's json library cannot handle tables of this size. Even if we were to work around this issue by using a different JSON parsing library or writing our own, keeping with the JSON-based format has a number of issues:

  • it is not easily chunked / subsettable; we have to load all parsed contents into memory at a single time
  • JSON wasn’t really meant for these large file sizes; many other JSON tools, parsers, etc. can't handle these data sizes either
  • with increasing file sizes, an ASCII-based file format is only going to hurt us

Thus, as the size of BIOM tables will likely continue to increase into the foreseeable future, the next major version of the BIOM Format needs to be based on a file format (and supporting libraries) that will allow us to efficiently read, write, and interact with tables of this magnitude.

Requirements

Must have

  • platform-independent file format
  • subsettable (i.e. pulling chunks of data into memory instead of the whole thing)
  • file format reader/writer in more languages than Python (e.g., like JSON) or very simple format to implement readers/writers for
  • extensible enough to allow (at least) the representation of the core contingency table plus metadata
  • compatible licensing terms (we need to keep BIOM under BSD)

Nice to have

  • binary format
  • Python 2 and 3 compatible, possibly a requirement

Other considerations

  • project activity, lifetime, support, etc.
  • prior adoption: how are others using the file format / libraries?
  • ease of installation: how heavy are the dependencies that we'll be adding to BIOM?
  • availability: are these libraries available on Linux and OS X?

Candidate file formats / libraries

  • Pros
    • very well-established file format / library
    • good support in multiple languages
    • parallel I/O capabilities
    • open format
    • various seamless (built-in) compression techniques
    • hierarchical/structured format
    • extensible and very flexible
    • chunking
    • built-in checksums
    • binary format
    • many tools to visualize HDF5 files (e.g., hdfview)
    • random access
  • Cons
    • not as much support for queries, transactions, etc. (not a RDBMS)
  • Python support
  • Pros
    • SQL queries, transactions, etc.
  • Cons
    • likely performance; for what we'll be doing with BIOM, HDF5 will likely be better performance-wise (see this link for more info)
  • Python support
    • sqlite3 (part of the Python standard library)
    • APSW
    • probably others
  • Pros
    • widely used for sparse and dense matrices
    • support in multiple languages
    • very simple format
  • Cons
    • ASCII-based
    • not as easily extensible (i.e. we'd have to develop our own extension of the format to store metadata)
    • does not allow random access or parallel I/O out of the box
  • Python support
    • SciPy (scipy.io)
    • NumPy (numpy.genfromtext)
    • could write our own (pretty much what we're doing now)
  • Pros
    • binary
    • structured
    • extensible
    • support in multiple languages
  • Cons
    • seems very similar to XML/JSON which raises some red flags; someone on SO mentioned it wasn't really meant for large file sizes
  • Python support

HDF5 support in Python

Since HDF5 is looking like the best candidate so far, this section contains info about packages that provide HDF5 support in Python. The two main contenders are PyTables and h5py. Both packages are mature, active projects, relatively easy to install, and have great documentation and support. Both projects also support Python 2 and 3, and are under the New BSD license.

Pros:

  • SQL-like queries (could be very elegant)
  • ViTables to visualize PyTables files

Cons:

  • HDF5 files will have some extra content/metadata in them, which makes it harder to define exactly what the BIOM Format 2.0.0 file format looks like for those who aren't going to use PyTables to parse/write their files
  • heavier dependency than h5py

Tested with:

  • Ubuntu 12.10 (64-bit)
  • HDF5 install (sudo apt-get install libhdf5-7 libhdf5-dev)
  • Python 2.7.3
  • numpy 1.7.1 (pip install numpy)
  • numexpr 2.2.1 (requires numpy 1.6 or greater) (pip install numexpr)
  • Cython 0.18 (pip install Cython)
  • PyTables 3.0.0 (pip install tables)

Pros:

  • clean, simple, numpy-like API, as well as lower-level C-like API
  • lighter-weight dependency than PyTables
  • a bit simpler to use than PyTables
  • HDF5 files are cleaner/simpler than those created by PyTables (easier to formally define what a BIOM Format 2.0.0 table will look like to those who aren't using PyTables because PyTables adds some extra content/metadata to their files)

Cons:

  • no support for SQL-like queries

Tested with:

  • same deps as for PyTables, minus numexpr
  • h5py 2.2.0 (pip install h5py)