BIOM Format 2.0.0 Code Sprint Goals

This page lists potential goals, priorities, and to-do items for the BIOM Format 2.0.0 code sprint, which is taking place in Flagstaff, AZ from 12/2/2013 - 12/6/2013.

Prior to the code sprint

Here are some things that each developer needs to do before the code sprint begins.

Dependencies

Each developer should have the following dependencies installed. I have listed the commands that I (@jrrideout) ran on my Ubuntu 12.10 laptop to install each dependency:

HDF5 libraries (sudo apt-get install libhdf5-7 libhdf5-dev)
numpy (preferably a modern version >= 1.6; tested with 1.7.1 and 1.8.0) (pip install numpy)
h5py (tested with 2.2.0) (pip install h5py)
scipy (preferably the newest available version; tested with 0.13.0) (pip install scipy)
numexpr (tested with 2.2.1) (pip install numexpr)
Cython (tested with 0.18) (pip install Cython)
PyTables (tested with 3.0.0) (pip install tables)
biom-format repository (fork and clone the repo; ensure that all unit tests are passing and that you're set up to submit pull requests)
protobiom repository (fork and clone the repo)

Note: this is not a finalized list of dependencies. The actual dependency list will probably be much shorter. The goal here is to get some of the likely dependencies installed beforehand that we may need. This will cut down on the time needed to set up development environments for everyone and give us more time to troubleshoot any installation issues that might arise.

Reading

Read up on h5py, which is one of the HDF5 APIs available in Python. The quick start guide is really helpful and pretty concise. Also read up on HDF5 itself- in addition to poking through the website, their intro guide is pretty good.

Code sprint goals

The following are potential goals to accomplish during the code sprint. This list will need to be discussed, voted on, and prioritized as we probably won't have time to get to everything.

Define 2.0.0 file format

First, before doing any coding, we need to define what the new file format will be:

what stays the same?
what existing pieces need to change?
what new data do we want to store, if any?
what about table types (e.g., OTU table, taxon table, metabolite table, etc.)? There are currently no distinctions between table types (just a controlled vocabulary of recognized table types)
do we want richer descriptions of metadata? Metadata is currently extremely generic and has no requirements/validation
should we have an attribute that indicates whether the table is in absolute or relative abundances?

An artifact of this goal will be a new format specification that can be posted on the BIOM website (similar to the 1.0.0 spec).

BIOM Python API changes

Performance improvements

The current API is very generic (i.e., a table method usually takes a Python function as an argument, and applies that function to do e.g., sorting, filtering, transforming, etc.). This is elegant and makes the BIOM API very general-purpose, but ends up hitting us pretty hard performance-wise. This happens because the biom.table.Table methods usually iterate over rows or columns, applying the user-supplied function to each row/column. Each row/column has to be converted from sparse format into a dense numpy row vector before the function can be applied. Iterating over a numpy array (or a Python sequence) is already pretty slow; coupling this with sparse -> dense conversions further aggravates the problem. This is the main reason that we're currently not seeing huge performance benefits when using the new scipy sparse matrix backend that I added to BIOM recently. For operations that can hit the underlying backend matrix, we have huge gains (e.g., summing the table). For all other operations, the performance gains aren't nearly as striking because we're spending most of the time iterating/converting.

One solution I've been toying with (see the HDF5 prototype in protobiom) is to avoid iterating and converting from sparse to dense wherever possible. The prototype loads all sparse data into a scipy sparse matrix and uses vectorized or slicing operations to do things like sorting, converting to relative abundances, etc.. The idea is to always use the scipy sparse matrix API to do table operations because it is highly efficient. This works very well in practice, as these operations can be applied to the EMP table in seconds as opposed to minutes, hours, or days. The drawback is that we'll need to make the BIOM backends more complicated than they currently are. Each backend will need to have more specialized methods for sorting, filtering, etc.. When these changes happen, I'd like to remove the pure-Python CSMat backend in favor of the scipy backend. I also think we should implement a base class for backends at this point, so that we'll have a dense (numpy) backend and a sparse scipy backend that each share a common interface.

Switching to HDF5 helps us a lot in terms of reading and writing data, but if tables can't be manipulated efficiently once they are loaded, the BIOM project will still continue to be plagued with scaling issues. Thus, I think this goal should be high-priority (but it is not trivial).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly