BIOM Format 2.0.0 Code Sprint Goals

This page lists potential goals, priorities, and to-do items for the BIOM Format 2.0.0 code sprint, which is taking place in Flagstaff, AZ from 12/2/2013 - 12/6/2013.

Prior to the code sprint

Here are some things that each developer needs to do before the code sprint begins.

Dependencies

Each developer should have the following dependencies installed. I have listed the commands that I (@jrrideout) ran on my Ubuntu 12.10 laptop to install each dependency:

HDF5 libraries (sudo apt-get install libhdf5-7 libhdf5-dev)
numpy (preferably a modern version >= 1.6; tested with 1.7.1 and 1.8.0) (pip install numpy)
h5py (tested with 2.2.0) (pip install h5py)
scipy (preferably the newest available version; tested with 0.13.0) (pip install scipy)
numexpr (tested with 2.2.1) (pip install numexpr)
Cython (tested with 0.18) (pip install Cython)
PyTables (tested with 3.0.0) (pip install tables)
biom-format repository (fork and clone the repo; ensure that all unit tests are passing and that you're set up to submit pull requests)
protobiom repository (fork and clone the repo)

Note: this is not a finalized list of dependencies. The actual dependency list will probably be much shorter. The goal here is to get some of the likely dependencies installed beforehand that we may need. This will cut down on the time needed to set up development environments for everyone and give us more time to troubleshoot any installation issues that might arise.

Reading

Read up on h5py, which is one of the HDF5 APIs available in Python. The quick start guide is really helpful and pretty concise. Also read up on HDF5 itself- in addition to poking through the website, their intro guide is pretty good.

Code sprint goals

The following are potential goals to accomplish during the code sprint. This list will need to be discussed, voted on, and prioritized as we probably won't have time to get to everything.

Define 2.0.0 file format

what stays the same?
what existing pieces need to change?
what new data do we want to store, if any?
what about table types (e.g., OTU table, taxon table, metabolite table, etc.)? There are currently no distinctions between table types (just a controlled vocabulary of recognized table types)
do we want richer descriptions of metadata? Metadata is currently extremely generic and has no requirements/validation
should we have an attribute that indicates whether the table is in absolute or relative abundances?

Update BIOM Python API

the current API is very generic (i.e., a table method usually takes a Python function as an argument, and applies that function to do e.g., sorting, filtering, transforming, etc.). This is elegant and makes the BIOM API very general-purpose, but this high level generic API has a heavy hit to performance. This happens because the biom.table.Table methods usually iterate over rows or columns, applying the user-supplied function to each row/column. Each row/column has to be converted from sparse format into a dense numpy row vector before the function can be applied. Iterating over a numpy array (or a Python sequence) is already pretty slow; coupling this with sparse -> dense conversions further aggravates the problem. One solution I've been toying with

Provide feedback

Saved searches

Use saved searches to filter your results more quickly