Skip to content

BIOM Format 2.0.0 Code Sprint Goals

Jai Ram Rideout edited this page Nov 19, 2013 · 12 revisions

This page lists potential goals, priorities, and to-do items for the BIOM Format 2.0.0 code sprint, which is taking place in Flagstaff, AZ from 12/2/2013 - 12/6/2013 (hosted by the Caporaso Lab).

Prior to the code sprint

In order to make the best use of the time available to us during the code sprint, there are a number of things that each developer needs to do before the code sprint begins (preferably a few days in advance).

Dependencies

Each developer should have the following dependencies installed. I have listed the commands that I (@jrrideout) ran on my Ubuntu 12.10 laptop to install each dependency:

  • HDF5 libraries (sudo apt-get install libhdf5-7 libhdf5-dev)
  • numpy (preferably a modern version >= 1.6; tested with 1.7.1 and 1.8.0) (pip install numpy)
  • h5py (tested with 2.2.0) (pip install h5py)
  • scipy (preferably the newest available version; tested with 0.13.0) (pip install scipy)
  • numexpr (tested with 2.2.1) (pip install numexpr)
  • Cython (tested with 0.18) (pip install Cython)
  • PyTables (tested with 3.0.0) (pip install tables)
  • biom-format repository (fork and clone the repo; ensure that all unit tests are passing and that you're set up to submit pull requests, e.g. as you would with QIIME development)
  • protobiom repository (fork and clone the repo)

It may be helpful (but not required) to use virtualenv and virtualenvwrapper to create an isolated Python development environment, and you can then install the dependencies within the virtual environment.

Note: this is not a finalized list of dependencies. The actual dependency list will probably be much shorter. The goal here is to get some of the likely dependencies installed beforehand that we may need. This will cut down on the time needed to set up development environments for everyone and give us more time to troubleshoot any installation issues that might arise.

Reading

Read up on h5py, which is one of the HDF5 APIs available in Python. The quick start guide is really helpful and pretty concise. Also read up on HDF5 itself- in addition to poking through the website, their intro guide is pretty good.

If you haven't already been following the 2.0.0 discussion on the biom-format issue tracker, I recommend reading through it and adding any comments of your own. I also put together a wiki page with some requirements and notes about the 2.0.0 transition here, which I recommend taking a quick pass through.

Familiarize yourself with the existing biom-format codebase

For those who are not already familiar with the existing biom-format codebase, please take some time to look through the repository to become familiar with the project layout, as well as the class designs and APIs. I also recommend reading through the current documentation/tutorials, which can be built with the following commands:

cd doc/
make html
open _build/html/index.html

Code sprint goals

The following are potential goals to accomplish during the code sprint. This list will need to be discussed, voted on, and prioritized as we probably won't have time to get to everything.

Define 2.0.0 file format

First, before doing any coding, we need to define what the new file format will be:

  • what stays the same?
  • what existing pieces need to change?
  • what new data do we want to store, if any?
  • what about table types (e.g., OTU table, taxon table, metabolite table, etc.)? There are currently no distinctions between table types (just a controlled vocabulary of recognized table types)
  • do we want richer descriptions of metadata? Metadata is currently extremely generic and has no requirements/validation
  • should we have an attribute that indicates whether the table is in absolute or relative abundances?

An artifact of this goal will be a new format specification that can be posted on the BIOM website (similar to the 1.0.0 spec).

Decide on compatibility support

Now that there will be multiple versions of the BIOM file format (1.0.0 and 2.0.0), we need to decide on what support we will continue to provide for 1.0.0 files. At a minimum, there will need to be conversion routines to convert between "classic" (TSV) tables, 1.0.0, and 2.0.0 files (bidirectional). I think we'll want these conversion routines to be available at the Python API level, as well as via the command-line (e.g., similar to biom convert). It is extremely important to make these conversion routines as intuitive and easy-to-use as possible. There has been a lot of confusion on the QIIME forum related to converting between file formats using the current biom convert command, which leads to frustrated users and increased support burden. By adding another file format to the mix, there's even more potential to go wrong here.

In addition to conversion routines, we need to decide whether we'll continue to provide native support for 1.0.0 files in the Python API. For example, should the BIOM table parser seamlessly accept 1.0.0 and 2.0.0 files as input? What about writing? Should there be deprecation warnings if a 1.0.0 file is supplied?

Improved handling of metadata

There are improvements related to the handling of metadata that may be good to take care of during the sprint. A number of these improvements were suggested by @zaneveld. In particular, it'd be great to have a way to export metadata from a table (including support for multiple columns).

It may also be beneficial to port the metadata mapping file functionality and documentation from QIIME (see this issue for more details).

Port BIOM table scripts from QIIME

There are a number of scripts in QIIME that allow one to perform general BIOM table operations, such as sorting, filtering, merging, splitting, etc.. These scripts should be moved from QIIME into the biom-format project and renamed/generalized where necessary (e.g., change references to OTUs to observations). Many of the scripts are listed in this QIIME tutorial.

BIOM Python API changes

Performance improvements

The current API is very generic (i.e., a table method usually takes a Python function as an argument, and applies that function to do e.g., sorting, filtering, transforming, etc.). This is elegant and makes the BIOM API very general-purpose, but ends up hitting us pretty hard performance-wise. This happens because the biom.table.Table methods usually iterate over rows or columns, applying the user-supplied function to each row/column. Each row/column has to be converted from sparse format into a dense numpy row vector before the function can be applied. Iterating over a numpy array (or a Python sequence) is already pretty slow; coupling this with sparse -> dense conversions further aggravates the problem. This is the main reason that we're currently not seeing huge performance benefits when using the new scipy sparse matrix backend that I added to BIOM recently. For operations that can hit the underlying backend matrix, we have huge gains (e.g., summing the table). For all other operations, the performance gains aren't nearly as striking because we're spending most of the time iterating/converting.

One solution I've been toying with (see the HDF5 prototype in protobiom) is to avoid iterating and converting from sparse to dense wherever possible. The prototype loads all sparse data into a scipy sparse matrix and uses vectorized or slicing operations to do things like sorting, converting to relative abundances, etc.. The idea is to always use the scipy sparse matrix API to do table operations because it is highly efficient. This works very well in practice, as these operations can be applied to the EMP table in seconds as opposed to minutes, hours, or days. The drawback is that we'll need to make the BIOM backends more complicated than they currently are. Each backend will need to have more specialized methods for sorting, filtering, etc.. When these changes happen, I'd like to remove the pure-Python CSMat backend in favor of the scipy backend. I also think we should implement a base class for backends at this point, so that we'll have a dense (numpy) backend and a sparse scipy backend that each share a common interface.

Switching to HDF5 helps us a lot in terms of reading and writing data, but if tables can't be manipulated efficiently once they are loaded, the BIOM project will still continue to be plagued with scaling issues. Thus, I think this goal should be high-priority (but it is not trivial).

Fix API inconsistencies

Some of the current Table methods take an axis argument, which is either 'samples' or 'observations' (similar to axes in numpy/scipy). For example, Table.reduce. Other methods encode the axis in their names, e.g., Table.sortBySampleId/Table.sortByObservationId. I think it'd be good to choose one or the other in order to present a more consistent API to users. My (@jrrideout) preference is to use an axis argument as it is conceptually similar to numpy/scipy and cuts down on the number of methods that a Table has. This goal could likely be taken care of while working on the performance improvements goal.

Remove multiple inheritance from Table inheritance hierarchy

The current inheritance hierarchy defines Table, with SparseTable and DenseTable deriving from this abstract base class. There is a separate set of classes, such as OTUTable, PathwayTable, etc. that each specify a different type of table. The final derived Tables use multiple inheritance to derive from either SparseTable or DenseTable and one of the "type" tables. For example, SparseOTUTable or DensePathwayTable.

This inheritance hierarchy leads to 14 different derived classes, plus 10 abstract classes. I think the hierarchy could be redesigned so that matrix type (i.e., sparse/dense) isn't encoded into the class name, which would cut down on the number of classes. After all, a Table shouldn't be concerned whether its backend data matrix is in sparse or dense format; it can use the backend interface the same way in either case. I understand why this design decision was made originally, but if all backends have a unified interface, we no longer need to encode matrix type into the inheritance hierarchy and can instead use aggregation. I think this will ultimately end up being a more flexible design, and it is also in agreement with GoF's "aggregation over inheritance" guideline.

Drop dense table support

After discussing with the rest of the code sprint group, we agreed that it may be worth dropping dense table support altogether. This will greatly reduce the complexity of the codebase, and will slightly simplify the file format. There will be performance (memory and runtime) costs if large dense BIOM tables are represented in sparse structures, but there will still be overall scaling issues (mainly memory) even if we support loading the data into a dense numpy array. Thus, we are sacrificing some smaller performance gains in favor of a much smaller and simpler codebase.

One of the most heavily used table types is the OTU table. These table types are continuing to grow in size (particularly in the number of observations) and they are staying very sparse. There are other table types that are more dense (e.g., PICRUSt metagenome tables are usually around 75% density), but the number of observations is staying relatively stable (in the thousands). After some preliminary investigation of metabolite tables, these also have fairly few observations (in the hundreds). Thus, it makes sense to tailor BIOM to support large sparse matrices at a small performance cost for the dense table cases.