-
Notifications
You must be signed in to change notification settings - Fork 9
Roadmap
This document describes a possible roadmap for future developments of the KlustaSuite.
Most of these developments are expected to occur in the second half of 2014.
The goal of spike sorting is to extract single-unit spiking activity from raw extracellular recordings. Those recordings are now increasingly acquired through high-density multielectrode arrays. The number of recording channels is currently growing at a fast rate. Whereas probes with tens of channels are routinely used today, probes containing several hundreds or even thousands of channels are expected in the coming months and years. This deluge of data requires novel algorithmic and software developments.
In our group, we have developed a software suite tackling the spike sorting problem with a particular set of algorithms. We also designed a new file format based on HDF5.
Our software suite is designed around a rigid workflow: filtering, spike detection, waveform extraction, automatic clustering, and manual clustering. This workflow has been widely used for years in many research teams.
However, other workflows exist and might prove more powerful. For example, retinal data is increasingly being sorted with algorithms based on template matching. These methods allow extracted waveforms to be semi-automatically matched to the original data. They often yield greater sorting quality, notably by mitigating the problem of spike overlap. Whether this approach can be beneficial on cortical data remains to be tested.
Further approaches and workflows might also be considered. Thus, spike sorting remains today an open research question. A generic, fully automatic, real-time method for spike sorting is out of reach in the medium term, but we might get closer in the coming years. Yet, progress in this respect is hindered by the variety of workflows, algorithms, experimental protocols, file formats, implementations, programming languages used by various research teams around the world.
Our own software suite is hardly extendable, making experimentations difficult.
For these reasons, we propose to develop the foundations of a modern open framework for large-scale electrophysiology. We will focus on the Python language, because we believe it is a very strong candidate for such a framework. Instead of providing a monolithic graphical software, we will create reusable components in Python (high-performance visualization views, graphical widgets, algorithms). These components will be organized around the IPython notebook. This innovative platform provides a modern, dynamic, Web-based framework for performing computational experiments in a reproducible way.
In the following, we will motivate and describe all those elements.
This language has the following strengths:
- Python is an open-source language, which is a strong benefit compared with commercial, closed-source alternatives.
- Python provides a high-performance environment for numerical computing.
- There are many solid and actively maintained libraries for scientific computing.
- Scientific Python has a very dynamic and active community, particularly in neuroscience. It is supported by many researchers, research institutions, and industries around the world.
- The language itself is expressive, multi-paradigm and easy to learn.
- Python can very easily integrate with non-Python code and libraries (C/C++, FORTRAN, etc.).
- MATLAB users can move to Python relatively easily, since those two languages share many programming concepts and syntax paradigms.
Python also has a few weaknesses. We describe how we plan to solve these problems.
-
Two incompatible versions of Python coexist at this time: Python 2 and Python 3. Python 2 is minimally maintained, whereas all recent developments have only concerned Python 3. Today, most scientific computing libraries are compatible with both Python 2 and Python 3. However, many people are still using Python 2. Some libraries remain incompatible with Python 3. Some computing environments have not upgraded to Python 3 yet. For all these reasons, it is not recommended to support only one branch of Python. Fortunately, several solid solutions exist to write a single codebase compatible with Python 2 and Python 3 (notably
six.py
, a popular method that we choose to rely upon). Robust software engineering methods make it possible to ensure perfect compatibility of the code (testing suite, code coverage, continuous integration). -
Although Python is multi-platform, distributing a Python software is known to be particularly painful. Many incompatible packaging systems have been developed for Python, and none of them is perfect. However, there are currently many efforts in the direction of an "ultimate" packaging system for Python, and things are really improving. One of the most promising efforts is conducted by a company, Continuum Analytics. They developed an open-source system, conda, for building and distributing multi-platform Python packages. This system might be combined with methods for building standalone installers and executables. Finally, let's mention the fact that Python programs might be able to run in the browser in the near future. This would considerably simplify the distribution of Python-based applications.
We want to create a flexible framework for large-scale electrophysiology. There are two complementary goals:
-
Provide an effective computing environment for experimenting with new ideas, algorithms, implementations, workflows, user interfaces in the context of large-scale electrophysiology (including spike sorting).
-
Provide a Web-based framework for building a user-friendly graphical interface for spike sorting (including manual stage and human supervision).
The core idea is that the exact same framework can be used for research experiments in spike sorting algorithms, and for user-exposed graphical interfaces. In the first case, one would essentially write code interactively, and visualize the results upon request. In the second case, one would not necessarily have to write code: they would open their datasets, and execute different steps by clicking on buttons or by typing extremely simple commands (like run_filtering()
).
Having the same architecture for both cases is a huge advantage:
-
The path from research experiments to user testing is drastically reduced. Once a new algorithm is implemented, it can be immediately tested by end-users who do not need necessarily have any programming background.
-
Advanced end-users can perform their own experiments, and they can customize their workflows at will.
-
It saves the many man-hours required for building a monolithic GUI that would be hardly extendable. These man-hours can be spent experimenting with new algorithms and workflows.
-
It becomes vastly easier to attract external contributors. It is very hard for new contributors to work on the code of a monolithic GUI. By contrast, it is much easier to contribute on a modular framework. In effect, the exact same framework is used for coding and testing.
-
More philosophically, it is becoming increasingly clear that desktop GUIs are the past, whereas Web applications are the future. It seems hazardous to build a future-proof framework on a dying concept.
Technically, implementing this framework is made possible thanks to IPython 2.0, released in April 2014.
According to the official documentation, the IPython Notebook is a Web-based interactive computational environment where you can combine code execution, text, mathematics, plots, visualizations and rich media into a single document. Since version 2.0, it is now possible to create interactive widgets. This new feature is mandatory for the creation of an interactive framework that can combine code and visualizations.
The IPython notebook is a distributed architecture. There is a server and a client. Both can run or not on the same computer.
-
The server executes an IPython kernel. The kernel implements all Python code, executes algorithms, has access to the data, etc.
-
The client is a Web browser. It exposes an interactive Web page that combines text areas for user input (such as Python code), code outputs, and rich elements like interactive visualizations and graphical widgets. Those include user controls like buttons, sliders, checkboxes, dropdown lists, etc. Those controls can be extended at will. It is also possible to create entirely new controls in HTML and Javascript.
In the client, Javascript elements can synchronize transparently with the Python kernel through a robust architecture based on the Model-View-Controller (MVC) design pattern (implemented in the backbone.js library). This architecture has been very carefully designed and implemented by the IPython developers. It can offer a rock-solid foundation for our interactive spike sorting framework.
This architecture offers many advantages compared to a monolithic Qt-based GUI application like the one we developed in our suite.
-
No more Qt. Although extremely powerful, Qt is not particularly easy to use. It is essentially a technology from the past. It is a huge dependency (hundreds of MB) that can be very hard to build and to install on different operating systems. There are some low-level bugs (like segmentation faults) in the Python wrappers, that can be hard to debug and to work around.
-
No more threading. With a monolithic GUI implemented in a single process, threading is necessary for long-lasting computations that block the UI. Threading is tricky to do right, and it can lead to stochastic bugs that are quite hard to debug. As the famous quote says, A programmer had a problem. He thought to himself, "I know, I'll solve it with threads!". has Now problems. two he
-
Remote work becomes possible. One can work on computer A even if their data and the whole Python software is on computer B (the data can even be on a third computer C). This is because the IPython notebook is based on network protocols (implemented with ZeroMQ, Tornado, and WebSockets).
-
Separation of concerns. The hard work related to low-level details (protocols, interoperability, etc.) is taken care of by other developers, in the context of popular and well-founded projects (notably IPython, ZeroMQ, etc.)
-
Reproducibility. The Web page that is the IPython notebook can be saved in a language-agnostic file format based on JSON. This enables one to save all steps performed during a spike sorting session.
-
Diffusion. An IPython notebook can be shared easily. It is not necessary to have Python installed to read a notebook (but it is for writing). A Web browser is enough. All code, text, outputs, interactive visualizations are saved in the notebook.
-
The client may run on mobile devices. The possibility of touch-friendly graphical interfaces for spike sorting on mobile devices comes for free with this architecture. This is because the Web platform is available on many devices: computers, smartphones, tablets, etc.
-
The Web browser is becoming the most popular platform for rich user applications. There are many libraries, and considerable investment in this platform by all major players in the technology industry (Apple, Google, Microsoft, etc.).
-
There is some work done in the direction of bringing the whole Python scientific stack to the browser. Notably, Google is working on building Python (including scientific libraries and the IPython notebook) with PNaCl. This would bring a fully standalone Python distribution to Google Chrome.
Although we could rely on existing libraries for visualization purposes (matplotlib, d3.js, etc.), we will need to develop our own solution. This is because the amount of data to display can be too big for standard visualization libraries. The idea is to use the power of the Graphics Computing Unit (GPU) for fast interactive visualization of big data.
We have work on this technology for the last two years. Today, developments occur within a collaborative project named Vispy. There are currently 5 core developers and 2 Google Summer of Code students. Although still in its infancy, this library is about to reach the level of maturity required for building our framework. Most of these developments are happening indendently from our spike sorting framework.
One GSoC student is currently working on bringing Vispy to the IPython notebook. Proof-of-concepts have already been done. This work will be one of the foundations for our framework. Yet, other visualization libraries can also be used for small or medium datasets (like Bokeh or mpld3).
There are several methods to bring Vispy to the IPython notebook:
-
VNC-like approach: the server does all the rendering. It generates PNG images dynamically and sends them to the browser. Conversely, the browser captures user events (like mouse movements) and sends them to Python.
-
WebGL-based approach: the server implements the visualization logic. The rendering step is done entirely in the browser through WebGL. The server sends OpenGL commands and data buffers to the browser instead of sending bitmap images.
The first approach is adapted to big datasets and light clients.
The second approach is adapted to smaller datasets and more capable clients.
Both approaches will be implemented in Vispy.
Although we are developing our own file format, we expect our framework to be agnostic with respect to the file format. In particular, the NEO framework should be usable in this context.
The idea is to provide a very simple common interface for accessing various types of data.
- Spyke and co
- NEO (focused on data)
- OpenElectrophy
- Stimfit