[Feature] compare/replace cell matching code with `scipy.optimize` implementation #77

alessandrofelder · 2024-05-16T14:47:54Z

Is your feature request related to a problem? Please describe.
The core developer team is not confident we'll be able to maintain the cell matching code long-term (mainly because I don't feel like I entirely understand each line 😅 !)

Describe the solution you'd like
Unless it's a lot less performant, we should consider replacing the core logic of our cell matching functionality with scipy.optimize.linear_sum_assigment, to reduce the maintenance burden. This is not exactly the same algorithm, but is related and useful for the same class of problems, AFAICT.

A solution like this would also fit well with the idea of BrainGlobe bringing the neuroanatomical context to the Python ecosystem, rather than re-implementing parts of it.

Describe alternatives you've considered

\

Additional context

\

The text was updated successfully, but these errors were encountered:

matham · 2024-05-16T18:25:26Z

I do agree with that we shouldn't re-implement stuff and reuse as much as possible.

The main reason why I re-implemented it, is that every existing implementation that solves this problem expects a cost matrix input of size NxM. For our example with 250k cells, this would have required a 200+GB matrix for float32.

There may be a way to hack a fake numpy array that calls a function that computes the distance for each call to cost[i, j], but I can't imagine that will be very performant. Perhaps we can try to get scipy to accept a ufunc instead of requiring a cost matrix.

I'll be honest that while I spent some time reading the algorithm and ideas behind it and I think I understand it at a cursory level, I don't understand it deep enough to properly debug. However, I also expect this kind of algorithm, and given where it comes from, that it should be a set and forget as long as no one changes any of the logic!?

matham · 2024-05-16T18:36:25Z

I opened an issue with scipy: scipy/scipy#20729.

alessandrofelder · 2024-05-17T10:09:39Z

However, I also expect this kind of algorithm, and given where it comes from, that it should be a set and forget as long as no one changes any of the logic!?

This is a fair point. As long as no one changes any of the logic and numba doesn't evolve too weirdly. But even that should be fine. The hypothetical scenario that worries me is

we make some changes to the cell detection code that we are sure (!) will not change the cell matching
the cell matching reports changes
we start suspecting a bug in the cell matching

But we should question point 1. in that case. 😆

For our example with 250k cells, this would have required a 200+GB matrix for float32.

Writing down some related thoughts that might help down the line (this is probably not a priority for the core team right now)

Not an expert, but wondering whether a Dask array would help us avoiding keeping a huge matrix in memory here?
(optimisation) Could we exclude cells that have an exact match (distance 0) by putting the cells in a set and checking whether each element of the other_cells is in the set. And then apply the Hungarian method to the non-zero distance cells?

matham · 2024-05-23T23:48:03Z

It's a fair worry that issues with the algorithm would lead to misjudging changes in e.g. cellfinder. Perhaps if we think of the algorithm as more of another input to consider when looking at changes in cell counting rather than the histogram must look a certain way!?

dask array

The issue there is that you'd still need a large harddrive, just for this computation - and it'd probably take a while to compute and create the cost data on disk. But even then, given that the algorithm needs to constantly query the cost of different indices, the CPU having to look it up on a harddrive, even a fast SSD would significantly reduce performance. Compared to in memory access, I don't think it'd be fast enough!?

Optimization

I added the zero-distance optimization you mentioned to the original PR!

alessandrofelder · 2024-06-06T16:14:26Z

I think with the improvements made to #74 after this was opened, we can close this as not planned (for now). Our cell matching now is tailored to BrainGlobe's needs (it is fast, and it allows "pre-matching", and it is tested. We can revisit if the scipy issue is addressed if we like.

alessandrofelder added the enhancement New feature or request label May 16, 2024

alessandrofelder mentioned this issue May 16, 2024

Add cell matching support #74

Merged

7 tasks

alessandrofelder closed this as not planned Won't fix, can't repro, duplicate, stale Jun 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] compare/replace cell matching code with `scipy.optimize` implementation #77

[Feature] compare/replace cell matching code with `scipy.optimize` implementation #77

alessandrofelder commented May 16, 2024

matham commented May 16, 2024

matham commented May 16, 2024

alessandrofelder commented May 17, 2024

matham commented May 23, 2024

alessandrofelder commented Jun 6, 2024

[Feature] compare/replace cell matching code with scipy.optimize implementation #77

[Feature] compare/replace cell matching code with scipy.optimize implementation #77

Comments

alessandrofelder commented May 16, 2024

matham commented May 16, 2024

matham commented May 16, 2024

alessandrofelder commented May 17, 2024

matham commented May 23, 2024

alessandrofelder commented Jun 6, 2024

[Feature] compare/replace cell matching code with `scipy.optimize` implementation #77

[Feature] compare/replace cell matching code with `scipy.optimize` implementation #77