Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] compare/replace cell matching code with scipy.optimize implementation #77

Closed
alessandrofelder opened this issue May 16, 2024 · 5 comments
Labels
enhancement New feature or request

Comments

@alessandrofelder
Copy link
Member

Is your feature request related to a problem? Please describe.
The core developer team is not confident we'll be able to maintain the cell matching code long-term (mainly because I don't feel like I entirely understand each line 😅 !)

Describe the solution you'd like
Unless it's a lot less performant, we should consider replacing the core logic of our cell matching functionality with scipy.optimize.linear_sum_assigment, to reduce the maintenance burden. This is not exactly the same algorithm, but is related and useful for the same class of problems, AFAICT.

A solution like this would also fit well with the idea of BrainGlobe bringing the neuroanatomical context to the Python ecosystem, rather than re-implementing parts of it.

Describe alternatives you've considered

\

Additional context

\

@alessandrofelder alessandrofelder added the enhancement New feature or request label May 16, 2024
@matham
Copy link
Contributor

matham commented May 16, 2024

I do agree with that we shouldn't re-implement stuff and reuse as much as possible.

The main reason why I re-implemented it, is that every existing implementation that solves this problem expects a cost matrix input of size NxM. For our example with 250k cells, this would have required a 200+GB matrix for float32.

There may be a way to hack a fake numpy array that calls a function that computes the distance for each call to cost[i, j], but I can't imagine that will be very performant. Perhaps we can try to get scipy to accept a ufunc instead of requiring a cost matrix.

I'll be honest that while I spent some time reading the algorithm and ideas behind it and I think I understand it at a cursory level, I don't understand it deep enough to properly debug. However, I also expect this kind of algorithm, and given where it comes from, that it should be a set and forget as long as no one changes any of the logic!?

@matham
Copy link
Contributor

matham commented May 16, 2024

I opened an issue with scipy: scipy/scipy#20729.

@alessandrofelder
Copy link
Member Author

However, I also expect this kind of algorithm, and given where it comes from, that it should be a set and forget as long as no one changes any of the logic!?

This is a fair point. As long as no one changes any of the logic and numba doesn't evolve too weirdly. But even that should be fine. The hypothetical scenario that worries me is

  1. we make some changes to the cell detection code that we are sure (!) will not change the cell matching
  2. the cell matching reports changes
  3. we start suspecting a bug in the cell matching

But we should question point 1. in that case. 😆

For our example with 250k cells, this would have required a 200+GB matrix for float32.

Writing down some related thoughts that might help down the line (this is probably not a priority for the core team right now)

  • Not an expert, but wondering whether a Dask array would help us avoiding keeping a huge matrix in memory here?
  • (optimisation) Could we exclude cells that have an exact match (distance 0) by putting the cells in a set and checking whether each element of the other_cells is in the set. And then apply the Hungarian method to the non-zero distance cells?

@matham
Copy link
Contributor

matham commented May 23, 2024

It's a fair worry that issues with the algorithm would lead to misjudging changes in e.g. cellfinder. Perhaps if we think of the algorithm as more of another input to consider when looking at changes in cell counting rather than the histogram must look a certain way!?

dask array

The issue there is that you'd still need a large harddrive, just for this computation - and it'd probably take a while to compute and create the cost data on disk. But even then, given that the algorithm needs to constantly query the cost of different indices, the CPU having to look it up on a harddrive, even a fast SSD would significantly reduce performance. Compared to in memory access, I don't think it'd be fast enough!?

Optimization

I added the zero-distance optimization you mentioned to the original PR!

@alessandrofelder
Copy link
Member Author

I think with the improvements made to #74 after this was opened, we can close this as not planned (for now). Our cell matching now is tailored to BrainGlobe's needs (it is fast, and it allows "pre-matching", and it is tested. We can revisit if the scipy issue is addressed if we like.

@alessandrofelder alessandrofelder closed this as not planned Won't fix, can't repro, duplicate, stale Jun 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants