-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add cell matching support #74
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #74 +/- ##
==========================================
+ Coverage 90.38% 91.68% +1.30%
==========================================
Files 35 35
Lines 1393 1623 +230
==========================================
+ Hits 1259 1488 +229
- Misses 134 135 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Thanks for this @matham, much appreciated! I will let @alessandrofelder review this, and these questions are for him really:
|
Thanks a lot @matham - I had a more simplistic version of something similar in a local script (no global optimisation or guaranteed 1-1 pairing, just brute-force finding minimum distance for each point in any point in the other) that I could/should have shared - apologies! This is definitely very useful functionality!!! I am just wondering whether it's worth at some point replacing the core logic of I am satisfied the implementation matches the Wikipedia one AFAICT, and testing it on local data seems to give reasonable outputs (see below), so I'll approve once I've done some of the comparisons Adam suggested. I think this would benefit from some negative tests (checking e.g. that expected errors are thrown for invalid input) as sanity checks, and also to keep the coverage 90%+ (something @K-Meech has worked really hard to achieve) - I can work on this (#76).
At today's developer meeting we decided that
I'll track this in brainglobe/brainglobe.github.io#187
I've started doing some comparisons today, but will finish them tomorrow with a clearer mind :) |
I've compared cell candidate detection results before ( I ran on cells with z coordinate in slices 1000-1050 (somewhere in the middle) of a cFOS stained brain. The result:
The distribution of good matches with distance > 0 is: So far, so good. But then I ran the script below: from pathlib import Path
from brainglobe_utils.cells.cells import match_cells
from brainglobe_utils.IO.cells import get_cells, save_cells
if __name__ == "__main__":
pre_optimisation_cells = get_cells(str(Path("/media/ceph-neuroinformatics/.../cells.xml")))
post_optimisation_cells = get_cells(str(Path("/media/ceph-neuroinformatics/.../other_cells.xml")))
# only consider cells in slices 1000 - 1050
z_min, z_max = 1000, 1050
old_cells = [cell for cell in pre_optimisation_cells if cell.z > z_min and cell.z < z_max]
new_cells = [cell for cell in post_optimisation_cells if cell.z > z_min and cell.z < z_max]
print(f"There are {len(old_cells)} old cells")
print(f"There are {len(new_cells)} new cells")
# minimize total euclidean distance for a 1-1 pairing between old and new.
# pairings with distance > 5 are considered "bad"
good_threshold = 5
no_match_in_new, matching, no_match_in_old = match_cells(old_cells, new_cells, threshold=good_threshold)
print(f"There are {len(no_match_in_new)} old cells without a good match in new cells")
print(f"There are {len(no_match_in_old)} new cells without a good match in old cells")
old_cells_with_no_match_in_new = [old_cells[i] for i in no_match_in_new]
new_cells_with_no_match_in_old = [new_cells[i] for i in no_match_in_old]
save_cells(new_cells_with_no_match_in_old, str(Path.home()/"no_match_in_old_2.xml"))
save_cells(old_cells_with_no_match_in_new, str(Path.home()/"no_match_in_new_2.xml")) to visualise "bad" matches. I've probabably done something extremely silly in the script, or misunderstood how this should work, but I'm not comfortable merging this until I know what's going on 🤔 - pointers/thoughts about how to interpret this appreciated! |
You code seems correct. Can you share the cells xmls so I can look at it? Perhaps there's something wrong with the threshold computation. But I also wonder if the axis were somehow flipped during visualization. |
Also, here's the scipy code for this - it's quite similar. |
Slight adaptation of the script above (avoiding the cell filtering by z coordinate) from pathlib import Path
import napari
from brainglobe_utils.cells.cells import match_cells, to_numpy_pos
from brainglobe_utils.IO.cells import get_cells, save_cells
if __name__ == "__main__":
old_cells = get_cells(str(Path.home()/"cells-z-1000-1050.xml"))
new_cells = get_cells(str(Path.home()/"other-cells-z-1000-1050.xml"))
good_threshold = 5
no_match_in_new, matching, no_match_in_old = match_cells(old_cells, new_cells, threshold=good_threshold)
print(f"There are {len(no_match_in_new)} old cells without a good match in new cells")
print(f"There are {len(no_match_in_old)} new cells without a good match in old cells")
old_cells_with_no_match_in_new = [old_cells[i] for i in no_match_in_new]
new_cells_with_no_match_in_old = [new_cells[i] for i in no_match_in_old]
save_cells(new_cells_with_no_match_in_old, str(Path.home()/"no_match_in_old-z-1000-1050.xml"))
save_cells(old_cells_with_no_match_in_new, str(Path.home()/"no_match_in_new-z-1000-1050.xml"))
viewer = napari.Viewer()
viewer.add_points(to_numpy_pos(new_cells_with_no_match_in_old), edge_color="yellow", face_color="yellow", symbol="ring", opacity=0.5, name="no match in old")
viewer.add_points(to_numpy_pos(old_cells_with_no_match_in_new), edge_color="red", face_color="red", symbol="ring", opacity=0.5, name="no match in new")
napari.run() Should be reproducible with attached files saved to |
I think I figured out the issue and it has to do with us finding the global minimum and then removing those above the threshold. This results in the following problem: Say you have 4 points on a line. Set 1 is This is exactly the issue you noticed. Because the pairs of cells that are right on top of each other but not in match, each are actually matched with other cells to minimize the global cost. Then, removing those above threshold, results in the unmatched sets having cells that are right on top of each other. This can be tested with the following code: cells1 = [
Cell((0, 0, 0), Cell.UNKNOWN),
Cell((12, 0, 0), Cell.UNKNOWN),
]
cells2 = [
Cell((10, 0, 0), Cell.UNKNOWN),
Cell((22, 0, 0), Cell.UNKNOWN),
]
missing_c1, good_matches, missing_c2 = match_cells(cells1, cells2, threshold=np.inf)
assert not missing_c1
assert not missing_c2
assert good_matches == [[0, 0], [1, 1]]
missing_c1, good_matches, missing_c2 = match_cells(cells1, cells2, threshold=5)
# before the fixes the following assert is correct
# assert missing_c1 == [0, 1]
# assert missing_c2 == [0, 1]
# assert not good_matches
# with final fixes this is correct
assert missing_c1 == [0, ]
assert missing_c2 == [1, ]
assert good_matches == [[1, 0]] I changed it to account for the threshold during the matching so this issue doesn't occur. I haven't yet tested it on my whole brain, to be sure, but will do soon. |
Also, I added tests so there's full coverage. But it needs to be run with |
I'd naively suggest implementing the same solution we have in cellfinder here, ie running both.
I think I follow your reasoning and agree with it, but I can still reproduce the above problem (Lots of orange points) locally :thinking_face: so I am confused. I get a lot higher number of "bad" matches this time though (~40k instead of 6k), FWIW. Let me know how your "real-life" testing goes :) |
So after more looking, there are multiple things going on, some which is not a bug, but one that was a bug... There was a bug where a variable was not properly reset. This was fixed in the first commit after your last comment. When translating Sadly, with this fix, the algorithm became a lot slower, in particular when using a small threshold (see below some reasons behind this). Which is quite in line with it being O(N3). So, I also added an optional initial step (ON by default), that extracts all zero distance pairs before running the optimization step. And, after that I also added the option to use scipy as the optimization backend. This is only practical when we do the zero-distance extraction, because scipy cannot handle the full dataset due to memory requirements. However, with scipy, we cannot use the threshold during the matching because scipy doesn't take it as an option, so we can only use it after. And with that, I compared the matching with and without threshold and with and without scipy. I found that without threshold, the total cost of the matched pairs were the same - which is good as that is what we are minimizing. Although the matched pairs were not identical, which makes sense. These images, on your two cell sets show the result: Besides all this, there are complexities with this algorithm that may result in some of what you saw. This algorithm is perhaps not precisely what we want. This algorithm minimizes the Some examples how this may appear in the algorithm:
For comparison, here's how the previous histogram looks like if we don't take the So, I'm not sure what you think about the code now. But my sense is that first extracting all the zero-distance pairs, whatever remains, we care the most about the overall histogram. Because that gives us a sense of how bad the changes in e.g. cellfinder was. And while we would prefer a local optimization algorithm (that e.g. always prioritizes zero-distance pairs) rather than global, I don't know of such an algorithm. So perhaps a the global optimization algorithm is good enough? The following are points to consider about the API
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your extensive and very clear explanations!
So, I'm not sure what you think about the code now. But my sense is that first extracting all the zero-distance pairs, whatever remains, we care the most about the overall histogram. Because that gives us a sense of how bad the changes in e.g. cellfinder was. And while we would prefer a local optimization algorithm (that e.g. always prioritizes zero-distance pairs) rather than global, I don't know of such an algorithm. So perhaps a the global optimization algorithm is good enough?
Absolutely, I agree with this.
I can reproduce your histograms locally and am now satisfied this works as expected 🎉
My suggestions for the API:
I think it's a bit confusing that the same argument threshold
works differently when we use scipy versus when we don't.
So maybe it should be different API calls altogether, something like
match_cells(cells, other, threshold = inf, prematch=True) # default to what we expect is the most common use case
probably using the sqrt to keep global optimisation tight, and
match_cells_scipy(cells, other, threshold = inf) # always pre-match
I am wondering whether we need to maintain our scipy version here, so maybe I'd suggest getting rid of scipy altogether (given that we can't have it as a complete replacement of match_cells
due to our memory requirements, at least for now?). So I think my opinion would be to
- remove the option to use scipy
- always use sqrt
in the interest of simplicity
If we wanted to go the extra mile, we could write a regression test with the data here, that compares our cost with the cost reported by the scipy implementation.
We could do this in a separate issue and PR, or add this on here. @matham, let me know whether
- whether you'd like to work on this or let the core team try :)
- whether you'd like to unconvince me of the (not very strongly held) opinion above.
Thanks again for this very useful contribution!
I will merge this once I know how you would like to proceed.
Thanks for the review! I'm with you on the API and I can make the changes (remove scipy, keep About adding the cells you shared as a test to compare scipy to us. Each xml file is 12MB. Is that (not) too big to add to the repo? Because everyone cloning will have to download it. I was kinda running into this issue on the pytorch branch because I added images to make sure the 2d/3d filtering is similar, which easily added a few MBs. And I also have 3 25MB images from a very bright brain that is super helpful for regression testing the full process and helped me figure out where there wasn't parity. But that definitely seemed too big to include in the repo. I'm not sure if you have encountered/considered this before!? |
Co-authored-by: Alessandro Felder <[email protected]>
Good point. I will update brainglobe/cellfinder#282 with our long-term plan for this (and other BrainGlobe repositories). Short version: we already have too much data in this repo that should get moved to the GIN platform, downloaded with I'll put the XML files on GIN now, and happy to do the same with your images if you could temporarily share them via Dropbox/onedrive/google drive/...? |
I've made a sketch PR for the regression test on the XML files to your fork @matham, with how I imagine this working (which are now on GIN) - hope it helps! |
Thank you!! 🎉
I can add the test itself too, if you like? |
Add test fixtures for cell matching regression test
Ok, removed scipy as an option and filled in the scipy test. Thanks for the pooch PR, I've never used it so that was helpful!
Adding tests that pass is the most fun part 😁 |
# hash on conftest in case url changes | ||
key: ${{ runner.os }}-${{ matrix.python-version }}-${{ hashFiles('**/conftest.py') }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# hash on conftest in case url changes | |
key: ${{ runner.os }}-${{ matrix.python-version }}-${{ hashFiles('**/conftest.py') }} | |
# hash on conftest in case url changes, or we add files to the pooch registry | |
key: ${{ runner.os }}-${{ matrix.python-version }}-${{ hashFiles('**/conftest.py') }} |
very cool idea to do this 😍
matham#2 may help add more of the tests discussed above. The data you shared with us is on GIN now :) |
Description
What is this PR
While working on the pytorch changes, I wanted to have a way of comparing how my changes affect cell detection. Because the filtering in pytorch is potentially slightly different. So I needed a way to compare two sets of cells for distance.
This PR adds support for this. It uses the Hungarian method to compute the best paring between two groups of cells such that the euclidean distance within a pair across all pairs is minimized, I really hope someone already implemented it in the repos. I searched but didn't find anything 😅
The algorithm is of time complexity O(n3). Which seemingly is the fastest algorithm around for this problem. I adapted the code directly from Wikipedia.
References
None
How has this PR been tested?
I tested it matching two sets of 250k cells, one from the main branch and one my working changes and it took about 1.5 hours.
I also added tests.
Is this a breaking change?
No.
Does this PR require an update to the documentation?
No.
Checklist:
For the actual example to see how it works. I compared my changes to
main
. The matching was able to pair245952
cells, with8426
left unpaired due to unequal number of cells found between the two sets.Looking at the histogram of distance between all the pairs we get this image, which is pretty good IMO:
I also visualized the histogram in x, y, and z of the
8426
unmatched cells to see if something stands out but nothing stood out: