Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement, test py::object as taxon type #47

Merged
merged 82 commits into from
Dec 5, 2023
Merged

Implement, test py::object as taxon type #47

merged 82 commits into from
Dec 5, 2023

Conversation

mmore500
Copy link
Collaborator

This will allow the user to use arbitrary types as their taxa, including numpy stuff, instead of just strings as previously implemented.

Probably could get a speed boost by specializing for string, float, int,
etc. instead of using py::object but this works!
Because <format> isn't available on compilers in CI
TODO: Empirical should support quote escaping csv entries containing
"," so that we don't have to url encode taxon info representations
containg ","
@codecov-commenter
Copy link

codecov-commenter commented Nov 23, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (e96c8a4) 100.00% compared to head (6c9fcb7) 100.00%.

Additional details and impacted files
@@            Coverage Diff            @@
##              main       #47   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files            1         1           
  Lines            5         5           
=========================================
  Hits             5         5           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mmore500
Copy link
Collaborator Author

mmore500 commented Nov 23, 2023

It looks like Empirical CSV engine doesn't support quote-escaping csv content containing "," at the moment? Is this on the roadmap? (Quote escapes are part of the CSV standard https://datatracker.ietf.org/doc/html/rfc4180)

If we add quote escapes to Empirical csv engine, we can have a much nicer serialized representation of containers (lists, arrays, etc)

@mmore500
Copy link
Collaborator Author

Ready for review/merge

@emilydolson
Copy link
Owner

This is mostly looking good, but things are messed up when the actual taxon or taxon info is a string (super common use case that definitely needs to be supported before we can merge this). I added a fix to support loading from file when the info type is a string, but there's still an issue when the taxon calculation function returns a string (demonstrated in #57).

@emilydolson
Copy link
Owner

Oops, no I was making a dumb mistake in my tests. With my fix to de-serialization, everything seems fine when info is a string. The only remaining problem is that we still need to override py::object equals if we don't want the super unintuitve behavior where equality is based on pointer equality rather than info equality.

@emilydolson
Copy link
Owner

New equals operator is working except for numpy arrays, because apparently == for numpy arrays returns an array of Trues and Falses, not a single True/False

@mmore500
Copy link
Collaborator Author

mmore500 commented Dec 4, 2023

Oops, no I was making a dumb mistake in my tests. With my fix to de-serialization, everything seems fine when info is a string. The only remaining problem is that we still need to override py::object equals if we don't want the super unintuitve behavior where equality is based on pointer equality rather than info equality.

Agreed here. Wonder if we could do a try catch with numpy.array_equal as the first choice override for py::object == (very nicely, this works as regular == for non-numpy objects) with fallback to use plain builtins.__eq__.

@emilydolson
Copy link
Owner

That sounds like a pretty expensive thing to do every time we check equality. I'm experimenting with an alternative where we cache an == operator when a python object is constructed, but it's not going great and honestly still feels like a lot of overhead to support what I'm not convinced is a common use case. Is there reason to think using raw NumPy arrays as taxon info is likely to happen a lot? (e.g. is this important for your long-term parallelization plans?)

In general, I don't want us to be responsible for adding lots of additional checks to support objects with unconventional == operators. I would generally think that if a user wants to use an object like that as taxon info, we should let them be responsible for wrapping it in a class with an == operator that returns a bool. But if you think there's reason to make an exception for NumPy, I can keep trying

@emilydolson
Copy link
Owner

Okay, I did get the constructor thing working (in #57). It requires reaching into python in the constructor, grabbing the correct equals operator, and then storing that in the py::object. Two slight issues:

  1. Bad things happen if you mix taxon_info types that have incompatible equals operators (pretty sure that's always going to be true, but it throws a less obvious error than would be ideal)
  2. We're storing a ton of copies of the same equals operator. Ideally, we should just have one (maybe a static variable?) that gets stored once on construction of a Systematics manager.

That said, these are both sufficiently minor that I think I'd be okay with merging at this point (probably adding these as issues)

@emilydolson
Copy link
Owner

Oh shoot, looks like that breaks if numpy isn't installed. I'll go back to my original question of how important numpy support is.

@emilydolson
Copy link
Owner

Okay I fixed it for realsies now. Going to go ahead and merge it into this branch and merge this branch into master.

@emilydolson emilydolson merged commit 66989d4 into main Dec 5, 2023
49 checks passed
@emilydolson emilydolson deleted the taxontype branch December 5, 2023 02:00
@mmore500
Copy link
Collaborator Author

mmore500 commented Dec 5, 2023

Beautiful!

think we could get rid of memory overhead you mention by using a static variable initialized using an immediately invoked lambda that does the try catch block you have. Agree that existing solution 100% works fine for the moment.

@emilydolson
Copy link
Owner

Potentially? I tried something like that really quickly but there's something messy about having py::objects as static variables (they don't get destructed at the right time relative to the Python infrastructure and end up segfaulting?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants