Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Any thoughts on supporting something better than pickling? #45

Open
nima opened this issue Feb 15, 2021 · 11 comments
Open

Any thoughts on supporting something better than pickling? #45

nima opened this issue Feb 15, 2021 · 11 comments

Comments

@nima
Copy link

nima commented Feb 15, 2021

Pickling is about as slow as it can get, some benchmarks here.

Any future plans for adding support for something faster, like msgpack, or even json?

I realize that some choices will require the cachier-wrapped functions to restrict the data that they return, but given how slow pickle is, and that the point of caching is at least in part to speed things up, and specially not to slow them down - I think this would be a good feature.

Any thoughts on this?

@shaypal5
Copy link
Collaborator

shaypal5 commented Feb 15, 2021

No thoughts at the moment.

I'd love to hear suggestions, and later on contributions. :)

I think anything that is switched to should be partial, and not restrict the data returned.
Meaning, if using types X, Y and Z use msgpack, otherwise pickle, etc.
So optimize where you can.

Recall that the main use case of this package is to save very long calls - minutes and upwards.
It'll be cool to optimize stuff, but not in exchange of reducing usability.

@blakeNaccarato
Copy link

I know this is an old issue, but my request is somewhat similar, regarding alternative backend support. I will open an issue separately if you recommend! I use https://github.com/uqfoundation/dill to pickle a wider range of objects, and I was hoping cachier could enable swapping out of the pickle module for alternative implementations (which have the same API surface of loads/dumps, etc.).

So I went ahead and hacked it in, literally adding a dill requirement to setup.py, and replacing every import pickle to import dill as pickle, and it happened to work for my personal use-case! See https://github.com/blakeNaccarato/cachier/tree/dill for the hacked in dill support.

This is a very quick hack, not the intended final feature implementation, and more importantly I haven't run any tests whatsoever, except for the one place I'm using the caching right now for a SimpleNamespace housing the namespace from a ploomber-engine Jupyter notebook.

@lordjabez
Copy link
Contributor

lordjabez commented Nov 11, 2023

If there were a better back-end in place I know I'd benefit from my typical use case: caching API calls for cost and performance saves (e.g. to OpenAI the way I did in this gist).

@blakeNaccarato
Copy link

blakeNaccarato commented Nov 13, 2023

I'm interested in helping to implement this, while heeding these recent words of wisdom from a Parsl maintainer.

Two new questions I'm [...] asking when people want to contribute big components. Who is going to maintain this component in a year's time? What does your PR break for other [app] users who aren't using your new component and don't want to? They're both questions that try to address this big leap from "runs in my fork in my environment" to "released into software other people already use".

So my dill hack definitely works in my fork, but I think the level of maintenance available to the project will determine the sum total new feature surface area added here, to facilitate dill as a backend, or other arbitrary backends.

I think an extension interface makes sense, where the user code can supply functions which conform to a certain protocol or shape, and then cachier can invoke those functions when building the cache file.

The first idea that comes to mind is allowing custom loads and dumps methods to be supplied by the user, for backends which conform to serialization/pickle-like protocols. This wouldn't account for all conceivable extensions to the kinds of serialization desired by users, but an overwrought extension interface would probably be too much maintenance burden, and maybe the YAGNI principle applies here.

Also maybe interfacing at the loads/dumps level isn't even the right "layer" for such an extension interface, so I'm open to suggestions.

So for the "who's going to maintain this in a year's time" question, I think I could manage to maintain it if it's a simple loads/dumps interface, only as long as the specification of that interface is specific enough to carve out a small feature set. I wouldn't be comfortable implementing this as an async-capable interface, for instance.

And for the "what does it break" for those who "aren't using" this or "don't want to", I think this could be implemented in such a way that the default is pickling, and only passing some additional serialization primitives to the @cachier decorator (or alternatively customizing as module options?) will not break default usage.


Also, while on the topic of extension, I think the hash function extension concept could be neatly extended by designating "handlers" for certain object types. Instead of overriding the entire hash method altogether, users could supply handlers which take an object and return a Hashable, which could be fed into a built-in pre-hasher stage that just applies all hash handlers. Sorry this is a different concept but I encountered it at the same time as the caching backend concept and wanted to address it. Would probably be a different Issue and feature effort altogether.

@lordjabez
Copy link
Contributor

I like loads/dumps idea! As long as default behavior isn't affected, if you wanted to take a crack at implementation, I don't see why we couldn't incorporate it.

@shaypal5
Copy link
Collaborator

I like the load/dumps idea as well! I suggest you give this a go in a thoroughly documented and tested PR, and we'll see how it goes. I'm opening a new issue regarding your idea for per-type hashers.

@blakeNaccarato
Copy link

blakeNaccarato commented Nov 27, 2023

Alright, I think can start formulating this. I'll start off in a draft PR this coming weekend, and after this reply I will confine further discussion to the eventual draft PR. I have questions with respect to implementation, but maybe it'll be easier to ask the user-facing questions first.

User-facing API

It seems sensible that a new decorator keyword argument should receive the custom serialization/deserialization logic from the user, which will be used in caching. I think the new argument should receive custom load-like and dump-like Callables from the user, maybe a loader/dumper named tuple for ergonomics?

External/internal boundary

It is expected that user-supplied loader/dumper are coupled, and caches created by this combination of special loader/dumper are expected to be re-used only by this same loader/dumper implementation in future code runs. This may be signaled by embedding this information in the cache file somehow, to detect (and warn? or raise?) when a different loader/dumper are used, or else to silently create a different cache file. Customization of cache file naming can get hairy pretty quickly, if we're not careful a bunch of specifics are going to be embedded in the cache file name and that seems a bit fragile.

Both pickle and dill have Pickler- and Unpickler-like objects with load and dump methods, allowing customization of the pickling protocol and such. The user-facing API of cachier will have to pass down these customized user-supplied callables into cachier internals. I see that the Mongo core also uses pickle, so should its pickler/unpickler be customized by this new argument as well? That's probably a question for later.

New internals

I guess the minimally-invasive initial implementation would introduce a new core, e.g. byop_core.py (bring your own pickler, totally a placeholder that needs renaming), which is parametrized on the user-supplied load/dump methods. Then if that works well, pickle_core.py and other cores can be refactored to derive from it in a non-breaking way, otherwise the significant logic duplication across pickle_core.py and byop_core.py can diverge.

To be a bit more specific, it looks like byop_core.py would duplicate pickle_core.py almost exactly, but with additional parametrization on the load and dump methods used, perhaps in a ByopCore (name pending) subclass of _BaseCore. Optionally, then pickle_core.py could be refactored to get an instance of _ByopCore parametrized on the pickle-specific loader/dumper.

Generalizing existing internals

All of this would probably mean generalizing the pickle.load and pickle.dump calls (currently specifically called via module import) and embedding them in the logic of _BaseCore or a new abstract subclass of it. New tests covering this would also be needed, implying at least the test logic will require a new dependency (e.g. dill) to provide a real-world loader/dumper alternative. Or at least a superficially-customized loader/dumper from the standard library could be thrown at it.

Documentation

There's also docs changes needed for this, which would document the new parameter and probably a guided example. I would probably use dill as an example, and an example with the pickling protocol set to 5 instead of the default of 4, but not require it in project dependencies.

Definition of acceptable user-supplied loaders and dumpers

In addition to this, a specification as to what constitutes acceptable user-supplied loaders/dumpers should be pointed to, e.g. the data stream format laid out in the pickle standard library documentation. The narrowness of that specification determines how useful this feature is, though. If this customization amounts to "you can supply anything as long as it's pickle", then it doesn't really satisfy the "something better than pickling" part of the OP's feature request.

Maybe there's a way to say, "User-supplied loader/dumpers should conform to <certain spec>. Technically, any loader/dumper that reads and writes bytes will do, but functionality is not guaranteed for exotic deviations from <certain spec>." I don't know, I get the feeling I'm opening a can of worms here, the whole point of this feature is to make cachier's serialization/deserialization logic more user-customizable, but there is always going to be a degree of "try at your own risk" here.

Conclusion

Just let me know what you think about some of the details given here, and I'll start gradually working towards an implementation. I'm working this in hobby time, so it's probably not going to progress at breakneck speed, at least there will be ample time to think on these details in that case.

@shaypal5
Copy link
Collaborator

Random thoughts:

User-facing API: Yup, a new keyword is in order. And a named tuple makes sense.

External/internal boundary: Regarding how to embed information about the de/serializing package in the file, what about the file extension?

About the mongo core - yes, I think that is for later.

Generalizing existing internals: Good points overall.

Documentation: Sounds good.

Definition of acceptable user-supplied loaders and dumpers: I think you worry to much about this. We should accept only drop-in replacements for pickle, and responsibility of proper use is on the user, like everywhere in Python. :)

Conclusion: Sounds good. Work at it in your own pace and let me know when you have need for advise.

@blakeNaccarato
Copy link

Thanks @shaypal5 for redirecting my nervous energy about specification and scope 😅. I'm becoming more interested in contributing lately, and my unfamiliarity with external codebases and processes manifests as overthinking it all. Just continue to nudge me if I'm thinking too much about a particular thing.

Regarding embedding loader/dumper information in cache filenames/extensions, should the signature be a hash, or loader/dumper function names? Embedding some kind of hash may be unwieldy and overly-specific, but embedding just the loader/dumper function names may be brittle (e.g. their implementation changes but names stay the same).

Maybe a minimal approach is to not encode the custom loader/dumper into the cache file at all, and just let decoding fail if the user changed the loader/dumper between runs. The user can decide how to handle that exception, e.g. reset the cache.

This would avoid adding more exception handling or logic to the cache read path, which is probably being called by user code in hot loops. The only issue with this minimal approach could be that data successfully unpickles but is silently corrupted by reading with a slightly different loader.

@shaypal5
Copy link
Collaborator

Regarding file extensions: I think function names might be not unique. A hash sounds sensible. There should be hashing somewhere there.

Thinking about the use cases:

  1. In 80-99% of the cases we'll get a custom de/serializer, it will be a function imported from a package providing a drop-in replacement to pickle. For those cases, something like dillv0311 for dill v3.11 (zero padding to differentiate from v31.1 would be cool and unique. Even though no one would probably look at those, so it's kind of a waste of time.
  2. In the rest of the cases we'll get a custom wrapper function some wrote around pickle's or dill's de/serialization methods. In that case a hash of the function's code itself sounds appropriate.

Regarding silent failing: I think your point is crucial. It's too minimal. It's not a promise by dill's (or any other such package) API that deserialization explicitly and gracefully fails on every possibly data breaking version incompatibility.

So I basically agree with your analysis and stand by your hunch to go the safe route and do this properly. :)

@blakeNaccarato
Copy link

blakeNaccarato commented Nov 30, 2023

Great, I'll gradually work these things into the PR. I'll implement the naming scheme in a way that addresses these points. I think you're already embedding a hash in the filename when separate_files is enabled, so I'll make sure this plays nicely with that machinery, and follows similar patterns. I need to remember we can always tweak things at review time, so it need not be perfect. Thanks for the welcoming environment you're facilitating here, I'm looking forward to taking a crack at this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants