Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement possibility? --> Pipe cache #14

Open
rickhg12hs opened this issue Sep 11, 2020 · 3 comments
Open

Enhancement possibility? --> Pipe cache #14

rickhg12hs opened this issue Sep 11, 2020 · 3 comments

Comments

@rickhg12hs
Copy link

There is some overhead to create pipes. For some use cases it may be advantageous to cache pipes or even partial pipes. Would it be possible to cache pipes automatically? ... or by some switch, etc.?

Here you can see the "penalty" associated with creating pipes.

$ ipython3
Python 3.7.5 (default, Oct 17 2019, 12:21:00) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.18.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from pipetools import pipe,X,foreach

In [2]: def my_func(count=10000, predef = False):
   ...:     if predef == False:
   ...:         for k in range(count):
   ...:             a = range(10) > pipe | foreach(X**2) | sum
   ...:     else:
   ...:         my_pipe = pipe | foreach(X**2) | sum
   ...:         for k in range(count):
   ...:             a = range(10) > my_pipe
   ...:     return a
   ...: 

In [3]: %timeit my_func()
202 ms ± 8.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [4]: %timeit my_func(predef=True)
59.5 ms ± 1.67 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [5]: %timeit for k in range(10000): a=sum([x**2 for x in range(10)])
29.9 ms ± 962 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
@0101
Copy link
Owner

0101 commented Sep 14, 2020

It might be possible. It still wouldn't be as fast as doing it manually, because there would be some overhead of creating a key and looking it up in the cache.

Also it's quite easy to do it explicitly as in your predef example. If you know you'll be reusing a pipe few thousand times you might as well give it a name. So I'm not sure if it's worth it to add caching that would complicate the code and potentially introduce some tricky bugs.

@rickhg12hs
Copy link
Author

It would be nice if there was a way to do incremental data analysis where intermediate results could be cached and checked/viewed. For example,

In [6]: big_data_list > pipe | transformation_that_takes_a_long_time
[1234.5678,
...
 9876.5432]

Then hit "up-arrow" on the keyboard and just add the next transformation/aggregation at the end of the previous line...

In [7]: big_data_list > pipe | transformation_that_takes_a_long_time | another_long_transformation
[0.098,
...
 0.987]

... where big_data_list > pipe | transformation_that_takes_a_long_time isn't recalculated.

Is there an easy way to do this without manually storing intermediate analysis results?

@0101
Copy link
Owner

0101 commented Sep 15, 2020

Well that's another story - I thought you only wanted to cache the pipe itself, not the result of calling it. For this you'd have to make some sort of cached_pipe object that would behave that way, or control it by some flag (because that would not be a good default behavior). Also in case of big_data_list it might be tricky to create a good cache key. At the moment I don't see any easy way to accomplish this, but it could be a nice coding exercise. I'm open to ideas if you have any.

Also there's usually some sort of magic variable, in case of IPython it's _, which always holds the result of the previous expression. So you can do:

In [6]: big_data_list > pipe | transformation_that_takes_a_long_time
...
In [7]: _ > pipe | another_long_transformation # or another_long_transformation(_)
...

Which might be a decent workaround for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants