Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync/Async API design feedback #4

Open
normanrz opened this issue Jun 7, 2023 · 16 comments
Open

Sync/Async API design feedback #4

normanrz opened this issue Jun 7, 2023 · 16 comments

Comments

@normanrz
Copy link
Member

normanrz commented Jun 7, 2023

I've been working to make zarrita work with async stores and using concurrency throughout the library. However, I think that many users will still want to use synchronous methods. So, the implementation uses only async methods internally and provides sync methods, by running the async methods an event loop on a separate thread and blocking the main thread (inspired by fsspec).

I am looking for feedback on how to design the API to accommodate both sync and async functions. Here are the main options that came to my mind:

1. Separate classes

The class methods create either sync or async variants of the Array class. Users need to decide upfront, whether to use async or sync methods.

# sync
a = zarrita.Array.create_sync(
    store,
    'array',
    shape=(6, 10),
    dtype='int32',
    chunk_shape=(2, 5),
)
a[:, :] = np.ones((6, 10), dtype='int32') # set
a[:, :] # get
a.reshape((10, 10))
assert isinstance(a, zarrita.ArraySync)

# async
a = await zarrita.Array.create_async(
    store,
    'array',
    shape=(6, 10),
    dtype='int32',
    chunk_shape=(2, 5),
)
await a[:, :].set(np.ones((6, 10), dtype='int32')) # set
await a[:, :].get() # get
await a.reshape((10, 10))
assert isinstance(a, zarrita.Array)

2. Separate methods and properties

Both sync and async methods are available through the same class. There are still separate create and create_async class methods because the creation of an array is async under the hood (i.e. writing metadata to storage).

# sync
a = zarrita.Array.create(
    store,
    'array',
    shape=(6, 10),
    dtype='int32',
    chunk_shape=(2, 5),
)

# async
a = await zarrita.Array.create_async(
    store,
    'array',
    shape=(6, 10),
    dtype='int32',
    chunk_shape=(2, 5),
)

2a. Property-based async

This is a sync-first API, with the async methods available through the async_ property.

# sync
a[:, :] = np.ones((6, 10), dtype='int32') # set
a[:, :] # get
a.reshape((10, 10))

# async
await a.async_[:, :].set(np.ones((6, 10), dtype='int32')) # set
await a.async_[:, :].get() # get
await a.async_.reshape((10, 10))

2b. Async methods

Similar to 2a, but with _async-suffixed async methods. This feels unpleasant, because the slice syntax [:, :] cannot be used.

# sync
a[:, :] = np.ones((6, 10), dtype='int32') # set
a[:, :] # get
a.reshape((10, 10))

# async
await a.set_async((slice(None), slice(None)), np.ones((6, 10), dtype='int32')) # set
await a.get_async((slice(None), slice(None))) # get
await a.reshape_async((10, 10))

3. Async-first API

Implemented through future objects. Inspired by tensorstore

# sync
a[:, :].set(np.ones((6, 10), dtype='int32')).result() # set
a[:, :].get().result() # get
a.reshape((10, 10)).result()

# async
await a[:, :].set(np.ones((6, 10), dtype='int32')) # set
await a[:, :].get() # get
await a.reshape((10, 10))
@d-v-b
Copy link

d-v-b commented Jun 8, 2023

What's the use of a[:,:] in these examples? E.g., in example 3, Is there a difference between
a[:, :].set(np.ones((6, 10), dtype='int32')).result() and
a.set(np.ones((6, 10), dtype='int32')).result() ?

@normanrz
Copy link
Member Author

normanrz commented Jun 8, 2023

I just use [:, :] as a stand in for arbitrary selections. It would select the entire array and then either provide the data directly (for the sync methods) or a handle to get/set methods. In example 3, there wouldn't be a a.set(...).result() method. Setting would only be available through the selection.

@d-v-b
Copy link

d-v-b commented Jun 8, 2023

In example 3, there wouldn't be a a.set(...).result() method. Setting would only be available through the selection.

Why impose this restriction?

@normanrz
Copy link
Member Author

normanrz commented Jun 8, 2023

I guess providing a a.set(...).result() would be handling a special case. For many use cases, users would rather want to set a subrange of the array, e.g. a[:2, :2].set(np.ones((2, 2), dtype='int32').result(). But of course, such a special case method could be added for the [:, :] case.

@d-v-b
Copy link

d-v-b commented Jun 8, 2023

I think I understand -- a alone is not set-able or get-able, whereas a[selector] returns a set-able / get-able object.

I'm mentally weighing the pros and cons of the a[].set().result() syntax vs just having a.set(selector, values).result(), sitting below a synchronous numpy-like array API.

@normanrz
Copy link
Member Author

normanrz commented Jun 9, 2023

I think I understand -- a alone is not set-able or get-able, whereas a[selector] returns a set-able / get-able object.

Exactly.

I'm mentally weighing the pros and cons of the a[].set().result() syntax vs just having a.set(selector, values).result(), sitting below a synchronous numpy-like array API.

The problem with a.set(selector, values).result() is that the selectors can get very verbose in contrast to the bracket syntax, e.g. [:, :] => (slice(None), slice(None)) or [1:4, 4:] => (slice(1, 4), slice(4, None)) or [:] => (slice(None),)

@jbms
Copy link

jbms commented Jun 21, 2023

Given that you will need syntax that differs from NumPy for async anyway, it seems to me that from a user perspective there is little benefit in having a separate async Array class. Instead, I would say option 2 or 3 is better. That is what is done in tensorstore as well. However, if you allow both sync and async through the same class, then that kind of implies that sync and async methods might be called concurrently, and async methods might be called concurrently from multiple event loops. That would require that the class (especially any underlying key-value stores that it uses) not hold any references to any particular event loop, and be designed to support this type of concurrent use. That would probably make it challenging to re-use existing asyncio libraries for I/O. You could just ban use of the same Array from multiple event loops and ban mixing sync/async operations, but that may be undesirable. In general, because array processing often involves CPU-heavy operations that release the GIL, support multi-threaded access would be useful, though. tensorstore does not have this problem because it internally uses its own thread pools and merely interoperates with asyncio.

It is convenient to support the normal NumPy syntax for reading and writing, because then it can in some cases it can just work with existing libraries that expect an Array-like object. However, you could also support that use case with an adapter object available through a property.

@jstriebel
Copy link

Style-wise, my personal favorites are 3 and 2.

3 might be confusing when not used to async, but is the most clean API IMO, unifying both worlds nicely. Also, futures allow some degree of parallelism from non-async code in Python. Additionally, I find it nice that slices of an array can be represented, and actual get or set access is done explicitly. However, this would be a major difference to the current zarr-python implementation.

If the barrier of entry should be as low as possible (for beginners or people coming from zarr-python), I'd tend to option 2a or 2b.

However, if you allow both sync and async through the same class, then that kind of implies that sync and async methods might be called concurrently, and async methods might be called concurrently from multiple event loops. That would require that the class (especially any underlying key-value stores that it uses) not hold any references to any particular event loop, and be designed to support this type of concurrent use.

This might be circumvented by allowing to set the event-loop for sync access directly on the array. Then any async access should also happen within the same event-loop (which can be asserted). This is also a useful feature on it's own IMO, just to have more control over the event-loop zarrita would be using internally with synchronous access.

@martindurant
Copy link

I did some playing with this and zarr-python in https://github.com/martindurant/async-zarr

@normanrz
Copy link
Member Author

normanrz commented Jul 7, 2023

@martindurant async-zarr looks great! Have you also thought about how to design the "write" interface in async?

@martindurant
Copy link

To be sure: it was just an experimental POC, and a while ago now. I think "write" would look very much the same, though - of course assuming you already have the data in memory. I had not thoughts at all about streaming in either direction.

@clbarnes
Copy link
Contributor

clbarnes commented Jul 8, 2023

I am not convinced about the suitability of asyncio for zarr at all. I'm not a heavy asyncio user, I've made a number of attempts to incorporate it into my workflows but if my understanding of it is correct (not a given), I have always run into issues when trying to reframe my problems in an asyncio context. Mainly the issues are:

  • It's dependent on the async nature of the very bottom of the stack. There are async HTTP implementations (so cloud storage is fine), but many OSs and local file systems don't have actual async APIs: the workaround in aiofiles is to wrap around threading to pretend they're async.
  • It poisons the whole stack. If zarr depends on async at the bottom, then either every level (where someone might want to interpose a codec or transformer or store) also needs a sync version (possibly which just calls asyncio.run on the async version), or users need to annotate every interaction and every function containing an interaction and so on and so on with async and think in those terms.
  • Code running in an event loop must not block the event loop. This means that we can't do anything compute-heavy while loading chunks (e.g. de/compression), and also (crucially), the user can't either. I do my best to make my IO-bound and compute-bound tasks at the same time so that nothing's waiting around, but this is not possible in asyncio: you must do all your asyncio stuff without any heavy CPU usage, then all your compute stuff outside of the event loop, which means you have an idling CPU while you're waiting for IO and filling up your RAM, then idling IO channels as you chew through what you previously loaded out of the event loop. I think that even codecs which release the GIL are no use here: there is no parallelism under asyncio, only concurrency.
  • Possibly more of a personal thing, but lazy evaluation of awaitables (i.e. they only get scheduled when you await them) is an ergonomic pain, to me. a = an_async_fn(); b = another_async_fn(); await a; await b; runs in serial, unless you manually create_task() around each coroutine, or use as_completed or gather around a group of them.

The stdlib docs recommend running blocking IO-bound tasks (like file access) in a thread pool, and blocking CPU-bound tasks (like heavy maths) in a process pool, alongside the asyncio you're doing. But transferring data in and out of a separate interpreter for multiprocessing is itself very slow, and means we can't stream from the IO channel directly into a codec.

@martindurant
Copy link

@clbarnes : I think there may be some misunderstanding here. zarr-python already allows for async operation on reading/writing batches of chunks via fsspec, using only blocking calls in the outward facing API - there is no poisoning of the stack. All of it works exceedingly well with thread and process parallelism, particularly as implemented by dask. It took some magicry in fsspec to make this happen (or, look at the async implementations in rfsspec, which operate without asyncio in a rust event loop), and there is no asyncio at all for local file access.

However, zarr is fundamentally an IO library more than an encoding/compute library. It is reasonable for people to want to call it from within an async context and to want the various metadata operations on remote files to operate async (since they are extremely latency bottlenecked). Even a compute library like dask might reasonable want to be able to say "wait on these IO operations and in the meantime I can use this thread to do other real work". My POC shows that it is relatively simple to provide the normal blocking API as usual in zarr-python and also provide an alternative async class hierarchy for those that want it. This is (part of) one of the options being discussed in this issue.

@clbarnes
Copy link
Contributor

clbarnes commented Jul 8, 2023

However, zarr is fundamentally an IO library more than an encoding/compute library.

Agreed, although you've got to do something with the data you're IO-ing, and zarr is designed to IO a lot of data, which does imply a lot of "somethings" to do!

look at the async implementations in rfsspec, which operate without asyncio in a rust event loop

Yes, rust does sidestep some of the issues with python's asyncio because you can freely hand data between processes, spin them up and down without needing to fork an interpreter, and so on.

@normanrz
Copy link
Member Author

normanrz commented Jul 8, 2023

I think that even codecs which release the GIL are no use here: there is no parallelism under asyncio, only concurrency.

I want to mention that there is asyncio.tothread which allows to schedule blocking code (that releases the GIL, such as blosc) in a thread pool. This does allow for parallelism. zarrita uses tothread for all codecs.

@martindurant
Copy link

Well, also remember that holding the GIL means other (thread) parallelism, for instance with dask, can't happen, but if you release it then it can.

Do we have any estimate of how much zarr interaction happens via dask??

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants