This project explores strategies for fast GroupBy reductions with dask.array. It used to be called dask_groupby
It was motivated by
- Dask Dataframe GroupBy blogpost
- numpy_groupies in Xarray issue
(See a presentation about this package, from the Pangeo Showcase).
This work was funded in part by
- NASA-ACCESS 80NSSC18M0156 "Community tools for analysis of NASA Earth Observing System Data in the Cloud" (PI J. Hamman, NCAR),
- NASA-OSTFL 80NSSC22K0345 "Enhancing analysis of NASA data with the open-source Python Xarray Library" (PIs Scott Henderson, University of Washington; Deepak Cherian, NCAR; Jessica Scheick, University of New Hampshire), and
- NCAR's Earth System Data Science Initiative.
It was motivated by very very many discussions in the Pangeo community.
There are two main functions
flox.groupby_reduce(dask_array, by_dask_array, "mean")
"pure" dask array interfaceflox.xarray.xarray_reduce(xarray_object, by_dataarray, "mean")
"pure" xarray interface; though work is ongoing to integrate this package in xarray.
See the documentation for details on the implementation.
flox
implements all common reductions provided by numpy_groupies
in aggregations.py
.
It also allows you to specify a custom Aggregation (again inspired by dask.dataframe),
though this might not be fully functional at the moment. See aggregations.py
for examples.
mean = Aggregation(
# name used for dask tasks
name="mean",
# operation to use for pure-numpy inputs
numpy="mean",
# blockwise reduction
chunk=("sum", "count"),
# combine intermediate results: sum the sums, sum the counts
combine=("sum", "sum"),
# generate final result as sum / count
finalize=lambda sum_, count: sum_ / count,
# Used when "reindexing" at combine-time
fill_value=0,
# Used when any member of `expected_groups` is not found
final_fill_value=np.nan,
)