Roadmap towards UC8 #124

LukeWeidenwalker · 2023-06-19T14:23:01Z

@clausmichele @aljacob I wanted to follow up on our call last Friday with a roadmap of concrete steps we'll need to finally close UC8.

Last year, because of the way the architecture was set up and there was no local execution environment, we could only test the whole workflow within the entire running system, so it was difficult to collaborate on this outside of EODC. With the rewrites to the parser and the processes, this should now be much easier, because we can essentially prototype the entire workflow without needing to have it exposed in a fully-fledged backend. As discussed on Friday, I'm summing up here what's still missing, so that @clausmichele can start supporting us again!

Subissues:

Support the geometries parameter in load_collection (or load_stac) by loading sparse arrays (https://github.com/pydata/sparse) (Implement load_stac #120, Support geometries in load_stac #121)
- The problem with this usecase was always that we had no way to only load the data that we'd crop to later in aggregate_spatial. I'm attaching my proof of concept notebook for how to get around this by leveraging pydata/sparse arrays (super crude - @clausmichele please skim this quickly and let me know if you can work with this already, or if I need to tidy it up and add some more comments!). This has to be done at the point where data is loaded initially, because otherwise any operations between load_collection and aggregate_spatial will run over an extent much larger than necessary.
- @clausmichele : If you prototype this in load_stac, the rest of the pipeline should be unlocked to be worked on, even if this isn't live at EODC yet!
Rewrite aggregate_spatial to work with sparse arrays and produce vector cubes with https://github.com/xarray-contrib/xvec (Port aggregate_spatial to work with sparse arrays and produce vector cubes with xvec #119)
- When we worked on this last year, we used geopandas+dask_geopandas to represent vector cubes somewhat haphazardly. This is actually not a good idea in the long term, because we'd have to handle raster and vector cubes separately everywhere (think if isinstance(data, gpd.GeoDataFrame): statements all over the place) - which would be annoying and make this unnecessarily hard to maintain!
- Xvec has since come out, which I find quite promising, because it should allow us to support vector cubes within the xarray paradigm. It has the added benefit that the implementation is directly motivated by Edzer's concept of the vector cube, so we should be quite closely aligned conceptually already. It should also be easier to keep processes dask-friendly that way, rather than relying on the extra dependency of geopandas/dask-geopandas.
- Disclaimer: I haven't tried xvec yet, so all of this (especially how it behaves with dask) still needs to be confirmed!
adapt fit_regr_random_forest to work on xvec vector cubes (Adapt fit_regr_random_forest to work on xvec vector cubes #122)
- there's also be some developments in the spec for the ML processes, which we should adopt
adapt pred_regr_random_forest to work with the new setup (Test pred_regr_random_forest to work this new setup #123)
- This should already work (there's been an integration test for this for a long time), but still needs to be tested

Let me know if you have any questions on this general plan! As I explained on Friday, we don't have bandwidth to tackle these issues ourselves for at least another 2-week increment, so any progress you're able to make on this will help us out a lot and will be much appreciated. I'll try to make myself available to support and review any pull requests in a timely manner!

cc @SerRichard @ValentinaHutter @christophreimer @bschumac

test_sparse_multipolygons.zip

The text was updated successfully, but these errors were encountered:

clausmichele · 2023-08-22T07:42:57Z

@LukeWeidenwalker my colleagues are working on the aggregate functionality: ks905383/xagg#48

SerRichard · 2023-09-11T09:40:34Z

Some open points from the coordination!

Test xagg on larger scale datasets
Working with projected systems?
Is loading of a multi polygon a hard requirement?
Fit_random is unblocked if we know we will be using xvec as an output of the aggregate_spatial

ValentinaHutter · 2023-09-11T10:04:11Z

Some open points from the coordination!

Test xagg on larger scale datasets

Concerning larger scale datasets:

Test how xagg works with dask dataarrays
Continental scale can also be tested on artificial datasets, we don't need real Sentinel tiles to make sure it works with on a large area.
The geometry from UC8, which includes polygons over Europe can be found for example here:
https://github.com/openEOPlatform/SRR3_notebooks/blob/main/notebooks/resources/UC8/vector_data/target_canopy_cover_60m_WGS84/target_canopy_cover_WGS84_60m.geojson

clausmichele · 2023-10-10T10:04:20Z

@masawdah implemented filter_spatial in this PR: #170
It is dask based, we tested it with full S2 tiles. This is allows to crop a large input dataset to the requested polygon and load only the necessary subset.
Once this is approved we will reuse this process to implement aggregate_spatial using the xvec data structure as output.

ValentinaHutter · 2023-11-14T13:51:17Z

I now did some tests with filter_spatial and can confirm that it works with our datasets as well. When testing it with one polygon, the dataset gets cropped to the bounds of the polygon and gets smaller. When two polygons are used, which are further away from each other, it also works, but has lots of nan values in the area between the two polygons in the dataset (which is as expected). This is the most important detail for the aggregate_spatial implementation, I think. I will now continue with the implementation for fit_regr_random_forest, hope that's okay!

masawdah · 2023-11-14T14:30:06Z

Hi @ValentinaHutter , aggregate_spatial implemented in this PR: #194

As discussed with @clausmichele , if you plan to proceed with the implementation of fit_regr_random_forest, it is important to determine that the output of aggregate_spatial is compatible with the requirements of other processes. Currently, it is in the form of an x-vector cube as an xarray dataset.

ValentinaHutter · 2023-11-14T14:32:14Z

Thanks a lot for the quick reply! will have a look at the PR and test it on our side asap!

ValentinaHutter · 2024-02-13T13:18:17Z

There is now a general implementation for aggregate spatial on openeo-processes-dask.

However, we found that the spatio-temporal extent for the UC8 needs to be treated differently. This is because the UC8 starts with load_collection on bbox = [3, 43, 18, 51], temporal_extent = ["2018-05-01", "2018-09-01"]). This fails for the datasets, that are provided by EODC in the EODC backend. The error is raised as soon as dask tries to create a lazy array, for either having to much data in one chunk or having a task graph that is too large. The kernel crashes even before the aggregate_spatial is started.
Therefore, we needed to reimplemented aggregate_spatial internally to work with subsets of load_collection outputs.

Let me know, if anything is unclear or you need a more detailed description of the issue.

clausmichele · 2024-02-13T15:40:51Z

@ValentinaHutter in my opinion the re-implementation/adaptation would also be important to be shared, so that we understand the internal logic. Additionally, if someone would face the same issue, it would be important to track a possible solution to it.

ValentinaHutter · 2024-02-21T14:41:53Z

I will try to sum it up in the next days - I think this might be a nice thing to discuss at our first openeo-processes-dask meeting next Monday :)

LukeWeidenwalker added this to the Support for UC8 milestone Jun 19, 2023

masawdah mentioned this issue Sep 25, 2023

ENH: add zonal_stats xarray-contrib/xvec#52

Merged

masawdah mentioned this issue Dec 14, 2023

Aggregate spatial #194

Merged

ValentinaHutter mentioned this issue Feb 9, 2024

Spatila agg xvec #217

Merged

ValentinaHutter closed this as completed Feb 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Roadmap towards UC8 #124

Roadmap towards UC8 #124

LukeWeidenwalker commented Jun 19, 2023 •

edited

Loading

clausmichele commented Aug 22, 2023

SerRichard commented Sep 11, 2023

ValentinaHutter commented Sep 11, 2023

clausmichele commented Oct 10, 2023

ValentinaHutter commented Nov 14, 2023 •

edited

Loading

masawdah commented Nov 14, 2023

ValentinaHutter commented Nov 14, 2023

ValentinaHutter commented Feb 13, 2024

clausmichele commented Feb 13, 2024

ValentinaHutter commented Feb 21, 2024

Roadmap towards UC8 #124

Roadmap towards UC8 #124

Comments

LukeWeidenwalker commented Jun 19, 2023 • edited Loading

clausmichele commented Aug 22, 2023

SerRichard commented Sep 11, 2023

ValentinaHutter commented Sep 11, 2023

clausmichele commented Oct 10, 2023

ValentinaHutter commented Nov 14, 2023 • edited Loading

masawdah commented Nov 14, 2023

ValentinaHutter commented Nov 14, 2023

ValentinaHutter commented Feb 13, 2024

clausmichele commented Feb 13, 2024

ValentinaHutter commented Feb 21, 2024

LukeWeidenwalker commented Jun 19, 2023 •

edited

Loading

ValentinaHutter commented Nov 14, 2023 •

edited

Loading