Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aggregate spatial #194

Merged
merged 36 commits into from
Nov 22, 2023
Merged

Aggregate spatial #194

merged 36 commits into from
Nov 22, 2023

Conversation

masawdah
Copy link
Member

@masawdah masawdah commented Nov 14, 2023

Hi team,
We have completed the implementation of the aggregate_spatial process. You can use the code snippet below to test it. The current output format is an x-vector cube as an xarray dataset, but it can also be configured as a DatArray, depending on the requirements of processes utilizing the output, such as fit_regr_random_forest.

import xarray as xr
import pandas as pd
from shapely.geometry import MultiPolygon, Polygon, mapping
import geopandas as gpd
import json
from openeo.local import LocalConnection
local_conn = LocalConnection("./")


json_file = {
  "type": "FeatureCollection",
  "features": [
    {
      "type": "Feature",
      "properties": {},
      "geometry": {
        "coordinates": [
          [
            [
              11.143181900944768,
              46.682468440575775
            ],
            [
              11.143181900944768,
              46.67892170969685
            ],
            [
              11.14948639229209,
              46.67892170969685
            ],
            [
              11.14948639229209,
              46.682468440575775
            ],
            [
              11.143181900944768,
              46.682468440575775
            ]
          ]
        ],
        "type": "Polygon"
      }
    },
    {
      "type": "Feature",
      "properties": {},
      "geometry": {
        "coordinates": [
          [
            [
              11.181891477817032,
              46.69812317718774
            ],
            [
              11.176847884739885,
              46.697171913851975
            ],
            [
              11.173695639065556,
              46.69284777838911
            ],
            [
              11.178234872836669,
              46.69250183259447
            ],
            [
              11.181891477817032,
              46.69812317718774
            ]
          ]
        ],
        "type": "Polygon"
      }
    }
  ]
}

bbox =  [10.93311286, 46.66029087, 11.41500808, 46.85360212]


url = "https://planetarycomputer.microsoft.com/api/stac/v1/collections/sentinel-2-l2a"
spatial_extent =  {"west": bbox[0], "south": bbox[1], "east": bbox[2], "north": bbox[3], 'crs':'EPSG:4326'}

temporal_extent = ["2022-06-01", "2022-06-30"]
bands = ["B04","B03","B02","B08"] # B04 = red, B03 = green, B02 = blue, B08 = near infrared
properties = {"eo:cloud_cover": dict(lt=30)}


datacube = local_conn.load_stac(
    url=url,
    spatial_extent=spatial_extent,
    temporal_extent=temporal_extent,
    bands=bands,
    properties=properties,
)

# Geometries - Polygon
datacube = datacube.filter_spatial(json_file)

datacube = datacube.aggregate_spatial(json_file, reducer='mean')
datacube

Here if the output is xr.Dataset

image

And this is as DataArray

image

@clausmichele
Copy link
Member

@masawdah in the example you mention a json_file but it's not defined.
Anyway, it would be good to provide a simple geometry in the code so that it's self contained and it can be executed without further inputs.

@masawdah
Copy link
Member Author

@masawdah in the example you mention a json_file but it's not defined. Anyway, it would be good to provide a simple geometry in the code so that it's self contained and it can be executed without further inputs.

updated with geometries and bbox.

) -> VectorCube:
raise NotImplementedError
t_dim = data.openeo.temporal_dims
t_dim_name = t_dim[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Line 196, you check the length of t_dim, so it might make sense to also check for the length here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would make sense to also add

t_dim_name = None if len(t_dim) == 0 else t_dim[0]

otherwise this line fails when the temporal dimension is not available. This would also need to be considered in Line 231 and 243.

@ValentinaHutter
Copy link
Collaborator

LGTM, thank you for working on this!

@ValentinaHutter
Copy link
Collaborator

If you want to run the pre commit manually, you could pip install pre-commit and then try pre-commit run --all-files. :)


## Create VectorCube
vec_cube = xr.Dataset(
data_vars=data_vars, coords=dict(geometry=df.geometry, time=times)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of naming the temporal dimension "time" per default, it makes sense to name it after the temporal dimension in the input data, what do you think?
I think, this could be achieved by something like

coords= {"geometry": df.geometry, t_dim: times}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, it makes sense to keep the name as it is. Unfortunately, I currently don't have access to my VM due to internal maintenance. However, you can go ahead and implement the new modifications.

The name of time dimension has already been extracted in the beginning with t_dim_name = t_dim[0], so we can reuse it here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I was looking into it, but you were faster with your updates, I now also added a test for a dataset without time.

Copy link

codecov bot commented Nov 22, 2023

Codecov Report

Attention: 31 lines in your changes are missing coverage. Please review.

Comparison is base (c09c805) 80.20% compared to head (a689516) 79.42%.

Files Patch % Lines
...es_dask/process_implementations/cubes/aggregate.py 63.52% 31 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #194      +/-   ##
==========================================
- Coverage   80.20%   79.42%   -0.78%     
==========================================
  Files          30       30              
  Lines        1384     1468      +84     
==========================================
+ Hits         1110     1166      +56     
- Misses        274      302      +28     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ValentinaHutter ValentinaHutter merged commit 665a877 into Open-EO:main Nov 22, 2023
3 of 5 checks passed
@ValentinaHutter
Copy link
Collaborator

Thanks for working on this! I will now release a new openeo-processes-dask version and continue testing on larger scales!

Comment on lines +150 to +159
mask = rasterio.features.geometry_mask(
geometry_array, out_shape=(y_dim_size, x_dim_size), transform=transform
)

if t_dim is not None:
mask = np.expand_dims(mask, axis=data_dims.index(t_dim))
if b_dim is not None:
mask = np.expand_dims(mask, axis=data_dims.index(b_dim))

masked_data = data * mask

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, I got here from xarray-contrib/xvec#52 where @masawdah opened a PR with a variation of this.

The same code as is here was there as well but I believe it is wrong. mask is a boolean array with False where the geometry_array has been burned in, the rest is True. Line 159 results in an array where all the values are preserved except those where the geometry is, where the values are now 0 (not missing!).

Any reduction of this masked array it therefore likely wrong.

Even if you passed invert=True to rasterio.features.geometry_mask to get True indicating the location of geometry and False the area that should be taken away, the line 159 still keeps 0 where the geometry is not, therefore mean, median etc will be wrong. You would need to create a mask resulting in missing values here I believe.

I may be wrong as this is clearly reviewed PR but I don't think that is the case. The tests here also do not verify any correctness of values and this code indeed produces correctly formed cube. Just with completely wrong results.

I have refactored the logic in that xvec PR and verified the result against other implementation so you can check how it is done there. Or we may be able to get the built-in functionality of xvec to the state where you can use it directly, if you dare to depend on version 0.2.0 of it :).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martinfleis Thanks for pointing this out. Actually, this work was done simultaneously with xarray-contrib/xvec#52 . The intention behind it was to replace this effort with a built-in functionality of xvec once it is ready.

I saw your refactoring for the implementation, and it looks great. Let's double-check once again to confirm the best :)

@clausmichele
Copy link
Member

@masawdah @ValentinaHutter, I also noticed something that needs to be adjusted.
You can run the first example adding datacube.execute() at the end and inspect the result.

  1. The result is an xarray.Dataset, but we agreed to use DataArray as a common data format.
  2. The result is not Dask based, somewhere the load/computation of the data is being triggered. We need to avoid that and do it only when explicitly requested.

@masawdah
Copy link
Member Author

@masawdah @ValentinaHutter, I also noticed something that needs to be adjusted. You can run the first example adding datacube.execute() at the end and inspect the result.

  1. The result is an xarray.Dataset, but we agreed to use DataArray as a common data format.
  2. The result is not Dask based, somewhere the load/computation of the data is being triggered. We need to avoid that and do it only when explicitly requested.

I'm not sure if I missed something about the output format but it mentioned here #124 (comment) xarray.dataset. but it would be easy to switch.

@clausmichele
Copy link
Member

Please be aware that now the conversion from xarray.Dataset to xarray.DataArray is using the .to_dataarray() method, the .to_array() is deprecated. See here: https://docs.xarray.dev/en/stable/generated/xarray.Dataset.to_dataarray.html#xarray.Dataset.to_dataarray

image

@masawdah
Copy link
Member Author

@clausmichele @ValentinaHutter I have tested the new implementation of the built-in functionality of xvec for spatial aggregation, and it performed well. It successfully handled the points mentioned by Michele.

We can replace the existing functionality with this new implementation upon its release.

Thanks @martinfleis

image

@clausmichele
Copy link
Member

@masawdah there's a new release of xvec including your contribution https://github.com/xarray-contrib/xvec/releases/tag/v0.2.0
You can create a new PR now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants