Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load_stac: stac_items #501

Open
jdries opened this issue Apr 16, 2024 · 4 comments
Open

load_stac: stac_items #501

jdries opened this issue Apr 16, 2024 · 4 comments

Comments

@jdries
Copy link
Contributor

jdries commented Apr 16, 2024

Proposed Process ID: load_stac
Proposed Parameter Name: stac_items
Optional: yes, default: None

Context

load_stac is very popular for loading user defined data, but require the stac json to be available via http url.
In many cases, such a url is not available, and the user thus needs to rely on a 3rd party service (e.g. github) to upload the stac json.
I see also a use case for systems that require signed urls for data access, where the user first needs to sign urls using a secret key.

Description

stac_items, if provided, is an array of valid STAC Item object. The backend will load all assets in the provided items.

Data Type

array of objects

Additional changes

the other parameters would no longer need to be present if stac_items is provided directly.
Alternative option is of course to turn this into a separate 'load_stac_items' process, with a single parameter?

@clausmichele
Copy link
Member

@jdries so, if I understand it correctly, you would like to directly pass the STAC items as json/text in the process graph instead of an URL? It could be a good idea!

From what I understand, if we integrate it in load_stac an user can provide:

  • A STAC Collection or Item URL: the data will be loaded as usual using also the provided query filters.
  • An array of objects (STAC Items), that can be generated client side, so that there's more control on what we want to load.

@m-mohr
Copy link
Member

m-mohr commented May 16, 2024

This can quickly become problematic. Many STAC Items don't have absolute URLs and then you can't load the data if the self url isn't set. Usually you can use the Item URL if no self url is given, but the URL is not available here as fallback. Also, the JSON size can explode quickly if people start to pass thousand of Items.

Generally, I think I'd prefer a separate process if at all.

@jdries
Copy link
Contributor Author

jdries commented May 17, 2024

it's indeed limited to cases where you use absolute url's and don't send thousands of items.
The use case is really a user that wants to point to a low number of files that are online somewhere, but don't have a corresponding stac item online.
In general, not all of our users have a STAC API or http service at hand where they manage to quickly upload some items.
The process graph also becomes more self-contained if it just includes the STAC metadata.

In fact, our new load_stac sample somewhat illustrates it:
https://github.com/Open-EO/openeo-community-examples/blob/main/python/LoadStac/load-stac-item-example.ipynb
at a given point, it says 'make sure you upload your item', that step is the tricky part.

@m-mohr
Copy link
Member

m-mohr commented May 17, 2024

The place that would allow users to do that is the openEO /files endpoints. That was the original intention that users could upload any related files such as GeoJSON, STAC, etc. there. Due to the lack of implementation we didn't push this through the processes either, but maybe we should to encourage it.

The other thing with the STAC example you linked to: Creating a STAC Item for this purpose seems "overkill".
You could easily just capture all information you need in a simpler format, I believe, i.e. just a list of assets:

{
    "ndvi": {
        "href:" tiff_url,
        "type": "image/tiff; application=geotiff; profile=cloud-optimized",
        "eo:bands": [ # REQUIRED: define the bands in the eo extension for openEO to be able to load it
            {
                "name": "NDVI-band",
            }
        ],
        "proj:epsg": src.crs.to_epsg(),
        "proj:shape": src.shape, # Caveat: this is [height, width] and not [width, height] if you want to set them yourself
        "proj:bbox": proj_bounds,
    }
}

I assume you don't need the geometry and the projected bbox is enough, but not sure.

Do we have an agreed consensus across providers what the STAC Items need to contain to be read (and maybe optional ones for more efficiency)?

And then I'm wondering, why not just: load_url(tiff_url, "GTiff", {bands: ["NDVI-band"], ...})?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants