This document summarizes the process involved in the generation of STAC Collections from datasets provided in the InterTwin DataLake accessible using rucio. The central idea is to provide a STAC JSON that could be used in downstream analytic pipelines for the different thematic use cases in the InterTwin project. This STAC JSON is expected to contain links to Cloud-Optimized-GeoTIFFS provided on a publicly accessible S3 storage through the STAC as well as an alternative link to the original datasets available in the InterTwin DataLake. The steps involved in generating these STAC JSON and interacting with the DataLake are outline below.
Expand
- Rucio Installation for a Debian-based OS,
- Pre-requisites for accessing datasets in the data lake,
- Downloading specified datasets,
- Generating STAC Collections using Raster2STAC,
- Extending STAC JSON to contain link to InterTwin Datalake,
- Load datasets with downstream packages
Expand
A full documentation on how to interact with the data lake using rucio and detailed introduction to important rucio terminologies are provided here. Nonewithstanding, we highlight some of the specific requirements needed for using rucio on a Debian-based OS (such as Ubuntu, which currently being at EURAC Research at the time of writing this document.)
In your development environment, install rucio with pip pip install rucio-clients
. This provides you with both the Rucio Client CLI and the Rucio Client Python API, however since rucio uses Gfal, which is not compatible with debian, to download and upload data you would need to run your operations in a containerized environment. The recommended docker image for interacting with the InterTwin datalake is provided here:: dvrbanec/rucio-client:latest
, which requires you to mount the configuration file to run effectively.
docker run \
-v /tmp/rucio.cfg:/opt/rucio/etc/rucio.cfg \
--name=rucio-client \
-it -d rucio/rucio-clients
The details for setting up the configuration is provided in the next section.
Expand
In order to access the data in the InterTwin datalake, the following pre-requisites should be met.
-
Register and request access to the interTwin dev (dev.intertwin.eu) with your EGI Check-in credentials here.
Once signed in with your EGI Credentials, go to People --> Enroll. Search for "Join dev.intertwin.eu VO" and click on "Begin" and request access to the VO from there. Please note that the access approval depends on the availability of the administrator. See this documentation for more details.
-
Set up your rucio configuration in
rucio.cfg
. Here is a sample configuration we used:[client] rucio_host = https://rucio-intertwin-testbed.desy.de auth_host = https://rucio-intertwin-testbed-auth.desy.de ca_certs = /etc/ssl/certs/ca-bundle.crt account = <YOUR_ACCOUNT_NAME> # your EGI check-in account name auth_type = oidc auth_token_file_path = /tmp/rucio_oauth.token oidc_scope = openid profile offline_access eduperson_entitlement [download] transfer_timeout = 3600000 preferred_impl = xrootd, rclone
-
Install the necessary certifications to validate rucio access to the intertwin DataLake, see the compatible files in the provided
Dockerfile
-
Run your container and start using rucio commands, remember to also mount the path you would like your datasets to be stored in
docker run \ -v /tmp/rucio.cfg:/opt/rucio/etc/rucio.cfg \ -v /data_path:~/data_path \ --name=rucio-client \ -it -d dvrbanec/rucio-client:latest
Then use rucio commands:
docker exec -it rucio-client /bin/bash $ rucio ping
-
Authenticate your rucio
Run
rucio whoami
and Rucio will give you a link to authenticate yourself. Follow the instructions on the link and at the end you will get a code that you should copy back to Rucio in the terminal. Once you've copied the code back to Rucio, you'll be authenticated to Rucio.
Expand
To download a specific dataset from the InterTwin data lake, you just need to get to the Data IDentifier (DID) and run rucio get DID
To perform other functions with rucio such as upload and creating datasets, see the full documentation here and here
Expand
See sample notebook RUCIO_STAC.ipynb
for details on generating STAC collection using Raster2STAC
Expand
Coming soon!