Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lazy access of jp2 file in private S3 bucket works, but subsequent computation outside of the session fails with HTTP response 403 #816

Open
konstntokas opened this issue Oct 29, 2024 · 2 comments
Labels
question Further information is requested

Comments

@konstntokas
Copy link

Code Sample, a copy-pastable example if possible

The code sample below raises a warning with an HTTP response 403. Note that key and secret for AWS bucket can be obtained by CDSE.

import rasterio
import rioxarray


uri = (
    "s3://eodata/Sentinel-2/MSI/L2A/2020/07/05/S2B_MSIL2A_20200705T101559_N0214_R065"
    "_T32TMT_20200705T135630.SAFE/GRANULE/L2A_T32TMT_A017394_20200705T101917/"
    "IMG_DATA/R10m/T32TMT_20200705T101559_B02_10m.jp2"
)
session = rasterio.session.AWSSession(
    aws_unsigned=False,
    endpoint_url="eodata.dataspace.copernicus.eu",
    aws_access_key_id="xxx",
    aws_secret_access_key="xxx",
)
with rasterio.env.Env(session=session, AWS_VIRTUAL_HOSTING=False):
    ds = rioxarray.open_rasterio(uri, chunks=dict(x=1024, y=1024))

mean = ds.mean()
print(mean)
print(mean.compute())

The output is is given in the following cell. When computing the dask graph in the last line print(mean.compute()), the actual data needs to be accessed, which raises the warning. In larger examples it raises the error ''Aborting load due to failure while reading'.

<xarray.DataArray ()> Size: 8B
dask.array<mean_agg-aggregate, shape=(), dtype=float64, chunksize=(), chunktype=numpy.ndarray>
Coordinates:
    spatial_ref  int64 8B 0
Warning 1: HTTP response code on https://eodata.s3.us-east-2.amazonaws.com/Sentinel-2/MSI/L2A/2020/07/05/S2B_MSIL2A_20200705T101559_N0214_R065_T32TMT_20200705T135630.SAFE/GRANULE/L2A_T32TMT_A017394_20200705T101917/IMG_DATA/R10m/T32TMT_20200705T101559_B02_10m.jp2.msk: 403
Warning 1: HTTP response code on https://eodata.s3.us-east-2.amazonaws.com/Sentinel-2/MSI/L2A/2020/07/05/S2B_MSIL2A_20200705T101559_N0214_R065_T32TMT_20200705T135630.SAFE/GRANULE/L2A_T32TMT_A017394_20200705T101917/IMG_DATA/R10m/T32TMT_20200705T101559_B02_10m.jp2.MSK: 403
<xarray.DataArray ()> Size: 8B
array(802.02040494)
Coordinates:
    spatial_ref  int64 8B 0

Process finished with exit code 0

When performing the last line within the rasterio Env, it works just fine.

import rasterio
import rioxarray


uri = (
    "s3://eodata/Sentinel-2/MSI/L2A/2020/07/05/S2B_MSIL2A_20200705T101559_N0214_R065"
    "_T32TMT_20200705T135630.SAFE/GRANULE/L2A_T32TMT_A017394_20200705T101917/"
    "IMG_DATA/R10m/T32TMT_20200705T101559_B02_10m.jp2"
)
session = rasterio.session.AWSSession(
    aws_unsigned=False,
    endpoint_url="eodata.dataspace.copernicus.eu",
    aws_access_key_id="O0M0CUQIDQO9TDZ4D8NR",
    aws_secret_access_key="qPUyXs9G6j8on6MY5KPhQNHuA5uZTqxEscrbBCGx",
)
with rasterio.env.Env(session=session, AWS_VIRTUAL_HOSTING=False):
    ds = rioxarray.open_rasterio(uri, chunks=dict(x=1024, y=1024))

mean = ds.mean()
print(mean)

with rasterio.env.Env(session=session, AWS_VIRTUAL_HOSTING=False):
    print(mean.compute())

Problem description

When computing the dask graph in the last line print(mean.compute()), the actual data needs to be accessed, which raises the warning. In larger examples it raises ''Aborting load due to failure while reading'.

Expected Output

The access credentials should be somehow saved when open the data. Otherwise for each computation on the actual data, a new environment will need to be created and applied.

Environment Information

python -c "import rioxarray; rioxarray.show_versions()"
rioxarray (0.17.0) deps:
rasterio: 1.4.1
xarray: 2024.6.0
GDAL: 3.9.3
GEOS: 3.13.0
PROJ: 9.5.0
PROJ DATA: /home/konstantin/micromamba/envs/xcube-stac/share/proj
GDAL DATA: /home/konstantin/micromamba/envs/xcube-stac/share/gdal

Other python deps:
scipy: 1.14.1
pyproj: 3.7.0

System:
python: 3.12.7 | packaged by conda-forge | (main, Oct 4 2024, 16:05:46) [GCC 13.3.0]
executable: /home/konstantin/micromamba/envs/xcube-stac/bin/python
machine: Linux-6.8.0-47-generic-x86_64-with-glibc2.35

Installation method

  • micromamba

Conda environment information (if you installed with conda):


Environment (micromamba list):
$ micromamba list | grep -E "rasterio|xarray|gdal"
  gdal                              3.9.3           py312h1299960_0          conda-forge
  libgdal                           3.9.3           ha770c72_0               conda-forge
  libgdal-core                      3.9.3           hd5b9bfb_0               conda-forge
  libgdal-fits                      3.9.3           h2db6552_0               conda-forge
  libgdal-grib                      3.9.3           hc3b29a1_0               conda-forge
  libgdal-hdf4                      3.9.3           hd5ecb85_0               conda-forge
  libgdal-hdf5                      3.9.3           h6283f77_0               conda-forge
  libgdal-jp2openjpeg               3.9.3           h1b2c38e_0               conda-forge
  libgdal-kea                       3.9.3           h1df15e4_0               conda-forge
  libgdal-netcdf                    3.9.3           hf2d2f32_0               conda-forge
  libgdal-pdf                       3.9.3           h600f43f_0               conda-forge
  libgdal-pg                        3.9.3           h5e77dd0_0               conda-forge
  libgdal-postgisraster             3.9.3           h5e77dd0_0               conda-forge
  libgdal-tiledb                    3.9.3           h4a3bace_0               conda-forge
  libgdal-xls                       3.9.3           h03c987c_0               conda-forge
  rasterio                          1.4.1           py312h8456570_0          conda-forge
  rioxarray                         0.17.0          pyhd8ed1ab_0             conda-forge
  xarray                            2024.10.0       pyhd8ed1ab_0             conda-forge



Details about micromamba and system ( micromamaba info ):
$ micromamba info

       libmamba version : 1.5.8
     micromamba version : 1.5.8
           curl version : libcurl/8.6.0 OpenSSL/3.2.1 zlib/1.2.13 zstd/1.5.5 libssh2/1.11.0 nghttp2/1.58.0
     libarchive version : libarchive 3.7.2 zlib/1.2.13 bz2lib/1.0.8 libzstd/1.5.5
       envs directories : /home/konstantin/micromamba/envs
          package cache : /home/konstantin/micromamba/pkgs
                          /home/konstantin/.mamba/pkgs
            environment : xcube-stac (active)
           env location : /home/konstantin/micromamba/envs/xcube-stac
      user config files : /home/konstantin/.mambarc
 populated config files : 
       virtual packages : __unix=0=0
                          __linux=6.8.0=0
                          __glibc=2.35=0
                          __archspec=1=x86_64-v3
               channels : 
       base environment : /home/konstantin/micromamba
               platform : linux-64


@konstntokas konstntokas added the bug Something isn't working label Oct 29, 2024
@snowman2 snowman2 added question Further information is requested and removed bug Something isn't working labels Oct 30, 2024
@snowman2
Copy link
Member

That sounds correct. The data is lazy loaded from disk. If you load in all of the data inside the session, then this likely won't be an issue.

@konstntokas
Copy link
Author

This I understand. However the idea is to lazy load the data in a reading routine and later load the data when plotting etc. Otherwise, each operation which loads the data needs to be performed within the environment session.

I downloaded one tile of the dataset and stored in one of our private S3 buckets. When accessing this file, it works as expected. The lazy loading is done within the environment session and operation with loading data can be done outside of the Env.

import rasterio
import rioxarray


uri = "s3://xxx/L2A_T33SXA_20150715T094306_B02_10m.jp2"

session = rasterio.session.AWSSession(
    aws_unsigned=False,
    aws_access_key_id="xxx",
    aws_secret_access_key="xxx",
)
with rasterio.env.Env(session=session):
    ds = rioxarray.open_rasterio(uri, chunks=dict(x=1024, y=1024))

print(ds)
mean = ds.mean()
print(mean)
print(mean.compute())

They only difference above is that I need to set an endpoint_url and `AWS_VIRTUAL_HOSTING=False. Do you think is can have an impact?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants