Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Opening s3 file with rasterio 1.4 immediately loads the data to the machine #824

Open
phofl opened this issue Nov 1, 2024 · 4 comments
Open
Labels
question Further information is requested

Comments

@phofl
Copy link
Contributor

phofl commented Nov 1, 2024

Code Sample, a copy-pastable example if possible

A "Minimal, Complete and Verifiable Example" will make it much easier for maintainers to help you:
http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

We used to open the file connection to the file stored on s3 and then passing it to open_rasterio. A change in 1.4 now immediately loads all the data to the local machine and then puts it completely into the dask graph, making it explode in size

import s3fs
import rioxarray

s3 = s3fs.S3FileSystem()
counties = rioxarray.open_rasterio(
    s3.open("s3://nwm-250m-us-counties/Counties_on_250m_grid.tif"),
    chunks="auto",
)

This now hangs and downloads the whole 1GB to my local machine, before it completely, i.e. it doesn't keep things lazy.

See rasterio/rasterio#3232 for context

Problem description

This behavior is sub-optimal, it slows us down on the client and then create a huge dask graph. The new recommended way of dealing with this is as follows:

import s3fs
import rioxarray
import fsspec
import rasterio

s3 = s3fs.S3FileSystem()
counties = rioxarray.open_rasterio(
    "s3://nwm-250m-us-counties/Counties_on_250m_grid.tif",
    chunks="auto",
    opener=fsspec.open
)

Expected Output

Ideally, the opener would be set automatically whenever fsspec is installed and potentially raise a warning if rasterio >=1.4 and a remote /local file is passed that runs into this condition so that users now what to do

Environment Information

rioxarray (0.17.1.dev0) deps:
rasterio: 1.4.2
xarray: 2024.10.0
GDAL: 3.9.3
GEOS: 0.0.0
PROJ: 9.4.1
PROJ DATA: /Users/patrick/mambaforge/envs/rioxarray/share/proj
GDAL DATA: /Users/patrick/mambaforge/envs/rioxarray/lib/python3.12/site-packages/rasterio/gdal_data

Other python deps:
scipy: 1.14.1
pyproj: 3.7.0

System:
python: 3.12.7 | packaged by conda-forge | (main, Oct 4 2024, 15:57:01) [Clang 17.0.6 ]
executable: /Users/patrick/mambaforge/envs/rioxarray/bin/python
machine: macOS-14.3.1-arm64-arm-64bit

Installation method

  • conda, pypi, from source, etc...

Conda environment information (if you installed with conda):


Environment (conda list):
$ conda list | grep -E "rasterio|xarray|gdal"


Details about conda and system ( conda info ):
$ conda info

@phofl phofl added the bug Something isn't working label Nov 1, 2024
@snowman2 snowman2 added question Further information is requested and removed bug Something isn't working labels Nov 1, 2024
@snowman2
Copy link
Member

snowman2 commented Nov 1, 2024

Ideally, the opener would be set automatically whenever fsspec is installed

Not all users want this behavior as GDAL handles S3 files natively. I think the current behavior is fine with the workaround you have identified.

This is the current recommended approach, correct?

import rioxarray
import s3fs

s3 = s3fs.S3FileSystem()
counties = rioxarray.open_rasterio(
    "s3://nwm-250m-us-counties/Counties_on_250m_grid.tif",
    chunks="auto",
    opener=s3.open,
)
import fsspec
import rioxarray

counties = rioxarray.open_rasterio(
    "s3://nwm-250m-us-counties/Counties_on_250m_grid.tif",
    chunks="auto",
    opener=fsspec.open
)

@phofl
Copy link
Contributor Author

phofl commented Nov 1, 2024

Hm that makes sense. Didn't think about GDAL.

Can we at least raise a warning if an already opened remote file is passed that is now automatically loaded into memory pointing to the workaround? I am mostly worried about foot guns for users that run into this similar to how I ran into it.

@snowman2
Copy link
Member

snowman2 commented Nov 1, 2024

I am not aware of a good way to do that as users may actually want to load the whole file into memory.

@phofl
Copy link
Contributor Author

phofl commented Nov 1, 2024

This is definitely an anti pattern is chunks is passed as well and something you should never do. That case seems sensible enough to warn about

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants