Extracting valid values from large raster #501

samfav · 2022-04-08T00:59:17Z

samfav
Apr 8, 2022

Hi,
I have a large (40000x40000) GeoTiff file with approximately 2% of the pixels that have a valid (different from NODATA) value. I want to extract those values to a column dataset and extract the values at the same location in other aligned rasters.
The end result is a column database of points valid points.

Here is some code that I have written to perform this task :

da = rioxarray.open_rasterio(f"{BASE_PATH}large_tiff.tif")

da.where(da!=NODATA).stack(z=('x', 'y')).dropna('z').reset_index('z').to_netcdf(f"{BASE_PATH}tiff_selection.nc", engine='netcdf4')

The issue with this code is that it fills up my memory. I have tried using different chunk sizes on the opened tiff, dropping the NODATA values inside the where to no avail etc.

Has anyone performed a similar task using rioxarray and could point me to the right way of doing this?

Ultimately I want the data to be stored to parquet but that's another problem.
Thanks for your help,

Sam

Answered by snowman2

Apr 8, 2022

rioxarray.open_rasterio lazily loads the data by default. You can take advantage of this in your process and read the data in windows (or you could just select subsets of the array by slicing).
To prevent your memory from loading up, I recommend using the cache=False kwarg when opening the file.

da = rioxarray.open_rasterio(f"{BASE_PATH}large_tiff.tif", cache=False, mask_and_scale=True)

windows = [...]
out_data = []
for window in windows:
    subset = da.rio.isel_window(window)
    out_data.append(
        subset
            .to_dataframe()
            .reset_index()
            .drop_vars("spatial_ref")
            .dropna(how="all", subset=list(subset.data_vars))
   )

pandas.concat(ou…

View full answer

snowman2 · 2022-04-08T13:34:18Z

snowman2
Apr 8, 2022
Maintainer

rioxarray.open_rasterio lazily loads the data by default. You can take advantage of this in your process and read the data in windows (or you could just select subsets of the array by slicing).
To prevent your memory from loading up, I recommend using the cache=False kwarg when opening the file.

da = rioxarray.open_rasterio(f"{BASE_PATH}large_tiff.tif", cache=False, mask_and_scale=True)

windows = [...]
out_data = []
for window in windows:
    subset = da.rio.isel_window(window)
    out_data.append(
        subset
            .to_dataframe()
            .reset_index()
            .drop_vars("spatial_ref")
            .dropna(how="all", subset=list(subset.data_vars))
   )

pandas.concat(out_data).to_parquet(f"{BASE_PATH}tiff_selection.parquet")

1 reply

samfav Apr 8, 2022
Author

Thanks for the reply I hadn't checked the cache argument.
Similar to what you propose, I ended up reading the file line by line (with window size (1,X)) and aggregating the selected points in each line in the end.

I was wondering if it was possible to achieve a similar result without having to iterate myself on the windows. But I guess it is better to be explicit and in control of what happens in the code :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extracting valid values from large raster #501

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Extracting valid values from large raster #501

samfav Apr 8, 2022

Replies: 1 comment · 1 reply

snowman2 Apr 8, 2022 Maintainer

samfav Apr 8, 2022 Author

samfav
Apr 8, 2022

Replies: 1 comment 1 reply

snowman2
Apr 8, 2022
Maintainer

samfav Apr 8, 2022
Author