-
Hey again! Sorry for the back to back questions, hope it's not to much trouble! I've just been updating a lot of our old code to use rioxarray instead of rasterio which has lead to a few questions. We have a notebook that loops through 46 Landsat images, opens them, clips them, and masks out cloud values. When we use rasterio to do this, the whole notebook can run in under one second. With rioxarray, it takes over 30 seconds. I timed them. It seems that the slow part is opening and clipping all those files in rioxarray. By commenting out line by line, these lines seem to be the ones that slows down the code the most
I realize these are somewhat complex functions, but the same thing was done in rasterio very quickly. Do you have any hint as to why this may be taking so long? Any help would be appreciated! If you need more details I can help provide those too |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 28 replies
-
There are several ways to speed things up.
Putting it all together: import os
from glob import glob
import earthpy.spatial as es
import numpy as np
import pandas as pd
import geopandas as gpd
import rioxarray as rxr
import pyproj
# use for rioxarray < 0.3
# pyproj.set_use_global_context(True)
# Get a list of each directory
path = os.path.join("ndvi-automation", "sites")
# Get a list of each directory
sites = glob(path + "/*/")
# Define columns and list used to create pandas dataframe
col_names = ["site", "date", "mean_ndvi"]
output = []
for site in sites:
# Get site name
site_name = os.path.basename(os.path.normpath(site))
# Define directories
landsat_dir = os.path.join(site, "landsat-crop")
vector_dir = os.path.join(site, "vector")
# Define site boundary
site_boundary_path = os.path.join(
vector_dir, site_name + "-crop.shp")
site_bound = gpd.read_file(site_boundary_path)
# Get list of landsat scene directories
cropped_directories = sorted(glob(os.path.join(landsat_dir, "LC08*")))
# Calculate NDVI for each directory
for directory in cropped_directories:
# Get date from directory name
dir_head, dir_tail = os.path.split(directory)
date = dir_tail[10:18]
# Open bands
raster_list = sorted(glob(os.path.join(directory, "*band*.tif")))
with rioxarray.set_options(export_grid_mapping=False):
band_4_crop_data = rxr.open_rasterio(
raster_list[3], masked=True,
).rio.clip(
site_bound.geometry, from_disk=True,
).squeeze()
band_5_crop_data = rxr.open_rasterio(
raster_list[4], masked=True,
).rio.clip(
site_bound.geometry, from_disk=True,
).isel(band=0)
band_4_crop_data = band_4_crop_data.where(
~((band_4_crop_data < 0) | (band_4_crop_data > 10000))
)
band_5_crop_data = band_5_crop_data.where(
~((band_5_crop_data < 0) | (band_5_crop_data > 10000))
)
# Calculate NDVI for site using masked data
ndvi_landsat = es.normalized_diff(
band_5_crop_data.values.astype('f4'), band_4_crop_data.values.astype('f4'),
)
# Add date and NDVI mean to list
output.append([site_name, date, np.nanmean(ndvi_landsat)])
# Convert list to pandas dataframe and save to csv
ndvi_ts_unclean = pd.DataFrame(output, columns=col_names)
out_path_no_clean = os.path.join(
"outputs", "ndvi_2017_not_clean.csv")
ndvi_ts_unclean.to_csv(out_path_no_clean)
ndvi_df_unclean = pd.read_csv(out_path_no_clean,
parse_dates=['date'],
na_values=[-9999],
index_col=['date']) |
Beta Was this translation helpful? Give feedback.
-
Hey! Just tried out the changes, sadly they didn't make too big of a difference. Before the changes I ran the notebook and it took 27 seconds to run. After the changes it took 25. So faster! But not by a noticeable amount. |
Beta Was this translation helpful? Give feedback.
-
ok let me try to summarize the changes that i see here
i also see this approach
is it better to use squeeze or select a band? squeeze seems to work nicely! In short does my comment above summarize the tweeks to make things faster?
|
Beta Was this translation helpful? Give feedback.
-
I created a summary notebook that shows (significant!!) speed gains using from_disk and pyproj global context (at the bottom of the notebook there are timed sections). . i do plan to add this as a lesson to our website as well!! thank you for the help. i read through the pyproj documentation but still don't understand why that global context option is "dangerous". Is there something else that i can read to better understand what the dangers are? Does it somehow use more processing power (threads?). i may just not understand what it's doing. |
Beta Was this translation helpful? Give feedback.
There are several ways to speed things up.
from_disk=True
when clipping. This will reduce the amount of data loaded in. (rioxarray 0.2+).apply(mapping)
anymore.rioxarray.set_options(export_grid_mapping=False)
Putting it all together: