Slow opening and clipping of files #234

nkorinek · 2021-02-06T00:16:41Z

nkorinek
Feb 6, 2021

Hey again!

Sorry for the back to back questions, hope it's not to much trouble! I've just been updating a lot of our old code to use rioxarray instead of rasterio which has lead to a few questions.

We have a notebook that loops through 46 Landsat images, opens them, clips them, and masks out cloud values. When we use rasterio to do this, the whole notebook can run in under one second. With rioxarray, it takes over 30 seconds. I timed them. It seems that the slow part is opening and clipping all those files in rioxarray. By commenting out line by line, these lines seem to be the ones that slows down the code the most

band_4 = rxr.open_rasterio(raster_list[3], masked=True).squeeze()
band_4_crop_data = band_4.rio.clip(site_bound.geometry.apply(mapping))

I realize these are somewhat complex functions, but the same thing was done in rasterio very quickly. Do you have any hint as to why this may be taking so long? Any help would be appreciated! If you need more details I can help provide those too

Answered by snowman2

Feb 6, 2021

There are several ways to speed things up.

Use from_disk=True when clipping. This will reduce the amount of data loaded in. (rioxarray 0.2+)
With rioxarray 0.2+ and rasterio 1.2+ you don't need .apply(mapping) anymore.
Enable global context with pyproj 3+ with rioxarray < 0.3 (https://pyproj4.github.io/pyproj/stable/api/global_context.html#global-context) - be very careful with this option. With rioxarray 0.3+ use rioxarray.set_options(export_grid_mapping=False)

Putting it all together:

import os
from glob import glob

import earthpy.spatial as es
import numpy as np
import pandas as pd
import geopandas as gpd
import rioxarray as rxr
import pyproj

# use for rioxarray < 0.3
# pyproj.set_…

View full answer

snowman2 · 2021-02-06T00:47:38Z

snowman2
Feb 6, 2021
Maintainer

There are several ways to speed things up.

Use from_disk=True when clipping. This will reduce the amount of data loaded in. (rioxarray 0.2+)
With rioxarray 0.2+ and rasterio 1.2+ you don't need .apply(mapping) anymore.
Enable global context with pyproj 3+ with rioxarray < 0.3 (https://pyproj4.github.io/pyproj/stable/api/global_context.html#global-context) - be very careful with this option. With rioxarray 0.3+ use rioxarray.set_options(export_grid_mapping=False)

Putting it all together:

import os
from glob import glob

import earthpy.spatial as es
import numpy as np
import pandas as pd
import geopandas as gpd
import rioxarray as rxr
import pyproj

# use for rioxarray < 0.3
# pyproj.set_use_global_context(True)

# Get a list of each directory
path = os.path.join("ndvi-automation", "sites")

# Get a list of each directory
sites = glob(path + "/*/")

# Define columns and list used to create pandas dataframe
col_names = ["site", "date", "mean_ndvi"]
output = []

for site in sites:

    # Get site name
    site_name = os.path.basename(os.path.normpath(site))

    # Define directories
    landsat_dir = os.path.join(site, "landsat-crop")
    vector_dir = os.path.join(site, "vector")

    # Define site boundary
    site_boundary_path = os.path.join(
        vector_dir,  site_name + "-crop.shp")
    site_bound = gpd.read_file(site_boundary_path)

    # Get list of landsat scene directories
    cropped_directories = sorted(glob(os.path.join(landsat_dir, "LC08*")))

    # Calculate NDVI for each directory
    for directory in cropped_directories:

        # Get date from directory name
        dir_head, dir_tail = os.path.split(directory)
        date = dir_tail[10:18]

       # Open bands
        raster_list = sorted(glob(os.path.join(directory, "*band*.tif")))
        with rioxarray.set_options(export_grid_mapping=False):
            band_4_crop_data = rxr.open_rasterio(
                raster_list[3], masked=True,
            ).rio.clip(
                site_bound.geometry, from_disk=True,
            ).squeeze()
    
            band_5_crop_data = rxr.open_rasterio(
                raster_list[4], masked=True,
            ).rio.clip(
                site_bound.geometry, from_disk=True,
            ).isel(band=0)

        band_4_crop_data = band_4_crop_data.where(
            ~((band_4_crop_data < 0) | (band_4_crop_data > 10000))
        )
        band_5_crop_data = band_5_crop_data.where(
            ~((band_5_crop_data < 0) | (band_5_crop_data > 10000))
        )

        # Calculate NDVI for site using masked data
        ndvi_landsat = es.normalized_diff(
            band_5_crop_data.values.astype('f4'), band_4_crop_data.values.astype('f4'),
        )

        # Add date and NDVI mean to list
        output.append([site_name, date, np.nanmean(ndvi_landsat)])

    # Convert list to pandas dataframe and save to csv
    ndvi_ts_unclean = pd.DataFrame(output, columns=col_names)
    out_path_no_clean = os.path.join(
        "outputs", "ndvi_2017_not_clean.csv")
    ndvi_ts_unclean.to_csv(out_path_no_clean)
    ndvi_df_unclean = pd.read_csv(out_path_no_clean,
                          parse_dates=['date'],
                          na_values=[-9999],
                          index_col=['date'])

13 replies

lwasser Feb 17, 2021

this i s fantastic! @nkorinek thank you for following up here. @snowman2 when will this new fix rioxarray.set_options(export_grid_mapping=False) be released? We will then update lessons and environments accordingly!

snowman2 Feb 17, 2021
Maintainer

Current plan is pretty soon. I will post here when it is released so you can update.

lwasser Feb 17, 2021

wonderful. no rush on our end but once we know we can update. Thank you @snowman2 !! we've learned a lot and i love being able to teach approaches that are efficient so we appreciate your time in working through this with us.

snowman2 Feb 18, 2021
Maintainer

0.3.0 is on pypi 🚀

lwasser Feb 19, 2021

yes!! thank you. We will do some updates on our end!

nkorinek · 2021-02-08T17:42:10Z

nkorinek
Feb 8, 2021
Author

Hey! Just tried out the changes, sadly they didn't make too big of a difference. Before the changes I ran the notebook and it took 27 seconds to run. After the changes it took 25. So faster! But not by a noticeable amount.

9 replies

snowman2 Feb 9, 2021
Maintainer

The culprit was pyproj. When I enabled the global context (https://pyproj4.github.io/pyproj/stable/api/global_context.html#global-context), I was able to get it down to ~2 seconds. But, you can only enable the global context for single threaded applications. I updated my answer above.

nkorinek Feb 9, 2021
Author

This one did reduce the time a lot (8 seconds for the whole notebook), thanks so much for your help these last couple of days Alan, I really appreciate it!

snowman2 Feb 9, 2021
Maintainer

Glad to hear it worked for you 👍 Side note: Did you notice the minor change in the masking logic I made as well?

nkorinek Feb 9, 2021
Author

Ahh, I missed that! Makes it much cleaner, thanks!

lwasser Feb 10, 2021

what version of pyproj do we need to ensure this all runs properly as well?

lwasser · 2021-02-10T16:23:26Z

lwasser
Feb 10, 2021

ok let me try to summarize the changes that i see here

i think we need rioxarray 0.2.0 or greater for this and pyproj version ???
enabling pyproj.set_use_global_context(True) speeds things up but there are some dangers here that i'd like to better understand before using it and teaching it.
When you are using clip you are calling from_disk - this i think requires a new version of rioxarray. what does this do to enhance speed for us? i'll test it after updating our envts today!

 band_4_crop_data = rxr.open_rasterio(
            raster_list[3], masked=True,
        ).rio.clip(
            site_bound.geometry, from_disk=True,
        ).squeeze()

i also see this approach

       band_5_crop_data = rxr.open_rasterio(
            raster_list[4], masked=True,
        ).rio.clip(
            site_bound.geometry, from_disk=True,
        ).isel(band=0)

is it better to use squeeze or select a band? squeeze seems to work nicely!

In short does my comment above summarize the tweeks to make things faster?

from_disk and
pyproj
being the two big changes to enhance speed but both depend upon newer versions of tools.

5 replies

snowman2 Feb 10, 2021
Maintainer

From: #234 (comment)

1. Use from_disk=True when clipping. This will reduce the amount of data loaded in. (rioxarray 0.2+)
2. With rioxarray 0.2+ and rasterio 1.2+ you don't need .apply(mapping) anymore.
3. Enable global context with pyproj 3+ (https://pyproj4.github.io/pyproj/stable/api/global_context.html#global-context) - be very careful with this option.

snowman2 Feb 10, 2021
Maintainer

is it better to use squeeze or select a band? squeeze seems to work nicely!

Either one is fine.

lwasser Feb 10, 2021

ok great - i just did a local envt update. the pyproj 3x doesn't seem to be installing by default - i will see if i can force it without breaking our environment (the joys of python environments :)) ! much appreciated @snowman2 i'll share some cleaned up code with times just for a record here when we are done! (it is very close) Your help is much appreciated!

lwasser Feb 10, 2021

If this is helpful - for kicks i ran a few speed tests on the open and clip series isolated:

i do see some small improvements using both from_disk and the pyproj fix.

snowman2 Feb 10, 2021
Maintainer

@lwasser, for your tests, I recommend two changes:

Use pyproj.set_use_global_context(False) for the other examples to be sure the setting is off before running.
I recommend doing the clip immediately after opening to ensure you have the file handle (so do the squeeze after clip) ref.

lwasser · 2021-02-11T15:14:52Z

lwasser
Feb 11, 2021

I created a summary notebook that shows (significant!!) speed gains using from_disk and pyproj global context (at the bottom of the notebook there are timed sections). . i do plan to add this as a lesson to our website as well!! thank you for the help.

i read through the pyproj documentation but still don't understand why that global context option is "dangerous". Is there something else that i can read to better understand what the dangers are? Does it somehow use more processing power (threads?). i may just not understand what it's doing.

1 reply

snowman2 Feb 11, 2021
Maintainer

still don't understand why that global context option is "dangerous"

Essentially you could get unexpected behavior or crash your program if you use multi-threading with the global context enabled.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow opening and clipping of files #234

{{title}}

Replies: 4 comments 28 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Slow opening and clipping of files #234

Replies: 4 comments · 28 replies

snowman2 Feb 6, 2021 Maintainer

snowman2 Feb 17, 2021 Maintainer

snowman2 Feb 18, 2021 Maintainer

nkorinek Feb 8, 2021 Author

snowman2 Feb 9, 2021 Maintainer

nkorinek Feb 9, 2021 Author

snowman2 Feb 9, 2021 Maintainer

nkorinek Feb 9, 2021 Author

i also see this approach

snowman2 Feb 10, 2021 Maintainer

snowman2 Feb 10, 2021 Maintainer

snowman2 Feb 10, 2021 Maintainer

snowman2 Feb 11, 2021 Maintainer

Replies: 4 comments 28 replies

snowman2
Feb 6, 2021
Maintainer

snowman2 Feb 17, 2021
Maintainer

snowman2 Feb 18, 2021
Maintainer

nkorinek
Feb 8, 2021
Author

snowman2 Feb 9, 2021
Maintainer

nkorinek Feb 9, 2021
Author

snowman2 Feb 9, 2021
Maintainer

nkorinek Feb 9, 2021
Author

snowman2 Feb 10, 2021
Maintainer

snowman2 Feb 10, 2021
Maintainer

snowman2 Feb 10, 2021
Maintainer

snowman2 Feb 11, 2021
Maintainer