Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add daymet #213

Draft
wants to merge 47 commits into
base: master
Choose a base branch
from
Draft

[WIP] Add daymet #213

wants to merge 47 commits into from

Conversation

yuvipanda
Copy link
Contributor

No description provided.

@yuvipanda yuvipanda marked this pull request as draft October 28, 2022 18:34
@yuvipanda
Copy link
Contributor Author

@TomAugspurger
Copy link
Contributor

Thanks Yuvi! 🙌

I'm curious to see how this runs with on the Beam branch. The NA-daily job was running really slowly when I tried it with pangeo-forge-recipes main (executed on a Dask Cluster), and I haven't had a chance to investigate why.

@yuvipanda
Copy link
Contributor Author

@TomAugspurger yay! Also can I bring you onto the pangeo-forge slack somehow maybe? :)

@TomAugspurger also, i'm curious if we can get this data via https://cmr.earthdata.nasa.gov/search/concepts/C2031536952-ORNL_CLOUD.html instead? Is earthdata login what is holding that back?

@yuvipanda
Copy link
Contributor Author

@TomAugspurger ok, I just tried to get this purely from CMR for the daily run, and running into:


Traceback (most recent call last):
  File "apache_beam/runners/common.py", line 1417, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 837, in apache_beam.runners.common.PerWindowInvoker.invoke_process
  File "apache_beam/runners/common.py", line 983, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window
  File "/Users/yuvipanda/.local/share/virtualenvs/pangeo-forge-runner/lib/python3.9/site-packages/apache_beam/transforms/core.py", line 1877, in <lambda>
  File "/srv/conda/envs/notebook/lib/python3.9/site-packages/pangeo_forge_recipes/executors/beam.py", line 14, in _no_arg_stage
    fun(config=config)
  File "/srv/conda/envs/notebook/lib/python3.9/site-packages/pangeo_forge_recipes/recipes/xarray_zarr.py", line 530, in prepare_target
    with open_chunk(chunk_key, config=config) as ds:
  File "/srv/conda/envs/notebook/lib/python3.9/contextlib.py", line 119, in __enter__
    return next(self.gen)
  File "/srv/conda/envs/notebook/lib/python3.9/site-packages/pangeo_forge_recipes/recipes/xarray_zarr.py", line 414, in open_chunk
    ds = xr.concat(dsets, config.concat_dim, **config.xarray_concat_kwargs)
  File "/srv/conda/envs/notebook/lib/python3.9/site-packages/xarray/core/concat.py", line 243, in concat
    return _dataset_concat(
  File "/srv/conda/envs/notebook/lib/python3.9/site-packages/xarray/core/concat.py", line 558, in _dataset_concat
    raise ValueError(f"{name!r} is not present in all datasets.")
ValueError: 'dayl' is not present in all datasets.

@yuvipanda
Copy link
Contributor Author

Which makes sense, as I think it's one file per region, per year, per variable

@yuvipanda
Copy link
Contributor Author

I'm making this into one recipe per region per variable, chunked across time. From conversations with ORNL folks, it would also be exciting to have this be chunked across lat / lon, so you can get historical info for a single 'pixel' - like https://daymet.ornl.gov/single-pixel/.

In this case, I'd imagine it would be the same recipes as otherwise, but just chunked by lat /lon?

@yuvipanda
Copy link
Contributor Author

Hmm, I should probably fold region inside, and just make one recipe per variable?

@yuvipanda
Copy link
Contributor Author

I swear my commit messages are usually of better quality :) I'll squash and what not before final.

@yuvipanda
Copy link
Contributor Author

Now this PR reads data list via CMR!

@yuvipanda
Copy link
Contributor Author

I think maybe the workers ran out of memory here? I bumped up the size of the node being used in dataflow from n1-highmem-2 to 8

@yuvipanda
Copy link
Contributor Author

I see a lot of this in the logs:

[2022-10-28T21:44:17.124660+00:00] [docker:dayl-bba2fd66df972afa41b1-10281438-er4b-harness-6187] /srv/conda/envs/notebook/lib/python3.9/site-packages/xarray/core/indexing.py:1380: PerformanceWarning: Slicing is producing a large chunk. To accept the large
[2022-10-28T21:44:17.124708+00:00] [docker:dayl-bba2fd66df972afa41b1-10281438-er4b-harness-6187] chunk and silence this warning, set the option
[2022-10-28T21:44:17.124716+00:00] [docker:dayl-bba2fd66df972afa41b1-10281438-er4b-harness-6187]     >>> with dask.config.set(**{'array.slicing.split_large_chunks': False}):
[2022-10-28T21:44:17.124722+00:00] [docker:dayl-bba2fd66df972afa41b1-10281438-er4b-harness-6187]     ...     array[indexer]
[2022-10-28T21:44:17.124727+00:00] [docker:dayl-bba2fd66df972afa41b1-10281438-er4b-harness-6187] 
[2022-10-28T21:44:17.124732+00:00] [docker:dayl-bba2fd66df972afa41b1-10281438-er4b-harness-6187] To avoid creating the large chunks, set the option
[2022-10-28T21:44:17.124737+00:00] [docker:dayl-bba2fd66df972afa41b1-10281438-er4b-harness-6187]     >>> with dask.config.set(**{'array.slicing.split_large_chunks': True}):
[2022-10-28T21:44:17.124742+00:00] [docker:dayl-bba2fd66df972afa41b1-10281438-er4b-harness-6187]     ...     array[indexer]
[2022-10-28T21:44:17.124750+00:00] [docker:dayl-bba2fd66df972afa41b1-10281438-er4b-harness-6187]   value = value[(slice(None),) * axis + (subkey,)]
[2

Given that this is happening at same time as:

[2022-10-28T21:44:17.746232+00:00] [worker:dayl-bba2fd66df972afa41b1-10281438-er4b-harness-6187] Opening input with Xarray Index({DimIndex(name='time', index=121, sequence_len=156, operation=<CombineOp.CONCAT: 2>)}): 'https://data.ornldaac.earthdata.nasa.gov/protected/daymet/Daymet_Daily_V4/data/daymet_v4_daily_na_dayl_2010.nc'
[2022-10-28T21:44:17.746490+00:00] [worker:dayl-bba2fd66df972afa41b1-10281438-er4b-harness-6187] Opening 'https://data.ornldaac.earthdata.nasa.gov/protected/daymet/Daymet_Daily_V4/data/daymet_v4_daily_na_dayl_2010.nc' from cache
[2022-10-28T21:44:19.788892+00:00] [worker:dayl-bba2fd66df972afa41b1-10281438-er4b-harness-6187] Opening input with Xarray Index({DimIndex(name='time', index=122, sequence_len=156, operation=<CombineOp.CONCAT: 2>)}): 'https://data.ornldaac.earthdata.nasa.gov/protected/daymet/Daymet_Daily_V4/data/daymet_v4_daily_hi_dayl_2010.nc'
[2022-10-28T21:44:19.789208+00:00] [worker:dayl-bba2fd66df972afa41b1-10281438-er4b-harness-6187] Opening 'https://data.ornldaac.earthdata.nasa.gov/protected/daymet/Daymet_Daily_V4/data/daymet_v4_daily_hi_dayl_2010.nc' from cache
[2022-10-28T21:44:20.046337+00:00] [worker:dayl-bba2fd66df972afa41b1-10281438-er4b-harness-6187] Opening input with Xarray Index({DimIndex(name='time', index=123, sequence_len=156, operation=<CombineOp.CONCAT: 2>)}): 'https://data.ornldaac.earthdata.nasa.gov/protected/daymet/Daymet_Daily_V4/data/daymet_v4_daily_hi_dayl_2011.nc'
[2022-10-28T21:44:20.046594+00:00] [worker:dayl-bba2fd66df972afa41b1-10281438-er4b-harness-6187] Opening 'https://data.ornldaac.earthdata.nasa.gov/protected/daymet/Daymet_Daily_V4/data/daymet_v4_daily_hi_dayl_2011.nc' from cache

I suspected this was maybe because the source nc file was too large, but it is barely a meg. Maybe the destination is too large.

This is the first recipe I'm really writing so a lot of learning!

@yuvipanda
Copy link
Contributor Author

Yep, definitely running out of memory:

[2022-10-28T21:55:42.920516+00:00] [system:dayl-bba2fd66df972afa41b1-10281438-er4b-harness-6187] Out of memory: Killed process 6534 (python) total-vm:102006872kB, anon-rss:45809068kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:90608kB oom_score_adj:900

Not exactly sure why, probably something about the chunking.

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger also, i'm curious if we can get this data via https://cmr.earthdata.nasa.gov/search/concepts/C2031536952-ORNL_CLOUD.html instead? Is earthdata login what is holding that back?

More me not understanding how earthdata login works. Glad to see you've got it working, and I'm pleased to learn about pangeo_forge_cmr.

recipes[var] = XarrayZarrRecipe(
pattern_from_file_sequence(
var_files[var],
# FIXME: Leap years?!
Copy link
Contributor

@TomAugspurger TomAugspurger Oct 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They just drop December 31st on leap years :)

}

# Get the GPM IMERG Late Precipitation Daily data
shortname = 'Daymet_Daily_V4_1840'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the "1840" here specific to daily - North America? And other regions will have different numerical IDs here?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is daily, but all spatial areas are distributed under that DOI. The granule-level file name will distinguish the difference; na, pr, hi. We just updates and 1840 is now 2129. https://daac.ornl.gov/cgi-bin/dsviewer.pl?ds_id=2129

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new short name: Daymet_Daily_V4R1_2129

@derekocallaghan
Copy link
Contributor

Hi @yuvipanda, quick question about this recipe. I saw that you're specifying aiohttp.BasicAuth:

https://github.com/yuvipanda/staged-recipes/blob/5a50ed2ed1212e77a7615cbbbc285c7163dce465/recipes/daymet/recipe.py#L15

client_kwargs = {
    'auth': aiohttp.BasicAuth(username, password),
    'trust_env': True,
}

...

            fsspec_open_kwargs=dict(
                client_kwargs=client_kwargs
            ),

I'm wanted to check whether this was successfully serialized when generating the recipe hash? (pangeo-forge/pangeo-forge-recipes#429).

@andersy005 encountered a problem where aiohttp client_kwargs couldn't be serialized in a feedstock recipe run, a couple of workarounds are suggested here: pangeo-forge/eooffshore_ics_ccmp_v02_1_nrt_wind-feedstock#3 (comment)

I'd have thought that there'd be a similar issue with BasicAuth, I've replicated it locally (see pangeo_forge_recipes.serialization) and it doesn't seem to be serializable:

In [4]: import inspect
   ...: from collections.abc import Collection
   ...: from dataclasses import asdict
   ...: from enum import Enum
   ...: from hashlib import sha256
   ...: from json import dumps
   ...: from typing import Any, List, Sequence

In [5]: def either_encode_or_hash(obj: Any):
   ...:     """For objects which are not serializable with ``json.dumps``, this function defines
   ...:     type-specific handlers which extract either a serializable value or a hash from the object.
   ...:     :param obj: Any object which is not serializable to ``json``.
   ...:     """
   ...: 
   ...:     if isinstance(obj, Enum):  # custom serializer for FileType, CombineOp, etc.
   ...:         return obj.value
   ...:     elif hasattr(obj, "sha256"):
   ...:         return obj.sha256.hex()
   ...:     elif inspect.isfunction(obj):
   ...:         return inspect.getsource(obj)
   ...:     elif isinstance(obj, bytes):
   ...:         return obj.hex()
   ...:     raise TypeError(f"object of type {type(obj).__name__} not serializable")

In [11]: ba = aiohttp.BasicAuth(login="test", password="test")

In [12]: either_encode_or_hash(ba)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [12], in <cell line: 1>()
----> 1 either_encode_or_hash(ba)

Input In [5], in either_encode_or_hash(obj)
     13 elif isinstance(obj, bytes):
     14     return obj.hex()
---> 15 raise TypeError(f"object of type {type(obj).__name__} not serializable")

TypeError: object of type BasicAuth not serializable

@yuvipanda
Copy link
Contributor Author

@derekocallaghan yeah, am definitely running into that too! However, this isn't an actual serialization issue but something to do with the hashing (which IMO should probably just be removed?). To unblock myself temporarily while I ramp up on writing more recipes, I've applied the following patch to pangeo-forge-recipes :D

diff --git a/pangeo_forge_recipes/serialization.py b/pangeo_forge_recipes/serialization.py
index c7fb42a..d2c8828 100644
--- a/pangeo_forge_recipes/serialization.py
+++ b/pangeo_forge_recipes/serialization.py
@@ -22,7 +22,8 @@ def either_encode_or_hash(obj: Any):
         return inspect.getsource(obj)
     elif isinstance(obj, bytes):
         return obj.hex()
-    raise TypeError(f"object of type {type(obj).__name__} not serializable")
+    return bytes.fromhex("6ac3c336e4094835293a3fed8a4b5fedde1b5e2626d9838fed50693bba00af0e")
+    # raise TypeError(f"object of type {type(obj).__name__} not serializable")
 
 
 def dict_to_sha256(dictionary: dict) -> bytes:

@derekocallaghan
Copy link
Contributor

Hi @yuvipanda, yep, the hashing in serialization.py has raised some issues recently. It may be the case that we can exclude client_kwargs or even fsspec_open_kwargs from the hashing (see below), as I guess they could contain arbitrary values/objects. E.g. this possible workaround

FilePattern has its own sha256() implementation, and although fsspec_open_kwargs (and client_kwargs) is currently included, perhaps this could be excluded (similar to the pattern format_function and combine dims currently excluded), or particular kwargs (e.g. any aiohttp instances) could be excluded from fsspec_open_kwargs prior to serialization. I guess timeouts/credentials values are something that aren't necessary for a recipe hash?

https://github.com/pangeo-forge/pangeo-forge-recipes/blob/3441d94a290d296b8f62638df15bb60993e86b1d/pangeo_forge_recipes/patterns.py#L299

    # we exclude the format function and combine dims from ``root`` because they determine the
    # index:filepath pairs yielded by iterating over ``.items()``. if these pairs are generated in
    # a different way in the future, we ultimately don't care.
    root = {
        "fsspec_open_kwargs": pattern.fsspec_open_kwargs,
        "query_string_secrets": pattern.query_string_secrets,
        "file_type": pattern.file_type,
        "nitems_per_file": {
            op.name: op.nitems_per_file  # type: ignore
            for op in pattern.combine_dims
            if op.name in pattern.concat_dims
        },
    }

@yuvipanda
Copy link
Contributor Author

@derekocallaghan I think the recipe hash should be an allow_list, including only specific things it wants to track, rather than exclude specific things. I am not exactly sure what this hash is actually used for right now, do you know?

@yuvipanda
Copy link
Contributor Author

There are two separate serialization issues here though - one is related to beam serialization, and one is related to hashing to get a hash id for the recipe. They probably both need different solutions as well

@derekocallaghan
Copy link
Contributor

Yeah, previously the hash was created on demand, where BaseRecipe.sha256() was called from BaseRecipe.get_execution_context(), which itself was called from XarrayZarrRecipe.prepare_target(), and generating a default job name in pangeo_forge_runner.commands.bake.Bake. I couldn't see where it was used apart from that at the time.

Agree that an allow_list is preferable.

With your hash workaround above, does the subsequent Beam-related pickling/serialization work?

@yuvipanda
Copy link
Contributor Author

@derekocallaghan I think so. I'm running it with local direct runner and was able to generate a full series of one particular variable just for HI!

image

I've just pushed my latest changes. I'm trying to get a couple steps running for all of the regions and variables.

I'm producing one recipe per variable per region, partially to try see if I can get that to work before trying to merge the variables into one dimension. @jbusecke made me realize we can't actually easily combine the three regions into one!

@yuvipanda
Copy link
Contributor Author

I'm testing this locally the following way:

  1. Make a local_runner_config.py file with the following:
import pathlib

HERE = pathlib.Path(__file__).parent
c.TargetStorage.fsspec_class = "fsspec.implementations.local.LocalFileSystem"
c.TargetStorage.root_path = f"file://{HERE}/storage/output/{{job_id}}"

c.InputCacheStorage.root_path = f"file://{HERE}/storage/cache"
c.InputCacheStorage.fsspec_class = c.TargetStorage.fsspec_class

c.MetadataCacheStorage.root_path = f"file://{HERE}/storage/metadata/{{job_id}}"
c.MetadataCacheStorage.fsspec_class = c.TargetStorage.fsspec_class

c.Bake.bakery_class = "pangeo_forge_runner.bakery.local.LocalDirectBakery"
  1. Install pangeo-forge-runner from master
  2. Install pangeo-forge-recipes from master, with the patch indicated.
  3. Run it from this repo with: pangeo-forge-runner bake --repo . --feedstock-subdir=recipes/daymet -f local_runner_config.py --prune

@yuvipanda
Copy link
Contributor Author

Currently it fails with the following when trying to run the NA files:

Worker: severity: WARN timestamp {   seconds: 1667987210   nanos: 63492059 } message: "Variable tmin of 92123153000 bytes is 184.25 times larger than specified maximum variable array size of 500000000 bytes. Consider re-instantiating recipe with `subset_inputs = {\"time\": 185}`. If `len(ds[\"time\"])` < 185, substitute \"time\" for any name in ds[\"tmin\"].dims with length >= 185 or consider subsetting along multiple dimensions. Setting PANGEO_FORGE_MAX_MEMORY env variable changes the variable array size which will trigger this warning." instruction_id: "bundle_1608" transform_id: "Start|cache_input|Reshuffle_000|prepare_target|Reshuffle_001|store_chunk|Reshuffle_002|finalize_target|Reshuffle_003/store_chunk/Execute" log_location: "/Users/yuvipanda/code/pangeo-forge-recipes/pangeo_forge_recipes/recipes/xarray_zarr.py:621" thread: "Thread-14" 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants