-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Add daymet #213
base: master
Are you sure you want to change the base?
[WIP] Add daymet #213
Conversation
hacked from https://github.com/TomAugspurger/daymet-recipe |
Thanks Yuvi! 🙌 I'm curious to see how this runs with on the Beam branch. The NA-daily job was running really slowly when I tried it with pangeo-forge-recipes main (executed on a Dask Cluster), and I haven't had a chance to investigate why. |
@TomAugspurger yay! Also can I bring you onto the pangeo-forge slack somehow maybe? :) @TomAugspurger also, i'm curious if we can get this data via https://cmr.earthdata.nasa.gov/search/concepts/C2031536952-ORNL_CLOUD.html instead? Is earthdata login what is holding that back? |
@TomAugspurger ok, I just tried to get this purely from CMR for the daily run, and running into:
|
Which makes sense, as I think it's one file per region, per year, per variable |
I'm making this into one recipe per region per variable, chunked across time. From conversations with ORNL folks, it would also be exciting to have this be chunked across lat / lon, so you can get historical info for a single 'pixel' - like https://daymet.ornl.gov/single-pixel/. In this case, I'd imagine it would be the same recipes as otherwise, but just chunked by lat /lon? |
Hmm, I should probably fold region inside, and just make one recipe per variable? |
I swear my commit messages are usually of better quality :) I'll squash and what not before final. |
Now this PR reads data list via CMR! |
I think maybe the workers ran out of memory here? I bumped up the size of the node being used in dataflow from n1-highmem-2 to 8 |
I see a lot of this in the logs:
Given that this is happening at same time as:
I suspected this was maybe because the source nc file was too large, but it is barely a meg. Maybe the destination is too large. This is the first recipe I'm really writing so a lot of learning! |
Yep, definitely running out of memory:
Not exactly sure why, probably something about the chunking. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomAugspurger also, i'm curious if we can get this data via https://cmr.earthdata.nasa.gov/search/concepts/C2031536952-ORNL_CLOUD.html instead? Is earthdata login what is holding that back?
More me not understanding how earthdata login works. Glad to see you've got it working, and I'm pleased to learn about pangeo_forge_cmr
.
recipes/daymet/recipe.py
Outdated
recipes[var] = XarrayZarrRecipe( | ||
pattern_from_file_sequence( | ||
var_files[var], | ||
# FIXME: Leap years?! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They just drop December 31st on leap years :)
recipes/daymet/recipe.py
Outdated
} | ||
|
||
# Get the GPM IMERG Late Precipitation Daily data | ||
shortname = 'Daymet_Daily_V4_1840' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the "1840" here specific to daily - North America? And other regions will have different numerical IDs here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is daily, but all spatial areas are distributed under that DOI. The granule-level file name will distinguish the difference; na, pr, hi. We just updates and 1840 is now 2129. https://daac.ornl.gov/cgi-bin/dsviewer.pl?ds_id=2129
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new short name: Daymet_Daily_V4R1_2129
Hi @yuvipanda, quick question about this recipe. I saw that you're specifying client_kwargs = {
'auth': aiohttp.BasicAuth(username, password),
'trust_env': True,
}
...
fsspec_open_kwargs=dict(
client_kwargs=client_kwargs
), I'm wanted to check whether this was successfully serialized when generating the recipe hash? (pangeo-forge/pangeo-forge-recipes#429). @andersy005 encountered a problem where I'd have thought that there'd be a similar issue with In [4]: import inspect
...: from collections.abc import Collection
...: from dataclasses import asdict
...: from enum import Enum
...: from hashlib import sha256
...: from json import dumps
...: from typing import Any, List, Sequence
In [5]: def either_encode_or_hash(obj: Any):
...: """For objects which are not serializable with ``json.dumps``, this function defines
...: type-specific handlers which extract either a serializable value or a hash from the object.
...: :param obj: Any object which is not serializable to ``json``.
...: """
...:
...: if isinstance(obj, Enum): # custom serializer for FileType, CombineOp, etc.
...: return obj.value
...: elif hasattr(obj, "sha256"):
...: return obj.sha256.hex()
...: elif inspect.isfunction(obj):
...: return inspect.getsource(obj)
...: elif isinstance(obj, bytes):
...: return obj.hex()
...: raise TypeError(f"object of type {type(obj).__name__} not serializable")
In [11]: ba = aiohttp.BasicAuth(login="test", password="test")
In [12]: either_encode_or_hash(ba)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Input In [12], in <cell line: 1>()
----> 1 either_encode_or_hash(ba)
Input In [5], in either_encode_or_hash(obj)
13 elif isinstance(obj, bytes):
14 return obj.hex()
---> 15 raise TypeError(f"object of type {type(obj).__name__} not serializable")
TypeError: object of type BasicAuth not serializable |
@derekocallaghan yeah, am definitely running into that too! However, this isn't an actual serialization issue but something to do with the hashing (which IMO should probably just be removed?). To unblock myself temporarily while I ramp up on writing more recipes, I've applied the following patch to pangeo-forge-recipes :D
|
Hi @yuvipanda, yep, the hashing in
# we exclude the format function and combine dims from ``root`` because they determine the
# index:filepath pairs yielded by iterating over ``.items()``. if these pairs are generated in
# a different way in the future, we ultimately don't care.
root = {
"fsspec_open_kwargs": pattern.fsspec_open_kwargs,
"query_string_secrets": pattern.query_string_secrets,
"file_type": pattern.file_type,
"nitems_per_file": {
op.name: op.nitems_per_file # type: ignore
for op in pattern.combine_dims
if op.name in pattern.concat_dims
},
} |
@derekocallaghan I think the recipe hash should be an allow_list, including only specific things it wants to track, rather than exclude specific things. I am not exactly sure what this hash is actually used for right now, do you know? |
There are two separate serialization issues here though - one is related to beam serialization, and one is related to hashing to get a hash id for the recipe. They probably both need different solutions as well |
Yeah, previously the hash was created on demand, where Agree that an allow_list is preferable. With your hash workaround above, does the subsequent Beam-related pickling/serialization work? |
@derekocallaghan I think so. I'm running it with local direct runner and was able to generate a full series of one particular variable just for HI! I've just pushed my latest changes. I'm trying to get a couple steps running for all of the regions and variables. I'm producing one recipe per variable per region, partially to try see if I can get that to work before trying to merge the variables into one dimension. @jbusecke made me realize we can't actually easily combine the three regions into one! |
I'm testing this locally the following way:
import pathlib
HERE = pathlib.Path(__file__).parent
c.TargetStorage.fsspec_class = "fsspec.implementations.local.LocalFileSystem"
c.TargetStorage.root_path = f"file://{HERE}/storage/output/{{job_id}}"
c.InputCacheStorage.root_path = f"file://{HERE}/storage/cache"
c.InputCacheStorage.fsspec_class = c.TargetStorage.fsspec_class
c.MetadataCacheStorage.root_path = f"file://{HERE}/storage/metadata/{{job_id}}"
c.MetadataCacheStorage.fsspec_class = c.TargetStorage.fsspec_class
c.Bake.bakery_class = "pangeo_forge_runner.bakery.local.LocalDirectBakery"
|
Currently it fails with the following when trying to run the NA files:
|
No description provided.