Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bake fails without MetadataCacheStorage in config #46

Open
derekocallaghan opened this issue Nov 9, 2022 · 2 comments
Open

Bake fails without MetadataCacheStorage in config #46

derekocallaghan opened this issue Nov 9, 2022 · 2 comments

Comments

@derekocallaghan
Copy link
Contributor

Currently, the Bake command requires MetadataCacheStorage in the associated config. E.g.:

    bakery_config.MetadataCacheStorage.root_path = (
        bakery_config.MetadataCacheStorage.root_path.format(subpath=subpath)
    )

If a config is used that doesn't require a metadata cache (e.g. the CCMP recipe doesn't require it, and appeared to run faster without it, at least locally), it fails as follows:

In [41]: bconfig
Out[41]: 
{'Bake': {'bakery_class': 'pangeo_forge_runner.bakery.local.LocalDirectBakery',
  'recipe_id': 'eooffshore_ics_ccmp_v02_1_nrt_wind',
  'feedstock_subdir': 'recipes/eooffshore_ics_ccmp_v02_1_nrt_wind',
  'repo': 'https://github.com/eooffshore/staged-recipes',
  'ref': '663f30c95c406b9efe012b9bae66fa1f386b539b',
  'job_name': 'CCMP'},
 'LocalDirectBakery': {'num_workers': 1},
 'TargetStorage': {'fsspec_class': 'fsspec.implementations.local.LocalFileSystem',
  'root_path': './ccmp.zarr'},
 'InputCacheStorage': {'fsspec_class': 'fsspec.implementations.local.LocalFileSystem',
  'root_path': './input-cache/'},
 'prune': True}

In [42]: Bake(config=Config(bconfig)).start()
[Bake] Target Storage is FSSpecTarget(LocalFileSystem(, root_path="./ccmp.zarr")

[Bake] Input Cache Storage is CacheFSSpecTarget(LocalFileSystem(, root_path="./input-cache/")

[Bake] Metadata Cache Storage is MetadataTarget(AbstractFileSystem(, root_path="")

[Bake] Picked Git content provider.

[Bake] Cloning into '/tmp/tmplbp6ohs8'...

[Bake] HEAD is now at 663f30c Added datetime import to ics_wind_speed_direction()

[Bake] Parsing recipes...
[Bake] Baking only recipe_id='eooffshore_ics_ccmp_v02_1_nrt_wind'
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
Input In [42], in <cell line: 1>()
----> 1 Bake(config=Config(bconfig)).start()

File ~/gitrepos/pangeo-forge/pangeo-forge-runner/pangeo_forge_runner/commands/bake.py:145, in Bake.start(self)
    139 else:
    140     job_name = self.job_name
    142 recipe.storage_config = StorageConfig(
    143     target_storage.get_forge_target(job_name=job_name),
    144     input_cache_storage.get_forge_target(job_name=job_name),
--> 145     metadata_cache_storage.get_forge_target(job_name=job_name),
    146 )
    148 pipeline_options = bakery.get_pipeline_options(
    149     job_name=job_name,
    150     # FIXME: Bring this in from meta.yaml?
    151     container_image=self.container_image,
    152 )
    154 # Set argv explicitly to empty so Apache Beam doesn't try to parse the commandline
    155 # for pipeline options - we have traitlets doing that for us.

File ~/gitrepos/pangeo-forge/pangeo-forge-runner/pangeo_forge_runner/storage.py:54, in StorageTargetConfig.get_forge_target(self, job_name)
     48 def get_forge_target(self, job_name: str):
     49     """
     50     Return correct pangeo-forge-recipes Target
     51 
     52     If {job_name} is present in `root_path`, it is expanded with the given job_name
     53     """
---> 54     return self.pangeo_forge_target_class(
     55         self.fsspec_class(**self.fsspec_args),
     56         root_path=self.root_path.format(job_id=job_name),
     57     )

File <string>:5, in __init__(self, fs, root_path)

File /data/anaconda/anaconda3/envs/forgerunner/lib/python3.9/site-packages/pangeo_forge_recipes/storage.py:130, in FSSpecTarget.__post_init__(self)
    129 def __post_init__(self):
--> 130     if not self.fs.isdir(self.root_path):
    131         self.fs.mkdir(self.root_path)

File /data/anaconda/anaconda3/envs/forgerunner/lib/python3.9/site-packages/fsspec/spec.py:641, in AbstractFileSystem.isdir(self, path)
    639 """Is this entry directory-like?"""
    640 try:
--> 641     return self.info(path)["type"] == "directory"
    642 except IOError:
    643     return False

File /data/anaconda/anaconda3/envs/forgerunner/lib/python3.9/site-packages/fsspec/spec.py:601, in AbstractFileSystem.info(self, path, **kwargs)
    584 """Give details of entry at path
    585 
    586 Returns a single dictionary, with exactly the same information as ``ls``
   (...)
    598 directory, or something else) and other FS-specific keys.
    599 """
    600 path = self._strip_protocol(path)
--> 601 out = self.ls(self._parent(path), detail=True, **kwargs)
    602 out = [o for o in out if o["name"].rstrip("/") == path]
    603 if out:

File /data/anaconda/anaconda3/envs/forgerunner/lib/python3.9/site-packages/fsspec/spec.py:339, in AbstractFileSystem.ls(self, path, detail, **kwargs)
    300 def ls(self, path, detail=True, **kwargs):
    301     """List objects at path.
    302 
    303     This should include subdirectories and files at that location. The
   (...)
    337     dicts if detail is True.
    338     """
--> 339     raise NotImplementedError

NotImplementedError: 

Here's where it's used in Bake:

metadata_cache_storage = MetadataCacheStorage(parent=self)

Is a metadata cache always required, or can it be configurable e.g. via the recipe or meta.yaml?

@yuvipanda
Copy link
Collaborator

Good question, @derekocallaghan! Perhaps it's not always required, at which point we should hide it behind a conditional. I can see that perhaps Cache is also not always required. Will need some digging into the pangeo-forge-recipes codebase maybe?

Or we can just make it optional and see if something fails? Would you feel like submitting a PR? :D

@derekocallaghan
Copy link
Contributor Author

Hi @yuvipanda, I've been looking through the pangeo-forge-recipes code, and it seems that metadata caching is currently configurable via the recipe implementation. E.g.:

pattern = FilePattern(
    make_url,
    ConcatDim(name="time", keys=dates, nitems_per_file=NITEMS_PER_FILE),
)
    @property
    def nitems_per_input(self) -> Dict[str, Union[int, None]]:
        """Dictionary mapping concat dims to number of items per file."""
        nitems = {}  # type: Dict[str, Union[int, None]]
        for op in self.combine_dims:
            if isinstance(op, ConcatDim):
                if op.nitems_per_file:
                    nitems[op.name] = op.nitems_per_file
                else:
                    nitems[op.name] = None
        return nitems
    @property
    def concat_sequence_lens(self) -> Dict[str, Optional[int]]:
        """Dictionary mapping concat dims to sequence lengths.
        Only available if ``nitems_per_input`` is set on the dimension."""
        return {
            dim_name: (nitems * self.dims[dim_name] if nitems is not None else None)
            for dim_name, nitems in self.nitems_per_input.items()
        }
        self.concat_dim = self.file_pattern.concat_dims[0]
        self.cache_metadata = any(
            [v is None for v in self.file_pattern.concat_sequence_lens.values()]
        )

It looks like the storage metadata cache location passed to pangeo-forge-runner is ignored if ConcatDim.nitems_per_file is set in the recipe pattern. I guess this means that we can assume it'll always be provided in the Bake config as is currently the case, and I can close this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants