Generalize ReplayMover #12

timothyas · 2023-12-27T18:12:56Z

#5 adds the scripts used to transfer 1 degree and 1/4 degree FV3 Replay datasets. There are a number of hard coded values that I used in order to get the wheels turning, but we have now discussed ways that this can be generalized. Here are some examples pointed out by @frolovsa and @danielabdi-noaa

Runtime arguments

The scripts like examples/replay/move_quarter_degree.py have hard coded runtime, SLURM, and user related options. One way around this would be to read in a yaml file with all of these specifications. For starters:

    mover = ReplayMoverQuarterDegree(
        n_jobs=15,
        config_filename="config-0.25-degree.yaml",
        storage_options={"token": "/contrib/Tim.Smith/.gcs/replay-service-account.json"},
        main_cache_path="/lustre/Tim.Smith/tmp-replay/0.25-degree",
    )

as well as the slurm options

    slurm_dir = "slurm/replay-0.25-degree"
    txt = "#!/bin/bash\n\n" +\
        f"#SBATCH -J rqd{job_id:03d}\n"+\
        f"#SBATCH -o {slurm_dir}/{job_id:03d}.%j.out\n"+\
        f"#SBATCH -e {slurm_dir}/{job_id:03d}.%j.err\n"+\
        f"#SBATCH --nodes=1\n"+\
        f"#SBATCH --ntasks=1\n"+\
        f"#SBATCH --cpus-per-task=30\n"+\
        f"#SBATCH --partition=compute\n"+\
        f"#SBATCH -t 120:00:00\n\n"+\

could easily be put in a yaml file with a syntax like

mover:
  n_jobs: 15
  config_filename: config-0.25-degree.yaml
  storage_options: 
...
slurm:
  cpus-per-task: 30
  partition: compute
  ntasks: 1
...

Then the conda related arguments:

        f"source /contrib/Tim.Smith/miniconda3/etc/profile.d/conda.sh\n"+\
        f"conda activate ufs2arco\n"+\
        f'python -c "{the_code}"'

could be generalized with os.get_login() as suggested by @danielabdi-noaa, or we could also just put that in a yaml file. My preference is for the latter, since we'd have other things in a yaml file, and it removes the restrictions that 1) we all have miniconda3 2) it's in the same relative path and 3) the conda environment name is the same.

Replay Mover script options

Here's a list of specifics for the subset of replay data we wanted:

xcycles

    @property
    def xcycles(self):
        """These are the DA cycle timestamps, which are every 6 hours. There is one s3 directory per cycle for replay."""
        cycles = pd.date_range(start="1994-01-01", end="1999-06-13T06:00:00", freq="6h")
        return xr.DataArray(cycles, coords={"cycles": cycles}, dims="cycles")

Makes the assumption about startdate, enddate, and 6hour frequency

xtime

    @property
    def xtime(self):
        """These are the time stamps of the resulting dataset, assuming we are grabbing fhr00 and fhr03"""
        time = pd.date_range(start="1994-01-01", end="1999-06-13T09:00:00", freq="3h")
        iau_time = time - timedelta(hours=6)
        return xr.DataArray(iau_time, coords={"time": iau_time}, dims="time", attrs={"long_name": "time", "axis": "T"})

relies first on the assumption related to xcycles, and then relies on the option that we're grabbing the fhr00 and fhr03 files, as well as the adjusted timing due to the IAU. I don't know how to generalize this though, just created this after gaining an understanding about the mapping from cycle to fhr timestamps. Note that similar assumptions are baked into add_time_coords.

the property ods_kwargs is specific to s3, and could be moved so that it's being passed like the runtime yaml options above
Inside the method move_single_dataset is the assumption that we're only grabbing two timestamps per DA cycle (doesn't matter that they are fhr00 and fhr03 though), see line defining tslice.
the cached_path staticmethod is specific to the replay data, but I think this is the one thing that can't be generalized since it is specific to the dataset.
It specifies that it wants an fv3dataset, and this should obviously be generalized. This also pertains to the "region" definition in these lines:

                        "time": tslice,
                        "pfull": slice(None, None),
                        "grid_yt": slice(None, None),
                        "grid_xt": slice(None, None),

The text was updated successfully, but these errors were encountered:

timothyas mentioned this issue Dec 27, 2023

Replay Mover #5

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generalize ReplayMover #12

Generalize ReplayMover #12

timothyas commented Dec 27, 2023 •

edited

Loading

Generalize ReplayMover #12

Generalize ReplayMover #12

Comments

timothyas commented Dec 27, 2023 • edited Loading

Runtime arguments

Replay Mover script options

timothyas commented Dec 27, 2023 •

edited

Loading