Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalize ReplayMover #12

Open
timothyas opened this issue Dec 27, 2023 · 0 comments
Open

Generalize ReplayMover #12

timothyas opened this issue Dec 27, 2023 · 0 comments

Comments

@timothyas
Copy link
Member

timothyas commented Dec 27, 2023

#5 adds the scripts used to transfer 1 degree and 1/4 degree FV3 Replay datasets. There are a number of hard coded values that I used in order to get the wheels turning, but we have now discussed ways that this can be generalized. Here are some examples pointed out by @frolovsa and @danielabdi-noaa

Runtime arguments

The scripts like examples/replay/move_quarter_degree.py have hard coded runtime, SLURM, and user related options. One way around this would be to read in a yaml file with all of these specifications. For starters:

    mover = ReplayMoverQuarterDegree(
        n_jobs=15,
        config_filename="config-0.25-degree.yaml",
        storage_options={"token": "/contrib/Tim.Smith/.gcs/replay-service-account.json"},
        main_cache_path="/lustre/Tim.Smith/tmp-replay/0.25-degree",
    )

as well as the slurm options

    slurm_dir = "slurm/replay-0.25-degree"
    txt = "#!/bin/bash\n\n" +\
        f"#SBATCH -J rqd{job_id:03d}\n"+\
        f"#SBATCH -o {slurm_dir}/{job_id:03d}.%j.out\n"+\
        f"#SBATCH -e {slurm_dir}/{job_id:03d}.%j.err\n"+\
        f"#SBATCH --nodes=1\n"+\
        f"#SBATCH --ntasks=1\n"+\
        f"#SBATCH --cpus-per-task=30\n"+\
        f"#SBATCH --partition=compute\n"+\
        f"#SBATCH -t 120:00:00\n\n"+\

could easily be put in a yaml file with a syntax like

mover:
  n_jobs: 15
  config_filename: config-0.25-degree.yaml
  storage_options: 
...
slurm:
  cpus-per-task: 30
  partition: compute
  ntasks: 1
...

Then the conda related arguments:

        f"source /contrib/Tim.Smith/miniconda3/etc/profile.d/conda.sh\n"+\
        f"conda activate ufs2arco\n"+\
        f'python -c "{the_code}"'

could be generalized with os.get_login() as suggested by @danielabdi-noaa, or we could also just put that in a yaml file. My preference is for the latter, since we'd have other things in a yaml file, and it removes the restrictions that 1) we all have miniconda3 2) it's in the same relative path and 3) the conda environment name is the same.

Replay Mover script options

Here's a list of specifics for the subset of replay data we wanted:

  1. xcycles
    @property
    def xcycles(self):
        """These are the DA cycle timestamps, which are every 6 hours. There is one s3 directory per cycle for replay."""
        cycles = pd.date_range(start="1994-01-01", end="1999-06-13T06:00:00", freq="6h")
        return xr.DataArray(cycles, coords={"cycles": cycles}, dims="cycles")

Makes the assumption about startdate, enddate, and 6hour frequency

  1. xtime
    @property
    def xtime(self):
        """These are the time stamps of the resulting dataset, assuming we are grabbing fhr00 and fhr03"""
        time = pd.date_range(start="1994-01-01", end="1999-06-13T09:00:00", freq="3h")
        iau_time = time - timedelta(hours=6)
        return xr.DataArray(iau_time, coords={"time": iau_time}, dims="time", attrs={"long_name": "time", "axis": "T"})

relies first on the assumption related to xcycles, and then relies on the option that we're grabbing the fhr00 and fhr03 files, as well as the adjusted timing due to the IAU. I don't know how to generalize this though, just created this after gaining an understanding about the mapping from cycle to fhr timestamps. Note that similar assumptions are baked into add_time_coords.

  1. the property ods_kwargs is specific to s3, and could be moved so that it's being passed like the runtime yaml options above
  2. Inside the method move_single_dataset is the assumption that we're only grabbing two timestamps per DA cycle (doesn't matter that they are fhr00 and fhr03 though), see line defining tslice.
  3. the cached_path staticmethod is specific to the replay data, but I think this is the one thing that can't be generalized since it is specific to the dataset.
  4. It specifies that it wants an fv3dataset, and this should obviously be generalized. This also pertains to the "region" definition in these lines:
                        "time": tslice,
                        "pfull": slice(None, None),
                        "grid_yt": slice(None, None),
                        "grid_xt": slice(None, None),
@timothyas timothyas mentioned this issue Dec 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant