Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cesm3.0 add marbl #269

Merged
merged 9 commits into from
Aug 9, 2024

Conversation

mnlevy1981
Copy link
Contributor

@mnlevy1981 mnlevy1981 commented Aug 7, 2024

Description of changes

Updates necessary to run MOM6 with MARBL, also removes cheyenne-specific and POP-specific PE layouts

Specific notes

Requires FMS tag fi_240807, CMEPS tagged after ESCOMP/CMEPS#493, and MOM interface tag mi_240805

Fixes: #268 (support turning MARBL tracers on in MOM6)

User interface changes?: No

Testing performed (automated tests and/or manual tests): Once I'm done making changes, I'll add a comment with the testing I've done + take this out of draft mode

fischer-ncar and others added 7 commits July 29, 2024 16:39
Also changed existing MOM6 layouts to explicitly look for non-MARBL
configuration
still running some timing experiments to see what layout fits S and XL on
derecho
I'm having trouble getting 20 SYPD in MOM6 with MARBL enabled; 25 nodes gives
me 19.9 myears / day in the ocean but 18.16 SYPD overall. Increasing to 27
nodes or 30 nodes both slow the model down, possibly due to increased
communication or possibly due to the machine being busy? We might want to
adjust the XL layout in a future alpha tag.
Until we decide to turn MARBL on by default, the compsets should not change.
However, I created a temporary BLT1850_MARBL compset to make it easier to run
with MARBL while we do final testing (also, I added an ERI test for that
compset so we aren't surprised by anything when we update the compset
definitions)
@mnlevy1981 mnlevy1981 marked this pull request as ready for review August 8, 2024 19:56
@mnlevy1981
Copy link
Contributor Author

I added a new BLT1850_MARBL compset, and also an ERI test for it (in prealpha). I also updated PE layouts to increase NTASKS_OCN for MOM6%MARBL-BIO compsets.

Unrelated to MARBL, I also removed PE layouts if any of the following applied:

  1. The layout was for running on cheyenne
  2. The layout was for running on POP grids
  3. The layout was for running on the old MOM6 tx0.66v1 grid

@mnlevy1981
Copy link
Contributor Author

I should note that the BLT1850_MARBL compset is temporary -- once we have a chance to do some comparisons with BLT1850 and fix any issues that may arise, we want all the allactive compsets to use MARBL by default. At that point, we'll remove BLT1850_MARBL.

Copy link
Contributor

@jedwards4b jedwards4b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are satisfied with the load balancing of MARBL compsets please add them to the cesm3 timing table at https://cseg.cgd.ucar.edu/timing/login/

@mnlevy1981
Copy link
Contributor Author

mnlevy1981 commented Aug 8, 2024

Oh, and the XL pelayout needs to be retuned -- MOM6 can't get 20 SYPD with MARBL turned on, likely due to I/O issues when running with larger core counts. Reducing the number of CAM tasks might keep throughput the same but decrease overall cost?

The existing XL layout without MARBL claims 20 SYPD / 7100 pe-hrs/simyear, though I saw 17.9 SYPD / 8000 pe-hrs/simyear when I ran (during the tutorial, so computer was busier than usual). Turning MARBL on, I saw 18.3 SYPD and 11500 pe-hrs/simyear but even when the machine is cooperating I never got to the 20 SYPD in ocean-only testing.

@mnlevy1981
Copy link
Contributor Author

At @jedwards4b's suggestion, I ran a series of PFS tests for the different pre-defined pe layout sizes:

pecount SYPD
(no MARBL)
pe-hrs / sim-year
(no MARBL)
SYPD
(MARBL)
pe-hrs / sim-year
(MARBL)
S 2.43 5057 2.61 7075
M 9.48 5833 9.14 8740
L 11.87 6210 11.39 9708
XL 18.05 8000 15.82 13202

The XL results are interesting -- the ocean runtime is still less than ATM + max(LND+ROF, ICE), but the CPL COMM time increased quite a bit and accounts for the entire performance hit.

Component Run Time (s)
(no MARBL)
Run Time (s)
(MARBL)
diff (s)
TOT 262.250 299.201 36.951
CPL 17.793 17.203 -0.590
ATM 169.303 174.386 5.083
LND 51.998 55.270 3.272
ICE 39.824 48.156 8.332
OCN 169.424 203.312 33.888
ROF 3.672 5.157 1.484
GLC 0.027 0.027 0.000
WAV 0.000 0.000 0.000
ESP 0.000 0.000 0.000
CPL COMM 45.571 83.982 38.411

Before I add any of this to the timing table, I'd like to wait and see how throughput compares (at least for the medium size layout) in the CESM3 development runs. looking at these results, I wouldn't be surprised if we need to tweak the layouts a little bit...

@jedwards4b
Copy link
Contributor

Looking at the timing table for XL:

    TOT Run Time:     262.250 seconds       13.112 seconds/mday        18.05 myears/wday 
    CPL Run Time:      17.793 seconds        0.890 seconds/mday       266.07 myears/wday 
    ATM Run Time:     169.303 seconds        8.465 seconds/mday        27.96 myears/wday 
    LND Run Time:      51.998 seconds        2.600 seconds/mday        91.05 myears/wday 
    ICE Run Time:      39.824 seconds        1.991 seconds/mday       118.88 myears/wday 
    OCN Run Time:     169.424 seconds        8.471 seconds/mday        27.94 myears/wday 
    ROF Run Time:       3.672 seconds        0.184 seconds/mday      1289.11 myears/wday 
    GLC Run Time:       0.027 seconds        0.001 seconds/mday    175994.30 myears/wday 
    WAV Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday 
    ESP Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday 
    CPL COMM Time:     45.571 seconds        2.279 seconds/mday       103.89 myears/wday 

The ocean is waiting some 56s per coupling interval.

@jedwards4b
Copy link
Contributor

@mnlevy1981 You'll need to resolve the conflict before I can merge - I can't push to your fork.

@jedwards4b jedwards4b merged commit 2200ac6 into ESCOMP:cesm3.0-alphabranch Aug 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Updates needed now that MOM6 can run with MARBL tracers
3 participants