Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update gdas.cd #2978

Open
wants to merge 12 commits into
base: develop
Choose a base branch
from

Conversation

guillaumevernieres
Copy link
Contributor

@guillaumevernieres guillaumevernieres commented Oct 3, 2024

Description

Updates to the gdas.cd #.
@RussTreadon-NOAA will submit a PR in the GDASApp, we'll update the gdas.cd # in this branch after the GDASApp PR is merged.
In the mean time, could somebody review the few simple code changes that are needed to run with the new #?

  • Depends on GDASApp PR #1310
  • Depends on g-w issue #3012

Type of change

  • Maintenance (code refactor, clean-up, new CI test, etc.)

Change characteristics

  • Is this a breaking change (a change in existing functionality)? NO
  • Does this change require a documentation update? NO
  • Does this change require an update to any of the following submodules? YES

How has this been tested?

Run subset of g-w CI on Hera, Hercules, Orion, and WCOSS2 (Cactus)

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • This change is covered by an existing CI test or a new one has been added

CoryMartin-NOAA
CoryMartin-NOAA previously approved these changes Oct 3, 2024
aerorahul
aerorahul previously approved these changes Oct 4, 2024
Copy link
Contributor

@aerorahul aerorahul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me

@RussTreadon-NOAA
Copy link
Contributor

@guillaumevernieres , the hash for sorc/gdas.cd can be updated. GDASApp PR #1310 was merged into develop at 9d95c9d.

@guillaumevernieres guillaumevernieres changed the title Feature/update hashes Update gdas.cd Oct 7, 2024
@guillaumevernieres
Copy link
Contributor Author

@guillaumevernieres , the hash for sorc/gdas.cd can be updated. GDASApp PR #1310 was merged into develop at 9d95c9d.

Thanks @RussTreadon-NOAA , working on it this morning.

@guillaumevernieres guillaumevernieres marked this pull request as ready for review October 7, 2024 13:21
CoryMartin-NOAA
CoryMartin-NOAA previously approved these changes Oct 7, 2024
aerorahul
aerorahul previously approved these changes Oct 7, 2024
@aerorahul aerorahul added CI-Hera-Ready **CM use only** PR is ready for CI testing on Hera CI-Orion-Ready **CM use only** PR is ready for CI testing on Orion CI-Wcoss2-Ready **CM use only** PR is ready for CI testing on WCOSS labels Oct 7, 2024
@emcbot emcbot added CI-Orion-Building **Bot use only** CI testing is cloning/building on Orion CI-Hera-Building **Bot use only** CI testing is cloning/building on Hera CI-Wcoss2-Building **Bot use only** CI testing is cloning/building on WCOSS and removed CI-Orion-Ready **CM use only** PR is ready for CI testing on Orion CI-Hera-Ready **CM use only** PR is ready for CI testing on Hera CI-Wcoss2-Ready **CM use only** PR is ready for CI testing on WCOSS labels Oct 7, 2024
@CoryMartin-NOAA
Copy link
Contributor

interestingly I found this Unidata/netcdf-c#2158

@guillaumevernieres
Copy link
Contributor Author

@DavidNew-NOAA , @CoryMartin-NOAA , @danholdaway , @guillaumevernieres : any ideas to try on the WCOSS2 failures for jobs gdas_aeroanlgenb and gdas_snowanl? It certainly seems like an FMS / FMS2 issue. @DavidNew-NOAA made a fix which is in the most recent Cactus test ... but the jobs still fail. Is there another way to tackle ths problem?

Alternatively, do we accept that 96C48_hybatmaerosnowDA is broken on WCOSS2 and move forward with this PR?

No idea @RussTreadon-NOAA for the WCOSS2 problem. I'm finishing up testing a subset of the g-w ci, it's already past what failed in the (semi) automated ci, so far so good.
I vote for merging this if the marine DA stuff works (I'm not biased!).

@CoryMartin-NOAA
Copy link
Contributor

Is it complaining about ending with an integer? shouldn't the string be irrelevant?

@CoryMartin-NOAA
Copy link
Contributor

CoryMartin-NOAA commented Oct 16, 2024

Why is it referencing SOCA? eg:
libsoca.so 00001509F3989C80 fms_netcdf_domain 548 fms_netcdf_domain_io.F90

Is SOCA building FMS separately? Is it linking to two FMS versions somehow?

Not sure if this is related:
https://github.com/JCSDA-internal/soca/blob/4d7ef21e74d78a065156c942a72806ef2e2eb08e/CMakeLists.txt#L34
vs
https://github.com/JCSDA-internal/fv3-jedi/blob/c99519638484c017ebdf1817a2508d6a34562f8f/CMakeLists.txt#L126

@DavidNew-NOAA
Copy link
Contributor

@RussTreadon-NOAA Can you try again with the latest feature/fms2io-fix . I have a hunch about why that illegal character error is being raised on Cactus

@RussTreadon-NOAA
Copy link
Contributor

Thank you @DavidNew-NOAA for the update. I can't directly pull your fv3-jedi branch, feature/fms2io-fix, into gdas.cd. Your branch includes Model Variable Renaming Sprint changes which will cause the gdas.cd hash in this PR to choke. Let me manually add your latest change to the working copy of my Cactus branch, rebuild, and rerun.

@RussTreadon-NOAA
Copy link
Contributor

@DavidNew-NOAA , unfortunately, jobs gdas_aeroanlgenb and gdas_snowanl fail on Cactus in the same way as before. Note that the traceback for both failures reference libsoca. Why?

nid001899.cactus.wcoss2.ncep.noaa.gov 0:
FATAL from PE     0: NetCDF: Name contains illegal characters: netcdf_add_variable: file:./bkg/20211220.180000.anlres.fv_tracer.res.tile1.nc variable:xaxis_1

nid001899.cactus.wcoss2.ncep.noaa.gov 64: Image              PC                Routine            Line        Source
libifcoremt.so.5   00001551E34BCD4A  tracebackqq_          Unknown  Unknown
libsoca.so         000015521D117B4E  mpp_mod_mp_mpp_er          72  mpp_util_mpi.inc
libsoca.so         000015521D496CE2  fms_io_utils_mod_         190  fms_io_utils.F90
libsoca.so         000015521D03B3D3  netcdf_io_mod_mp_         381  netcdf_io.F90
libsoca.so         000015521D049E56  netcdf_io_mod_mp_         969  netcdf_io.F90
libsoca.so         000015521D499C80  fms_netcdf_domain         548  fms_netcdf_domain_io.F90
libfv3jedi.so      000015522B0E9F0D  fv3jedi_io_fms_mo     Unknown  Unknown
libfv3jedi.so      000015522B0F13A9  fv3jedi_io_fms_mo     Unknown  Unknown
libfv3jedi.so      000015522AEA04E5  fv3jedi_io_fms_wr     Unknown  Unknown

@RussTreadon-NOAA
Copy link
Contributor

g-w CI tests on Hercules
Repeat the Hera and Cactus g-w CI tests on Hercules. Results are as follows:

rocotostat /work/noaa/stmp/rtreadon/EXPDIR/prgsi_pr2978
   CYCLE         STATE           ACTIVATED              DEACTIVATED
202112201800        Done    Oct 16 2024 15:24:50    Oct 16 2024 16:00:04
202112210000        Done    Oct 16 2024 15:24:50    Oct 16 2024 18:05:03
202112210600        Done    Oct 16 2024 15:24:50    Oct 16 2024 18:25:03

rocotostat /work/noaa/stmp/rtreadon/EXPDIR/prjedi_pr2978
   CYCLE         STATE           ACTIVATED              DEACTIVATED
202402231800        Done    Oct 16 2024 15:24:51    Oct 16 2024 16:00:05
202402240000        Done    Oct 16 2024 15:24:51    Oct 16 2024 19:26:54
202402240600        Done    Oct 16 2024 15:24:51    Oct 16 2024 20:26:55

rocotostat /work/noaa/stmp/rtreadon/EXPDIR/praero_pr2978
   CYCLE         STATE           ACTIVATED              DEACTIVATED
202112201200        Done    Oct 16 2024 15:22:04    Oct 16 2024 16:00:06
202112201800        Done    Oct 16 2024 15:22:04    Oct 16 2024 17:20:05
202112210000        Done    Oct 16 2024 15:22:04    Oct 16 2024 19:26:55

rocotostat /work/noaa/stmp/rtreadon/EXPDIR/prwcda_pr2978
   CYCLE         STATE           ACTIVATED              DEACTIVATED
202103241200        Done    Oct 16 2024 15:24:53    Oct 16 2024 16:00:07
202103241800        Done    Oct 16 2024 15:24:53    Oct 16 2024 17:05:06

All jobs successfully complete in the four tested configurations. This Passed result is the same as Hera.

@CoryMartin-NOAA
Copy link
Contributor

@DavidNew-NOAA will be on leave until Tuesday. Should we accept that this may fail on WCOSS, and allow this to proceed, but prioritize fixing this next week? Or wait until we have a solution for WCOSS? It's entirely possible it is a WCOSS library issue, I wonder if we can reproduce this with simplified code?

@RussTreadon-NOAA
Copy link
Contributor

I agree, @CoryMartin-NOAA. Let's move this PR forward as is and fix the WCOSS2 failures as soon as possible. All four g-w CI configurations tested on Hera and Hercules pass. 2 out of the 3 g-w CI tested on Cactus pass. What do you think @WalterKolczynski-NOAA and @aerorahul ?

It would be much easier to troubleshoot what's going on if we could replicate the problem in a simple script. I will try to encapsulate the failing portions of the gdas_aeroanlgenb or gdas_snowanl jobs in a simple wrapper script.

@WalterKolczynski-NOAA
Copy link
Contributor

I agree, @CoryMartin-NOAA. Let's move this PR forward as is and fix the WCOSS2 failures as soon as possible. All four g-w CI configurations tested on Hera and Hercules pass. 2 out of the 3 g-w CI tested on Cactus pass. What do you think @WalterKolczynski-NOAA and @aerorahul ?

CI still needs to pass on WCOSS to be merged, so if we approve this plan:

  • The offending non-passing test will need to be turned off for WCOSS so we can run the others
  • A workflow issue created to fix the problem and turn the test back on

Will let @aerorahul chime in first.

@RussTreadon-NOAA
Copy link
Contributor

Oops, I realize there's a bit more work required for this PR to move forward. All my tests are using GDASApp patch/saber. This PR does not, at present, point sorc/gdas.cd at patch/saber.

@CoryMartin-NOAA
Copy link
Contributor

Sorry what @WalterKolczynski-NOAA said is what I intended to propose. We turn off the test and fix it and then turn the test back on at a later date. @aerorahul and I discussed yesterday that this would be acceptable in this sort of scenario

@RussTreadon-NOAA
Copy link
Contributor

Completion of this PR requires updates to files in EIB's fix/gdas/fv3jedi/20220805/fv3files. g-w issue #3012 has been opened to request installation of the updated fix files.

@aerorahul
Copy link
Contributor

  • Disable GDASApp tests that are failing on platforms affected
  • Open issues for failing GDASApp tests with a priority to reenable them
  • Merge this PR

@RussTreadon-NOAA
Copy link
Contributor

Why is it referencing SOCA? eg: libsoca.so 00001509F3989C80 fms_netcdf_domain 548 fms_netcdf_domain_io.F90

Is SOCA building FMS separately? Is it linking to two FMS versions somehow?

Not sure if this is related: https://github.com/JCSDA-internal/soca/blob/4d7ef21e74d78a065156c942a72806ef2e2eb08e/CMakeLists.txt#L34 vs https://github.com/JCSDA-internal/fv3-jedi/blob/c99519638484c017ebdf1817a2508d6a34562f8f/CMakeLists.txt#L126

Thank you @CoryMartin-NOAA for catching this. As a test I updated the soca CMakeLists.txt to find_package(FMS 2023.04 REQUIRED COMPONENTS R4 R8) and rebuilt on Cactus. A rerun of job gdas_aeroanlgenb failed with the same illegal characters message.

@RussTreadon-NOAA
Copy link
Contributor

RussTreadon-NOAA commented Oct 17, 2024

g-w CI on Hera, Hercules, Orion, and WCOSS2

Install guillaumevernieres:feature/update_hashes at 5869ff3 with sorc/gdas.cd replaced by GDASApp branch patch/gwci at c33fbab. Run g-w CI for

  • C96C48_hybatmDA - PSLOT = prgsi_pr2978
  • C96C48_ufs_hybatmDA - PSLOT = prjedi_pr2978
  • C96C48_hybatmaerosnowDA - PSLOT = praero_pr2978
  • C48mx500_3DVarAOWCDA - PSLOT = prwcda_pr2978

with results as follows on the indicated machines

Hera

/scratch1/NCEPDEV/stmp2/Russ.Treadon/EXPDIR/prgsi_pr2978
   CYCLE         STATE           ACTIVATED              DEACTIVATED
202112201800        Done    Oct 17 2024 11:00:37    Oct 17 2024 11:20:17
202112210000        Done    Oct 17 2024 11:00:37    Oct 17 2024 13:45:19
202112210600        Done    Oct 17 2024 11:00:37    Oct 17 2024 13:30:21

/scratch1/NCEPDEV/stmp2/Russ.Treadon/EXPDIR/prjedi_pr2978
   CYCLE         STATE           ACTIVATED              DEACTIVATED
202402231800        Done    Oct 17 2024 11:00:44    Oct 17 2024 11:20:19
202402240000        Done    Oct 17 2024 11:00:44    Oct 17 2024 14:10:18
202402240600        Done    Oct 17 2024 11:00:44    Oct 17 2024 14:35:20

/scratch1/NCEPDEV/stmp2/Russ.Treadon/EXPDIR/praero_pr2978
   CYCLE         STATE           ACTIVATED              DEACTIVATED
202112201200        Done    Oct 17 2024 11:00:46    Oct 17 2024 11:31:00
202112201800        Done    Oct 17 2024 11:00:46    Oct 17 2024 12:45:22
202112210000        Done    Oct 17 2024 11:00:46    Oct 17 2024 15:10:30

/scratch1/NCEPDEV/stmp2/Russ.Treadon/EXPDIR/prwcda_pr2978
   CYCLE         STATE           ACTIVATED              DEACTIVATED
202103241200        Done    Oct 17 2024 11:00:47    Oct 17 2024 11:20:23
202103241800        Done    Oct 17 2024 11:00:47    Oct 17 2024 12:35:21

Hercules

rocotostat /work/noaa/stmp/rtreadon/EXPDIR/prgsi_pr2978
   CYCLE         STATE           ACTIVATED              DEACTIVATED
202112201800        Done    Oct 17 2024 10:45:03    Oct 17 2024 11:05:02
202112210000        Done    Oct 17 2024 10:45:03    Oct 17 2024 13:05:03
202112210600        Done    Oct 17 2024 10:45:03    Oct 17 2024 13:25:03

rocotostat /work/noaa/stmp/rtreadon/EXPDIR/prjedi_pr2978
   CYCLE         STATE           ACTIVATED              DEACTIVATED
202402231800        Done    Oct 17 2024 10:45:04    Oct 17 2024 11:05:04
202402240000        Done    Oct 17 2024 10:45:04    Oct 17 2024 13:46:50
202402240600        Done    Oct 17 2024 10:45:04    Oct 17 2024 13:55:04

rocotostat /work/noaa/stmp/rtreadon/EXPDIR/praero_pr2978
   CYCLE         STATE           ACTIVATED              DEACTIVATED
202112201200        Done    Oct 17 2024 10:45:05    Oct 17 2024 11:10:06
202112201800        Done    Oct 17 2024 10:45:05    Oct 17 2024 12:45:06
202112210000        Done    Oct 17 2024 10:45:05    Oct 17 2024 14:25:05

rocotostat /work/noaa/stmp/rtreadon/EXPDIR/prwcda_pr2978
   CYCLE         STATE           ACTIVATED              DEACTIVATED
202103241200        Done    Oct 17 2024 10:45:06    Oct 17 2024 11:10:06
202103241800        Done    Oct 17 2024 10:45:06    Oct 17 2024 12:30:07

Orion

rocotostat /work2/noaa/stmp/rtreadon/EXPDIR/prgsi_pr2978
   CYCLE         STATE           ACTIVATED              DEACTIVATED
202112201800        Done    Oct 17 2024 12:56:04    Oct 17 2024 13:20:06
202112210000        Done    Oct 17 2024 12:56:04    Oct 17 2024 15:50:06
202112210600        Done    Oct 17 2024 12:56:04    Oct 17 2024 15:55:05

rocotostat /work2/noaa/stmp/rtreadon/EXPDIR/prjedi_pr2978
   CYCLE         STATE           ACTIVATED              DEACTIVATED
202402231800        Done    Oct 17 2024 12:56:06    Oct 17 2024 13:20:08
202402240000        Done    Oct 17 2024 12:56:06    Oct 17 2024 16:20:07
202402240600        Done    Oct 17 2024 12:56:06    Oct 17 2024 16:36:27

rocotostat /work2/noaa/stmp/rtreadon/EXPDIR/praero_pr2978
   CYCLE         STATE           ACTIVATED              DEACTIVATED
202112201200        Done    Oct 17 2024 12:56:08    Oct 17 2024 13:20:11
202112201800        Done    Oct 17 2024 12:56:08    Oct 17 2024 14:45:10
202112210000        Done    Oct 17 2024 12:56:08    Oct 17 2024 16:54:23

rocotostat /work2/noaa/stmp/rtreadon/EXPDIR/prwcda_pr2978
   CYCLE         STATE           ACTIVATED              DEACTIVATED
202103241200        Done    Oct 17 2024 12:56:09    Oct 17 2024 13:20:13
202103241800        Done    Oct 17 2024 12:56:09    Oct 17 2024 14:45:11

WCOSS2 (Cactus)

 /lfs/h2/emc/ptmp/russ.treadon/EXPDIR/prgsi_pr2978
   CYCLE         STATE           ACTIVATED              DEACTIVATED     
202112201800        Done    Oct 17 2024 11:00:10    Oct 17 2024 11:20:06
202112210000        Done    Oct 17 2024 11:00:10    Oct 17 2024 13:25:11
202112210600        Done    Oct 17 2024 11:00:10    Oct 17 2024 13:09:40
 
/lfs/h2/emc/ptmp/russ.treadon/EXPDIR/prjedi_pr2978
   CYCLE         STATE           ACTIVATED              DEACTIVATED     
202402231800        Done    Oct 17 2024 11:00:12    Oct 17 2024 11:20:09
202402240000        Done    Oct 17 2024 11:00:12    Oct 17 2024 13:46:38
202402240600        Done    Oct 17 2024 11:00:12    Oct 17 2024 13:47:03
 
/lfs/h2/emc/ptmp/russ.treadon/EXPDIR/praero_pr2978
   CYCLE         STATE           ACTIVATED              DEACTIVATED     
202112201200      Active    Oct 17 2024 11:00:14             -          
202112201800      Active    Oct 17 2024 11:00:14             -          
202112210000      Active    Oct 17 2024 11:00:14             -          
 
/lfs/h2/emc/ptmp/russ.treadon/EXPDIR/prwcda_pr2978
   CYCLE         STATE           ACTIVATED              DEACTIVATED     
202103241200        Done    Oct 17 2024 11:00:16    Oct 17 2024 11:20:15
202103241800      Active    Oct 17 2024 11:00:16             -          
   

The Cactus failures are

/lfs/h2/emc/ptmp/russ.treadon/EXPDIR/praero_pr2978
202112201200        gdas_aeroanlgenb                   158495445                DEAD                 -29         2        1849.0
202112201800            gdas_snowanl                   158472339                DEAD                   1         2          58.0
 
/lfs/h2/emc/ptmp/russ.treadon/EXPDIR/prwcda_pr2978
202103241800         gdas_marinebmat                   158473078                DEAD                 -29         2        1823.0

The following GDASApp issues have been opened to address these failures

g-w CI C96C48_hybatmaerosnowDA and C48mx500_3DVarAOWCDA are disabled on WCOSS2

@RussTreadon-NOAA
Copy link
Contributor

RussTreadon-NOAA commented Oct 17, 2024

Thank you @aerorahul for outlining what needs to be done before this PR can be closed. I converted your list into tasks and added the fix update.

  • Disable GDASApp tests that are failing on platforms affected
  • Open issues for failing GDASApp tests with a priority to reenable them
  • Close g-w issue update fix/gdas/fv3files #3012
  • Merge this PR

Copy link
Contributor

@RussTreadon-NOAA RussTreadon-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve

@RussTreadon-NOAA
Copy link
Contributor

@WalterKolczynski-NOAA , this PR is ready for final review.

Once g-w issue #3012 is closed, we need to update gdas_fv3jedi_ver in versions/fix.ver to the timestamp used for the updated fv3jedi fix files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI-Hera-Failed **Bot use only** CI testing on Hera for this PR has failed CI-Wcoss2-Failed **Bot use only** CI testing on WCOSS for this PR has failed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants