Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specs approach for AMOF files #17

Open
joshua-hampton opened this issue Oct 28, 2022 · 23 comments
Open

Specs approach for AMOF files #17

joshua-hampton opened this issue Oct 28, 2022 · 23 comments

Comments

@joshua-hampton
Copy link
Collaborator

As with issue #16, I have also made some progress on the idea of using specs to check AMOF files. In this approach, I have created a number of spec files, one that looks at land variables, and one that does common global attributes.

  • land variables also includes a dimension and a variable that isn't real, just as part of checking it's working
  • common global attributes refers to a new vocab file amof.json, although I think the info in here might also be in the AMF CVs, I didn't look because I did this first.

To accommodate this, I made some changes to generic.py, where the functions run by the spec checker live

  • lines 31, 40, 50-54, function check_global_attrs: added regex checks
  • lines 59-86: added two new functions - check_var_exists and check_dim_exists. Quite self-explanatory

This approach doesn't (currently) check the values of variable attributes, just that they exist. This is better suited for checking valid_min (compare with discussion in issue #16), but not for checking units, for example.

I think this approach will also work well when all variables have the same attributes; however, that's not the case (e.g. quality control variables have flag_values and flag_meanings, which don't exist in other variables). I guess that the specs can be written to do checks on QC variables separately to other variables.

Example of running check against these spec files:

checksit check -t off --specs=amof-land-common,amof-global-attrs-common /gws/nopw/j04/ncas_obs/public/cdao/ncas-radar-wind-profiler-1/ncas-radar-wind-profiler-1_cdao_20221020_snr-winds_low-mode_15min_v1.0.nc

Running with:
	Template: OFF
	Datafile: /gws/nopw/j04/ncas_obs/public/cdao/ncas-radar-wind-profiler-1/ncas-radar-wind-profiler-1_cdao_20221020_snr-winds_low-mode_15min_v1.0.nc


---------------- Running checks ------------------

[WARNING] Template checks switched off!
[FAILED] with 3 errors:

	01. [variable**************:cows]: Does not exist in file.
	02. [dimension**************:fields]: Does not exist in file.
	03. [global-attributes:******:instrument_serial_number]: '' does not match regex pattern '.{3,}'.
@joshua-hampton
Copy link
Collaborator Author

joshua-hampton commented Nov 2, 2022

Update through commit 65cdf5c:

  • added make_specs.py - I imagine this will probably live in amf-check-writer but I've added it here for now so it is somewhere. This takes (most of) the json CV files created by amf-check-writer and creates spec files that can be used by checksit.
    • currently this creates a product-specific spec file with variables and dimensions, and a start of doing common global attributes, although this still needs a bit of work
    • still to do:
      • complete common global attributes
      • common land/sea/air/trajectory variables and dimensions
      • product-specific global attributes
  • added function check_var to generic.py and added rules_attrs keyword argument to check_global_attrs function to help with global attribute spec check

In general, these product specific spec checks work, e.g.

$ checksit check -t off --specs=amof-snr-winds ncas-radar-wind-profiler-1_cdao_20221017_snr-winds_high-mode_15min_v1.0.nc 

Running with:
	Template: OFF
	Datafile: ncas-radar-wind-profiler-1_cdao_20221017_snr-winds_high-mode_15min_v1.0.nc


---------------- Running checks ------------------

[WARNING] Template checks switched off!
[INFO] File is compliant!

and

$ checksit check -t off --specs=amof-surface-met ncas-aws-10_iao_20220301_surface-met_v1.0.nc 

Running with:
	Template: OFF
	Datafile: ncas-aws-10_iao_20220301_surface-met_v1.0.nc


---------------- Running checks ------------------

[WARNING] Template checks switched off!
[FAILED] with 8 errors:

	01. [variable**************:downwelling_longwave_flux_in_air]: Does not exist in file.
	02. [variable**************:downwelling_shortwave_flux_in_air]: Does not exist in file.
	03. [variable**************:downwelling_total_irradiance]: Does not exist in file.
	04. [variable**************:net_total_irradiance]: Does not exist in file.
	05. [variable**************:qc_flag_radiation]: Does not exist in file.
	06. [variable**************:qc_flag_downwelling_total_irradiance]: Does not exist in file.
	07. [variable**************:qc_flag_net_total_irradiance]: Does not exist in file.
	08. [dimension**************:index]: Does not exist in file.

In the second case, the variables that checksit reports as not existing do not exist in the file but should not necessarily be in the file - while they are valid variables for that data product, the ncas-aws-10 instrument does not measure radiation, and so the product-specific variables relating to radiation are omitted from the file, rather than being left in empty.

The index dimension should not appear in this spec - while in the middle of writing this I have found the error in make_specs.py and will correct this shortly.

@joshua-hampton
Copy link
Collaborator Author

make_specs.py fixed in commit d527361, with updated spec file for the surface-met product.

@joshua-hampton
Copy link
Collaborator Author

Commit c72106f adds making spec files for deployment modes, e.g. for land

@joshua-hampton
Copy link
Collaborator Author

Added section to check.py which, if template is auto (default), checks to see if the file given is a netCDF file, if so checks Conventions global attribute against known AMOF standard Conventions vocab. If so, gets spec files for that file based on data product and deployment mode, sets template to off, and runs checksit using those specs. For example:

$ checksit check ncas-radar-wind-profiler-1_cdao_20221017_snr-winds_high-mode_15min_v1.0.nc 

AMOF file detected, finding correct spec files

Running with:
	Template: OFF
	Datafile: ncas-radar-wind-profiler-1_cdao_20221017_snr-winds_high-mode_15min_v1.0.nc


---------------- Running checks ------------------

[WARNING] Template checks switched off!
[INFO] File is compliant!

Doesn't currently add in the global attributes spec file because that has not been created correctly/in a useful manner. Also still need to work on product-specific attributes.

I've not tested this against a non-netCDF file, but I have tested against a non-AMOF netCDF file, which is correctly ignored in the AMOF-checking process.

@joshua-hampton
Copy link
Collaborator Author

One issue with global attributes spec was the licence global attribute, which includes a URL. The check code is splitting the string based on a colon, which meant half of the attribute was not included. This is fixed in commit defa3b1, joining the list with a colon, and only applies to the regex checks.

@joshua-hampton
Copy link
Collaborator Author

In the general global attributes, there are two that have vocab checks - source and platform.

  • platform is taken from the AMF_platfrom CV file, with any of the locations being acceptable. In vocab speak - __vocabs__:AMF_CVs/AMF_platform:platform:* where * represents any
  • source is from the AMF_ncas_instrument (or AMF_community_instrument) CV file, however this then depends on instrument name, i.e. __vocabs__:AMF_CVs/AMF_ncas_instrument:ncas_instrument:<instrument_name>:description

To help toward this, commit 6c5b8a0 introduces the idea of wildcard to cvs.py. The wildcard can be used in specifying the vocab check, however at the moment it will only work if the wildcard is the last argument in the check. This means that platform will work, but source will not. The reason for this is that the vocab check is going through dictionaries until it gets to the required value(s) at the end, either a single value or a list of values. The wildcard returns a list of values. If the wildcard is used before the end, the vocab check will get confused at the next stage trying to take a key of a list. I think I would too.

@joshua-hampton
Copy link
Collaborator Author

Commit 1d75c78 allows for wildcard to appear anywhere in specs vocab check.

  • When a wildcard is encountered by a vocab check, checksit looks to see if the wildcard is the last key in the vocab lookup.
    • If it is, the wildcard returns a list of all keys of the dictionary being checked.
    • If it isn't, the wildcard returns a list of all the values within the object being checked, but without the keys to those values. When this list then gets checked at the next stage of the vocab check, each element in the list will be checked independently, with a new list being returned.

A (hopefully not too confusing) generic example to help (mostly me) understand, considering the very generic dictionary

{'a' : {'b' : 'c', 'd' : {'e' : 'f', 'g' : 'h'} },
 'i' : {'b' : 'j', 'd' : {'e' : 'k', 'g' : 'l'} },
 'm' : {'b' : 'n', 'd' : {'e' : 'o', 'g' : 'p'} } }

The following checks get the (also) following results:

"__all__" => ['a', 'i', 'm']
"a:__all__" => ['b', 'd']
"a:d:__all__" => ['e', 'g']
"__all__:b" => ['c', 'j', 'n']
"__all__:d" => [{'e' : 'f', 'g' : 'h'}, {'e': 'k', 'g' : 'l'}, {'e' : 'o', 'g' : 'p'}]
"__all__:d:e => ['f', 'k', 'o']

@joshua-hampton
Copy link
Collaborator Author

Final issue that needs dealing with for AMOF spec checks (I think) - how to deal with product-specific variables that are not in the netCDF file? The AMOF standard allows for these variables to not be there, but at the moment checksit expects all possible variables to be present.
One option - create a function similar to checksit.generic.check_var that returns info or warning rather than error if variable is not present?

@agstephens
Copy link
Member

Great work @joshua-hampton,

I wonder if we should have a way of specifying that any component (i.e. a dimension, variable or an attribute) could be optional. If so, we just need a nice clean way of identifying them.

@joshua-hampton
Copy link
Collaborator Author

Couple of thoughts on how we could do that:

  1. Have the option to append names with something like __optional__, then within the checksit.generic.check_XXX function check to see if that string is present before running check on that value.
  2. Add an extra parameter to the checksit.generic.check_XXX functions called required which takes a boolean value for each variable/dimension/attribute name being passed to the function (or the inverse parameter called optional)

@joshua-hampton
Copy link
Collaborator Author

Another thought - global attributes sampling_interval and averaging_interval have a Compliance Checking Rule in the spreadsheets that they should be a string with at least two characters - do we want to add an extra check that they should have units of time in them?

@joshua-hampton
Copy link
Collaborator Author

Idea on applying optional status to things in spec checks (commit c4fe779). When making spec files for data products, variables and dimensions being checked have :__OPTIONAL__ appended to them (e.g. see amof-surface-met.yml). The functions in generic.py then check the variables and dimensions to see if :__OPTIONAL__ is in each one. If it is, the check still happens, but (currently) no error is reported if the variable/dimension is not present. If :__OPTIONAL__ is not in the variable/dimension name, then the check happens as it previously did.

@agstephens
Copy link
Member

@joshua-hampton: sounds good 👍

@joshua-hampton
Copy link
Collaborator Author

joshua-hampton commented Dec 7, 2022

Commit 31a0209 - Added "warnings" as a returned parameter from all functions in generic.py. When optional product-specific dimensions or variables are missing, a warning is added, which passes back up the chain through specs.py to check.py, where they are printed out along with errors/file compliance, e.g.

(checksit_env) (base) [earjham@rsg1-202206011551 checksit]$ checksit check ncas-aws-10_iao_20220301_surface-met_v1.0.nc

AMOF file detected, finding correct spec files

Running with:
	Template: OFF
	Spec Files: ['amof-common-land', 'amof-surface-met', 'amof-global-attrs']
	Datafile: ncas-aws-10_iao_20220301_surface-met_v1.0.nc


---------------- Running checks ------------------

[WARNING] Template checks switched off!
[FAILED] with 5 errors:

	01. [global-attributes:******:creator_email]*** Value 'CHANGE: E-mail address of the person who made the file. Valid email' does not match regex rule: 'valid-email'.
	02. [global-attributes:******:creator_url]*** Value 'CHANGE: ORCID URL of the person who made the file. Valid URL' does not match regex rule: 'valid-url'.
	03. [global-attributes:******:calibration_certification_date]*** Value 'CHANGE: when bought' does not match regex rule: 'datetime-or-na'.
	04. [global-attributes:******:processing_level]*** Value '1' is not of required type: 'integer'.
	05. [global-attributes:******:platform_altitude]*** Value 'CHANGE: Altitude of the deployment position above MSL. Enter Not Applicable for air. Exact match: <number> m' does not match regular expression: '^\d+\.?\d*\sm$'.

[WARNING] 7 warnings about file:

	01. [variable**************:downwelling_longwave_flux_in_air]: Optional variable does not exist in file.
	02. [variable**************:downwelling_shortwave_flux_in_air]: Optional variable does not exist in file.
	03. [variable**************:downwelling_total_irradiance]: Optional variable does not exist in file.
	04. [variable**************:net_total_irradiance]: Optional variable does not exist in file.
	05. [variable**************:qc_flag_radiation]: Optional variable does not exist in file.
	06. [variable**************:qc_flag_downwelling_total_irradiance]: Optional variable does not exist in file.
	07. [variable**************:qc_flag_net_total_irradiance]: Optional variable does not exist in file.

I've also added a flag to the check command in cli.py - -w/--ignore-warnings which means the warnings don't print out, e.g.

(checksit_env) (base) [earjham@rsg1-202206011551 checksit]$ checksit check -w ncas-aws-10_iao_20220301_surface-met_v1.0.nc

AMOF file detected, finding correct spec files

Running with:
	Template: OFF
	Spec Files: ['amof-common-land', 'amof-surface-met', 'amof-global-attrs']
	Datafile: ncas-aws-10_iao_20220301_surface-met_v1.0.nc


---------------- Running checks ------------------

[WARNING] Template checks switched off!
[FAILED] with 5 errors:

	01. [global-attributes:******:creator_email]*** Value 'CHANGE: E-mail address of the person who made the file. Valid email' does not match regex rule: 'valid-email'.
	02. [global-attributes:******:creator_url]*** Value 'CHANGE: ORCID URL of the person who made the file. Valid URL' does not match regex rule: 'valid-url'.
	03. [global-attributes:******:calibration_certification_date]*** Value 'CHANGE: when bought' does not match regex rule: 'datetime-or-na'.
	04. [global-attributes:******:processing_level]*** Value '1' is not of required type: 'integer'.
	05. [global-attributes:******:platform_altitude]*** Value 'CHANGE: Altitude of the deployment position above MSL. Enter Not Applicable for air. Exact match: <number> m' does not match regular expression: '^\d+\.?\d*\sm$'.

@agstephens
Copy link
Member

Another thought - global attributes sampling_interval and averaging_interval have a Compliance Checking Rule in the spreadsheets that they should be a string with at least two characters - do we want to add an extra check that they should have units of time in them?

@joshua-hampton: I think all structural rules like this should be referred to Barb.

@joshua-hampton
Copy link
Collaborator Author

joshua-hampton commented Jan 4, 2023

Testing this with a file that isn't mine has produced some results of note:

  1. This particular file uses the global attribute feature_type instead of featureType. In this case, checksit has correctly identified an error with the file, however I wonder if it would be helpful/possible to write something that identifies a "close mistake" and offers a suggestion on how to fix.
  2. Similarly, this file calls a dimension altitudes rather than altitude. As altitude is an optional dimension, checksit raises a warning about the file, but would still call it compliant, which in this case is probably wrong. Again, a "did you mean this" or "I wasn't expecting to see this" would provide the user with more information when they might be thinking "I'm sure I put altitude in".
  3. The platform global attribute requires an exact match to one of the platform_ids here. For an instrument not at any of those locations (e.g. anything on fieldwork I'd guess), none of these options might make sense.
  4. The spec checks do not check the value of a variable attribute, just for it's existence. There are a number of variable attributes that have defined values (e.g. long_name) that might be worth checking. Some errors in the file I spotted that checksit didn't is with qc_flag:flag_values and qc_flag:flag_meanings, and these are common errors that I've seen from many files (and I made myself once) and fail a CF checker:
    • the flag_values are a single string looking like "0b,1b,2b...", which looks right when looking at the spreadsheet, however the CF conventions state they should be the same type as the variable to which it is attached, and contains a list of the possible flag values, i.e. a list of byte values 0b 1b 2b....
    • flag_meanings also appear to have been copied from the spreadsheet, and are given as a single string separated by \n, CF conventions state they should be space separated.

@agstephens
Copy link
Member

@joshua-hampton: I really like your ideas for giving people hints on what they might have meant.

@joshua-hampton
Copy link
Collaborator Author

joshua-hampton commented Feb 14, 2023

I've found another issue, when doing a rule check on a global attribute. Specifically, this one from specs/groups/amof-global-attrs.yml

geospatial_bounds: rule-func:string-of-length:8+

In one netCDF file, the attribute was spelt incorrectly, and dct['global_attributes'].get(attr, UNDEFINED) (line 61 in generic.py) returned UNDEFINED. However, UNDEFINED matches rule-func:string-of-length:8+, and so passed the checker.


Suggested fix - the check for attribute existence is done separately. This may need to be done in other places too.

@joshua-hampton
Copy link
Collaborator Author

Played around a bit with the idea of giving hints on potentially misspelt items (da93da7). If a variable/attribute/dimension specified in the spec file is not found in the netCDF file, a function is called that takes the name of what should be there, produces a set of possible "close matches" defined as up to two errors, and then looks through these close matches to see if any of them match anything in the file. For example, if a file had the variable yr instead of year, and the global attribute sauce rather than source, then checksit would return

1. [variable**************:year]: Does not exist in file. 'yr' was found in this file, should this be 'year'?
2. [global-attributes:**************:source]: Attribute 'source' does not exist. 'sauce' was found in this file, should this be 'source'?

There are a few limitations on this method. Firstly, I've assumed that the first character will be correct. This is mostly because you could otherwise get

23. [dimension**************:altitude]: Optional dimension does not exist in file. 'latitude' was found in this file, should this be 'altitude'?

Secondly, if there are similar variable/attribute/dimension names intended to be in the file, they can come up as suggestions. For example, the snr-winds data product has variables skew_of_beam_1, skew_of_beam_2, and skew_of_beam_3. If one of those was spelt incorrectly, e.g. skew_of_bem_2, you might end up with the following return

1. [variable**************:skew_of_beam_2]: Optional variable does not exist in file. 'skew_of_beam_1' was found in this file, should this be 'skew_of_beam_2'?

The functions that are doing this are currently in checksit/generic.py, but I guess if they are useful they could be moved somewhere else (checksit/utils.py?) to be available elsewhere.

@joshua-hampton
Copy link
Collaborator Author

Commit 843444c - changed specs to check values of variable attributes. Errors such as

[variable**************:time]: Attribute 'long_name' must have definition Time (seconds since 1970-01-01 00:00:00), not Tim (seconds since 1970-01-01 00:00:00).

are now caught.

@joshua-hampton
Copy link
Collaborator Author

Also found a bug when logging in "compact" mode, checksit would attempt to run template check regardless of whether one was required or not (e.g. AMOF netCDFs only using specs). This is fixed in 15183b0

@joshua-hampton
Copy link
Collaborator Author

CLI option to skip spellchecking added in 51f7f8e. Creating a pull request to main with latest changes

@joshua-hampton
Copy link
Collaborator Author

joshua-hampton commented May 10, 2023

Something that still needs doing, how to work with newer releases of NCAS-GENERAL standard. This will require:

  • checking which version of standard is being used
  • storing specs for different versions
  • downloading specs of new versions if not already available to checksit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants