Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enforce no spaces in file names #139

Closed
adamjtaylor opened this issue Dec 22, 2022 · 23 comments · Fixed by #145
Closed

Enforce no spaces in file names #139

adamjtaylor opened this issue Dec 22, 2022 · 23 comments · Fixed by #145
Assignees
Labels
effort-mid This one needs your brain

Comments

@adamjtaylor
Copy link
Contributor

This has caused a number of downstream issues in the past and we should enforce with a regex validation rule

@aclayton555
Copy link
Contributor

Please work with @brynnz22 on this.

Do we enforce no spaces? no white space characters at all? no special characters?

@aclayton555 aclayton555 added the effort-mid This one needs your brain label Jan 9, 2023
@adamjtaylor
Copy link
Contributor Author

the regex string ^\S*$ should be appropriate to capture start of string, any number of non-whitespace characters followed by the end of the string

@adamjtaylor
Copy link
Contributor Author

From the docs this can be implemented as adding regex match ^\S*$ to the Validation Rules column for the attribute Filename

@adamjtaylor
Copy link
Contributor Author

@brynnz22 do you think you try and open a PR to modify the data model to add the validation rule above. Happy to do so in a working session if preferable. We can try and deploy this branch to a testing instance and see the rule works appropriately.

@aclayton555
Copy link
Contributor

Note to Brynn: Make a new branch titled based on this, then open a PR against main. Once PR done, work with Adam to run action: Should be able to run gh action on that branch to deploy to testing1

@adamjtaylor
Copy link
Contributor Author

I've deployed this to the testing1 instance

To do this I:

Of course this will all change for the upcoming multi-tennant DCA

@adamjtaylor
Copy link
Contributor Author

I'm getting a DCA disconnect when I try and generate a template in the testing instance. Will look at logs and see if I can see what the issue is.

@adamjtaylor
Copy link
Contributor Author

Nothing in the logs to suggest the cause 😪

@brynnz22
Copy link
Contributor

brynnz22 commented Jan 23, 2023

@adamjtaylor Is it the weird quotes and things that were inserted in the csv that could be causing this?

@adamjtaylor
Copy link
Contributor Author

@brynnz22 Yeah I think it might be the quoted validation rule "url " highlighted in my review of the PR. Lets remove them (editing the csv directly in the Github web UI might be the easiest way to avoid any local formatting issues) and then we can redeploy to testing1 Thank you!

@adamjtaylor
Copy link
Contributor Author

Hey @brynnz22 I've been trying to test this validation rule locally with a small test template based from OtherAssay

Component,Filename,File Format,HTAN Parent Biospecimen ID,HTAN Data File ID,Assay Type
OtherAssay,underscore_separated.csv,csv,HTA1_1_1,HTA1_2_2,Test
OtherAssay,space separated.csv,csv,HTA1_1_1,HTA1_2_2,Test

This should fail validation due to the space.... but schematic seems to trip up with error KeyError: 'Filename'

(data-models) ataylor@w152 test-nospace-validation % schematic model -c schematic_config.yml validate -mp other_assay_example.csv -dt "OtherAssay"
Starting schematic...
The (model > input > location) argument with value 'HTAN.model.jsonld' is being read from the config file.
The (model > input > file_type) argument with value 'local' is being read from the config file.
JSON schema successfully generated from schema.org schema!
JSON schema file log stored as HTAN.OtherAssay.schema.json
FileDataContext loading zep config
GxConfig.parse_yaml() failed with errors - [{'loc': ('xdatasources',), 'msg': 'field required', 'type': 'value_error.missing'}]
GxConfig.parse_yaml() returning empty `xdatasources`
Loading 'datasources' ->
{}
Loaded 'datasources' ->
{}
EphemeralDataContext has not implemented `_load_zep_config()` returning empty `GxConfig`
Loading 'datasources' ->
{}
Loaded 'datasources' ->
{}
warning: /Users/ataylor/miniforge3/envs/data-models/lib/python3.10/site-packages/great_expectations/data_context/data_context/abstract_data_context.py:2590: DeprecationWarning: create_expectation_suite is deprecated as of v0.15.48 and will be removed in v0.18. Please use add_expectation_suite or add_or_update_expectation_suite instead.
warning:   warnings.warn(
warning: /Users/ataylor/miniforge3/envs/data-models/lib/python3.10/site-packages/great_expectations/data_context/data_context/abstract_data_context.py:373: DeprecationWarning: save_expectation_suite is deprecated as of v0.15.48 and will be removed in v0.18. Please use update_expectation_suite or add_or_update_expectation_suite instead.
warning:   warnings.warn(
        0 expectation(s) included in expectation_suite.
Calculating Metrics: 0it [00:00, ?it/s]
Traceback (most recent call last):
  File "/Users/ataylor/miniforge3/envs/data-models/bin/schematic", line 8, in <module>
    sys.exit(main())
  File "/Users/ataylor/miniforge3/envs/data-models/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/Users/ataylor/miniforge3/envs/data-models/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/Users/ataylor/miniforge3/envs/data-models/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/ataylor/miniforge3/envs/data-models/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/ataylor/miniforge3/envs/data-models/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/ataylor/miniforge3/envs/data-models/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/Users/ataylor/miniforge3/envs/data-models/lib/python3.10/site-packages/click/decorators.py", line 38, in new_func
    return f(get_current_context().obj, *args, **kwargs)
  File "/Users/ataylor/miniforge3/envs/data-models/lib/python3.10/site-packages/schematic/models/commands.py", line 209, in validate_manifest
    errors, warnings = metadata_model.validateModelManifest(
  File "/Users/ataylor/miniforge3/envs/data-models/lib/python3.10/site-packages/schematic/models/metadata.py", line 254, in validateModelManifest
    errors, warnings, manifest = validate_all(self, errors, warnings, manifest, manifestPath, self.sg, jsonSchema, restrict_rules, project_scope)
  File "/Users/ataylor/miniforge3/envs/data-models/lib/python3.10/site-packages/schematic/models/validate_manifest.py", line 253, in validate_all
    manifest, vmr_errors, vmr_warnings = vm.validate_manifest_rules(manifest, sg, restrict_rules, project_scope)
  File "/Users/ataylor/miniforge3/envs/data-models/lib/python3.10/site-packages/schematic/models/validate_manifest.py", line 208, in validate_manifest_rules
    vr_errors, vr_warnings = validation_method(
  File "/Users/ataylor/miniforge3/envs/data-models/lib/python3.10/site-packages/schematic/models/validate_attribute.py", line 666, in regex_validation
    validation_rules=self.sg.se.get_class_validation_rules(self.sg.se.get_class_label_from_display_name(manifest_col.name))
  File "/Users/ataylor/miniforge3/envs/data-models/lib/python3.10/site-packages/schematic/schemas/explorer.py", line 432, in get_class_validation_rules
    class_info = self.explore_class(class_label)
  File "/Users/ataylor/miniforge3/envs/data-models/lib/python3.10/site-packages/schematic/schemas/explorer.py", line 333, in explore_class
    if "subClassOf" in self.schema_nx.nodes[schema_class]:
  File "/Users/ataylor/miniforge3/envs/data-models/lib/python3.10/site-packages/networkx/classes/reportviews.py", line 194, in __getitem__
    return self._nodes[n]
KeyError: 'Filename'

I think this actually related to something going on in the JSON-LD of the schema as it contains the following

        {
            "@id": "bts:filename",
            "@type": "rdf:Property",
            "rdfs:comment": "Name of a file",
            "rdfs:label": "filename",
            "schema:domainIncludes": {
                "@id": "bts:File"
            },
            "schema:isPartOf": {
                "@id": "http://schema.biothings.io"
            },
            "sms:displayName": "Filename",
            "sms:required": "sms:true",
            "sms:validationRules": [
               "regex search ^\\S*$"
            ]
        },

Note the lowercase filename for id and label while only displayName has the title case Filename.
Note also that only title case Filename is used in the CSV version of the model

If I manually edit the JSON-LD to have title case Filename for id and label the validation runs as expected, showing the expected error for the second row

        {
            "@id": "bts:Filename",
            "@type": "rdf:Property",
            "rdfs:comment": "Name of a file",
            "rdfs:label": "Filename",
            "schema:domainIncludes": {
                "@id": "bts:File"
            },
            "schema:isPartOf": {
                "@id": "http://schema.biothings.io"
            },
            "sms:displayName": "Filename",
            "sms:required": "sms:true",
            "sms:validationRules": [
               "regex search ^\\S*$"
            ]
        },
(data-models) ataylor@w152 test-nospace-validation % schematic model -c schematic_config.yml validate -mp other_assay_example.csv -dt "OtherAssay"
Starting schematic...
The (model > input > location) argument with value 'HTAN.model.jsonld' is being read from the config file.
The (model > input > file_type) argument with value 'local' is being read from the config file.
JSON schema successfully generated from schema.org schema!
JSON schema file log stored as HTAN.OtherAssay.schema.json
FileDataContext loading zep config
GxConfig.parse_yaml() failed with errors - [{'loc': ('xdatasources',), 'msg': 'field required', 'type': 'value_error.missing'}]
GxConfig.parse_yaml() returning empty `xdatasources`
Loading 'datasources' ->
{}
Loaded 'datasources' ->
{}
EphemeralDataContext has not implemented `_load_zep_config()` returning empty `GxConfig`
Loading 'datasources' ->
{}
Loaded 'datasources' ->
{}
warning: /Users/ataylor/miniforge3/envs/data-models/lib/python3.10/site-packages/great_expectations/data_context/data_context/abstract_data_context.py:2590: DeprecationWarning: create_expectation_suite is deprecated as of v0.15.48 and will be removed in v0.18. Please use add_expectation_suite or add_or_update_expectation_suite instead.
warning:   warnings.warn(
warning: /Users/ataylor/miniforge3/envs/data-models/lib/python3.10/site-packages/great_expectations/data_context/data_context/abstract_data_context.py:373: DeprecationWarning: save_expectation_suite is deprecated as of v0.15.48 and will be removed in v0.18. Please use update_expectation_suite or add_or_update_expectation_suite instead.
warning:   warnings.warn(
        0 expectation(s) included in expectation_suite.
Calculating Metrics: 0it [00:00, ?it/s]
error: For the attribute Filename, on row 3, the string is not properly formatted. It should follow the following re.search pattern "^\S*$".
[['3', 'Filename', 'For the attribute Filename, on row 3, the string is not properly formatted. It should follow the following re.search pattern "^\\S*$".', 'space separated.csv']]

This seems related to this error reported on Slack: https://sagebionetworks.slack.com/archives/C01ANC02U59/p1671129949628629?thread_ts=1671068131.722179&cid=C01ANC02U59

I also wonder if we can't actually validate Filename as it is a Reserve-use special attribute per the Data Model Schema docs.

Does this make sense?

@adamjtaylor
Copy link
Contributor Author

FYI @milen. We would like to add a regex validation rule to prevent spaces in filenames. But we seem to get a key error when adding a regex validation rule to this attribute. See above.

In further testing I can add other validation rules (eg int or str) for the Filename attribute but regex validation rules result in the same error.

@adamjtaylor
Copy link
Contributor Author

adamjtaylor commented Feb 22, 2023

@brynnz22
Copy link
Contributor

Yes this makes sense! I saw that Gianna is looking into it on Slack.

@GiaJordan
Copy link

GiaJordan commented Feb 22, 2023

From slack

@adamjtaylor can you open an issue on this in the schematic repo? I think its more related to the conversion from csv to jsonld. I was able to add the regex rule to our example model and validate without issue. I also noticed that in our model and jsonld, Filename is capitalized. I tried converting the HTAN model with what's on develop and noticed that it was converted as filename like it was in yours. Something aside from the issue with the validation rule is going on and it'll likely take more time to diagnose and solve.

cc @milen-sage @MiekoHash

@milen-sage
Copy link
Contributor

Just a note: Filename, should have this capitalization, since schematic treats this attribute specially. It's concerning that it gets lower cased.

@adamjtaylor
Copy link
Contributor Author

Update ahead of the sprint review:

  • I think this issue is blocked by the lowercasing of Filename in our JSON-LD`
  • There is a schematic ticket open liked above but it has had no action yet (FYI @milen-sage )
  • Propose we move this to escalate to monitor resolution
  • It is a "nice to have" rather than "must have" feature so we can manage waiting
  • We can also implement this check at the data release stage (@clarisse-lau)

@milen-sage
Copy link
Contributor

@adamjtaylor it's in the queue for next sprint triage: see here

@MiekoHash
Copy link

@adamjtaylor FAIR Data team moved our issue tracking system to JIRA as of this week. Diagnosis of schematic issue is now tracked here.

@MiekoHash
Copy link

Per Gianna's diagnosis, FAIR's recommendation is that "HTAN should move the Filename specification from the properties column of each attribute to the DependsOn column to resolve"

@adamjtaylor
Copy link
Contributor Author

adamjtaylor commented Mar 28, 2023

Thank you @MiekoHash
@aclayton555, @elv-sb I think this competing this move is a separate issue (#195) for us to tackle next data model sprint.

@adamjtaylor
Copy link
Contributor Author

Confirmed in staging that this catches spaces in Filename.

Unfortunately, it also catches Not Applicable in the Channel Metadata filename.

Proposing that we remove the validation from Channel Metadata Filename as we expect deeper validation of these files and a possible transition to synapse IDs.

1 similar comment
@adamjtaylor
Copy link
Contributor Author

Confirmed in staging that this catches spaces in Filename.

Unfortunately, it also catches Not Applicable in the Channel Metadata filename.

Proposing that we remove the validation from Channel Metadata Filename as we expect deeper validation of these files and a possible transition to synapse IDs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
effort-mid This one needs your brain
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants