-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal for refactoring & higher maintainability #159
Comments
Hi @di-hardt thanks for your comments. I will first (if you are OK with it) move this issue to the sdrf-pipelines repository because the issue is more related with the python validator than the specification. |
Hi @di-hardt:
Currently, the tool does two main tasks: The validator was built originally using the library pandas_schema, which enables to define a pandas schema and perform the validation. More than happy to move it into Pydantic. However, before movement, I would like to make sure that the refactoring is smooth, for example the pandas_schema approach allows us to load easily the SDRF (TSV) into pandas and pass the validator defined in the pandas_schema. Then, would be nice to have (for other contributors in the project) and idea if it is better to quit pandas_schema in favor of pydantic.
This is similar to what we have now with pandas_schema.
Great.
Looks OK.
Agreed. Let's plan this. This is a big refactoring that needs to be planned step by step because the sdrf-pipelines validator is in active developement with other contributors, and also is in production (in use) in PRIDE, quantms, the proteomics-metadata validator with almost 300 downloas per month. I propose the following:
Hope this comment make sense. Waiting for your feedback @di-hardt |
Hey @ypriverol, thanks for the detailed comments. Makes sense to me, I'm on it. |
I totally support this endeavour. I would love to be able to more easily adapt validators based on downstream applications. |
@di-hardt Have you started on this? We have sometime now to start refactoring some of the code and we think this may be a good first use case. |
Yes! I started it in the ELIXIR Proteomic Community Github, as we forked the project a while ago. The new implmentation lives in the branch I also explain the structure and everything in the After my vacation I want to finish the sdrf-default use case. |
Hey,
I noticed the SDRF validator is a bit difficult to maintain and to understand in it's current state, as it is one long python script.
To make it more maintainable and extendable for future changes and integration, I would like to propose a refactoring and at the same time the integration of a data validation framework like pydantic.
Pydantic basically adds validator to class attributes via so called Field-objects. While it already supports basic validator like decimal and length constraints there is the option to implement custom ones, e.g. checking validity of ontology terms.
There are multiple options how to implement this. My suggestions is to create one pydantic model for each SDRF template record (e.g. cell-line, human, plant, ...) (record == line in SDRF).
Pytandic already supports JSON which would play along with the plans of a JSON representation described in bigbio/proteomics-sample-metadata#696
I already looked into the possibility to read automatically parse CSV/TSVs.
A structure like this can also be easily tested, as each validator can be tested by itself, as well as in combination with others in one of the record types.
A possible structure could look like this
One additional advantage would be the increased expressiveness of error messages as validators will give an exact error message per attribute. Which can than be easily tackled by the SDRF creator or in case of new implementation, by the developer.
The modularity would also allow other projects to include only parts of the validation process, e.g. sdrf_convert need only parts of the validation for variables used by the targeted software e.g. Comet, DIA-NN, ...
The text was updated successfully, but these errors were encountered: