-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rowData clean-up from character artifacts #406
Conversation
checks to fix |
True, when I checked the job run log it seemed that it was related to other functions. And there seems to be not test script for |
It is possible to place testdata in the package in inst/extdata/ but not sure if this is necessary for now. |
Sure, I could give it a try and add the tests for |
Hmm - one more change: it would be good to explicitly detect if there are such non-standard characters to remove, then throw a warning if such characters are being removed? This would make the process more safe. By default there should not be such extra characters, I think, and if there are it may indicate some problems that the users may wish to be aware of. That new file can be called with: |
That's actually true, thanks. Is it ok to keep the default "artifact" pattern to detect and remove as I update the system.file too, thanks. |
Sorry for the many replies :) |
Ok that could be the default now - but it should be argument that the user can tweak if necessary. We can later remove this if it turns problematic. The real solution would comply with biom standard, I am not sure what that says about the extra characters. There are pros & cons but in a way we also want to make this fluent to users so that they can focus on analysis of the data. |
Before implementing anything, I was still curious about a way to detect non-standard character artifacts in Taxonomy data, so I made a small testing around based on an example file (attached in here): Example DataData was copied from Aggregated_humanization2.biom (manually through bash Reading dataJust as an example to test the search for artifacts, the text was split based on
Testing the regular expression on one example of the test_textRetrieving taxonomy data including the separators that exists in taxonomy data Testing with one example, retrieving the taxonomy information usually wanted:
result -> " 1726470, metadata taxonomy k__Bacteria, p__Bactero etes, c__Bactero ia, o__Bactero ales, f__Bactero aceae, g__Bactero es, 1726471, metadata " Testing the negative (or invert) of the regular expression to retrieve the
To collect the unique list of artifacts to be cleaned later:
Testing the regular expression on the whole test_textCollecting the final unique list of character artifacts to be cleaned throughout
result -> "[" "{" """ ":" "\" "]" "}" Consequently, this could be perhaps an automated way of detecting, and forming Ofc with the current implementation the |
Can you summarize in a much shorter way what is the problem and what is the proposed solution? |
I will be committing the implementation I had in mind, then I guess it would show clearly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
@ChouaibB note that @RiboRings is now making some changes in OMA |
One question/suggestion @antagomir @TuomasBorman : |
You can move if it seems useful but I think it should be possible to use these regardless of the file where they are located. |
Ahh ok, I will keep them where they are for now. |
Strange that we have such errors suddenly appearing. Would be great if you can fix on the same go? |
I will give it a try. |
About the failed checks:
Hopefully now it passes the checks at github :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
merge when ready unless there are further comments
I will merge it for now. Perhaps when the utility/helper functions within this one turn to be useful at other importers, they could be moved to |
Hi,
This is related to the #303 discussion.
A solution draft for cleaning up feature_data/rowData loaded from biom files that might contain some character artifacts (e.g.
"
).The attached pdf file represent tests of the function in question
makeTreeSEFromBiom
(which was also based on the example test at #303 ).If it seems fine, I could update the examples at documentation using the function's argument, make unit tests, and bump the version.
Thanks.
makeTreeSEFromBiom_local_test.pdf