Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rowData clean-up from character artifacts #406

Merged
merged 6 commits into from
Aug 4, 2023
Merged

rowData clean-up from character artifacts #406

merged 6 commits into from
Aug 4, 2023

Conversation

ChouaibB
Copy link
Contributor

Hi,

This is related to the #303 discussion.
A solution draft for cleaning up feature_data/rowData loaded from biom files that might contain some character artifacts (e.g. ").
The attached pdf file represent tests of the function in question makeTreeSEFromBiom (which was also based on the example test at #303 ).

If it seems fine, I could update the examples at documentation using the function's argument, make unit tests, and bump the version.

Thanks.

makeTreeSEFromBiom_local_test.pdf

@ChouaibB ChouaibB linked an issue Jul 27, 2023 that may be closed by this pull request
R/makeTreeSummarizedExperimentFromBiom.R Outdated Show resolved Hide resolved
R/makeTreeSummarizedExperimentFromBiom.R Outdated Show resolved Hide resolved
@antagomir
Copy link
Member

checks to fix

@ChouaibB
Copy link
Contributor Author

checks to fix

True, when I checked the job run log it seemed that it was related to other functions. And there seems to be not test script for makeTreeSEFromBiom. That's why I re-ran it, I stop the run now and do the updates first.

@antagomir
Copy link
Member

It is possible to place testdata in the package in inst/extdata/ but not sure if this is necessary for now.

@ChouaibB
Copy link
Contributor Author

ChouaibB commented Jul 27, 2023

It is possible to place testdata in the package in inst/extdata/ but not sure if this is necessary for now.

Sure, I could give it a try and add the tests for makeTreeSEFromBiom if otherwise changes to the function are ok?

@antagomir
Copy link
Member

Hmm - one more change: it would be good to explicitly detect if there are such non-standard characters to remove, then throw a warning if such characters are being removed? This would make the process more safe. By default there should not be such extra characters, I think, and if there are it may indicate some problems that the users may wish to be aware of.

That new file can be called with:
system.file("extdata/myfilename.end", package="mia")

@ChouaibB
Copy link
Contributor Author

ChouaibB commented Jul 28, 2023

Hmm - one more change: it would be good to explicitly detect if there are such non-standard characters to remove, then throw a warning if such characters are being removed? This would make the process more safe. By default there should not be such extra characters, I think, and if there are it may indicate some problems that the users may wish to be aware of.

That new file can be called with: system.file("extdata/myfilename.end", package="mia")

That's actually true, thanks. Is it ok to keep the default "artifact" pattern to detect and remove as \" and then generalize it by the time? bit hard for me to know what kind of non-wanted (artifact) characters that might occur in the taxonomy data within a biom file :/ ; unless ofc if the user priorly knows and could use the cleanTaxaPattern argument, and eventually throw a warning about the removal?

I update the system.file too, thanks.

@ChouaibB
Copy link
Contributor Author

ChouaibB commented Jul 28, 2023

Hmm - one more change: it would be good to explicitly detect if there are such non-standard characters to remove, then throw a warning if such characters are being removed? This would make the process more safe. By default there should not be such extra characters, I think, and if there are it may indicate some problems that the users may wish to be aware of.
That new file can be called with: system.file("extdata/myfilename.end", package="mia")

That's actually true, thanks. Is it ok to keep the default "artifact" pattern to detect and remove as \" and then generalize it by the time? bit hard for me to know what kind of non-wanted (artifact) characters that might occur in the taxonomy data within a biom file :/ ; unless ofc if the user priorly knows and could use the cleanTaxaPattern argument, and eventually throw a warning about the removal?

I update the system.file too, thanks.

Sorry for the many replies :)
I was just thinking with my humble experience with Taxonomy data, it always contain alphabets (lower and upper cases), and numeric characters (alphanumeric characters) plus underscores and/or dashes, anything else should be an artifact to detect and remove (except the characters used to parse data into table e.g sep=; or sep=|), building such a regex do you think it would be safe and generalized for detection and removal (while ofc throwing the warning about it)?

@antagomir
Copy link
Member

Ok that could be the default now - but it should be argument that the user can tweak if necessary. We can later remove this if it turns problematic. The real solution would comply with biom standard, I am not sure what that says about the extra characters. There are pros & cons but in a way we also want to make this fluent to users so that they can focus on analysis of the data.

@ChouaibB
Copy link
Contributor Author

ChouaibB commented Jul 28, 2023

Ok that could be the default now - but it should be argument that the user can tweak if necessary. We can later remove this if it turns problematic. The real solution would comply with biom standard, I am not sure what that says about the extra characters. There are pros & cons but in a way we also want to make this fluent to users so that they can focus on analysis of the data.

Before implementing anything, I was still curious about a way to detect non-standard character artifacts in Taxonomy data, so I made a small testing around based on an example file (attached in here):
`

Example Data

Data was copied from Aggregated_humanization2.biom (manually through bash
command less); the section corresponding to taxonomy data.
Then saved to a text file; test_text.txt

Reading data

Just as an example to test the search for artifacts, the text was split based on
"id", then each resulting list was made as a character vector:

test_text <- readr::read_lines("test_text.txt") %>% stringr::str_split("id") %>%
lapply(., function(l) paste(unlist(l), collapse = " "))

Testing the regular expression on one example of the test_text

Retrieving taxonomy data including the separators that exists in taxonomy data
which should be kept to parse them as a table afterwards.

Testing with one example, retrieving the taxonomy information usually wanted:

grep("[[:alnum:]]|[-_,;|]|[[:space:]]",
test_text[[1]] %>% stringr::str_split("") %>% unlist(),
invert = FALSE, value = TRUE) %>% paste0(collapse = "")

result -> " 1726470, metadata taxonomy k__Bacteria, p__Bactero etes, c__Bactero ia, o__Bactero ales, f__Bactero aceae, g__Bactero es, 1726471, metadata "

Testing the negative (or invert) of the regular expression to retrieve the
artifacts:

grep("[[:alnum:]]|[-_,;|]|[[:space:]]",
test_text[[1]] %>% stringr::str_split("") %>% unlist(),
invert = TRUE, value = TRUE) %>% paste0(collapse = "")
result -> "[{"":"""":{"":["\"""""""""""\""]}}{"":"""":{""

To collect the unique list of artifacts to be cleaned later:

grep("[[:alnum:]]|[-_,;|]|[[:space:]]",
test_text[[1]] %>% stringr::str_split("") %>% unlist(),
invert = TRUE, value = TRUE) %>% unique()
result -> "[" "{" """ ":" "\" "]" "}"

Testing the regular expression on the whole test_text

Collecting the final unique list of character artifacts to be cleaned throughout
the whole Taxonomy data:

lapply(test_text, function(x) {
grep("[[:alnum:]]|[-_,;|]|[[:space:]]",
x %>% stringr::str_split("") %>% unlist(),
invert = TRUE, value = TRUE) %>% unique()
}) %>% unlist() %>% unique()

result -> "[" "{" """ ":" "\" "]" "}"

Consequently, this could be perhaps an automated way of detecting, and forming
the patterns to be cleaned at the "makeTreeSEFromBiom" (which automatically
will become 'cleanTaxaPattern="[{":\]}"') if for example the user
provide "cleanTaxaPattern=NULL" or smt similar (maybe even giving an value option 'auto' for instance),
to hopefully safely parse the taxonomy data, which would safely parse rank names, taxonomy data afterwards in the function.
`
test_text.txt

Ofc with the current implementation the biomformat::observation_metadata does some parsing and cleaning of some of the above seen artifacts in the raw data file. The idea I was thinking if this seems reasonable, a further utility/helper function(s) I could implement that would double check by detecting, cleaning and throwing the warning.
P.S if it seems interesting please some hints how to improve the regex used in the example, thanks :)

@antagomir
Copy link
Member

Can you summarize in a much shorter way what is the problem and what is the proposed solution?

@ChouaibB
Copy link
Contributor Author

Can you summarize in a much shorter way what is the problem and what is the proposed solution?

I will be committing the implementation I had in mind, then I guess it would show clearly.

Copy link
Member

@antagomir antagomir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

R/makeTreeSummarizedExperimentFromBiom.R Outdated Show resolved Hide resolved
R/makeTreeSummarizedExperimentFromBiom.R Outdated Show resolved Hide resolved
R/makeTreeSummarizedExperimentFromBiom.R Outdated Show resolved Hide resolved
R/makeTreeSummarizedExperimentFromBiom.R Outdated Show resolved Hide resolved
R/makeTreeSummarizedExperimentFromBiom.R Outdated Show resolved Hide resolved
R/makeTreeSummarizedExperimentFromBiom.R Outdated Show resolved Hide resolved
R/makeTreeSummarizedExperimentFromBiom.R Outdated Show resolved Hide resolved
R/makeTreeSummarizedExperimentFromBiom.R Outdated Show resolved Hide resolved
R/makeTreeSummarizedExperimentFromBiom.R Show resolved Hide resolved
@antagomir
Copy link
Member

@ChouaibB note that @RiboRings is now making some changes in OMA
microbiome/OMA#316

@ChouaibB
Copy link
Contributor Author

ChouaibB commented Aug 1, 2023

One question/suggestion @antagomir @TuomasBorman :
If this implementation seems to be fine, should .detect_taxa_artifacts_and_clean and .detect_taxa_artifacts and .clean_from_artifacts be moved to R/taxonomy.R or R/utils.R that potentially could help check the taxonomy parsing of other types of data (e.g. makeTreeSummarizedExperimentFromDADA2.R or makeTreeSummarizedExperimentFromPhyloseq.R ) or such?

@antagomir
Copy link
Member

You can move if it seems useful but I think it should be possible to use these regardless of the file where they are located.

@ChouaibB
Copy link
Contributor Author

ChouaibB commented Aug 2, 2023

Ahh ok, I will keep them where they are for now.
Regarding the failed checks (link):
image
It seems they are not related to the tests of makeTreeSummarizedExperimentFromBiom.R (actually I guess it fails before even reaching the tests related of test-10makeTreeSEFromBiom.R; locally these tests all pass with no problem), is somebody working on debugging those test checks that fails or shall I also look at them so this can be potentially merged if all is ok otherwise?

@antagomir
Copy link
Member

Strange that we have such errors suddenly appearing. Would be great if you can fix on the same go?

@ChouaibB
Copy link
Contributor Author

ChouaibB commented Aug 2, 2023

Strange that we have such errors suddenly appearing. Would be great if you can fix on the same go?

I will give it a try.

@ChouaibB
Copy link
Contributor Author

ChouaibB commented Aug 3, 2023

About the failed checks:
When I did the checks locally I took some notes of the reason of the error and the fix:

══ Failed tests ════════════════════════════════════════════════════════════════
── Failure ('test-0diversity.R:27:5'): diversity estimates ─────────────────────
`lambda` not equal to .simpson_lambda(assays(tse_idx)$relabundance).
names for current but not for target
── Failure ('test-0diversity.R:28:5'): diversity estimates ─────────────────────
`ginisimpson` not equal to colData(tse_idx)$gini_simpson.
names for current but not for target
── Failure ('test-0diversity.R:29:5'): diversity estimates ─────────────────────
`ginisimpson` not equal to .calc_gini_simpson(assays(tse_idx)$relabundance).
names for current but not for target
── Failure ('test-0diversity.R:31:5'): diversity estimates ─────────────────────
`invsimpson` not equal to colData(tse_idx)$inverse_simpson.
names for current but not for target
── Failure ('test-0diversity.R:32:5'): diversity estimates ─────────────────────
`invsimpson` not equal to .calc_inverse_simpson(assays(tse_idx)$relabundance).
names for current but not for target
## ==> solution was to unname the named returned vector of each right-hand 
## side of the test 

── Failure ('test-0diversity.R:66:5'): diversity estimates ─────────────────────
`test1` not equal to `test2`.
Names: 3 string mismatches
## ==> solution was to unname both since they have same values but different 
## naming of the vectors at each side of the test

── Failure ('test-0diversity.R:67:5'): diversity estimates ─────────────────────
round(test1, 6) not equal to c(7.256706, 6.098354, 7.278894).
names for target but not for current
## ==> solution was to unname left-hand side of test otherwise values were equal

── Failure ('test-8subsample.R:35:5'): subsampleCounts ─────────────────────────
colSums2(assay(tse.subsampled, "subsampled")) not equal to `expColSums`.
names for target but not for current
## ==> solution was to unname left-hand side of test otherwise values were equal

── Failure ('test-8subsample.R:51:5'): subsampleCounts ─────────────────────────
colSums2(assay(tse.subsampled.rp, "subsampled")) not equal to `expColSumsRp`.
names for target but not for current
## ==> solution was to unname left-hand side of test otherwise values were equal  

Hopefully now it passes the checks at github :)

@ChouaibB ChouaibB requested a review from antagomir August 3, 2023 15:59
Copy link
Member

@antagomir antagomir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merge when ready unless there are further comments

@ChouaibB
Copy link
Contributor Author

ChouaibB commented Aug 4, 2023

merge when ready unless there are further comments

I will merge it for now. Perhaps when the utility/helper functions within this one turn to be useful at other importers, they could be moved to taxonomy.R. I was just thinking that, by then for future development any taxonomy related helper functions could be found at taxonomy.R so the functions and utilities and the type of utilities needed could be "easily readable and found" under R/ then all the developers instead of going through each function and searching for their helper function that could be re-used; then naming the type of utility functions under R/ could be helpful and more easily readable. But this is just an opinion ofc :)

@ChouaibB ChouaibB merged commit 6663c03 into master Aug 4, 2023
1 check passed
@ChouaibB ChouaibB deleted the issue_303 branch August 4, 2023 11:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Taxonomy parsing
2 participants