rowData clean-up from character artifacts #406

ChouaibB · 2023-07-27T11:43:35Z

Hi,

This is related to the #303 discussion.
A solution draft for cleaning up feature_data/rowData loaded from biom files that might contain some character artifacts (e.g. ").
The attached pdf file represent tests of the function in question makeTreeSEFromBiom (which was also based on the example test at #303 ).

If it seems fine, I could update the examples at documentation using the function's argument, make unit tests, and bump the version.

Thanks.

makeTreeSEFromBiom_local_test.pdf

R/makeTreeSummarizedExperimentFromBiom.R

antagomir · 2023-07-27T12:23:35Z

checks to fix

ChouaibB · 2023-07-27T12:27:46Z

checks to fix

True, when I checked the job run log it seemed that it was related to other functions. And there seems to be not test script for makeTreeSEFromBiom. That's why I re-ran it, I stop the run now and do the updates first.

antagomir · 2023-07-27T13:48:08Z

It is possible to place testdata in the package in inst/extdata/ but not sure if this is necessary for now.

ChouaibB · 2023-07-27T14:42:56Z

It is possible to place testdata in the package in inst/extdata/ but not sure if this is necessary for now.

Sure, I could give it a try and add the tests for makeTreeSEFromBiom if otherwise changes to the function are ok?

antagomir · 2023-07-27T21:02:11Z

Hmm - one more change: it would be good to explicitly detect if there are such non-standard characters to remove, then throw a warning if such characters are being removed? This would make the process more safe. By default there should not be such extra characters, I think, and if there are it may indicate some problems that the users may wish to be aware of.

That new file can be called with:
system.file("extdata/myfilename.end", package="mia")

ChouaibB · 2023-07-28T10:32:28Z

Hmm - one more change: it would be good to explicitly detect if there are such non-standard characters to remove, then throw a warning if such characters are being removed? This would make the process more safe. By default there should not be such extra characters, I think, and if there are it may indicate some problems that the users may wish to be aware of.

That new file can be called with: system.file("extdata/myfilename.end", package="mia")

That's actually true, thanks. Is it ok to keep the default "artifact" pattern to detect and remove as \" and then generalize it by the time? bit hard for me to know what kind of non-wanted (artifact) characters that might occur in the taxonomy data within a biom file :/ ; unless ofc if the user priorly knows and could use the cleanTaxaPattern argument, and eventually throw a warning about the removal?

I update the system.file too, thanks.

ChouaibB · 2023-07-28T10:47:30Z

Hmm - one more change: it would be good to explicitly detect if there are such non-standard characters to remove, then throw a warning if such characters are being removed? This would make the process more safe. By default there should not be such extra characters, I think, and if there are it may indicate some problems that the users may wish to be aware of.
That new file can be called with: system.file("extdata/myfilename.end", package="mia")

That's actually true, thanks. Is it ok to keep the default "artifact" pattern to detect and remove as \" and then generalize it by the time? bit hard for me to know what kind of non-wanted (artifact) characters that might occur in the taxonomy data within a biom file :/ ; unless ofc if the user priorly knows and could use the cleanTaxaPattern argument, and eventually throw a warning about the removal?

I update the system.file too, thanks.

Sorry for the many replies :)
I was just thinking with my humble experience with Taxonomy data, it always contain alphabets (lower and upper cases), and numeric characters (alphanumeric characters) plus underscores and/or dashes, anything else should be an artifact to detect and remove (except the characters used to parse data into table e.g sep=; or sep=|), building such a regex do you think it would be safe and generalized for detection and removal (while ofc throwing the warning about it)?

antagomir · 2023-07-28T11:52:10Z

Ok that could be the default now - but it should be argument that the user can tweak if necessary. We can later remove this if it turns problematic. The real solution would comply with biom standard, I am not sure what that says about the extra characters. There are pros & cons but in a way we also want to make this fluent to users so that they can focus on analysis of the data.

ChouaibB · 2023-07-28T16:22:06Z

Ok that could be the default now - but it should be argument that the user can tweak if necessary. We can later remove this if it turns problematic. The real solution would comply with biom standard, I am not sure what that says about the extra characters. There are pros & cons but in a way we also want to make this fluent to users so that they can focus on analysis of the data.

Before implementing anything, I was still curious about a way to detect non-standard character artifacts in Taxonomy data, so I made a small testing around based on an example file (attached in here):
`

Example Data

Data was copied from Aggregated_humanization2.biom (manually through bash
command less); the section corresponding to taxonomy data.
Then saved to a text file; test_text.txt

Reading data

Just as an example to test the search for artifacts, the text was split based on
"id", then each resulting list was made as a character vector:

test_text <- readr::read_lines("test_text.txt") %>% stringr::str_split("id") %>%
lapply(., function(l) paste(unlist(l), collapse = " "))

Testing the regular expression on one example of the test_text

Retrieving taxonomy data including the separators that exists in taxonomy data
which should be kept to parse them as a table afterwards.

Testing with one example, retrieving the taxonomy information usually wanted:

grep("[[:alnum:]]|[-_,;|]|[[:space:]]",
test_text[[1]] %>% stringr::str_split("") %>% unlist(),
invert = FALSE, value = TRUE) %>% paste0(collapse = "")

result -> " 1726470, metadata taxonomy k__Bacteria, p__Bactero etes, c__Bactero ia, o__Bactero ales, f__Bactero aceae, g__Bactero es, 1726471, metadata "

Testing the negative (or invert) of the regular expression to retrieve the
artifacts:

grep("[[:alnum:]]|[-_,;|]|[[:space:]]",
test_text[[1]] %>% stringr::str_split("") %>% unlist(),
invert = TRUE, value = TRUE) %>% paste0(collapse = "")
result -> "[{"":"""":{"":["\"""""""""""\""]}}{"":"""":{""

To collect the unique list of artifacts to be cleaned later:

grep("[[:alnum:]]|[-_,;|]|[[:space:]]",
test_text[[1]] %>% stringr::str_split("") %>% unlist(),
invert = TRUE, value = TRUE) %>% unique()
result -> "[" "{" """ ":" "\" "]" "}"

Testing the regular expression on the whole test_text

Collecting the final unique list of character artifacts to be cleaned throughout
the whole Taxonomy data:

lapply(test_text, function(x) {
grep("[[:alnum:]]|[-_,;|]|[[:space:]]",
x %>% stringr::str_split("") %>% unlist(),
invert = TRUE, value = TRUE) %>% unique()
}) %>% unlist() %>% unique()

result -> "[" "{" """ ":" "\" "]" "}"

Consequently, this could be perhaps an automated way of detecting, and forming
the patterns to be cleaned at the "makeTreeSEFromBiom" (which automatically
will become 'cleanTaxaPattern="[{":\]}"') if for example the user
provide "cleanTaxaPattern=NULL" or smt similar (maybe even giving an value option 'auto' for instance),
to hopefully safely parse the taxonomy data, which would safely parse rank names, taxonomy data afterwards in the function.
`
test_text.txt

Ofc with the current implementation the biomformat::observation_metadata does some parsing and cleaning of some of the above seen artifacts in the raw data file. The idea I was thinking if this seems reasonable, a further utility/helper function(s) I could implement that would double check by detecting, cleaning and throwing the warning.
P.S if it seems interesting please some hints how to improve the regex used in the example, thanks :)

antagomir · 2023-07-28T17:29:13Z

Can you summarize in a much shorter way what is the problem and what is the proposed solution?

ChouaibB · 2023-07-28T17:47:23Z

Can you summarize in a much shorter way what is the problem and what is the proposed solution?

I will be committing the implementation I had in mind, then I guess it would show clearly.

antagomir

nice!

R/makeTreeSummarizedExperimentFromBiom.R

antagomir · 2023-07-29T09:17:56Z

@ChouaibB note that @RiboRings is now making some changes in OMA
microbiome/OMA#316

ChouaibB · 2023-08-01T18:05:01Z

One question/suggestion @antagomir @TuomasBorman :
If this implementation seems to be fine, should .detect_taxa_artifacts_and_clean and .detect_taxa_artifacts and .clean_from_artifacts be moved to R/taxonomy.R or R/utils.R that potentially could help check the taxonomy parsing of other types of data (e.g. makeTreeSummarizedExperimentFromDADA2.R or makeTreeSummarizedExperimentFromPhyloseq.R ) or such?

antagomir · 2023-08-01T18:29:19Z

You can move if it seems useful but I think it should be possible to use these regardless of the file where they are located.

ChouaibB · 2023-08-02T07:39:39Z

Ahh ok, I will keep them where they are for now.
Regarding the failed checks (link):

It seems they are not related to the tests of makeTreeSummarizedExperimentFromBiom.R (actually I guess it fails before even reaching the tests related of test-10makeTreeSEFromBiom.R; locally these tests all pass with no problem), is somebody working on debugging those test checks that fails or shall I also look at them so this can be potentially merged if all is ok otherwise?

antagomir · 2023-08-02T07:49:50Z

Strange that we have such errors suddenly appearing. Would be great if you can fix on the same go?

ChouaibB · 2023-08-02T08:18:30Z

Strange that we have such errors suddenly appearing. Would be great if you can fix on the same go?

I will give it a try.

ChouaibB · 2023-08-03T15:22:26Z

About the failed checks:
When I did the checks locally I took some notes of the reason of the error and the fix:

══ Failed tests ════════════════════════════════════════════════════════════════
── Failure ('test-0diversity.R:27:5'): diversity estimates ─────────────────────
`lambda` not equal to .simpson_lambda(assays(tse_idx)$relabundance).
names for current but not for target
── Failure ('test-0diversity.R:28:5'): diversity estimates ─────────────────────
`ginisimpson` not equal to colData(tse_idx)$gini_simpson.
names for current but not for target
── Failure ('test-0diversity.R:29:5'): diversity estimates ─────────────────────
`ginisimpson` not equal to .calc_gini_simpson(assays(tse_idx)$relabundance).
names for current but not for target
── Failure ('test-0diversity.R:31:5'): diversity estimates ─────────────────────
`invsimpson` not equal to colData(tse_idx)$inverse_simpson.
names for current but not for target
── Failure ('test-0diversity.R:32:5'): diversity estimates ─────────────────────
`invsimpson` not equal to .calc_inverse_simpson(assays(tse_idx)$relabundance).
names for current but not for target
## ==> solution was to unname the named returned vector of each right-hand 
## side of the test 

── Failure ('test-0diversity.R:66:5'): diversity estimates ─────────────────────
`test1` not equal to `test2`.
Names: 3 string mismatches
## ==> solution was to unname both since they have same values but different 
## naming of the vectors at each side of the test

── Failure ('test-0diversity.R:67:5'): diversity estimates ─────────────────────
round(test1, 6) not equal to c(7.256706, 6.098354, 7.278894).
names for target but not for current
## ==> solution was to unname left-hand side of test otherwise values were equal

── Failure ('test-8subsample.R:35:5'): subsampleCounts ─────────────────────────
colSums2(assay(tse.subsampled, "subsampled")) not equal to `expColSums`.
names for target but not for current
## ==> solution was to unname left-hand side of test otherwise values were equal

── Failure ('test-8subsample.R:51:5'): subsampleCounts ─────────────────────────
colSums2(assay(tse.subsampled.rp, "subsampled")) not equal to `expColSumsRp`.
names for target but not for current
## ==> solution was to unname left-hand side of test otherwise values were equal

Hopefully now it passes the checks at github :)

antagomir

merge when ready unless there are further comments

ChouaibB · 2023-08-04T09:19:00Z

merge when ready unless there are further comments

I will merge it for now. Perhaps when the utility/helper functions within this one turn to be useful at other importers, they could be moved to taxonomy.R. I was just thinking that, by then for future development any taxonomy related helper functions could be found at taxonomy.R so the functions and utilities and the type of utilities needed could be "easily readable and found" under R/ then all the developers instead of going through each function and searching for their helper function that could be re-used; then naming the type of utility functions under R/ could be helpful and more easily readable. But this is just an opinion ofc :)

rowData clean-up from character artifacts

9ae8e75

ChouaibB requested review from TuomasBorman and antagomir July 27, 2023 11:43

ChouaibB linked an issue Jul 27, 2023 that may be closed by this pull request

Taxonomy parsing #303

Closed

antagomir reviewed Jul 27, 2023

View reviewed changes

R/makeTreeSummarizedExperimentFromBiom.R Outdated Show resolved Hide resolved

R/makeTreeSummarizedExperimentFromBiom.R Outdated Show resolved Hide resolved

changed to argument + example at doc

b0c0fe9

fix at func + testdata + testthat func

a858e72

re-implmented cleanTaxaPattern and added auto and NULL features to it

94001aa

antagomir mentioned this pull request Jul 29, 2023

Make instructions on data import clear and consistent microbiome/OMA#307

Closed

antagomir requested changes Jul 29, 2023

View reviewed changes

re-adjusted parsing of feature_data as list + 1 extra test

e79cd11

fixed testhat

549219d

ChouaibB requested a review from antagomir August 3, 2023 15:59

antagomir approved these changes Aug 4, 2023

View reviewed changes

ChouaibB merged commit 6663c03 into master Aug 4, 2023
1 check passed

ChouaibB deleted the issue_303 branch August 4, 2023 11:35

antagomir mentioned this pull request Aug 9, 2023

Simplify biom import example microbiome/OMA#327

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rowData clean-up from character artifacts #406

rowData clean-up from character artifacts #406

ChouaibB commented Jul 27, 2023

antagomir commented Jul 27, 2023

ChouaibB commented Jul 27, 2023

antagomir commented Jul 27, 2023

ChouaibB commented Jul 27, 2023 •

edited

Loading

antagomir commented Jul 27, 2023

ChouaibB commented Jul 28, 2023 •

edited

Loading

ChouaibB commented Jul 28, 2023 •

edited

Loading

antagomir commented Jul 28, 2023

ChouaibB commented Jul 28, 2023 •

edited

Loading

antagomir commented Jul 28, 2023

ChouaibB commented Jul 28, 2023

antagomir left a comment

antagomir commented Jul 29, 2023

ChouaibB commented Aug 1, 2023

antagomir commented Aug 1, 2023

ChouaibB commented Aug 2, 2023 •

edited

Loading

antagomir commented Aug 2, 2023

ChouaibB commented Aug 2, 2023

ChouaibB commented Aug 3, 2023 •

edited

Loading

antagomir left a comment

ChouaibB commented Aug 4, 2023 •

edited

Loading

rowData clean-up from character artifacts #406

rowData clean-up from character artifacts #406

Conversation

ChouaibB commented Jul 27, 2023

antagomir commented Jul 27, 2023

ChouaibB commented Jul 27, 2023

antagomir commented Jul 27, 2023

ChouaibB commented Jul 27, 2023 • edited Loading

antagomir commented Jul 27, 2023

ChouaibB commented Jul 28, 2023 • edited Loading

ChouaibB commented Jul 28, 2023 • edited Loading

antagomir commented Jul 28, 2023

ChouaibB commented Jul 28, 2023 • edited Loading

Example Data

Reading data

Testing the regular expression on one example of the test_text

Testing the regular expression on the whole test_text

antagomir commented Jul 28, 2023

ChouaibB commented Jul 28, 2023

antagomir left a comment

Choose a reason for hiding this comment

antagomir commented Jul 29, 2023

ChouaibB commented Aug 1, 2023

antagomir commented Aug 1, 2023

ChouaibB commented Aug 2, 2023 • edited Loading

antagomir commented Aug 2, 2023

ChouaibB commented Aug 2, 2023

ChouaibB commented Aug 3, 2023 • edited Loading

antagomir left a comment

Choose a reason for hiding this comment

ChouaibB commented Aug 4, 2023 • edited Loading

ChouaibB commented Jul 27, 2023 •

edited

Loading

ChouaibB commented Jul 28, 2023 •

edited

Loading

ChouaibB commented Jul 28, 2023 •

edited

Loading

ChouaibB commented Jul 28, 2023 •

edited

Loading

ChouaibB commented Aug 2, 2023 •

edited

Loading

ChouaibB commented Aug 3, 2023 •

edited

Loading

ChouaibB commented Aug 4, 2023 •

edited

Loading