Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

agglomerate* functions: behavior when NA #411

Closed
wants to merge 10 commits into from
Closed

Conversation

TuomasBorman
Copy link
Contributor

Leo;

1. agglomerateByPrevalence: Indeed, it seems that your analysis is correct; the "Other" group only contains 0's and 1's. Then the system is not sure if these are actual counts that are sensible to sum up, and throws a warning. In principle this works as expected. In practice, we already know from the context that it is actual count data because we could test that the original mae[[1]] is count data, and hence any subset of it (those that will be merged under the "Other" category) will also be. This could deserve a small fix that would check the "count" status in such cases for the original input only. This will require thinking a bit about the logic of the method.

  1. I noticed also another point; agglomerating by rank or prevalence will give different total read counts per sample, although they would be expected to give identical count (just different grouping of the rows).
colSums(assay(agglomerateByRank(mae[[1]], rank = "Phylum")))
colSums(assay(agglomerateByPrevalence(mae[[1]], rank = "Phylum")))

This is because the Phylum rank includes NAs for some rows: sum(is.na(rowData(mae[[1]])$Phylum)) yields 93. These are omitted with agglomerateByPrevalence but not with agglomerateByRank (they will be included as NA row in the latter). It would be most logical that the NA row would be included also in the data that is agglomerated by prevalence. The user can choose whether they want to merge such NA row further. One problem with the NA row is that these may come from different phyla, and hence grouping them together in the phylum level agglomeration is potentially misleading. I would solve this by providing a binary argument that excludes the NA phyla by default in all agglomerations (rank, prevalence, or other grouping variable) but user could choose to keep these by switching the argument (then they are aware of this and can maintain the original read count, which might be relevant in some cases).

@TuomasBorman TuomasBorman marked this pull request as draft September 6, 2023 16:39
@TuomasBorman
Copy link
Contributor Author

New PR, closing this

#438

@Daenarys8 Daenarys8 deleted the agglomerate_NAs branch August 6, 2024 09:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant