agglomerate* functions: behavior when NA #411
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Leo;
1.
agglomerateByPrevalence
: Indeed, it seems that your analysis is correct; the "Other" group only contains 0's and 1's. Then the system is not sure if these are actual counts that are sensible to sum up, and throws a warning. In principle this works as expected. In practice, we already know from the context that it is actual count data because we could test that the original mae[[1]] is count data, and hence any subset of it (those that will be merged under the "Other" category) will also be. This could deserve a small fix that would check the "count" status in such cases for the original input only. This will require thinking a bit about the logic of the method.This is because the Phylum rank includes NAs for some rows:
sum(is.na(rowData(mae[[1]])$Phylum))
yields 93. These are omitted with agglomerateByPrevalence but not with agglomerateByRank (they will be included as NA row in the latter). It would be most logical that the NA row would be included also in the data that is agglomerated by prevalence. The user can choose whether they want to merge such NA row further. One problem with the NA row is that these may come from different phyla, and hence grouping them together in the phylum level agglomeration is potentially misleading. I would solve this by providing a binary argument that excludes the NA phyla by default in all agglomerations (rank, prevalence, or other grouping variable) but user could choose to keep these by switching the argument (then they are aware of this and can maintain the original read count, which might be relevant in some cases).