-
Notifications
You must be signed in to change notification settings - Fork 695
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Limma mixed models feature #6753
base: master
Are you sure you want to change the base?
Conversation
@@ -4,17 +4,19 @@ process LIMMA_DIFFERENTIAL { | |||
|
|||
conda "${moduleDir}/environment.yml" | |||
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? | |||
'https://depot.galaxyproject.org/singularity/bioconductor-limma:3.54.0--r42hc0cfd56_0' : | |||
'biocontainers/bioconductor-limma:3.54.0--r42hc0cfd56_0' }" | |||
'oras://community.wave.seqera.io/library/bioconductor-edger_bioconductor-ihw_bioconductor-limma_r-dplyr_r-readr:7fc48564d286c1c6' : |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we shouldn't have any oras protocol in the modules, it doesn't work with NXF_SINGULARITY_CACHEDIR
Can you swittch to https?
'https://depot.galaxyproject.org/singularity/bioconductor-limma:3.54.0--r42hc0cfd56_0' : | ||
'biocontainers/bioconductor-limma:3.54.0--r42hc0cfd56_0' }" | ||
'oras://community.wave.seqera.io/library/bioconductor-edger_bioconductor-ihw_bioconductor-limma_r-dplyr_r-readr:7fc48564d286c1c6' : | ||
'community.wave.seqera.io/library/bioconductor-edger_bioconductor-ihw_bioconductor-limma_r-dplyr_r-readr:edea0f9fbaeba3a0' }" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'community.wave.seqera.io/library/bioconductor-edger_bioconductor-ihw_bioconductor-limma_r-dplyr_r-readr:edea0f9fbaeba3a0' }" | |
'nf-core/bioconductor-edger_bioconductor-ihw_bioconductor-limma_r-dplyr_r-readr:edea0f9fbaeba3a0' }" |
I pulled and pushed to quay.io
So we can change the registry from all container in a simpler way.
We'll be using community.wave.seqera.io
as registry when we do the switch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not keen on this @maxulysse, there's no way for someone without privilege to do this and maintain the module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few changes I'd like to see here.
I understand the attraction of readr etc, but until we have full Seqera Containers integration with nf-core tooling (which shouldn't be long), extra packages create challenges.
The forking of logic is also a bit ugly. If the logic is sufficiently different to require two scripts it should be two modules- though I'd favour integration if it didn't create excess complexity in the R code.
dependencies: | ||
- bioconda::bioconductor-limma=3.54.0 | ||
- bioconda::bioconductor-edger=4.0.16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This multi-package thing still creates difficulties right now- which is why I wrote this module depending on a single Biocontainer.
@@ -4,17 +4,19 @@ process LIMMA_DIFFERENTIAL { | |||
|
|||
conda "${moduleDir}/environment.yml" | |||
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? | |||
'https://depot.galaxyproject.org/singularity/bioconductor-limma:3.54.0--r42hc0cfd56_0' : | |||
'biocontainers/bioconductor-limma:3.54.0--r42hc0cfd56_0' }" | |||
'oras://community.wave.seqera.io/library/bioconductor-edger_bioconductor-ihw_bioconductor-limma_r-dplyr_r-readr:7fc48564d286c1c6' : |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These oras links don't work well with nf-core tooling right now, unfortunately
@@ -23,5 +25,10 @@ process LIMMA_DIFFERENTIAL { | |||
task.ext.when == null || task.ext.when | |||
|
|||
script: | |||
template 'limma_de.R' | |||
if (type == 'rnaseq') { | |||
template 'limma_de_rnaseq.R' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not massively keen on the parallel script.
Either the new thing should be a separate module, or they should be properly integrated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How you envision "proper" integration? As single template? I will refactor that if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, if the two scripts share enough logic, they should be one with a simple conditional. If the logic is quite divergent such that doing that adds too much complexity, these should be separate modules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well I had similar doubts in the one hand this is still limma and differential abundance analysis and on the other site this one is combined with voom and there is lot of differences. So maybe we will create new module named limma/differential-voom? or simply limma/voom - what are your thoughts on the module naming?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be limma/differential-voom.
As in the comments below, I also think this should be, as much as possible, a thin wrapper around Limma functions only. Otherwise it's just your custom script, rather than a 'standard' Limma module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updating this older comment after newer ones below: I don't believe additional logic required for Voom merits its own module. We can add the Voom part in a conditional, and other changes (e.g. duplicateCorrelation) etc, apply equally to non-Voom Limma.
- bioconda::bioconductor-limma=3.54.0 | ||
- bioconda::bioconductor-edger=4.0.16 | ||
- bioconda::bioconductor-ihw=1.28.0 | ||
- bioconda::bioconductor-limma=3.58.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Limma is a dependency of edger, you don't need to include it here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is true however without this we won't be able to control limma version which will be used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only pin the primary dependency in the modules repo, unless you have a really compelling reason to do otherwise. We will at some point have lock files that pin all dependencies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(as in other comments, I'm not sure we need edgeR here, so we'd only pin Limma)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well we noticed that results are slightly different depending on limma versions so we decided to pin it to achieve reproducibility. I will adopt your comment in standalone module
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I see your point on this one, apologies. Limma is the primary dependency so it's pinned, but we need edgeR.
dim(DGE) | ||
|
||
# Calculate normalization factors | ||
DGE <- calcNormFactors(DGE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please correct me if I'm wrong, but this and the DGEList are the only things that need edgeR here, right? voom() will do the normalisation things, and will just take a matrix. Then we don't need the edgeR dependency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use edgeR dependency to read raw matrix count to create DGEList object which is one of required ways to provide expression data to voom: "counts - a numeric matrix containing raw counts, or an ExpressionSet containing raw counts, or a DGEList object." : https://www.rdocumentation.org/packages/limma/versions/3.28.14/topics/voom
So we either need BioBase or edgeR dependency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, as I say, we don't need edgeR, and that will simplify the software dependencies
Actually, after reminding myself, I take this back, think it probably does need the DGEList etc the way you're doing.
## Generate outputs | ||
# Differential expression table | ||
|
||
if (opt\$IHW_correction) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO if we're writing 'limma' modules they should be thin-as-possible wrappers around Limma functions. We shouldn't we embedding lots of additional logic, aside from the boilerplate required to pass inputs to those functions correctly and write the outputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extensive customisations should probably be in local modules in the workflow, and we'd want to discuss to determine if this was part of standard best practice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just had some time to review this again, and I think the R part also needs work. There's no need to treat the pairing variable in a special way as far as I can tell, it can be a blocking variable like any other. At that point, the main difference between the two modes is the use of duplicateCorrelation, for which there could just be a separate flag.
Fundamentally, this would integrate much better into the existing differentialabundance logic if started from the existing limma module.
I actually don't think the logic is very different with using Voom, so for basic Voom integration the existing module can just have a conditional in it something like this, about here:
if (!is.null(opt$use_voom) && opt$use_voom) {
# Create a DGEList object for RNA-seq data
dge <- DGEList(counts = intensities.table)
# Normalize counts using TMM
dge <- calcNormFactors(dge, method = "TMM")
# Run voom to transform the data
voom_result <- voom(dge, design)
data_for_fit <- voom_result
} else {
# Use as.matrix for regular microarray analysis
data_for_fit <- as.matrix(intensities.table)
}
All the other logic can be the same, and then we can add in any features you think are missing from there.
# Calculate normalization factors | ||
DGE <- calcNormFactors(DGE) | ||
|
||
if (opt\$analysis_type == "pairwise") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (opt\$analysis_type == "pairwise") { | |
if (opt\$analysis_type == "pairwise") { |
I don't think you even need this extra mode. Sample pairing can be handled as a blocking variable, no? Even if it was needed, you're duplicating a lot of code between the two modes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well we are duplicating some code. Unfortunately the mixed model integration have large amount of small additions which resulted in multiple if's example can be voom which is called twice. I will try to incorporate voom into original script. I'm not yet fulli convinced about mixedmodel mode.
Just to re-state the above, and after refreshing myself on the Voom logic, I actually don't think Voom needs to be its own module. I think this can be integrated into the existing module. The voom-specific component is a matter of a few lines. All the other changes are not Voom specific. |
I missed your earlier comment I will take a closer look on this one and figure out how to incorporate mixedmodel in the nice clean way. At least thanks to new module we have nf-test almost ready :D |
@pinin4fjords I've extended original limma module with voom and IHW correction. This is still a draft as my results are now significantly different and sitll this have to be sorted out. I wanted to ask you if you fill that this is a right direction. Also any suggestion regarding incorporation of mixed models logic will be welcome :) Thank you in advance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the ongoing work- this is definitely going the right way.
But I still think the IHW code doesn't belong in a module wrapping standard Limma functions.
@@ -322,18 +357,54 @@ ebayes_args = c( | |||
|
|||
fit2 <- do.call(eBayes, ebayes_args) | |||
|
|||
# Run topTable() to generate a results data frame | |||
if ((!is.null(opt\$use_voom) && opt\$use_voom) && (!is.null(opt\$IHW_correction) && opt\$IHW_correction)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry, but I don't think the IHW correction belongs in this module.
nf-core modules should be thin wrappers around underlying software. We have some unavoidable boilerplate when we're wrapping library code like this, but we shouldn't be adding to that with our own judgement calls- it makes the code harder to maintain, and harder for users expecting Limma-native functionality to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well I don't se a point to make a new module just for IHW correction and I'm not sure how we can handle integration of this logic into main pipeline at this point. Should I do sth like this one:
https://github.com/nf-core/differentialabundance/blob/master/modules/local/filter_difftable.nf?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just do it without for now, and we'll evaluate if and how this should be included in the pipeline afterwards- feel free to make a separate workflow issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do
I finally find out why I had different results. I either don't understand how limma works or usage of blocking as is already implemented will not work as was mentioned earlier bellow the story of model and designs and how they change results: different order makes different design and different results:
model definition:
Also it looks like there might be some unexpected situation if we have pair data as text and ints - the one with ints makes correct design one with string is missing one column in design bellow example :( model definition:
model definition:
I am thinking now that it might be worth to add option to override model via opt as each of above makes results significantly different. This will make module significantly more flexible. What do you think about that @pinin4fjords ? |
@pinin4fjords bellow above I will proceed with addition of opt which will allow to override model in pipeline. Unless you are against it?
|
Could you explain how the results differ please? The fact that the design matrix has a different order shouldn't matter too much, as long as we construct the contrasts correctly. I'll do some digging myself as well, and come back to you. |
OK, that might not be true. But let's see if we can fix things without adding too much complexity. Will suggest soon... |
OK, so I've mucked about and put some changes in #6795 by way of illustration in isolation from your other changes. Basically, what I'd like is to pass pairing and blocking variables together, if it makes sense, to avoid having to re-engineer differentialabundance to too great a degree. I think that we can include pairing along with the other blocking factors, as long as those variables occur first in the model. So my suggestion is that we move them all, rather than having a new input for the pairing. We'd then just document that people supplied pairing first among any other variables they supply. I believe that approach is scientifically valid, addresses your modelling issue, and doesn't require too much change or additional complexity. I don't think allowing model specification is a good plan because the way the model is specified has ramifications for how the rest of the code is constructed, and it would be very easy to break. |
I agree with you. Making model a parameter can be dangerous for end users and result in unexpected results. That draft is great I will do that over the weekend. Can you comment on the design difference when blocking column contains str vs ints? - For us it looks like some kind of bug. Input file is Details bellow:
Code to reproduce:
I'm still not sure how we should approach mixed model where we need to handle multiple variable references and targets... I guess this is a problem of future us. |
Strings or ints, they should be converted to factors- think I cover than in my draft, but we should make that happen. On your last point, assuming you mean multiple contrasts, that's handled in workflow logic. |
This extends limma functionality with option to process rna-seq data in pairwise and mixed mode. It will be used in differentialabundance pipeline and in future can be used in other pipelines which are using limma.
PR checklist
Closes #XXX
versions.yml
file.label
nf-core modules test <MODULE> --profile docker
nf-core modules test <MODULE> --profile singularity
nf-core modules test <MODULE> --profile conda
nf-core subworkflows test <SUBWORKFLOW> --profile docker
nf-core subworkflows test <SUBWORKFLOW> --profile singularity
nf-core subworkflows test <SUBWORKFLOW> --profile conda