Limma mixed models feature #6753

KamilMaliszArdigen · 2024-10-08T11:03:04Z

This extends limma functionality with option to process rna-seq data in pairwise and mixed mode. It will be used in differentialabundance pipeline and in future can be used in other pipelines which are using limma.

PR checklist

Closes #XXX

maxulysse · 2024-10-08T11:44:34Z

modules/nf-core/limma/differential/main.nf

@@ -4,17 +4,19 @@ process LIMMA_DIFFERENTIAL {

 conda "${moduleDir}/environment.yml"
 container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
- 'https://depot.galaxyproject.org/singularity/bioconductor-limma:3.54.0--r42hc0cfd56_0' :
- 'biocontainers/bioconductor-limma:3.54.0--r42hc0cfd56_0' }"
+ 'oras://community.wave.seqera.io/library/bioconductor-edger_bioconductor-ihw_bioconductor-limma_r-dplyr_r-readr:7fc48564d286c1c6' :


we shouldn't have any oras protocol in the modules, it doesn't work with NXF_SINGULARITY_CACHEDIR
Can you swittch to https?

modules/nf-core/limma/differential/environment.yml

maxulysse · 2024-10-08T11:48:58Z

modules/nf-core/limma/differential/main.nf

- 'https://depot.galaxyproject.org/singularity/bioconductor-limma:3.54.0--r42hc0cfd56_0' :
- 'biocontainers/bioconductor-limma:3.54.0--r42hc0cfd56_0' }"
+ 'oras://community.wave.seqera.io/library/bioconductor-edger_bioconductor-ihw_bioconductor-limma_r-dplyr_r-readr:7fc48564d286c1c6' :
+ 'community.wave.seqera.io/library/bioconductor-edger_bioconductor-ihw_bioconductor-limma_r-dplyr_r-readr:edea0f9fbaeba3a0' }"


Suggested change

'community.wave.seqera.io/library/bioconductor-edger_bioconductor-ihw_bioconductor-limma_r-dplyr_r-readr:edea0f9fbaeba3a0' }"

'nf-core/bioconductor-edger_bioconductor-ihw_bioconductor-limma_r-dplyr_r-readr:edea0f9fbaeba3a0' }"

I pulled and pushed to quay.io

So we can change the registry from all container in a simpler way.
We'll be using community.wave.seqera.io as registry when we do the switch

I'm not keen on this @maxulysse, there's no way for someone without privilege to do this and maintain the module.

pinin4fjords

There are a few changes I'd like to see here.

I understand the attraction of readr etc, but until we have full Seqera Containers integration with nf-core tooling (which shouldn't be long), extra packages create challenges.

The forking of logic is also a bit ugly. If the logic is sufficiently different to require two scripts it should be two modules- though I'd favour integration if it didn't create excess complexity in the R code.

pinin4fjords · 2024-10-08T11:45:41Z

modules/nf-core/limma/differential/environment.yml

 dependencies:
- - bioconda::bioconductor-limma=3.54.0
+- bioconda::bioconductor-edger=4.0.16


This multi-package thing still creates difficulties right now- which is why I wrote this module depending on a single Biocontainer.

pinin4fjords · 2024-10-08T11:46:01Z

modules/nf-core/limma/differential/main.nf

@@ -4,17 +4,19 @@ process LIMMA_DIFFERENTIAL {

 conda "${moduleDir}/environment.yml"
 container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
- 'https://depot.galaxyproject.org/singularity/bioconductor-limma:3.54.0--r42hc0cfd56_0' :
- 'biocontainers/bioconductor-limma:3.54.0--r42hc0cfd56_0' }"
+ 'oras://community.wave.seqera.io/library/bioconductor-edger_bioconductor-ihw_bioconductor-limma_r-dplyr_r-readr:7fc48564d286c1c6' :


These oras links don't work well with nf-core tooling right now, unfortunately

pinin4fjords · 2024-10-08T11:56:59Z

modules/nf-core/limma/differential/main.nf

@@ -23,5 +25,10 @@ process LIMMA_DIFFERENTIAL {
 task.ext.when == null || task.ext.when

 script:
- template 'limma_de.R'
+ if (type == 'rnaseq') {
+ template 'limma_de_rnaseq.R'


I'm not massively keen on the parallel script.

Either the new thing should be a separate module, or they should be properly integrated.

How you envision "proper" integration? As single template? I will refactor that if needed.

I mean, if the two scripts share enough logic, they should be one with a simple conditional. If the logic is quite divergent such that doing that adds too much complexity, these should be separate modules.

Well I had similar doubts in the one hand this is still limma and differential abundance analysis and on the other site this one is combined with voom and there is lot of differences. So maybe we will create new module named limma/differential-voom? or simply limma/voom - what are your thoughts on the module naming?

I think this should be limma/differential-voom.

As in the comments below, I also think this should be, as much as possible, a thin wrapper around Limma functions only. Otherwise it's just your custom script, rather than a 'standard' Limma module.

updating this older comment after newer ones below: I don't believe additional logic required for Voom merits its own module. We can add the Voom part in a conditional, and other changes (e.g. duplicateCorrelation) etc, apply equally to non-Voom Limma.

pinin4fjords · 2024-10-10T09:01:32Z

modules/nf-core/limma/differential/environment.yml

- - bioconda::bioconductor-limma=3.54.0
+- bioconda::bioconductor-edger=4.0.16
+- bioconda::bioconductor-ihw=1.28.0
+- bioconda::bioconductor-limma=3.58.1


Limma is a dependency of edger, you don't need to include it here

That is true however without this we won't be able to control limma version which will be used

We only pin the primary dependency in the modules repo, unless you have a really compelling reason to do otherwise. We will at some point have lock files that pin all dependencies.

(as in other comments, I'm not sure we need edgeR here, so we'd only pin Limma)

Well we noticed that results are slightly different depending on limma versions so we decided to pin it to achieve reproducibility. I will adopt your comment in standalone module

Actually, I see your point on this one, apologies. Limma is the primary dependency so it's pinned, but we need edgeR.

pinin4fjords · 2024-10-10T09:16:28Z

modules/nf-core/limma/differential/templates/limma_de_rnaseq.R

+dim(DGE)
+
+# Calculate normalization factors
+DGE <- calcNormFactors(DGE)


Please correct me if I'm wrong, but this and the DGEList are the only things that need edgeR here, right? voom() will do the normalisation things, and will just take a matrix. Then we don't need the edgeR dependency.

We use edgeR dependency to read raw matrix count to create DGEList object which is one of required ways to provide expression data to voom: "counts - a numeric matrix containing raw counts, or an ExpressionSet containing raw counts, or a DGEList object." : https://www.rdocumentation.org/packages/limma/versions/3.28.14/topics/voom

So we either need BioBase or edgeR dependency.

~~Yep, as I say, we don't need edgeR, and that will simplify the software dependencies~~

Actually, after reminding myself, I take this back, think it probably does need the DGEList etc the way you're doing.

pinin4fjords · 2024-10-10T09:18:55Z

modules/nf-core/limma/differential/templates/limma_de_rnaseq.R

+## Generate outputs
+# Differential expression table
+
+if (opt\$IHW_correction) {


IMO if we're writing 'limma' modules they should be thin-as-possible wrappers around Limma functions. We shouldn't we embedding lots of additional logic, aside from the boilerplate required to pass inputs to those functions correctly and write the outputs.

Extensive customisations should probably be in local modules in the workflow, and we'd want to discuss to determine if this was part of standard best practice.

pinin4fjords

I've just had some time to review this again, and I think the R part also needs work. There's no need to treat the pairing variable in a special way as far as I can tell, it can be a blocking variable like any other. At that point, the main difference between the two modes is the use of duplicateCorrelation, for which there could just be a separate flag.

Fundamentally, this would integrate much better into the existing differentialabundance logic if started from the existing limma module.

I actually don't think the logic is very different with using Voom, so for basic Voom integration the existing module can just have a conditional in it something like this, about here:

if (!is.null(opt$use_voom) && opt$use_voom) {
    # Create a DGEList object for RNA-seq data
    dge <- DGEList(counts = intensities.table)
    
    # Normalize counts using TMM
    dge <- calcNormFactors(dge, method = "TMM")
    
    # Run voom to transform the data
    voom_result <- voom(dge, design)
    data_for_fit <- voom_result
} else {
    # Use as.matrix for regular microarray analysis
    data_for_fit <- as.matrix(intensities.table)
}

All the other logic can be the same, and then we can add in any features you think are missing from there.

pinin4fjords · 2024-10-12T08:25:00Z

modules/nf-core/limma/differential/templates/limma_de_rnaseq.R

+# Calculate normalization factors
+DGE <- calcNormFactors(DGE)
+
+if (opt\$analysis_type == "pairwise") {


Suggested change

if (opt\$analysis_type == "pairwise") {

if (opt\$analysis_type == "pairwise") {

I don't think you even need this extra mode. Sample pairing can be handled as a blocking variable, no? Even if it was needed, you're duplicating a lot of code between the two modes.

Well we are duplicating some code. Unfortunately the mixed model integration have large amount of small additions which resulted in multiple if's example can be voom which is called twice. I will try to incorporate voom into original script. I'm not yet fulli convinced about mixedmodel mode.

pinin4fjords · 2024-10-14T09:44:26Z

Just to re-state the above, and after refreshing myself on the Voom logic, I actually don't think Voom needs to be its own module. I think this can be integrated into the existing module.

The voom-specific component is a matter of a few lines. All the other changes are not Voom specific.

KamilMaliszArdigen · 2024-10-14T22:12:05Z

Just to re-state the above, and after refreshing myself on the Voom logic, I actually don't think Voom needs to be its own module. I think this can be integrated into the existing module.

The voom-specific component is a matter of a few lines. All the other changes are not Voom specific.

I missed your earlier comment I will take a closer look on this one and figure out how to incorporate mixedmodel in the nice clean way.

At least thanks to new module we have nf-test almost ready :D

KamilMaliszArdigen · 2024-10-15T15:05:59Z

@pinin4fjords I've extended original limma module with voom and IHW correction. This is still a draft as my results are now significantly different and sitll this have to be sorted out. I wanted to ask you if you fill that this is a right direction. Also any suggestion regarding incorporation of mixed models logic will be welcome :)

Thank you in advance.

pinin4fjords

Thank you for the ongoing work- this is definitely going the right way.

But I still think the IHW code doesn't belong in a module wrapping standard Limma functions.

pinin4fjords · 2024-10-15T16:58:00Z

modules/nf-core/limma/differential/templates/limma_de.R

@@ -322,18 +357,54 @@ ebayes_args = c(

 fit2 <- do.call(eBayes, ebayes_args)

-# Run topTable() to generate a results data frame
+if ((!is.null(opt\$use_voom) && opt\$use_voom) && (!is.null(opt\$IHW_correction) && opt\$IHW_correction)) {


I'm sorry, but I don't think the IHW correction belongs in this module.

nf-core modules should be thin wrappers around underlying software. We have some unavoidable boilerplate when we're wrapping library code like this, but we shouldn't be adding to that with our own judgement calls- it makes the code harder to maintain, and harder for users expecting Limma-native functionality to understand.

Well I don't se a point to make a new module just for IHW correction and I'm not sure how we can handle integration of this logic into main pipeline at this point. Should I do sth like this one:
https://github.com/nf-core/differentialabundance/blob/master/modules/local/filter_difftable.nf?

Just do it without for now, and we'll evaluate if and how this should be included in the pipeline afterwards- feel free to make a separate workflow issue.

modules/nf-core/limma/differential/templates/limma_de.R

KamilMaliszArdigen · 2024-10-16T20:20:22Z

I finally find out why I had different results. I either don't understand how limma works or usage of blocking as is already implemented will not work as was mentioned earlier bellow the story of model and designs and how they change results:

different order makes different design and different results:
model definition: paste('~ 0 +', contrast_variable, '+ pair_IDs') -> this is what we end with current implementation
model: "~ 0 + treatment + pair_IDs"
design:

           treatmenthND6 treatmentmCherry pair_IDs
SRX8042381             0                1        1
SRX8042382             0                1        2
SRX8042383             0                1        3
SRX8042384             1                0        1
SRX8042385             1                0        2
SRX8042386             1                0        3

model definition: paste('~ 0 + pair_IDs +', contrast_variable) -> this is how paired model should be defined according to documentation
model: "~ 0 + pair_IDs + treatment"
design:

           pair_IDs treatmenthND6 treatmentmCherry
SRX8042381        1             0                1
SRX8042382        2             0                1
SRX8042383        3             0                1
SRX8042384        1             1                0
SRX8042385        2             1                0
SRX8042386        3             1                0

Also it looks like there might be some unexpected situation if we have pair data as text and ints - the one with ints makes correct design one with string is missing one column in design bellow example :(

model definition: paste('~ 0 + pair_IDs +', contrast_variable)
model: "~ 0 + pair_IDs + treatment"
design:

           pair_IDs treatmenthND6 treatmentmCherry
SRX8042381        1             0                1
SRX8042382        2             0                1
SRX8042383        3             0                1
SRX8042384        1             1                0
SRX8042385        2             1                0
SRX8042386        3             1                0

model definition: paste('~ 0 + sample_number +', contrast_variable)
model: "~ 0 + sample_number + treatment"
design:

           sample_numbersample1 sample_numbersample2 sample_numbersample3
SRX8042381                    1                    0                    0
SRX8042382                    0                    1                    0
SRX8042383                    0                    0                    1
SRX8042384                    1                    0                    0
SRX8042385                    0                    1                    0
SRX8042386                    0                    0                    1
           treatmentmCherry
SRX8042381                1
SRX8042382                1
SRX8042383                1
SRX8042384                0
SRX8042385                0
SRX8042386                0

I am thinking now that it might be worth to add option to override model via opt as each of above makes results significantly different. This will make module significantly more flexible. What do you think about that @pinin4fjords ?

KamilMaliszArdigen · 2024-10-17T08:50:29Z

@pinin4fjords bellow above I will proceed with addition of opt which will allow to override model in pipeline. Unless you are against it?

I finally find out why I had different results. I either don't understand how limma works or usage of blocking as is already implemented will not work as was mentioned earlier bellow the story of model and designs and how they change results:

different order makes different design and different results: model definition: paste('~ 0 +', contrast_variable, '+ pair_IDs') -> this is what we end with current implementation model: "~ 0 + treatment + pair_IDs" design:
           treatmenthND6 treatmentmCherry pair_IDs
SRX8042381             0                1        1
SRX8042382             0                1        2
SRX8042383             0                1        3
SRX8042384             1                0        1
SRX8042385             1                0        2
SRX8042386             1                0        3
model definition: paste('~ 0 + pair_IDs +', contrast_variable) -> this is how paired model should be defined according to documentation model: "~ 0 + pair_IDs + treatment" design:
           pair_IDs treatmenthND6 treatmentmCherry
SRX8042381        1             0                1
SRX8042382        2             0                1
SRX8042383        3             0                1
SRX8042384        1             1                0
SRX8042385        2             1                0
SRX8042386        3             1                0
Also it looks like there might be some unexpected situation if we have pair data as text and ints - the one with ints makes correct design one with string is missing one column in design bellow example :(

model definition: paste('~ 0 + pair_IDs +', contrast_variable) model: "~ 0 + pair_IDs + treatment" design:
           pair_IDs treatmenthND6 treatmentmCherry
SRX8042381        1             0                1
SRX8042382        2             0                1
SRX8042383        3             0                1
SRX8042384        1             1                0
SRX8042385        2             1                0
SRX8042386        3             1                0
model definition: paste('~ 0 + sample_number +', contrast_variable) model: "~ 0 + sample_number + treatment" design:
           sample_numbersample1 sample_numbersample2 sample_numbersample3
SRX8042381                    1                    0                    0
SRX8042382                    0                    1                    0
SRX8042383                    0                    0                    1
SRX8042384                    1                    0                    0
SRX8042385                    0                    1                    0
SRX8042386                    0                    0                    1
           treatmentmCherry
SRX8042381                1
SRX8042382                1
SRX8042383                1
SRX8042384                0
SRX8042385                0
SRX8042386                0
I am thinking now that it might be worth to add option to override model via opt as each of above makes results significantly different. This will make module significantly more flexible. What do you think about that @pinin4fjords ?

pinin4fjords · 2024-10-17T09:34:41Z

Could you explain how the results differ please? The fact that the design matrix has a different order shouldn't matter too much, as long as we construct the contrasts correctly. I'll do some digging myself as well, and come back to you.

pinin4fjords · 2024-10-17T10:04:49Z

The fact that the design matrix has a different order shouldn't matter too much, as long as we construct the contrasts correctly.

OK, that might not be true. But let's see if we can fix things without adding too much complexity. Will suggest soon...

pinin4fjords · 2024-10-17T10:28:57Z

OK, so I've mucked about and put some changes in #6795 by way of illustration in isolation from your other changes.

Basically, what I'd like is to pass pairing and blocking variables together, if it makes sense, to avoid having to re-engineer differentialabundance to too great a degree.

I think that we can include pairing along with the other blocking factors, as long as those variables occur first in the model. So my suggestion is that we move them all, rather than having a new input for the pairing. We'd then just document that people supplied pairing first among any other variables they supply.

I believe that approach is scientifically valid, addresses your modelling issue, and doesn't require too much change or additional complexity.

I don't think allowing model specification is a good plan because the way the model is specified has ramifications for how the rest of the code is constructed, and it would be very easy to break.

KamilMaliszArdigen · 2024-10-18T14:25:02Z

I agree with you. Making model a parameter can be dangerous for end users and result in unexpected results. That draft is great I will do that over the weekend.

Can you comment on the design difference when blocking column contains str vs ints? - For us it looks like some kind of bug. Input file is https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/mus_musculus/rnaseq_expression/SRP254919.samplesheet.csv

Details bellow:

Also it looks like there might be some unexpected situation if we have pair data as text and ints - the one with ints makes correct design one with string is missing one column in design bellow example :(

model definition: paste('~ 0 + pair_IDs +', contrast_variable) model: "~ 0 + pair_IDs + treatment" design:

           pair_IDs treatmenthND6 treatmentmCherry
SRX8042381        1             0                1
SRX8042382        2             0                1
SRX8042383        3             0                1
SRX8042384        1             1                0
SRX8042385        2             1                0
SRX8042386        3             1                0

model definition: paste('~ 0 + sample_number +', contrast_variable) model: "~ 0 + sample_number + treatment" design:

           sample_numbersample1 sample_numbersample2 sample_numbersample3
SRX8042381                    1                    0                    0
SRX8042382                    0                    1                    0
SRX8042383                    0                    0                    1
SRX8042384                    1                    0                    0
SRX8042385                    0                    1                    0
SRX8042386                    0                    0                    1
           treatmentmCherry
SRX8042381                1
SRX8042382                1
SRX8042383                1
SRX8042384                0
SRX8042385                0
SRX8042386                0

Code to reproduce:

sample.sheet <- read_delim_flexible(file = opt\$sample_file)

sample.sheet\$pair_IDs <- as.integer(as.factor(sample.sheet[['sample_number']]))

# int vs string ends with different design
model <- paste('~ 0 + pair_IDs +', contrast_variable)
print(model)
design <- model.matrix(
    as.formula(model),
    data=sample.sheet
)
print(design)

print('-------------------')
model <- paste('~ 0 + sample_number +', contrast_variable)
print(model)
design <- model.matrix(
    as.formula(model),
    data=sample.sheet
)
print(design)

print('-------------------')

I'm still not sure how we should approach mixed model where we need to handle multiple variable references and targets...

I guess this is a problem of future us.

pinin4fjords · 2024-10-18T16:13:32Z

Strings or ints, they should be converted to factors- think I cover than in my draft, but we should make that happen.

On your last point, assuming you mean multiple contrasts, that's handled in workflow logic.

KamilMaliszArdigen and others added 2 commits October 8, 2024 12:17

Limma mixed models feature

9795e09

Merge branch 'master' into limma-update

2f217b0

pinin4fjords self-requested a review October 8, 2024 11:41

maxulysse reviewed Oct 8, 2024

View reviewed changes

modules/nf-core/limma/differential/environment.yml Outdated Show resolved Hide resolved

maxulysse reviewed Oct 8, 2024

View reviewed changes

pinin4fjords requested changes Oct 8, 2024

View reviewed changes

Test fix

bc5d128

pinin4fjords reviewed Oct 10, 2024

View reviewed changes

pinin4fjords mentioned this pull request Oct 11, 2024

Add limma for rnaseq nf-core/differentialabundance#286

Open

11 tasks

pinin4fjords requested changes Oct 12, 2024

View reviewed changes

Limma voom moved to new module

96fc5e1

KamilMaliszArdigen added 2 commits October 14, 2024 23:13

Limma voom nf-tests

2dcfb12

Limma voom nf-tests

f557caa

Extension of limma module with voom and IHW correction

2e008b0

pinin4fjords reviewed Oct 15, 2024

View reviewed changes

pinin4fjords mentioned this pull request Oct 17, 2024

Update limma modelling approach #6795

Draft

17 tasks

	'community.wave.seqera.io/library/bioconductor-edger_bioconductor-ihw_bioconductor-limma_r-dplyr_r-readr:edea0f9fbaeba3a0' }"
	'nf-core/bioconductor-edger_bioconductor-ihw_bioconductor-limma_r-dplyr_r-readr:edea0f9fbaeba3a0' }"

	if (opt\$analysis_type == "pairwise") {
	if (opt\$analysis_type == "pairwise") {

Limma mixed models feature #6753

Are you sure you want to change the base?

Limma mixed models feature #6753

Conversation

KamilMaliszArdigen commented Oct 8, 2024 • edited Loading

PR checklist

maxulysse Oct 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pinin4fjords left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pinin4fjords Oct 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KamilMaliszArdigen Oct 11, 2024 • edited Loading

Choose a reason for hiding this comment

pinin4fjords Oct 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pinin4fjords left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pinin4fjords commented Oct 14, 2024

KamilMaliszArdigen commented Oct 14, 2024 • edited Loading

KamilMaliszArdigen commented Oct 15, 2024

pinin4fjords left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KamilMaliszArdigen commented Oct 16, 2024

KamilMaliszArdigen commented Oct 17, 2024

pinin4fjords commented Oct 17, 2024 • edited Loading

pinin4fjords commented Oct 17, 2024

pinin4fjords commented Oct 17, 2024 • edited Loading

KamilMaliszArdigen commented Oct 18, 2024 • edited Loading

pinin4fjords commented Oct 18, 2024

KamilMaliszArdigen commented Oct 8, 2024 •

edited

Loading

maxulysse Oct 8, 2024 •

edited

Loading

pinin4fjords Oct 13, 2024 •

edited

Loading

KamilMaliszArdigen Oct 11, 2024 •

edited

Loading

pinin4fjords Oct 12, 2024 •

edited

Loading

pinin4fjords left a comment •

edited

Loading

KamilMaliszArdigen commented Oct 14, 2024 •

edited

Loading

pinin4fjords commented Oct 17, 2024 •

edited

Loading

pinin4fjords commented Oct 17, 2024 •

edited

Loading

KamilMaliszArdigen commented Oct 18, 2024 •

edited

Loading