Controls / cases counts inverted when using binary model #63

GACGAMA · 2024-06-20T16:26:11Z

Hello!

I've finished analyzing a large cohort using 0/1 numerical values for control/case status respectively.
Then I went to my vcf to count the distributions of variants for each group.
But what I saw was the contrary of what I expected, I found things that seemed enriched in controls instead of case. Is that the expected order for the enrichment?

I observed that when looking for the percentage comparing controls and cases, the results are both ways - some seemed enriched for controls (most of them) and some for cases, so I'm not sure how to interpret this

Should I repeat by inversing 0/1 as case/control respectively to find things enriched in cases?

One example:

gene_step	gene	Cases_Sum_nHet	Control_Sum_nHet	Total X more in controls
MUC4_ODMS_WGS_WES_MUC4_synonymous	MUC4	261	2829	10.83908046
METTL1_ODMS_WGS_METTL1_promoter_CAGE	METTL1	4	0	0

To count, I summed the genotype calls for each group (case and control) and summarized it for each gene-category pair.
As you can see, I have a lot more total Heterozygous calls for controls in one example and cases in another

xihaoli · 2024-06-21T03:57:42Z

Hi @GACGAMA,

Since you are working on variant-set analysis, is your Cases_Sum_nHet defined as the sum of heterozygous calls across all case samples, or is it the total number of heterozygous carriers among case samples?

Overall, this is possible, as the effect sizes of different variants in the variant set could be in various directions- some variants have a protective effect, while others may have a deleterious impact on the outcome. As such, the heterozygous counts may not always be enriched in the case samples.

GACGAMA · 2024-06-26T15:12:18Z

Hi @xihaoli
In this case, the Cases_Sum_nHet is defined as the sum of the heterozygous carriers among case samples for each specific gene-function pair, such as gene-plof for example.
Is there any better way to find the direction of enrichment for each category of enriched genes?

kwdoyle · 2024-08-16T15:04:20Z

To jump in on this, @xihaoli, it seems the Score_Stat metric obtained for the variant sets after running the STAARpipelineSummary_Gene_Centric_[Coding/Noncoding]_Annotation.r scripts relate to the magnitude and direction of the effect for a given variant.
In my analysis, variants that have a higher percentage in cases vs controls also have a larger Score_Stat. Am I interpreting this correctly?

GACGAMA · 2024-08-16T17:03:55Z

@kwdoyle
As a follow-up:
I agree with you. Score_Stat is related to the direction, but only when looking at individual variants. I counted the variants in each of the binary outcomes and score_stat direction is how many in each group.
Individual analysis significant variants are enriching the signal for the genes, but the model also include multiple non-significant variants (the same as including a non-significant variable in a regression model to increase fit, as I understand).
We should have a way to visualize those things for the variants within STAAR results, the simplest one is to include the number of samples in each cohort for binary outcomes, but I'm not sure what would be usefull for quantitative outcomes

xihaoli · 2024-08-16T20:21:47Z

Thank you both!

@kwdoyle: Score_Stat reflect the variant effect direction and is for individual variants (@GACGAMA is correct). Regarding magnitude, you may look at effect size, which is currently provided in the output of Individual_Analysis.R from the STAARpipeline package.

@GACGAMA: Thanks for your patience. I agree that the simplest thing to look at is to include the number of samples (carriers) in each cohort for binary outcomes. Have you figured out a way to do it now?

Best,
Xihao

GACGAMA · 2024-08-20T15:48:54Z

@xihaoli I have figured a way to do that by using R, but I've used the VCF data which is not the best, because the .GDS files are available. Im not sure how to manipulate the GDS files but my scripts should be easily usable in this context if it is adapted
I'm available to help if needed

xihaoli · 2024-08-21T15:27:59Z

Thank you @GACGAMA. I have added you as a collaborator of the STAARpipeline-Tutorial repo. Please feel free to contribute if there is any improvements that you wish to make.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Controls / cases counts inverted when using binary model #63

Controls / cases counts inverted when using binary model #63

GACGAMA commented Jun 20, 2024 •

edited

Loading

xihaoli commented Jun 21, 2024

GACGAMA commented Jun 26, 2024

kwdoyle commented Aug 16, 2024

GACGAMA commented Aug 16, 2024 •

edited

Loading

xihaoli commented Aug 16, 2024

GACGAMA commented Aug 20, 2024

xihaoli commented Aug 21, 2024

Controls / cases counts inverted when using binary model #63

Controls / cases counts inverted when using binary model #63

Comments

GACGAMA commented Jun 20, 2024 • edited Loading

xihaoli commented Jun 21, 2024

GACGAMA commented Jun 26, 2024

kwdoyle commented Aug 16, 2024

GACGAMA commented Aug 16, 2024 • edited Loading

xihaoli commented Aug 16, 2024

GACGAMA commented Aug 20, 2024

xihaoli commented Aug 21, 2024

GACGAMA commented Jun 20, 2024 •

edited

Loading

GACGAMA commented Aug 16, 2024 •

edited

Loading