Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Controls / cases counts inverted when using binary model #63

Open
GACGAMA opened this issue Jun 20, 2024 · 7 comments
Open

Controls / cases counts inverted when using binary model #63

GACGAMA opened this issue Jun 20, 2024 · 7 comments

Comments

@GACGAMA
Copy link
Collaborator

GACGAMA commented Jun 20, 2024

Hello!

I've finished analyzing a large cohort using 0/1 numerical values for control/case status respectively.
Then I went to my vcf to count the distributions of variants for each group.
But what I saw was the contrary of what I expected, I found things that seemed enriched in controls instead of case. Is that the expected order for the enrichment?

I observed that when looking for the percentage comparing controls and cases, the results are both ways - some seemed enriched for controls (most of them) and some for cases, so I'm not sure how to interpret this

Should I repeat by inversing 0/1 as case/control respectively to find things enriched in cases?

One example:

<style> </style>
gene_step gene Cases_Sum_nHet Control_Sum_nHet Total X more in controls
MUC4_ODMS_WGS_WES_MUC4_synonymous MUC4 261 2829 10.83908046
METTL1_ODMS_WGS_METTL1_promoter_CAGE METTL1 4 0 0

To count, I summed the genotype calls for each group (case and control) and summarized it for each gene-category pair.
As you can see, I have a lot more total Heterozygous calls for controls in one example and cases in another

@xihaoli
Copy link
Owner

xihaoli commented Jun 21, 2024

Hi @GACGAMA,

Since you are working on variant-set analysis, is your Cases_Sum_nHet defined as the sum of heterozygous calls across all case samples, or is it the total number of heterozygous carriers among case samples?

Overall, this is possible, as the effect sizes of different variants in the variant set could be in various directions- some variants have a protective effect, while others may have a deleterious impact on the outcome. As such, the heterozygous counts may not always be enriched in the case samples.

@GACGAMA
Copy link
Collaborator Author

GACGAMA commented Jun 26, 2024

Hi @xihaoli
In this case, the Cases_Sum_nHet is defined as the sum of the heterozygous carriers among case samples for each specific gene-function pair, such as gene-plof for example.
Is there any better way to find the direction of enrichment for each category of enriched genes?

@kwdoyle
Copy link
Collaborator

kwdoyle commented Aug 16, 2024

To jump in on this, @xihaoli, it seems the Score_Stat metric obtained for the variant sets after running the STAARpipelineSummary_Gene_Centric_[Coding/Noncoding]_Annotation.r scripts relate to the magnitude and direction of the effect for a given variant.
In my analysis, variants that have a higher percentage in cases vs controls also have a larger Score_Stat. Am I interpreting this correctly?

@GACGAMA
Copy link
Collaborator Author

GACGAMA commented Aug 16, 2024

@kwdoyle
As a follow-up:
I agree with you. Score_Stat is related to the direction, but only when looking at individual variants. I counted the variants in each of the binary outcomes and score_stat direction is how many in each group.
Individual analysis significant variants are enriching the signal for the genes, but the model also include multiple non-significant variants (the same as including a non-significant variable in a regression model to increase fit, as I understand).
We should have a way to visualize those things for the variants within STAAR results, the simplest one is to include the number of samples in each cohort for binary outcomes, but I'm not sure what would be usefull for quantitative outcomes

@xihaoli
Copy link
Owner

xihaoli commented Aug 16, 2024

Thank you both!

@kwdoyle: Score_Stat reflect the variant effect direction and is for individual variants (@GACGAMA is correct). Regarding magnitude, you may look at effect size, which is currently provided in the output of Individual_Analysis.R from the STAARpipeline package.

@GACGAMA: Thanks for your patience. I agree that the simplest thing to look at is to include the number of samples (carriers) in each cohort for binary outcomes. Have you figured out a way to do it now?

Best,
Xihao

@GACGAMA
Copy link
Collaborator Author

GACGAMA commented Aug 20, 2024

@xihaoli I have figured a way to do that by using R, but I've used the VCF data which is not the best, because the .GDS files are available. Im not sure how to manipulate the GDS files but my scripts should be easily usable in this context if it is adapted
I'm available to help if needed

@xihaoli
Copy link
Owner

xihaoli commented Aug 21, 2024

Thank you @GACGAMA. I have added you as a collaborator of the STAARpipeline-Tutorial repo. Please feel free to contribute if there is any improvements that you wish to make.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants