-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to analyze large #19
Comments
@tibettiger to clarify, your question is that for sQTL, the summary statistics file is too large to be converted to HDF5 format using our current pipeline? If that is the case, could you elaborate which step it was stuck at? My impression is that we convert them to per condition (in your case cell type) data first in HDF5, then merge them per gene across conditions. Which step is problematic due to the large file size?
Yes. And you can do this manually without loading everything into HDF5 -- if you can manage to build a tabix index for the summary stats and select genes using tools based on tabix. It was not easy to do it for GTEx because the original format was not tabix ready (does not immediately have chr pos etc tab delimited) but this may be different for your case. |
Thanks Wang Gao! I just want to clarify with regard to the second point. When we select "lead SNP per gene", this SNP should be the lead SNP across all conditions, correct? For example, if we had the following results for a specific gene: In this case, we would choose SNP 2 for both cell type 1 and cell type 2, because SNP 2 has the lowest p-value (1e-15) across all SNPs and all cell types. Is my understanding correct? Thank you! |
@boxiangliu yes that is correct! |
Dear authors, Sorry for cutting in, I have a related question. When we select "strong" gene-SNP pairs, should we include genes which do not have any significant QTL in any conditions? Did you set any threshold of a nominal pvalue or FDR in real GTEx analysis? Thanks in advance! |
The precise way you select the strong snps should not be critical, and the overall results should be robust to including some null snps in that set by "accident". |
Thank you very much for the clarification! It sounds reasonable because the sample size of GTEx was large enough, and most genes were eGene. (When I tested mashr in fewer sample sizes (N=40/tissues, 29 tissues), the strong signals without setting p-value thresholds did not capture the expected biological similarity.) Best, |
Dear author:
I am really sorry to disturb you. When using fastqtl_to_mashr.ipynb to detect cell-type-specific sQTL coming from AIDA datasets(nominal pass), we come across the following two main problems:
That is all the questions we come across and I really wish to hear from you.
Best Regards
The text was updated successfully, but these errors were encountered: