Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to analyze large #19

Open
tibettiger opened this issue Apr 5, 2023 · 6 comments
Open

How to analyze large #19

tibettiger opened this issue Apr 5, 2023 · 6 comments

Comments

@tibettiger
Copy link

Dear author:
I am really sorry to disturb you. When using fastqtl_to_mashr.ipynb to detect cell-type-specific sQTL coming from AIDA datasets(nominal pass), we come across the following two main problems:

  1. The nominal pass(nominal pass p =1) result file is too large. Total 22 chromosomes of one cell type took around space of 200G. We have 22 cell types. So it is impossible for us to detect cell-type specific sQTLs with the whole 22 celltypes together (around 22*200=4400G). I wonder to know how you transform GTEx data(nearly same size or even large) into HDF5 format using fastqtl_to_mashr.ipynb.
  2. Another problem is how you select strong signals from fastQTL summary statistics using this script. As we understand, we first select the top SNP(with the lowest p value) for each gene seperately in each 22 celltypes. Then we compare the 22 SNPs identified previously and select one SNP(with lowest p value) as the strong signal for this gene. I really wonder to know if our method is correct.

That is all the questions we come across and I really wish to hear from you.
Best Regards

@gaow
Copy link
Member

gaow commented Apr 5, 2023

@tibettiger to clarify, your question is that for sQTL, the summary statistics file is too large to be converted to HDF5 format using our current pipeline? If that is the case, could you elaborate which step it was stuck at? My impression is that we convert them to per condition (in your case cell type) data first in HDF5, then merge them per gene across conditions. Which step is problematic due to the large file size?

As we understand, we first select the top SNP(with the lowest p value) for each gene seperately in each 22 celltypes. Then we compare the 22 SNPs identified previously and select one SNP(with lowest p value) as the strong signal for this gene.

Yes. And you can do this manually without loading everything into HDF5 -- if you can manage to build a tabix index for the summary stats and select genes using tools based on tabix. It was not easy to do it for GTEx because the original format was not tabix ready (does not immediately have chr pos etc tab delimited) but this may be different for your case.

@boxiangliu
Copy link

Thanks Wang Gao!

I just want to clarify with regard to the second point.

When we select "lead SNP per gene", this SNP should be the lead SNP across all conditions, correct?

For example, if we had the following results for a specific gene:
cell type 1 cell type 2
SNP 1 1e-10 0.01
SNP 2 0.05 1e-15
SNP 3 0.03 0.02

In this case, we would choose SNP 2 for both cell type 1 and cell type 2, because SNP 2 has the lowest p-value (1e-15) across all SNPs and all cell types. Is my understanding correct?

Thank you!

@gaow
Copy link
Member

gaow commented Apr 6, 2023

@boxiangliu yes that is correct!

@hhat
Copy link

hhat commented Apr 6, 2023

Dear authors,

Sorry for cutting in,

I have a related question.

When we select "strong" gene-SNP pairs, should we include genes which do not have any significant QTL in any conditions?

Did you set any threshold of a nominal pvalue or FDR in real GTEx analysis?

Thanks in advance!

@stephens999
Copy link
Collaborator

The precise way you select the strong snps should not be critical, and the overall results should be robust to including some null snps in that set by "accident".

@hhat
Copy link

hhat commented Apr 14, 2023

@stephens999

Thank you very much for the clarification!

It sounds reasonable because the sample size of GTEx was large enough, and most genes were eGene.

(When I tested mashr in fewer sample sizes (N=40/tissues, 29 tissues), the strong signals without setting p-value thresholds did not capture the expected biological similarity.)

Best,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants