How to analyze large #19

tibettiger · 2023-04-05T03:29:23Z

Dear author:
I am really sorry to disturb you. When using fastqtl_to_mashr.ipynb to detect cell-type-specific sQTL coming from AIDA datasets(nominal pass), we come across the following two main problems:

The nominal pass(nominal pass p =1) result file is too large. Total 22 chromosomes of one cell type took around space of 200G. We have 22 cell types. So it is impossible for us to detect cell-type specific sQTLs with the whole 22 celltypes together (around 22*200=4400G). I wonder to know how you transform GTEx data(nearly same size or even large) into HDF5 format using fastqtl_to_mashr.ipynb.
Another problem is how you select strong signals from fastQTL summary statistics using this script. As we understand, we first select the top SNP(with the lowest p value) for each gene seperately in each 22 celltypes. Then we compare the 22 SNPs identified previously and select one SNP(with lowest p value) as the strong signal for this gene. I really wonder to know if our method is correct.

That is all the questions we come across and I really wish to hear from you.
Best Regards

gaow · 2023-04-05T12:45:58Z

@tibettiger to clarify, your question is that for sQTL, the summary statistics file is too large to be converted to HDF5 format using our current pipeline? If that is the case, could you elaborate which step it was stuck at? My impression is that we convert them to per condition (in your case cell type) data first in HDF5, then merge them per gene across conditions. Which step is problematic due to the large file size?

As we understand, we first select the top SNP(with the lowest p value) for each gene seperately in each 22 celltypes. Then we compare the 22 SNPs identified previously and select one SNP(with lowest p value) as the strong signal for this gene.

Yes. And you can do this manually without loading everything into HDF5 -- if you can manage to build a tabix index for the summary stats and select genes using tools based on tabix. It was not easy to do it for GTEx because the original format was not tabix ready (does not immediately have chr pos etc tab delimited) but this may be different for your case.

boxiangliu · 2023-04-06T01:12:21Z

Thanks Wang Gao!

I just want to clarify with regard to the second point.

When we select "lead SNP per gene", this SNP should be the lead SNP across all conditions, correct?

For example, if we had the following results for a specific gene:
cell type 1 cell type 2
SNP 1 1e-10 0.01
SNP 2 0.05 1e-15
SNP 3 0.03 0.02

In this case, we would choose SNP 2 for both cell type 1 and cell type 2, because SNP 2 has the lowest p-value (1e-15) across all SNPs and all cell types. Is my understanding correct?

Thank you!

gaow · 2023-04-06T01:34:20Z

@boxiangliu yes that is correct!

hhat · 2023-04-06T18:59:49Z

Dear authors,

Sorry for cutting in,

I have a related question.

When we select "strong" gene-SNP pairs, should we include genes which do not have any significant QTL in any conditions?

Did you set any threshold of a nominal pvalue or FDR in real GTEx analysis?

Thanks in advance!

stephens999 · 2023-04-07T12:51:09Z

The precise way you select the strong snps should not be critical, and the overall results should be robust to including some null snps in that set by "accident".

hhat · 2023-04-14T07:56:14Z

@stephens999

Thank you very much for the clarification!

It sounds reasonable because the sample size of GTEx was large enough, and most genes were eGene.

(When I tested mashr in fewer sample sizes (N=40/tissues, 29 tissues), the strong signals without setting p-value thresholds did not capture the expected biological similarity.)

Best,

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to analyze large #19

How to analyze large #19

tibettiger commented Apr 5, 2023

gaow commented Apr 5, 2023

boxiangliu commented Apr 6, 2023

gaow commented Apr 6, 2023

hhat commented Apr 6, 2023

stephens999 commented Apr 7, 2023

hhat commented Apr 14, 2023

How to analyze large #19

How to analyze large #19

Comments

tibettiger commented Apr 5, 2023

gaow commented Apr 5, 2023

boxiangliu commented Apr 6, 2023

gaow commented Apr 6, 2023

hhat commented Apr 6, 2023

stephens999 commented Apr 7, 2023

hhat commented Apr 14, 2023