Multiple regression analyses often assume that the response and covariates of each individual are observed, and use them to infer the regression coefficients. Here, motivated by the applications in genetics, we assume that these individual-level data are not available, but instead the summary statistics of univariate regression (essentially, the effect size estimates and their standard errors) are provided. We also assume that information on the correlation structure among covariates is available. The aim is to infer the multiple regression coefficients using the marginal regression summary statistics.
This work is motivated by applications in genome-wide association studies (GWAS). When fitting the multiple regression model to individual-level data of GWAS, the covariates are the genotypes typed at different genetic variants (typically SNPs), the response is the quantitative phenotype (e.g. height or blood lipid level), and the regression coefficients are the effects of each SNP on phenotype. Due to privacy and logistical issues, the individual-level data are often not easily available. In contrast, the GWAS summary statistics (from standard single-SNP analysis) are widely available in the public domain (e.g. GIANT and PGC). Moreover, the correlation among covariates (genotypes of SNPs), known as linkage disequilibrium, also can be obtained from public databases (e.g. the 1000 Genomes Project). When the protected individual-level data are not available, can we perform "multiple-SNP" analysis using these public assets?
Here we provide a generally-applicable framework for the multiple-SNP analyses using GWAS single-SNP summary data. Specifically, we introduce a “Regression with Summary Statistics” (RSS) likelihood, which relates the multiple regression coefficients to univariate regression results. We then combine the RSS likelihood with suitable priors to perform Bayesian inference for the regression coefficients.
The repository is licensed under the MIT License.
- Get started from some short tutorials.
- Refer to FAQ for answers to some common questions.
- Create a new issue to report bugs and/or request features.
-
The Regression with Summary Statistics (RSS) likelihood
Xiang Zhu and Matthew Stephens (2017). Bayesian large-scale multiple regression with summary statistics from genome-wide association studies. Annals of Applied Statistics 11(3): 1561-1592. [Article PDF] [Journal Page] [bioRxiv Page] [Supplementary Information] [Software] -
RSS-E: Enrichment and prioritization analysis based on RSS likelihood
Xiang Zhu and Matthew Stephens (2018). Large-scale genome-wide enrichment analyses identify new trait-associated genes and pathways across 31 human phenotypes. Nature Communications 9, 4361. [Article PDF] [Journal Page] [bioRxiv Page] [Supplementary Information] [Online Results] [Software] -
RSS-NET: Integrated analysis of regulatory networks based on RSS likelihood
Xiang Zhu, Zhana Duren and Wing Hung Wong (2021). Modeling regulatory network topology improves genome-wide analyses of complex human traits. Nature Communications 12, 2851. [Article PDF] [Journal Page] [bioRxiv Page] [Supplementary Information] [Online Results] [Software] -
More extensions of RSS to come. Stay tuned!
Here we have developed a likelihood function of multiple regression coefficients based on univariate regression summary data, which opens the door to a wide range of statistical machinery for inference. Using this likelihood, we have implemented Bayesian methods to estimate SNP heritability, detect genetic association, assess gene set or network enrichment, prioritize trait-associated genes and infer genetic architecture. Please check our progress updates regularly.
If you have specific applications that use GWAS summary data as input, and want to build new statistical methods based on the RSS likelihood, please feel free to contact us. We are glad to help!
Xiang Zhu, Ph.D.
Matthew Stephens Lab
Department of Statistics
University of Chicago