Code and scripts implementing genetic association and fine-mapping analyses of UK Biobank data.
All scripts implementing the data processing and analysis can be found in the scripts directory. The pipeline is currently illustrated for fine-mapping standing height in a region on chromosome 3 near gene ZBTB38, and can be adapted for other traits. The steps are as follows:
-
Prepare phenotype data. Run R script get_pheno.R to prepare a CSV file containing the phenotype and covariate data from the UK Biobank source files. For height, this step creates a new CSV file,
height.csv
, containing the phenotype and covariate data. -
Prepare SNP data. Run R script get_geneatlas_snps.R to create a table containing independently computed summary statistics. These are used to validate our association results. For height, this generates a new CSV file,
geneatlas-neale-height.csv
, containing the association results. Alternatively, run get_region_snps.R to generate a text file containing the ids of the genetic variants within the selected region, accompanied by independently computed summary statistics, when available. This produces a new CSV file,region-variants-ZBTB38.csv
, containing information about the selected genetic variants, such as base-pair positions, SNP variant ids, and association statistics. -
Prepare genotype data and SuSiE sufficient statistics Run bash script prepare.region.sh to create an RDS file containing sufficient statistics using the genetype data and height. The script requires 4 input: chromosome number, start base-pair position, stop base-pair position, region name. For example,
scripts/prepare.region.sh 3 140.8e6 141.8e6 ZBTB38
-
Prepare phenotype data. Run R script get_bloodcells.R to prepare a CSV file containing the phenotype and covariate data from the UK Biobank source files. Run R script prepare_plink_pheno_bloodcells.R to prepare phenotype and covariate txt files for PLINK.
-
Run GWAS. Run plink_gwas.sh and gwas_results.sh to get GWAS results.
-
Get fine-mapping regions. Run R script get_bloodcells_trait_regions.R to get regions for each trait. Run R script get_bloodcells_regions.R to combine overlapping regions for each trait and across traits.
-
Prepare genotype data, LD and z scores for each region. Run get_bloodcells_region_genotype_ld.sh to get genotype data and LD for each region. Run R script get_bloodcells_zscores.R to get z scores and XtY for each region.