Skip to content

Simulation

coffm049 edited this page Feb 26, 2024 · 12 revisions

execute: echo: false

This simulation completely generates genotypes (G) and phenotypes (Y) for multiple simulated ancestries while controlling for the heritability ($h^2$) within and across said ancestries. The genotypes are generated from a set of simulated allelic frequencies ($f$). We simulate ancestries by creating clusters of allelic frequencies. All functions will be explictly stated as follows ("path/to/file" functionName) for clarity.

flowchart TB
  
  geneInput --> common
  input --> SNPeffects
  subpopulations --> subjGeno
  subjGeno --> geneEffects 
  ethBalance --> subjAncestry
  siteInput --> balance
  causalShared -->geneEffects
  balance --> subjSite
  geneInput --> causalShared

  subgraph User Input
    geneInput("# SNPs, (M) \n # Ancestries (O) \n Proportion causal (p) \n Proportion similar, p(Shared)")
    siteInput("# subjects (N) \n # sites (N_S) \n balance")
    input("Heritability, h2")
  end

  subgraph Allele frequencies 
    causalShared(Simulate Causal and shared regions) --> common
    common(Simulate \n Common ancestry) --> subpopulations(Simulate \n modern ancestries)
  end

  subgraph Study design
    balance(Simulate numbers \n at each site) -->  ethBalance(Simulate ancestries \n at each site)
  end

  subgraph Subjects
     subjSite(Assign subject \n Site)
     subjAncestry(Sample subject \n Ancestry)
     subjGeno(Sample subject \n genotype)

    subjSite --> subjAncestry --> subjGeno
  end

  subgraph Effects
    SNPeffects(Simulate SNP effects)
    geneEffects(Compute gene \n effects)

    SiteEffects(Simulate Site effects)
    errors(Simulate \n noise)

    phenotype(Simulate phenotype)
    
    
    SNPeffects --> geneEffects
    geneEffects --> phenotype
    SiteEffects --> phenotype
    errors --> phenotype 
  end
Loading

Genotypes

Providing genotype information

Simulations can be run off of provided plink formatted (.bed, .bim, .fam) genotypes. If the sample is known to come from mixed ancestries, an additional subject ancestry file can be provided which should have the following format

IID, FID, subj_ancestries
a, 1, AFR
b, 2, EUR
...

Note this is a comma separated file and the column names need to be precisely as defined above. The values for subj_ancestries can take any string and so doesn't need to be 3 capital letters.

Simulating Genotypes (no genotype information provided)

The simulation starts by simulating Single Nucleotide Polymorphisms at $M$ positions sequencing in multiple subjects ($N$) across potentially multiple ancestries ($O$). The genotypes are simulated in multiple steps:

  1. Simulating allelic frequency of a "common ancester"
  2. Simulating allelic frequencies of a set of ancestries that are descendent from the "common ancester"
  3. Simulate subject genotypes based on their assigned ancestry

Each of these steps are outlined in further detail below

Study design

("Simulate/simulation_helpers/sites.py" sim_sites)

The study design refers to the number of subjects total, the number of subjects of each reported ancestry, and, if desired, the distribution of those subjects across multiple sites. Simulation of imbalanced study designs are possible both in terms of subjects at each site ($N_L$) and in terms of distribution of subject ancestries at each site. If a balanced design is desired, subjects are equally split across each site, and the number of subjects ($N$) must be divisible by the number of sites ($L$). If an imbalanced design is desired, they are sampled from a dirichlet distribution as

$$ ∀ l \in {1,\dots, L}, N_l = x_lN \text{ where } x_l \stackrel{indep}{\sim} Dirichlet(α) \\ $$

Note that right now this is restricted to a symmetric dirichlet parameter (i.e. $∀i, α_i = α$). Each subject ($i\in {1\dots N}$) in this simulation is assumed to come from exactly one (No admixing) of multiple ancestries (there is no admixing). Each ancestry is assumed to spring from a common ancestor (as is supported through modern evolutionary theory).

Ancestry

("Simulate/simulation_helpers/clusters.py" assign_clusters)

Ancestries can be either assigned equally across all sites, their overall proportions sampled from a symmetric dirichlet distribution in which case the numbers of each ancestry may differ across each site.

Differentiation between ancestries is achieved by changing the allele frequencies ($f_{*}$) observed between the populations.

%%| label: fig-ancestries
%%| caption: Diagram of deriving ancestries from a common ancestor.
flowchart TB
  common(Common Ancestry, f_*0)
  a1(Ancestry 1, f_*1)
  a2(Ancestry 2, f_*2)
  more("...")
  aO(Ancestry O, f_*O)

  common --> a1
  common -->a2
  common --> more
  common --> aO
Loading

The common ancester is defined by a vector of allelic frequencies ($f_{*0}$) across all SNPs ($j\in {1\dots M}$). The common allele frequencies are simulated form a uniform distribution with lower and upper parameters 0.05 and 0.95 to avoid allele frequencies that would be filtered out for analysis anyway (minor allele frequencies < 0.05). The sampling distribution is desribed as

$$ ∀ m∈ {1, …, M}, f_{m0} \stackrel{iid}{\sim} Uni(0.1, 0.9) \ f_{*0} \sim Unif(0.1, 0.9)\ $$ ("Simulate/simulation_helpers/clusters.py")

Furthermore, for simulation purposes, we assume that the resulting modern ancestries derived from said common ancestor are derived from stochastic differences in allelic frequencies. More precisely, we assume that each ancestry is a random realization of a Beta distribution with the same mean as the common ancestry and an appropriately chosen variance. This amounts to a sampling as follows

$$ ∀ k ∈ {1, …, O}, ∀ m ∈{1, …, M}, f_{mk} \sim Beta(f_{0m}(1- θ_k)/θ_k, (1-f_{0m})(1-θ_k)/θ_k) $$

This supoposes the allele frequencies of the newer ancestors are random flucuations centered at the allele frequencies of the common ancestor. This is because the mean for the $X \sim Beta(α, β)$ distribution is

$$ E(X) = \frac{ α} { α + β} = f_{0m} $$

And the variance is

$$ Var(X) = \frac{ α β}{ (α + β)^2(α + β + 1)} \\ = f_{0m}(1-f_{0m})θ_k $$

Therefore, the simulated allele frequencies will be centered at allele frequencies of the common ancestor, and the spread of the resulting ancestries will be random fluctions around that center controlled by $θ_k$.Theoretically, $θ_k$ could differ for each ancestry, but initially we keep $θ_k$ consistent across all ancestries and the high dimensionality is such that the differences in allele frequency still define distinct clusters. This however restricts the simulation design currently to equally distinct ancestries and it is planned to augment this to allow for differently related ancestries.

Genotypes

("Simulate/simulation_helpers/genos.py" sim_genos)

The genotypes of each of the subjects are assumed to be independent draws from a $binomial(2, f_{*k})$ given the subjects ancestry k

$$ ∀ m ∈{1, …, M}, n ∈ {1, …, n}, G_{mo} \sim Bin(2, f_{mo}), \text{ given subject n in ancestry o} $$

Simulated subjects and ancestry centers plotted on first two genomic principal components. 1kg subjects plotted on first two genomic principal components.

Effects

("Simulate/simulation_helpers/pheno.py" sim_pheno)

The effects are simulated such that the phenotypic variance attributable to SNP effects (as opposed to random variances), also defined as the heritabilty ($h^2$), is controlled to equal a user specified input.

$$ h^2 = \frac{ σ_g^2}{ σ_g^2 + σ_e^2} $$

If there it is assumed that there are effects shared between ancestries (homogeneous) and effects that are specific to ancestries, the heritability for each ancestry ($o$)

$$ h_o^2 = \frac{ σ_{g,hom}^2 + σ_{g,het,o}^2}{ σ_{g,hom,o}^2 + σ_{g,het,o}^2 + σ_e^2} $$

While, this doesn't fix the heritability for ancestry o precisely to the specified heritability it is generally very close.

SNP effects

The SNPs effects are supposed to be inversely related to their respective minor allelic frequencies. So, to this end, the SNP effects are sampled as follows

$$ ∀ m ∈ {1, …, M}, o ∈ {1, …, O}, β_{mo} \sim N(0, \frac{ σ_g^2}{M f_{mo}(1-f_{mo})}) $$

Some portion of the effects are assumed to be homogeneous (w.r.t. ancestry) and so the allele frequency is taken to be the overall allele frequency in the overall sample. Some portion of the effects are assumed to be heterogeneous (w.r.t. ancestry) and so the allele frequency is taken to be the allele frequency within the given ancestry.

SNP effects across the genome. Distribution of homogeneous SNP effects. Ancestry specific distribution of SNP effects.

Noise

The random noise is simulated so as to satisfy the prescribed heritability so

$$ ϵ_n \sim N(0, \frac{σ_g^2}{h^2} - 1) $$

and in the presence of hetergenous effects

$$ ϵ_n \sim N(0, \frac{σ_{g,hom}^2 + σ_{g,het}^2}{h^2} - 1) $$

Site effects

The site effects aren't assumed to be correlated with the SNP effects and so can either be sample from a normal distribution as $γ_l \sim N(0, σ_s^2)$ or it can be prespecified to an arbitrary set of fixed effects. In some instances, sites can additionally have their own variances (heteroskedastic).

Phenotypes

The phenotypes are simulated from a linear model combining all of the above effects as

$$ Y_{n} = X_cβ_c + ∑_{m=1}^M G_{mn}β_{m,hom} + ∑_{l=1}^LX_{S,il}β_{S,l} + ϵ_n, \text{Given subject n is in ancestry o} $$

$$ Y_{n} = X_cβ_c + ∑_{m=1}^M G_{mn}β_{m,hom} + ∑_{o=1}^O∑_{m=1}^M G_{mn}β_{mo} + ∑_{l=1}^LX_{S,il}β_{S,l} + ϵ_n $$

$$ Y = X_cβ_c + Gβ_{hom} + ∑_{o=1}^O I(n ∈ o)Gβ_o + X_Sβ_S + ϵ_n $$

  1. The heritability is controlled to a prespecified value ($h^2$)
  2. The affect of a given SNP is related to it's allelic frequency

It has been suggested that a SNP effect is related to its observed frequency. ADD A CITATION AND AN IMAGE FOR THAT.

Appendix of symbols

Symbol Definition
$L$ Number of sites
$M$ Number of SNPs
$N$ Number of subjects
$O$ Number of ancestries
$l$ site index $l\in {1, \dots, L}$
$m$ SNP index, $m\in {1, \dots, M}$
$n$ Subject index, $n \in {1, \dots , N}$
$o$ Ancestry index index, $o \in {0, \dots , O}$
$f$ allelic frequencies, dimension $M \times O$
$f_{*k}$ column vector of allelic frequencies for ancestry $k$
$\tau$ variance associated with simulating allelic frequencies of the ancestries derived from the common ancestor
$p_{c}$ Proportion of genome considered causal (with nonzero effect)
$p_{s}$ $2$-vector of proportion of genome with similar allele frequencies across ancestries in the casual and in the non-causal regions
$N_l$ is number of subjects at site l