Skip to content

Latest commit

 

History

History
270 lines (207 loc) · 7.33 KB

5. CpG Island Shores Assessment.md

File metadata and controls

270 lines (207 loc) · 7.33 KB
title author date output
CpG Island Shores Assessment
Quynh Nhu Nguyen
March 4th 2023
html_document
path="/Users/lucia/OneDrive/Desktop/R/edX_DNA Methylation/tcgaMethylationSubset-master"
list.files(path)

load these Bioconductor packages:

library(minfi)
library(IlluminaHumanMethylation450kmanifest)
library(IlluminaHumanMethylation450kanno.ilmn12.hg19)

Question 1:

Read in the sample annotation table:

targets=read.delim(file.path (path,"targets.txt"),as.is=TRUE)

How many samples are represented in this table?

nrow(targets)

Question 2:

How many samples are from normal colon samples?

nrow(targets[targets$Status=="normal" & targets$Tissue=="colon",])
# or
sum(targets$Tissue=="colon" & targets$Status=="normal")
##or look here
table(targets$Tissue,targets$Status)

Question 3:

For the next question we will read in breast and colon normal samples:

index = which( targets$Status=="normal" & targets$Tissue%in%c("colon","breast") )
targets = targets[index,]

Now we are ready to read in the data (this will take about 2 minutes):

library(minfi)
dat <- read.metharray.exp(base=path, targets=targets, verbose=TRUE)

dat includes the raw data. To convert this into an object that includes methylation values, as well as the location of CpGs, we do the following (we show you the class of dat as we transform it):

class(dat)
## preprocess the data
dat = preprocessIllumina(dat)
class(dat)
## assign locations to each CpG
dat = mapToGenome(dat)
class(dat)
## precompute methylation values from U and M values
dat = ratioConvert(dat,type="Illumina")
class(dat)

Before we start we can create some quality assessment plots. First look at the distribution of each sample:

library(rafalib)
mypar(1,1)
##extract methylation values
y = getBeta(dat)
shist(y)

Note that the distributions seem similar. Nothing stands out.

We also create an MDS plot to search for outlier samples. The first PC splits the data by tissue as expected and no sample stands out as an outlier.

mds = cmdscale( dist(t(y)))
tissue = as.factor(pData(dat)$Tissue)
plot(mds,col=tissue)

Now we are ready to use statistical inference to find differentially methylated regions. Let's start by using the limma package to perform a site-by-site analysis.

library(limma)
##create design matrix
tissue = as.factor(pData(dat)$Tissue)
X = model.matrix(~tissue)
##extract methylation values
y = getBeta(dat)
## obtain effect sizes and pvals with limma
fit = lmFit(y,X)

Find the CpG with the largest effect size when comparing the two tissues. What chromosome is it on?

maxes<-which.max(fit$coefficients[,2])
granges(dat)[maxes]
# model answer
index = which.max(abs( fit$coef[,2]))
seqnames(dat)[index]
start(dat)[index]

Question 4:

Now we will use the qvalue function to determine the q-value for the CpG found in the previous question.

library(qvalue)
##create design matrix
tissue = as.factor(pData(dat)$Tissue)
X = model.matrix(~tissue)
##extract methylation values
y = getBeta(dat)
## obtain effect sizes and pvals with limma
fit = lmFit(y,X)
eb = eBayes(fit)
## obtain q-values
qvals = qvalue(eb$p.value[,2])$qvalue

What is the q-value for this CpG?

##When reporting such small p-values there is no need to show exactly how small it is
## so we can say < 10^-6
index = which.max(abs( fit$coef[,2]))
-log10(qvals[index])

Question 5:

For this problem, we will use the location of the CpG discussed in the previous two questions.

Find all the CpGs within 5000 basepairs of the location of this CpG.

Create a plot showing the methylation values for all samples for these CpGs. Use color to distinguish breast from colon.

# model answer
library(rafalib)
mypar(3,1)
index = which.max(abs( fit$coef[,2]))
gr=granges(dat)[index]+5000
index=which(granges(dat)%over%gr)
pos= start(dat)[index]
matplot(pos,y[index,],ylab="Methylation",col=as.numeric(tissue))
plot(pos, fit$coef[index,2],ylab="Effect Size")
plot(pos, -log10(qvals[index]) ,ylab="-log10 q-value")

Question 6:

Repeat the above exercise, but now make the same plots for the top 10 CpGs ranked by absolute value of effect size. You can get the order like this:

library(rafalib)
mypar(3,1)
o = order(abs(fit$coef[,2]), decreasing = TRUE)[1:10]
for(i in o){
  index = i
  gr=granges(dat)[index]+5000
  index=which(granges(dat)%over%gr)
  pos= start(dat)[index]
  matplot(pos,y[index,,drop=FALSE],ylab="Methylation",col=as.numeric(tissue))
  plot(pos, fit$coef[index,2],ylab="Effect Size")
  plot(pos, -log10(qvals[index]) ,ylab="-log10 q-value")
}

Question 7:

Now we are going to explicitly search for regions using the bumphunter function. We will use permutation to assess statistical significance. Because the function is slow, we will restrict our analysis to chromosome 15.

index= which(seqnames(dat)=="chr15")
dat2 = dat[index,]

If your computer has more than one core, you can use parallel computing to speed up the procedure.

library(doParallel)
ncores = detectCores()
registerDoParallel(cores = ncores)

We can now run the bumphunter function to find differentially methylated regions (DMR). For this assessment question we will use 100 permutations, although we recommend more in practice. Here we will use a cutoff of 0.1. The permutations are random so make sure you set seed to 1 to obtain exact results in the assessment question.

##create design matrix
tissue <- as.factor(pData(dat)$Tissue)
X <- model.matrix(~tissue)

##extract methylation values
set.seed(1)
res <- bumphunter(dat2,X,cutoff=0.1,B=100)
head(res$tab)

According to these results, how many regions achieve an FWER lower than 0.05?

sum(res$table$fwer < 0.05)

Question 8:

Previously we performed a CpG by CpG analysis and obtained qvalues. Create an index for the CpGs that achieve qvalues smaller than 0.05 and a large effect size larger than 0.5 (in absolute value):

##fit and qvals were defined in a previous answer
index = which(qvals < 0.05 & abs(fit$coef[,2]) > 0.5 & seqnames(dat)=="chr15")

Now create a table of the DMRs returned by bumphunter that had 3 or more probes and convert the table into GRanges:

tab = res$tab[ res$tab$L >= 3,]
tab = makeGRangesFromDataFrame(tab)

What proportion of the CpGs indexed by index are inside regions found in tab (hint use the findOverlaps function in the GenomicRanges package)?

length(findOverlaps(dat[index,],tab))/length(index)
#or (model answer:
mean(granges(dat[index,])%over%tab)

Question 9:

Now download the table of CGI using AnnotationHub.

library(AnnotationHub)
cgi = AnnotationHub()[["AH5086"]]

Now we create a GRanges object from the list of DMRs we computed in the previous questions:

tab = res$tab[res$tab$fwer <= 0.05,]
tab = makeGRangesFromDataFrame(tab)

What proportion of the regions represented in tab do not overlap islands, but overall CpG islands shores (within 2000 basepairs) ? Hint: use the distanceToNearest

map = distanceToNearest(tab,cgi)
d = mcols(map)$distance
mean(d>0 & d<=2000)