Skip to content

Latest commit

 

History

History
181 lines (136 loc) · 5.01 KB

3. Finding Differentially Methylated Regions in R Assessment.md

File metadata and controls

181 lines (136 loc) · 5.01 KB
title author date output
Finding Differentially Methylated Regions in R Assessment
Quynh Nhu Nguyen
March 4th 2023
html_document

Up to now the assessments have focused on characteristics of the human genome related to DNA methylation. Now we are ready to study measurement data. We will be working with a small subset of TCGA data that we have created for illustrative purposes. However, it is real cancer DNA methylation data. Note that we are going to use some of the material we learned in course 3 (Advanced Statistics) and course 4 (Introduction to Bioconductor) If you have not done so already, you will need to install the following library from the github repository:

BiocManager::install("devtools")
library(devtools)
install_github("genomicsclass/coloncancermeth")

Now we can load the library as well as the needed data objects:

library(coloncancermeth)
data(coloncancermeth)
dim(meth)
dim(pd)
print( gr )

Question 1:

From dim(pd) we can see that there is a total of 26 samples.

How many are cancer samples?

table(pd$Status)

Which column of the meth matrix is a cancer sample and has BCR patient barcode "TCGA-A6-4107"?

##figure out the column number this way:
names(pd)[ grep("barcode",names(pd)) ]
##
which(pd[,1]=="TCGA-A6-4107" & pd$Status=="cancer")

Question 2:

Use the methylation profiles to compute a Euclidean distance between each sample.

d = dist( t(meth))

Now use the cmdscale function to create an MDS plot that graphically shows approximate distances between the samples, using color to distinguish cancer and normal samples.

mds = cmdscale(d)
cols = as.numeric(pd$Status)
plot(mds,col=cols)
legend("topleft",levels(pd$Status),col=1:2,pch=1)

Question 3:

For each CpG compute p-values for the cancer versus normal comparison using the limma package:

BiocManager::install("limma")
library(limma)
X <- model.matrix(~pd$Status)
fit <- lmFit(meth,X)
eb <- eBayes(fit)
pvals <- eb$p.value[,2]

Now use the qvalue() function in the qvalue package to obtain q-values.

What proportion of genes have q-values smaller than 0.05?

BiocManager::install("qvalue")
library(qvalue)
qvals = qvalue(pvals)$qvalue
mean(qvals<=0.05)

Question 4:

Before high-throughput technologies were available, cancer epigenetics research focused on searching for CpG islands showings higher levels of methylation in cancer (hypermethylated). Let's explore the data at hand in this regard.

What proportion of the CpGs showing statistically significant differences (defined with q-values in the previous question) are, on average, higher in cancer compared to normal samples?

qvals = qvalue(pvals)$qvalue    # previous question
index = which(qvals<=0.05)
diffs = fit$coef[index,2]
mean(diffs > 0)

Question 5:

Now let's determine which of the differentially methylated CpGs are in CpG islands.

Let's review the code we used in a previous assessment:

BiocManager::install("AnnotationHub")
library(AnnotationHub)
ah = AnnotationHub()
cgi = ah[["AH5086"]]

What proportion of the differentially methylated CpGs are inside islands? Hint: use %over%

##We re-run the code from above
library(qvalue)
library(limma)
X = model.matrix(~pd$Status)
fit = lmFit(meth,X)
eb <- eBayes(fit)
pvals = eb$p.value[,2]
qvals = qvalue(pvals)$qvalue
index = which(qvals<=0.05)
##Now we can see which CpGs are in islands
mean(gr[index]%over%cgi)

Note that we can now see the proportions of each combination.

islands=gr[index]%over%cgi
hypermethylated=fit$coef[index,2]>0
prop.table( table(islands,hypermethylated) )

Question 6:

Now we will use the bumphunter package to separate the differentially methylated CpGs into groups.

BiocManager::install("bumphunter")
library(bumphunter)
X = model.matrix(~pd$Status)
chr = as.character(seqnames(gr))
res = bumphunter(meth,X,chr=chr,pos=start(gr),cutoff=0.1)

From here we get a table of regions:

head(res$table)

Note that the bumphunter function has options to assess uncertainty, which are turned on through the B argument. However, these options make this function computationally intensive. We therefore skip this step here and, instead of filtering by statistical significance, filter by region size.

dmrs = res$table[ res$table$L>=3, ]

Note that this table is not a GenomicRanges object, but we can turn it into one easily:

dmrs = makeGRangesFromDataFrame(dmrs)

For the regions in dmrs, find the distance to the closest island (hint: use distanceToNearest).

What proportion of DMRs overlap a CpG island?

map<-distanceToNearest(dmrs,cgi)
d=mcols(map)$distance
mean(d==0)

Question 7:

What proportion of DMRs are within 2000 basepairs from a CpG island, but do not overlap?

map = distanceToNearest(dmrs,cgi)
d = mcols(map)$distance
mean(d>0 & d<=2000)