-
Notifications
You must be signed in to change notification settings - Fork 1
/
Day1.Rmd
238 lines (173 loc) · 11.8 KB
/
Day1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
---
title: "phylogenetic_comparative_methods_Day1"
author: "Jigyasa_Arora"
date: "4/8/2020"
output: html_document
---
```{r setup, include=FALSE}
#load all the packages-
install.packages("ggpubr")
library(ggpubr)
install.packages("ape")
library(ape)
install.packages("phylolm")
library(phylolm)
install.packages("phytools")
library(phytools)
install.packages("geomorphs")
library(geomorph)
install.packages("geiger")
library(geiger)
install.packages("tibble")
library(tibble)
install.packages("tidytree")
library(tidytree)
options(scipen = 999) #disabling scientific notation in R.
```
Importance of independent contrasts-
------------------------------------
### Material taken from- http://www.phytools.org/Cordoba2017/ex/3/PICs.html
A linear regression model-
--------------------------
The Linear regression model is used to model the relationship between two variables-dependent variable and independent variable (also called as Explainatory variable). The relationship is fit in a formula: y~x, where y=dependent variable, x=independent variable. The linear equation is: y=mx+c+e
```{r}
obj<-read.csv("Centrarchidae.csv",row.names=1) #Centrarchidae are a species of fish
colnames(obj) #traits being measured <1 discrete trait, 2 continuous traits>
rownames(obj) #fish species name
#AIM : correlation between the measured characteristics-
#method1- linear regression
fit.ols<-lm(gape.width~buccal.length,data=obj)
summary(fit.ols) #weak significant correlation (see Adjusted R-sqaured values and p-value)
#plot the model-
plot(obj[,c("buccal.length","gape.width")],
xlab="relative buccal length",
ylab="relative gape width",pch=21,bg="grey",
cex=1.4)
abline(fit.ols,lwd=2,lty="dashed",col="red") #linear regression line
```
## checking the statistical reliability of the model-
```{r}
#test1- The residues are homogeneously distributed-
#plot the residues-
res = resid(fit.ols)
#plot1- The residues are randomly scattered around the plot, and closer to zero.
plot(obj$buccal.length, res, ylab="Residuals", xlab="Independent variable", main="Residuals plot")
abline(0, 0,col="blue")
#test2- The residues are normally distributed-
#plot2- The residues are normally distributed
ggqqplot(res)
shapiro.test(res) #p-value > 0.05 imply that the distribution of the data are not significantly different from normal distribution. In other words, we can assume the normality.
```
Understanding the results-
y=mx+c+e
Intercept ("c" variable):The point where the line meets the Y-axis.
Slope ("m" coeffient): The slope of the line as indicated by Estimate of buccal.length.
Check this link to understand how changing the slope, changes the relationship between the x and y variable-https://support.minitab.com/en-us/minitab-express/1/help-and-how-to/modeling-statistics/regression/supporting-topics/regression-models/slope-and-intercept-of-the-regression-line/
Std.Error: Measures the sampling variation to examine if the sample observations are close to the True linear regression line. Smaller values are better.
t-value: This is the student t-test. The value of estimated t-value should be larger than the true t-value to reject the Null Hypothesis.
H0:Null hypothesis that the slope coefficient ("m") is 0 i.e. there is no linear relationship between x and y.
Residuals ("e" variable): Mathematically it is the distance of a data-point from the regression line i.e. difference between the observed values of dependent variable and predicted values. They are used to predict if the model is a good fit to the data.
Graphically, if the data-points are a)close to zero, b)symmetrically disributed in the plot (i.e. homoscedastic) ,c) normally distributed then the model is a good fit.
A example of how the residue plot should not look like- like-http://docs.statwing.com/interpreting-residual-plots-to-improve-your-regression/
R-square and Adjusted R-square: R^2 helps to determine how much variability in the dependent variable (y) is explained by the independent variable (x).
eg- R^2 of 0.22 means 22% of variablity in gape.width is explained by buccal.length.
*NOTE*- The R^2 value increases as more variables get added.Even though they are irrelevant variables. This is called as "Overfitting".
Adjusted R^2 is used to penalize extra variables in the model that do not add to the model explaination. It is called as an unbiased estimator of R^2 and is commonly used to determine the model fit.
Read more about the statistical reliability of the model in (chapter-Statistical Issues and Assumptions of Phylogenetic Generalized Least Squares in "Modern Phylogenetic Comparative Methods and Their Application in Evolutionary Biology" 2019).
When phylogeny is considered- Phylogenetic independent contrast [PIC]
---------------------------------------------------------------------
It happens sometimes that PIC model shows a lower value of p, and adjusted R2 as compared to a linear model. *But that does not mean you donot take phylogeny into consideration!*
```{r}
cent.tree<-read.tree("Centrarchidae.nwk")
plotTree(cent.tree)
obj<-obj[cent.tree$tip.label,] #order the "obj" dataframe according to "cent.tree" file
head(obj)
?pic #to understand how to use the method in R package
pic.bl<-pic(as.vector(obj$buccal.length),cent.tree) #"pic" function takes a vector as the first argument
pic.gw<-pic(as.vector(obj$gape.width),cent.tree)
#The PIC model-
fit.pic<-lm(pic.gw~pic.bl+0) #The intercept is zero.
summary(fit.pic)
#plot-
plot(obj[,c("buccal.length","gape.width")],
xlab="relative buccal length",
ylab="relative gape width",pch=21,bg="grey",
cex=1.4)
abline(fit.ols,lwd=2,lty="dashed",col="red") #the OLS model
abline(fit.pic,lwd=2,lty="dashed",col="blue") # the PIC model.
```
```{r}
##statistical reliability of the model-
#plot the residues-
res2 = resid(fit.pic)
node_age=branching.times(cent.tree)
plot(node_age, res2, ylab="Residuals of PIC", xlab="Independent variable (node age)", main="Residuals plot")
abline(0, 0,col="blue")
#plot2- The residues are normally distributed
ggqqplot(res)
shapiro.test(res)
```
Explaination of PIC and model fitting-
PIC model is based on the Brownian Motion model which can be simulated as shown below-
## simulating Brownian motion model on a phylogenetic tree- Material taken from-http://phytools.org/eqg/Exercise_4.1/
```{r}
#simulate a tree of 20 taxa <method1>
tre = rtree(20) #simulate a tree by randomly splitting the edges. 50 is number of tips in a tree.
#BM model-
x = rTrait(n=1,phy=tre, model="BM",plot.tree = TRUE)
## simulate Brownian evolution on a tree with fastBM <method2>
sig2 <- 0.01 #variance per generation. To examine the rate of evolution from one generation to the next. Increasing this value will increase the variance between species.
x <- fastBM(tree, sig2 = sig2, internal = TRUE)
## visualize Brownian evolution on a tree
phenogram(tree, x, spread.labels = TRUE, spread.cost = c(1, 0))
```
#### Question1: Compare the Adjusted R-sqaure and p-value of "fit.pic" and "fit.ols" regression models. What is the similarity and difference?
#### Ans1: Both the models have a "weak signal" which shows that "buccal.length" and "gape.width" are weakly correlated with each other. The main difference is that "fit.pic" has a weaker signal than "fit.ols" which could be naively considered as WRONG. But THAT'S NOT TRUE. Even though the "fit.ols" signal is high it is most probably due to type I error.
#### Question2: Compare the plots of residuals of linear regression data and phylogenetically corrected data. What is the difference?
#### Ans2:The residuals plot of PIC is more homogeneously distributed than that of linear regression. But as there is a weak correlation between the two continuous traits, not much difference can be observed.
#### Question3: What does a "weak signal" mean?
#### Ans3: "Weak signal" means that there is a weak linear (or phylogenetic) relationship between the two traits i.e. they vary independently of each other.
Quantifying Phylogenetic signal-
-------------------------------
Before we compare traits via PIC, its important to examine if the traits have a phylogenetic signal or not. This degree of variation in species trait values is predicted by phylogeny under the Brownian motion model of trait evolution.
```{r}
?phylosig #to check the documentation of the function
obj<-obj[cent.tree$tip.label,] #order the "obj" dataframe according to "cent.tree" file
phylosig(cent.tree,as.vector(obj$buccal.length),method="lambda",test=FALSE)
phylosig(cent.tree,as.vector(obj$buccal.length),method="K",test=FALSE)
phylosig(cent.tree,as.vector(obj$buccal.length),method="lambda",test=TRUE,nsim = 1000) #test=TRUE means that we will test the hypothesis if phylogenetic signal is significantly different from random distribution <called as randomization tests>
#this generates a p.value if the phylogenetic signal is significantly different from the null model of random distribution.
```
#### Question1: What does Lambda value = 0.24 and K value of 0.29 mean?
#### Ans1: The values of lambda and K vary from 0-1. Values closer to or greater than 1 means that the observed variation in trait is predicted by phylogeny. Values closer to 0 means there is less phylogenetic structure in the trait under Brownian Motion model. But as its difficult to assign what intermediate values of lambda and K might mean, significance testing helps to estimate if the values are significantly different from no-phylogenetic signal.
#### Question2: Does "buccal.length" has a phylogenetic signal? What about "gape.width"?
####Ans2: <perform in class>
#### Question3: Why do you think there was not much difference in "fit.ols" and "fit.pic"?
#### Ans3: Because "buccal.length" doesn't have a phylogenetic signal. But "gape.width" does. The difference in OLS and PIC model was driven by "gape.width" across the phylogeny. Its possible that independently a trait has a strong phylogenetic signal, but not when it is regressed with another trait (Symonds and Bloomberg,p105-130)
**NOTE :Do we still account for phylogeny if trait doesn't have a phylogenetic signal?**
How do we test for phylogenetic signal between a discrete character variable like "feeding mode" and continuous variable "buccal.length?
----------------------------------------------------------------------------------------------------------------------------------------
```{r}
#method1-
?phylANOVA
fit.anova<-phylANOVA(cent.tree, as.factor(obj$feeding.mode), obj$buccal.length, nsim=1000, posthoc=TRUE, p.adj="holm")
fit.anova #to check the results
#method2-
?procD.lm
gdf <- geomorph.data.frame(obj, phy = cent.tree) #create a dataframe that contains the data and phylogeny.
geomorph.lm<-procD.lm(buccal.length ~ as.factor(feeding.mode), iter=999,RRPP=TRUE,effect.type = "F", SS.type = c("I", "II", "III") ,data=gdf)
anova(geomorph.lm) #to examine if there is a phylogenetic correlation between "feeding mode" and "buccal.length"
pw<-pairwise(geomorph.lm,groups = as.factor(gdf$feeding.mode))
summary(pw,confidence=0.95) #pairwise comparison between "feeding modes" to examine differences between feeding groups.
```
What is Phylogenetic ANOVA?
Phylogenetic ANOVA is a special type of PGLS (or OLS) if the independent variable (x) is a factor variable rather than continuous variable.
Formula-Y(phy)=X(phy)B+E under the Brownian motion model, where Y is the dependent variable, X is the independent variable, B is slope coefficient and E is the residuals.
#### Q. Why do these two methods give different results?
#### Ans- They are based on two different algorithms to find the correlation between y and x. Check out Adams et al 2018 paper.
#### Q. Phylogenetic ANOVA can be extended to what other kind of analysis?
#### Ans- differences between lineages, nested effects of multiple factor variables (eg: sex,lineage,etc.; check function "interaction" in geomorph R package), comparing with Null model or different alternative models.
### Session info
```{r, session_info}
sessionInfo()
```