Day1.Rmd

---
title: "phylogenetic_comparative_methods_Day1"
author: "Jigyasa_Arora"
date: "4/8/2020"
output: html_document
---

```{r setup, include=FALSE}
#load all the packages-
install.packages("ggpubr")
library(ggpubr)
install.packages("ape")
library(ape)
install.packages("phylolm")
library(phylolm)
install.packages("phytools")
library(phytools)
install.packages("geomorphs")
library(geomorph)
install.packages("geiger")
library(geiger)
install.packages("tibble")
library(tibble)
install.packages("tidytree")
library(tidytree)

options(scipen = 999) #disabling scientific notation in R.
```

Importance of independent contrasts- 
------------------------------------
### Material taken from- http://www.phytools.org/Cordoba2017/ex/3/PICs.html 

A linear regression model-
--------------------------  
The Linear regression model is used to model the relationship between two variables-dependent variable and independent variable (also called as Explainatory variable). The relationship is fit in a formula: y~x, where y=dependent variable, x=independent variable. The linear equation is: y=mx+c+e
```{r}
obj<-read.csv("Centrarchidae.csv",row.names=1) #Centrarchidae are a species of fish
colnames(obj) #traits being measured <1 discrete trait, 2 continuous traits>
rownames(obj) #fish species name

#AIM : correlation between the measured characteristics-
#method1- linear regression
fit.ols<-lm(gape.width~buccal.length,data=obj)
summary(fit.ols) #weak significant correlation (see Adjusted R-sqaured values and p-value)

#plot the model-
plot(obj[,c("buccal.length","gape.width")],
    xlab="relative buccal length",
    ylab="relative gape width",pch=21,bg="grey",
    cex=1.4)
abline(fit.ols,lwd=2,lty="dashed",col="red") #linear regression line

```

## checking the statistical reliability of the model-
```{r}

#test1- The residues are homogeneously distributed-
#plot the residues-
res = resid(fit.ols)

#plot1- The residues are randomly scattered around the plot, and closer to zero.
plot(obj$buccal.length, res, ylab="Residuals", xlab="Independent variable", main="Residuals plot") 
abline(0, 0,col="blue")

#test2- The residues are normally distributed-

#plot2- The residues are normally distributed
ggqqplot(res)

shapiro.test(res) #p-value > 0.05 imply that the distribution of the data are not significantly different from normal distribution. In other words, we can assume the normality.
```

Understanding the results-
y=mx+c+e
Intercept ("c" variable):The point where the line meets the Y-axis.

Slope ("m" coeffient): The slope of the line as indicated by Estimate of buccal.length.
Check this link to understand how changing the slope, changes the relationship between the x and y variable-https://support.minitab.com/en-us/minitab-express/1/help-and-how-to/modeling-statistics/regression/supporting-topics/regression-models/slope-and-intercept-of-the-regression-line/

Std.Error: Measures the sampling variation to examine if the sample observations are close to the True linear regression line. Smaller values are better.

t-value: This is the student t-test. The value of estimated t-value should be larger than the true t-value to reject the Null Hypothesis.

H0:Null hypothesis that the slope coefficient ("m") is 0 i.e. there is no linear relationship between x and y.

Residuals ("e" variable): Mathematically it is the distance of a data-point from the regression line i.e. difference between the observed values of dependent variable and predicted values. They are used to predict if the model is a good fit to the data.
Graphically, if the data-points are a)close to  zero, b)symmetrically disributed in the plot (i.e. homoscedastic) ,c) normally distributed then the model is a good fit.
A example of how the residue plot should not look like- like-http://docs.statwing.com/interpreting-residual-plots-to-improve-your-regression/

R-square and Adjusted R-square: R^2 helps to determine how much variability in the dependent variable (y) is explained by the independent variable (x).
eg- R^2 of 0.22 means 22% of variablity in gape.width is explained by buccal.length. 
*NOTE*- The R^2 value increases as more variables get added.Even though they are irrelevant variables. This is called as "Overfitting".
Adjusted R^2 is used to penalize extra variables in the model that do not add to the model explaination. It is called as an unbiased estimator of R^2 and is commonly used to determine the model fit.


Read more about the statistical reliability of the model in (chapter-Statistical Issues and Assumptions of Phylogenetic Generalized Least Squares in "Modern Phylogenetic Comparative Methods and Their Application in Evolutionary Biology" 2019).

When phylogeny is considered- Phylogenetic independent contrast [PIC]
---------------------------------------------------------------------

It happens sometimes that PIC model shows a lower value of p, and adjusted R2 as compared to a linear model. *But that does not mean you donot take phylogeny into consideration!*
```{r}

cent.tree<-read.tree("Centrarchidae.nwk")
plotTree(cent.tree)

obj<-obj[cent.tree$tip.label,] #order the "obj" dataframe according to "cent.tree" file
head(obj)

?pic #to understand how to use the method in R package
pic.bl<-pic(as.vector(obj$buccal.length),cent.tree) #"pic" function takes a vector as the first argument
pic.gw<-pic(as.vector(obj$gape.width),cent.tree)

#The PIC model-
fit.pic<-lm(pic.gw~pic.bl+0) #The intercept is zero.
summary(fit.pic)

#plot-
plot(obj[,c("buccal.length","gape.width")],
    xlab="relative buccal length",
    ylab="relative gape width",pch=21,bg="grey",
    cex=1.4)
abline(fit.ols,lwd=2,lty="dashed",col="red") #the OLS model
abline(fit.pic,lwd=2,lty="dashed",col="blue") # the PIC model.
```


```{r}
##statistical reliability of the model-
#plot the residues-
res2 = resid(fit.pic)
node_age=branching.times(cent.tree)

plot(node_age, res2, ylab="Residuals of PIC", xlab="Independent variable (node age)", main="Residuals plot") 
abline(0, 0,col="blue")

#plot2- The residues are normally distributed
ggqqplot(res)

shapiro.test(res) 

```

Explaination of PIC and model fitting-
PIC model is based on the Brownian Motion model which can be simulated as shown below-

## simulating Brownian motion model on a phylogenetic tree- Material taken from-http://phytools.org/eqg/Exercise_4.1/
```{r}
#simulate a tree of 20 taxa <method1> 
tre = rtree(20) #simulate a tree by randomly splitting the edges. 50 is number of tips in a tree.

#BM model-
x = rTrait(n=1,phy=tre, model="BM",plot.tree = TRUE)

## simulate Brownian evolution on a tree with fastBM <method2>
sig2 <- 0.01 #variance per generation. To examine the rate of evolution from one generation to the next. Increasing this value will increase the variance between species. 
x <- fastBM(tree, sig2 = sig2, internal = TRUE)
## visualize Brownian evolution on a tree
phenogram(tree, x, spread.labels = TRUE, spread.cost = c(1, 0))
```


#### Question1: Compare the Adjusted R-sqaure and p-value of "fit.pic" and "fit.ols" regression models. What is the similarity and difference?  
#### Ans1: Both the models have a "weak signal" which shows that "buccal.length" and "gape.width" are weakly correlated with each other. The main difference is that "fit.pic" has a weaker signal than "fit.ols" which could be naively considered as WRONG. But THAT'S NOT TRUE. Even though the "fit.ols" signal is high it is most probably due to type I error.

#### Question2: Compare the plots of residuals of linear regression data and phylogenetically corrected data. What is the difference?  
#### Ans2:The residuals plot of PIC is more homogeneously distributed than that of linear regression. But as there is a weak correlation between the two continuous traits, not much difference can be observed.

#### Question3: What does a "weak signal" mean?  
#### Ans3: "Weak signal" means that there is a weak linear (or phylogenetic) relationship between the two traits i.e. they vary independently of each other.

Quantifying Phylogenetic signal-
-------------------------------

Before we compare traits via PIC, its important to examine if the traits have a phylogenetic signal or not. This degree of variation in species trait values is predicted by phylogeny under the Brownian motion model of trait evolution.
```{r}

?phylosig #to check the documentation of the function

obj<-obj[cent.tree$tip.label,] #order the "obj" dataframe according to "cent.tree" file
phylosig(cent.tree,as.vector(obj$buccal.length),method="lambda",test=FALSE)
phylosig(cent.tree,as.vector(obj$buccal.length),method="K",test=FALSE)

phylosig(cent.tree,as.vector(obj$buccal.length),method="lambda",test=TRUE,nsim = 1000) #test=TRUE means that we will test the hypothesis if phylogenetic signal is significantly different from random distribution <called as randomization tests>
#this generates a p.value if the phylogenetic signal is significantly different from the null model of random distribution.
```
#### Question1: What does Lambda value = 0.24 and K value of 0.29 mean?  
#### Ans1: The values of lambda and K vary from 0-1. Values closer to or greater than 1 means that the observed variation in trait is predicted by phylogeny. Values closer to 0 means there is less phylogenetic structure in the trait under Brownian Motion model. But as its difficult to assign what intermediate values of lambda and K might mean, significance testing helps to estimate if the values are significantly different from no-phylogenetic signal.

#### Question2: Does "buccal.length" has a phylogenetic signal? What about "gape.width"?  
####Ans2: <perform in class>

#### Question3: Why do you think there was not much difference in "fit.ols" and "fit.pic"?  
#### Ans3: Because "buccal.length" doesn't have a phylogenetic signal. But "gape.width" does.  The difference in OLS and PIC model was driven by "gape.width" across the phylogeny. Its possible that independently a trait has a strong phylogenetic signal, but not when it is regressed with another trait (Symonds and Bloomberg,p105-130)

**NOTE :Do we still account for phylogeny if trait doesn't have a phylogenetic signal?**

How do we test for phylogenetic signal between a discrete character variable like "feeding mode" and continuous variable "buccal.length?
----------------------------------------------------------------------------------------------------------------------------------------

```{r}
#method1-

?phylANOVA
fit.anova<-phylANOVA(cent.tree, as.factor(obj$feeding.mode), obj$buccal.length, nsim=1000, posthoc=TRUE, p.adj="holm")

fit.anova #to check the results

#method2-

?procD.lm
gdf <- geomorph.data.frame(obj, phy = cent.tree) #create a dataframe that contains the data and phylogeny.
geomorph.lm<-procD.lm(buccal.length ~ as.factor(feeding.mode), iter=999,RRPP=TRUE,effect.type = "F", SS.type = c("I", "II", "III") ,data=gdf)
anova(geomorph.lm) #to examine if there is a phylogenetic correlation between "feeding mode" and "buccal.length"

pw<-pairwise(geomorph.lm,groups = as.factor(gdf$feeding.mode))
summary(pw,confidence=0.95) #pairwise comparison between "feeding modes" to examine differences between feeding groups.
```

What is Phylogenetic ANOVA?
Phylogenetic ANOVA is a special type of PGLS (or OLS) if the independent variable (x) is a factor variable rather than continuous variable.
Formula-Y(phy)=X(phy)B+E under the Brownian motion model, where Y is the dependent variable, X is the independent variable, B is slope coefficient and E is the residuals.

#### Q. Why do these two methods give different results?  
#### Ans- They are based on two different algorithms to find the correlation between y and x. Check out Adams et al 2018 paper.

#### Q. Phylogenetic ANOVA can be extended to what other kind of analysis?  
#### Ans- differences between lineages, nested effects of multiple factor variables (eg: sex,lineage,etc.; check function "interaction" in geomorph R package), comparing with Null model or different alternative models.


### Session info
```{r, session_info}
sessionInfo()
```