-
Notifications
You must be signed in to change notification settings - Fork 0
/
hw4ML.Rmd
386 lines (352 loc) · 17.7 KB
/
hw4ML.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
---
title: "HW4 ML"
author: "Bijesh Mishra"
date: "2/24/2021"
output: pdf_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Machine Learning Homework 4
#### Due date: March 01, 2021
Variables:
* predictor: eprobe scores
* eprobes: Ep1, Ep2, Ep3, Ep4 Ep5, Total.
```{r Machine Learning Hw4 Prep., message=TRUE, warning=TRUE, paged.print=TRUE}
rm(list = ls()) #Clear environment.
setwd("/Users/bmishra/Dropbox/OSU/PhD/SemVISp2021/STAT5063ML/Homeworks/hw4 L(Q)DA KNN")
# Load libraries:
library(MASS)
library(ISLR)
library(class)
library(readxl)
library(readr)
eprobe = read.csv("/Users/bmishra/Dropbox/OSU/PhD/SemVISp2021/STAT5063ML/Data/Eprobe.csv")
eprobe = setNames(eprobe,
tolower(names(eprobe)))
attach(eprobe, pos = 2L,
name = deparse(substitute(eprobe), backtick=FALSE),
warn.conflicts = F)
eprobe[c(1, 13), ]
# View(eprobe)
```
#### 1: Plotting and preliminary analysis:
Q1.a) Box plot: Total score vs. Pathogen. Which would work better: LDA or QDA?
```{r ml hw4.Q1.a., message=TRUE, warning=TRUE, paged.print=TRUE}
boxplot(total ~ pathogen,
data = eprobe,
main = "Pathogen Count Score",
col = c("green", "red"),
names = c("No", "Yes"),
ann = T,
xlab = "Pathogen",
ylab = "Score Total")
```
Answer: Quadratic Discriminant analysis( QDA) works better. Because, the Linear Discriminant analysis (LDA) assumes that the distribution function (probability densities) for each class is normally distributed (Gaussian distribution) and different classes share same variance-covariance matrix and, thus, we can use same variance-covariance matrix for each class. But QDA assumes that variance-covariance matrix can be differnet for each class. So, we estimate the convariance matrix seperately for each K classes. The covariance matrix [$\sum_k$] is not identical so we have to keep quadratic term. In the figure above, this is not normally distributed. Thus we have to estimate variance-covariance matrix seperately for each classes k; K = 2.
Q1.b: Get t-test p-value for LDA and a t-test p-value for QDA to determine if $H_o: \mu_1 = \mu_2$ where $\mu_1$ and $\mu_2$ are total scores among pathogens and non-pathogen populations.
```{r ml hw4.Q1.b.LDA, message=TRUE, warning=TRUE, paged.print=TRUE}
t.test(total ~ pathogen,
var.equal = T, # Variances equal: LDA
conf.level = 0.95,
alternative = c("two.sided"),
data = eprobe)
```
Answer: LDA: $H_o: \mu_1 = \mu_2$ and $H_a: \mu_1 \neq \mu_2$. Conclusion: Since p = 0.004024, we reject $H_o$ in favor of $H_a$ and concluded that the true differences in score means is not equal to zero between pathogen and non-pathogen population.
```{r ml hw4.Q1.b.QDA, message=TRUE, warning=TRUE, paged.print=TRUE}
t.test(total ~ pathogen,
var.equal = F, # Variances not equal: LDA
conf.level = 0.95,
alternative = c("two.sided"),
data = eprobe)
```
Answer: QDA: $H_o: \mu_1 = \mu_2$ and $H_a: \mu_1 \neq \mu_2$. Conclusion: Since p = 0.000157, we reject $H_o$ in favor of $H_a$ and concluded that the true differences in score means is not equal to zero between pathogen and non-pathogen population.
Q1.c: Construct a plot with Y = I(Pathogens = "Yes") on y-axis and total score on x-axis with logistic regression curve superimposed. Does it appears that pathogens can be discriminated from controls using total scores based on this plot? Interpret p-value for logistic model.
```{r ml hw4.Q1.c.Plot, message=TRUE, warning=TRUE, paged.print=TRUE}
logitq1c = glm(pathogen ~ total, family = "binomial", data = eprobe)
plot(total, I(pathogen == "Y"),
xlab = "Total Score",
ylab = "Pathogen = Yes",
pch = 1, col = 1,
main = "Pathogen Score when Pathogen is Present")
b0 = -23.738
b1 = 3.134
curve(exp(b0 + b1*x)/(1+exp(b0 + b1*x)), add = TRUE, col = "red", lwd = 3, lty = 1)
```
Answer: When total score is close to zero, it is hard to discreminate as there is some overlapping in the chart between pathogen ~ 0 and pathogen ~ 1.
```{r ml hw4.Q1.c.Summary, message=TRUE, warning=TRUE, paged.print=TRUE}
summary(logitq1c)
```
Answer: Since p-value > 0.05 for total, total is not associated with the pathogen count score at 95% CI.
#### 2: Logistic Regression on total score:
Q2.a) Find values for X such that p(X) = 0.1, 0.5, 0.9 and add them to your plot. Hint p(X) = c $\Longleftrightarrow$ logit(p(X)) = logit(c). Use a different line type and for each line and add a key to your plot that identifies each line.
```{r ml hw4.Q2.a., message=TRUE, warning=TRUE, paged.print=TRUE}
logit2a = logitq1c
l2acf = coef(logit2a)
l2a.prob = exp(l2acf[1] + l2acf[2]*total)/(1 + exp(l2acf[1] + l2acf[2]*total))
plot(total, I(pathogen == "Y"),
xlab = "Total Score",
ylab = "Pathogen = Yes",
pch = 1, col = 1,
main = "Pathogen Score when Pathogen is Present.
Note: Horizantal Colored Lines to Vertical")
# b0 = -23.738
# b1 = 3.134
curve(exp(b0 + b1*x)/(1+exp(b0 + b1*x)), add = TRUE, col = "red", lwd = 3, lty = 1)
abline(h = 0.1, lwd = 1, lty = 1, col = 1)
abline(h = 0.5, lwd = 1, lty = 1, col = 2)
abline(h = 0.9, lwd = 1, lty = 1, col = 3)
legend("bottomright",
title = "Legend:",
col = c( 1, 2, 3),
lty = c(1, 1, 1),
legend = c("0.1", "0.5", "0.9"))
```
Q2.b) Get a table that contains the training error, the training TPR and the training FPR for a rule that classifies Y as a pathogen if p(x) > c for each value of c above.
```{r ml hw4.Q2.b.Prob 0.1, message=TRUE, warning=TRUE, paged.print=TRUE}
prediction1 = rep("N", length(pathogen))
prediction1[predict(logit2a) > log(.1/(1-.1))] = "Y" # p(x) = 0.1
table1 = table(prediction1, pathogen) # p(x) = 0.1
proportion1 = round(prop.table(table(prediction1, pathogen),2),3) # p(x) = 0.1
trr0.1 = mean(prediction1 != pathogen) # p(x) = 0.1 #Training Error Rate:
TPR.2b1 = proportion1[2,2]/ sum(proportion1[1,2], proportion1[2,2]) #TPR
FPR.2b1 = proportion1[2,1]/ sum(proportion1[1,1], proportion1[2,1]) #FPR
proportion1
trr0.1
TPR.2b1
FPR.2b1
```
```{r ml hw4.Q2.b.Prob 0.5, message=TRUE, warning=TRUE, paged.print=TRUE}
prediction2 = rep("N", length(pathogen))
prediction2[predict(logit2a) > log(.5/(1-.5))] = "Y" # p(x) = 0.5
table2 = table(prediction2, pathogen) # p(x) = 0.5
proportion2 = round(prop.table(table(prediction2, pathogen),2),3) # p(x) = 0.5
trr0.5 = mean(prediction2 != pathogen) # p(x) = 0.5 #Training Error Rate:
TPR.2b2 = proportion2[2,2]/ sum(proportion2[1,2], proportion2[2,2]) #TPR
FPR.2b2 = proportion2[2,1]/ sum(proportion2[1,1], proportion2[2,1]) #FPR
proportion2
trr0.5
TPR.2b2
FPR.2b2
```
```{r ml hw4.Q2.b.Prob 0.9, message=TRUE, warning=TRUE, paged.print=TRUE}
prediction3 = rep("N", length(pathogen))
prediction3[predict(logit2a) > log(.9/(1-.9))] = "Y" # p(x) = 0.9
table3 = table(prediction3, pathogen) # p(x) = 0.9
proportion3 = round(prop.table(table(prediction3, pathogen),2),3) # p(x) = 0.9
trr0.9 = mean(prediction3 != pathogen) # p(x) = 0.9 #Training Error Rate
TPR.2b3 = proportion3[2,2]/ sum(proportion3[1,2], proportion3[2,2]) #TPR
FPR.2b3 = proportion3[2,1]/ sum(proportion3[1,1], proportion3[2,1]) #FPR
trr0.9
proportion3
TPR.2b3
FPR.2b3
```
```{r ml hw4.Q2.b.table, message=TRUE, warning=TRUE, paged.print=TRUE}
probability = c(0.1, 0.5, 0.9)
Train.Err = c((table1[1,2] + table1[2,1]),
(table2[1,2] + table2[2,1]),
(table3[1,2] + table3[2,1]))
TPR = c(TPR.2b1, TPR.2b2, TPR.2b3)
FPR = c(FPR.2b1, FPR.2b2, FPR.2b3)
TRR = c(trr0.1, trr0.5, trr0.9)
as.data.frame(cbind(probability, Train.Err, TPR, FPR, TRR))
```
#### 3: Linear Discriminant Analysis on total:
Q3.a) Get the estimated Bayes decision boundary (you may assume $\pi_1$ = $\pi_2$) and interpret the coefficient in L(x).
```{r ml hw4.Q3.a., message=TRUE, warning=TRUE, paged.print=TRUE}
lda.q3 = lda(pathogen~total, data = eprobe)
lda.q3
# Bayes Boundary:
q3.b1 = lda.q3$scaling[1]
var.q3 = (lda.q3$means[2] - lda.q3$means[1])/q3.b1
q3.b0 = log(lda.q3$prior[2]) -
log(lda.q3$prior[1]) +
(lda.q3$means[1]^2/(2*var.q3)) -
(lda.q3$means[2]^2/(2*var.q3))
bayes.boundary = -(q3.b0/q3.b1)
data.frame(q3.b0, q3.b1, bayes.boundary)
```
Answer: Bayes decision boundary is -0.3152. LDA output indicates that $\hat{\pi}_{no}$ = 0.3077 and $\hat{\pi}_{yes}$ = 0.6923. i.e. 30.77 % of the training observations corresponds to no total pathogen counts and 69.23% corresponds to the some amount of pathonge counts. Group means are means of class Y and Class N in total and used by LDA as estimates of $\mu_{k}$. The coefficient of linear determinants [L(x)] is a linear combination of variables (Here, only total), which is used to form decision rule. If 0.022*total is large then pathogen count is Y, otherwise N.
Q3.b) Get the training error and the training TPR and FPR.
```{r ml hw4.Q3.b., message=TRUE, warning=TRUE, paged.print=TRUE}
lda.predict = predict(lda.q3, new.data = eprobe)
names(lda.predict)
# cbind(lda.predict$x, lda.predict$posterior, lda.predict$class, eprobe$pathogen)
ldamat = table(lda.predict$class, eprobe$pathogen)
lda.trr = mean(lda.predict$class != eprobe$pathogen) ##training error rate
lda.prop = round(prop.table(table(predicted = lda.predict$class,
truth = eprobe$pathogen), 2), 3)
TPR.lda = ldamat[2,2]/(ldamat[1,2]+ldamat[2,2])
FPR.lda = ldamat[2,1]/(ldamat[1,1]+ldamat[2,1])
TPR.lda ##True Positive Rate
FPR.lda ##False Positive Rate
ldamat #LDA Matrix
TrErr.lda = c((ldamat[1,2] + ldamat[2,1]))
lda.prop #LDA Proportion.
TrErr.lda # Training Error
lda.trr # Training Error Rate
```
* True Positive Rate = 1
* False Positive Rate = 0.125
* Training Error = 1
* Training Error Rate = 0.03846
Q4: Quadratic discriminant analysis on total: Get the training error and training TPR and FPR.
```{r ml hw4.Q4., message=TRUE, warning=TRUE, paged.print=TRUE}
qda.q4 = qda(pathogen~total, data = eprobe)
table(predict(qda.q4)$class, pathogen)
qda.predict = predict(qda.q4, new.data = eprobe)
# names(qda.predict)
# cbind(qda.predict$x, qda.predict$posterior, qda.predict$class, eprobe$pathogen)
qdamat = table(qda.predict$class, eprobe$pathogen)
qda.trr = mean(qda.predict$class != eprobe$pathogen) ##training error rate
qda.prop = round(prop.table(table(predicted = qda.predict$class,
truth = eprobe$pathogen), 2), 3)
TPR.qda = qdamat[2,2]/(qdamat[1,2]+qdamat[2,2])
FPR.qda = qdamat[2,1]/(qdamat[1,1]+qdamat[2,1])
# qdamatrix #QDA Matrix
qda.q4
qdamat
TrErr.qda = c((qdamat[1,2] + qdamat[2,1])) #Training Error
TrErr.qda
qda.prop #QDA Proportion.
TPR.qda ##True Positive Rate
FPR.qda ##False Positive Rate
qda.trr # Training Error Rate
```
* True Positive Rate = 0.778
* False Positive Rate = 0
* Training Error = 4
* Training Error Rate = 0.154
#### 5: LDA on Ep1, Ep2, ..., Ep5.
Q5.a): Report the coefficients for the linear discriminant function if discriminating between populations with Ep1, ..., Ep5 and determine which Eprobe best discriminates pathogens from non-pathogens by commenting on the coefficients.
```{r ml hw4.Q5.a:, message=FALSE, warning=FALSE, paged.print=FALSE}
lda5a = lda(pathogen ~ ep1 + ep2 + ep3 + ep4 + ep5)
lda5a
# plot(lda5a)
```
Answer: The coefficients for respective Eprobes are given above. Looking at the coefficients, ep3 best descriminates the pathogen from non-pathogen as the coefficents is much higher and thus makes pathogen counts much higher which can help to classify count as Y.
Q5.b): Get the training error and the training TPR and FPR.
```{r ml hw4.Q5.b:, message=TRUE, warning=TRUE, paged.print=TRUE}
table(predict(lda5a)$class, pathogen)
lda5a.predict = predict(lda5a, new.data = eprobe)
# names(lda5a.predict)
# cbind(lda5a.predict$x, lda5a.predict$posterior, lda5a.predict$class, eprobe$pathogen)
lda5amat = table(lda5a.predict$class, eprobe$pathogen)
lda5a.trr = mean(lda5a.predict$class != eprobe$pathogen) ##training error rate
lda5a.prop = round(prop.table(table(predicted = lda5a.predict$class,
truth = eprobe$pathogen), 2), 3)
TPR.lda5a = lda5amat[2,2]/(lda5amat[1,2] + lda5amat[2,2])
FPR.lda5a = lda5amat[2,1]/(lda5amat[1,1] + lda5amat[2,1])
lda5amat
TrErr.lda5a = c((lda5amat[1,2] + lda5amat[2,1])) #Training Error
TrErr.lda5a #Training Error
lda5a.prop #LDA Proportion.
TPR.lda5a ##True Positive Rate
FPR.lda5a ##False Positive Rate
lda5a.trr # Training Error Rate
```
* True Positive Rate = 0.889
* False Positive Rate = 0.25
* Training Error = 4
* Training Error Rate = 0.154
#### Q6: KNN on total:
Q6.a): Get the training error and training TPR and FPR for k = 5 nearest neighbors using total as a predictor.
```{r ml hw4.Q6.a:, message=TRUE, warning=TRUE, paged.print=TRUE}
knnq6 = knn(scale(total), scale(total), pathogen, k = 5)
knnq6mat = table(knnq6, pathogen)
TPR.knnq6 = knnq6mat[2,2]/(knnq6mat[1,2] + knnq6mat[2,2])
FPR.knnq6 = knnq6mat[2,1]/(knnq6mat[1,1] + knnq6mat[2,1])
TrErr.knn = c((knnq6mat[1,2] + knnq6mat[2,1])) #Training Error
trr.knn = mean((knnq6 != eprobe$pathogen)) #Training Error Rate
TPR.knnq6 ##True Positive Rate
FPR.knnq6 ##False Positive Rate
knnq6mat #KNN Matrix
TrErr.knn #Training Error
trr.knn
```
* True Positive Rate = 1
* False Positive Rate = 0.25
* Training Error = 2.
* Training Error Rate = 0.07692
Q6.b): Get a plot of training error vs. k for k = 1, 2, ..., 10 with k on the x axis. What value(s) of k would you anticipate the most bias in the training error estimate? Justify your answer in a sentence or two.
```{r ml hw4.Q6.b:, message=TRUE, warning=TRUE, paged.print=TRUE}
set.seed(1)
knnq6b1 = knn(scale(total), scale(total), pathogen, k = 1)
knnq6b1mat = table(knnq6b1, pathogen)
knnq6b2 = knn(scale(total), scale(total), pathogen, k = 2)
knnq6b2mat = table(knnq6b2, pathogen)
knnq6b3 = knn(scale(total), scale(total), pathogen, k = 3)
knnq6b3mat = table(knnq6b3, pathogen)
knnq6b4 = knn(scale(total), scale(total), pathogen, k = 4)
knnq6b4mat = table(knnq6b4, pathogen)
knnq6b5 = knn(scale(total), scale(total), pathogen, k = 5)
knnq6b5mat = table(knnq6b5, pathogen)
knnq6b6 = knn(scale(total), scale(total), pathogen, k = 6)
knnq6b6mat = table(knnq6b6, pathogen)
knnq6b7 = knn(scale(total), scale(total), pathogen, k = 7)
knnq6b7mat = table(knnq6b7, pathogen)
knnq6b8 = knn(scale(total), scale(total), pathogen, k = 8)
knnq6b8mat = table(knnq6b8, pathogen)
knnq6b9 = knn(scale(total), scale(total), pathogen, k = 9)
knnq6b9mat = table(knnq6b9, pathogen)
knnq6b10 = knn(scale(total), scale(total), pathogen, k = 10)
knnq6b10mat = table(knnq6b10, pathogen)
training.error = c((knnq6b1mat[1,2] + knnq6b1mat[2,1]),
(knnq6b2mat[1,2] + knnq6b2mat[2,1]),
(knnq6b3mat[1,2] + knnq6b3mat[2,1]),
(knnq6b4mat[1,2] + knnq6b4mat[2,1]),
(knnq6b5mat[1,2] + knnq6b5mat[2,1]),
(knnq6b6mat[1,2] + knnq6b6mat[2,1]),
(knnq6b7mat[1,2] + knnq6b7mat[2,1]),
(knnq6b8mat[1,2] + knnq6b8mat[2,1]),
(knnq6b9mat[1,2] + knnq6b9mat[2,1]),
(knnq6b10mat[1,2] + knnq6b10mat[2,1]))
k = c(1:10)
plot(k, training.error, col = "red", xlab = "K",
ylab = "Training Error",
main = "KNN Training Error vs K")
```
Answer: In this case, K = 2 has highest in the training error estimate. This is because the training error (ie. sum of miscalssification) with K = 2 is highest in the chart above. However, only looking at the K, does not says which has less and more bias because the total bias is related to bias and variance and there is always bias and variances tradeoff in the modeling process. With the increase in K, model become less flexible, the bias increases, variance decreases. Thus I expect 10 to be most biased.
```{r ml hw4.Q6.b alt:, message=TRUE, warning=TRUE, paged.print=TRUE}
#looping:
# set.seed(1)
# errors = NULL
# for(i in 1:10){
# eprobe_KNN = knn(scale(total),
# scale(total),
# eprobe$pathogen, k = i)
# errors[i] = round(mean(eprobe_KNN != pathogen), 3)
# }
# k <- 1:10
# plot(k, errors)
```
Q7: (Grad Students Only). Reconstruct the plot with Total score on the X axis and Pathogen on Y axis with the logistic regression curve superimposed. Add $(x_i, p_2(x_i))$ to the plot using the points function, where $p_2(x_i)$ is the probability of default for the LDA on the total scores, full LDA, and QDA on total using different plotting characters and colors for each set of probabilities. Add a legend to the plot to identify probabilities. Hint: You’ll want to run something like the code below to get one set of points:
pred.fullLDA<-predict(my.full.LDA.model)
points(total, preds.fullLDA$posterior[,2], pch = 2, col = 2)
```{r ml hw4.Q7:, message=TRUE, warning=TRUE, paged.print=TRUE}
# Logistic Regression:
plot(total, I(pathogen == "Y"),
xlab = "Total Score",
ylab = "Pathogen = Yes",
pch = 1,
col = as.factor(pathogen),
main = "Pathogen Score when Pathogen is Present")
b0 = -23.738
b1 = 3.134
curve(exp(b0 + b1*x)/(1+exp(b0 + b1*x)),
add = TRUE, col = "red",
lwd = 3, lty = 1)
points(total, lda.predict$posterior[,2],
pch=9, col=9) # LDA
points(total, lda5a.predict$posterior[,2],
pch=10, col=10) # Full LDA with 5 Eps.
points(total, qda.predict$posterior[,2],
pch=11, col=11) #QDA
legend("bottomright", col = c(9, 10, 11),
pch = c(9, 10, 11),
legend = c("LDA", "Full LDA", "QDA"))
```
Q8: (Grad Students Only) Consider an LDA for 2 predictors. Denote the estimated covariance matrix by S = $(s_{ij})$ and denote the mean vectors by $\bar{x}^T_1 =$ ($\bar{x}_{11}, \bar{x}_{12}$) and $\bar{x}^T_2 =$ ($\bar{x}_{21}, \bar{x}_{22}$). Find expressions for $\hat{\beta}_1 and \hat{\beta}_2$ in the linear discriminant function $L(x) = \hat{\beta}_o + \hat{\beta}_1x_1 + \hat{\beta}_2x_2$. How does $\hat{\beta}_1$ compare to the univariate estimate of $\hat{\beta}_1$ when $s_{12} = 0$?
```{r ml hw4.Q8:, message=TRUE, warning=TRUE, paged.print=TRUE}
# Theoritical question: Next Page:
```