-
Notifications
You must be signed in to change notification settings - Fork 1
/
Preprocessing.Rmd
96 lines (77 loc) · 2.36 KB
/
Preprocessing.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
---
title: "Preprocessing"
output: html_document
---
GOALS:
+ Read encoded data
+ Preliminary feature selection:
+ Filter constant features
+ Filter correlated features
Setup workspace
```{r setup, include=FALSE}
getwd()
library(janitor)
library(caret)
```
Preprocessing function
- remove constant features
- remove highly correlated features (> 0.8)
```{r preprocess_function}
preprocessData <- function(path) {
print("reading file...")
data <- read.csv(path)
print("removing constant features...")
data <- remove_constant(data)
# print("calculation correlations...")
# corr <- cor(data, method = "spearman")
# featuresToRemove <- findCorrelation(corr, cutoff = 0.8)
# if (featuresToRemove != 0) {
# print(paste(
# "removing",
# length(featuresToRemove),
# "correlated features..."
# ))
# data <- subset(data, select = -featuresToRemove)
# }
print(paste("cleaned dataset contains", ncol(data), "features."))
return(data)
}
```
Read & preprocess ace vaxinpad
```{r message=FALSE, warning=FALSE}
ace_vaxinpad <-
preprocessData("../Data/ace_vaxinpad/ace_vaxinpad_binary_centered.csv")
ace_vaxinpad_classes <-
read.csv("../Data/ace_vaxinpad/ace_vaxinpad_classes.txt", header = F)
ace_vaxinpad_classes <-
factor(ace_vaxinpad_classes[, 1], labels = c("No", "Yes"))
```
Read & preprocess HIV protease
```{r message=FALSE, warning=FALSE}
hiv_protease <-
preprocessData("../Data/hiv_protease/hiv_protease_binary_centered.csv")
hiv_protease_classes <-
read.csv("../Data/hiv_protease/hiv_protease_classes.txt", header = F)
hiv_protease_classes <-
factor(hiv_protease_classes[, 1], labels = c("No, Yes"))
```
Train / Test split
```{r}
set.seed(69)
trainIndACE <- createDataPartition(y = ace_vaxinpad_classes, p = 0.8, list = F)
trainIndHIV <- createDataPartition(y = hiv_protease_classes, p = 0.8, list = F)
```
Classes
```{r}
ace_vaxinpad_classes_train <- ace_vaxinpad_classes[trainIndACE]
ace_vaxinpad_classes_test <- ace_vaxinpad_classes[-trainIndACE]
hiv_protease_classes_train <- ace_vaxinpad_classes[trainIndHIV]
hiv_protease_classes_test <- ace_vaxinpad_classes[-trainIndHIV]
```
Original data sets
```{r}
ace_vaxinpad_train <- ace_vaxinpad[trainIndACE, ]
ace_vaxinpad_test <- ace_vaxinpad[-trainIndACE, ]
hiv_protease_classes_train <- hiv_protease[trainIndHIV, ]
hiv_protease_classes_test <- hiv_protease[-trainIndHIV, ]
```