forked from cambiotraining/r-intermediate
-
Notifications
You must be signed in to change notification settings - Fork 28
/
2.dplyr-intro-solutions.Rmd
109 lines (74 loc) · 3 KB
/
2.dplyr-intro-solutions.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
---
title: "Solutions to exercise: tidying and transforming data"
author: "Mark Dunning and Matt Eldridge"
date: '`r format(Sys.time(), "Last modified: %d %b %Y")`'
output: html_document
---
## Part I -- tidying data
1. Read in the simulated clinical dataset, `clinical-data.txt`.
```{r}
library(tidyverse)
clinical_data <- read_tsv("clinical-data.txt")
```
Take a look at the data and answer the following questions:
* What variables and observations do we have?
* What might a 'tidy' version of this dataset look like?
2. Transform the data into a tidy form using the `gather` function from the `tidyr` package.
```{r}
clinical_data <- gather(clinical_data, key = "Treatment", value = "Value", -Subject)
```
3. Display the range of values for each drug and placebo treatment as a box plot
```{r}
ggplot(clinical_data, mapping = aes(x = Treatment, y = Value)) +
geom_boxplot()
```
## Part II -- selecting columns
4. Read in the patients dataset
```{r}
patients <- read_tsv("patient-data.txt")
```
5. Select all the columns between `Height` and `Grade_Level`
```{r}
select(patients, Height:Grade_Level)
```
6. Select all the columns between `Height` and `Grade_Level`, but not `State`
```{r}
select(patients, Height:Grade_Level, -State)
```
7. Select all columns ending in 'eight', i.e. `Height` and `Weight`
```{r}
select(patients, ends_with("eight"))
```
8. Select all columns except `Birth`, `State` and `Grade_Level`
```{r}
select(patients, -Birth, -State, -Grade_Level)
select(patients, -c(Birth, State, Grade_Level))
select(patients, -(Birth:Grade_Level))
```
## Part III - data transformations
The study that collected the patient data is interested in the relationship between body mass index (BMI) and smoking.
Which the relevant variables (columns) in our patients dataset and what problems can you see with these columns as they currently stand?
9. Clean the `Height` and `Weight` columns to remove the 'cm' and 'kg' units respectively.
```{r}
patients <- mutate(patients, Height = as.numeric(str_remove(Height, pattern = "cm$")))
patients <- mutate(patients, Weight = as.numeric(str_remove(Weight, pattern = "kg$")))
patients
```
10. Clean the `Smokes` column so that it contains just `TRUE` and `FALSE` values.
_Hint: use the `%in%` operator, e.g. `c("Matt", "George") %in% c("John", "Paul", "George", "Ringo")`_
```{r}
unique(patients$Smokes)
patients <- mutate(patients, Smokes = Smokes %in% c("TRUE", "Yes"))
patients
```
11. Calculate the body mass index (BMI) for each of our patients using the formula $BMI = Weight / (Height)^2$, where the weight is measured in kilograms and the height is in metres.
```{r}
patients <- mutate(patients, BMI = Weight / (Height / 100) ** 2)
select(patients, ID, Smokes, Height, Weight, BMI)
```
12. Create a new variable to indicate which individuals are overweight, i.e. those with a BMI of above 25.
```{r}
patients <- mutate(patients, Overweight = BMI > 25)
select(patients, ID, Smokes, Height, Weight, BMI, Overweight)
```
What other problems can you find in the patients dataset?