forked from rdpeng/RepData_PeerAssessment1
-
Notifications
You must be signed in to change notification settings - Fork 0
/
PA1_template.Rmd
149 lines (87 loc) · 4.98 KB
/
PA1_template.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
# Reproducible Research: Peer Assessment 1
## Loading and preprocessing the data
The following code reads in the data, formats the date column, creates a weekday column and presents summary statistics:
```{r}
data <- read.csv(file="activity.csv", stringsAsFactors=F)
data$date <- as.Date(data$date, format='%Y-%m-%d')
data$week <- weekdays(data$date)
str(data)
library(pastecs)
options(scipen=100)
options(digits=2)
summ <- stat.desc(data[,-c(2,4)])
summ
```
## What is mean total number of steps taken per day?
We first compute the total steps for each day, aggregating the data for the different intervals within each day.
```{r}
totalstepsday <- aggregate(data$steps,by=list(data$date),function(x) sum(x,na.rm=T))
```
Then we present the boxplot and the histogram of the total steps for each day.
```{r histtotalsteps, fig.width=7, fig.height=5}
nf <- layout(mat = matrix(c(1,2),2,1, byrow=TRUE), height = c(1,1.5))
par(mar=c(3, 3, .2, .2))
boxplot(totalstepsday$x, horizontal=TRUE, outline=TRUE,ylim=c(0,26000),col="lightblue",type=3)
hist(totalstepsday$x,nclass=20,xlab="",ylab="Frequency",col="lightblue",main="",xlim=c(0,26000))
```
The mean is __`r mean(totalstepsday$x,na.rm=TRUE)`__ and the median is __`r median(totalstepsday$x,na.rm=TRUE)`__.
## What is the average daily activity pattern?
We compute the average number of steps for each interval, across days.
```{r}
totalinterval <- aggregate(data$steps,by=list(data$interval),function(x) mean(x,na.rm=T))
```
The plot presents a time series of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all days (y-axis).
```{r timeseries, fig.width=7, fig.height=5}
plot(y=totalinterval$x,x=totalinterval$Group.1,xlab="Interval",ylab="Average number of steps",type="l")
```
```{r}
activeinterval <- totalinterval[which(totalinterval$x==max(totalinterval$x)),1]
```
The 5-minute interval, on average across all the days in the dataset, that contains the maximum number of steps is __`r activeinterval`__.
## Imputing missing values
```{r}
miss <- length(which(is.na(data$steps)))
```
The dataset has __`r dim(data)[1]`__ observations and __`r miss`__ missing observations.
The missing information for the intervals will be replaced with the median number of steps for the interval across all days.
```{r}
imputat <- aggregate(data$steps,by=list(data$interval),function(x) median(x,na.rm=T))
datacomplete <- data[!is.na(data$steps),]
datamissing <- data[is.na(data$steps),]
dataimputed <- datamissing
dataimputed$steps <- imputat[match(datamissing$interval,imputat[,1]),2]
datacompleteimputed <- rbind(datacomplete,dataimputed)
```
We then compute the total steps for each day, aggregating the data for the different intervals within each day, now considering the dataset including the imputed values.
```{r}
totalstepsday <- aggregate(datacompleteimputed$steps,by=list(datacompleteimputed$date),function(x) sum(x))
```
Boxplot and the histogram of the total steps for each day.
```{r histtotalstepsimput, fig.width=7, fig.height=5}
nf <- layout(mat = matrix(c(1,2),2,1, byrow=TRUE), height = c(1,1.5))
par(mar=c(3, 3, .2, .2))
boxplot(totalstepsday$x, horizontal=TRUE, outline=TRUE,ylim=c(0,26000),col="lightblue",type=3)
hist(totalstepsday$x,nclass=20,xlab="",ylab="Frequency",col="lightblue",main="",xlim=c(0,26000))
```
The mean is __`r mean(totalstepsday$x,na.rm=TRUE)`__ and the median is __`r median(totalstepsday$x,na.rm=TRUE)`__. The mean is now bigger than when excluding the missing values, but the median is the same.
## Are there differences in activity patterns between weekdays and weekends?
We create e variable indicating "Weekend" or "Weekday":
```{r}
datacompleteimputed$weekend <- ifelse(datacompleteimputed$week %in% c("Saturday","Sunday"),"Weekend","Weekday")
```
We compute the average number of steps for each interval, across weekend days.
```{r}
totalintervalweekend <- aggregate(datacompleteimputed$steps[datacompleteimputed$weekend=="Weekend"],by=list(datacompleteimputed$interval[datacompleteimputed$weekend=="Weekend"]),function(x) mean(x))
```
We compute the average number of steps for each interval, across weekday days.
```{r}
totalintervalweekday <- aggregate(datacompleteimputed$steps[datacompleteimputed$weekend=="Weekday"],by=list(datacompleteimputed$interval[datacompleteimputed$weekend=="Weekday"]),function(x) mean(x))
```
The plot presents the time series of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all days (y-axis), separately for weekends and weekdays.
```{r timeseries2, fig.width=7, fig.height=5}
tmp <- max(c(totalintervalweekend$x,totalintervalweekday$x))+1
nf <- layout(mat = matrix(c(1,2),2,1, byrow=TRUE))
par(mar=c(3, 3, 3, 3))
plot(y=totalintervalweekend$x,x=totalintervalweekend$Group.1,xlab="Interval",ylab="Average number of steps",type="l",main="Weekends",ylim=c(0,tmp))
plot(y=totalintervalweekday$x,x=totalintervalweekday$Group.1,xlab="Interval",ylab="Average number of steps",type="l",main="Weekdays",ylim=c(0,tmp))
```