forked from rdpeng/rprogdatascience
-
Notifications
You must be signed in to change notification settings - Fork 0
/
readwritedata.Rmd
548 lines (402 loc) · 20.9 KB
/
readwritedata.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
# Getting Data In and Out of R
```{r,echo=FALSE}
knitr::opts_chunk$set(comment = NA, prompt = TRUE, collapse = TRUE)
```
## Reading and Writing Data
[Watch a video of this section](https://youtu.be/Z_dc_FADyi4)
There are a few principal functions reading data into R.
* `read.table`, `read.csv`, for reading tabular data
* `readLines`, for reading lines of a text file
* `source`, for reading in R code files (`inverse` of `dump`)
* `dget`, for reading in R code files (`inverse` of `dput`)
* `load`, for reading in saved workspaces
* `unserialize`, for reading single R objects in binary form
There are of course, many R packages that have been developed to read
in all kinds of other datasets, and you may need to resort to one of
these packages if you are working in a specific area.
There are analogous functions for writing data to files
* `write.table`, for writing tabular data to text files (i.e. CSV) or
connections
* `writeLines`, for writing character data line-by-line to a file or
connection
* `dump`, for dumping a textual representation of multiple R objects
* `dput`, for outputting a textual representation of an R object
* `save`, for saving an arbitrary number of R objects in binary format
(possibly compressed) to a file.
* `serialize`, for converting an R object into a binary format for
outputting to a connection (or file).
## Reading Data Files with `read.table()`
The `read.table()` function is one of the most commonly used functions
for reading data. The help file for `read.table()` is worth reading in
its entirety if only because the function gets used a lot (run
`?read.table` in R). I know, I know, everyone always says to read the
help file, but this one is actually worth reading.
The `read.table()` function has a few important arguments:
* `file`, the name of a file, or a connection
* `header`, logical indicating if the file has a header line
* `sep`, a string indicating how the columns are separated
* `colClasses`, a character vector indicating the class of each column
in the dataset
* `nrows`, the number of rows in the dataset. By default
`read.table()` reads an entire file.
* `comment.char`, a character string indicating the comment
character. This defalts to `"#"`. If there are no commented lines in
your file, it's worth setting this to be the empty string `""`.
* `skip`, the number of lines to skip from the beginning
* `stringsAsFactors`, should character variables be coded as factors?
This defaults to `TRUE` because back in the old days, if you had
data that were stored as strings, it was because those strings
represented levels of a categorical variable. Now we have lots of
data that is text data and they don't always represent categorical
variables. So you may want to set this to be `FALSE` in those
cases. If you *always* want this to be `FALSE`, you can set a global
option via `options(stringsAsFactors = FALSE)`. I've never seen so
much heat generated on discussion forums about an R function
argument than the `stringsAsFactors` argument. Seriously.
For small to moderately sized datasets, you can usually call
read.table without specifying any other arguments
```{r,eval=FALSE}
data <- read.table("foo.txt")
```
In this case, R will automatically
* skip lines that begin with a #
* figure out how many rows there are (and how much memory needs to be
allocated)
* figure what type of variable is in each column of the table.
Telling R all these things directly makes R run faster and more
efficiently. The `read.csv()` function is identical to read.table
except that some of the defaults are set differently (like the `sep`
argument).
## Reading in Larger Datasets with read.table
[Watch a video of this section](https://youtu.be/BJYYIJO3UFI)
With much larger datasets, there are a few things that you can do that
will make your life easier and will prevent R from choking.
* Read the help page for read.table, which contains many hints
* Make a rough calculation of the memory required to store your
dataset (see the next section for an example of how to do this). If
the dataset is larger than the amount of RAM on your computer, you
can probably stop right here.
* Set `comment.char = ""` if there are no commented lines in your file.
* Use the `colClasses` argument. Specifying this option instead of
using the default can make ’read.table’ run MUCH faster, often twice
as fast. In order to use this option, you have to know the class of
each column in your data frame. If all of the columns are "numeric",
for example, then you can just set `colClasses = "numeric"`. A quick
an dirty way to figure out the classes of each column is the
following:
```{r,eval=FALSE}
initial <- read.table("datatable.txt", nrows = 100)
classes <- sapply(initial, class)
tabAll <- read.table("datatable.txt", colClasses = classes)
```
* Set `nrows`. This doesn’t make R run faster but it helps with memory
usage. A mild overestimate is okay. You can use the Unix tool `wc`
to calculate the number of lines in a file.
In general, when using R with larger datasets, it’s also useful to
know a few things about your system.
* How much memory is available on your system?
* What other applications are in use? Can you close any of them?
* Are there other users logged into the same system?
* What operating system ar you using? Some operating systems can limit
the amount of memory a single process can access
## Calculating Memory Requirements for R Objects
Because R stores all of its objects physical memory, it is important
to be cognizant of how much memory is being used up by all of the data
objects residing in your workspace. One situation where it's
particularly important to understand memory requirements is when you
are reading in a new dataset into R. Fortunately, it's easy to make a
back of the envelope calculation of how much memory will be required
by a new dataset.
For example, suppose I have a data frame with 1,500,000 rows and 120
columns, all of which are numeric data. Roughly, how much memory is
required to store this data frame? Well, on most modern computers
[double precision floating point
numbers](http://en.wikipedia.org/wiki/Double-precision_floating-point_format)
are stored using 64 bits of memory, or 8 bytes. Given that
information, you can do the following calculation
| 1,500,000 × 120 × 8 bytes/numeric | = 1,440,000,000 bytes |
| | = 1,440,000,000 / 2^20^ bytes/MB
| | = 1,373.29 MB
| | = 1.34 GB
So the dataset would require about 1.34 GB of RAM. Most computers
these days have at least that much RAM. However, you need to be aware
of
- what other programs might be running on your computer, using up RAM
- what other R objects might already be taking up RAM in your workspace
Reading in a large dataset for which you do not have enough RAM is one
easy way to freeze up your computer (or at least your R session). This
is usually an unpleasant experience that usually requires you to kill
the R process, in the best case scenario, or reboot your computer, in
the worst case. So make sure to do a rough calculation of memeory
requirements before reading in a large dataset. You'll thank me later.
# Using the `readr` Package
The `readr` package is recently developed by Hadley Wickham to deal
with reading in large flat files quickly. The package provides
replacements for functions like `read.table()` and `read.csv()`. The
analogous functions in `readr` are `read_table()` and
`read_csv()`. These functions are often *much* faster than their base
R analogues and provide a few other nice features such as progress
meters.
For the most part, you can read use `read_table()` and `read_csv()`
pretty much anywhere you might use `read.table()` and `read.csv()`. In
addition, if there are non-fatal problems that occur while reading in
the data, you will get a warning and the returned data frame will have
some information about which rows/observations triggered the
warning. This can be very helpful for "debugging" problems with your
data before you get neck deep in data analysis.
The importance of the `read_csv` function is perhaps better understood
from an historical perspective. R's built in `read.csv` function
similarly reads CSV files, but the `read_csv` function in `readr`
builds on that by removing some of the quirks and "gotchas" of
`read.csv` as well as dramatically optimizing the speed with which it
can read data into R. The `read_csv` function also adds some nice
user-oriented features like a progress meter and a compact method for
specifying column types.
A typical call to `read_csv` will look as follows.
```{r}
library(readr)
teams <- read_csv("data/team_standings.csv")
teams
```
By default, `read_csv` will open a CSV file and read it in line-by-line. It will also (by default), read in the first few rows of the table in order to figure out the type of each column (i.e. integer, character, etc.). From the `read_csv` help page:
> If 'NULL', all column types will be imputed from the first 1000 rows on the input. This is convenient (and fast), but not robust. If the imputation fails, you'll need to supply the correct types yourself.
You can specify the type of each column with the `col_types` argument.
In general, it's a good idea to specify the column types explicitly. This rules out any possible guessing errors on the part of `read_csv`. Also, specifying the column types explicitly provides a useful safety check in case anything about the dataset should change without you knowing about it.
```{r}
teams <- read_csv("data/team_standings.csv", col_types = "cc")
```
Note that the `col_types` argument accepts a compact representation. Here `"cc"` indicates that the first column is `character` and the second column is `character` (there are only two columns). Using the `col_types` argument is useful because often it is not easy to automatically figure out the type of a column by looking at a few rows (especially if a column has many missing values).
The `read_csv` function will also read compressed files automatically. There is no need to decompress the file first or use the `gzfile` connection function. The following call reads a gzip-compressed CSV file containing download logs from the RStudio CRAN mirror.
```{r}
logs <- read_csv("data/2016-07-19.csv.bz2", n_max = 10)
```
Note that the warnings indicate that `read_csv` may have had some difficulty identifying the type of each column. This can be solved by using the `col_types` argument.
```{r}
logs <- read_csv("data/2016-07-19.csv.bz2", col_types = "ccicccccci", n_max = 10)
logs
```
You can specify the column type in a more detailed fashion by using the various `col_*` functions. For example, in the log data above, the first column is actually a date, so it might make more sense to read it in as a Date variable. If we wanted to just read in that first column, we could do
```{r}
logdates <- read_csv("data/2016-07-19.csv.bz2",
col_types = cols_only(date = col_date()),
n_max = 10)
logdates
```
Now the `date` column is stored as a `Date` object which can be used for relevant date-related computations (for example, see the `lubridate` package).
A> The `read_csv` function has a `progress` option that defaults to TRUE. This options provides a nice progress meter while the CSV file is being read. However, if you are using `read_csv` in a function, or perhaps embedding it in a loop, it's probably best to set `progress = FALSE`.
# Using Textual and Binary Formats for Storing Data
[Watch a video of this chapter](https://youtu.be/5mIPigbNDfk)
There are a variety of ways that data can be stored, including
structured text files like CSV or tab-delimited, or more complex
binary formats. However, there is an intermediate format that is
textual, but not as simple as something like CSV. The format is native
to R and is somewhat readable because of its textual nature.
One can create a more descriptive representation of an R object by
using the `dput()` or `dump()` functions. The `dump()` and `dput()`
functions are useful because the resulting textual format is
edit-able, and in the case of corruption, potentially
recoverable. Unlike writing out a table or CSV file, `dump()` and
`dput()` preserve the _metadata_ (sacrificing some readability), so
that another user doesn’t have to specify it all over again. For
example, we can preserve the class of each column of a table or the
levels of a factor variable.
Textual formats can work much better with version control programs
like subversion or git which can only track changes meaningfully in
text files. In addition, textual formats can be longer-lived; if there
is corruption somewhere in the file, it can be easier to fix the
problem because one can just open the file in an editor and look at it
(although this would probably only be done in a worst case
scenario!). Finally, textual formats adhere to the [Unix
philosophy](http://www.catb.org/esr/writings/taoup/), if that means
anything to you.
There are a few downsides to using these intermediate textual formats.
The format is not very space-efficient, because all of the metadata is
specified. Also, it is really only partially readable. In some
instances it might be preferable to have data stored in a CSV file and
then have a separate code file that specifies the metadata.
## Using `dput()` and `dump()`
One way to pass data around is by deparsing the R object with `dput()`
and reading it back in (parsing it) using `dget()`.
```{r}
## Create a data frame
y <- data.frame(a = 1, b = "a")
## Print 'dput' output to console
dput(y)
```
Notice that the `dput()` output is in the form of R code and that it
preserves metadata like the class of the object, the row names, and
the column names.
The output of `dput()` can also be saved directly to a file.
```{r}
## Send 'dput' output to a file
dput(y, file = "y.R")
## Read in 'dput' output from a file
new.y <- dget("y.R")
new.y
```
Multiple objects can be deparsed at once using the dump function and
read back in using `source`.
```{r}
x <- "foo"
y <- data.frame(a = 1L, b = "a")
```
We can `dump()` R objects to a file by passing a character vector of
their names.
```{r}
dump(c("x", "y"), file = "data.R")
rm(x, y)
```
The inverse of `dump()` is `source()`.
```{r}
source("data.R")
str(y)
x
```
## Binary Formats
The complement to the textual format is the binary format, which is
sometimes necessary to use for efficiency purposes, or because there's
just no useful way to represent data in a textual manner. Also, with
numeric data, one can often lose precision when converting to and from
a textual format, so it's better to stick with a binary format.
The key functions for converting R objects into a binary format are
`save()`, `save.image()`, and `serialize()`. Individual R objects can
be saved to a file using the `save()` function.
```{r}
a <- data.frame(x = rnorm(100), y = runif(100))
b <- c(3, 4.4, 1 / 3)
## Save 'a' and 'b' to a file
save(a, b, file = "mydata.rda")
## Load 'a' and 'b' into your workspace
load("mydata.rda")
```
If you have a lot of objects that you want to save to a file, you can
save all objects in your workspace using the `save.image()` function.
```{r}
## Save everything to a file
save.image(file = "mydata.RData")
## load all objects in this file
load("mydata.RData")
```
Notice that I've used the `.rda` extension when using `save()` and the
`.RData` extension when using `save.image()`. This is just my personal
preference; you can use whatever file extension you want. The `save()`
and `save.image()` functions do not care. However, `.rda` and `.RData`
are fairly common extensions and you may want to use them because they
are recognized by other software.
The `serialize()` function is used to convert individual R objects
into a binary format that can be communicated across an arbitrary
connection. This may get sent to a file, but it could get sent over a
network or other connection.
When you call `serialize()` on an R object, the output will be a raw
vector coded in hexadecimal format.
```{r}
x <- list(1, 2, 3)
serialize(x, NULL)
```
If you want, this can be sent to a file, but in that case you are
better off using something like `save()`.
The benefit of the `serialize()` function is that it is the only way
to perfectly represent an R object in an exportable format, without
losing precision or any metadata. If that is what you need, then
`serialize()` is the function for you.
# Interfaces to the Outside World
[Watch a video of this chapter](https://youtu.be/Pb01WoJRUtY)
Data are read in using _connection_ interfaces. Connections can be
made to files (most common) or to other more exotic things.
- `file`, opens a connection to a file
- `gzfile`, opens a connection to a file compressed with gzip
- `bzfile`, opens a connection to a file compressed with bzip2
- `url`, opens a connection to a webpage
In general, connections are powerful tools that let you navigate files
or other external objects. Connections can be thought of as a
translator that lets you talk to objects that are outside of R. Those
outside objects could be anything from a data base, a simple text
file, or a a web service API. Connections allow R functions to talk to
all these different external objects without you having to write
custom code for each object.
## File Connections
Connections to text files can be created with the `file()` function.
```{r}
str(file)
```
The `file()` function has a number of arguments that are common to
many other connection functions so it's worth going into a little
detail here.
* `description` is the name of the file
* `open` is a code indicating what mode the file should be opened in
The `open` argument allows for the following options:
- "r" open file in read only mode
- "w" open a file for writing (and initializing a new file)
- "a" open a file for appending
- "rb", "wb", "ab" reading, writing, or appending in binary mode (Windows)
In practice, we often don't need to deal with the connection interface
directly as many functions for reading and writing data just deal with
it in the background.
For example, if one were to explicitly use connections to read a CSV
file in to R, it might look like this,
```{r,eval=FALSE}
## Create a connection to 'foo.txt'
con <- file("foo.txt")
## Open connection to 'foo.txt' in read-only mode
open(con, "r")
## Read from the connection
data <- read.csv(con)
## Close the connection
close(con)
```
which is the same as
```{r,eval=FALSE}
data <- read.csv("foo.txt")
```
In the background, `read.csv()` opens a connection to the file
`foo.txt`, reads from it, and closes the connection when it's done.
The above example shows the basic approach to using
connections. Connections must be opened, then the are read from or
written to, and then they are closed.
## Reading Lines of a Text File
Text files can be read line by line using the `readLines()`
function. This function is useful for reading text files that may be
unstructured or contain non-standard data.
```{r}
## Open connection to gz-compressed text file
con <- gzfile("words.gz")
x <- readLines(con, 10)
x
```
For more structured text data like CSV files or tab-delimited files,
there are other functions like `read.csv()` or `read.table()`.
The above example used the `gzfile()` function which is used to create
a connection to files compressed using the gzip algorithm. This
approach is useful because it allows you to read from a file without
having to uncompress the file first, which would be a waste of space
and time.
There is a complementary function `writeLines()` that takes a
character vector and writes each element of the vector one line at a
time to a text file.
## Reading From a URL Connection
The `readLines()` function can be useful for reading in lines of
webpages. Since web pages are basically text files that are stored on
a remote server, there is conceptually not much difference between a
web page and a local text file. However, we need R to negotiate the
communication between your computer and the web server. This is what
the `url()` function can do for you, by creating a `url` connection to
a web server.
This code might take time depending on your connection speed.
```{r}
## Open a URL connection for reading
con <- url("http://www.jhsph.edu", "r")
## Read the web page
x <- readLines(con)
## Print out the first few lines
head(x)
```
While reading in a simple web page is sometimes useful, particularly
if data are embedded in the web page somewhere. However, more commonly
we can use URL connection to read in specific data files that are
stored on web servers.
Using URL connections can be useful for producing a reproducible
analysis, because the code essentially documents where the data came
from and how they were obtained. This is approach is preferable to
opening a web browser and downloading a dataset by hand. Of course,
the code you write with connections may not be executable at a later
date if things on the server side are changed or reorganized.