forked from willcrouch/RExcel2020Clone
-
Notifications
You must be signed in to change notification settings - Fork 0
/
readxl.Rmd
380 lines (248 loc) · 16.4 KB
/
readxl.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
# `readxl` {#readxl}
## Summary
**Check this, may need to be a block quote**: The **readxl** package makes it easy to import tabular data from Excel spreadsheets (.xls or .xlsx files) and includes several options for cleaning data during import. **readxl** has no external dependencies and functions on any operating system, making it an OS- and user-friendly package that simplifies getting your data from Excel into R.
### Objectives
- Use `read_csv()` to read in a comma separated value (CSV) file
- Use `read_excel()` to read in an Excel worksheet from an Excel workbook
- Replace a specific string/value in a spreadsheet with with `NA`
- Skip *n* rows when importing an Excel worksheet
- Use `read_excel()` to read in parts of a worksheet (by cell range)
- Specify column names when importing Excel data
- Read and combine data from multiple Excel worksheets into a single df using `purrr::map_df()`
- Write data using `write_csv()` or `write_xlsx()`
### Resources
- https://readxl.tidyverse.org/
- [readxl Workflows article (from tidyverse.org)](https://readxl.tidyverse.org/articles/articles/readxl-workflows.html)
## Lesson
### Attach packages
In the .Rmd you just created within your version-controlled `r-workshop` R project, attach the `tidyverse`, `readxl`, `writexl`, and `here` packages.
In this lesson, we'll read in a CSV file with the `read_csv()` function, so we need to have the `readr` package attached. Since it's part of the `tidyverse`, we'll go ahead and attach the `tidyverse` package below our script header using `library(package_name)`. It's a good idea to attach packages within the set-up chunk in R Markdown, so we'll also attach the `readxl`, `writexl`, and `here` packages there.
Here's our first code chunk:
```{r, include = FALSE}
# Attach the tidyverse, readxl and writexl packages:
library(tidyverse)
library(readxl) # install.packages('readxl')
library(writexl) # install.packages('writexl')
library(here) # install.packages('here')
```
```
{r setup, eval = FALSE}
knitr::opts_chunk$set(echo = TRUE)
# Attach the tidyverse, readxl, writexl and here packages:
library(tidyverse)
library(readxl)
library(writexl)
library(here)
```
Now, all of the packages and functions within the `tidyverse` and `readxl`, including `read_csv()` and `read_excel()`, are available for use.
### Use `read_csv()` to read in data from a CSV file
There are many types of files containing data that you might want to work with in R. A common one is a comma separated value (CSV) file, which contains values with each column entry separated by a comma delimiter. CSVs can be opened, viewed, and worked with in Excel just like an .xls or .xlsx file - but let's learn how to get data directly from a CSV into R where we can work with it more reproducibly.
The CSV we'll read in here is called "fish_counts_curated.csv", and contains observations for "the abundance and size of fish species as part of SBCLTER's kelp forest monitoring program to track long-term patterns in species abundance and diversity" from the [Santa Barbara Channel Long Term Ecological Research](http://sbc.lternet.edu/) program.
**Source:** Reed D. 2018. SBC LTER: Reef: Kelp Forest Community Dynamics: Fish abundance. Environmental Data Initiative. https://doi.org/10.6073/pasta/dbd1d5f0b225d903371ce89b09ee7379. Dataset accessed 9/26/2019.
Read in the "fish_counts_curated.csv" file `read_csv("file_name.csv")`, and store it in R as an object called *fish_counts*:
```{r, include = FALSE}
# Teaching version
fish_counts <- read_csv("curation/fish_counts_curated.csv")
```
```{r, eval = FALSE, results = 'hide'}
fish_counts <- read_csv(here("data", "fish_counts_curated.csv"))
```
Notice that the name of the stored object (here, *fish_counts*) will show up in our Environment tab in RStudio.
Click on the object in the Environment, and R will automatically run the `View()` function for you to pull up your data in a separate viewing tab. Now we can look at it in the spreadsheet format we're used to.
Here are a few other functions for quickly exploring imported data:
- `summary()`: summary of class, dimensions, `NA` values, etc.
- `names()`: variable names (column headers)
- `ls()`: list all objects in environment
- `head()`: Show the first x rows (default is 6 lines)
- `tail()`: Show the last x rows (default is 6 lines)
Now that we have our fish counts data ready to work with in R, let's get the substrate cover and kelp data (both .xlsx files). In the following sections, we'll learn that we can use `read_excel()` to read in Excel files directly.
### Use `read_excel()` to read in a single Excel worksheet
First, take a look at *substrate_cover_curated.xlsx* in Excel, which contains a single worksheet with substrate type and percent cover observations at different sampling locations in the Santa Barbara Channel.
A few things to notice:
- The file contains a single worksheet
- There are multiple rows containing text information up top
- Where observations were not recorded, there exists '-9999'
Let's go ahead and read in the data. If the file is in our working directory, we can read in a single worksheet .xlsx file using `read_excel("file_name.xlsx")`. *Note: read_excel() works for both .xlsx and .xls types*.
Like this:
```{r, include = FALSE}
substrate_cover <- read_excel("curation/substrate_cover_curated.xlsx")
```
```{r, eval = FALSE, results = 'hide'}
substrate_cover <- read_excel(here("data", "substrate_cover_curated.xlsx"))
```
**Tada? Not quite.**
Click on the object name (*substrate_cover*) in the Environment to view the data in a new tab. A few things aren't ideal:
```{r}
substrate_cover
```
- The top row of text has automatically become the (messy) column headers
- There are multiple descriptive rows before we actually get to the data
- There are -9999s that we want R to understand as `NA` instead
We can deal with those issues by adding arguments within `read_excel()`. Like:
- Add `skip = n` to skip the first 'n' rows when importing data
- Add `na = "this"` to replace "this" with `NA` when reading in spreadsheet data
```{r, include = FALSE}
substrate_cover <- read_excel("curation/substrate_cover_curated.xlsx", skip = 4, na = "-9999")
```
```{r, eval = FALSE}
substrate_cover <- read_excel(here("data", "substrate_cover_curated.xlsx, skip = 4, na = "-9999")
```
```{r}
substrate_cover
```
Check out *substrate_cover*, and see that the first row *after* the 4 skipped are the column names, and all -9999s have been updated to `NA`. Hooray!
### Use `read_excel()` to read in only *part* of an Excel worksheet
We always advocate for leaving the raw data raw, and writing a complete script containing all steps of data wrangling & transformation. But in *some* situations (be careful), you may want to specify a range of cells to read in from an Excel worksheet.
You can specify a range of cells to read in using the `range = ` argument in `read_excel()`. For example, if I want to read in the rectangle from D12:I15 in *substrate_cover_curated.xlsx* - only observations for Carpenteria Beach (Transect 2) in September 2000 - I can use:
```{r, include = FALSE}
carp_cover_2000 <- read_excel("curation/substrate_cover_curated.xlsx", range = "D12:I15")
```
```{r, eval = FALSE}
carp_cover_2000 <- read_excel(here("data", "substrate_cover_curated.xlsx", range = "D12:I15")
```
But yuck. Look at *carp_cover_2000* and you'll notice that the first row *of that range* is automatically made the column headers. To keep all rows within a range and **add your own column names**, add a `col_names = ` argument:
```{r, include = FALSE}
carp_cover_2000 <- read_excel("curation/substrate_cover_curated.xlsx", range = "D12:I15", col_names = c("site_name", "transect", "quad", "plot_side", "type", "coverage"))
```
```{r, eval = FALSE}
carp_cover_2000 <- read_excel(here("data", "substrate_cover_curated.xlsx", range = "D12:I15", col_names = c("site_name", "transect", "quad", "plot_side", "type", "coverage"))
```
```{r}
carp_cover_2000
```
So far we've read in a single CSV file using `read_csv()`, and an Excel file containing a single worksheet with `read_excel()`. Next, let's read in data from an Excel workbook that contains multiple worksheets.
### Use `read_excel()` to read in selected worksheets from a workbook
Now, we'll read in the kelp fronds data from file *kelp_counts_curated.xlsx*. Open the file in Excel, and notice see that it contains multiple worksheets with giant kelp observations in the Santa Barbara Channel during July 2016, 2017, and 2018, with data collected at each *site* in a separate worksheet.
To read in a single Excel worksheet from a workbook we'll again use `read_excel("file_name.xlsx")`, but we'll need to let R know which worksheet to get.
Let's read in the kelp data just like we did above, as an object called *kelp_counts*.
```{r, include = FALSE}
# Teaching chunk (won't show up when knitted)
kelp_counts <- read_excel("curation/kelp_counts_curated.xlsx")
```
```{r, eval = FALSE, results = 'hide'}
kelp_counts <- read_excel(here("data", "kelp_counts_curated.xlsx")
```
You might be thinking, "Hooray, I got all of my Excel workbook data!" But remember to always look at your data - you will see that actually only the first worksheet was read in. The default in `read_excel()` is to read in the **first worksheet** in a multi-sheet Excel workbook.
To check the worksheet names in an Excel workbook, use `excel_sheets()`:
```{r, include = FALSE}
excel_sheets("curation/kelp_counts_curated.xlsx")
```
```{r, eval = FALSE}
excel_sheets(here("data", "kelp_counts_curated.xlsx")
```
If we want to read in data from a worksheet other than the first one in an Excel workbook, we can specify the correct worksheet by name or position by adding the `sheet` argument.
Let's read in data from the worksheet named *golb* (Goleta Beach) in the *kelp_counts_curated.xlsx* workbook:
```{r, include = FALSE}
kelp_golb <- read_excel("curation/kelp_counts_curated.xlsx", sheet = "golb")
```
```{r, eval = FALSE, results = 'hide'}
kelp_golb <- read_excel(here("data", "kelp_counts_curated.xlsx", sheet = "golb"))
```
Note that you can also specify a worksheet by position: since *golb* is the 6^th^ worksheet in the workbook, we could also use the following:
```{r, include = FALSE}
kelp_golb <- read_excel("curation/kelp_counts_curated.xlsx", sheet = 6)
```
```{r, eval = FALSE}
kelp_golb <- read_excel(here("data", "kelp_counts_curated.xlsx", sheet = 6))
```
```{r}
kelp_golb
```
### Read in and combine data from multiple worksheets into a data frame simultaneously with `purrr::map_df()`
So far, we've read in entire Excel worksheets and pieces of a worksheet. What if we have a workbook (like *kelp_counts_curated.xlsx*) that contains worksheets that contain observations for the same variables, in the same organization? Then we may want to read in data from *all* worksheets, and combine them into a single data frame.
We'll use `purrr::map_df()` to loop through all the worksheets in a workbook, reading them in & putting them together into a single data frame in the process.
The steps we'll go through in the code below are:
1. Set a pathway so that R knows where to look for an Excel workbook
2. Get the names of all worksheets in that workbook with `excel_sheets()`
3. Set names of a vector with `set_names()`
4. Read in all worksheets, and put them together into a single data frame with `purrr::map_df()`
**Aside**: the pipe operator (`%>%`)
There are many ways to use R functions in sequence. One way, that follows the order that we think about steps in sequence, is using the **pipe operator** (`%>%`). We can use the pipe operator between steps in a sequence, each place we think "and then do this."
For example, if I would like to **walk my dog** and *then* **eat a burrito**, in code that might be: `walk(dog) %>% eat(burrito, type = "California")`
Here, we'll use the pipe operator to complete steps 1 - 4 above in sequence:
```{r, include = FALSE}
kelp_path <- "curation/kelp_counts_curated.xlsx"
kelp_all_sites <- kelp_path %>% # Start with that pathway (file)...
excel_sheets() %>% # Get all the names of the worksheets
set_names() %>% # Names the vector items
map_df(read_excel, path = kelp_path) # Applies 'read_excel' function to all worksheet names (loops through them) in the workbook at kelp_path and the _df part of it puts them all into a single data frame
```
```{r, eval = FALSE}
kelp_path <- here("data", "kelp_counts_curated.xlsx")
kelp_all_sites <- kelp_path %>%
excel_sheets() %>%
set_names() %>%
purrr::map_df(read_excel, kelp_path)
```
Check out *kelp_all_sites*, and notice that now the data from all 11 sites is now collected into a single data frame:
```{r}
kelp_all_sites
```
### Save data frames as .csv or .xlsx with `write_csv()` or `write_xlsx()`
There are a number of reasons you might want to save (/export) data in a data frame as a .csv or Excel worksheet, including:
- To store raw data within the project you're working in
- To store copies of intermediate data frames
- To convert your data back to a format that your coworkers/clients/colleagues will be able to use it more easily
Use `write_csv(object, "file_name.csv")` to write a data frame to a CSV, or `write_xlsx(object, "file_name.xlsx")` to similarly export as a .xlsx (or .xls) worksheet.
In the previous step, we combined all of our kelp frond observations into a single data frame. Wouldn't it make sense to store a copy?
As a CSV:
```{r, eval = FALSE}
write_csv(kelp_all_sites, here("data", "kelp_all_sites.csv"))
```
A cool thing about `write_csv()` is that it just quietly *works* without wrecking anything else you do in a sequence, so it's great to add at the end of a piped sequence.
For example, if I want to read in the range cells C1:D3 'ivee' worksheet from kelp_counts_curated.xlsx, then write that new subset to a .csv file, I can pipe all the way through:
```{r, include = FALSE, eval = FALSE}
kelp_ivee_subset <- read_excel(here("curation", "kelp_counts_curated.xlsx"),
sheet = "ivee",
range = "C1:D3") %>%
write_csv(here("curation", "kelp_ivee_subset.csv"))
```
```{r, eval = FALSE}
kelp_ivee_subset <- read_excel(here("data", "kelp_counts_curated.xlsx"),
sheet = "ivee",
range = "C1:D3") %>%
write_csv(here("data", "kelp_ivee.csv"))
```
Now I've created *kelp_ivee_subset.csv*, but the object *kelp_ivee_subset* also exists as an object for me to use in R.
If needed, I can also export a data frame as an Excel (.xlsx) worksheet:
```{r, eval = FALSE}
write_xlsx(kelp_all_sites, here("data", "kelp_all_sites.xlsx"))
```
## Activity: Import some invertebrates!
There's one dataset we haven't imported or explored yet: invertebrate counts for 5 popular invertebrates (California cone snail, California spiny lobster, orange cup coral, purple urchin and rock scallops) at 11 sites in the Santa Barbara Channel. Take a look at the *invert_counts_curated.xlsx* data by opening it in Excel
- Read in the *invert_counts_curated.xlsx* worksheet as object 'inverts_july', only retaining **site**, **common_name**, and **2016** and setting the existing first row in the worksheet as to column headers upon import
- Explore the imported data frame using View, names, head, tail, etc.
- Write 'inverts_july' to a CSV file in your working directory called "inverts_july.csv"
```{r, include = FALSE}
# Solution:
# Importing only 'site' through '2016' columns:
inverts_july <- read_excel("curation/invert_counts_curated.xlsx", range = "B1:D56")
# Do some basic exploring (why might we want to do this in the Console instead?):
#View(inverts_july)
names(inverts_july)
head(inverts_july)
tail(inverts_july)
ls()
# Write back to a csv file:
# write_csv(inverts_july, "inverts_july.csv")
```
```{r, eval = FALSE}
# Importing only 'site' through '2016' columns:
inverts_july <- read_excel(here("data", "invert_counts_curated.xlsx"), range = "B1:D56")
# Do some basic exploring (why might we want to do this in the Console instead?):
View(inverts_july)
names(inverts_july)
head(inverts_july)
tail(inverts_july)
ls()
# Writing a csv "inverts_july.csv":
write_csv(inverts_july, here("data", "inverts_july.csv"))
```
## Efficiency Tips
- Add an assignment arrow in (<-): Alt + minus (-)
- Undo shortcut: Command + Z
- Redo shortcut: Command + Shift + Z
## Additional thoughts
- Economist article about gene > dates issue in Excel
- Mine: data frame of bike casualities in NC, column names are age ranges but some of them import as dates
- Excel makes some wrong assumptions and doesn't give you a heads up about its decision making