readxl.Rmd

# `readxl` {#readxl}

## Summary

**Check this, may need to be a block quote**: The **readxl** package makes it easy to import tabular data from Excel spreadsheets (.xls or .xlsx files) and includes several options for cleaning data during import. **readxl** has no external dependencies and functions on any operating system, making it an OS- and user-friendly package that simplifies getting your data from Excel into R. 

### Objectives

- Use `read_csv()` to read in a comma separated value (CSV) file
- Use `read_excel()` to read in an Excel worksheet from an Excel workbook
- Replace a specific string/value in a spreadsheet with with `NA` 
- Skip *n* rows when importing an Excel worksheet
- Use `read_excel()` to read in parts of a worksheet (by cell range)
- Specify column names when importing Excel data
- Read and combine data from multiple Excel worksheets into a single df using `purrr::map_df()`
- Write data using `write_csv()` or `write_xlsx()`

### Resources

- https://readxl.tidyverse.org/
- [readxl Workflows article (from tidyverse.org)](https://readxl.tidyverse.org/articles/articles/readxl-workflows.html)

## Lesson

### Attach packages

In the .Rmd you just created within your version-controlled `r-workshop` R project, attach the `tidyverse`, `readxl`, `writexl`, and `here` packages.

In this lesson, we'll read in a CSV file with the `read_csv()` function, so we need to have the `readr` package attached. Since it's part of the `tidyverse`, we'll go ahead and attach the `tidyverse` package below our script header using `library(package_name)`. It's a good idea to attach packages within the set-up chunk in R Markdown, so we'll also attach the `readxl`, `writexl`, and `here` packages there. 

Here's our first code chunk: 

```{r, include = FALSE}
# Attach the tidyverse, readxl and writexl packages:
library(tidyverse)
library(readxl) # install.packages('readxl')
library(writexl) # install.packages('writexl')
library(here) # install.packages('here')
```

```
{r setup, eval = FALSE}
knitr::opts_chunk$set(echo = TRUE)

# Attach the tidyverse, readxl, writexl and here packages:
library(tidyverse)
library(readxl)
library(writexl)
library(here)

```

Now, all of the packages and functions within the `tidyverse` and `readxl`, including `read_csv()` and `read_excel()`, are available for use. 

### Use `read_csv()` to read in data from a CSV file

There are many types of files containing data that you might want to work with in R. A common one is a comma separated value (CSV) file, which contains values with each column entry separated by a comma delimiter. CSVs can be opened, viewed, and worked with in Excel just like an .xls or .xlsx file - but let's learn how to get data directly from a CSV into R where we can work with it more reproducibly. 

The CSV we'll read in here is called "fish_counts_curated.csv", and contains observations for "the abundance and size of fish species as part of SBCLTER's kelp forest monitoring program to track long-term patterns in species abundance and diversity" from the [Santa Barbara Channel Long Term Ecological Research](http://sbc.lternet.edu/) program. 

**Source:** Reed D. 2018. SBC LTER: Reef: Kelp Forest Community Dynamics: Fish abundance. Environmental Data Initiative. https://doi.org/10.6073/pasta/dbd1d5f0b225d903371ce89b09ee7379. Dataset accessed 9/26/2019.

Read in the "fish_counts_curated.csv" file `read_csv("file_name.csv")`, and store it in R as an object called *fish_counts*:

```{r, include = FALSE}

# Teaching version
fish_counts <- read_csv("curation/fish_counts_curated.csv")

```

```{r, eval = FALSE, results = 'hide'}

fish_counts <- read_csv(here("data", "fish_counts_curated.csv"))

```

Notice that the name of the stored object (here, *fish_counts*) will show up in our Environment tab in RStudio. 

Click on the object in the Environment, and R will automatically run the `View()` function for you to pull up your data in a separate viewing tab. Now we can look at it in the spreadsheet format we're used to. 

Here are a few other functions for quickly exploring imported data: 

- `summary()`: summary of class, dimensions, `NA` values, etc.
- `names()`: variable names (column headers)
- `ls()`: list all objects in environment
- `head()`: Show the first x rows (default is 6 lines)
- `tail()`: Show the last x rows (default is 6 lines)

Now that we have our fish counts data ready to work with in R, let's get the substrate cover and kelp data (both .xlsx files). In the following sections, we'll learn that we can use `read_excel()` to read in Excel files directly.

### Use `read_excel()` to read in a single Excel worksheet 

First, take a look at *substrate_cover_curated.xlsx* in Excel, which contains a single worksheet with substrate type and percent cover observations at different sampling locations in the Santa Barbara Channel. 

A few things to notice:

- The file contains a single worksheet
- There are multiple rows containing text information up top
- Where observations were not recorded, there exists '-9999'

Let's go ahead and read in the data. If the file is in our working directory, we can read in a single worksheet .xlsx file using `read_excel("file_name.xlsx")`. *Note: read_excel() works for both .xlsx and .xls types*. 

Like this: 
```{r, include = FALSE}

substrate_cover <- read_excel("curation/substrate_cover_curated.xlsx")

```

```{r, eval = FALSE, results = 'hide'}
substrate_cover <- read_excel(here("data", "substrate_cover_curated.xlsx"))
```

**Tada? Not quite.** 

Click on the object name (*substrate_cover*) in the Environment to view the data in a new tab. A few things aren't ideal:

```{r}
substrate_cover
```

- The top row of text has automatically become the (messy) column headers
- There are multiple descriptive rows before we actually get to the data
- There are -9999s that we want R to understand as `NA` instead

We can deal with those issues by adding arguments within `read_excel()`. Like:

- Add `skip = n` to skip the first 'n' rows when importing data
- Add `na = "this"` to replace "this" with `NA` when reading in spreadsheet data

```{r, include = FALSE}
substrate_cover <- read_excel("curation/substrate_cover_curated.xlsx", skip = 4, na = "-9999")
```

```{r, eval = FALSE}
substrate_cover <- read_excel(here("data", "substrate_cover_curated.xlsx, skip = 4, na = "-9999")
```

```{r}
substrate_cover
```


Check out *substrate_cover*, and see that the first row *after* the 4 skipped are the column names, and all -9999s have been updated to `NA`. Hooray!

### Use `read_excel()` to read in only *part* of an Excel worksheet

We always advocate for leaving the raw data raw, and writing a complete script containing all steps of data wrangling & transformation. But in *some* situations (be careful), you may want to specify a range of cells to read in from an Excel worksheet.

You can specify a range of cells to read in using the `range = ` argument in `read_excel()`. For example, if I want to read in the rectangle from D12:I15 in *substrate_cover_curated.xlsx* - only observations for Carpenteria Beach (Transect 2) in September 2000 - I can use: 
```{r, include = FALSE}
carp_cover_2000 <- read_excel("curation/substrate_cover_curated.xlsx", range = "D12:I15")
```

```{r, eval = FALSE}
carp_cover_2000 <- read_excel(here("data", "substrate_cover_curated.xlsx", range = "D12:I15")
```

But yuck. Look at *carp_cover_2000* and you'll notice that the first row *of that range* is automatically made the column headers. To keep all rows within a range and **add your own column names**, add a `col_names = ` argument:
```{r, include = FALSE}
carp_cover_2000 <- read_excel("curation/substrate_cover_curated.xlsx", range = "D12:I15", col_names = c("site_name", "transect", "quad", "plot_side", "type", "coverage"))
```

```{r, eval = FALSE}
carp_cover_2000 <- read_excel(here("data", "substrate_cover_curated.xlsx", range = "D12:I15", col_names = c("site_name", "transect", "quad", "plot_side", "type", "coverage"))
```

```{r}
carp_cover_2000
```


So far we've read in a single CSV file using `read_csv()`, and an Excel file containing a single worksheet with `read_excel()`. Next, let's read in data from an Excel workbook that contains multiple worksheets. 

### Use `read_excel()` to read in selected worksheets from a workbook

Now, we'll read in the kelp fronds data from file *kelp_counts_curated.xlsx*. Open the file in Excel, and notice see that it contains multiple worksheets with giant kelp observations in the Santa Barbara Channel during July 2016, 2017, and 2018, with data collected at each *site* in a separate worksheet.

To read in a single Excel worksheet from a workbook we'll again use `read_excel("file_name.xlsx")`, but we'll need to let R know which worksheet to get. 

Let's read in the kelp data just like we did above, as an object called *kelp_counts*.

```{r, include = FALSE}

# Teaching chunk (won't show up when knitted)
kelp_counts <- read_excel("curation/kelp_counts_curated.xlsx")

```

```{r, eval = FALSE, results = 'hide'}
kelp_counts <- read_excel(here("data", "kelp_counts_curated.xlsx")
```

You might be thinking, "Hooray, I got all of my Excel workbook data!" But remember to always look at your data - you will see that actually only the first worksheet was read in. The default in `read_excel()` is to read in the **first worksheet** in a multi-sheet Excel workbook. 

To check the worksheet names in an Excel workbook, use `excel_sheets()`:
```{r, include = FALSE}
excel_sheets("curation/kelp_counts_curated.xlsx")
```

```{r, eval = FALSE}
excel_sheets(here("data", "kelp_counts_curated.xlsx")
```

If we want to read in data from a worksheet other than the first one in an Excel workbook, we can specify the correct worksheet by name or position by adding the `sheet` argument. 

Let's read in data from the worksheet named *golb* (Goleta Beach) in the *kelp_counts_curated.xlsx* workbook: 
```{r, include = FALSE}
kelp_golb <- read_excel("curation/kelp_counts_curated.xlsx", sheet = "golb")
```

```{r, eval = FALSE, results = 'hide'}
kelp_golb <- read_excel(here("data", "kelp_counts_curated.xlsx", sheet = "golb"))
```

Note that you can also specify a worksheet by position: since *golb* is the 6^th^ worksheet in the workbook, we could also use the following: 
```{r, include = FALSE}
kelp_golb <- read_excel("curation/kelp_counts_curated.xlsx", sheet = 6)
```

```{r, eval = FALSE}
kelp_golb <- read_excel(here("data", "kelp_counts_curated.xlsx", sheet = 6))
```

```{r}
kelp_golb
```


### Read in and combine data from multiple worksheets into a data frame simultaneously with `purrr::map_df()`

So far, we've read in entire Excel worksheets and pieces of a worksheet. What if we have a workbook (like *kelp_counts_curated.xlsx*) that contains worksheets that contain observations for the same variables, in the same organization? Then we may want to read in data from *all* worksheets, and combine them into a single data frame. 

We'll use `purrr::map_df()` to loop through all the worksheets in a workbook, reading them in & putting them together into a single data frame in the process. 

The steps we'll go through in the code below are: 

1. Set a pathway so that R knows where to look for an Excel workbook
2. Get the names of all worksheets in that workbook with `excel_sheets()`
3. Set names of a vector with `set_names()`
4. Read in all worksheets, and put them together into a single data frame with `purrr::map_df()`

**Aside**: the pipe operator (`%>%`)

There are many ways to use R functions in sequence. One way, that follows the order that we think about steps in sequence, is using the **pipe operator** (`%>%`). We can use the pipe operator between steps in a sequence, each place we think "and then do this." 

For example, if I would like to **walk my dog** and *then* **eat a burrito**, in code that might be: `walk(dog) %>% eat(burrito, type = "California")`

Here, we'll use the pipe operator to complete steps 1 - 4 above in sequence: 
```{r, include = FALSE}

kelp_path <- "curation/kelp_counts_curated.xlsx"

kelp_all_sites <- kelp_path %>% # Start with that pathway (file)...
  excel_sheets() %>% # Get all the names of the worksheets
  set_names() %>% # Names the vector items
  map_df(read_excel, path = kelp_path) # Applies 'read_excel' function to all worksheet names (loops through them) in the workbook at kelp_path and the _df part of it puts them all into a single data frame
  
```

```{r, eval = FALSE}

kelp_path <- here("data", "kelp_counts_curated.xlsx")

kelp_all_sites <- kelp_path %>% 
  excel_sheets() %>% 
  set_names() %>% 
  purrr::map_df(read_excel, kelp_path)

```

Check out *kelp_all_sites*, and notice that now the data from all 11 sites is now collected into a single data frame:

```{r}
kelp_all_sites
```

### Save data frames as .csv or .xlsx with `write_csv()` or `write_xlsx()`

There are a number of reasons you might want to save (/export) data in a data frame as a .csv or Excel worksheet, including: 

- To store raw data within the project you're working in
- To store copies of intermediate data frames
- To convert your data back to a format that your coworkers/clients/colleagues will be able to use it more easily

Use `write_csv(object, "file_name.csv")` to write a data frame to a CSV, or `write_xlsx(object, "file_name.xlsx")` to similarly export as a .xlsx (or .xls) worksheet. 

In the previous step, we combined all of our kelp frond observations into a single data frame. Wouldn't it make sense to store a copy? 

As a CSV: 
```{r, eval = FALSE}
write_csv(kelp_all_sites, here("data", "kelp_all_sites.csv"))
```

A cool thing about `write_csv()` is that it just quietly *works* without wrecking anything else you do in a sequence, so it's great to add at the end of a piped sequence.

For example, if I want to read in the range cells C1:D3 'ivee' worksheet from kelp_counts_curated.xlsx, then write that new subset to a .csv file, I can pipe all the way through:

```{r, include = FALSE, eval = FALSE}
kelp_ivee_subset <- read_excel(here("curation", "kelp_counts_curated.xlsx"), 
                        sheet = "ivee", 
                        range = "C1:D3") %>% 
  write_csv(here("curation", "kelp_ivee_subset.csv"))
```

```{r, eval = FALSE}

kelp_ivee_subset <- read_excel(here("data", "kelp_counts_curated.xlsx"), 
                        sheet = "ivee",
                        range = "C1:D3") %>% 
  write_csv(here("data", "kelp_ivee.csv"))

```

Now I've created *kelp_ivee_subset.csv*, but the object *kelp_ivee_subset* also exists as an object for me to use in R. 

If needed, I can also export a data frame as an Excel (.xlsx) worksheet:
```{r, eval = FALSE}
write_xlsx(kelp_all_sites, here("data", "kelp_all_sites.xlsx"))
```

## Activity: Import some invertebrates! 

There's one dataset we haven't imported or explored yet: invertebrate counts for 5 popular invertebrates (California cone snail, California spiny lobster, orange cup coral, purple urchin and rock scallops) at 11 sites in the Santa Barbara Channel. Take a look at the *invert_counts_curated.xlsx* data by opening it in Excel

- Read in the *invert_counts_curated.xlsx* worksheet as object 'inverts_july', only retaining **site**, **common_name**, and **2016** and setting the existing first row in the worksheet as to column headers upon import
- Explore the imported data frame using View, names, head, tail, etc. 
- Write 'inverts_july' to a CSV file in your working directory called "inverts_july.csv"

```{r, include = FALSE}

# Solution:

# Importing only 'site' through '2016' columns: 
inverts_july <- read_excel("curation/invert_counts_curated.xlsx", range = "B1:D56")

# Do some basic exploring (why might we want to do this in the Console instead?):
#View(inverts_july)
names(inverts_july)
head(inverts_july)
tail(inverts_july)
ls()

# Write back to a csv file: 
# write_csv(inverts_july, "inverts_july.csv")

```

```{r, eval = FALSE}

# Importing only 'site' through '2016' columns: 

inverts_july <- read_excel(here("data", "invert_counts_curated.xlsx"), range = "B1:D56")

# Do some basic exploring (why might we want to do this in the Console instead?):

View(inverts_july)
names(inverts_july)
head(inverts_july)
tail(inverts_july)
ls()

# Writing a csv "inverts_july.csv":
write_csv(inverts_july, here("data", "inverts_july.csv"))

```

## Efficiency Tips

- Add an assignment arrow in (<-): Alt + minus (-)
- Undo shortcut: Command + Z 
- Redo shortcut: Command + Shift + Z

## Additional thoughts

- Economist article about gene > dates issue in Excel
- Mine: data frame of bike casualities in NC, column names are age ranges but some of them import as dates
- Excel makes some wrong assumptions and doesn't give you a heads up about its decision making