Skip to content

Latest commit

 

History

History
470 lines (348 loc) · 14.5 KB

r4ds_week14.md

File metadata and controls

470 lines (348 loc) · 14.5 KB

R4DS Study Group - Week 14

Pierrette Lo 7/10/2020

This week’s assignment

  • Chapter 7.1-7.4
library(tidyverse)

Notes

  • The {skimr} package is super useful for EDA, with more summary statistics in a nicer format than with summary().

  • Bonus: check out the new Palmer Penguins dataset - packaged by Alison Horst, who also created some fun artwork to go with it.

# install {palmerpenguins} before using for the first time using the command below:
# remotes::install_github("allisonhorst/palmerpenguins")

library(palmerpenguins)

# take a look at first few rows of the data

head(penguins) %>% view()

Then get summary statistics with {skimr}

library(skimr)
## Warning: package 'skimr' was built under R version 3.6.3
skim(penguins)
Name penguins
Number of rows 344
Number of columns 7
_______________________
Column type frequency:
factor 3
numeric 4
________________________
Group variables None

Data summary

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
species 0 1.00 FALSE 3 Ade: 152, Gen: 124, Chi: 68
island 0 1.00 FALSE 3 Bis: 168, Dre: 124, Tor: 52
sex 11 0.97 FALSE 2 mal: 168, fem: 165

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
bill_length_mm 2 0.99 43.92 5.46 32.1 39.23 44.45 48.5 59.6 ▃▇▇▆▁
bill_depth_mm 2 0.99 17.15 1.97 13.1 15.60 17.30 18.7 21.5 ▅▅▇▇▂
flipper_length_mm 2 0.99 200.92 14.06 172.0 190.00 197.00 213.0 231.0 ▂▇▃▅▂
body_mass_g 2 0.99 4201.75 801.95 2700.0 3550.00 4050.00 4750.0 6300.0 ▃▇▆▃▂

Ch 7:3 Variation

Notes

  • categorical variable = bar chart
  • continuous variable = histogram
  • use binwidth in ggplot() and cut_width() in count() to bin a continuous variable into discrete intervals.

Example:

# without cut_width - get count for every possible value of carat

diamonds %>% 
  count(carat)
## # A tibble: 273 x 2
##    carat     n
##    <dbl> <int>
##  1 0.2      12
##  2 0.21      9
##  3 0.22      5
##  4 0.23    293
##  5 0.24    254
##  6 0.25    212
##  7 0.26    253
##  8 0.27    233
##  9 0.28    198
## 10 0.290   130
## # ... with 263 more rows
# with cut_width - more manageable number of groups

diamonds %>% 
  count(cut_width(carat, 0.5))
## # A tibble: 11 x 2
##    `cut_width(carat, 0.5)`     n
##    <fct>                   <int>
##  1 [-0.25,0.25]              785
##  2 (0.25,0.75]             29498
##  3 (0.75,1.25]             15977
##  4 (1.25,1.75]              5313
##  5 (1.75,2.25]              2002
##  6 (2.25,2.75]               322
##  7 (2.75,3.25]                32
##  8 (3.25,3.75]                 5
##  9 (3.75,4.25]                 4
## 10 (4.25,4.75]                 1
## 11 (4.75,5.25]                 1

geom_freqpoly() gives you a histogram with lines instead of bars - better for comparing different groups

# bars

diamonds %>% 
  ggplot(aes(x = carat, fill = cut)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# lines

diamonds %>% 
  ggplot(aes(x = carat, colour = cut)) +
  geom_freqpoly()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Exercises

  1. Explore the distribution of each of the x, y, and z variables in diamonds. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.

A simple summary:

diamonds %>% 
  select(x:z) %>% 
  summary()
##        x                y                z         
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 4.710   1st Qu.: 4.720   1st Qu.: 2.910  
##  Median : 5.700   Median : 5.710   Median : 3.530  
##  Mean   : 5.731   Mean   : 5.735   Mean   : 3.539  
##  3rd Qu.: 6.540   3rd Qu.: 6.540   3rd Qu.: 4.040  
##  Max.   :10.740   Max.   :58.900   Max.   :31.800

A more detailed summary using the {skimr} package:

skimr::skim(diamonds, x:z)
Name diamonds
Number of rows 53940
Number of columns 10
_______________________
Column type frequency:
numeric 3
________________________
Group variables None

Data summary

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
x 0 1 5.73 1.12 0 4.71 5.70 6.54 10.74 ▁▁▇▃▁
y 0 1 5.73 1.14 0 4.72 5.71 6.54 58.90 ▇▁▁▁▁
z 0 1 3.54 0.71 0 2.91 3.53 4.04 31.80 ▇▁▁▁▁

Observations:

  • According to the help (?diamonds), x = length, y = width, z = depth
  • Looking at the summary, x and y are very similar dimensions, whereas z is smaller (makes sense when you imagine the shape of a typical diamond)
  • Both y and z have max values (58.9 and 31.8, respectively) that look like they could be errors
  • The min values for all 3 variables are 0, which is probably also an error.

I also looked at the data graphically (see Chapter 12 for info on “pivoting” a table into a longer format):

diamonds %>% 
  pivot_longer(x:z, names_to = "dimension", values_to = "dim_in_mm") %>% 
  ggplot(aes(x = dim_in_mm)) +
  geom_histogram(binwidth = 0.1) +
  facet_wrap(~dimension, scales = "free")

Observations:

  • The x-axis for y and z also indicate the presence of large outliers.

Let’s zoom in on the x-axis to get a better idea of variability between these groups:

diamonds %>% 
  pivot_longer(x:z, names_to = "dimension", values_to = "dim_in_mm") %>% 
  ggplot(aes(x = dim_in_mm)) +
  geom_histogram(binwidth = 0.1) +
  facet_wrap(~dimension, scales = "free") +
  coord_cartesian(xlim = c(0, 10))

diamonds %>% 
  arrange(desc(z))
## # A tibble: 53,940 x 10
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  0.51 Very Good E     VS1      61.8  54.7  1970  5.12  5.15 31.8 
##  2  2    Premium   H     SI2      58.9  57   12210  8.09 58.9   8.06
##  3  5.01 Fair      J     I1       65.5  59   18018 10.7  10.5   6.98
##  4  4.5  Fair      J     I1       65.8  58   18531 10.2  10.2   6.72
##  5  4.13 Fair      H     I1       64.8  61   17329 10     9.85  6.43
##  6  3.65 Fair      H     I1       67.1  53   11668  9.53  9.48  6.38
##  7  4    Very Good I     I1       63.3  58   15984 10.0   9.94  6.31
##  8  3.4  Fair      D     I1       66.8  52   15964  9.42  9.34  6.27
##  9  4.01 Premium   J     I1       62.5  62   15223 10.0   9.94  6.24
## 10  4.01 Premium   I     I1       61    61   15223 10.1  10.1   6.17
## # ... with 53,930 more rows
  • It looks like z has the least variability, and again x and y are very similar once you remove the outliers.
  • The data look “spiky” - maybe because people tend to round to certain numbers instead of being precise?
  1. Explore the distribution of price. Do you discover anything unusual or surprising? (Hint: Carefully think about the binwidth and make sure you try a wide range of values.)
summary(diamonds$price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     326     950    2401    3933    5324   18823

With smaller binwidths, you start to notice a gap around $1500 (this is not apparent with larger binwidths).

diamonds %>% 
  ggplot(aes(x = price)) + 
  geom_histogram(binwidth = 500)

diamonds %>% 
  ggplot(aes(x = price)) + 
  geom_histogram(binwidth = 50)

diamonds %>% 
  ggplot(aes(x = price)) + 
  geom_histogram(binwidth = 10) +
  coord_cartesian(xlim = c(0, 2500))

I dont know why there are no diamonds that cost $1500 - another error?

Also, if you decrease the binwidth further, the data again look “spiky”. My theory is that there are certain numbers that people use for price (e.g. ending in ‘99’) because it “looks better”.

  1. How many diamonds are 0.99 carat? How many are 1 carat? What do you think is the cause of the difference?
diamonds %>% 
  ggplot(aes(x = carat)) +
  geom_histogram(binwidth = 0.01) +
  coord_cartesian(xlim = c(0.99, 1))

There are almost no diamonds that are 0.99 carats. I assume this is because people have a tendency to round up. It’s interesting to see how a dataset consisting of hard numbers can still be so subjective!

  1. Compare and contrast coord_cartesian() vs xlim() or ylim() when zooming in on a histogram. What happens if you leave binwidth unset? What happens if you try and zoom so only half a bar shows?
  • Default binwidth is 30 if you don’t specify

  • xlim() and ylim() = “crop” - zoom in and remove data that fall outside the limits (you will get a warning message indicating how many data points were removed)

  • coord_cartesian() = “zoom” - zooms in without removing data

  • coord_cartesian() calculates binwidths and counts first, then zooms in, so some bars may be cut off

  • xlim() and ylim() remove data before calculating binwidths and counts, so bars are never cut off

diamonds %>% 
  ggplot(aes(x = carat)) +
  geom_histogram() +
  coord_cartesian(xlim = c(0, 1), ylim = c(0, 5000))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

diamonds %>% 
  ggplot(aes(x = carat)) +
  geom_histogram() +
  xlim(0, 1) +
  ylim(0, 5000)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 17502 rows containing non-finite values (stat_bin).

## Warning: Removed 3 rows containing missing values (geom_bar).

Ch 7:4 Missing values

Exercises

  1. What happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference?

I used the penguins dataset here because it contains missing values (diamonds does not).

In a bar chart – which is used for categorical variables – NAs are considered to be their own category.

penguins %>% 
  ggplot(aes(x = sex)) +
  geom_bar()

In a histogram, the continuous x variable is first grouped into bins, then the number of observations in each bin is counted. NAs don’t belong in any of the bins, so they get dropped (note warning message telling you how many data points were removed).

penguins %>% 
  ggplot(aes(x = flipper_length_mm)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 2 rows containing non-finite values (stat_bin).

  1. What does na.rm = TRUE do in mean() and sum()?

This argument is needed to ignore NAs when calculating the mean or sum.

Without na.rm = TRUE:

mean(penguins$flipper_length_mm)
## [1] NA

With na.rm = TRUE:

mean(penguins$flipper_length_mm, na.rm = T)
## [1] 200.9152

As an aside, I just learned that using T and F instead of TRUE or FALSE is bad practice! TRUE (1) and FALSE (0) are “reserved” in R, i.e. set in stone - you can’t edit them to have other values. You can overwrite T and F and get some wacky results if you’ve forgotten you did this. So always write out TRUE and FALSE. Sorry for misleading you all before!

T <- 5      # this works

T <- FALSE  # this also works

TRUE <- 5   # this doesn't
## Error in TRUE <- 5: invalid (do_set) left-hand side to assignment