Every year on February 2, Groundhog Day, the famous groundhog Punxsutawney Phil—and a growing cast of creatures (including stuffed animals, sock puppets, and mascots)—emerge from their burrows to predict the weather. Punxsutawney Phil, the prognosticator of prognosticators, has been predicting the weather since 1887. If he sees his shadow, his prediction is for six more weeks of winter. If not, he predicts an early spring.
This package brings together prediction data from Countdown to Groundhog Day and weather data from the U.S. National Ocean and Atmospheric Administration (NOAA) to help you evaluate which prognosticators you can trust.
# Install from GitHub
devtools::install_github("ericpgreen/feb2")
Currently this is a function-less data package designed to facilitate your analyses of Groundhog Day prediction data.
library(feb2)
There are three main datasets and several supporting datasets.
Michael Venos, the creator of Countdown to Groundhog Day, maintains the internet’s most comprehensive database about Groundhog Day predictions. He generously agreed to allow me to incorporate his data into this package. Currently he has data on a diverse collection of 153 prognosticators.
library(tidyverse)
prognosticators %>%
group_by(prognosticator_status) %>%
count()
#> # A tibble: 3 × 2
#> # Groups: prognosticator_status [3]
#> prognosticator_status n
#> <chr> <int>
#> 1 Creature 110
#> 2 Human Mascot 11
#> 3 Inanimate 32
Rodents are by far the most common prognosticators. One lobster is holding it down for the arthropods.
prognosticators %>%
group_by(prognosticator_phylum, prognosticator_class, prognosticator_order) %>%
count()
#> # A tibble: 15 × 4
#> # Groups: prognosticator_phylum, prognosticator_class, prognosticator_order
#> # [15]
#> prognosticator_phylum prognosticator_class prognosticator_order n
#> <chr> <chr> <chr> <int>
#> 1 Arthropoda Malacostraca Decapoda 1
#> 2 Chordata Actinopterygii Perciformes 1
#> 3 Chordata Amphibia Anura 1
#> 4 Chordata Aves Anseriformes 2
#> 5 Chordata Aves Galliformes 1
#> 6 Chordata Aves Strigiformes 1
#> 7 Chordata Mammalia Artiodactyla 1
#> 8 Chordata Mammalia Carnivora 9
#> 9 Chordata Mammalia Cingulata 1
#> 10 Chordata Mammalia Didelphimorphia 3
#> 11 Chordata Mammalia Eulipotyphla 7
#> 12 Chordata Mammalia Rodentia 121
#> 13 Chordata Mammalia Tubulidentata 1
#> 14 Chordata Reptilia Crocodilia 2
#> 15 Chordata Reptilia Testudines 1
Michael has collected 1488 predictions going back to Punxsutawney Phil’s
first prediction in 1887. Some years the prediction is uncertain or not
recorded, accounting for the NA
s.
predictions %>%
group_by(prediction) %>%
count()
#> # A tibble: 3 × 2
#> # Groups: prediction [3]
#> prediction n
#> <chr> <int>
#> 1 Early Spring 726
#> 2 Long Winter 733
#> 3 <NA> 29
Each prognosticator is linked to a city. I used the {tidygeocoder
}
package to get coordinates for each city and then searched for weather
stations with TMAX (daily max air temperature) data within 100km of each
city’s coordinates using the meteo_nearby_stations()
function in the
{rnoaa
} package. This dataset lists every eligible weather station
within 100km, but the package only uses weather data from (up to) the 10
closest stations.
The biggest challenge for evaluating Groundhog Day predictions is defining what we mean by “early spring”. So far in this package I follow the general approach of earlier analyses by NOAA and 538. I define early spring for a prognosticator’s location as one month (February OR March) with an average high temperature above the historical average for that month.1 Unlike the previous analyses, however, I use local data for each prognosticator. NOAA used U.S. national temperatures, and 538 looked across nine U.S. regions.2 I think it’s just silly to expect a real or stuffed groundhog to be able to predict national or regional weather based on localized sunshine. I say let’s evaluate their powers of prognostication using local data.3
Steps to construct the classification:
- Identify the 10 weather stations with TMAX data closest to each prognosticator’s city
- Use
rnoaa::ghcnd_search()
to obtain daily historical high temperatures for each station - Calculate the mean daily high temperature for each prognosticator’s city by averaging over the 10 closest weather stations
- Calculate the mean monthly high temperature for each prognosticator’s city by averaging over the daily data
- Calculate the 15-year rolling mean high monthly temperature for each prognosticator’s city
- Use the
def1
definition to classify each year as “early spring” or “long winter”
The class_def1
dataset does not contain the underlying weather data,
but you can find it in class_def1_data
. You can also use the {rnoaa
}
package and the weather station information in weather_stations_ghcnd
and weather_stations_isd
to query the weather data directly if you are
interested in creating a new classification definition.
I plan to update the prognosticators, predictions, and classifications
data after the month of March (classification definition def1
depends
on March weather data).
NOAA weather data are in the public domain (as far as I know). Data on prognosticators and their predictions come with permission from Countdown to Groundhog Day. You are welcome to use the data via this package for any purpose, but please do not post the raw data on any other public sites. Instead, give credit to Michael’s tremendous effort by pointing back to Countdown to Groundhog Day.
The groundhog pixel art is a DALL-E 2 creation.
Please submit an issue if you encounter any bugs or errors. This package comes with no warranty of any kind. Don’t rely on me or these rodents to get it right. Though my family did live in Punxsutawney when I was 4, and I have been to Gobbler’s Knob.