Smoothing Temperature Data.Rmd

---
title: "Smoothing Temperature Data"
author: "Steve Bischoff"
date: "2024-2-22"
output:
  pdf_document: default
  html_document: default
---

# Contents

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
```

```{r include=FALSE, warning=FALSE}
suppressWarnings(library(knitr))
suppressWarnings(library(ggplot2))
suppressWarnings(library(cowplot))
suppressWarnings(library(reshape2))
suppressWarnings(library(tseries))
suppressWarnings(library(tidyverse))
suppressWarnings(library(ks))
suppressWarnings(library(ramify))
suppressWarnings(library(pracma))
```

```{r include=FALSE}
data_aus = read.csv('data/daily_summaries_austin.csv', header=TRUE)
data_aus$DATE = as.Date(data_aus$DATE)
```

```{r include=FALSE}
data_sd = read.csv('data/daily_summaries_sandiego.csv', header=TRUE) %>% drop_na(TMAX)
data_sd$DATE = as.Date(data_sd$DATE)
```

```{r include=FALSE}
data_mn = read.csv('data/daily_summaries_minneapolis.csv', header=TRUE)
data_mn$DATE = as.Date(data_mn$DATE)
```

```{r include=FALSE}
data_bt = read.csv('data/daily_summaries_baltimore.csv', header=TRUE)
data_bt$DATE = as.Date(data_bt$DATE)
```

```{r include=FALSE}
data1_aus = data_aus[(data_aus$DATE >= "1993-01-01") & (data_aus$DATE <= "2022-12-31"),]
high1_aus = data1_aus$TMAX
n1_aus = length(high1_aus)
```

```{r include=FALSE}
data1_sd = data_sd[(data_sd$DATE >= "1993-01-01") & (data_sd$DATE <= "2022-12-31"),]
high1_sd = data1_sd$TMAX
n1_sd = length(high1_sd)
```

```{r include=FALSE}
data1_mn = data_mn[(data_mn$DATE >= "1993-01-01") & (data_mn$DATE <= "2022-12-31"),]
high1_mn = data1_mn$TMAX
n1_mn = length(high1_mn)
```

```{r include=FALSE}
data1_bt = data_bt[(data_bt$DATE >= "1993-01-01") & (data_bt$DATE <= "2022-12-31"),]
high1_bt = data1_bt$TMAX
n1_bt = length(high1_bt)
```

```{r include=FALSE}
time_idx1_aus = 1:n1_aus
date1_aus = data1_aus$DATE
```

```{r include=FALSE}
time_idx1_sd = 1:n1_sd
date1_sd = data1_sd$DATE
```

```{r include=FALSE}
time_idx1_mn = 1:n1_mn
date1_mn = data1_mn$DATE
```

```{r include=FALSE}
time_idx1_bt = 1:n1_bt
date1_bt = data1_bt$DATE
```

# 1. Introduction

Smoothing methods are a popular way to identify the trend(s) in noisy data without having to construct a model of the data-generating process. Temperature data are a particularly appealing application of these methods. Figure 1 shows daily high temperatures in Austin, Texas in the late 90s.[1]

```{r fig.cap="Austin daily high temperatures"}
low=1000
high=2500
ylims = c(10,120)
gg = ggplot(data1_aus[low:high,], aes(x=DATE, y=TMAX)) + geom_line() + labs(x="Date", y="High Temp. (F)") + theme(plot.title = element_text(hjust = 0.5))
print(gg)
```

The graph is visually unappealing due to daily temperature fluctuations. This noise also makes it difficult to generalize and identify trends. As just one example: the winter of 1996-1997 clearly has the lowest lows of these years, but was that cold spell an aberration or did it truly have the coldest winter?

This project investigates the application of automated smoothing methods to temperature data. Temperature data has several features that complicate the task. These features are detailed in Section 5, and potential solutions are discussed in subsequent sections. We find that *Super-Smoothing with K-Block Cross-Validation* does a pretty good job of smoothing the data where other methods fail. 

## 1.1 Data

The data used in this project comes from the National Oceanic and Atmospheric Administration (NOAA).[1] We can use the search tool to find stations that have had 100% Daily Summary coverage for many decades.  

In order to investigate temperature smoothing across a wide variety of climates, we'll use data from 4 different cities. Table 1 provides a summary. 

```{r results='asis'}
table_df = data.frame(City=c("Austin", "San Diego", "Minneapolis", "Baltimore"),
                      StationID=c("USW00013958","USW00023188","USW00014922","USW00093721"),
                      FirstDate=c("1938-06-01","1939-07-01","1938-04-09","1939-07-01"))
kable(table_df, cap="Data source descriptions",
      col.names=c("City", "Station ID", "First Available"))
```

To make computation more manageable, we truncate each data set to 1993 - 2022. Figures 2 and 3 display time series and box plots of the truncated data sets.  

```{r}
plot_df = data.frame(Daily_High_Temp=c(data1_aus$TMAX, data1_sd$TMAX,
                                       data1_mn$TMAX, data1_bt$TMAX),
                     Date=c(data1_aus$DATE, data1_sd$DATE,
                                       data1_mn$DATE, data1_bt$DATE),
                     City=c(rep("Austin", n1_aus),
                                   rep("S.D.", n1_sd),
                                   rep("Minn.", n1_mn),
                                   rep("Balt.", n1_bt)))
plot_df$Date = as.Date(plot_df$Date)
```

```{r fig.cap="Daily high temperatures 1993-2002"}
gg = ggplot(plot_df, aes(x=Date, y=Daily_High_Temp)) + 
  geom_line() + facet_wrap( ~ City) + 
  labs(y="High Temp. (F)") + 
  theme(plot.title = element_text(hjust = 0.5))
print(gg)
```


```{r fig.cap="Daily high boxplots by city"}
par(mar=c(5,5,4,2))
boxplot(Daily_High_Temp ~ City, plot_df, cex.lab=1, cex.axis=1, ylab="High Temp. (F)")
```

# 2. Kernel Smoothing

---

**Overview:** Kernel smoothers apply a weighted average to the data. They take a *bandwidth* parameter that controls how smooth the output curve is.

---

Smoothing methods try to find a good *smoothing function* that cuts through the noise to give the underlying trend. The value of a smoothing function at a point is generally calculated as some type of weighted average of the values at surrounding points. For instance, the temperature given by the smoothing function on Oct. 1, 1996 is a weighted average of the actual temperatures on and around that date.

Perhaps the simplest smoother is the *constant-span running mean* ([2] p. 366). This smoother is given a parameter $k$ that defines a window size for taking the mean. Suppose we apply the constant-span running mean with $k = 7$ to the temperature data. To calculate the smoothing function on Oct. 1, 1996, we take the mean temperature of the $\frac{k-1}{2} = 3$ days before, the day itself, and the $\frac{k-1}{2} = 3$ days after: in other words, we calculate the mean high temperature from Sep. 28 to Oct. 4. 

While its simplicity is attractive, the constant-span running mean has some problems: the window parameter defines a sharp cutoff for the points it considers, and points at the edge of the window are given as much weight as points in the middle. 

The smoothing method used in this project is known as *kernel smoothing* and tries to fix these problems. A *kernel* is a function that allows kernel smoothers to take truly *weighted* averages. When calculating the smooth function at a given point, the kernel weights observations based on how close they are to that point: the closer the observation, the higher the weight. This project uses a *normal kernel* that assigns weights around a given point in the bell shape of a normal probability distribution.

The effect of distance on weighting in controlled by the all-important *bandwidth* parameter $h$. If $h$ is set to a low value, observations around the given point are weighted very highly, and weights steeply decrease as distance from the point increases. Conversely, if $h$ is set to a high value, close points are given relatively less weight, and weights decrease less steeply with distance.

Suppose again that we're calculating the smoothed temperature on Oct. 1, 1996. Figure 4 shows how the temperatures on days around this date are weighted given different bandwidth parameters:

```{r}
get_S = function(X, kernel_func, h) {
  # Get smoothing matrix with given kernel and bandwidth
  X_diff = outer(X, X, FUN="-")
  Z = X_diff/h
  S_temp = kernel_func(Z)
  S = S_temp/rowSums(S_temp)
  return(S)
}

kernel_smoother = function(X, Y, kernel_func, h) {
  S = get_S(X, kernel_func, h) # smoothing matrix
  sk_hat = S %*% Y
  return(sk_hat)
}
```

```{r}
data1_aus$s_3  = dnorm((1370-time_idx1_aus)/3)/sum(dnorm((1370-time_idx1_aus)/3))
data1_aus$s_7  = dnorm((1370-time_idx1_aus)/7)/sum(dnorm((1370-time_idx1_aus)/7))
data1_aus$s_14 = dnorm((1370-time_idx1_aus)/14)/sum(dnorm((1370-time_idx1_aus)/14))
data1_aus$s_28 = dnorm((1370-time_idx1_aus)/28)/sum(dnorm((1370-time_idx1_aus)/28))
data1_aus$s_42 = dnorm((1370-time_idx1_aus)/42)/sum(dnorm((1370-time_idx1_aus)/42))
```

```{r fig.cap="Normal Kernel Weights by Bandwidth around 1996/10/01"}
df_temp = melt(data1_aus[1336:1404,c("DATE","s_3","s_7","s_14","s_28","s_42")], id="DATE")
gg = ggplot(df_temp) + geom_line(aes(x=DATE, y=value, color=variable)) + scale_color_discrete(name="Bandwidth", labels=c(3,7,14,28,42)) + labs(x="Date", y="Normal Kernel Weight") + theme(plot.title = element_text(hjust = 0.5))
print(gg)
```

Given a bandwidth $h=3$, the smoothed temperature on Oct. 1 mostly depends on the temperatures in a small window around that date. Given a bandwidth $h=28$ or $h=42$, the smoothed temperature on Oct. 1 is giving non-negligible weight to temperatures in August and November.

As seen in Figure 5, the bandwidth has a significant impact on the smooth:

```{r}
Y_hat1_aus = kernel_smoother(time_idx1_aus, high1_aus, dnorm, 3)
data1_aus$Y_hat_3 = Y_hat1_aus
```

```{r}
Y_hat1_aus = kernel_smoother(time_idx1_aus, high1_aus, dnorm, 42)
data1_aus$Y_hat_42 = Y_hat1_aus
```

```{r fig.cap="Austin Daily Highs with two smooths"}
df_temp = suppressWarnings(melt(data1_aus[(data1_aus$DATE >= "1995-10-01") & (data1_aus$DATE <= "1999-09-30"),c("DATE","TMAX","Y_hat_3","Y_hat_42")], id="DATE"))

gg = suppressWarnings(ggplot(df_temp) + geom_line(aes(x=DATE, y=value, color=variable)) + labs(x="Date", y="High Temp. (F)") + scale_color_manual("Bandwidth", values=c("gray","red","blue"), labels=c("N/A", "3", "42")) + theme(plot.title = element_text(hjust = 0.5)))
print(gg)
```

The kernel smoother with $h=3$ overfits the data, following its fluctuations too closely and failing to provide much of a smooth at all. The kernel smoother with $h=42$ underfits the data, failing to fully capture the seasonal extremes in summer and winter.

# 3. Bandwidth Selection

---

**Overview:** Automated bandwidth selection methods scale better, so they're preferred to manual selection when we need to smooth many data sets and / or large data sets.

---

Selecting an appropriate bandwidth is crucially important for getting a good smooth of the data (it's generally viewed as much more important than the choice of kernel function). We can divide methods of bandwidth selection into two broad categories, manual and automated.

## 3.1 Manual Bandwidth Selection

A common way to manually select a bandwidth is by trial and error with visual inspection of the smooths given by different bandwidths. We plot various smooths over the data and see how well they capture the overall trend(s). 

Figure 6 displays plots of a normal kernel smooth with $h = 14$ for all 4 cities:

```{r}
Y_hat1_aus = kernel_smoother(time_idx1_aus, high1_aus, dnorm, 14)
data1_aus$Y_hat = Y_hat1_aus
```

```{r}
Y_hat1_sd = kernel_smoother(time_idx1_sd, high1_sd, dnorm, 14)
data1_sd$Y_hat = Y_hat1_sd
```

```{r}
Y_hat1_mn = kernel_smoother(time_idx1_mn, high1_mn, dnorm, 14)
data1_mn$Y_hat = Y_hat1_mn
```

```{r}
Y_hat1_bt = kernel_smoother(time_idx1_bt, high1_bt, dnorm, 14)
data1_bt$Y_hat = Y_hat1_bt
```

```{r}
plot_df$Y_hat = c(Y_hat1_aus, Y_hat1_sd, Y_hat1_mn, Y_hat1_bt)
plot_df_trunc = plot_df[(plot_df$Date >= "1995-10-01") & (plot_df$Date <= "1999-09-30"),]
```

```{r fig.cap="Daily highs with normal kernel smooth (h=14)"}
gg = ggplot(plot_df_trunc) + 
  geom_line(aes(x=Date, y=Daily_High_Temp), color="gray") + 
  geom_line(aes(x=Date, y=Y_hat), color="red") + 
  facet_wrap( ~ City, scales="free") + 
  labs(x="Date", y="High Temp. (F)") + 
  theme(plot.title = element_text(hjust = 0.5))
print(gg)
```

By visual inspection, a bandwidth $h=14$ seems to do a pretty good job of smoothing the data, especially for Baltimore and Minnesota.

## 3.2 Automated Bandwidth Selection

A main problem with manual bandwidth selection is that it doesn't scale well in situations with many data sets and / or large data sets. There are reasons for this beyond the obvious time-intensiveness:

 * Many data sets: Even if they're all about the same type of phenomenon, different data sets can have different properties that make different bandwidths more appropriate. We shouldn't assume that a bandwidth that works well on some of the data sets will work well on all of them.  Notice above how the smooth with $h=14$ is noticeably worse for San Diego, containing many small wiggles due to short-term temperature fluctuations. A higher bandwidth could help remove those wiggles and provide a better smooth.
 * Large data sets: It's difficult to visually inspect the smooth over data sets that cover a wide "x-value" range. The above graphs only show 4 years out of 30, so we don't know how well the smooth does in other years.

Methods for automated bandwidth selection are explored in the rest of this project.

# 4. Leave-One-Out Cross-Validation (LOOCV)

--- 

**Overview:** Leave-one-out cross-validation (LOOCV) chooses a bandwidth by simulating how well different bandwidths extend to new data. It falls under the training set / testing set paradigm, with each test set having size 1. LOOCV fails to select good bandwidths for temperature data: given a set of candidate bandwidths, it selects the smallest candidate for each data set.

---

Given a data set of size *n*, leave-one-out cross-validation performs *n* training set / testing set splits. As indicated by the name, each LOOCV testing set contains only one data point that has been left out of the corresponding training set. 

LOOCV chooses among a set of candidate bandwidths by measuring how well smoothers with those bandwidths generalize from the training sets to the test sets. Generalization ability is typically measured using some error function such as the mean-squared error (MSE) used in this project. For a given candidate bandwidth, the LOOCV algorithm iterates through the train / test splits. For a given split, it "trains" a smoother with that bandwidth on the training set and calculates the squared error when applied to the (singleton) test set.^fn.^ The MSE across all splits is returned as the performance measure for that bandwidth. The candidate bandwidth with the smallest MSE is returned as the "optimal" choice.

^fn.^ Fortunately, a mathematical shortcut can be applied so that each smoother doesn't actually have to be trained *n* times. See ([2] p. 371).

We apply LOOCV to each temperature data set, testing candidate bandwidths ranging from $h=1$ to $h=20$ in the integers (as we'll see, testing larger bandwidths is unnecessary). 

```{r}
squared_error = function(u) {
  return(u^2)
}

cross_validate = function(X, Y, kernel_func, error_func, h) {
  # kernel smoother
  n = length(X)
  S = get_S(X, kernel_func, h) # smoothing matrix
  
  sk_hat = S %*% Y # Y_hat
  
  rss = 0
  for (i in 1:n) {
    # components of Givens equation 11.18
    num = Y[i] - sk_hat[i]
    den = 1 - S[i, i]
    # update rss
    rss = rss + error_func(num/den)
  }
  return(rss/n)
}
```

## 4.1. Results

```{r}
# Austin
CVMSE1_aus = c()

for (i in 1:20) {
  bdw = i/1 # i/10
  CVMSE1_aus = c(CVMSE1_aus, cross_validate(
    time_idx1_aus, high1_aus, dnorm, squared_error, bdw))
}
```

```{r}
CVMSE1_sd = c()

for (bdw in 1:20) {
  CVMSE1_sd = c(CVMSE1_sd, cross_validate(
    time_idx1_sd, high1_sd, dnorm, squared_error, bdw))
}
```

```{r}
CVMSE1_mn = c()

for (bdw in 1:20) {
  CVMSE1_mn = c(CVMSE1_mn, cross_validate(
    time_idx1_mn, high1_mn, dnorm, squared_error, bdw))
}
```

```{r}
CVMSE1_bt = c()

for (bdw in 1:20) {
  CVMSE1_bt = c(CVMSE1_bt, cross_validate(
    time_idx1_bt, high1_bt, dnorm, squared_error, bdw))
}
```

Figure 7 shows the test-set MSE at each candidate bandwidth for each data set. The test-set MSE monotonically increases with bandwidth size, so the LOOCV algorithm selects the smallest candidate ($h=1$) for each city.

In other words, LOOCV fails to select good bandwidths for smoothing these data sets. As seen in Section 2, very small bandwidths barely smooth the data. By selecting the smallest candidate, LOOCV is simply trying to follow the data as closely as possible instead of smoothing it. 

```{r}
plot_df_loocv = data.frame(h=seq(1, 20, by=1),
                           MSE=c(CVMSE1_aus, CVMSE1_sd, 
                                 CVMSE1_mn, CVMSE1_bt),
                           City=c(rep("Austin", 20),
                                   rep("S.D.", 20),
                                   rep("Minn.", 20),
                                   rep("Balt.", 20)))
```

```{r fig.cap="LOOCV MSE by bandwidth"}
gg = ggplot(plot_df_loocv, aes(x=h, y=MSE)) + 
  geom_line() + 
  facet_wrap( ~ City, scales="free") + 
  labs(y="Mean Squared Error (MSE)", x="Bandwidth") + 
  theme(plot.title = element_text(hjust = 0.5))
print(gg)
```

# 5. Difficulties Posed By Temperature Data

---

**Overview:** Temperature data has at least two properties that pose difficulties for kernel smoothing and bandwidth selection:

 * Non-Constant Variance
 * Dependent Data

---

Understanding why LOOCV fails to smooth temperature data can help us pick better methods for automated bandwidth selection. 

## 5.1. Non-Constant Variance

The smoothing methods examined up to now assume that the variance of the data is constant across its range. However, temperature data doesn't meet this assumption due to both long and short-term factors. Climate change is an example of a long-term factor affecting temperature variance: as the climate gets further disrupted, we see more temperature extremes and fluctuations. 

The seasons are the primary short-term factor affecting temperature variance. Take another look at some of the Austin data in Figure 1. We can see from this graph (and I know from personal experience) that temperature variance is much higher in the winter than the summer. Austin winters generally fluctuate between 40F and 80F, often within the span of a couple days, while Austin summers generally stay in the upper 90s to low 100s. 

As a result, different bandwidths are appropriate in different seasons. Higher bandwidths seem better in winter since they'll give smooths that are less affected by the short-term fluctuations. Relatively lower bandwidths seem better in summer since they can better capture the peak heat (usually in August).

## 5.2. Dependent Data

Most smoothing methods work better when deviations from the broader trend in the data are uncorrelated with each other. In the case of temperature data, this would occur if fluctuations around the broader trend were independent from one day to the next. The fluctuations would tend to balance each other out, allowing a smoother to easily cut through the middle of them.

Unfortunately, temperature data isn't like this. When temperatures deviate from the broader trend, they generally do so over the course of several days in a heat wave or cold spell. Abnormally hot days are likely to be followed by abnormally hot days; likewise for abnormally cold days. 

Consider the February 1996 heat wave in Austin highlighted in Figure 8. Here, the positive deviations from the overall trend lasted about a week. Unless they have a very high bandwidth, smoothers won't be able to "cut through" this heat wave and stick to the broader trend. The prolonged spell of abnormally high temperatures will inevitably pull the smooth upwards, perhaps in dramatic fashion. Looking back at the smoother with $h=14$ in Section 3, this is exactly what happens.

```{r fig.cap="Austin daily highs with Feb. 1996 heat wave"}
low=1000
high=1450
rect = data.frame(xmin=as.Date("1996-02-18"), xmax=as.Date("1996-02-27"), ymin=-Inf, ymax=Inf)
gg =  ggplot(data1_aus[low:high,], aes(x=DATE, y=TMAX)) + 
  geom_line() + 
  geom_rect(data=rect, aes(xmin=xmin, xmax=xmax, ymin=ymin, ymax=ymax), 
            fill="yellow", alpha=0.4, inherit.aes=FALSE) + 
  labs(x="Date", y="High Temp. (F)") + 
  theme(plot.title = element_text(hjust = 0.5))
print(gg)
```

We get more evidence of the dependent nature of temperature data from the autocorrelation (ACF) plots of the residuals from the $h=14$ smooths on each city. The ACF plots in Figure 9 display the correlations between different lags of the residuals. For each city, the residuals are positively correlated with the lag-1 and lag-2 residuals. For Minneapolis, the residuals are even positively correlated with the lag-3 residuals.  

```{r}
error1_aus = Y_hat1_aus - high1_aus
```

```{r}
error1_sd = Y_hat1_sd - high1_sd
```

```{r}
error1_mn = Y_hat1_mn - high1_mn
```

```{r}
error1_bt = Y_hat1_bt - high1_bt
```

```{r fig.cap="Autocorrelation plots for normal kernel smooth (h=14) residuals"}
par(mar=c(4,4,4,2), mfrow=c(2,2))

acf(error1_aus, lag=20, main="Austin", cex.main=1.5, cex.lab=1)
acf(error1_sd, lag=20, main="San Diego", cex.main=1.5, cex.lab=1)
acf(error1_mn, lag=20, main="Minneapolis", cex.main=1.5, cex.lab=1)
acf(error1_bt, lag=20, main="Baltimore", cex.main=1.5, cex.lab=1)
```

# 6. Super-Smoother

---

**Overview:** The super-smoother helps with the problem of non-constant variance by selecting different "optimal" bandwidths at different points. LOOCV again performs poorly as the bandwidth selection method, choosing the lowest candidate bandwidth at most points.

---

Up to this point, I've only discussed smoothers that use a single bandwidth across an entire data set. These can struggle when the data have non-constant variance, as different bandwidths are appropriate in different regions.

The super-smoother tries to solve this problem by allowing the bandwidth to vary across the data.[3] Like the previous LOOCV algorithm, the super-smoother is given a set of candidate bandwidths and computes the LOOCV error at each point for each bandwidth. A second smooth is then applied to these errors. At each point in the data set, the bandwidth with the smallest smoothed LOOCV error is selected as the "optimal" bandwidth at that point. A third smooth is applied to these optimal bandwidths to ease the transitions between different variance regions. Finally, the smoothed optimal bandwidths are used to calculate the output values of the super-smoother. 

Figure 10 (from [2]) shows an example non-constant variance data set with a super-smoother fit given 3 candidate bandwidths (0.05, 0.2, 0.5). Figure 11 (also from [2]) shows the optimal bandwidth selected by LOOCV at each point as well as the smooth applied to these bandwidths. Note that higher bandwidths are selected in higher-variance regions, allowing the smooth to cut through the noise, while lower bandwidths are selected in lower-variance regions, allowing the smooth to better capture the trend.

```{r echo=FALSE, fig.cap="Non-constant variance data with LOOCV super-smooth (solid line)", out.width = '100%'}
knitr::include_graphics("images/supersmoother_data.png")
```

```{r echo=FALSE, fig.cap="Selected bandwidths with smooth", out.width = '100%'}
knitr::include_graphics("images/supersmoother_bandwidths.png")
```

We test the LOOCV super-smoother on each data set. After some trial-and-error, we use $H = (7, 15, 23, 31, 39, 47, 55, 63, 71, 79, 87)$ as the set of candidate bandwidths. 

The decision to use a set of 11 candidate bandwidths introduced a minor complication. The paper introducing the super-smoother uses a set of 3 candidate bandwidths and recommends using the second-largest of these when applying the second and third smooths to the LOOCV errors and optimal bandwidths, respectively (call the bandwidth of these smooths $h^*$). This recommendation is ambiguous for a larger set of candidates: do we use the second-largest or the median bandwidth for the later smooths? We'll test both options ($h^*=15$ and $h^*=47$).

```{r}
super_smoother = function(X, Y, kernel_func, H_array, error_estimate="loocv", 
                          kb_window=2, error_bdw=NA) {
  # H_array: the set of candidate bandwidths
  # error_estimate: "loocv" or "kbcv"
  # kb_window: if error_estimate=="kbcv", the length of one side of the k-block 
  ### window
  # error_bdw: the bandwidth to use for the 2nd and 3rd smooths
  
  m = length(H_array)
  n = length(X)
  
  k = (m+1)/2
  if (is.na(error_bdw)) {
    error_bdw = H_array[k]
  }
  
  # fixed "outer" smoother
  S_star_h = get_S(X, kernel_func, error_bdw)
  
  # get performance measures at each point for each candidate bandwidth
  smooth_list = vector(mode="list", length=m)
  p_hat_list = vector(mode="list", length=m) # smoothed performance measure
  for (i in 1:m) {
    h = H_array[i]
    
    S_h = get_S(X, kernel_func, h) # smoothing matrix
    #S_h = S_array[i]
    
    sk_hat_h = S_h %*% Y
    smooth_list[[h]] = sk_hat_h
    
    if (error_estimate=="loocv") {
      error_h = (Y - sk_hat_h)/(1 - diag(S_h)) 
    }
    else if (error_estimate=="kbcv") {
      error_h = c()
      for (j in 1:n) {
        yj = Y[j]
        Y_temp = Y
        Y_temp[max((j-kb_window),1):min((j+kb_window),n)] = yj
        sk_hatj = dot(c(S_h[1:n,j]), Y_temp)
        
        num = Y[j] - sk_hatj
        den = 1 - sum(S_h[max((j-kb_window),1):min((j+kb_window),n), j])
        error_j = num/den
        
        error_h = c(error_h, c(error_j))
      }
    }

    g_h = abs(error_h)
    
    # smooth errors
    p_hat_h = S_star_h %*% g_h
    p_hat_list[[i]] = p_hat_h
  }
  
  p_hat_matrix = matrix(unlist(p_hat_list), ncol=m)
  smooth_matrix = matrix(unlist(smooth_list), ncol=m)
  
  # pick best bandwidths
  min_indices = argmin(p_hat_matrix)
  h_min = c()
  for (i in 1:n) {
    h_min = c(h_min, H_array[min_indices[i]])
  }
  
  # smooth optimal bandwidths
  smoothed_best_h = S_star_h %*% h_min
  smoothed_best_h = round(smoothed_best_h, 6) # floats would go below min(H)
  
  # use smoothed optimal bandwidths to estimate
  sk_hat = c()
  for (i in 1:n) { 
    h_i = smoothed_best_h[i] # smoothed optimal bandwidth
    h_minus = max(H_array[H_array <= h_i]) # closest candidate bandwidth below
    h_plus = min(H_array[H_array >= h_i]) # closest candidate bandwidth above
    
    # interpolate higher and lower bandwidths
    if (h_minus==h_plus) {
      p = 0
    }
    else {
      p = (h_i - h_minus)/(h_plus - h_minus)
    }
    
    sk_i = p*smooth_list[[h_plus]][i] + (1-p)*smooth_list[[h_minus]][i]
    sk_hat = c(sk_hat, c(sk_i))
  }
  return(list(sk_hat, h_min, smoothed_best_h))
}
```

## 6.1 Results

```{r}
H_array = c(7,15,23,31,39,47,55,63,71,79,87)
```

```{r}
# Austin
smooth_list1_aus = super_smoother(time_idx1_aus, high1_aus, dnorm, 
                                  H_array, error_estimate="loocv")
```

```{r}
Y_hat1_aus = smooth_list1_aus[[1]]
h_min1_aus = smooth_list1_aus[[2]]
smoothed_h_min1_aus = smooth_list1_aus[[3]]
```

```{r}
mid_mean_aus = mean(smoothed_h_min1_aus)
```

```{r}
smooth_list1_aus = super_smoother(time_idx1_aus, high1_aus, dnorm, H_array,
                                  error_estimate="loocv", error_bdw=H_array[2])
```

```{r}
Y_hat1_aus = smooth_list1_aus[[1]]
h_min1_aus = smooth_list1_aus[[2]]
smoothed_h_min1_aus = smooth_list1_aus[[3]]
```

```{r}
low_mean_aus = mean(smoothed_h_min1_aus)
```

```{r}
data1_aus$Y_hat = Y_hat1_aus
```

```{r}
# San Diego
smooth_list1_sd = super_smoother(time_idx1_sd, high1_sd, dnorm, H_array,
                                 error_estimate="loocv")
```

```{r}
Y_hat1_sd = smooth_list1_sd[[1]]
h_min1_sd = smooth_list1_sd[[2]]
smoothed_h_min1_sd = smooth_list1_sd[[3]]
```

```{r}
mid_mean_sd = mean(smoothed_h_min1_sd)
```

```{r}
smooth_list1_sd = super_smoother(time_idx1_sd, high1_sd, dnorm, H_array, 
                                  error_estimate="loocv", error_bdw=H_array[2])
```

```{r}
Y_hat1_sd = smooth_list1_sd[[1]]
h_min1_sd = smooth_list1_sd[[2]]
smoothed_h_min1_sd = smooth_list1_sd[[3]]
```

```{r}
low_mean_sd = mean(smoothed_h_min1_sd)
```

```{r}
data1_sd$Y_hat = Y_hat1_sd
```

```{r}
# Minneapolis
smooth_list1_mn = super_smoother(time_idx1_mn, high1_mn, dnorm, H_array, 
                                  error_estimate="loocv")
```

```{r}
Y_hat1_mn = smooth_list1_mn[[1]]
h_min1_mn = smooth_list1_mn[[2]]
smoothed_h_min1_mn = smooth_list1_mn[[3]]
```

```{r}
mid_mean_mn = mean(smoothed_h_min1_mn)
```

```{r}
smooth_list1_mn = super_smoother(time_idx1_mn, high1_mn, dnorm, H_array,
                                 error_estimate="loocv", error_bdw=H_array[2])
```

```{r}
Y_hat1_mn = smooth_list1_mn[[1]]
h_min1_mn = smooth_list1_mn[[2]]
smoothed_h_min1_mn = smooth_list1_mn[[3]]
```

```{r}
low_mean_mn = mean(smoothed_h_min1_mn)
```

```{r}
data1_mn$Y_hat = Y_hat1_mn
```

```{r}
# Baltimore
smooth_list1_bt = super_smoother(time_idx1_bt, high1_bt, dnorm, H_array,
                                 error_estimate="loocv")
```

```{r}
Y_hat1_bt = smooth_list1_bt[[1]]
h_min1_bt = smooth_list1_bt[[2]]
smoothed_h_min1_bt = smooth_list1_bt[[3]]
```

```{r}
mid_mean_bt = mean(smoothed_h_min1_bt)
```

```{r}
smooth_list1_bt = super_smoother(time_idx1_bt, high1_bt, dnorm, H_array,
                                  error_estimate="loocv", error_bdw=H_array[2])
```

```{r}
Y_hat1_bt = smooth_list1_bt[[1]]
h_min1_bt = smooth_list1_bt[[2]]
smoothed_h_min1_bt = smooth_list1_bt[[3]]
```

```{r}
low_mean_bt = mean(smoothed_h_min1_bt)
```

```{r}
data1_bt$Y_hat = Y_hat1_bt
```

Unfortunately, the super-smoother with LOOCV bandwidth selection also mostly fails to give good smooths of the data. As seen in Table 2, for each city and choice of $h^*$, the mean of the bandwidths chosen by the super-smoother across the data set is fairly close to 7, the smallest candidate bandwidth. In other words, the LOOCV super-smoother selects the smallest possible bandwidth as "optimal" at most points in the data set. (Sets of candidate bandwidths that contained even smaller values give the same result.)

```{r results='asis'}
table_df = data.frame(City=c("Austin", "San Diego", "Minneapolis", "Baltimore"),
                      mid_mean=c(mid_mean_aus,mid_mean_sd,mid_mean_mn,mid_mean_bt),
                      low_mean=c(low_mean_aus,low_mean_sd,low_mean_mn,low_mean_bt))
kable(table_df, caption="Mean LOOCV Super-Smoother Bandwidth by City", 
      col.names=c("City", "Mean Bandwidth (h*=47)", "Mean Bandwidth (h*=15)"))
```

Setting $h^*$ to the smaller value of 15 gives a better smooth than $h^*=47$, but not by much. Figure 12 displays parts of the smooths given by the LOOCV super-smoother with $h^* = 15$. The smooths are generally very choppy with only a few bright spots (e.g. Fall 1996 and Spring 1997 in Austin, Fall 1997 in San Diego). If lower bandwidths had been included in the set of candidates, the smooths would be even worse.

```{r}
plot_df$Y_hat = c(Y_hat1_aus, Y_hat1_sd, Y_hat1_mn, Y_hat1_bt)
plot_df_trunc = plot_df[(plot_df$Date >= "1995-10-01") & (plot_df$Date <= "1999-09-30"),]
```

```{r fig.cap="Daily High Temperatures with LOOCV Super-Smooth (h*=15)"}
gg = ggplot(plot_df_trunc) + 
  geom_line(aes(x=Date, y=Daily_High_Temp), color="gray") + 
  geom_line(aes(x=Date, y=Y_hat), color="red") + 
  facet_wrap( ~ City, scales="free") + 
  labs(x="Date", y="High Temp. (F)") + 
  theme(plot.title = element_text(hjust = 0.5))
print(gg)
```

# 7. K-Block Cross-Validation

---

**Overview:** Using *K*-block cross-validation (KBCV) instead of LOOCV helps with dependent data. The super-smoother that uses KBCV for bandwidth selection produces good smoothing results.  

---

The super-smoother is designed to overcome difficulties posed by data with non-constant variance, but it still fails on temperature data when using LOOCV to select bandwidths. This is due in part to the dependent nature of temperature data. If the high temperature on a given day deviates from the broader trend, the highs on days shortly before and after are likely to deviate from the broader trend in the same way. In LOOCV, surrounding days are used to predict the day that has been left out. As a result, the LOOCV prediction of the high temperature on a given day contains information about the broader trend we're trying to capture (good) but also information about how that day deviates from the trend (bad).

*K*-block cross-validation (KBCV) is a LOOCV alternative designed to help with this problem (the original authors call the technique *h*-block cross-validation, but I use "*k*" instead to avoid confusion with bandwidth parameters).[4] As with LOOCV, KBCV uses singleton test sets. The difference is in the training sets: LOOCV uses all $n-1$ remaining points in the training set, whereas KBCV uses $n - 2k - 1$ training points for a user-specified window parameter $k$. In addition to the testing point, KBCV training sets leave out a window of $k$ points on either side.

We test KBCV with a window parameter $k=3$, which leaves out a total of 7 points (the testing point, 3 points before, and 3 points after) from each training set. We set $k=3$ because the autocorrelation plots in Section 5.2 gave evidence of positive residual correlation up to lag 3. The super-smoother bandwidth used to smooth the cross-validation errors and optimal bandwidths was set to $h^*=15$ since this gave better LOOCV results.

```{r}
kb_window = 3
error_bdw = H_array[2]
```

## 7.1 Results

```{r}
smooth_list1_aus = super_smoother(time_idx1_aus, high1_aus, dnorm, H_array,
                                  error_estimate="kbcv", kb_window=kb_window,
                                  error_bdw=error_bdw)
```

```{r}
Y_hat1_aus = smooth_list1_aus[[1]]
h_min1_aus = smooth_list1_aus[[2]]
smoothed_h_min1_aus = smooth_list1_aus[[3]]

data1_aus$Y_hat_kb = Y_hat1_aus
data1_aus$h_min_kb = h_min1_aus
data1_aus$smoothed_h_min_kb = smoothed_h_min1_aus
```

```{r}
mean_aus = mean(smoothed_h_min1_aus)
```

```{r}
# San Diego
smooth_list1_sd = super_smoother(time_idx1_sd, high1_sd, dnorm, H_array,
                                 error_estimate="kbcv", kb_window=kb_window,
                                 error_bdw=error_bdw)
```

```{r}
Y_hat1_sd = smooth_list1_sd[[1]]
h_min1_sd = smooth_list1_sd[[2]]
smoothed_h_min1_sd = smooth_list1_sd[[3]]

data1_sd$Y_hat_kb = Y_hat1_sd
data1_sd$h_min_kb = h_min1_sd
data1_sd$smoothed_h_min_kb = smoothed_h_min1_sd
```

```{r}
mean_sd = mean(smoothed_h_min1_sd)
```

```{r}
# Minneapolis
smooth_list1_mn = super_smoother(time_idx1_mn, high1_mn, dnorm, H_array,
                                 error_estimate="kbcv", kb_window=kb_window,
                                 error_bdw=error_bdw)
```

```{r}
Y_hat1_mn = smooth_list1_mn[[1]]
h_min1_mn = smooth_list1_mn[[2]]
smoothed_h_min1_mn = smooth_list1_mn[[3]]

data1_mn$Y_hat_kb = Y_hat1_mn
data1_mn$h_min_kb = h_min1_mn
data1_mn$smoothed_h_min_kb = smoothed_h_min1_mn
```

```{r}
mean_mn = mean(smoothed_h_min1_mn)
```

```{r}
# Baltimore
smooth_list1_bt = super_smoother(time_idx1_bt, high1_bt, dnorm, H_array,
                                 error_estimate="kbcv", kb_window=kb_window,
                                 error_bdw=error_bdw)
```

```{r}
Y_hat1_bt = smooth_list1_bt[[1]]
h_min1_bt = smooth_list1_bt[[2]]
smoothed_h_min1_bt = smooth_list1_bt[[3]]

data1_bt$Y_hat_kb = Y_hat1_bt
data1_bt$h_min_kb = h_min1_bt
data1_bt$smoothed_h_min_kb = smoothed_h_min1_bt
```

```{r}
mean_bt = mean(smoothed_h_min1_bt)
```

Table 3 displays the means of the bandwidths chosen by the KBCV super-smoother across each data set. With KBCV instead of LOOCV as the bandwidth selection method, the super-smoother tends to select much higher bandwidths. 

```{r results='asis'}
table_df = data.frame(City=c("Austin", "San Diego", "Minneapolis", "Baltimore"),
                      mid_mean=c(mean_aus, mean_sd, mean_mn, mean_bt))
kable(table_df, caption="Mean KBCV Super-Smoother Bandwidth by City", 
      col.names=c("City", "Mean Bandwidth (h*=15, k=3)"))
```

The generally higher bandwidths lead to much-improved smooths which can be seen in Figures 13 and 14. The smooths are much less wiggly than the LOOCV super-smooths in Figure 12 (Section 6.1), doing a better job of capturing broader seasonal trends instead of short-term deviations. For instance, the LOOCV smooth for Austin in Winter 1995-1996 contains 3 prominent local minima and is yanked upwards by the February 1996 heat wave. The KBCV smooth, on the other hand, contains only a single minimum in that period and is less sensitive to the heat wave.

Figures 13 and 14 also displays the bandwidths selected by the super-smoother at each point (along with the $h^*=15$ smooth applied to them). We can see some seasonality in the bandwidth selections that match the advertised properties of the super-smoother. Higher bandwidths tend to be selected in higher-variance periods (e.g. Austin winters, Baltimore springs) and lower bandwidths tend to be selected in lower-variance periods (e.g. Austin summers).

```{r}
plot_df_temp_as = data.frame(TMAX=c(data1_aus$TMAX, data1_sd$TMAX),
                     DATE=c(data1_aus$DATE, data1_sd$DATE),
                     Y_hat_kb = c(data1_aus$Y_hat_kb, data1_sd$Y_hat_kb),
                     h_min_kb = c(data1_aus$h_min_kb, data1_sd$h_min_kb),
                     smoothed_h_min_kb = c(data1_aus$smoothed_h_min_kb,
                                        data1_sd$smoothed_h_min_kb),
                     City=c(rep("Austin", n1_aus),
                            rep("S.D.", n1_sd)))
```

```{r fig.cap="Above: daily highs with KBCV super-smooth (h*=15, k=3). Below: KBCV super-smoother selected bandwidths."}
plot_df_temp_as_trunc = plot_df_temp_as[(plot_df_temp_as$DATE >= "1995-10-01") & (
  plot_df_temp_as$DATE <= "1999-09-30"),]

gg1 = ggplot(plot_df_temp_as_trunc) + 
  geom_line(aes(x=DATE, y=TMAX), color="gray") + 
  geom_line(aes(x=DATE, y=Y_hat_kb), color="red") +  
  facet_wrap( ~ City, scale="free") + 
  labs(x="", y="High Temp. (F)") + 
  theme(plot.title = element_text(hjust = 0.5))

gg2 = ggplot(plot_df_temp_as_trunc) + 
  geom_point(aes(x=DATE, y=h_min_kb), color="black") + 
  geom_line(aes(x=DATE, y=smoothed_h_min_kb), color="gray") +  
  facet_wrap( ~ City) + 
  labs(x="Date", y="Selected Bandwidth") + 
  theme(plot.title = element_text(hjust = 0.5))

plot_grid(gg1, gg2, nrow=2, ncol=1)
```

```{r}
plot_df_temp_mb = data.frame(TMAX=c(data1_mn$TMAX, data1_bt$TMAX),
                     DATE=c(data1_mn$DATE, data1_bt$DATE),
                     Y_hat_kb = c(data1_mn$Y_hat_kb, data1_bt$Y_hat_kb),
                     h_min_kb = c(data1_mn$h_min_kb, data1_bt$h_min_kb),
                     smoothed_h_min_kb = c(data1_mn$smoothed_h_min_kb,
                                        data1_bt$smoothed_h_min_kb),
                     City=c(rep("Minneapolis", n1_mn),
                            rep("Baltimore", n1_bt)))
```

```{r fig.cap="Above: daily highs with KBCV super-smooth (h*=15, k=3). Below: KBCV super-smoother selected bandwidths."}
plot_df_temp_mb_trunc = plot_df_temp_mb[(plot_df_temp_mb$DATE >= "1995-10-01") & (
  plot_df_temp_mb$DATE <= "1999-09-30"),]

gg1 = ggplot(plot_df_temp_mb_trunc) + 
  geom_line(aes(x=DATE, y=TMAX), color="gray") + 
  geom_line(aes(x=DATE, y=Y_hat_kb), color="red") +  
  facet_wrap( ~ City, scale="free") + 
  labs(x="", y="High Temp. (F)") + 
  theme(plot.title = element_text(hjust = 0.5))

gg2 = ggplot(plot_df_temp_mb_trunc) + 
  geom_point(aes(x=DATE, y=h_min_kb), color="black") + 
  geom_line(aes(x=DATE, y=smoothed_h_min_kb), color="gray") +  
  facet_wrap( ~ City) + 
  labs(x="Date", y="Selected Bandwidth") + 
  theme(plot.title = element_text(hjust = 0.5))

plot_grid(gg1, gg2, nrow=2, ncol=1)
```

# 8. Discussion and Further Research

In this project, we examined automated bandwidth selection methods for kernel smoothing on daily temperature data. We discussed some problematic properties of this data that cause standard methods to fail. To help with the problem of non-constant variance, we moved from regular linear smoothers to the super-smoother. To help with the problem of dependent data, we moved from leave-one-out cross-validation to *k*-block cross-validation. We showed that the KBCV super-smoother does a good job smoothing temperature data across a wide variety of climates.

The KBCV super-smoother uses several hyperparameters: the set of candidate bandwidths, the bandwidths of the two higher-level smooths, and the window size *k* for *k*-block cross-validation. Tuning these hyperparameters was only briefly discussed in this project and is a good topic of further research.  

# 9. References

[1] https://www.ncei.noaa.gov/cdo-web/

[2] Givens and Hoeting (2012), Computational Statistics.

[3] Friedman (1984), A variable span smoother.

[4] Chow and Nolan (1994), A cross-validatory method for dependent data.