-
Notifications
You must be signed in to change notification settings - Fork 0
/
Smoothing Temperature Data.Rmd
946 lines (719 loc) · 39.9 KB
/
Smoothing Temperature Data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
---
title: "Smoothing Temperature Data"
author: "Steve Bischoff"
date: "2024-2-22"
output:
pdf_document: default
html_document: default
---
# Contents
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
```
```{r include=FALSE, warning=FALSE}
suppressWarnings(library(knitr))
suppressWarnings(library(ggplot2))
suppressWarnings(library(cowplot))
suppressWarnings(library(reshape2))
suppressWarnings(library(tseries))
suppressWarnings(library(tidyverse))
suppressWarnings(library(ks))
suppressWarnings(library(ramify))
suppressWarnings(library(pracma))
```
```{r include=FALSE}
data_aus = read.csv('data/daily_summaries_austin.csv', header=TRUE)
data_aus$DATE = as.Date(data_aus$DATE)
```
```{r include=FALSE}
data_sd = read.csv('data/daily_summaries_sandiego.csv', header=TRUE) %>% drop_na(TMAX)
data_sd$DATE = as.Date(data_sd$DATE)
```
```{r include=FALSE}
data_mn = read.csv('data/daily_summaries_minneapolis.csv', header=TRUE)
data_mn$DATE = as.Date(data_mn$DATE)
```
```{r include=FALSE}
data_bt = read.csv('data/daily_summaries_baltimore.csv', header=TRUE)
data_bt$DATE = as.Date(data_bt$DATE)
```
```{r include=FALSE}
data1_aus = data_aus[(data_aus$DATE >= "1993-01-01") & (data_aus$DATE <= "2022-12-31"),]
high1_aus = data1_aus$TMAX
n1_aus = length(high1_aus)
```
```{r include=FALSE}
data1_sd = data_sd[(data_sd$DATE >= "1993-01-01") & (data_sd$DATE <= "2022-12-31"),]
high1_sd = data1_sd$TMAX
n1_sd = length(high1_sd)
```
```{r include=FALSE}
data1_mn = data_mn[(data_mn$DATE >= "1993-01-01") & (data_mn$DATE <= "2022-12-31"),]
high1_mn = data1_mn$TMAX
n1_mn = length(high1_mn)
```
```{r include=FALSE}
data1_bt = data_bt[(data_bt$DATE >= "1993-01-01") & (data_bt$DATE <= "2022-12-31"),]
high1_bt = data1_bt$TMAX
n1_bt = length(high1_bt)
```
```{r include=FALSE}
time_idx1_aus = 1:n1_aus
date1_aus = data1_aus$DATE
```
```{r include=FALSE}
time_idx1_sd = 1:n1_sd
date1_sd = data1_sd$DATE
```
```{r include=FALSE}
time_idx1_mn = 1:n1_mn
date1_mn = data1_mn$DATE
```
```{r include=FALSE}
time_idx1_bt = 1:n1_bt
date1_bt = data1_bt$DATE
```
# 1. Introduction
Smoothing methods are a popular way to identify the trend(s) in noisy data without having to construct a model of the data-generating process. Temperature data are a particularly appealing application of these methods. Figure 1 shows daily high temperatures in Austin, Texas in the late 90s.[1]
```{r fig.cap="Austin daily high temperatures"}
low=1000
high=2500
ylims = c(10,120)
gg = ggplot(data1_aus[low:high,], aes(x=DATE, y=TMAX)) + geom_line() + labs(x="Date", y="High Temp. (F)") + theme(plot.title = element_text(hjust = 0.5))
print(gg)
```
The graph is visually unappealing due to daily temperature fluctuations. This noise also makes it difficult to generalize and identify trends. As just one example: the winter of 1996-1997 clearly has the lowest lows of these years, but was that cold spell an aberration or did it truly have the coldest winter?
This project investigates the application of automated smoothing methods to temperature data. Temperature data has several features that complicate the task. These features are detailed in Section 5, and potential solutions are discussed in subsequent sections. We find that *Super-Smoothing with K-Block Cross-Validation* does a pretty good job of smoothing the data where other methods fail.
## 1.1 Data
The data used in this project comes from the National Oceanic and Atmospheric Administration (NOAA).[1] We can use the search tool to find stations that have had 100% Daily Summary coverage for many decades.
In order to investigate temperature smoothing across a wide variety of climates, we'll use data from 4 different cities. Table 1 provides a summary.
```{r results='asis'}
table_df = data.frame(City=c("Austin", "San Diego", "Minneapolis", "Baltimore"),
StationID=c("USW00013958","USW00023188","USW00014922","USW00093721"),
FirstDate=c("1938-06-01","1939-07-01","1938-04-09","1939-07-01"))
kable(table_df, cap="Data source descriptions",
col.names=c("City", "Station ID", "First Available"))
```
To make computation more manageable, we truncate each data set to 1993 - 2022. Figures 2 and 3 display time series and box plots of the truncated data sets.
```{r}
plot_df = data.frame(Daily_High_Temp=c(data1_aus$TMAX, data1_sd$TMAX,
data1_mn$TMAX, data1_bt$TMAX),
Date=c(data1_aus$DATE, data1_sd$DATE,
data1_mn$DATE, data1_bt$DATE),
City=c(rep("Austin", n1_aus),
rep("S.D.", n1_sd),
rep("Minn.", n1_mn),
rep("Balt.", n1_bt)))
plot_df$Date = as.Date(plot_df$Date)
```
```{r fig.cap="Daily high temperatures 1993-2002"}
gg = ggplot(plot_df, aes(x=Date, y=Daily_High_Temp)) +
geom_line() + facet_wrap( ~ City) +
labs(y="High Temp. (F)") +
theme(plot.title = element_text(hjust = 0.5))
print(gg)
```
```{r fig.cap="Daily high boxplots by city"}
par(mar=c(5,5,4,2))
boxplot(Daily_High_Temp ~ City, plot_df, cex.lab=1, cex.axis=1, ylab="High Temp. (F)")
```
# 2. Kernel Smoothing
---
**Overview:** Kernel smoothers apply a weighted average to the data. They take a *bandwidth* parameter that controls how smooth the output curve is.
---
Smoothing methods try to find a good *smoothing function* that cuts through the noise to give the underlying trend. The value of a smoothing function at a point is generally calculated as some type of weighted average of the values at surrounding points. For instance, the temperature given by the smoothing function on Oct. 1, 1996 is a weighted average of the actual temperatures on and around that date.
Perhaps the simplest smoother is the *constant-span running mean* ([2] p. 366). This smoother is given a parameter $k$ that defines a window size for taking the mean. Suppose we apply the constant-span running mean with $k = 7$ to the temperature data. To calculate the smoothing function on Oct. 1, 1996, we take the mean temperature of the $\frac{k-1}{2} = 3$ days before, the day itself, and the $\frac{k-1}{2} = 3$ days after: in other words, we calculate the mean high temperature from Sep. 28 to Oct. 4.
While its simplicity is attractive, the constant-span running mean has some problems: the window parameter defines a sharp cutoff for the points it considers, and points at the edge of the window are given as much weight as points in the middle.
The smoothing method used in this project is known as *kernel smoothing* and tries to fix these problems. A *kernel* is a function that allows kernel smoothers to take truly *weighted* averages. When calculating the smooth function at a given point, the kernel weights observations based on how close they are to that point: the closer the observation, the higher the weight. This project uses a *normal kernel* that assigns weights around a given point in the bell shape of a normal probability distribution.
The effect of distance on weighting in controlled by the all-important *bandwidth* parameter $h$. If $h$ is set to a low value, observations around the given point are weighted very highly, and weights steeply decrease as distance from the point increases. Conversely, if $h$ is set to a high value, close points are given relatively less weight, and weights decrease less steeply with distance.
Suppose again that we're calculating the smoothed temperature on Oct. 1, 1996. Figure 4 shows how the temperatures on days around this date are weighted given different bandwidth parameters:
```{r}
get_S = function(X, kernel_func, h) {
# Get smoothing matrix with given kernel and bandwidth
X_diff = outer(X, X, FUN="-")
Z = X_diff/h
S_temp = kernel_func(Z)
S = S_temp/rowSums(S_temp)
return(S)
}
kernel_smoother = function(X, Y, kernel_func, h) {
S = get_S(X, kernel_func, h) # smoothing matrix
sk_hat = S %*% Y
return(sk_hat)
}
```
```{r}
data1_aus$s_3 = dnorm((1370-time_idx1_aus)/3)/sum(dnorm((1370-time_idx1_aus)/3))
data1_aus$s_7 = dnorm((1370-time_idx1_aus)/7)/sum(dnorm((1370-time_idx1_aus)/7))
data1_aus$s_14 = dnorm((1370-time_idx1_aus)/14)/sum(dnorm((1370-time_idx1_aus)/14))
data1_aus$s_28 = dnorm((1370-time_idx1_aus)/28)/sum(dnorm((1370-time_idx1_aus)/28))
data1_aus$s_42 = dnorm((1370-time_idx1_aus)/42)/sum(dnorm((1370-time_idx1_aus)/42))
```
```{r fig.cap="Normal Kernel Weights by Bandwidth around 1996/10/01"}
df_temp = melt(data1_aus[1336:1404,c("DATE","s_3","s_7","s_14","s_28","s_42")], id="DATE")
gg = ggplot(df_temp) + geom_line(aes(x=DATE, y=value, color=variable)) + scale_color_discrete(name="Bandwidth", labels=c(3,7,14,28,42)) + labs(x="Date", y="Normal Kernel Weight") + theme(plot.title = element_text(hjust = 0.5))
print(gg)
```
Given a bandwidth $h=3$, the smoothed temperature on Oct. 1 mostly depends on the temperatures in a small window around that date. Given a bandwidth $h=28$ or $h=42$, the smoothed temperature on Oct. 1 is giving non-negligible weight to temperatures in August and November.
As seen in Figure 5, the bandwidth has a significant impact on the smooth:
```{r}
Y_hat1_aus = kernel_smoother(time_idx1_aus, high1_aus, dnorm, 3)
data1_aus$Y_hat_3 = Y_hat1_aus
```
```{r}
Y_hat1_aus = kernel_smoother(time_idx1_aus, high1_aus, dnorm, 42)
data1_aus$Y_hat_42 = Y_hat1_aus
```
```{r fig.cap="Austin Daily Highs with two smooths"}
df_temp = suppressWarnings(melt(data1_aus[(data1_aus$DATE >= "1995-10-01") & (data1_aus$DATE <= "1999-09-30"),c("DATE","TMAX","Y_hat_3","Y_hat_42")], id="DATE"))
gg = suppressWarnings(ggplot(df_temp) + geom_line(aes(x=DATE, y=value, color=variable)) + labs(x="Date", y="High Temp. (F)") + scale_color_manual("Bandwidth", values=c("gray","red","blue"), labels=c("N/A", "3", "42")) + theme(plot.title = element_text(hjust = 0.5)))
print(gg)
```
The kernel smoother with $h=3$ overfits the data, following its fluctuations too closely and failing to provide much of a smooth at all. The kernel smoother with $h=42$ underfits the data, failing to fully capture the seasonal extremes in summer and winter.
# 3. Bandwidth Selection
---
**Overview:** Automated bandwidth selection methods scale better, so they're preferred to manual selection when we need to smooth many data sets and / or large data sets.
---
Selecting an appropriate bandwidth is crucially important for getting a good smooth of the data (it's generally viewed as much more important than the choice of kernel function). We can divide methods of bandwidth selection into two broad categories, manual and automated.
## 3.1 Manual Bandwidth Selection
A common way to manually select a bandwidth is by trial and error with visual inspection of the smooths given by different bandwidths. We plot various smooths over the data and see how well they capture the overall trend(s).
Figure 6 displays plots of a normal kernel smooth with $h = 14$ for all 4 cities:
```{r}
Y_hat1_aus = kernel_smoother(time_idx1_aus, high1_aus, dnorm, 14)
data1_aus$Y_hat = Y_hat1_aus
```
```{r}
Y_hat1_sd = kernel_smoother(time_idx1_sd, high1_sd, dnorm, 14)
data1_sd$Y_hat = Y_hat1_sd
```
```{r}
Y_hat1_mn = kernel_smoother(time_idx1_mn, high1_mn, dnorm, 14)
data1_mn$Y_hat = Y_hat1_mn
```
```{r}
Y_hat1_bt = kernel_smoother(time_idx1_bt, high1_bt, dnorm, 14)
data1_bt$Y_hat = Y_hat1_bt
```
```{r}
plot_df$Y_hat = c(Y_hat1_aus, Y_hat1_sd, Y_hat1_mn, Y_hat1_bt)
plot_df_trunc = plot_df[(plot_df$Date >= "1995-10-01") & (plot_df$Date <= "1999-09-30"),]
```
```{r fig.cap="Daily highs with normal kernel smooth (h=14)"}
gg = ggplot(plot_df_trunc) +
geom_line(aes(x=Date, y=Daily_High_Temp), color="gray") +
geom_line(aes(x=Date, y=Y_hat), color="red") +
facet_wrap( ~ City, scales="free") +
labs(x="Date", y="High Temp. (F)") +
theme(plot.title = element_text(hjust = 0.5))
print(gg)
```
By visual inspection, a bandwidth $h=14$ seems to do a pretty good job of smoothing the data, especially for Baltimore and Minnesota.
## 3.2 Automated Bandwidth Selection
A main problem with manual bandwidth selection is that it doesn't scale well in situations with many data sets and / or large data sets. There are reasons for this beyond the obvious time-intensiveness:
* Many data sets: Even if they're all about the same type of phenomenon, different data sets can have different properties that make different bandwidths more appropriate. We shouldn't assume that a bandwidth that works well on some of the data sets will work well on all of them. Notice above how the smooth with $h=14$ is noticeably worse for San Diego, containing many small wiggles due to short-term temperature fluctuations. A higher bandwidth could help remove those wiggles and provide a better smooth.
* Large data sets: It's difficult to visually inspect the smooth over data sets that cover a wide "x-value" range. The above graphs only show 4 years out of 30, so we don't know how well the smooth does in other years.
Methods for automated bandwidth selection are explored in the rest of this project.
# 4. Leave-One-Out Cross-Validation (LOOCV)
---
**Overview:** Leave-one-out cross-validation (LOOCV) chooses a bandwidth by simulating how well different bandwidths extend to new data. It falls under the training set / testing set paradigm, with each test set having size 1. LOOCV fails to select good bandwidths for temperature data: given a set of candidate bandwidths, it selects the smallest candidate for each data set.
---
Given a data set of size *n*, leave-one-out cross-validation performs *n* training set / testing set splits. As indicated by the name, each LOOCV testing set contains only one data point that has been left out of the corresponding training set.
LOOCV chooses among a set of candidate bandwidths by measuring how well smoothers with those bandwidths generalize from the training sets to the test sets. Generalization ability is typically measured using some error function such as the mean-squared error (MSE) used in this project. For a given candidate bandwidth, the LOOCV algorithm iterates through the train / test splits. For a given split, it "trains" a smoother with that bandwidth on the training set and calculates the squared error when applied to the (singleton) test set.^fn.^ The MSE across all splits is returned as the performance measure for that bandwidth. The candidate bandwidth with the smallest MSE is returned as the "optimal" choice.
^fn.^ Fortunately, a mathematical shortcut can be applied so that each smoother doesn't actually have to be trained *n* times. See ([2] p. 371).
We apply LOOCV to each temperature data set, testing candidate bandwidths ranging from $h=1$ to $h=20$ in the integers (as we'll see, testing larger bandwidths is unnecessary).
```{r}
squared_error = function(u) {
return(u^2)
}
cross_validate = function(X, Y, kernel_func, error_func, h) {
# kernel smoother
n = length(X)
S = get_S(X, kernel_func, h) # smoothing matrix
sk_hat = S %*% Y # Y_hat
rss = 0
for (i in 1:n) {
# components of Givens equation 11.18
num = Y[i] - sk_hat[i]
den = 1 - S[i, i]
# update rss
rss = rss + error_func(num/den)
}
return(rss/n)
}
```
## 4.1. Results
```{r}
# Austin
CVMSE1_aus = c()
for (i in 1:20) {
bdw = i/1 # i/10
CVMSE1_aus = c(CVMSE1_aus, cross_validate(
time_idx1_aus, high1_aus, dnorm, squared_error, bdw))
}
```
```{r}
CVMSE1_sd = c()
for (bdw in 1:20) {
CVMSE1_sd = c(CVMSE1_sd, cross_validate(
time_idx1_sd, high1_sd, dnorm, squared_error, bdw))
}
```
```{r}
CVMSE1_mn = c()
for (bdw in 1:20) {
CVMSE1_mn = c(CVMSE1_mn, cross_validate(
time_idx1_mn, high1_mn, dnorm, squared_error, bdw))
}
```
```{r}
CVMSE1_bt = c()
for (bdw in 1:20) {
CVMSE1_bt = c(CVMSE1_bt, cross_validate(
time_idx1_bt, high1_bt, dnorm, squared_error, bdw))
}
```
Figure 7 shows the test-set MSE at each candidate bandwidth for each data set. The test-set MSE monotonically increases with bandwidth size, so the LOOCV algorithm selects the smallest candidate ($h=1$) for each city.
In other words, LOOCV fails to select good bandwidths for smoothing these data sets. As seen in Section 2, very small bandwidths barely smooth the data. By selecting the smallest candidate, LOOCV is simply trying to follow the data as closely as possible instead of smoothing it.
```{r}
plot_df_loocv = data.frame(h=seq(1, 20, by=1),
MSE=c(CVMSE1_aus, CVMSE1_sd,
CVMSE1_mn, CVMSE1_bt),
City=c(rep("Austin", 20),
rep("S.D.", 20),
rep("Minn.", 20),
rep("Balt.", 20)))
```
```{r fig.cap="LOOCV MSE by bandwidth"}
gg = ggplot(plot_df_loocv, aes(x=h, y=MSE)) +
geom_line() +
facet_wrap( ~ City, scales="free") +
labs(y="Mean Squared Error (MSE)", x="Bandwidth") +
theme(plot.title = element_text(hjust = 0.5))
print(gg)
```
# 5. Difficulties Posed By Temperature Data
---
**Overview:** Temperature data has at least two properties that pose difficulties for kernel smoothing and bandwidth selection:
* Non-Constant Variance
* Dependent Data
---
Understanding why LOOCV fails to smooth temperature data can help us pick better methods for automated bandwidth selection.
## 5.1. Non-Constant Variance
The smoothing methods examined up to now assume that the variance of the data is constant across its range. However, temperature data doesn't meet this assumption due to both long and short-term factors. Climate change is an example of a long-term factor affecting temperature variance: as the climate gets further disrupted, we see more temperature extremes and fluctuations.
The seasons are the primary short-term factor affecting temperature variance. Take another look at some of the Austin data in Figure 1. We can see from this graph (and I know from personal experience) that temperature variance is much higher in the winter than the summer. Austin winters generally fluctuate between 40F and 80F, often within the span of a couple days, while Austin summers generally stay in the upper 90s to low 100s.
As a result, different bandwidths are appropriate in different seasons. Higher bandwidths seem better in winter since they'll give smooths that are less affected by the short-term fluctuations. Relatively lower bandwidths seem better in summer since they can better capture the peak heat (usually in August).
## 5.2. Dependent Data
Most smoothing methods work better when deviations from the broader trend in the data are uncorrelated with each other. In the case of temperature data, this would occur if fluctuations around the broader trend were independent from one day to the next. The fluctuations would tend to balance each other out, allowing a smoother to easily cut through the middle of them.
Unfortunately, temperature data isn't like this. When temperatures deviate from the broader trend, they generally do so over the course of several days in a heat wave or cold spell. Abnormally hot days are likely to be followed by abnormally hot days; likewise for abnormally cold days.
Consider the February 1996 heat wave in Austin highlighted in Figure 8. Here, the positive deviations from the overall trend lasted about a week. Unless they have a very high bandwidth, smoothers won't be able to "cut through" this heat wave and stick to the broader trend. The prolonged spell of abnormally high temperatures will inevitably pull the smooth upwards, perhaps in dramatic fashion. Looking back at the smoother with $h=14$ in Section 3, this is exactly what happens.
```{r fig.cap="Austin daily highs with Feb. 1996 heat wave"}
low=1000
high=1450
rect = data.frame(xmin=as.Date("1996-02-18"), xmax=as.Date("1996-02-27"), ymin=-Inf, ymax=Inf)
gg = ggplot(data1_aus[low:high,], aes(x=DATE, y=TMAX)) +
geom_line() +
geom_rect(data=rect, aes(xmin=xmin, xmax=xmax, ymin=ymin, ymax=ymax),
fill="yellow", alpha=0.4, inherit.aes=FALSE) +
labs(x="Date", y="High Temp. (F)") +
theme(plot.title = element_text(hjust = 0.5))
print(gg)
```
We get more evidence of the dependent nature of temperature data from the autocorrelation (ACF) plots of the residuals from the $h=14$ smooths on each city. The ACF plots in Figure 9 display the correlations between different lags of the residuals. For each city, the residuals are positively correlated with the lag-1 and lag-2 residuals. For Minneapolis, the residuals are even positively correlated with the lag-3 residuals.
```{r}
error1_aus = Y_hat1_aus - high1_aus
```
```{r}
error1_sd = Y_hat1_sd - high1_sd
```
```{r}
error1_mn = Y_hat1_mn - high1_mn
```
```{r}
error1_bt = Y_hat1_bt - high1_bt
```
```{r fig.cap="Autocorrelation plots for normal kernel smooth (h=14) residuals"}
par(mar=c(4,4,4,2), mfrow=c(2,2))
acf(error1_aus, lag=20, main="Austin", cex.main=1.5, cex.lab=1)
acf(error1_sd, lag=20, main="San Diego", cex.main=1.5, cex.lab=1)
acf(error1_mn, lag=20, main="Minneapolis", cex.main=1.5, cex.lab=1)
acf(error1_bt, lag=20, main="Baltimore", cex.main=1.5, cex.lab=1)
```
# 6. Super-Smoother
---
**Overview:** The super-smoother helps with the problem of non-constant variance by selecting different "optimal" bandwidths at different points. LOOCV again performs poorly as the bandwidth selection method, choosing the lowest candidate bandwidth at most points.
---
Up to this point, I've only discussed smoothers that use a single bandwidth across an entire data set. These can struggle when the data have non-constant variance, as different bandwidths are appropriate in different regions.
The super-smoother tries to solve this problem by allowing the bandwidth to vary across the data.[3] Like the previous LOOCV algorithm, the super-smoother is given a set of candidate bandwidths and computes the LOOCV error at each point for each bandwidth. A second smooth is then applied to these errors. At each point in the data set, the bandwidth with the smallest smoothed LOOCV error is selected as the "optimal" bandwidth at that point. A third smooth is applied to these optimal bandwidths to ease the transitions between different variance regions. Finally, the smoothed optimal bandwidths are used to calculate the output values of the super-smoother.
Figure 10 (from [2]) shows an example non-constant variance data set with a super-smoother fit given 3 candidate bandwidths (0.05, 0.2, 0.5). Figure 11 (also from [2]) shows the optimal bandwidth selected by LOOCV at each point as well as the smooth applied to these bandwidths. Note that higher bandwidths are selected in higher-variance regions, allowing the smooth to cut through the noise, while lower bandwidths are selected in lower-variance regions, allowing the smooth to better capture the trend.
```{r echo=FALSE, fig.cap="Non-constant variance data with LOOCV super-smooth (solid line)", out.width = '100%'}
knitr::include_graphics("images/supersmoother_data.png")
```
```{r echo=FALSE, fig.cap="Selected bandwidths with smooth", out.width = '100%'}
knitr::include_graphics("images/supersmoother_bandwidths.png")
```
We test the LOOCV super-smoother on each data set. After some trial-and-error, we use $H = (7, 15, 23, 31, 39, 47, 55, 63, 71, 79, 87)$ as the set of candidate bandwidths.
The decision to use a set of 11 candidate bandwidths introduced a minor complication. The paper introducing the super-smoother uses a set of 3 candidate bandwidths and recommends using the second-largest of these when applying the second and third smooths to the LOOCV errors and optimal bandwidths, respectively (call the bandwidth of these smooths $h^*$). This recommendation is ambiguous for a larger set of candidates: do we use the second-largest or the median bandwidth for the later smooths? We'll test both options ($h^*=15$ and $h^*=47$).
```{r}
super_smoother = function(X, Y, kernel_func, H_array, error_estimate="loocv",
kb_window=2, error_bdw=NA) {
# H_array: the set of candidate bandwidths
# error_estimate: "loocv" or "kbcv"
# kb_window: if error_estimate=="kbcv", the length of one side of the k-block
### window
# error_bdw: the bandwidth to use for the 2nd and 3rd smooths
m = length(H_array)
n = length(X)
k = (m+1)/2
if (is.na(error_bdw)) {
error_bdw = H_array[k]
}
# fixed "outer" smoother
S_star_h = get_S(X, kernel_func, error_bdw)
# get performance measures at each point for each candidate bandwidth
smooth_list = vector(mode="list", length=m)
p_hat_list = vector(mode="list", length=m) # smoothed performance measure
for (i in 1:m) {
h = H_array[i]
S_h = get_S(X, kernel_func, h) # smoothing matrix
#S_h = S_array[i]
sk_hat_h = S_h %*% Y
smooth_list[[h]] = sk_hat_h
if (error_estimate=="loocv") {
error_h = (Y - sk_hat_h)/(1 - diag(S_h))
}
else if (error_estimate=="kbcv") {
error_h = c()
for (j in 1:n) {
yj = Y[j]
Y_temp = Y
Y_temp[max((j-kb_window),1):min((j+kb_window),n)] = yj
sk_hatj = dot(c(S_h[1:n,j]), Y_temp)
num = Y[j] - sk_hatj
den = 1 - sum(S_h[max((j-kb_window),1):min((j+kb_window),n), j])
error_j = num/den
error_h = c(error_h, c(error_j))
}
}
g_h = abs(error_h)
# smooth errors
p_hat_h = S_star_h %*% g_h
p_hat_list[[i]] = p_hat_h
}
p_hat_matrix = matrix(unlist(p_hat_list), ncol=m)
smooth_matrix = matrix(unlist(smooth_list), ncol=m)
# pick best bandwidths
min_indices = argmin(p_hat_matrix)
h_min = c()
for (i in 1:n) {
h_min = c(h_min, H_array[min_indices[i]])
}
# smooth optimal bandwidths
smoothed_best_h = S_star_h %*% h_min
smoothed_best_h = round(smoothed_best_h, 6) # floats would go below min(H)
# use smoothed optimal bandwidths to estimate
sk_hat = c()
for (i in 1:n) {
h_i = smoothed_best_h[i] # smoothed optimal bandwidth
h_minus = max(H_array[H_array <= h_i]) # closest candidate bandwidth below
h_plus = min(H_array[H_array >= h_i]) # closest candidate bandwidth above
# interpolate higher and lower bandwidths
if (h_minus==h_plus) {
p = 0
}
else {
p = (h_i - h_minus)/(h_plus - h_minus)
}
sk_i = p*smooth_list[[h_plus]][i] + (1-p)*smooth_list[[h_minus]][i]
sk_hat = c(sk_hat, c(sk_i))
}
return(list(sk_hat, h_min, smoothed_best_h))
}
```
## 6.1 Results
```{r}
H_array = c(7,15,23,31,39,47,55,63,71,79,87)
```
```{r}
# Austin
smooth_list1_aus = super_smoother(time_idx1_aus, high1_aus, dnorm,
H_array, error_estimate="loocv")
```
```{r}
Y_hat1_aus = smooth_list1_aus[[1]]
h_min1_aus = smooth_list1_aus[[2]]
smoothed_h_min1_aus = smooth_list1_aus[[3]]
```
```{r}
mid_mean_aus = mean(smoothed_h_min1_aus)
```
```{r}
smooth_list1_aus = super_smoother(time_idx1_aus, high1_aus, dnorm, H_array,
error_estimate="loocv", error_bdw=H_array[2])
```
```{r}
Y_hat1_aus = smooth_list1_aus[[1]]
h_min1_aus = smooth_list1_aus[[2]]
smoothed_h_min1_aus = smooth_list1_aus[[3]]
```
```{r}
low_mean_aus = mean(smoothed_h_min1_aus)
```
```{r}
data1_aus$Y_hat = Y_hat1_aus
```
```{r}
# San Diego
smooth_list1_sd = super_smoother(time_idx1_sd, high1_sd, dnorm, H_array,
error_estimate="loocv")
```
```{r}
Y_hat1_sd = smooth_list1_sd[[1]]
h_min1_sd = smooth_list1_sd[[2]]
smoothed_h_min1_sd = smooth_list1_sd[[3]]
```
```{r}
mid_mean_sd = mean(smoothed_h_min1_sd)
```
```{r}
smooth_list1_sd = super_smoother(time_idx1_sd, high1_sd, dnorm, H_array,
error_estimate="loocv", error_bdw=H_array[2])
```
```{r}
Y_hat1_sd = smooth_list1_sd[[1]]
h_min1_sd = smooth_list1_sd[[2]]
smoothed_h_min1_sd = smooth_list1_sd[[3]]
```
```{r}
low_mean_sd = mean(smoothed_h_min1_sd)
```
```{r}
data1_sd$Y_hat = Y_hat1_sd
```
```{r}
# Minneapolis
smooth_list1_mn = super_smoother(time_idx1_mn, high1_mn, dnorm, H_array,
error_estimate="loocv")
```
```{r}
Y_hat1_mn = smooth_list1_mn[[1]]
h_min1_mn = smooth_list1_mn[[2]]
smoothed_h_min1_mn = smooth_list1_mn[[3]]
```
```{r}
mid_mean_mn = mean(smoothed_h_min1_mn)
```
```{r}
smooth_list1_mn = super_smoother(time_idx1_mn, high1_mn, dnorm, H_array,
error_estimate="loocv", error_bdw=H_array[2])
```
```{r}
Y_hat1_mn = smooth_list1_mn[[1]]
h_min1_mn = smooth_list1_mn[[2]]
smoothed_h_min1_mn = smooth_list1_mn[[3]]
```
```{r}
low_mean_mn = mean(smoothed_h_min1_mn)
```
```{r}
data1_mn$Y_hat = Y_hat1_mn
```
```{r}
# Baltimore
smooth_list1_bt = super_smoother(time_idx1_bt, high1_bt, dnorm, H_array,
error_estimate="loocv")
```
```{r}
Y_hat1_bt = smooth_list1_bt[[1]]
h_min1_bt = smooth_list1_bt[[2]]
smoothed_h_min1_bt = smooth_list1_bt[[3]]
```
```{r}
mid_mean_bt = mean(smoothed_h_min1_bt)
```
```{r}
smooth_list1_bt = super_smoother(time_idx1_bt, high1_bt, dnorm, H_array,
error_estimate="loocv", error_bdw=H_array[2])
```
```{r}
Y_hat1_bt = smooth_list1_bt[[1]]
h_min1_bt = smooth_list1_bt[[2]]
smoothed_h_min1_bt = smooth_list1_bt[[3]]
```
```{r}
low_mean_bt = mean(smoothed_h_min1_bt)
```
```{r}
data1_bt$Y_hat = Y_hat1_bt
```
Unfortunately, the super-smoother with LOOCV bandwidth selection also mostly fails to give good smooths of the data. As seen in Table 2, for each city and choice of $h^*$, the mean of the bandwidths chosen by the super-smoother across the data set is fairly close to 7, the smallest candidate bandwidth. In other words, the LOOCV super-smoother selects the smallest possible bandwidth as "optimal" at most points in the data set. (Sets of candidate bandwidths that contained even smaller values give the same result.)
```{r results='asis'}
table_df = data.frame(City=c("Austin", "San Diego", "Minneapolis", "Baltimore"),
mid_mean=c(mid_mean_aus,mid_mean_sd,mid_mean_mn,mid_mean_bt),
low_mean=c(low_mean_aus,low_mean_sd,low_mean_mn,low_mean_bt))
kable(table_df, caption="Mean LOOCV Super-Smoother Bandwidth by City",
col.names=c("City", "Mean Bandwidth (h*=47)", "Mean Bandwidth (h*=15)"))
```
Setting $h^*$ to the smaller value of 15 gives a better smooth than $h^*=47$, but not by much. Figure 12 displays parts of the smooths given by the LOOCV super-smoother with $h^* = 15$. The smooths are generally very choppy with only a few bright spots (e.g. Fall 1996 and Spring 1997 in Austin, Fall 1997 in San Diego). If lower bandwidths had been included in the set of candidates, the smooths would be even worse.
```{r}
plot_df$Y_hat = c(Y_hat1_aus, Y_hat1_sd, Y_hat1_mn, Y_hat1_bt)
plot_df_trunc = plot_df[(plot_df$Date >= "1995-10-01") & (plot_df$Date <= "1999-09-30"),]
```
```{r fig.cap="Daily High Temperatures with LOOCV Super-Smooth (h*=15)"}
gg = ggplot(plot_df_trunc) +
geom_line(aes(x=Date, y=Daily_High_Temp), color="gray") +
geom_line(aes(x=Date, y=Y_hat), color="red") +
facet_wrap( ~ City, scales="free") +
labs(x="Date", y="High Temp. (F)") +
theme(plot.title = element_text(hjust = 0.5))
print(gg)
```
# 7. K-Block Cross-Validation
---
**Overview:** Using *K*-block cross-validation (KBCV) instead of LOOCV helps with dependent data. The super-smoother that uses KBCV for bandwidth selection produces good smoothing results.
---
The super-smoother is designed to overcome difficulties posed by data with non-constant variance, but it still fails on temperature data when using LOOCV to select bandwidths. This is due in part to the dependent nature of temperature data. If the high temperature on a given day deviates from the broader trend, the highs on days shortly before and after are likely to deviate from the broader trend in the same way. In LOOCV, surrounding days are used to predict the day that has been left out. As a result, the LOOCV prediction of the high temperature on a given day contains information about the broader trend we're trying to capture (good) but also information about how that day deviates from the trend (bad).
*K*-block cross-validation (KBCV) is a LOOCV alternative designed to help with this problem (the original authors call the technique *h*-block cross-validation, but I use "*k*" instead to avoid confusion with bandwidth parameters).[4] As with LOOCV, KBCV uses singleton test sets. The difference is in the training sets: LOOCV uses all $n-1$ remaining points in the training set, whereas KBCV uses $n - 2k - 1$ training points for a user-specified window parameter $k$. In addition to the testing point, KBCV training sets leave out a window of $k$ points on either side.
We test KBCV with a window parameter $k=3$, which leaves out a total of 7 points (the testing point, 3 points before, and 3 points after) from each training set. We set $k=3$ because the autocorrelation plots in Section 5.2 gave evidence of positive residual correlation up to lag 3. The super-smoother bandwidth used to smooth the cross-validation errors and optimal bandwidths was set to $h^*=15$ since this gave better LOOCV results.
```{r}
kb_window = 3
error_bdw = H_array[2]
```
## 7.1 Results
```{r}
smooth_list1_aus = super_smoother(time_idx1_aus, high1_aus, dnorm, H_array,
error_estimate="kbcv", kb_window=kb_window,
error_bdw=error_bdw)
```
```{r}
Y_hat1_aus = smooth_list1_aus[[1]]
h_min1_aus = smooth_list1_aus[[2]]
smoothed_h_min1_aus = smooth_list1_aus[[3]]
data1_aus$Y_hat_kb = Y_hat1_aus
data1_aus$h_min_kb = h_min1_aus
data1_aus$smoothed_h_min_kb = smoothed_h_min1_aus
```
```{r}
mean_aus = mean(smoothed_h_min1_aus)
```
```{r}
# San Diego
smooth_list1_sd = super_smoother(time_idx1_sd, high1_sd, dnorm, H_array,
error_estimate="kbcv", kb_window=kb_window,
error_bdw=error_bdw)
```
```{r}
Y_hat1_sd = smooth_list1_sd[[1]]
h_min1_sd = smooth_list1_sd[[2]]
smoothed_h_min1_sd = smooth_list1_sd[[3]]
data1_sd$Y_hat_kb = Y_hat1_sd
data1_sd$h_min_kb = h_min1_sd
data1_sd$smoothed_h_min_kb = smoothed_h_min1_sd
```
```{r}
mean_sd = mean(smoothed_h_min1_sd)
```
```{r}
# Minneapolis
smooth_list1_mn = super_smoother(time_idx1_mn, high1_mn, dnorm, H_array,
error_estimate="kbcv", kb_window=kb_window,
error_bdw=error_bdw)
```
```{r}
Y_hat1_mn = smooth_list1_mn[[1]]
h_min1_mn = smooth_list1_mn[[2]]
smoothed_h_min1_mn = smooth_list1_mn[[3]]
data1_mn$Y_hat_kb = Y_hat1_mn
data1_mn$h_min_kb = h_min1_mn
data1_mn$smoothed_h_min_kb = smoothed_h_min1_mn
```
```{r}
mean_mn = mean(smoothed_h_min1_mn)
```
```{r}
# Baltimore
smooth_list1_bt = super_smoother(time_idx1_bt, high1_bt, dnorm, H_array,
error_estimate="kbcv", kb_window=kb_window,
error_bdw=error_bdw)
```
```{r}
Y_hat1_bt = smooth_list1_bt[[1]]
h_min1_bt = smooth_list1_bt[[2]]
smoothed_h_min1_bt = smooth_list1_bt[[3]]
data1_bt$Y_hat_kb = Y_hat1_bt
data1_bt$h_min_kb = h_min1_bt
data1_bt$smoothed_h_min_kb = smoothed_h_min1_bt
```
```{r}
mean_bt = mean(smoothed_h_min1_bt)
```
Table 3 displays the means of the bandwidths chosen by the KBCV super-smoother across each data set. With KBCV instead of LOOCV as the bandwidth selection method, the super-smoother tends to select much higher bandwidths.
```{r results='asis'}
table_df = data.frame(City=c("Austin", "San Diego", "Minneapolis", "Baltimore"),
mid_mean=c(mean_aus, mean_sd, mean_mn, mean_bt))
kable(table_df, caption="Mean KBCV Super-Smoother Bandwidth by City",
col.names=c("City", "Mean Bandwidth (h*=15, k=3)"))
```
The generally higher bandwidths lead to much-improved smooths which can be seen in Figures 13 and 14. The smooths are much less wiggly than the LOOCV super-smooths in Figure 12 (Section 6.1), doing a better job of capturing broader seasonal trends instead of short-term deviations. For instance, the LOOCV smooth for Austin in Winter 1995-1996 contains 3 prominent local minima and is yanked upwards by the February 1996 heat wave. The KBCV smooth, on the other hand, contains only a single minimum in that period and is less sensitive to the heat wave.
Figures 13 and 14 also displays the bandwidths selected by the super-smoother at each point (along with the $h^*=15$ smooth applied to them). We can see some seasonality in the bandwidth selections that match the advertised properties of the super-smoother. Higher bandwidths tend to be selected in higher-variance periods (e.g. Austin winters, Baltimore springs) and lower bandwidths tend to be selected in lower-variance periods (e.g. Austin summers).
```{r}
plot_df_temp_as = data.frame(TMAX=c(data1_aus$TMAX, data1_sd$TMAX),
DATE=c(data1_aus$DATE, data1_sd$DATE),
Y_hat_kb = c(data1_aus$Y_hat_kb, data1_sd$Y_hat_kb),
h_min_kb = c(data1_aus$h_min_kb, data1_sd$h_min_kb),
smoothed_h_min_kb = c(data1_aus$smoothed_h_min_kb,
data1_sd$smoothed_h_min_kb),
City=c(rep("Austin", n1_aus),
rep("S.D.", n1_sd)))
```
```{r fig.cap="Above: daily highs with KBCV super-smooth (h*=15, k=3). Below: KBCV super-smoother selected bandwidths."}
plot_df_temp_as_trunc = plot_df_temp_as[(plot_df_temp_as$DATE >= "1995-10-01") & (
plot_df_temp_as$DATE <= "1999-09-30"),]
gg1 = ggplot(plot_df_temp_as_trunc) +
geom_line(aes(x=DATE, y=TMAX), color="gray") +
geom_line(aes(x=DATE, y=Y_hat_kb), color="red") +
facet_wrap( ~ City, scale="free") +
labs(x="", y="High Temp. (F)") +
theme(plot.title = element_text(hjust = 0.5))
gg2 = ggplot(plot_df_temp_as_trunc) +
geom_point(aes(x=DATE, y=h_min_kb), color="black") +
geom_line(aes(x=DATE, y=smoothed_h_min_kb), color="gray") +
facet_wrap( ~ City) +
labs(x="Date", y="Selected Bandwidth") +
theme(plot.title = element_text(hjust = 0.5))
plot_grid(gg1, gg2, nrow=2, ncol=1)
```
```{r}
plot_df_temp_mb = data.frame(TMAX=c(data1_mn$TMAX, data1_bt$TMAX),
DATE=c(data1_mn$DATE, data1_bt$DATE),
Y_hat_kb = c(data1_mn$Y_hat_kb, data1_bt$Y_hat_kb),
h_min_kb = c(data1_mn$h_min_kb, data1_bt$h_min_kb),
smoothed_h_min_kb = c(data1_mn$smoothed_h_min_kb,
data1_bt$smoothed_h_min_kb),
City=c(rep("Minneapolis", n1_mn),
rep("Baltimore", n1_bt)))
```
```{r fig.cap="Above: daily highs with KBCV super-smooth (h*=15, k=3). Below: KBCV super-smoother selected bandwidths."}
plot_df_temp_mb_trunc = plot_df_temp_mb[(plot_df_temp_mb$DATE >= "1995-10-01") & (
plot_df_temp_mb$DATE <= "1999-09-30"),]
gg1 = ggplot(plot_df_temp_mb_trunc) +
geom_line(aes(x=DATE, y=TMAX), color="gray") +
geom_line(aes(x=DATE, y=Y_hat_kb), color="red") +
facet_wrap( ~ City, scale="free") +
labs(x="", y="High Temp. (F)") +
theme(plot.title = element_text(hjust = 0.5))
gg2 = ggplot(plot_df_temp_mb_trunc) +
geom_point(aes(x=DATE, y=h_min_kb), color="black") +
geom_line(aes(x=DATE, y=smoothed_h_min_kb), color="gray") +
facet_wrap( ~ City) +
labs(x="Date", y="Selected Bandwidth") +
theme(plot.title = element_text(hjust = 0.5))
plot_grid(gg1, gg2, nrow=2, ncol=1)
```
# 8. Discussion and Further Research
In this project, we examined automated bandwidth selection methods for kernel smoothing on daily temperature data. We discussed some problematic properties of this data that cause standard methods to fail. To help with the problem of non-constant variance, we moved from regular linear smoothers to the super-smoother. To help with the problem of dependent data, we moved from leave-one-out cross-validation to *k*-block cross-validation. We showed that the KBCV super-smoother does a good job smoothing temperature data across a wide variety of climates.
The KBCV super-smoother uses several hyperparameters: the set of candidate bandwidths, the bandwidths of the two higher-level smooths, and the window size *k* for *k*-block cross-validation. Tuning these hyperparameters was only briefly discussed in this project and is a good topic of further research.
# 9. References
[1] https://www.ncei.noaa.gov/cdo-web/
[2] Givens and Hoeting (2012), Computational Statistics.
[3] Friedman (1984), A variable span smoother.
[4] Chow and Nolan (1994), A cross-validatory method for dependent data.