-
Notifications
You must be signed in to change notification settings - Fork 2
/
r-session-03.qmd
727 lines (600 loc) · 23.9 KB
/
r-session-03.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
---
subtitle: "Estimation"
abstract-title: ""
abstract: |
*Materials adapted from John Drake and Pej Rohani [@drakeEstimation2019]*
execute:
warning: false
metadata-files:
- metadata/matthewferrari.yml
- metadata/mathjax-packages.yml
---
# R Session 03
## Setup
```{r}
library(here)
library(rio)
library(deSolve)
library(tidyverse)
library(ggtext)
library(gt)
```
```{r}
theme_set(theme_minimal())
```
```{r}
# Loads the datasets: flu, measles, niamey, plauge
flu <- rio::import("https://raw.githubusercontent.com/arnold-c/SISMID-Module-02_2023/main/data/flu.csv")
niamey <- rio::import("https://raw.githubusercontent.com/arnold-c/SISMID-Module-02_2023/main/data/niamey.csv")
# flu <- rio::import(here::here("data", "flu.csv"))
# niamey <- rio::import(here::here("data", "niamey.csv"))
```
## Estimating $R_0$ Problem Background
So far in this class we have focused on the **theory** of infectious disease. Often, however, we will want to apply this theory to particular situations.
One of the key applied problems in epidemic modeling is the estimation of $R_0$ from outbreak data.
In this session, we study two methods for estimating $R_0$ from an epidemic curve.
As a running example, we will use the data on influenza in a British boarding school.
```{r}
#| column: body
#| out-width: 100%
ggplot(flu, aes(x = day, y = flu)) +
geom_line(color = "slategray4") +
geom_point(shape = 21, size = 5, fill = "slategray4", alpha = 0.8) +
labs(x = "Day", y = "Active Influenza Cases")
```
## Estimating $R_0$ From The Final Outbreak Size
Our first approach is to estimate $R_0$ from the final outbreak size.
Although unhelpful at the early stages of an epidemic (before the final epidemic size is observed), this method is nonetheless a useful tool for *post hoc* analysis.
The method is general and can be motivated by the argument listed in [@keelingIntroductionSimpleEpidemic2008]:
First, we assume that the epidemic is started by a single infectious individual in a completely susceptible population.
On average, this individual infects $R_0$ others.
The probability a particular individual escaped infection is therefore $e^{-R_0 / N}$.
If $Z$ individuals have been infected, the probability of an individual escaping infection from all potential sources is $e^{-Z R_0 / N}$.
It follows that at the end of the epidemic a proportion $R(\infty) = Z / N$ have been infected and the fraction remaining susceptible is $S(\infty) = e^{-R(\infty) R_0}$, which is equal to $2 - R(\infty)$.
::: {.callout-note collapse="true"}
$S(\infty) = e^{-R(\infty) R_0}$ can be calculated by acknowledging that at equilibrium ($t = \infty$), $S(\infty) = 1 - R(\infty) = Z / N$, so substituting $R(\infty)$ into $1 - e^{-Z R_0 / N}$ gives the desired result.
It could also be calculated by dividing $\frac{\dd{S}}{\dd{t}}$ by $\frac{\dd{R}}{\dd{t}}$:
\begin{aligned}
\frac{\dd{S}}{\dd{R}} &= - \frac{\beta S}{\gamma} \\
&= - R_0 S
\end{aligned}
which is a [separable differential equation](https://tutorial.math.lamar.edu/classes/de/separable.aspx), so can be integrated as follows:
\begin{aligned}
- \int_{0}^{t} \frac{1}{R_0 S} \dd{S} &= \int_{0}^{t} \dd{R} \\
- \frac{1}{R_0} \left(\ln{S(t)} - \ln{S(0)} \right) &= R(t) - \cancelto{0}{R(0)} \\
\ln{S(t)} &= \ln{S(0)} - R_0 R(t) \\
S(t) &= S(0) e^{-R_0 R(t)}
\end{aligned}
:::
Putting this together, we get:
$$
1 - R(\infty) - e^{-R(\infty) R_0} = 0
$$
Rearranging, we have the estimator
$$
\hat{R_0} = \frac{\ln(1 - Z / N)}{-Z / N},
$$
which, in this case, evaluates to $\frac{\ln(1 - 512 / 764)}{-512 / 764} = 1.655$.
<div class="exercise">
### Exercise 1
</div>
This equation shows the important one-to-one relationship between $R_0$ and the final epidemic size.
Plot the relationship between the total epidemic size and $R_0$ for the complete range of values between 0 and 1.
## Linear Approximation
The next method we introduce takes advantage of the fact that during the early stages of an outbreak, the number of infected individuals is given approximately as $I(t) \approx I_0 e^{((R_0 - 1)(\gamma + \mu)t)}$.
Taking logarithms of both sides, we have $\ln(I(t)) \approx \ln(I_0) + (R_0 - 1)(\gamma + \mu)t$, showing that the log of the number of infected individuals is approximately linear in time with a slope that reflects both $R_0$ and the recovery rate.
This suggests that a simple linear regression fit to the first several data points on a log-scale, corrected to account for $\gamma$ and $\mu$, provides a rough and ready estimate of $R_0$.
For flu, we can assume $\mu =0$ because the epidemic occurred over a time period during which natural mortality is negligible. Further, assuming an infectious period of about 2.5 days, we use $\gamma = (2.5)^{-1} = 0.4$ for the correction.
Fitting to the first four data points, we obtain the slope as follows.
```{r}
# Fit a linear model
linear_model <- lm(log(flu[1:4]) ~ day[1:4], data = flu)
# Summary statistics for fit model
summary(linear_model)
# Extract slope parameter
coef(linear_model)[2]
```
Rearranging the linear equation above and denoting the slope coefficient by $\hat \beta_1$ we have the estimator $\hat R_0 = \hat \beta_1 / \gamma + 1$ giving $\hat R_0 = 1.094913 / 0.4 + 1 \approx 3.7$.
<div class="exercise">
### Exercise 2
</div>
Our estimate assumes that boys remained infectious during the natural course of infection.
The original report on this epidemic indicates that boys found to have symptoms were immediately confined to bed in the infirmary.
The report also indicates that only 1 out of 130 adults at the school exhibited any symptoms.
It is reasonable, then, to suppose that transmission in each case ceased once he had been admitted to the infirmary.
Supposing admission happened within 24 hours of the onset of symptoms.
How does this affect our estimate of $R_0$? Twelve hours?
<div class="exercise">
### Exercise 3
</div>
Biweekly data for outbreaks of measles in three communities in Niamey, Niger are provided in the dataframe `niamey`.
Use this method to obtain estimates of $R_0$ for measles from the first community assuming that the infectious period is approximately two weeks or $\frac{14}{365} \approx 0.0384$ years.
<div class="exercise">
### Exercise 4
</div>
A defect with this method is that it uses only a small fraction of the information that might be available, i.e., the first few data points.
Indeed, there is nothing in the method that tells one how many data points to use--this is a matter of judgment.
Further, there is a tradeoff in that as more and more data points are used the precision of the estimate increases, but this comes at a cost of additional bias.
Plot the estimate of $R_0$ obtained from $n=3, 4, 5, ...$ data points against the standard error of the slope from the regression analysis to show this tradeoff.
## Estimating dynamical parameters with least squares
The objective of the previous exercise was to estimate $R_0$.
Knowing $R_0$ is critical to understanding the dynamics of any epidemic system.
It is, however, a composite quantity and is not sufficient to completely describe the epidemic trajectory.
For this, we require estimates for all parameters of the model.
In this exercise, we introduce a simple approach to model estimation called **least squares fitting**, sometimes called **trajectory matching**.
The basic idea is that we find the values of the model parameters that minimize the squared differences between model predictions and the observed data.
To demonstrate least squares fitting, we consider an outbreak of measles in Niamey, Niger, reported on by [@graisEstimatingTransmissionIntensity2006].
```{r}
# Replace an "NA"
niamey[5, 3] <- 0
niamey_df <- niamey %>%
# Rename columns to remove automatic "V1" etc columns names
rename_with(., ~paste0("Site_", str_remove(.x, "V"))) %>%
# Add a column for the biweekly time period
mutate(biweek = 1:16) %>%
# Convert to long format for plotting
pivot_longer(
cols = contains("Site"),
names_to = "site",
values_to = "cases"
)
```
```{r}
# Create a vector of colors for each site in the Niamey dataset
niamey_site_colors <- RColorBrewer::brewer.pal(3, "Dark2")
# Assign names to the colors
names(niamey_site_colors) <- unique(niamey_df$site)
# Create a vector of labels for each site for nicer plotting legends
niamey_site_labels <- str_replace_all(names(niamey_site_colors), "_", " ")
names(niamey_site_labels) <- names(niamey_site_colors)
```
```{r}
#| column: body
#| out-width: 100%
ggplot(
niamey_df,
aes(x = biweek, y = cases, color = site, fill = site, group = site)
) +
geom_line() +
geom_point(shape = 21, size = 5, alpha = 0.8) +
scale_color_manual(
values = niamey_site_colors,
aesthetics = c("color", "fill"),
labels = niamey_site_labels
) +
guides(color = "none") +
labs(x = "Biweek", y = "Number of Cases", fill = "Site Number") +
theme(legend.position = "bottom")
```
## Dynamical Model
First, we write a specialized function for simulating the SIR model in a case where the removal rate is *"hard-wired"* and with no demography.
```{r}
#' Basic SIR model
#'
#' A basic SIR model with no demographic structure to be used in deSolve
#'
#' @param time deSolve passes the time parameter to the function.
#' @param state A vector of states.
#' @param params The beta parameter
#' @param ... Other arguments passed by deSolve.
#'
#' @return A deSolve matrix of states at each time step.
#' @examples
#' sir_params <- 0.0005
#' sir_init_states <- c(S = 5000, I = 1, R = 0)
#' sim_times <- seq(0, 16 / 365, by = 0.1 / 365)
#'
#' sir_sol <- deSolve::ode(
#' y = sir_init_states,
#' times = sim_times,
#' func = closed_sir_model,
#' parms = sir_params
#' ))
closed_sir_model <- function (time, state , params, ...) {
# Unpack states
S <- state["S"]
I <- state["I"]
# Unpack parameters
beta <- params
dur_inf <- 14 / 365
gamma <- 1 / dur_inf
new_inf <- beta * S * I
# Calculate the ODEs
dSdt <- -new_inf
dIdt <- new_inf - (gamma * I)
# Return the ODEs
return(list(c(dSdt, dIdt, new_inf)))
}
```
## Interactive Optimization
```{r}
#| echo: false
ojs_niamey <- niamey_df %>%
rename("Biweek" = biweek, "Number of Individuals" = cases) %>%
mutate(site = str_replace_all(site, "_", " "))
```
```{r}
#| echo: false
#| cache: false
ojs_define(niamey_data = ojs_niamey)
```
```{ojs}
//| echo: false
filtered_niamey_data = aq.table(niamey_data)
.filter(aq.escape(d => d.site == site_select))
```
```{ojs}
//| echo: false
reset_S = 10000
reset_I = 20
reset_beta = 5.00
```
```{ojs}
//| echo: false
function set(input, value) {
input.value = value;
input.dispatchEvent(new Event("input", {bubbles: true}));
}
```
```{ojs}
//| echo: false
function sse(obs, preds) {
if(obs.length == preds.length) {
var squared_errs = obs.map((e, i) => (e - preds[i])**2 )
return squared_errs.reduce((a, b) => a + b, 0)
} else {
return("lengths are not the same")
}
}
```
```{ojs}
//| panel: sidebar
//| echo: false
viewof reset = Inputs.button([
["Reset all sliders", () => {
set(viewof S0, reset_S)
set(viewof I0, reset_I)
set(viewof beta_input, reset_beta)
}]
])
viewof S0 = Inputs.range(
[500, 15000],
{value: reset_S, step: 1, label: md`${tex`S(0)`}`}
)
viewof I0 = Inputs.range(
[0.001, 50],
{value: reset_I, step: 0.001, label: md`${tex`I(0)`}`}
)
viewof beta_input = Inputs.range(
[1, 100],
{value: reset_beta, step: 0.001, label: md`${tex`\beta (\times 10^{-3})`}`}
)
viewof site_select = Inputs.select(
["Site 1", "Site 2", "Site 3"],
{label: "Select a site:"}
)
// convert to daily time scale as easier to manipulate
beta = (beta_input / 365) * (10 ** (-3))
md`${tex`R_0 = ${R0_str}`}`
```
```{ojs}
//| echo: false
dur_inf = 14
gamma = 1 / dur_inf
R0 = beta * (S0 + I0)/ gamma
R0_str = R0.toPrecision(2).toLocaleString()
```
```{ojs}
//| echo: false
dt = 0.01
tmax = 16 * 14
```
```{ojs}
//| echo: false
import {odeRK4} from '@rreusser/integration@3064'
import { aq, op } from '@uwdata/arquero'
```
```{ojs}
//| echo: false
function sir(dydt, y, t) {
dydt[3] = beta * y[0] * y[1]
dydt[0] = - dydt[3]
dydt[1] = dydt[3] - gamma * y[1]
dydt[2] = gamma * y[1]
}
```
```{ojs}
//| echo: false
function simulate(f, t0, y0, dt, tmax) {
var t = t0
var y = y0
var i = 0
var ysim = [y0]
for (t = t0 + dt; t <= tmax; t += dt) {
ysim.push(odeRK4([], ysim[i], f, dt))
i += 1
}
// return cumulative infections
return ysim.map(d => d[3])
}
```
```{ojs}
//| echo: false
sir_sol = simulate(sir, 0, [S0, I0, 0.0, 0.0], dt, tmax)
```
```{ojs}
//| echo: false
siteColors = ["#1b9e77", "#d95f02", "#7570b3"]
```
```{ojs}
//| echo: false
// Create an array of time indices, dropping first as from returns index 0, which
// doesn't exist in the real data
times = Array.from({length: 17}, (_, i) => i * 14).slice(1)
tindex = times.map((e, i) => e * (1 / dt))
cum_inc = tindex.map((i) => sir_sol[i])
preds = [cum_inc[0] + I0, ...cum_inc.map((e, i) => cum_inc[i] - cum_inc[i-1]).slice(1)]
sir_tbl = aq.table({
Biweek: times.map(t => t / 14),
"Cumulative Incidence": cum_inc,
"Number of Individuals": preds
})
sim_sse = [({
sse: sse(
filtered_niamey_data.array("Number of Individuals"),
preds
).toPrecision(4),
Biweek: 3,
"Number of Individuals": Math.max(
...sir_tbl.array("Number of Individuals"),
...filtered_niamey_data.array("Number of Individuals")
) * 0.9
})]
```
```{ojs}
//| echo: false
//| panel: fill
Plot.plot({
color: {
legend: true,
domain: ["Site 1", "Site 2", "Site 3"],
range: siteColors,
},
style: {fontSize: "20px"},
marginLeft: 75,
marginTop: 40,
marginBottom: 55,
grid: true,
width: 800,
height: 670,
marks: [
Plot.lineY(
sir_tbl,
{x: "Biweek", y: "Number of Individuals", stroke: "#4d4d4dff", strokeWidth: 6, strokeOpacity: 0.8}
),
Plot.dot(
filtered_niamey_data,
{x: "Biweek", y: "Number of Individuals", stroke: "site", fill: "site", fillOpacity: 0.6, r: 12}
),
Plot.lineY(
filtered_niamey_data,
{x: "Biweek", y: "Number of Individuals", stroke: "site"}
),
Plot.text(
sim_sse,
{x: "Biweek", y: "Number of Individuals", text: (d) => `SSE = ${d.sse}`, dx: 0, dy: 0, fontWeight: "bold"}
)
]
})
```
## Objective Function
Now we set up a function that will calculate the sum of the squared differences between the observations and the model at any parameterization (more commonly known as *"sum of squared errors"*).
In general, this is called the **objective function** because it is the quantity that optimization seeks to minimize.
```{r}
#' Calculate the Sum of Squared Errors
#'
#' A function to take in biweekly incidence data, and SIR parameters, and
#' calculate the SSE
#'
#' @param params A vector of parameter values
#' @param data A dataframe containing biweekly incidence data in the case column
#'
#' @return The SSE of type double
#' @examples
sse_sir <- function(params, data){
# Convert biweekly time series into annual time scale
# Daily time scale has requires beta values to be too small - doesn't
# optimize well
dt <- 0.01
max_biweek <- max(data$biweek)
t <- seq(0, max_biweek * 14, dt) / 365
# Extract the number of observed incidence
obs_inc <- data$cases
# Note the parameters are updated throughout the optimization process by
# the optim() function
# Unpack the transmission parameter and exponentiate to fit on ln scale
beta <- exp(params[["beta"]])
# Unpack the initial states and exponentiate to fit on normal scale
S_init <- exp(params[["S_init"]])
I_init <- exp(params[["I_init"]])
# Fit SIR model to the parameters
sol <- deSolve::ode(
y = c(S = S_init, I = I_init, new_inf = 0),
times = t,
func = closed_sir_model,
parms = beta,
# Use rk4 as fixed time steps, which is important for indexing
method = "rk4"
)
# Extract the cumulative incidence
cum_inc <- sol[, "new_inf"]
# Find the indices of the cumulative incidence to extract
biweek_index <- seq(1, max_biweek) * (14 / dt) + 1
# Index cumulative incidence to get the values at the end of the biweeks
biweek_cum_inc <- cum_inc[biweek_index]
# Calculate the biweekly incidence by using the difference between
# consecutive biweeks. Need to manually prepend first week's incidence
# and add in the initial number of infectious individuals, as ODE model
# only returns the cumulative differences, which is 0 at the start.
biweek_inc <- c(biweek_cum_inc[1] + I_init, diff(biweek_cum_inc, lag = 1))
# return SSE of predicted vs observed incidence
return(sum((biweek_inc - obs_inc)^2))
}
```
Notice that the code for `sse_sir()` makes use of the following modeling trick.
We know that $\beta$, $S_0$, and $I_0$ must be positive, but our search to optimize these parameters will be over the entire number line.
We could constrain the search using a more sophisticated algorithm, but this might introduce other problems (i.e., stability at the boundaries). Instead, we parameterize our objective function (`sse_sir`) in terms of some alternative variables $\ln(\beta)$, $\ln(S_0)$, and $\ln(I_0)$.
While these numbers range from $-\infty$ to $\infty$ (the range of our search) they map to our model parameters on a range from $0$ to $\infty$ (the range that is biologically meaningful).
## Optimization
Our final step is to use the function `optim` to find the values of $\beta$, $S_0$, and $I_0$ that minimize the sum of squared errors as calculated using our function.
Finally, we plot these fits against the data.
```{r}
# Initial guess
sse_optim_params <- c(beta = log(0.055), S_init = log(5000), I_init = log(1))
# Create a dataframe of optimized parameters
niamey_optims <- niamey_df %>%
# Create a nested dataframe i.e. one row for each site, and the data column
# now is a list column that contains a separate dataframe of times and
# cases for each site
nest(data = -site) %>%
mutate(
# Map the optim() function call to each of the separate dataframes
# stored in the nested data column we just created
fit = map(data, ~optim(sse_optim_params, sse_sir, data = .x)),
# Map the exp() function to each of the model fits just created, and
# output to a dataframe instead of a list (like in map()), for easier
# use in the plottinge predictions later
map_dfr(fit, ~exp(.x$par))
)
```
```{r}
niamey_optims %>%
select(-c(data, fit)) %>%
mutate(site = str_replace_all(site, "_", " ")) %>%
gt() %>%
fmt_number(columns = -site, decimals = 3) %>%
fmt_scientific(columns = beta, decimals = 3) %>%
# Relabel the column headers
cols_label(
site = md("**Site**"),
beta = md("**Beta**"),
S_init = md("**Initial S**"),
I_init = md("**Initial I**")
) %>%
# Apply style to the table with gray alternating rows
opt_stylize(style = 1, color = 'gray') %>%
# Increate space between columns
opt_horizontal_padding(scale = 3) %>%
cols_align("center")
```
::: {.callout-note}
You may have noticed that you can achieve slightly different results for the optimal parameter values using the interactive plot than are being presented here (though they are very similar).
This is because while the optimization code is running in `R`, the interactive plot and the calculation of the SSE is implemented using JavaScript.
Therefore, despite using the same underlying model structure, the answers will vary slightly, because the ODE solvers are different, resulting in different model simulations.
The difference is not enough to be concerned with here, but it is a point that's worth being aware of when you build your own models - you may want to perform sensitivity to confirm that your model implementation is not driving the magnitude of the results you see, and the inferences you make.
:::
```{r}
niamey_predictions <- niamey_optims %>%
mutate(
# For each of the different site's nested dataframes, fit the SIR model
# with the optimal parameters to get best fit predictions
predictions = pmap(
.l = list(
S_init = S_init,
I_init = I_init,
beta = beta,
time_data = data
),
.f = function(S_init, I_init, beta, time_data) {
site_times <- time_data$biweek * 14 / 365
# Return a dataframe of model solutions
as_tibble(ode(
y = c(S = S_init, I = I_init, new_inf = 0),
times = site_times,
func = closed_sir_model,
parms = beta,
hmax = 1/120
)) %>%
# Make sure all values are numeric for plotting purposes
mutate(across(everything(), as.numeric)) %>%
mutate(
incidence = ifelse(row_number() == 1, new_inf[1], diff(new_inf, lag = 1))
)
}
)
) %>%
unnest(c(data, predictions))
```
::: {.callout-important}
An important point to note is that our data is biweekly **incidence** (new cases in time period), whereas out SIR model produces **prevalence** (total cases at any time point).
To account for this, our SIR model returns the cumulative incidence (line 32 of the model code), and our objective function extracts the biweekly incidence (lines 41-54), to ensure we are fitting the same data!
This is a common source of error in interpretation when people fit models to data.
:::
```{r}
#| column: body
#| out-width: 100%
# Create a dataframe to store the positions of the text labels
niamey_preds_labels <- tibble(
site = c("Site_1", "Site_2"),
x_label = c(6.5, 6.5),
x_arrow_just = c(-0.5, -0.5),
x_arrow_end = c(7, 7.75),
y_label = c(900, 600),
y_arrow_just = c(-80, -70),
y_arrow_end = c(350, 290),
commentary = c("**Predicted", "**Observed"),
color = c("grey20", niamey_site_colors["Site_2"])
)
ggplot(niamey_predictions, aes(x = biweek, group = site)) +
# Plot the actual data in color
geom_line(aes(y = cases, color = site)) +
geom_point(aes(y = cases, color = site), size = 4, alpha = 0.8) +
# Plot the best-fit model predictions in black
geom_line(aes(y = incidence), color = "black") +
scale_color_manual(
values = niamey_site_colors, aesthetics = c("color", "fill")
) +
# Place each site on it's own subplot and change labels
facet_wrap(
~site, ncol = 1, scales = "free_y",
labeller = as_labeller(niamey_site_labels)
) +
labs(x = "Biweek", y = "Number of Case") +
theme(legend.position = "none") +
ggtext::geom_textbox(
data = niamey_preds_labels,
aes(
label = paste0(
"<span style = \"color:",
color,
"\">",
commentary,
" Cases**",
"</span>"
),
x = x_label, y = y_label
),
size = 4, fill = NA, box.colour = NA
) +
geom_curve(
data = niamey_preds_labels,
aes(
x = x_label + x_arrow_just, xend = x_arrow_end,
y = y_label + y_arrow_just, yend = y_arrow_end
),
linewidth = 0.75,
arrow = arrow(length = unit(0.2, "cm")),
curvature = list(0.25),
color = "grey20"
)
```
<div class="exercise">
### Exercise 5
</div>
To make things easier, we have assumed the infectious period is known to be 14 days.
In terms of years, $\text{D} = \frac{14}{365} \approx 0.0384$, and the recovery rate is the inverse i.e., $\gamma = \frac{14}{365}$.
Now, modify the code above to estimate $\gamma$ and $\beta$ simultaneously.
<div class="exercise">
### Exercise 6
</div>
What happens if one or both of the other unknowns ($S_0$ and $I_0$) is fixed instead of $\gamma$?