Error: One or more series in data is missing observations for one or more timepoints #62
-
I'm running a model to predict the abundance of a genus based on the abundance of other genuses. I'd like to run an AR model, but I get the error 'Error: One or more series in data is missing observations for one or more timepoints'. I don't understand why I get this error because all my timepoints have a response variable (there is a value of relative abundance for Alkanindiges for each bioreactor and each time point). Also, when plotting my series, I get a choppy plot with missing lines, but all the values seem to be there, they're simply not connected. The x axis (time) and the observed values also are super compressed, but the predictions seem not to be. My summary seems decent and I think my data is formatted correctly in the long format. The NA values in the file are because 'y' is my response variable (Alkanindiges's relative abundance) and I don't think I can include the abundance of other genera in the same column. The bioreactors are technical replicates, but they actually do not behave in the same way so I included a random effect for the bioreactors. I would have liked to have a unique group-level smoother for each bioreactor (GI model in Pedersen et al, 2019) as they have a behaviour somewhat different, but I don't have enough data points for that. I only have one series and the response variable is y, the abundance of Alkanindiges. Here is one of the models where the problem occurs, but all my models have the same problem so I think the problem is with the data I use to run the models rather than the model by itself. mod_alk <- mvgam( y ~ ti(y_all, time, by = Genus, bs = 'tp') + s(bioreactor, bs = 're'), APANB_train.csv (the data I use to train my model) APANB_test.csv (the data I use to test the model's predictions)
Thank you very much for you help, it is extremely appreciated! As a bonus question, I'd like to confirm that the CI in the forecast are 2.5, 25, 50, 75 and 97.5%. Thank you! |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 3 replies
-
Thanks for the post @purtyboiii. You seem to have multiple bioreactor values, so I'm guessing you want to use these as the your units of analysis (i.e.
|
Beta Was this translation helpful? Give feedback.
-
Thanks @purtyboiii for clarifying and for posting the data, this is helpful for debugging. You will need to widen your data so that you can use values of the other genera as predictors of your focal genus. I show below how you might go about doing this: library(mvgam)
library(dplyr)
# Read in the data
data_train <- read.csv('APANB_train.csv', as.is = TRUE)
data_test <- read.csv('APANB_test.csv', as.is = TRUE)
# Ensure each series (bioreactor level) is included via
# an appropriate indicator, and convert to factor; also ensure
# that values of the other genera can be used as predictors of the
# outcome of interest (Alkanindiges) by spreading into their own columns
data_train %>%
mutate(bio_alk = gsub(' ', '_', paste0(bioreactor, Genus))) %>%
tidyr::pivot_wider(names_from = 'Genus', values_from = 'y_all') %>%
group_by(bioreactor, time) %>%
mutate_at(.vars = c('Alkanindiges', 'Unclassified_Beijerinckiaceae',
'Novosphingobium', 'Others', 'Pseudomonas',
'Unclassified_Acetobacteraceae',
'Unclassified_Beijerinckiaceae'),
.funs = function(x) mean(x, na.rm = TRUE)) %>%
slice_head(n = 1) %>%
mutate(series = droplevels(as.factor(bio_alk)),
bioreactor = as.factor(bioreactor)) -> data_train
data_test %>%
mutate(bio_alk = gsub(' ', '_', paste0(bioreactor, Genus))) %>%
tidyr::pivot_wider(names_from = 'Genus', values_from = 'y_all') %>%
group_by(bioreactor, time) %>%
mutate_at(.vars = c('Alkanindiges', 'Unclassified_Beijerinckiaceae',
'Novosphingobium', 'Others', 'Pseudomonas',
'Unclassified_Acetobacteraceae',
'Unclassified_Beijerinckiaceae'),
.funs = function(x) mean(x, na.rm = TRUE)) %>%
slice_head(n = 1) %>%
mutate(series = droplevels(as.factor(bio_alk)),
bioreactor = as.factor(bioreactor)) -> data_test
# Plot training and testing data for the first series (bioreactor 1)
plot_mvgam_series(data = data_train, newdata = data_test) # Fit an AR1 model that includes time-varying effects of
# Pseudomonas and Novosphingobium as predictors, just for an example;
# Rather than estimating a different Beta precision parameter for each
# bioreactor, we can specify that they be shared
mod_alk <- mvgam(y ~ s(time, by = Pseudomonas) +
s(time, by = Novosphingobium) +
s(bioreactor, bs = 're'),
data = data_train,
newdata = data_test,
share_obs_params = TRUE,
family = betar(link = 'logit'),
trend_model = AR(1),
noncentred = TRUE,
parallel = TRUE,
chains = 4,
backend = 'cmdstanr')
#> Compiling Stan program using cmdstanr
#> Start sampling
#> Running MCMC with 4 parallel chains...
#>
#> Chain 1 Iteration: 1 / 1000 [ 0%] (Warmup)
#> Chain 2 Iteration: 1 / 1000 [ 0%] (Warmup)
#> Chain 3 Iteration: 1 / 1000 [ 0%] (Warmup)
#> Chain 4 Iteration: 1 / 1000 [ 0%] (Warmup)
#> Chain 1 Iteration: 100 / 1000 [ 10%] (Warmup)
#> Chain 4 Iteration: 100 / 1000 [ 10%] (Warmup)
#> Chain 2 Iteration: 100 / 1000 [ 10%] (Warmup)
#> Chain 3 Iteration: 100 / 1000 [ 10%] (Warmup)
#> Chain 1 Iteration: 200 / 1000 [ 20%] (Warmup)
#> Chain 4 Iteration: 200 / 1000 [ 20%] (Warmup)
#> Chain 2 Iteration: 200 / 1000 [ 20%] (Warmup)
#> Chain 1 Iteration: 300 / 1000 [ 30%] (Warmup)
#> Chain 3 Iteration: 200 / 1000 [ 20%] (Warmup)
#> Chain 4 Iteration: 300 / 1000 [ 30%] (Warmup)
#> Chain 2 Iteration: 300 / 1000 [ 30%] (Warmup)
#> Chain 4 Iteration: 400 / 1000 [ 40%] (Warmup)
#> Chain 3 Iteration: 300 / 1000 [ 30%] (Warmup)
#> Chain 1 Iteration: 400 / 1000 [ 40%] (Warmup)
#> Chain 2 Iteration: 400 / 1000 [ 40%] (Warmup)
#> Chain 4 Iteration: 500 / 1000 [ 50%] (Warmup)
#> Chain 4 Iteration: 501 / 1000 [ 50%] (Sampling)
#> Chain 3 Iteration: 400 / 1000 [ 40%] (Warmup)
#> Chain 4 Iteration: 600 / 1000 [ 60%] (Sampling)
#> Chain 1 Iteration: 500 / 1000 [ 50%] (Warmup)
#> Chain 1 Iteration: 501 / 1000 [ 50%] (Sampling)
#> Chain 2 Iteration: 500 / 1000 [ 50%] (Warmup)
#> Chain 2 Iteration: 501 / 1000 [ 50%] (Sampling)
#> Chain 1 Iteration: 600 / 1000 [ 60%] (Sampling)
#> Chain 2 Iteration: 600 / 1000 [ 60%] (Sampling)
#> Chain 3 Iteration: 500 / 1000 [ 50%] (Warmup)
#> Chain 3 Iteration: 501 / 1000 [ 50%] (Sampling)
#> Chain 4 Iteration: 700 / 1000 [ 70%] (Sampling)
#> Chain 1 Iteration: 700 / 1000 [ 70%] (Sampling)
#> Chain 2 Iteration: 700 / 1000 [ 70%] (Sampling)
#> Chain 3 Iteration: 600 / 1000 [ 60%] (Sampling)
#> Chain 4 Iteration: 800 / 1000 [ 80%] (Sampling)
#> Chain 1 Iteration: 800 / 1000 [ 80%] (Sampling)
#> Chain 2 Iteration: 800 / 1000 [ 80%] (Sampling)
#> Chain 3 Iteration: 700 / 1000 [ 70%] (Sampling)
#> Chain 4 Iteration: 900 / 1000 [ 90%] (Sampling)
#> Chain 3 Iteration: 800 / 1000 [ 80%] (Sampling)
#> Chain 1 Iteration: 900 / 1000 [ 90%] (Sampling)
#> Chain 2 Iteration: 900 / 1000 [ 90%] (Sampling)
#> Chain 4 Iteration: 1000 / 1000 [100%] (Sampling)
#> Chain 4 finished in 8.8 seconds.
#> Chain 3 Iteration: 900 / 1000 [ 90%] (Sampling)
#> Chain 1 Iteration: 1000 / 1000 [100%] (Sampling)
#> Chain 2 Iteration: 1000 / 1000 [100%] (Sampling)
#> Chain 1 finished in 9.8 seconds.
#> Chain 2 finished in 9.7 seconds.
#> Chain 3 Iteration: 1000 / 1000 [100%] (Sampling)
#> Chain 3 finished in 10.3 seconds.
#>
#> All 4 chains finished successfully.
#> Mean chain execution time: 9.7 seconds.
#> Total execution time: 10.6 seconds.
# Plot conditional effects on the expectation scale
conditional_effects(mod_alk, type = 'expected') # If you have gratia installed, you can also draw the smooths
# on the link scale, which is useful for visualising time-varying effects
require(gratia)
draw(mod_alk) # Plot the hindcast and forecast distributions for each bioreactor
plot(mod_alk, type = 'forecast', series = 1)
#> Out of sample CRPS:
#> 0.103626855689369 plot(mod_alk, type = 'forecast', series = 2)
#> Out of sample CRPS:
#> 0.278629406254966 plot(mod_alk, type = 'forecast', series = 3)
#> Out of sample CRPS:
#> 0.171947161786803 # Model summary
summary(mod_alk)
#> GAM formula:
#> y ~ s(time, by = Pseudomonas) + s(time, by = Novosphingobium) +
#> s(bioreactor, bs = "re")
#>
#> Family:
#> beta
#>
#> Link function:
#> logit
#>
#> Trend model:
#> AR(p = 1)
#>
#> N series:
#> 3
#>
#> N timepoints:
#> 24
#>
#> Status:
#> Fitted using Stan
#> 4 chains, each with iter = 1000; warmup = 500; thin = 1
#> Total post-warmup draws = 2000
#>
#>
#> Observation precision parameter estimates:
#> 2.5% 50% 97.5% Rhat n_eff
#> phi 44 170 470 1 188
#>
#> GAM coefficient (beta) estimates:
#> 2.5% 50% 97.5% Rhat n_eff
#> (Intercept) -3.400 -1.3e+00 0.690 1.00 1343
#> s(time):Pseudomonas.1 -0.390 1.5e-01 2.200 1.02 116
#> s(time):Pseudomonas.2 -0.390 -1.7e-02 0.280 1.01 656
#> s(time):Pseudomonas.3 -0.210 3.1e-02 0.630 1.02 215
#> s(time):Pseudomonas.4 -0.150 2.0e-03 0.130 1.01 451
#> s(time):Pseudomonas.5 -0.088 6.6e-03 0.210 1.01 245
#> s(time):Pseudomonas.6 -0.094 2.1e-06 0.093 1.00 647
#> s(time):Pseudomonas.7 -0.073 1.7e-03 0.110 1.01 305
#> s(time):Pseudomonas.8 -0.060 6.3e-05 0.060 1.01 527
#> s(time):Pseudomonas.9 -5.900 -9.5e-01 0.140 1.02 96
#> s(time):Pseudomonas.10 -2.900 -1.8e-01 0.520 1.02 96
#> s(time):Novosphingobium.1 -0.460 8.1e-02 2.700 1.03 138
#> s(time):Novosphingobium.2 -0.430 3.9e-03 0.310 1.01 466
#> s(time):Novosphingobium.3 -0.200 2.1e-02 0.690 1.03 193
#> s(time):Novosphingobium.4 -0.130 -1.1e-03 0.130 1.02 257
#> s(time):Novosphingobium.5 -0.081 5.7e-03 0.210 1.02 188
#> s(time):Novosphingobium.6 -0.075 1.3e-04 0.075 1.01 431
#> s(time):Novosphingobium.7 -0.057 1.4e-03 0.089 1.02 251
#> s(time):Novosphingobium.8 -0.052 -5.5e-04 0.054 1.00 704
#> s(time):Novosphingobium.9 -6.700 -1.6e-01 0.310 1.10 40
#> s(time):Novosphingobium.10 -1.600 -8.0e-02 0.600 1.01 227
#> s(bioreactor).1 -2.500 -5.3e-01 1.600 1.00 1787
#> s(bioreactor).2 -2.100 -1.7e-01 1.900 1.00 1779
#> s(bioreactor).3 -2.100 -1.6e-01 1.900 1.00 1742
#>
#> GAM group-level estimates:
#> 2.5% 50% 97.5% Rhat n_eff
#> mean(s(bioreactor)) -2.000 -0.23 1.6 1 2043
#> sd(s(bioreactor)) 0.038 0.44 2.4 1 851
#>
#> Approximate significance of GAM smooths:
#> edf Ref.df Chi.sq p-value
#> s(time):Pseudomonas 2.13 10 1.87 0.43
#> s(time):Novosphingobium 1.14 10 0.03 0.97
#> s(bioreactor) 1.99 3 2.16 0.35
#>
#> Latent trend parameter AR estimates:
#> 2.5% 50% 97.5% Rhat n_eff
#> ar1[1] -0.500 0.44 0.96 1.00 461
#> ar1[2] -0.110 0.50 0.94 1.01 717
#> ar1[3] -0.110 0.53 0.96 1.00 623
#> sigma[1] 0.057 0.44 0.81 1.01 191
#> sigma[2] 0.260 0.51 0.87 1.01 233
#> sigma[3] 0.290 0.51 0.85 1.01 337
#>
#> Stan MCMC diagnostics:
#> n_eff / iter looks reasonable for all parameters
#> Rhats above 1.05 found for 2 parameters
#> *Diagnose further to investigate why the chains have not mixed
#> 24 of 2000 iterations ended with a divergence (1.2%)
#> *Try running with larger adapt_delta to remove the divergences
#> 0 of 2000 iterations saturated the maximum tree depth of 12 (0%)
#> E-FMI indicated no pathological behavior
#>
#> Samples were drawn using NUTS(diag_e) at Thu Aug 01 1:39:13 PM 2024.
#> For each parameter, n_eff is a crude measure of effective sample size,
#> and Rhat is the potential scale reduction factor on split MCMC chains
#> (at convergence, Rhat = 1) Created on 2024-08-01 with reprex v2.1.0 |
Beta Was this translation helpful? Give feedback.
Thanks @purtyboiii for clarifying and for posting the data, this is helpful for debugging. You will need to widen your data so that you can use values of the other genera as predictors of your focal genus. I show below how you might go about doing this: