Error: One or more series in data is missing observations for one or more timepoints #62

SamuelBeauregard · 2024-07-29T19:25:43Z

SamuelBeauregard
Jul 29, 2024

I'm running a model to predict the abundance of a genus based on the abundance of other genuses. I'd like to run an AR model, but I get the error 'Error: One or more series in data is missing observations for one or more timepoints'. I don't understand why I get this error because all my timepoints have a response variable (there is a value of relative abundance for Alkanindiges for each bioreactor and each time point). Also, when plotting my series, I get a choppy plot with missing lines, but all the values seem to be there, they're simply not connected.

The x axis (time) and the observed values also are super compressed, but the predictions seem not to be.

My summary seems decent and I think my data is formatted correctly in the long format. The NA values in the file are because 'y' is my response variable (Alkanindiges's relative abundance) and I don't think I can include the abundance of other genera in the same column.

The bioreactors are technical replicates, but they actually do not behave in the same way so I included a random effect for the bioreactors. I would have liked to have a unique group-level smoother for each bioreactor (GI model in Pedersen et al, 2019) as they have a behaviour somewhat different, but I don't have enough data points for that.

I only have one series and the response variable is y, the abundance of Alkanindiges.

Here is one of the models where the problem occurs, but all my models have the same problem so I think the problem is with the data I use to run the models rather than the model by itself.

mod_alk <- mvgam( y ~ ti(y_all, time, by = Genus, bs = 'tp') + s(bioreactor, bs = 're'),
data = APANB_train,
newdata = APANB_test,
family = betar(link = 'logit'),
trend_model = CAR(1),
noncentred = TRUE,
parallel = TRUE,
chains = 4,
backend = 'cmdstanr',
max_treedepth = 14,
adapt_delta = 0.97)

APANB_train.csv (the data I use to train my model)

APANB_test.csv (the data I use to test the model's predictions)

(The chopy time series that seems to have all the data points, but some are not linked together)

(summary of APANB_train)

(forecast of the model. I currently use CAR(1), but I would like to be able to use AR models instead. You can see the compressed x axis and data, but the forecast seems reasonnable.)

Thank you very much for you help, it is extremely appreciated!

As a bonus question, I'd like to confirm that the CI in the forecast are 2.5, 25, 50, 75 and 97.5%. Thank you!

Answered by nicholasjclark

Aug 1, 2024

Thanks @purtyboiii for clarifying and for posting the data, this is helpful for debugging. You will need to widen your data so that you can use values of the other genera as predictors of your focal genus. I show below how you might go about doing this:

library(mvgam)
library(dplyr)

# Read in the data
data_train <- read.csv('APANB_train.csv', as.is = TRUE)
data_test <- read.csv('APANB_test.csv', as.is = TRUE)

# Ensure each series (bioreactor level) is included via
# an appropriate indicator, and convert to factor; also ensure
# that values of the other genera can be used as predictors of the
# outcome of interest (Alkanindiges) by spreading into their own columns
data_train %>%
  mutate(b…

View full answer

nicholasjclark · 2024-07-31T02:07:31Z

nicholasjclark
Jul 31, 2024
Maintainer

Thanks for the post @purtyboiii. You seem to have multiple bioreactor values, so I'm guessing you want to use these as the your units of analysis (i.e. Bioreactor 1 Alkanindiges would be the first series, while Bioreactor 2 Alkanindiges would be the second series. Is this correct? If so, you'll need to supply a smaller dataset that includes this as your outcome, something like:

data %>%
  mutate(bio_alk = paste0(bioreactor, Genus)) %>%
  filter(Genus == "Alkanindiges") %>%
  mutate(series = drop.levels(as.factor(bio_alk))) -> alk_data

1 reply

SamuelBeauregard Jul 31, 2024
Author

Thank you for you quick answer! I updated the original post with more information.

nicholasjclark · 2024-08-01T03:44:25Z

nicholasjclark
Aug 1, 2024
Maintainer

Thanks @purtyboiii for clarifying and for posting the data, this is helpful for debugging. You will need to widen your data so that you can use values of the other genera as predictors of your focal genus. I show below how you might go about doing this:

library(mvgam)
library(dplyr)

# Read in the data
data_train <- read.csv('APANB_train.csv', as.is = TRUE)
data_test <- read.csv('APANB_test.csv', as.is = TRUE)

# Ensure each series (bioreactor level) is included via
# an appropriate indicator, and convert to factor; also ensure
# that values of the other genera can be used as predictors of the
# outcome of interest (Alkanindiges) by spreading into their own columns
data_train %>%
  mutate(bio_alk = gsub(' ', '_', paste0(bioreactor, Genus))) %>%
  tidyr::pivot_wider(names_from = 'Genus', values_from = 'y_all') %>%
  group_by(bioreactor, time) %>%
  mutate_at(.vars = c('Alkanindiges', 'Unclassified_Beijerinckiaceae',
                      'Novosphingobium', 'Others', 'Pseudomonas',
                      'Unclassified_Acetobacteraceae', 
                      'Unclassified_Beijerinckiaceae'),
            .funs = function(x) mean(x, na.rm = TRUE)) %>%
  slice_head(n = 1) %>%
  mutate(series = droplevels(as.factor(bio_alk)),
         bioreactor = as.factor(bioreactor)) -> data_train

data_test %>%
  mutate(bio_alk = gsub(' ', '_', paste0(bioreactor, Genus))) %>%
  tidyr::pivot_wider(names_from = 'Genus', values_from = 'y_all') %>%
  group_by(bioreactor, time) %>%
  mutate_at(.vars = c('Alkanindiges', 'Unclassified_Beijerinckiaceae',
                      'Novosphingobium', 'Others', 'Pseudomonas',
                      'Unclassified_Acetobacteraceae', 
                      'Unclassified_Beijerinckiaceae'),
            .funs = function(x) mean(x, na.rm = TRUE)) %>%
  slice_head(n = 1) %>%
  mutate(series = droplevels(as.factor(bio_alk)),
         bioreactor = as.factor(bioreactor)) -> data_test

# Plot training and testing data for the first series (bioreactor 1)
plot_mvgam_series(data = data_train, newdata = data_test)

# Fit an AR1 model that includes time-varying effects of 
# Pseudomonas and Novosphingobium as predictors, just for an example;
# Rather than estimating a different Beta precision parameter for each
# bioreactor, we can specify that they be shared
mod_alk <- mvgam(y ~ s(time, by = Pseudomonas) + 
                   s(time, by = Novosphingobium) +
                   s(bioreactor, bs = 're'),
                 data = data_train,
                 newdata = data_test,
                 share_obs_params = TRUE,
                 family = betar(link = 'logit'),
                 trend_model = AR(1),
                 noncentred = TRUE,
                 parallel = TRUE,
                 chains = 4,
                 backend = 'cmdstanr')
#> Compiling Stan program using cmdstanr
#> Start sampling
#> Running MCMC with 4 parallel chains...
#> 
#> Chain 1 Iteration:   1 / 1000 [  0%]  (Warmup) 
#> Chain 2 Iteration:   1 / 1000 [  0%]  (Warmup) 
#> Chain 3 Iteration:   1 / 1000 [  0%]  (Warmup) 
#> Chain 4 Iteration:   1 / 1000 [  0%]  (Warmup) 
#> Chain 1 Iteration: 100 / 1000 [ 10%]  (Warmup) 
#> Chain 4 Iteration: 100 / 1000 [ 10%]  (Warmup) 
#> Chain 2 Iteration: 100 / 1000 [ 10%]  (Warmup) 
#> Chain 3 Iteration: 100 / 1000 [ 10%]  (Warmup) 
#> Chain 1 Iteration: 200 / 1000 [ 20%]  (Warmup) 
#> Chain 4 Iteration: 200 / 1000 [ 20%]  (Warmup) 
#> Chain 2 Iteration: 200 / 1000 [ 20%]  (Warmup) 
#> Chain 1 Iteration: 300 / 1000 [ 30%]  (Warmup) 
#> Chain 3 Iteration: 200 / 1000 [ 20%]  (Warmup) 
#> Chain 4 Iteration: 300 / 1000 [ 30%]  (Warmup) 
#> Chain 2 Iteration: 300 / 1000 [ 30%]  (Warmup) 
#> Chain 4 Iteration: 400 / 1000 [ 40%]  (Warmup) 
#> Chain 3 Iteration: 300 / 1000 [ 30%]  (Warmup) 
#> Chain 1 Iteration: 400 / 1000 [ 40%]  (Warmup) 
#> Chain 2 Iteration: 400 / 1000 [ 40%]  (Warmup) 
#> Chain 4 Iteration: 500 / 1000 [ 50%]  (Warmup) 
#> Chain 4 Iteration: 501 / 1000 [ 50%]  (Sampling) 
#> Chain 3 Iteration: 400 / 1000 [ 40%]  (Warmup) 
#> Chain 4 Iteration: 600 / 1000 [ 60%]  (Sampling) 
#> Chain 1 Iteration: 500 / 1000 [ 50%]  (Warmup) 
#> Chain 1 Iteration: 501 / 1000 [ 50%]  (Sampling) 
#> Chain 2 Iteration: 500 / 1000 [ 50%]  (Warmup) 
#> Chain 2 Iteration: 501 / 1000 [ 50%]  (Sampling) 
#> Chain 1 Iteration: 600 / 1000 [ 60%]  (Sampling) 
#> Chain 2 Iteration: 600 / 1000 [ 60%]  (Sampling) 
#> Chain 3 Iteration: 500 / 1000 [ 50%]  (Warmup) 
#> Chain 3 Iteration: 501 / 1000 [ 50%]  (Sampling) 
#> Chain 4 Iteration: 700 / 1000 [ 70%]  (Sampling) 
#> Chain 1 Iteration: 700 / 1000 [ 70%]  (Sampling) 
#> Chain 2 Iteration: 700 / 1000 [ 70%]  (Sampling) 
#> Chain 3 Iteration: 600 / 1000 [ 60%]  (Sampling) 
#> Chain 4 Iteration: 800 / 1000 [ 80%]  (Sampling) 
#> Chain 1 Iteration: 800 / 1000 [ 80%]  (Sampling) 
#> Chain 2 Iteration: 800 / 1000 [ 80%]  (Sampling) 
#> Chain 3 Iteration: 700 / 1000 [ 70%]  (Sampling) 
#> Chain 4 Iteration: 900 / 1000 [ 90%]  (Sampling) 
#> Chain 3 Iteration: 800 / 1000 [ 80%]  (Sampling) 
#> Chain 1 Iteration: 900 / 1000 [ 90%]  (Sampling) 
#> Chain 2 Iteration: 900 / 1000 [ 90%]  (Sampling) 
#> Chain 4 Iteration: 1000 / 1000 [100%]  (Sampling) 
#> Chain 4 finished in 8.8 seconds.
#> Chain 3 Iteration: 900 / 1000 [ 90%]  (Sampling) 
#> Chain 1 Iteration: 1000 / 1000 [100%]  (Sampling) 
#> Chain 2 Iteration: 1000 / 1000 [100%]  (Sampling) 
#> Chain 1 finished in 9.8 seconds.
#> Chain 2 finished in 9.7 seconds.
#> Chain 3 Iteration: 1000 / 1000 [100%]  (Sampling) 
#> Chain 3 finished in 10.3 seconds.
#> 
#> All 4 chains finished successfully.
#> Mean chain execution time: 9.7 seconds.
#> Total execution time: 10.6 seconds.

# Plot conditional effects on the expectation scale
conditional_effects(mod_alk, type = 'expected')

# If you have gratia installed, you can also draw the smooths
# on the link scale, which is useful for visualising time-varying effects
require(gratia)
draw(mod_alk)

# Plot the hindcast and forecast distributions for each bioreactor
plot(mod_alk, type = 'forecast', series = 1)
#> Out of sample CRPS:
#> 0.103626855689369

plot(mod_alk, type = 'forecast', series = 2)
#> Out of sample CRPS:
#> 0.278629406254966

plot(mod_alk, type = 'forecast', series = 3)
#> Out of sample CRPS:
#> 0.171947161786803

# Model summary
summary(mod_alk)
#> GAM formula:
#> y ~ s(time, by = Pseudomonas) + s(time, by = Novosphingobium) + 
#>     s(bioreactor, bs = "re")
#> 
#> Family:
#> beta
#> 
#> Link function:
#> logit
#> 
#> Trend model:
#> AR(p = 1)
#> 
#> N series:
#> 3 
#> 
#> N timepoints:
#> 24 
#> 
#> Status:
#> Fitted using Stan 
#> 4 chains, each with iter = 1000; warmup = 500; thin = 1 
#> Total post-warmup draws = 2000
#> 
#> 
#> Observation precision parameter estimates:
#>     2.5% 50% 97.5% Rhat n_eff
#> phi   44 170   470    1   188
#> 
#> GAM coefficient (beta) estimates:
#>                              2.5%      50% 97.5% Rhat n_eff
#> (Intercept)                -3.400 -1.3e+00 0.690 1.00  1343
#> s(time):Pseudomonas.1      -0.390  1.5e-01 2.200 1.02   116
#> s(time):Pseudomonas.2      -0.390 -1.7e-02 0.280 1.01   656
#> s(time):Pseudomonas.3      -0.210  3.1e-02 0.630 1.02   215
#> s(time):Pseudomonas.4      -0.150  2.0e-03 0.130 1.01   451
#> s(time):Pseudomonas.5      -0.088  6.6e-03 0.210 1.01   245
#> s(time):Pseudomonas.6      -0.094  2.1e-06 0.093 1.00   647
#> s(time):Pseudomonas.7      -0.073  1.7e-03 0.110 1.01   305
#> s(time):Pseudomonas.8      -0.060  6.3e-05 0.060 1.01   527
#> s(time):Pseudomonas.9      -5.900 -9.5e-01 0.140 1.02    96
#> s(time):Pseudomonas.10     -2.900 -1.8e-01 0.520 1.02    96
#> s(time):Novosphingobium.1  -0.460  8.1e-02 2.700 1.03   138
#> s(time):Novosphingobium.2  -0.430  3.9e-03 0.310 1.01   466
#> s(time):Novosphingobium.3  -0.200  2.1e-02 0.690 1.03   193
#> s(time):Novosphingobium.4  -0.130 -1.1e-03 0.130 1.02   257
#> s(time):Novosphingobium.5  -0.081  5.7e-03 0.210 1.02   188
#> s(time):Novosphingobium.6  -0.075  1.3e-04 0.075 1.01   431
#> s(time):Novosphingobium.7  -0.057  1.4e-03 0.089 1.02   251
#> s(time):Novosphingobium.8  -0.052 -5.5e-04 0.054 1.00   704
#> s(time):Novosphingobium.9  -6.700 -1.6e-01 0.310 1.10    40
#> s(time):Novosphingobium.10 -1.600 -8.0e-02 0.600 1.01   227
#> s(bioreactor).1            -2.500 -5.3e-01 1.600 1.00  1787
#> s(bioreactor).2            -2.100 -1.7e-01 1.900 1.00  1779
#> s(bioreactor).3            -2.100 -1.6e-01 1.900 1.00  1742
#> 
#> GAM group-level estimates:
#>                       2.5%   50% 97.5% Rhat n_eff
#> mean(s(bioreactor)) -2.000 -0.23   1.6    1  2043
#> sd(s(bioreactor))    0.038  0.44   2.4    1   851
#> 
#> Approximate significance of GAM smooths:
#>                          edf Ref.df Chi.sq p-value
#> s(time):Pseudomonas     2.13     10   1.87    0.43
#> s(time):Novosphingobium 1.14     10   0.03    0.97
#> s(bioreactor)           1.99      3   2.16    0.35
#> 
#> Latent trend parameter AR estimates:
#>            2.5%  50% 97.5% Rhat n_eff
#> ar1[1]   -0.500 0.44  0.96 1.00   461
#> ar1[2]   -0.110 0.50  0.94 1.01   717
#> ar1[3]   -0.110 0.53  0.96 1.00   623
#> sigma[1]  0.057 0.44  0.81 1.01   191
#> sigma[2]  0.260 0.51  0.87 1.01   233
#> sigma[3]  0.290 0.51  0.85 1.01   337
#> 
#> Stan MCMC diagnostics:
#> n_eff / iter looks reasonable for all parameters
#> Rhats above 1.05 found for 2 parameters
#>  *Diagnose further to investigate why the chains have not mixed
#> 24 of 2000 iterations ended with a divergence (1.2%)
#>  *Try running with larger adapt_delta to remove the divergences
#> 0 of 2000 iterations saturated the maximum tree depth of 12 (0%)
#> E-FMI indicated no pathological behavior
#> 
#> Samples were drawn using NUTS(diag_e) at Thu Aug 01 1:39:13 PM 2024.
#> For each parameter, n_eff is a crude measure of effective sample size,
#> and Rhat is the potential scale reduction factor on split MCMC chains
#> (at convergence, Rhat = 1)

^{Created on 2024-08-01 with reprex v2.1.0}

2 replies

SamuelBeauregard Aug 1, 2024
Author

WOW you sir are amazing! It worked perfectly and predicts much better than before!

If you don't mind me asking,

I'm surprised to see that we have to put the data in wide format rather than long. Is it because it allows us to select a few rows of the wide df and basically use it as a long format?
Does it not matter what the different levels of the 'series' column are? I thought that it had to be 'series_1', 'series_2', etc, but can it be anything and the program will assign the 1st series to the first level it encounter (Bioreactor_1Alkanindiges), 2nd series to the second level it encounter (Bioreactor_2Alkanindiges), so on and so forth?
Do you know why the prediction plot would not display as it should? Was it because of the NAs?
I'm trying to get a hold of mvgam to show my colleagues as I think it is an incredibly usefull tool for microbial ecology.
Once again, thank you so much for your time, it helped tremendously!

nicholasjclark Aug 5, 2024
Maintainer

My pleasure @purtyboiii. You can use any names you like for the factor levels in the series column. And I suspect your previous issues were because there were multiple observations per series x time combination (i.e. series 1 had multiple values for time == 1), and this is not allowed in mvgam

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error: One or more series in data is missing observations for one or more timepoints #62

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Error: One or more series in data is missing observations for one or more timepoints #62

SamuelBeauregard Jul 29, 2024

Replies: 2 comments · 3 replies

nicholasjclark Jul 31, 2024 Maintainer

SamuelBeauregard Jul 31, 2024 Author

nicholasjclark Aug 1, 2024 Maintainer

SamuelBeauregard Aug 1, 2024 Author

nicholasjclark Aug 5, 2024 Maintainer

SamuelBeauregard
Jul 29, 2024

Replies: 2 comments 3 replies

nicholasjclark
Jul 31, 2024
Maintainer

SamuelBeauregard Jul 31, 2024
Author

nicholasjclark
Aug 1, 2024
Maintainer

SamuelBeauregard Aug 1, 2024
Author

nicholasjclark Aug 5, 2024
Maintainer