Reproducing simulations • simpr

library(simpr)

simpr is designed with reproducibility in mind. If you set the same seed, you get the same results.

set.seed(500)
run_1 = specify(a = ~ runif(6)) %>% 
  generate(3)

run_1
#> full tibble
#> --------------------------
#> # A tibble: 3 × 3
#>   .sim_id   rep sim             
#>     <int> <int> <list>          
#> 1       1     1 <tibble [6 × 1]>
#> 2       2     2 <tibble [6 × 1]>
#> 3       3     3 <tibble [6 × 1]>
#> 
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 1
#>        a
#>    <dbl>
#> 1 0.869 
#> 2 0.0882
#> 3 0.914 
#> 4 0.384 
#> 5 0.147 
#> 6 0.352

set.seed(500)
run_2 = specify(a = ~ runif(6)) %>% 
  generate(3)

run_2
#> full tibble
#> --------------------------
#> # A tibble: 3 × 3
#>   .sim_id   rep sim             
#>     <int> <int> <list>          
#> 1       1     1 <tibble [6 × 1]>
#> 2       2     2 <tibble [6 × 1]>
#> 3       3     3 <tibble [6 × 1]>
#> 
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 1
#>        a
#>    <dbl>
#> 1 0.869 
#> 2 0.0882
#> 3 0.914 
#> 4 0.384 
#> 5 0.147 
#> 6 0.352

identical(run_1, run_2)
#> [1] TRUE

What’s more, generate() can take filtering criteria, so that you can re-generate specific repetitions or conditions without having to recreate the entire simulation. This requires that the seed, specification, definition, and number of reps is identical to the simulation you are trying to reproduce.

set.seed(500)
filter_after_generating = specify(a = ~ runif(6)) %>% 
  generate(3) %>% 
  filter(.sim_id == 2)

filter_after_generating
#> full tibble
#> --------------------------
#> # A tibble: 1 × 3
#>   .sim_id   rep sim             
#>     <int> <int> <list>          
#> 1       2     2 <tibble [6 × 1]>
#> 
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 1
#>        a
#>    <dbl>
#> 1 0.811 
#> 2 0.100 
#> 3 0.0916
#> 4 0.444 
#> 5 0.205 
#> 6 0.0947

## Much faster, same result!
set.seed(500)
filter_while_generating = specify(a = ~ runif(6)) %>% 
  generate(3, .sim_id == 2)

filter_while_generating
#> full tibble
#> --------------------------
#> # A tibble: 1 × 3
#>   .sim_id   rep sim             
#>     <int> <int> <list>          
#> 1       2     2 <tibble [6 × 1]>
#> 
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 1
#>        a
#>    <dbl>
#> 1 0.811 
#> 2 0.100 
#> 3 0.0916
#> 4 0.444 
#> 5 0.205 
#> 6 0.0947

identical(filter_after_generating, filter_while_generating)
#> [1] TRUE

Although only one repetition was generated above, it is the same data as was generated when we actually did the full simulation.

A common use case is for regenerating the data in cases where an error was created. Here’s an example of a simulation that only generated errors in one condition. We generate some data and fit a logistic regression, but notice that we get some errors.

set.seed(500)
fit_tidy = specify(a = ~ sample(0:max, size = 10, replace = TRUE),
        b = ~ a + rnorm(10))  %>% 
  define(max = c(0, 1, 10)) %>%
  generate(3) %>% 
  fit(lm = ~ glm(a ~ b, family = "binomial")) %>% 
  tidy_fits()
#> Warning in fit.simpr_tibble(., lm = ~glm(a ~ b, family = "binomial")): fit()
#> produced errors. See '.fit_error_*' column(s).

fit_tidy
#> # A tibble: 15 × 10
#>    .sim_id   max   rep Source .fit_…¹ term   estimate std.er…² statistic p.value
#>      <int> <dbl> <int> <chr>  <chr>   <chr>     <dbl>    <dbl>     <dbl>   <dbl>
#>  1       1     0     1 lm      NA     (Int… -2.46e+ 1  4.26e+4 -5.77e- 4   1.00 
#>  2       1     0     1 lm      NA     b     -5.05e-15  4.42e+4 -1.14e-19   1    
#>  3       2     1     1 lm      NA     (Int… -2.22e- 1  6.89e-1 -3.22e- 1   0.747
#>  4       2     1     1 lm      NA     b     -1.47e+ 0  1.65e+0 -8.93e- 1   0.372
#>  5       3    10     1 lm     "Error… NA    NA        NA       NA         NA    
#>  6       4     0     2 lm      NA     (Int… -2.46e+ 1  4.19e+4 -5.87e- 4   1.00 
#>  7       4     0     2 lm      NA     b      7.36e-15  4.03e+4  1.83e-19   1    
#>  8       5     1     2 lm      NA     (Int… -1.23e- 1  6.79e-1 -1.81e- 1   0.857
#>  9       5     1     2 lm      NA     b      5.74e- 1  1.04e+0  5.53e- 1   0.580
#> 10       6    10     2 lm     "Error… NA    NA        NA       NA         NA    
#> 11       7     0     3 lm      NA     (Int… -2.46e+ 1  4.15e+4 -5.91e- 4   1.00 
#> 12       7     0     3 lm      NA     b      1.73e-14  4.01e+4  4.30e-19   1    
#> 13       8     1     3 lm      NA     (Int… -1.28e+ 0  1.23e+0 -1.04e+ 0   0.296
#> 14       8     1     3 lm      NA     b      1.60e+ 0  1.02e+0  1.57e+ 0   0.117
#> 15       9    10     3 lm     "Error… NA    NA        NA       NA         NA    
#> # … with abbreviated variable names ¹.fit_error, ²std.error

One options for regenerating is to filter directly to the problematic max == 10 condition to examine the generated data.

set.seed(500)
filter_max_10 = specify(a = ~ sample(0:max, size = 10, replace = TRUE),
        b = ~ a + rnorm(10))  %>% 
  define(max = c(0, 1, 10)) %>%
  generate(3, max == 10)

filter_max_10
#> full tibble
#> --------------------------
#> # A tibble: 3 × 4
#>   .sim_id   max   rep sim              
#>     <int> <dbl> <int> <list>           
#> 1       3    10     1 <tibble [10 × 2]>
#> 2       6    10     2 <tibble [10 × 2]>
#> 3       9    10     3 <tibble [10 × 2]>
#> 
#> sim[[1]]
#> --------------------------
#> # A tibble: 10 × 2
#>        a     b
#>    <int> <dbl>
#>  1    10 10.3 
#>  2     6  6.18
#>  3     3  2.92
#>  4    10 10.3 
#>  5     8  7.65
#>  6     7  8.35
#>  7     1  2.80
#>  8     2  2.55
#>  9     7  7.93
#> 10    10  8.64

Looking at the raw generated data, we can see our outcome variable is often larger than 1, which makes no sense for a logistic regression.

In general, we could also filter down to only values of .sim_id which generated errors to examine those:

fit_errors = filter(fit_tidy, !is.na(.fit_error))

set.seed(500)
fit_error_data = specify(a = ~ sample(1:max, size = 10, replace = TRUE),
                     b = ~ a + rnorm(10))  %>% 
  define(max = c(0, 1, 10)) %>%
  generate(3, .sim_id %in% fit_errors$.sim_id)

fit_error_data
#> full tibble
#> --------------------------
#> # A tibble: 3 × 4
#>   .sim_id   max   rep sim              
#>     <int> <dbl> <int> <list>           
#> 1       3    10     1 <tibble [10 × 2]>
#> 2       6    10     2 <tibble [10 × 2]>
#> 3       9    10     3 <tibble [10 × 2]>
#> 
#> sim[[1]]
#> --------------------------
#> # A tibble: 10 × 2
#>        a     b
#>    <int> <dbl>
#>  1     7 7.47 
#>  2     4 5.19 
#>  3     9 8.49 
#>  4     8 7.74 
#>  5     2 1.81 
#>  6     3 2.35 
#>  7     8 7.68 
#>  8     4 4.34 
#>  9     6 7.17 
#> 10     1 0.940

This approach is useful in cases where we don’t know which conditions are producing the errors. Sometimes simulation errors arise from numerical issues arising from unlucky draws from the data-generating mechanism, and are not systematic.