simpr
is designed with reproducibility in mind. If you
set the same seed, you get the same results.
set.seed(500)
run_1 = specify(a = ~ runif(6)) %>%
generate(3)
run_1
#> full tibble
#> --------------------------
#> # A tibble: 3 × 3
#> .sim_id rep sim
#> <int> <int> <list>
#> 1 1 1 <tibble [6 × 1]>
#> 2 2 2 <tibble [6 × 1]>
#> 3 3 3 <tibble [6 × 1]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 1
#> a
#> <dbl>
#> 1 0.869
#> 2 0.0882
#> 3 0.914
#> 4 0.384
#> 5 0.147
#> 6 0.352
set.seed(500)
run_2 = specify(a = ~ runif(6)) %>%
generate(3)
run_2
#> full tibble
#> --------------------------
#> # A tibble: 3 × 3
#> .sim_id rep sim
#> <int> <int> <list>
#> 1 1 1 <tibble [6 × 1]>
#> 2 2 2 <tibble [6 × 1]>
#> 3 3 3 <tibble [6 × 1]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 1
#> a
#> <dbl>
#> 1 0.869
#> 2 0.0882
#> 3 0.914
#> 4 0.384
#> 5 0.147
#> 6 0.352
identical(run_1, run_2)
#> [1] TRUE
What’s more, generate()
can take filtering criteria, so
that you can re-generate specific repetitions or conditions without
having to recreate the entire simulation. This requires that the
seed, specification, definition, and number of reps is identical to the
simulation you are trying to reproduce.
set.seed(500)
filter_after_generating = specify(a = ~ runif(6)) %>%
generate(3) %>%
filter(.sim_id == 2)
filter_after_generating
#> full tibble
#> --------------------------
#> # A tibble: 1 × 3
#> .sim_id rep sim
#> <int> <int> <list>
#> 1 2 2 <tibble [6 × 1]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 1
#> a
#> <dbl>
#> 1 0.811
#> 2 0.100
#> 3 0.0916
#> 4 0.444
#> 5 0.205
#> 6 0.0947
## Much faster, same result!
set.seed(500)
filter_while_generating = specify(a = ~ runif(6)) %>%
generate(3, .sim_id == 2)
filter_while_generating
#> full tibble
#> --------------------------
#> # A tibble: 1 × 3
#> .sim_id rep sim
#> <int> <int> <list>
#> 1 2 2 <tibble [6 × 1]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 1
#> a
#> <dbl>
#> 1 0.811
#> 2 0.100
#> 3 0.0916
#> 4 0.444
#> 5 0.205
#> 6 0.0947
identical(filter_after_generating, filter_while_generating)
#> [1] TRUE
Although only one repetition was generated above, it is the same data as was generated when we actually did the full simulation.
A common use case is for regenerating the data in cases where an error was created. Here’s an example of a simulation that only generated errors in one condition. We generate some data and fit a logistic regression, but notice that we get some errors.
set.seed(500)
fit_tidy = specify(a = ~ sample(0:max, size = 10, replace = TRUE),
b = ~ a + rnorm(10)) %>%
define(max = c(0, 1, 10)) %>%
generate(3) %>%
fit(lm = ~ glm(a ~ b, family = "binomial")) %>%
tidy_fits()
#> Warning in fit.simpr_tibble(., lm = ~glm(a ~ b, family = "binomial")): fit()
#> produced errors. See '.fit_error_*' column(s).
fit_tidy
#> # A tibble: 15 × 10
#> .sim_id max rep Source .fit_…¹ term estimate std.er…² statistic p.value
#> <int> <dbl> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0 1 lm NA (Int… -2.46e+ 1 4.26e+4 -5.77e- 4 1.00
#> 2 1 0 1 lm NA b -5.05e-15 4.42e+4 -1.14e-19 1
#> 3 2 1 1 lm NA (Int… -2.22e- 1 6.89e-1 -3.22e- 1 0.747
#> 4 2 1 1 lm NA b -1.47e+ 0 1.65e+0 -8.93e- 1 0.372
#> 5 3 10 1 lm "Error… NA NA NA NA NA
#> 6 4 0 2 lm NA (Int… -2.46e+ 1 4.19e+4 -5.87e- 4 1.00
#> 7 4 0 2 lm NA b 7.36e-15 4.03e+4 1.83e-19 1
#> 8 5 1 2 lm NA (Int… -1.23e- 1 6.79e-1 -1.81e- 1 0.857
#> 9 5 1 2 lm NA b 5.74e- 1 1.04e+0 5.53e- 1 0.580
#> 10 6 10 2 lm "Error… NA NA NA NA NA
#> 11 7 0 3 lm NA (Int… -2.46e+ 1 4.15e+4 -5.91e- 4 1.00
#> 12 7 0 3 lm NA b 1.73e-14 4.01e+4 4.30e-19 1
#> 13 8 1 3 lm NA (Int… -1.28e+ 0 1.23e+0 -1.04e+ 0 0.296
#> 14 8 1 3 lm NA b 1.60e+ 0 1.02e+0 1.57e+ 0 0.117
#> 15 9 10 3 lm "Error… NA NA NA NA NA
#> # … with abbreviated variable names ¹.fit_error, ²std.error
One options for regenerating is to filter directly to the problematic
max == 10
condition to examine the generated data.
set.seed(500)
filter_max_10 = specify(a = ~ sample(0:max, size = 10, replace = TRUE),
b = ~ a + rnorm(10)) %>%
define(max = c(0, 1, 10)) %>%
generate(3, max == 10)
filter_max_10
#> full tibble
#> --------------------------
#> # A tibble: 3 × 4
#> .sim_id max rep sim
#> <int> <dbl> <int> <list>
#> 1 3 10 1 <tibble [10 × 2]>
#> 2 6 10 2 <tibble [10 × 2]>
#> 3 9 10 3 <tibble [10 × 2]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 10 × 2
#> a b
#> <int> <dbl>
#> 1 10 10.3
#> 2 6 6.18
#> 3 3 2.92
#> 4 10 10.3
#> 5 8 7.65
#> 6 7 8.35
#> 7 1 2.80
#> 8 2 2.55
#> 9 7 7.93
#> 10 10 8.64
Looking at the raw generated data, we can see our outcome variable is often larger than 1, which makes no sense for a logistic regression.
In general, we could also filter down to only values of
.sim_id
which generated errors to examine those:
fit_errors = filter(fit_tidy, !is.na(.fit_error))
set.seed(500)
fit_error_data = specify(a = ~ sample(1:max, size = 10, replace = TRUE),
b = ~ a + rnorm(10)) %>%
define(max = c(0, 1, 10)) %>%
generate(3, .sim_id %in% fit_errors$.sim_id)
fit_error_data
#> full tibble
#> --------------------------
#> # A tibble: 3 × 4
#> .sim_id max rep sim
#> <int> <dbl> <int> <list>
#> 1 3 10 1 <tibble [10 × 2]>
#> 2 6 10 2 <tibble [10 × 2]>
#> 3 9 10 3 <tibble [10 × 2]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 10 × 2
#> a b
#> <int> <dbl>
#> 1 7 7.47
#> 2 4 5.19
#> 3 9 8.49
#> 4 8 7.74
#> 5 2 1.81
#> 6 3 2.35
#> 7 8 7.68
#> 8 4 4.34
#> 9 6 7.17
#> 10 1 0.940
This approach is useful in cases where we don’t know which conditions are producing the errors. Sometimes simulation errors arise from numerical issues arising from unlucky draws from the data-generating mechanism, and are not systematic.