This vignette provides an overview of each step in the
simpr
workflow:
specify()
your data-generating process
define()
parameters that you want to systematically
vary across your simulation design (e.g. n, effect
size)
generate()
the simulation data
tidy_fits()
for further processing, such as
computing power or Type I Error rates
specify()
your data-generating process
specify()
takes arbitrary R expressions that can be used
for generating data. Each argument should have a name and be
prefixed with ~
, the tilde operator. Order
matters: later arguments can refer to earlier arguments, but not the
other way around.
Good—b
specification refers to a
specification, and comes after a
:
specify(a = ~ runif(6),
b = ~ a + rnorm(6)) %>%
generate(1)
#> full tibble
#> --------------------------
#> # A tibble: 1 × 3
#> .sim_id rep sim
#> <int> <int> <list>
#> 1 1 1 <tibble [6 × 2]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 2
#> a b
#> <dbl> <dbl>
#> 1 0.773 -0.334
#> 2 0.642 0.435
#> 3 0.770 1.19
#> 4 0.205 0.253
#> 5 0.928 2.04
#> 6 0.989 1.99
Error—b
specification refers to a
specification, but comes before a
, so
generate()
doesn’t know what a
is:
specify(b = ~ a + rnorm(6),
a = ~ runif(6)) %>%
generate(1)
#> Warning in create_sim_results(specs = specs, x = x[c("meta_info", "specify", :
#> Simulation produced errors. See column '.sim_error'.
#> tibble
#> --------------------------
#> # A tibble: 1 × 4
#> .sim_id rep sim .sim_error
#> <int> <int> <list> <chr>
#> 1 1 1 <NULL> "\u001b[1m\u001b[33mError\u001b[39m in `map()`:\u001b[22…
All arguments must imply the same number of rows. Arguments that imply 1 row are recycled.
OK—both a
and b
imply 6 rows:
specify(a = ~ runif(6),
b = ~ rnorm(6)) %>%
generate(1)
#> full tibble
#> --------------------------
#> # A tibble: 1 × 3
#> .sim_id rep sim
#> <int> <int> <list>
#> 1 1 1 <tibble [6 × 2]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 2
#> a b
#> <dbl> <dbl>
#> 1 0.136 1.32
#> 2 0.699 -1.70
#> 3 0.392 -0.381
#> 4 0.744 -2.58
#> 5 0.267 0.484
#> 6 0.182 0.713
OK—a
implies 1 row and b
implies 6 rows, so
a
is recycled:
specify(a = ~ runif(1),
b = ~ rnorm(6)) %>%
generate(1)
#> full tibble
#> --------------------------
#> # A tibble: 1 × 3
#> .sim_id rep sim
#> <int> <int> <list>
#> 1 1 1 <tibble [6 × 2]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 2
#> a b
#> <dbl> <dbl>
#> 1 0.638 0.296
#> 2 0.638 1.56
#> 3 0.638 -1.45
#> 4 0.638 -0.762
#> 5 0.638 -0.101
#> 6 0.638 -1.04
Error—a
implies 2 rows and b
implies 6
rows:
specify(a = ~ runif(2),
b = ~ rnorm(6)) %>%
generate(1)
#> Warning in create_sim_results(specs = specs, x = x[c("meta_info", "specify", :
#> Simulation produced errors. See column '.sim_error'.
#> tibble
#> --------------------------
#> # A tibble: 1 × 4
#> .sim_id rep sim .sim_error
#> <int> <int> <list> <chr>
#> 1 1 1 <NULL> "\u001b[1m\u001b[33mError\u001b[39m in `dplyr::bind_cols…
Using x
as an argument to specify()
is not
recommended, because for technical reasons x
is always
placed as the first argument. This means that if x
refers
to prior variables it will return an error:
specify(y = ~ runif(6),
x = ~ y + runif(6)) %>%
generate(1)
#> Formula specification for 'x' detected. Assuming 'x' is the first formula.
#>
#> To hide this message, or to avoid moving this formula first, use a different variable name.
#> Warning in create_sim_results(specs = specs, x = x[c("meta_info", "specify", :
#> Simulation produced errors. See column '.sim_error'.
#> tibble
#> --------------------------
#> # A tibble: 1 × 4
#> .sim_id rep sim .sim_error
#> <int> <int> <list> <chr>
#> 1 1 1 <NULL> "\u001b[1m\u001b[33mError\u001b[39m in `map()`:\u001b[22…
The same specification works fine when x
is renamed:
specify(y = ~ runif(6),
a = ~ y + runif(6)) %>%
generate(1)
#> full tibble
#> --------------------------
#> # A tibble: 1 × 3
#> .sim_id rep sim
#> <int> <int> <list>
#> 1 1 1 <tibble [6 × 2]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 2
#> y a
#> <dbl> <dbl>
#> 1 0.686 1.07
#> 2 0.114 0.272
#> 3 0.00425 0.495
#> 4 0.567 1.14
#> 5 0.270 0.500
#> 6 0.717 1.57
specify()
accepts expressions that generate multiple
columns simultaneously in a matrix
,
data.frame
, or tibble
. By default, the column
names in the output append a number to the variable name.
Here’s an example using MASS::mvrnorm()
, which returns
draws from the multivariate normal distribution as a matrix.
MASS::mvrnorm()
determines the number of columns for the
output data from the length of mu
and the dimensions of the
variance-covariance matrix Sigma
.
specify(a = ~ MASS::mvrnorm(6,
mu = rep(0, 3),
Sigma = diag(rep(1, 3)))) %>%
generate(1)
#> full tibble
#> --------------------------
#> # A tibble: 1 × 3
#> .sim_id rep sim
#> <int> <int> <list>
#> 1 1 1 <tibble [6 × 3]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 3
#> a_1 a_2 a_3
#> <dbl> <dbl> <dbl>
#> 1 1.41 -1.96 -0.496
#> 2 -0.127 -0.489 -1.89
#> 3 0.564 -0.291 -0.924
#> 4 1.55 0.724 -1.01
#> 5 0.340 0.712 -0.847
#> 6 2.32 0.174 -2.04
The argument was named a
in specify()
, so
generate()
creates three variables named a_1
,
a_2
, and a_3
.
You can change the separator between the argument name and the number
via .sep
. Here, we change it from the default
"_"
to "."
:
specify(a = ~ MASS::mvrnorm(6,
mu = rep(0, 3),
Sigma = diag(rep(1, 3))),
.sep = ".") %>%
generate(1)
#> full tibble
#> --------------------------
#> # A tibble: 1 × 3
#> .sim_id rep sim
#> <int> <int> <list>
#> 1 1 1 <tibble [6 × 3]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 3
#> a.1 a.2 a.3
#> <dbl> <dbl> <dbl>
#> 1 1.24 0.356 -1.74
#> 2 0.0116 0.833 1.61
#> 3 0.494 -0.309 -1.82
#> 4 0.374 0.985 1.15
#> 5 -1.42 -0.232 -0.591
#> 6 -0.114 -1.18 -0.665
Alternatively, you can give a two-sided formula to set names. The
argument name is ignored, and the left hand side must use c
or cbind
.
specify(y = c(a, b, c) ~ MASS::mvrnorm(6,
mu = rep(0, 3),
Sigma = diag(rep(1, 3)))) %>%
generate(1)
#> full tibble
#> --------------------------
#> # A tibble: 1 × 3
#> .sim_id rep sim
#> <int> <int> <list>
#> 1 1 1 <tibble [6 × 3]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 3
#> a b c
#> <dbl> <dbl> <dbl>
#> 1 -1.46 -0.321 0.729
#> 2 0.132 0.419 0.536
#> 3 -0.135 -0.384 0.897
#> 4 -0.0808 -0.611 0.334
#> 5 1.13 -0.979 0.396
#> 6 -0.0823 1.73 -1.21
If your expression already produces column names, those are used by default. The argument name is again ignored:
specify(a = ~ MASS::mvrnorm(6,
mu = rep(0, 3),
Sigma = diag(rep(1, 3))) %>%
magrittr::set_colnames(c("d", "e", "f"))) %>%
generate(1)
#> full tibble
#> --------------------------
#> # A tibble: 1 × 3
#> .sim_id rep sim
#> <int> <int> <list>
#> 1 1 1 <tibble [6 × 3]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 3
#> d e f
#> <dbl> <dbl> <dbl>
#> 1 -0.826 -0.407 0.890
#> 2 -0.163 -0.728 -0.0944
#> 3 0.0886 0.0767 -0.222
#> 4 0.369 -0.00302 0.322
#> 5 -0.195 0.0547 -0.575
#> 6 0.379 -0.878 -1.11
This is useful for dealing with functions from other packages that
already provide informative names (e.g.,
lavaan::simulateData()
). You can turn this behavior off
with .use_names = FALSE
.
Whatever method you use, you can still refer to these generated
variables in subsequent arguments to specify()
:
specify(a = ~ MASS::mvrnorm(6,
mu = rep(0, 3),
Sigma = diag(rep(1, 3))),
b = ~ a_1 - a_2) %>%
generate(1)
#> full tibble
#> --------------------------
#> # A tibble: 1 × 3
#> .sim_id rep sim
#> <int> <int> <list>
#> 1 1 1 <tibble [6 × 4]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 4
#> a_1 a_2 a_3 b
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1.90 2.00 -1.93 -0.102
#> 2 0.729 1.81 -0.164 -1.08
#> 3 0.803 -0.259 -0.951 1.06
#> 4 -0.421 -1.12 -1.85 0.702
#> 5 0.288 -0.0160 -0.426 0.304
#> 6 0.107 0.186 1.79 -0.0784
define()
parameters that you want to systematically
vary
define()
creates metaparameters (also called simulation
factors): values that you want to systematically vary. An obvious choice
is sample size.
Instead of writing a number for the n
argument of
rnorm
, we write a placeholder value samp_size
(this can be any valid R name), and we write a corresponding argument in
define()
that contains the possible values that
samp_size
can take on:
specify(a = ~ rnorm(samp_size)) %>%
define(samp_size = c(10, 20)) %>%
generate(1)
#> full tibble
#> --------------------------
#> # A tibble: 2 × 4
#> .sim_id samp_size rep sim
#> <int> <dbl> <int> <list>
#> 1 1 10 1 <tibble [10 × 1]>
#> 2 2 20 1 <tibble [20 × 1]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 10 × 1
#> a
#> <dbl>
#> 1 -2.95
#> 2 -0.422
#> 3 0.353
#> 4 -1.58
#> 5 -0.101
#> 6 1.57
#> 7 -2.63
#> 8 -0.863
#> 9 0.910
#> 10 -0.724
Each argument to define()
is a vector or list with the
desired values to be used in the expressions in specify()
.
specify()
expressions can refer to the names of define
arguments, and generate()
will substitute the possible
values that argument when generating data.
When define()
has multiple arguments, each possible
combination is generated:
specify(a = ~ rnorm(samp_size, mu)) %>%
define(samp_size = c(10, 20),
mu = c(0, 10)) %>%
generate(1)
#> full tibble
#> --------------------------
#> # A tibble: 4 × 5
#> .sim_id samp_size mu rep sim
#> <int> <dbl> <dbl> <int> <list>
#> 1 1 10 0 1 <tibble [10 × 1]>
#> 2 2 20 0 1 <tibble [20 × 1]>
#> 3 3 10 10 1 <tibble [10 × 1]>
#> 4 4 20 10 1 <tibble [20 × 1]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 10 × 1
#> a
#> <dbl>
#> 1 0.440
#> 2 0.596
#> 3 -0.925
#> 4 0.229
#> 5 0.937
#> 6 -0.323
#> 7 -0.0697
#> 8 -2.14
#> 9 1.75
#> 10 0.181
(If not all combinations are desired, see options for filtering at
the generate()
step, below.)
define()
can also take lists for any type of value used
in specify
. For instance, the argument s
is
defined as a list with two variables representing two different possible
correlation matrices, and we use that placeholder value s
in the specify()
statement:
specify(a = ~ MASS::mvrnorm(6, rep(0, 2), Sigma = s)) %>%
define(s = list(independent = diag(rep(1, 2)),
dependent = matrix(c(1, 0.5, 0.5, 1), nrow = 2))) %>%
generate(1)
#> full tibble
#> --------------------------
#> # A tibble: 2 × 5
#> .sim_id s_index rep s sim
#> <int> <chr> <int> <list> <list>
#> 1 1 independent 1 <dbl [2 × 2]> <tibble [6 × 2]>
#> 2 2 dependent 1 <dbl [2 × 2]> <tibble [6 × 2]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 2
#> a_1 a_2
#> <dbl> <dbl>
#> 1 0.202 0.438
#> 2 -2.13 0.971
#> 3 -1.06 0.168
#> 4 -0.643 -0.465
#> 5 1.15 -1.36
#> 6 -0.485 0.815
In the output, simpr
creates a column
s_index
using the names of the list elements of
s
to make the results easier to view and filter.
define()
also supports lists of functions. Here, the
specify
command has a placeholder function
distribution
, where distribution
is defined to
be either rnorm()
or rlnorm()
(the lognormal
distribution):
specify(y = ~ distribution(6)) %>%
define(distribution = list(normal = rnorm,
lognormal = rlnorm)) %>%
generate(1)
#> full tibble
#> --------------------------
#> # A tibble: 2 × 5
#> .sim_id distribution_index rep distribution sim
#> <int> <chr> <int> <list> <list>
#> 1 1 normal 1 <fn> <tibble [6 × 1]>
#> 2 2 lognormal 1 <fn> <tibble [6 × 1]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 1
#> y
#> <dbl>
#> 1 -1.52
#> 2 -1.17
#> 3 0.767
#> 4 -1.78
#> 5 -0.351
#> 6 -0.753
generate()
the simulation data
The main argument to generate()
is .reps
,
the number of repetitions for each combination of metaparameters in
define()
.
specify(a = ~ rnorm(n, mu)) %>%
define(n = c(6, 12),
mu = c(0, 10)) %>%
generate(2)
#> full tibble
#> --------------------------
#> # A tibble: 8 × 5
#> .sim_id n mu rep sim
#> <int> <dbl> <dbl> <int> <list>
#> 1 1 6 0 1 <tibble [6 × 1]>
#> 2 2 12 0 1 <tibble [12 × 1]>
#> 3 3 6 10 1 <tibble [6 × 1]>
#> 4 4 12 10 1 <tibble [12 × 1]>
#> 5 5 6 0 2 <tibble [6 × 1]>
#> 6 6 12 0 2 <tibble [12 × 1]>
#> 7 7 6 10 2 <tibble [6 × 1]>
#> 8 8 12 10 2 <tibble [12 × 1]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 1
#> a
#> <dbl>
#> 1 -1.88
#> 2 1.48
#> 3 -1.80
#> 4 -0.110
#> 5 0.780
#> 6 -1.69
Since there are 4 possible combinations of n
and
mu
, there are a total of 4 * 2 = 8 simulations generated, 2
for each possible combination.
If some combination of variables is not desired, add filtering
criteria to generate()
using the same syntax as
dplyr::filter()
. Here we arbitrarily filter to all
combinations of n
and mu
where n
is greater than mu
, but any valid filtering criteria can be
applied.
specify(a = ~ rnorm(n, mu)) %>%
define(n = c(6, 12),
mu = c(0, 10)) %>%
generate(2, n > mu)
#> full tibble
#> --------------------------
#> # A tibble: 6 × 5
#> .sim_id n mu rep sim
#> <int> <dbl> <dbl> <int> <list>
#> 1 1 6 0 1 <tibble [6 × 1]>
#> 2 2 12 0 1 <tibble [12 × 1]>
#> 3 4 12 10 1 <tibble [12 × 1]>
#> 4 5 6 0 2 <tibble [6 × 1]>
#> 5 6 12 0 2 <tibble [12 × 1]>
#> 6 8 12 10 2 <tibble [12 × 1]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 1
#> a
#> <dbl>
#> 1 -1.59
#> 2 -0.346
#> 3 -0.828
#> 4 -0.450
#> 5 -0.492
#> 6 1.66
To preserve reproducibility, a given simulation in the filtered
version of generate
is still the same as if all possible
combinations were generated. This can be tracked using the
.sim_id
that generate()
includes in the output
data, which uniquely identifies the simulation run given the same inputs
to specify
, define
, and .reps
.
Above, note that .sim_id
skips 3 and 7 See
vignette("Reproducing simulations")
for more information on
using generate()
to filter.
generate()
by default continues with the next iteration
if an error is produced, and returns a column .sim_error
with the text of the error. Alternative error handling mechanisms are
available, see vignette("Managing simulation errors")
.
fit()
models to your data
fit()
uses similar syntax to generate()
:
you can write arbitrary R expressions to fit models. Again, each
argument should have a name and be prefixed with ~
, the
tilde operator.
Below, we fit both a t-test and a linear model.
specify(a = ~ rnorm(6),
b = ~ a + rnorm(6)) %>%
generate(1) %>%
fit(t_test = ~ t.test(a, b),
lm = ~ lm(b ~ a))
#> full tibble
#> --------------------------
#> # A tibble: 1 × 5
#> .sim_id rep sim t_test lm
#> <int> <int> <list> <list> <list>
#> 1 1 1 <tibble [6 × 2]> <htest> <lm>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 2
#> a b
#> <dbl> <dbl>
#> 1 0.289 -0.190
#> 2 -0.636 -0.778
#> 3 0.215 0.0567
#> 4 0.00572 -0.371
#> 5 -0.185 -0.692
#> 6 1.07 0.565
#>
#> t_test[[1]]
#> --------------------------
#>
#> Welch Two Sample t-test
#>
#> data: a and b
#> t = 1.169, df = 9.8425, p-value = 0.2699
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#> -0.3286575 1.0508712
#> sample estimates:
#> mean of x mean of y
#> 0.1262152 -0.2348917
#>
#>
#> lm[[1]]
#> --------------------------
#>
#> Call:
#> lm(formula = b ~ a)
#>
#> Coefficients:
#> (Intercept) a
#> -0.3413 0.8430
fit()
adds columns to the overall tibble to contain each
type of fit. Printing the object displays a preview of the first fit
object in each column.
Although the function is named fit
, any arbitrary R
expression can be used. Below, one fit column will include the mean of
a
, while the second will include vectors that are five
larger than a
. The result is always a list-column, so any
type of return object is allowed:
specify(a = ~ rnorm(6)) %>%
generate(1) %>%
fit(mean = ~ mean(a),
why_not = ~ a + 5)
#> full tibble
#> --------------------------
#> # A tibble: 1 × 5
#> .sim_id rep sim mean why_not
#> <int> <int> <list> <list> <list>
#> 1 1 1 <tibble [6 × 1]> <dbl [1]> <dbl [6]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 1
#> a
#> <dbl>
#> 1 1.03
#> 2 -0.632
#> 3 0.265
#> 4 -0.247
#> 5 -0.500
#> 6 -0.397
#>
#> mean[[1]]
#> --------------------------
#> [1] -0.08082731
#>
#> why_not[[1]]
#> --------------------------
#> [1] 6.026923 4.367594 5.264579 4.752820 4.499658 4.603462
fit()
is computed for each individual simulated dataset,
so usually you do not need to refer to the dataset itself. If a
reference to the dataset is needed, use .
.
The below code is equivalent to the previous example, but explicitly
referencing the dataset using .
and .$
:
specify(a = ~ rnorm(6),
b = ~ a + rnorm(6)) %>%
generate(1) %>%
## .$ and data = . not actually required here
fit(t_test = ~ t.test(.$a, .$b),
lm = ~ lm(b ~ a, data = .))
#> full tibble
#> --------------------------
#> # A tibble: 1 × 5
#> .sim_id rep sim t_test lm
#> <int> <int> <list> <list> <list>
#> 1 1 1 <tibble [6 × 2]> <htest> <lm>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 2
#> a b
#> <dbl> <dbl>
#> 1 0.460 -1.56
#> 2 1.43 4.34
#> 3 1.34 1.26
#> 4 -0.750 -1.78
#> 5 -1.19 -0.431
#> 6 -1.43 -2.81
#>
#> t_test[[1]]
#> --------------------------
#>
#> Welch Two Sample t-test
#>
#> data: .$a and .$b
#> t = 0.11663, df = 7.2484, p-value = 0.9103
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#> -2.640012 2.915949
#> sample estimates:
#> mean of x mean of y
#> -0.0251145 -0.1630829
#>
#>
#> lm[[1]]
#> --------------------------
#>
#> Call:
#> lm(formula = b ~ a, data = .)
#>
#> Coefficients:
#> (Intercept) a
#> -0.1226 1.6111
Sometimes data manipulation is required between
generate()
and fit()
. After
generate()
, run per_sim()
and then chain any
dplyr
or tidyr
verbs that work on
data.frame
s or tibble
s. These verbs will be
applied to every individual simulated dataset.
A common use-case is needing to reshape wide to long. Consider an intervention study with a control group, an intervention group that does slightly better than the control, and a second intervention group that does much better. This is easiest to specify in wide format, with separate variables for each group and differing means by group:
wide_gen = specify(control = ~ rnorm(6, mean = 0),
intervention_1 = ~ rnorm(6, mean = 0.2),
intervention_2 = ~ rnorm(6, mean = 2)) %>%
generate(2)
wide_gen
#> full tibble
#> --------------------------
#> # A tibble: 2 × 3
#> .sim_id rep sim
#> <int> <int> <list>
#> 1 1 1 <tibble [6 × 3]>
#> 2 2 2 <tibble [6 × 3]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 3
#> control intervention_1 intervention_2
#> <dbl> <dbl> <dbl>
#> 1 -0.578 0.839 1.14
#> 2 2.20 0.219 2.32
#> 3 0.0715 0.0958 0.652
#> 4 -0.497 -0.337 0.926
#> 5 0.221 0.0949 3.10
#> 6 1.59 -0.135 1.64
But to run an ANOVA, we need the outcome in one column and the group
name in another column. We first run per_sim()
to indicate
we want to compute on individual simulated datasets, and then we can use
tidyr::pivot_longer()
to reshape each dataset into a format
ready for analysis:
long_gen = wide_gen %>%
per_sim() %>%
pivot_longer(cols = everything(),
names_to = "group",
values_to = "response")
long_gen
#> full tibble
#> --------------------------
#> # A tibble: 2 × 3
#> .sim_id rep sim
#> <int> <int> <list>
#> 1 1 1 <tibble [18 × 2]>
#> 2 2 2 <tibble [18 × 2]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 18 × 2
#> group response
#> <chr> <dbl>
#> 1 control -0.578
#> 2 intervention_1 0.839
#> 3 intervention_2 1.14
#> 4 control 2.20
#> 5 intervention_1 0.219
#> 6 intervention_2 2.32
#> 7 control 0.0715
#> 8 intervention_1 0.0958
#> 9 intervention_2 0.652
#> 10 control -0.497
#> 11 intervention_1 -0.337
#> 12 intervention_2 0.926
#> 13 control 0.221
#> 14 intervention_1 0.0949
#> 15 intervention_2 3.10
#> 16 control 1.59
#> 17 intervention_1 -0.135
#> 18 intervention_2 1.64
Each simulation is now reshaped and ready for fitting:
long_fit = long_gen %>%
fit(aov = ~ aov(response ~ group),
lm = ~ lm(response ~ group))
long_fit
#> full tibble
#> --------------------------
#> # A tibble: 2 × 5
#> .sim_id rep sim aov lm
#> <int> <int> <list> <list> <list>
#> 1 1 1 <tibble [18 × 2]> <aov> <lm>
#> 2 2 2 <tibble [18 × 2]> <aov> <lm>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 18 × 2
#> group response
#> <chr> <dbl>
#> 1 control -0.578
#> 2 intervention_1 0.839
#> 3 intervention_2 1.14
#> 4 control 2.20
#> 5 intervention_1 0.219
#> 6 intervention_2 2.32
#> 7 control 0.0715
#> 8 intervention_1 0.0958
#> 9 intervention_2 0.652
#> 10 control -0.497
#> 11 intervention_1 -0.337
#> 12 intervention_2 0.926
#> 13 control 0.221
#> 14 intervention_1 0.0949
#> 15 intervention_2 3.10
#> 16 control 1.59
#> 17 intervention_1 -0.135
#> 18 intervention_2 1.64
#>
#> aov[[1]]
#> --------------------------
#> Call:
#> aov(formula = response ~ group)
#>
#> Terms:
#> group Residuals
#> Sum of Squares 7.319581 11.617044
#> Deg. of Freedom 2 15
#>
#> Residual standard error: 0.8800395
#> Estimated effects may be unbalanced
#>
#> lm[[1]]
#> --------------------------
#>
#> Call:
#> lm(formula = response ~ group)
#>
#> Coefficients:
#> (Intercept) groupintervention_1 groupintervention_2
#> 0.5008 -0.3714 1.1282
tidy_fits()
for further processing
The output of fit()
is not yet amenable to plotting or
analysis. tidy_fits()
applies broom::tidy()
to
each fit object and binds them together in a single
tibble
:
specify(a = ~ rnorm(n),
b = ~ a + rnorm(n)) %>%
define(n = c(6, 12)) %>%
generate(2) %>%
fit(lm = ~ lm(b ~ a)) %>%
tidy_fits()
#> # A tibble: 8 × 9
#> .sim_id n rep Source term estimate std.error statistic p.value
#> <int> <dbl> <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 1 6 1 lm (Intercept) 0.181 0.441 0.412 0.702
#> 2 1 6 1 lm a 0.541 0.402 1.34 0.250
#> 3 2 12 1 lm (Intercept) -0.0369 0.403 -0.0914 0.929
#> 4 2 12 1 lm a 1.32 0.386 3.41 0.00662
#> 5 3 6 2 lm (Intercept) 0.805 0.334 2.41 0.0735
#> 6 3 6 2 lm a 1.35 0.293 4.62 0.00985
#> 7 4 12 2 lm (Intercept) -0.281 0.261 -1.07 0.308
#> 8 4 12 2 lm a 1.22 0.244 5.00 0.000537
All the same metaparameter information appears in the left-hand
column, and all the model information from broom::tidy()
is
provided. This is a convenient format for filtering, plotting, and
calculating diagnostics.
If more than one kind of fit is present, tidy_fits()
simply brings them together in the same tibble
using
bind_rows
; this means there may be many NA
values where one type of model has no values. In the example below,
t.test
returns the column estimate1
, but
lm
does not, so for the lm
rows there are
NA
values.
specify(a = ~ rnorm(n),
b = ~ a + rnorm(n)) %>%
define(n = c(6, 12)) %>%
generate(2) %>%
fit(lm = ~ lm(b ~ a),
t_test = ~ t.test(a, b)) %>%
tidy_fits()
#> # A tibble: 12 × 16
#> .sim_id n rep Source term estimate std.e…¹ statis…² p.value estim…³
#> <int> <dbl> <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 6 1 lm (Interc… -0.431 0.268 -1.61 1.84e-1 NA
#> 2 1 6 1 lm a 0.680 0.330 2.06 1.08e-1 NA
#> 3 1 6 1 t_test NA 0.388 NA 0.785 4.51e-1 -0.134
#> 4 2 12 1 lm (Interc… -0.693 0.449 -1.54 1.54e-1 NA
#> 5 2 12 1 lm a 0.751 0.533 1.41 1.89e-1 NA
#> 6 2 12 1 t_test NA 0.551 NA 1.41 1.78e-1 -0.571
#> 7 3 6 2 lm (Interc… -0.103 0.379 -0.272 7.99e-1 NA
#> 8 3 6 2 lm a 0.733 0.423 1.73 1.58e-1 NA
#> 9 3 6 2 t_test NA 0.00356 NA 0.00652 9.95e-1 -0.373
#> 10 4 12 2 lm (Interc… 0.130 0.212 0.611 5.55e-1 NA
#> 11 4 12 2 lm a 1.35 0.200 6.71 5.29e-5 NA
#> 12 4 12 2 t_test NA -0.0821 NA -0.145 8.86e-1 -0.138
#> # … with 6 more variables: estimate2 <dbl>, parameter <dbl>, conf.low <dbl>,
#> # conf.high <dbl>, method <chr>, alternative <chr>, and abbreviated variable
#> # names ¹std.error, ²statistic, ³estimate1
Any option taken by the tidier can be passed through
tidy_fits()
. Below, we specify the conf.level
and conf.int
options for broom::tidy.lm()
:
specify(a = ~ rnorm(n),
b = ~ a + rnorm(n)) %>%
define(n = c(6, 12)) %>%
generate(2) %>%
fit(lm = ~ lm(b ~ a)) %>%
tidy_fits(conf.level = 0.99, conf.int = TRUE)
#> # A tibble: 8 × 11
#> .sim_id n rep Source term estimate std.e…¹ stati…² p.value conf.…³
#> <int> <dbl> <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 6 1 lm (Intercep… -0.184 0.307 -0.601 0.580 -1.60
#> 2 1 6 1 lm a 0.968 0.333 2.90 0.0440 -0.567
#> 3 2 12 1 lm (Intercep… 0.402 0.276 1.45 0.177 -0.474
#> 4 2 12 1 lm a 0.719 0.447 1.61 0.139 -0.698
#> 5 3 6 2 lm (Intercep… -0.00991 0.708 -0.0140 0.990 -3.27
#> 6 3 6 2 lm a 0.514 0.465 1.11 0.331 -1.63
#> 7 4 12 2 lm (Intercep… -0.290 0.383 -0.758 0.466 -1.50
#> 8 4 12 2 lm a 1.13 0.326 3.46 0.00611 0.0952
#> # … with 1 more variable: conf.high <dbl>, and abbreviated variable names
#> # ¹std.error, ²statistic, ³conf.low
glance_fits()
analogously provides the one-row summary
provided by broom::glance()
for each simulation:
specify(a = ~ rnorm(n),
b = ~ a + rnorm(n)) %>%
define(n = c(6, 12)) %>%
generate(2) %>%
fit(lm = ~ lm(b ~ a)) %>%
glance_fits()
#> # A tibble: 4 × 16
#> .sim_id n rep Source r.squa…¹ adj.r…² sigma stati…³ p.value df logLik
#> <int> <dbl> <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 6 1 lm 0.0974 -0.128 1.01 0.431 0.547 1 -7.33
#> 2 2 12 1 lm 0.115 0.0261 1.24 1.30 0.282 1 -18.5
#> 3 3 6 2 lm 0.606 0.507 1.13 6.15 0.0682 1 -8.05
#> 4 4 12 2 lm 0.587 0.545 1.06 14.2 0.00367 1 -16.6
#> # … with 5 more variables: AIC <dbl>, BIC <dbl>, deviance <dbl>,
#> # df.residual <int>, nobs <int>, and abbreviated variable names ¹r.squared,
#> # ²adj.r.squared, ³statistic
apply_fits
can take any arbitrary expression (preceded
by ~
) or function and apply it to each fit object. The
special value .
indicates the current fit.
Below, the maximum Cook’s Distance for each simulated model fit is
computed using cooks.distance()
.
specify(a = ~ rnorm(n),
b = ~ a + rnorm(n)) %>%
define(n = c(6, 12)) %>%
generate(2) %>%
fit(lm = ~ lm(b ~ a)) %>%
apply_fits(~ max(cooks.distance(.)))
#> # A tibble: 4 × 5
#> .sim_id n rep Source value
#> <int> <dbl> <int> <chr> <dbl>
#> 1 1 6 1 lm 1.53
#> 2 2 12 1 lm 0.471
#> 3 3 6 2 lm 1.26
#> 4 4 12 2 lm 0.361