Specify the data-generating mechanisms for the simulation using purrr-style lambda functions.

# S3 method for formula
specify(x = NULL, ..., .use_names = TRUE, .sep = "_")

Arguments

x

leave this argument blank (NULL); this argument is a placeholder and can be skipped.

...

named purrr-style formula functions used for generating simulation variables. x is not recommended as a name, since it is a formal argument and will be automatically assumed to be the first variable (a message will be displayed if x is used).

.use_names

Whether to use names generated by the lambda function (TRUE, the default), or to overwrite them with supplied names.

.sep

Specify the separator for auto-generating names. See Column naming.

Value

A simpr_specify object which contains the functions needed to generate the simulation; to be passed to

define for defining metaparameters or, if there are no metaparameters, directly to

generate

for generating the simulation.

Also useful is the fact that one can refer to variables in subsequent arguments. So, one could define another variable b that depends on a very simply, e.g.

specify(a = ~ 3 + runif(10), b = ~ 2 * x).

Finally, one can also refer to metaparameters that are to be systematically varied in the simulation study. See define

and the examples for more details.

Details

This is always the first command in the simulation process, to specify the actual simulated variables, which is then passed to define to define metaparameters and then to generate to generate the data.

The ... arguments use an efficient syntax to specify custom functions needed for generating a simulation, based on the purrr package. When producing one variable, one can provide an expression such as specify(a = ~ 3 + runif(10)); the expression is preceded by ~, the tilde operator, and can refer to previous arguments in specify or to metaparameters in define. This is called a lambda function.

Order matters: arguments are evaluated sequentially, so later argument can refer to an earlier one, e.g. specify(a = ~ rnorm(2), b = ~ a + rnorm(2)).

generate combines results together into a single tibble for each simulation, so all lambda functions should produce the same number of rows. However, a lambda function can produce multiple columns.

Column naming

Because functions can produce different numbers of columns, there are several options for naming columns. If a provided lambda function produces a single column, the name given to the argument becomes the name of the column. If the lambda function already produces column names, then the output will use these names if .use_names = TRUE, the default. Otherwise, simpr uses the argument name as a base and auto-numbers the columns. For instance, if the argument a generates a two-column matrix and .sep = "_" (the default) the columns will be named a_1and a_2.

Custom names can also be directly provided by a double-sided formula. The left-hand side must use c or cbind, e.g. specify(c(a, b) ~ MASS::mvrnorm(5, c(0, 0), Sigma = diag(2))).

Note

This function is an S3 method for specify from the generics package. Because x is a formal argument of specify, if you have a variable in your simulation named x it will be automatically moved to be the first variable (with a message). It is therefore safest to use any other variable name besides x.

Examples

## specify a variable and generate it in the simulation
single_var = specify(a = ~ 1 + rnorm(5)) %>%
  generate(1) # generate a single repetition of the simulation
single_var
#> full tibble
#> --------------------------
#> # A tibble: 1 × 3
#>   .sim_id   rep sim             
#>     <int> <int> <list>          
#> 1       1     1 <tibble [5 × 1]>
#> 
#> sim[[1]]
#> --------------------------
#> # A tibble: 5 × 1
#>        a
#>    <dbl>
#> 1  2.72 
#> 2  1.66 
#> 3 -0.236
#> 4 -0.638
#> 5  1.73 
#> 

two_var = specify(a = ~ 1 + rnorm(5),
                    b = ~ x + 2) %>%
  generate(1)
#> Warning: Simulation produced errors.  See column '.sim_error'.
two_var
#> tibble
#> --------------------------
#> # A tibble: 1 × 4
#>   .sim_id   rep sim    .sim_error                                               
#>     <int> <int> <list> <chr>                                                    
#> 1       1     1 <NULL> "\u001b[1m\u001b[33mError\u001b[39m in `map()`:\u001b[22…
#> 

## Generates a_01 through a_10
autonumber_var = specify(a = ~ MASS::mvrnorm(5, rep(0, 10), Sigma = diag(10))) %>%
  generate(1)
autonumber_var
#> full tibble
#> --------------------------
#> # A tibble: 1 × 3
#>   .sim_id   rep sim              
#>     <int> <int> <list>           
#> 1       1     1 <tibble [5 × 10]>
#> 
#> sim[[1]]
#> --------------------------
#> # A tibble: 5 × 10
#>     a_01   a_02    a_03   a_04   a_05  a_06    a_07   a_08   a_09   a_10
#>    <dbl>  <dbl>   <dbl>  <dbl>  <dbl> <dbl>   <dbl>  <dbl>  <dbl>  <dbl>
#> 1 -0.323  1.17   0.315   0.609  0.987 0.577 -0.0338  0.125  1.19   0.461
#> 2  0.485 -1.30   0.243   2.52  -1.31  0.205  0.400  -1.05  -0.458 -0.537
#> 3  0.419  0.542  0.0649 -0.239 -0.747 1.79  -1.84   -1.35   0.228  0.249
#> 4  1.00   0.282  0.939   0.611  0.160 0.192  0.518  -0.224 -1.10   0.904
#> 5 -1.39   2.03  -0.731   0.184  0.536 0.489  0.871  -2.44  -1.86  -1.41 
#> 

# alternatively, you could use a two-sided formula for names
multi_name = specify(cbind(a, b, c) ~ MASS::mvrnorm(5, rep(0, 3), Sigma = diag(3))) %>%
  generate(1)
#> Formula specification for 'x' detected. Assuming 'x' is the first formula.
#> 
#> To hide this message, or to avoid moving this formula first, use a different variable name.
multi_name
#> full tibble
#> --------------------------
#> # A tibble: 1 × 3
#>   .sim_id   rep sim             
#>     <int> <int> <list>          
#> 1       1     1 <tibble [5 × 3]>
#> 
#> sim[[1]]
#> --------------------------
#> # A tibble: 5 × 3
#>        a      b      c
#>    <dbl>  <dbl>  <dbl>
#> 1  0.259 1.19   -0.696
#> 2  0.821 0.0367 -0.565
#> 3  0.883 0.131   0.108
#> 4  0.535 0.811  -0.272
#> 5 -1.49  1.77   -0.668
#> 

# Simple example of setting a metaparameter
simple_meta = specify(a = ~ 1 + rnorm(n)) %>%
  define(n = c(5, 10)) %>% # without this line you would get an error!
  generate(1)


simple_meta # has two rows now, one for each value of n
#> full tibble
#> --------------------------
#> # A tibble: 2 × 4
#>   .sim_id     n   rep sim              
#>     <int> <dbl> <int> <list>           
#> 1       1     5     1 <tibble [5 × 1]> 
#> 2       2    10     1 <tibble [10 × 1]>
#> 
#> sim[[1]]
#> --------------------------
#> # A tibble: 5 × 1
#>        a
#>    <dbl>
#> 1 -0.122
#> 2  1.68 
#> 3  1.53 
#> 4  3.29 
#> 5  1.58 
#> 
simple_meta$sim[[1]] # n = 5
#> # A tibble: 5 × 1
#>        a
#>    <dbl>
#> 1 -0.122
#> 2  1.68 
#> 3  1.53 
#> 4  3.29 
#> 5  1.58 
simple_meta$sim[[2]] # n = 10
#> # A tibble: 10 × 1
#>         a
#>     <dbl>
#>  1  1.63 
#>  2  0.322
#>  3 -0.292
#>  4  1.70 
#>  5  1.64 
#>  6  0.986
#>  7  0.107
#>  8  1.27 
#>  9 -0.824
#> 10  0.250