Function to simulate data based on specified relationships between the generated (continuous) outcome, variable of interest, confounder, and selection mechanism.
simdat(
N,
X_dist = "continuous",
S_known = FALSE,
tau_0 = 0,
tau_X = 1,
beta_0 = 0,
beta_A = 1,
beta_X = 1,
hetero = TRUE,
alpha_0 = 0,
alpha_X = 1,
alpha_A = 1,
alpha_AX = 0.1
)
int - Number of observations to be generated
string - Distribution of the confounding variable, X. Defaults to "continuous" for a N(1, 1) variable, or "binary" for a Bernoulli(0.5) variable
boolean - Logical for whether the selection mechanism should be treated as known (deterministic) or needs to be estimated (simulated with Gaussian error; defaults to FALSE)
double - Intercept for propensity model (defaults to 0)
double - Coefficient for X in propensity model (defaults to 1)
double - Intercept for selection model (defaults to 0)
double - Coefficient for A in selection model (defaults to 1)
double - Coefficient for X in selection model (defaults to 1)
boolean - Logical for heterogeneous treatment effect in the outcome model (defaults to TRUE)
double - Intercept for outcome model (defaults to 0)
double - Coefficient for X in outcome model (defaults to 1)
double - Coefficient for A in outcome model (defaults to 1)
double - Coefficient for interaction between A and X in
outcome model (only used if hetero == TRUE
; defaults to 0.1)
A data.frame
with N
observations of 7 variables:
Observed outcome (continuous)
Comparison group variable of interest (binary)
Confounding variable (continuous or binary)
True probability of A = 1 conditional on X (continuous)
True probability of selection (S = 1) conditional on A and X (continuous)
True probability of selection (S = 1) conditional on A = 1 and X (continuous)
True probability of selection (S = 1) conditional on A = 0 and X (continuous)
True controlled difference in outcomes by comparison group (double)
The data are generated as follows. For a user-given number, N
,
observations in our so-called super population, we first generate a
confounding variable, X
, which relates to our outcome, Y
, our
variable of interest, A
, and our selection indicator, S
.
We generate population-level data with X ~ N(1,1)
or
X ~ Bern(0.5)
depending on whether distribution of X
is
chosen to be X_dist = "continous"
or X_dist = "binary"
,
respectively.
We then generate the remaining data from three models:
N <- 100000
dat <- simdat(N)
head(dat)
#> Y A X P_A_cond_X P_S_cond_AX P_S_cond_A1X P_S_cond_A0X
#> 1 -0.9374768 0 -0.4000435 0.4013019 0.3830190 0.6279066 0.3830190
#> 2 1.7171265 1 1.2553171 0.7782189 0.9076775 0.9076775 0.7834017
#> 3 -2.5860869 0 -1.4372636 0.1919695 0.1848481 0.3813458 0.1848481
#> 4 2.1034929 0 0.9944287 0.7299618 0.7233937 0.8766799 0.7233937
#> 5 5.0639509 1 1.6215527 0.8350092 0.9421912 0.9421912 0.8570581
#> 6 2.4688280 1 2.1484116 0.8955203 0.9587912 0.9587912 0.8953901
#> CDIFF
#> 1 1.102055
#> 2 1.102055
#> 3 1.102055
#> 4 1.102055
#> 5 1.102055
#> 6 1.102055