Simulate data with varying degrees of selection and confounding bias

Function to simulate data based on specified relationships between the generated outcome, group variable, confounder(s), and selection mechanism.

simdat(
  N = 1e+06,
  p = 1,
  q = 0,
  n_strat = 1,
  n_clust = 1,
  sigma_strat = 1,
  sigma_clust = 1,
  X_fam = c("gaussian", "binary"),
  tau_0 = 0,
  tau_A = 1,
  tau_X = rep(1, p),
  tau_X12 = 0,
  beta_0 = 0,
  beta_A = 1,
  beta_X = rep(1, p),
  beta_U = rep(1, q),
  Y_fam = c("gaussian", "binary", "poisson"),
  alpha_0 = 0,
  alpha_A = 1,
  alpha_X = rep(1, p),
  alpha_AX = 0
)

Arguments

N: int - Number of observations to be generated. Defaults to 1000000.
p: int - Number of covariates to be generated. Defaults to 1.
q: int - Number of additional covariates that affect selection to be generated. Defaults to 0.
n_strat: int - Number of strata in the population to be generated. Defaults to 1.
n_clust: int - Number of clusters within each stratum in the population to be generated. Defaults to 1.
sigma_strat: double - Standard deviation of covariate means across strata. Defaults to 1.
sigma_clust: double - Standard deviation of covariate means across clusters. Defaults to 1.
X_fam: string - Distribution of the covariates, X. Defaults to a multivariate normal distribution with mean equal to the sum of the cluster and stratum means, and an identity covariance matrix. If "binary", continuous covariates are discretized at their median values.
tau_0: double - Intercept for propensity model. Defaults to 0.
tau_A: double - Scaling factor for group assignment. Defaults to 1.
tau_X: double - Coefficients for X in propensity model. Defaults to a 1 vector of length p.
tau_X12: double - Interaction term coefficient for X1*X2 if p > 1. Defaults to 0.
beta_0: double - Intercept for selection model. Defaults to 0.
beta_A: double - Coefficient for A in selection model. Defaults to 1.
beta_X: double - Coefficients for X in selection model. Defaults to a 1 vector of length p.
beta_U: double - Coefficients for U (additional covariates affection only selection) in selection model. Defaults to a 1 vector of length q.
Y_fam: string - Distribution of the outcome variable, Y. Defaults to "gaussian" for a normally distributed outcome. Other options include "binary" for a Bernoulli-distributed outcome and "poisson" for a Poisson-distributed outcome.
alpha_0: double - Intercept for outcome model. Defaults to 0.
alpha_A: double - Coefficient for A in outcome model. Defaults to 1.
alpha_X: double - Coefficients for X in outcome model. Defaults to a 1 vector of length p.
alpha_AX: double - Coefficient for interaction between A and X in outcome model. Defaults to 0.

Value

A data.frame with N observations and the following variables:

Strata: Stratum index (integer)
Cluster: Cluster index (integer)
X1, X2, ..., Xp: Confounding covariates (continuous or binary, depending on X_fam)
pA: True probability of A = 1 conditional on X (continuous)
A: Group assignment (binary)
pS: True probability of selection conditional on A and X (continuous)
Y0: Potential outcome under A = 0 (continuous, binary, or count depending on Y_fam)
Y1: Potential outcome under A = 1 (continuous, binary, or count depending on Y_fam)
Y: Observed outcome, based on treatment assignment (continuous, binary, or count depending on Y_fam)
CDIFF: True controlled difference in outcomes by comparison group (double, computed as mean(Y1 - Y0))

Details

The function generates data in a hierarchical structure with stratified clusters. The data generation process follows these steps:

1. Stratum and Cluster Means: For each of the n_strat strata, a matrix of stratum-level means for p covariates is generated from a normal distribution with standard deviation sigma_strat. Similarly, for each of the n_clust clusters within each stratum, cluster-level means are generated from a normal distribution with standard deviation sigma_clust.

2. Covariate Generation: Within each cluster, covariates, X, for N / (n_strat * n_clust) individuals are generated from a multivariate normal distribution with mean equal to the sum of the cluster and stratum means, and an identity covariance matrix.

3. Covariate Transformation: If X_fam is "binary", each covariate is discretized at its median, otherwise it remains continuous.

4. Propensity Model: The group variable, A, is generated using a logistic regression model with intercept tau_0, covariate effects tau_X, and an interaction effect between the first two covariates with coefficient tau_X12. The group membership probability, pA, is defined by the logistic model.

5. Selection Model: The probability of selection, pS, is generated using a logistic regression model with intercept beta_0, group effect beta_A, and covariate effects beta_X. Gaussian noise is added to the linear predictor.

6. Outcome Model: The outcome, Y, is generated based on a chosen outcome distribution, Y_fam. The linear predictor includes an intercept, alpha_0, group effect, alpha_A, covariate effects, alpha_X, and an optional interaction effect, alpha_AX, between the group variable and covariates.

7. Controlled Difference: The true controlled difference in the outcome between groups is calculated as CDIFF.

The output is a data frame containing the generated outcome, group variable, covariates, and selection probabilities.

Examples


N <- 100000

dat <- simdat(N)

head(dat)
#>   Cluster Strata        X1         pA A         pS P_S_cond_A1X P_S_cond_A0X
#> 1       1      1 -1.821282 0.13928009 0 0.16087018    0.3425913   0.16087018
#> 2       1      1 -1.294423 0.21510506 1 0.42658122    0.4265812   0.21487043
#> 3       1      1 -4.264653 0.01386190 0 0.01289530    0.0342932   0.01289530
#> 4       1      1 -2.690160 0.06355648 0 0.05470347    0.1359232   0.05470347
#> 5       1      1 -2.687035 0.06374277 0 0.06741146    0.1642213   0.06741146
#> 6       1      1 -2.725540 0.06148299 0 0.06585926    0.1608244   0.06585926
#>           Y0         Y1          Y     CDIFF
#> 1  0.4589607 -1.3076043  0.4589607 0.9977058
#> 2 -2.1888480 -0.1170095 -0.1170095 0.9977058
#> 3 -5.0102595 -4.7814792 -5.0102595 0.9977058
#> 4 -0.5664572 -1.6112367 -0.5664572 0.9977058
#> 5 -3.4176525 -1.3325220 -3.4176525 0.9977058
#> 6 -3.0292261 -0.7895155 -3.0292261 0.9977058