Skip to contents

Overview

In “Performance is not enough: the story of Rashomon’s quartet” (Biecek et al., 2023), four different models (linear regression, decision trees, random forests, and neural networks) achieve near-identical predictive performance on the same dataset. However, their interpretations vary significantly. In this module, we will:

  1. Fit each model on a training set.
  2. Evaluate their predictive performance on a testing set.
  3. Demonstrate how the naive, classical, and IPD-based inference results differ.

Note: We will replicate the results of the Rashomon Quartet analysis following:

https://github.com/MI2DataLab/rashomon-quartet

Background on the Rashomon Quartet

The underlying datasets are synthetically generated to include three covariates, 𝐗=(X1,X2,X3)\boldsymbol{X} = (X_1, X_2, X_3)^\top, so that

𝐗=(X1,X2,X3)𝒩3(𝟎,Σ),Σij={1i=j,0.9ij \boldsymbol{X} = (X_1, X_2, X_3)^\top \sim \mathcal{N}_3 \left(\boldsymbol{0}, \Sigma\right), \quad \Sigma_{ij} = \begin{cases} 1 & i=j,\\ 0.9 & i \neq j\end{cases} The outcomes, YY, are then generated to depend only on X1X_1 and, to a lesser extent, X2X_2, given by

Y=sin(3X1+X25)+ε,ε𝒩(0,13). Y = \sin\left(\frac{3X_1 + X_2}{5}\right) + \varepsilon, \quad \varepsilon \sim \mathcal{N}\left(0, \frac{1}{3}\right).

Note: X3X_3 does not appear in the true data generating mechanism, but is highly correlated with X1X_1 and X2X_2.

Two independent batches of 1,000 observations (a training set and a testing set) are generated under this model. We will fit the four candidate models on the 1,000-row training data, and evaluate them separately on the 1,000-row testing data. As we will show, all four achieve R20.729R^2 \approx 0.729 and RMSE0.354\mathrm{RMSE} \approx 0.354 on the testing set. However, their interpretations diverge:

  • Linear regression captures a near-linear approximation of sin(3X1+X25)\sin\left(\tfrac{3X_{1} + X_{2}}{5}\right).
  • Decision tree approximates the nonlinear relationship via a small set of splits.
  • Random forest produces a smooth ensemble derivative of the tree.
  • Neural network learns a different nonlinear basis with hidden units.

In addition to comparing interpretability across models, we will demonstrate how to perform valid downstream inference for YY on 𝐗\boldsymbol{X} when YY is only partially observed. We will:

  1. Train each predictive model on the full 1,000-row training set.

  2. Predict on all 1,000 testing observations.

  3. Randomly split the 1,000 testing rows into two groups:

    • Labeled subset (with true YY observed).
    • Unlabeled subset (with only predicted ff).
  4. In the labeled subset, regress YY on 𝐗\boldsymbol{X} (“Classical”).

  5. In the unlabeled subset, regress ff on 𝐗\boldsymbol{X} (“Naive”).

  6. Combine the two subsets and use ipd::ipd() to correct bias and variance.

This mirrors realistic scenarios where the analytic dataset (here the testing set) has only some observations labeled, and we want to use model predictions for the rest while still making valid inference.

Setup and Data

# Install if needed:
# install.packages(c("ipd","DALEX","partykit","randomForest","neuralnet","GGally","tidyverse"))

library(ipd)           # Inference with Predicted Data
library(DALEX)         # Model Explanation
library(partykit)      # Decision Trees
library(randomForest)  # Random Forests
library(neuralnet)     # Neural Networks
library(GGally)        # Pairwise Plots
library(tidyverse)     # Data Manipulation and Visualization

Read-In the Training and Testing Data

We have included two CSV files from the MI2DataLab/rashomon-quartet repository on GitHub. Note that these CSVs are semicolon-delimited.

# Use read_delim to parse semicolons
train <- read_delim("data/rq_train.csv", delim = ";", col_types = cols(), show_col_types = FALSE)
test  <- read_delim("data/rq_test.csv",  delim = ";", col_types = cols(), show_col_types = FALSE)

# Show the first few rows of each
glimpse(train)
#> Rows: 1,000
#> Columns: 4
#> $ y  <dbl> -0.47048454, 0.17177895, 0.07295012, -0.34301361, 0.47167402, 0.191…
#> $ x1 <dbl> -0.51211395, 0.49038643, -0.89461065, 0.02725393, 0.44364997, -0.26…
#> $ x2 <dbl> -0.34174653, 0.25511508, -0.28744899, -0.09277031, 0.52147200, 0.28…
#> $ x3 <dbl> 0.255531783, -0.179610715, -0.310620340, -0.109603650, 0.920225141,

glimpse(test)
#> Rows: 1,000
#> Columns: 4
#> $ y  <dbl> -0.44547765, -0.67618374, -0.78822993, -0.55024530, -0.49205810, -0…
#> $ x1 <dbl> -0.553685719, -0.965436826, -0.962515641, -0.871816734, -1.96565982…
#> $ x2 <dbl> -0.74221752, -0.19959118, -1.25801758, -1.08469684, -1.43546177, -0…
#> $ x3 <dbl> -1.15266476, -1.00944121, -1.34017902, -0.84322216, -1.29871031, -0…

Confirm Dimensions and Column Names:

  • Each dataset should have exactly 1,000 rows and four columns: y, x1, x2, x3.
  • We will use all 1,000 rows of train to fit each predictive model, and then randomly split the 1,000-row test for IPD.

Predictive Model Training on the Training Set

Below, we fit four predictive models using the training set:

  1. Decision Tree via ctree() (max depth = 3, minsplit = 250).
  2. Linear Regression via lm().
  3. Random Forest via randomForest() (100 trees).
  4. Neural Network via neuralnet() (two hidden layers: 8 units, 4 units; threshold = 0.05).

We then evaluate each model’s performance in the testing set using the DALEX package to extract the model’s R2R^2 and RMSERMSE.

set.seed(1568)

# 1) Decision Tree (ctree)
model_dt <- ctree(
  formula = y ~ x1 + x2 + x3,
  data    = train,
  control = ctree_control(maxdepth = 3, minsplit = 250)
)

exp_dt <- DALEX::explain(
  model   = model_dt,
  data    = test |> select(x1, x2, x3),
  y       = test$y,
  verbose = FALSE,
  label   = "Decision Tree"
)

mp_dt <- model_performance(exp_dt)

# 2) Linear Regression
model_lm <- lm(y ~ x1 + x2 + x3, data = train)

exp_lm <- DALEX::explain(
  model   = model_lm,
  data    = test |> select(x1, x2, x3),
  y       = test$y,
  verbose = FALSE,
  label   = "Linear Regression"
)

mp_lm <- model_performance(exp_lm)

# 3) Random Forest
model_rf <- randomForest(
  formula = y ~ x1 + x2 + x3,
  data    = train,
  ntree   = 100
)

exp_rf <- DALEX::explain(
  model   = model_rf,
  data    = test |> select(x1, x2, x3),
  y       = test$y,
  verbose = FALSE,
  label   = "Random Forest"
)

mp_rf <- model_performance(exp_rf)

# 4) Neural Network
model_nn <- neuralnet(
  formula      = y ~ x1 + x2 + x3,
  data         = train,
  hidden       = c(8, 4),
  threshold    = 0.05,
  linear.output = TRUE
)

predict_nn <- function(object, newdata) {
  as.numeric(predict(object, newdata))
}

exp_nn <- DALEX::explain(
  model            = model_nn,
  data             = test |> select(x1, x2, x3),
  y                = test$y,
  predict_function = predict_nn,
  verbose          = FALSE,
  label            = "Neural Network"
)

mp_nn <- model_performance(exp_nn)

# Save DALEX explainers for later reuse
save(exp_dt, exp_lm, exp_rf, exp_nn, file = "models.RData")

# Compile performance metrics into one data frame
mp_all <- list(
  "Linear Regression" = mp_lm,
  "Decision Tree"     = mp_dt,
  "Random Forest"     = mp_rf,
  "Neural Network"    = mp_nn
)

R2_values   <- sapply(mp_all, function(x) x$measures$r2)
RMSE_values <- sapply(mp_all, function(x) x$measures$rmse)

perf_df <- tibble(
  Model = names(R2_values),
  R2    = round(as.numeric(R2_values),   4),
  RMSE  = round(as.numeric(RMSE_values), 4)
)

perf_df
#> # A tibble: 4 × 3
#>   Model                R2  RMSE
#>   <chr>             <dbl> <dbl>
#> 1 Linear Regression 0.729 0.354
#> 2 Decision Tree     0.729 0.354
#> 3 Random Forest     0.728 0.354
#> 4 Neural Network    0.730 0.352

Interpretation: All four models (Linear Regression, Decision Tree, Random Forest, Neural Network) deliver essentially identical predictive performance on the testing set:

R20.729,RMSE0.354. R^2 \approx 0.729,\quad \mathrm{RMSE} \approx 0.354.

Inspecting the Prediction Models

We now take a quick look at each model’s internal structure or summary.

Note:

  • Often, in practice, we do not know the operating characteristics of the predictive model or have access to its training data.
  • IPD methods are distinct in that they are agnostic to the source or form of these ‘black-box’ predictions

Decision Tree Visualization

plot(model_dt, main = "Decision Tree (ctree) Structure")

The tree has depth 3, with splits on x1,x2,x3x_1, x_2, x_3 that approximate sin(3X1+X25)\sin\left(\tfrac{3X_{1} + X_{2}}{5}\right) piecewise.

Linear Regression Summary

summary(model_lm)
#> 
#> Call:
#> lm(formula = y ~ x1 + x2 + x3, data = train)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1.68045 -0.24213  0.01092  0.23409  1.26450 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) -0.01268    0.01114  -1.138    0.255    
#> x1           0.48481    0.03001  16.157  < 2e-16 ***
#> x2           0.14316    0.02966   4.826 1.61e-06 ***
#> x3          -0.03113    0.02980  -1.045    0.296    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.352 on 996 degrees of freedom
#> Multiple R-squared:  0.7268, Adjusted R-squared:  0.726 
#> F-statistic: 883.4 on 3 and 996 DF,  p-value: < 2.2e-16

The OLS coefficients show how the linear model approximates the underlying sine function. In particular, the slope on x1x_1 will be roughly an average of this over the training distribution.

Random Forest Summary

model_rf
#> 
#> Call:
#>  randomForest(formula = y ~ x1 + x2 + x3, data = train, ntree = 100) 
#>                Type of random forest: regression
#>                      Number of trees: 100
#> No. of variables tried at each split: 1
#> 
#>           Mean of squared residuals: 0.1193538
#>                     % Var explained: 73.58

We see OOB MSE and variance explained on the training data. The forest builds many trees of depth >3 and averages them to approximate the sine structure.

Neural Network Topology

plot(model_nn, rep = "best", main = "Neural Network (8,4) Topology")

This diagram shows two hidden layers (8 units \to 4 units) with activation functions that capture nonlinearity. The network approximates sin(3X1+X25)\sin\left(\tfrac{3X_{1} + X_{2}}{5}\right) more flexibly than the decision tree or random forest.

Variable Importance (DALEX)

Next, we compute variable importance for each model via DALEX::model_parts(). We request type="difference" so that importance is measured in terms of loss of performance (RMSE) when that variable is permuted.

imp_dt <- model_parts(exp_dt, N = NULL, B = 1, type = "difference")
imp_lm <- model_parts(exp_lm, N = NULL, B = 1, type = "difference")
imp_rf <- model_parts(exp_rf, N = NULL, B = 1, type = "difference")
imp_nn <- model_parts(exp_nn, N = NULL, B = 1, type = "difference")

plot(imp_dt, imp_nn, imp_rf, imp_lm)

Interpretation:

  • Each model ranks features (x1,x2,x3)(x_{1},x_{2},x_{3}) by their impact on predictive accuracy.
  • Because 𝐱\mathbf{x} are correlated, some models (e.g. tree, forest) may split differently than linear regression or neural network.

Partial-Dependence Plots (DALEX)

We now generate partial-dependence (PD) profiles for each feature under each model via DALEX::model_profile(). This shows how each model’s predicted ff changes when we vary one feature at a time, averaging out the others.

pd_dt <- model_profile(exp_dt, N = NULL)
pd_lm <- model_profile(exp_lm, N = NULL)
pd_rf <- model_profile(exp_rf, N = NULL)
pd_nn <- model_profile(exp_nn, N = NULL)

plot(pd_dt, pd_nn, pd_rf, pd_lm)

Interpretation:

  • For x1x_{1}, the Linear Regression PD is a straight line.
  • The Decision Tree PD is piecewise-constant.
  • The Random Forest PD is smoother but still stepwise.
  • The Neural Network PD approximates the sine shape.

Because x3x_{3} does not appear in the generating formula, its PD should be nearly flat—though correlation can induce slight structure.

Inference with Predicted Data (IPD)

We now demonstrate how to perform valid inference in the testing set when only some test rows have true YY. Specifically, we:

  1. Randomly split the 1,000 testing rows into:

    • A labeled subset of size nln_l (we choose 10%, but this can be adjusted).
    • An unlabeled subset of size nu=1,000nln_u = 1,000 - n_l.
  2. In the labeled subset, regress YY on X1X_1, X2X_2, and X3X_3 (Classical).

  3. In the unlabeled subset, regress ff on X1X_1, X2X_2, and X3X_3 (Naive).

  4. Use both subsets in an IPD pipeline and run:

    ipd_res <- ipd(
      formula        = y - f ~ x1 + x2 + x3,
      data           = test_labeled,
      unlabeled_data = test_unlabeled,
      method         = "pspa",
      model          = "ols"
    )

    to obtain bias-corrected estimates of β1\beta_1, β2\beta_2, and β3\beta_3.

Note: Instead of supplying one stacked dataset and the column name for the set labels, we can alternativel supply the labeled and unlabeled data separately.

set.seed(12345)

# Predict outcomes in the testing set
ipd_data <- test |>
  mutate(
    
    preds_lm = c(predict(model_lm, newdata = test)),
    preds_dt = c(predict(model_dt, newdata = test)),
    preds_rf = c(predict(model_rf, newdata = test)),
    preds_nn = c(predict(model_nn, newdata = test)))

# Randomly split the testing set into labeled and unlabeled subsets
frac_labeled <- 0.1

n_test    <- nrow(ipd_data)
n_labeled <- floor(frac_labeled * n_test)

labeled_idx <- sample(seq_len(n_test), size = n_labeled)

test_labeled <- ipd_data |>
  slice(labeled_idx) |>
  mutate(set_label = "labeled")
  
test_unlabeled <- ipd_data |>
  slice(-labeled_idx) |>
  mutate(set_label = "unlabeled")
# Classical
class_fit <- lm(y ~ x1 + x2 + x3, data = test_labeled)
class_df  <- broom::tidy(class_fit) |>
  mutate(method = "Classical") 
  
# Naive (Linear Model)
naive_fit_lm <- lm(preds_lm ~ x1 + x2 + x3, data = test_unlabeled)
naive_df_lm  <- broom::tidy(naive_fit_lm) |>
  mutate(method = "Naive (lm)") 

# Naive (Decision Tree)
naive_fit_dt <- lm(preds_dt ~ x1 + x2 + x3, data = test_unlabeled)
naive_df_dt  <- broom::tidy(naive_fit_dt) |>
  mutate(method = "Naive (dt)") 

# Naive (Random Forest)
naive_fit_rf <- lm(preds_rf ~ x1 + x2 + x3, data = test_unlabeled)
naive_df_rf  <- broom::tidy(naive_fit_rf) |>
  mutate(method = "Naive (rf)") 

# Naive (Neural Net)
naive_fit_nn <- lm(preds_nn ~ x1 + x2 + x3, data = test_unlabeled)
naive_df_nn  <- broom::tidy(naive_fit_nn) |>
  mutate(method = "Naive (nn)") 

# IPD (Linear Model)
ipd_fit_lm <- ipd(y - preds_lm ~ x1 + x2 + x3, 
  data = test_labeled, unlabeled_data = test_unlabeled, 
  method = "pspa", model = "ols")
ipd_df_lm  <- tidy(ipd_fit_lm) |>
  mutate(method = "IPD (lm)") 

# IPD (Decision Tree)
ipd_fit_dt <- ipd(y - preds_dt ~ x1 + x2 + x3, 
  data = test_labeled, unlabeled_data = test_unlabeled,
  method = "pspa", model = "ols")
ipd_df_dt  <- tidy(ipd_fit_dt) |>
  mutate(method = "IPD (dt)") 

# IPD (Random Forest)
ipd_fit_rf <- ipd(y - preds_rf ~ x1 + x2 + x3, 
  data = test_labeled, unlabeled_data = test_unlabeled,
  method = "pspa", model = "ols")
ipd_df_rf  <- tidy(ipd_fit_rf) |>
  mutate(method = "IPD (rf)") 

# IPD (Neural Net)
ipd_fit_nn <- ipd(y - preds_nn ~ x1 + x2 + x3, 
  data = test_labeled, unlabeled_data = test_unlabeled,
  method = "pspa", model = "ols")
ipd_df_nn  <- tidy(ipd_fit_nn) |>
  mutate(method = "IPD (nn)") 

combined_df <- bind_rows(class_df,
  naive_df_lm, naive_df_dt, naive_df_rf, naive_df_nn,
  ipd_df_lm,   ipd_df_dt,   ipd_df_rf,   ipd_df_nn) |>
  
  mutate(
    conf.low  = estimate - 1.96 * std.error,
    conf.high = estimate + 1.96 * std.error
  )

combined_df
#> # A tibble: 36 × 8
#>    term        estimate std.error statistic   p.value method  conf.low conf.high
#>    <chr>          <dbl>     <dbl>     <dbl>     <dbl> <chr>      <dbl>     <dbl>
#>  1 (Intercept)  0.00872  3.68e- 2  2.37e- 1 8.13e-  1 Classi…  -0.0634   0.0808 
#>  2 x1           0.643    1.09e- 1  5.91e+ 0 5.24e-  8 Classi…   0.430    0.856  
#>  3 x2           0.0702   8.70e- 2  8.07e- 1 4.22e-  1 Classi…  -0.100    0.241  
#>  4 x3          -0.0791   9.94e- 2 -7.96e- 1 4.28e-  1 Classi…  -0.274    0.116  
#>  5 (Intercept) -0.0127   6.32e-18 -2.01e+15 0         Naive …  -0.0127  -0.0127 
#>  6 x1           0.485    1.67e-17  2.91e+16 0         Naive …   0.485    0.485  
#>  7 x2           0.143    1.69e-17  8.45e+15 0         Naive …   0.143    0.143  
#>  8 x3          -0.0311   1.70e-17 -1.83e+15 0         Naive …  -0.0311  -0.0311 
#>  9 (Intercept) -0.0183   6.76e- 3 -2.71e+ 0 6.89e-  3 Naive …  -0.0315  -0.00506
#> 10 x1           0.546    1.78e- 2  3.06e+ 1 1.57e-141 Naive …   0.511    0.581  
#> # ℹ 26 more rows

Plotting The Model Estimates

We now visualize, for each predictive model, the estimated coefficients and 95% confidence intervals from each inference method:

Note: The vertical dashed line marks the estimated slope for the Classical method, which serves as our benchmark for comparison.

ref_lines <- combined_df %>% 
  filter(method == "Classical") %>% 
  select(term, estimate)

ggplot(combined_df, aes(x = estimate, y = method, color = method)) +
  geom_point(size = 2) +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height = 0.2) +
  facet_wrap(~ term, ncol = 4, scales = "free_x") +
  geom_vline(
    data = ref_lines,
    aes(xintercept = estimate),
    linetype = "dashed",
    color    = "black"
  ) +
  labs(
    x = expression(hat(beta)),
    y = "") +
  theme_minimal() +
  theme(legend.position = "none")

Interpretation of Results

  1. The Naive estimates are biased due to the prediction error and has artificially narrow confidence intervals.
  2. The Classical is our benchmark, as it is unbiased, but has wider CI’s
  3. The IPD methods cover the Classical estimates and have wider CI’s to account for the prediction uncertainty.

Data Distribution: Train vs Test (GGally::ggpairs)

Finally, we visualize pairwise relationships among (x1,x2,x3,y)(x_{1},x_{2},x_{3},y) for train vs test. This confirms that train and test were drawn from the same synthetic distribution.

both <- bind_rows(
  train |> mutate(label = "train"),
  test  |> mutate(label = "test")
)

ggpairs(
  both,
  columns = c("x1", "x2", "x3", "y"),
  mapping = aes(color = label, alpha = 0.3),
  lower = list(
    continuous = wrap("points", size = 0.5, alpha = 0.3),
    combo      = wrap("facethist", bins = 20)
  ),
  diag = list(
    continuous = wrap("densityDiag", alpha = 0.4, bw = "SJ")
  ),
  upper = list(
    continuous = wrap("cor", size = 3, stars = FALSE)
  )
) +
  theme_minimal() +
  labs(
    title    = "Pairwise Distribution: Train vs Test (x1,x2,x3,y)",
    subtitle = "Color = train (blue) vs test (pink)"
  )

Interpretation:

  • The diagonals show almost identical marginal distributions of x1,x2,x3,yx_{1},x_{2},x_{3},y in train vs test.
  • The off-diagonals confirm the covariance structure Σ\Sigma is consistent across both sets.

Summary and Takeaways

  1. Predictive Performance:

    • Linear Regression, Decision Tree, Random Forest, and Neural Network each achieve nearly identical test set performance (R20.729,RMSE0.354R^2 \approx 0.729, \mathrm{RMSE} \approx 0.354).
  2. Model Interpretability:

    • Variable Importance and Partial-Dependence plots highlight how each model “tells a different story” about which features matter and how yy responds to x1,x2,x3x_{1},x_{2},x_{3}.
    • The Neural Network’s PD best tracks the true sine function; the tree’s PD is piecewise-constant; the forest’s PD is smoothed piecewise; the linear PD is a straight line.
  3. Inference on Predicted Data (IPD):

    • By randomly splitting the test set into “labeled” and “unlabeled” halves, we simulate a real-world scenario where only some new observations have true yy.
    • Naive regression of fx1+x2+x3f \sim x_{1} + x_{2} + x_{3} on the unlabeled set is biased due to prediction error.
    • Classical regression of yx1+x2+x3y \sim x_{1} + x_{2} + x_{3} on the labeled set is unbiased but less efficient.
    • IPD methods combine the labeled/unlabeled sets to correct bias and properly estimate variance.
  4. Key Lesson:

    • “Performance is not enough.” Even models with identical R2R^2 can lead to very different bias and variance in downstream inference on covariate effects. IPD provides a principled correction when using model predictions in place of true outcomes for some observations.

References

  • Biecek, Przemysław, et al. “Performance is not enough: The story told by a Rashomon Quartet.” Journal of Computational and Graphical Statistics 33.3 (2024): 1118-1121.
  • MI2DataLab. “The Rashomon Quartet.” GitHub repository (2023).

This is the end of the module. We hope this was informative! For question/concerns/suggestions, please reach out to