Inference with Predicted Data (IPD) Workshop

What do we do after we have machine learned everything?

Presenters: Jesse Gronsbell¹ Stephen Salerno²

Contributors (Alphabetical Order): Awan Afiaz³, David Cheng⁴, Jianhui Gao⁵, Jesse Gronsbell⁶, Kentaro Hoffman⁷, Jeff Leek⁸, Qiongshi Lu⁹, Tyler McCormick¹⁰, Jiacheng Miao¹¹,
Anna Neufeld¹², Stephen Salerno¹³

Workshop Date: June 24, 2025

Background and Motivation

Artificial intelligence and machine learning (AI/ML) have become essential tools in biomedical research, enabling large-scale analyses across diverse domains such as genomics, structural biology, and electronic health records-based research. Increasingly, researchers rely on model-generated predictions, rather than directly measured variables, as inputs for downstream statistical analyses. For example, predicted gene expression values or polygenic risk scores are often used in place of experimental assays, allowing researchers to expand cohort sizes and explore hypotheses when traditional data collection is infeasible, costly, or time-consuming.

While this practice of “using predictions as data” holds promise for accelerating scientific discovery, it presents significant challenges for statistical inference. When predicted values are used in place of true variables, the resulting estimates of association can be biased and misleading if uncertainty in the prediction step is not properly accounted for.

Workshop Overview

In this workshop, we explore the consequences of inference on predicted data across several biomedical applications. Drawing from classical approaches to measurement error and recent developments in bias correction, we will present a suite of prediction-based inference methods that adjust for prediction-related uncertainty and improve inference validity and efficiency. We will also introduce ipd, a user-friendly R package that implements several of these correction methods through a unified interface. The package supports modular integration into existing workflows and includes tidy methods for model inspection and diagnostics.

This workshop covers four modules (time permitting), each illustrating IPD in R using the ipd package:

Unit 00: Getting Started
- Introduce IPD concepts and core ipd package functions
- Simulate data and explore the bias and variance of AI/ML predictions versus ‘real’ data
- Fit naive and classical inference models and compare with IPD methods

Unit 01: The Rashomon Quartet
- Train multiple prediction models on the Rashomon Quartet training set
- Compare the performances of the upstream predictions on the Rashomon Quartet testing set
- Recover classical estimates using IPD and contrast with naive estimates

Unit 02: Different Measures of Adiposity
- Explore the National Health and Nutrition Examination Survey (NHANES) pre- and post- COVID-19
- Define obesity based body mass index, waist circumference, and gold-standard dual-energy X-ray absorptiometry
- Demonstrate how conclusions differ for naive, classical, and IPD logistic regression

Unit 03: BCR-ABL Fusion in B-Cell Leukemia
- Learn gene expression classifiers for acute lymphoblastic leukemia (ALL) genetic subtypes
- Harmonize features across two studies (ALL and Golub) and predict BCR-ABL1 (Philadelphia chromosome) fusion status
- Perform IPD to estimate associations between fusion status and clinical risk factors

Participation

This 90-minute workshop uses a blended format of instruction and hands-on coding exercises. Participants should:

Follow along in the virtual RStudio environment (see below).
Attempt to complete brief exercises or run the solution code snippets in real time.
Engage in Q&A at module boundaries to troubleshoot and discuss concepts.

Prerequisites

A computer with internet to access the RStudio Virtual Environment (see below).
Familiarity with base R and tidyverse syntax (e.g., dplyr, broom).
Basic understanding of predictive (e.g., randomForest) and regression modeling (e.g., lm, glm).
Exposure to Bioconductor’s ExpressionSet, AnnotationDbi, and MLInterfaces is helpful for the last module.

R / Bioconductor Packages Used

Datasets: nhanesA, ALL, golubEsets, AnnotationDbi, hgu95av2.db, hu6800.db
Data Manipulation and Visualization: broom, scales, janitor, GGally, patchwork, tidyverse
Predictive Modeling: neuralnet, partykit, randomForest, ranger, mgcv, pROC, DALEX, MLInterfaces
Inference with Predicted Data: ipd

Time Outline (90 minutes)

Activity	Time
Brief Overview of the Problem	15 m
Unit 00: Getting Started	15 m
Unit 01: The Rashomon Quartet	15 m
Unit 02: Different Measures of Adiposity	15 m
Unit 03: BCR-ABL Fusion in B-Cell Leukemia	15 m
Wrap-Up and Q&A	15 m

Workshop Goals and Objectives

Learning Goals:

Understand the limitations of using predicted data for inference.
Learn how IPD methods adjust for bias and recover valid uncertainty estimates.
Gain practical skills with the ipd R package across simulated and real datasets.

Learning Objectives: By the end of the workshop, participants will be able to:

Train and evaluate predictive models (LDA, neural nets, random forests) using R and Bioconductor workflows.
Explore data with AI/ML-predicted outcomes and diagnose bias/variance in predictions.
Apply ipd::ipd() for continuous and binary outcomes to correct inference using predicted data.
Interpret IPD outputs and visualize adjusted coefficient estimates with confidence intervals.

Workshop Environment

The companion website for this workshop is available at:

https://salernos.github.io/ipdworkshop

To use the workshop image:

docker run -e PASSWORD=<choose_a_password_for_rstudio> -p 8787:8787 ghcr.io/salernos/ipdworkshop:latest

Once running, navigate to http://localhost:8787/ and then login with rstudio:yourchosenpassword.