Inference with Predicted Data (IPD) Workshop
What do we do after we have machine learned everything?
Presenters: Jesse Gronsbell1 Stephen Salerno2
Contributors (Alphabetical Order): Awan Afiaz3, David Cheng4, Jianhui Gao5, Jesse Gronsbell6, Kentaro Hoffman7, Jeff Leek8, Qiongshi Lu9, Tyler McCormick10, Jiacheng Miao11,
Anna Neufeld12, Stephen Salerno13
Workshop Date: June 24, 2025
Background and Motivation
Artificial intelligence and machine learning (AI/ML) have become essential tools in biomedical research, enabling large-scale analyses across diverse domains such as genomics, structural biology, and electronic health records-based research. Increasingly, researchers rely on model-generated predictions, rather than directly measured variables, as inputs for downstream statistical analyses. For example, predicted gene expression values or polygenic risk scores are often used in place of experimental assays, allowing researchers to expand cohort sizes and explore hypotheses when traditional data collection is infeasible, costly, or time-consuming.
While this practice of “using predictions as data” holds promise for accelerating scientific discovery, it presents significant challenges for statistical inference. When predicted values are used in place of true variables, the resulting estimates of association can be biased and misleading if uncertainty in the prediction step is not properly accounted for.
Workshop Overview
In this workshop, we explore the consequences of inference on predicted data across several biomedical applications. Drawing from classical approaches to measurement error and recent developments in bias correction, we will present a suite of prediction-based inference methods that adjust for prediction-related uncertainty and improve inference validity and efficiency. We will also introduce ipd, a user-friendly R package that implements several of these correction methods through a unified interface. The package supports modular integration into existing workflows and includes tidy methods for model inspection and diagnostics.
This workshop covers four modules (time permitting), each illustrating IPD in R using the ipd package:
-
- Introduce IPD concepts and core
ipdpackage functions - Simulate data and explore the bias and variance of AI/ML predictions versus ‘real’ data
- Fit naive and classical inference models and compare with IPD methods
- Introduce IPD concepts and core
-
- Train multiple prediction models on the Rashomon Quartet training set
- Compare the performances of the upstream predictions on the Rashomon Quartet testing set
- Recover classical estimates using IPD and contrast with naive estimates
Unit 02: Different Measures of Adiposity
- Explore the National Health and Nutrition Examination Survey (NHANES) pre- and post- COVID-19
- Define obesity based body mass index, waist circumference, and gold-standard dual-energy X-ray absorptiometry
- Demonstrate how conclusions differ for naive, classical, and IPD logistic regression
Unit 03: BCR-ABL Fusion in B-Cell Leukemia
- Learn gene expression classifiers for acute lymphoblastic leukemia (ALL) genetic subtypes
- Harmonize features across two studies (
ALLandGolub) and predict BCR-ABL1 (Philadelphia chromosome) fusion status - Perform IPD to estimate associations between fusion status and clinical risk factors
Participation
This 90-minute workshop uses a blended format of instruction and hands-on coding exercises. Participants should:
- Follow along in the virtual RStudio environment (see below).
- Attempt to complete brief exercises or run the solution code snippets in real time.
- Engage in Q&A at module boundaries to troubleshoot and discuss concepts.
Prerequisites
- A computer with internet to access the RStudio Virtual Environment (see below).
- Familiarity with base R and tidyverse syntax (e.g.,
dplyr,broom). - Basic understanding of predictive (e.g.,
randomForest) and regression modeling (e.g.,lm,glm). - Exposure to Bioconductor’s ExpressionSet, AnnotationDbi, and
MLInterfacesis helpful for the last module.
R / Bioconductor Packages Used
Datasets:
nhanesA,ALL,golubEsets,AnnotationDbi,hgu95av2.db,hu6800.dbData Manipulation and Visualization:
broom,scales,janitor,GGally,patchwork,tidyversePredictive Modeling:
neuralnet,partykit,randomForest,ranger,mgcv,pROC,DALEX,MLInterfacesInference with Predicted Data:
ipd
Time Outline (90 minutes)
| Activity | Time |
|---|---|
| Brief Overview of the Problem | 15 m |
| Unit 00: Getting Started | 15 m |
| Unit 01: The Rashomon Quartet | 15 m |
| Unit 02: Different Measures of Adiposity | 15 m |
| Unit 03: BCR-ABL Fusion in B-Cell Leukemia | 15 m |
| Wrap-Up and Q&A | 15 m |
Workshop Goals and Objectives
Learning Goals:
- Understand the limitations of using predicted data for inference.
- Learn how IPD methods adjust for bias and recover valid uncertainty estimates.
- Gain practical skills with the
ipdR package across simulated and real datasets.
Learning Objectives: By the end of the workshop, participants will be able to:
- Train and evaluate predictive models (LDA, neural nets, random forests) using R and Bioconductor workflows.
- Explore data with AI/ML-predicted outcomes and diagnose bias/variance in predictions.
- Apply
ipd::ipd()for continuous and binary outcomes to correct inference using predicted data. - Interpret IPD outputs and visualize adjusted coefficient estimates with confidence intervals.
Workshop Environment
The companion website for this workshop is available at:
https://salernos.github.io/ipdworkshop
To use the workshop image:
docker run -e PASSWORD=<choose_a_password_for_rstudio> -p 8787:8787 ghcr.io/salernos/ipdworkshop:latestOnce running, navigate to http://localhost:8787/ and then login with rstudio:yourchosenpassword.