What do we do after we have machine learned everything?
Presenters: Stephen Salerno1
Contributors (Alphabetical Order): Awan Afiaz2, David Cheng3, Jianhui Gao4, Jesse Gronsbell5, Kentaro Hoffman6, Jeff Leek7, Qiongshi Lu8, Tyler McCormick9, Jiacheng Miao10,
Anna Neufeld11, Stephen Salerno12
Workshop Date: June 24, 2025
Background and Motivation
Artificial intelligence and machine learning (AI/ML) have become essential tools in biomedical research, enabling large-scale analyses across diverse domains such as genomics, structural biology, and electronic health records-based research. Increasingly, researchers rely on model-generated predictions, rather than directly measured variables, as inputs for downstream statistical analyses. For example, predicted gene expression values or polygenic risk scores are often used in place of experimental assays, allowing researchers to expand cohort sizes and explore hypotheses when traditional data collection is infeasible, costly, or time-consuming.
While this practice of “using predictions as data” holds promise for accelerating scientific discovery, it presents significant challenges for statistical inference. When predicted values are used in place of true variables, the resulting estimates of association can be biased and misleading if uncertainty in the prediction step is not properly accounted for.
Workshop Overview
In this workshop, we explore the consequences of inference on predicted data across several biomedical applications. Drawing from classical approaches to measurement error and recent developments in bias correction, we will present a suite of prediction-based inference methods that adjust for prediction-related uncertainty and improve inference validity and efficiency. We will also introduce ipd
, a user-friendly R package that implements several of these correction methods through a unified interface. The package supports modular integration into existing workflows and includes tidy
methods for model inspection and diagnostics.
This workshop covers four modules (time permitting), each illustrating IPD in R using the ipd
package:
-
-
Introduce IPD concepts and core
ipd
package functions - Simulate data and explore the bias and variance of AI/ML predictions versus ‘real’ data
- Fit naive and classical inference models and compare with IPD methods
-
Introduce IPD concepts and core
-
- Train multiple prediction models on the Rashomon Quartet training set
- Compare the performances of the upstream predictions on the Rashomon Quartet testing set
- Recover classical estimates using IPD and contrast with naive estimates
-
Unit 02: Different Measures of Adiposity
- Explore the National Health and Nutrition Examination Survey (NHANES) pre- and post- COVID-19
- Define obesity based body mass index, waist circumference, and gold-standard dual-energy X-ray absorptiometry
- Demonstrate how conclusions differ for naive, classical, and IPD logistic regression
-
Unit 03: BCR-ABL Fusion in B-Cell Leukemia
- Learn gene expression classifiers for acute lymphoblastic leukemia (ALL) genetic subtypes
-
Harmonize features across two studies (
ALL
andGolub
) and predict BCR-ABL1 (Philadelphia chromosome) fusion status - Perform IPD to estimate associations between fusion status and clinical risk factors
Participation
This 90-minute workshop uses a blended format of instruction and hands-on coding exercises. Participants should:
- Follow along in the virtual RStudio environment (see below).
- Attempt to complete brief exercises or run the solution code snippets in real time.
- Engage in Q&A at module boundaries to troubleshoot and discuss concepts.
Prerequisites
- A computer with internet to access the RStudio Virtual Environment (see below).
- Familiarity with base R and tidyverse syntax (e.g.,
dplyr
,broom
). - Basic understanding of predictive (e.g.,
randomForest
) and regression modeling (e.g.,lm
,glm
). - Exposure to Bioconductor’s ExpressionSet, AnnotationDbi, and
MLInterfaces
is helpful for the last module.
R / Bioconductor Packages Used
Datasets:
nhanesA
,ALL
,golubEsets
,AnnotationDbi
,hgu95av2.db
,hu6800.db
Data Manipulation and Visualization:
broom
,scales
,janitor
,GGally
,patchwork
,tidyverse
Predictive Modeling:
neuralnet
,partykit
,randomForest
,ranger
,mgcv
,pROC
,DALEX
,MLInterfaces
Inference with Predicted Data:
ipd
Time Outline (90 minutes)
Activity | Time |
---|---|
Brief Overview of the Problem | 15 m |
Unit 00: Getting Started | 15 m |
Unit 01: The Rashomon Quartet | 15 m |
Unit 02: Different Measures of Adiposity | 15 m |
Unit 03: BCR-ABL Fusion in B-Cell Leukemia | 15 m |
Wrap-Up and Q&A | 15 m |
Workshop Goals and Objectives
Learning Goals:
- Understand the limitations of using predicted data for inference.
- Learn how IPD methods adjust for bias and recover valid uncertainty estimates.
-
Gain practical skills with the
ipd
R package across simulated and real datasets.
Learning Objectives: By the end of the workshop, participants will be able to:
- Train and evaluate predictive models (LDA, neural nets, random forests) using R and Bioconductor workflows.
- Explore data with AI/ML-predicted outcomes and diagnose bias/variance in predictions.
-
Apply
ipd::ipd()
for continuous and binary outcomes to correct inference using predicted data. - Interpret IPD outputs and visualize adjusted coefficient estimates with confidence intervals.
Workshop Environment
The companion website for this workshop is available at:
https://salernos.github.io/ipdworkshop
To use the workshop image:
docker run -e PASSWORD=<choose_a_password_for_rstudio> -p 8787:8787 ghcr.io/salernos/ipdworkshop:latest
Once running, navigate to http://localhost:8787/ and then login with rstudio
:yourchosenpassword
.