Additional Information

Methods Reference and Extended Learning Resources

Reference table for prediction-based inference methods used in the workshop plus additional resources for continued learning.

Methods References

Throughout the workshop we use several approaches for conducting inference with predicted data. These methods address a common challenge in modern data science:

How can we use machine learning predictions as ‘data,’ to improve efficiency while still guaranteeing valid statistical inference?

Each approach reviewed here combines information from:

Labeled Data: Where the true outcomes, predicted outcomes, and covariates are observed, and
Unlabeled data: Where only predicted outcomes and covariates are available.

The table below summarizes the main methods implemented in the ipd package.

Method	`ipd` Argument	Inferential Model(s) Implemented	Primary Reference
Chen and Chen	`"chen"`	OLS, logistic, Poisson	Gronsbell et al. (2026)
PDC	`"pdc"`	OLS, logistic, Poisson	Gan et al. (2024)
PostPI (analytic)	`"postpi_analytic"`	OLS	Wang et al. (2020)
PostPI (bootstrap)	`"postpi_boot"`	OLS, logistic	Wang et al. (2020)
PPI	`"ppi"`	OLS, logistic, Poisson, mean, quantile	Angelopoulos et al. (2023)
PPI++	`"ppi_plusplus"`	OLS, logistic, Poisson, mean, quantile	Angelopoulos et al. (2024)
PPI (using all data)	`"ppi_a"`	OLS	Gronsbell et al. (2026)
PSPA	`"pspa"`	OLS, logistic, Poisson, mean, quantile	Miao et al. (2023)

Practical Interpretation

Across all of these approaches, the goal is the same:

Use predictions to improve efficiency, and
Maintain guarantees of valid statistical inference.

In practice this means preserving:

Unbiased or approximately unbiased effect estimates
Properly calibrated confidence intervals
Improved efficiency/statistical power when predictions are informative

Data Sources Used in this Workshop

Where Each Dataset Appears

The workshop modules draw on several synthetic and real-world datasets across multiple scientific domains.

Unit	Title	Dataset	Scientific Context	Role in the Workshop
Unit 00	Getting Started	Simulated	General Statistical Setting	Introduces labeled/unlabeled structure and core IPD ideas
Unit 01	Proteomics with AlphaFold	AlphaFold / PTM	Proteomics and Structural Biology	Studies the association between PTMs and disorder when disorder is predicted at scale with novel AI methods
Unit 02	Measuring Adiposity	NHANES	Public Health and Epidemiology	Introduces a non-AI/ML example of BMI as a surrogate for adiposity and highlights data structure considerations
Unit 03	BCR-ABL Fusion	ALL / Golub	Genomics and Molecular Classification	Provides a gene expression example, using data and methods from Bioconductor
Unit 04	The Rashomon Quartet	Rashomon Quartet	Prediction Ambiguity and Interpretation	Shows that similar predictive performance can still support very different downstream conclusions

Simulated Data

Our Getting Started unit uses simulated data. This is especially useful because it lets us control:

The true data-generating process,
The amount of prediction error,
The labeled/unlabeled sample sizes, and
The strength of the associations of interest.

Simulation is often the clearest way to build intuition for:

Bias-variance tradeoffs,
Confidence interval behavior,
Efficiency gains, and
The conditions under which PB inference methods are most useful.

AlphaFold

The proteomics module is motivated by a modern structural biology setting in which machine learning predictions are available at massive scale, while experimental validation remains much more limited.

Resources:

AlphaFold predicts protein structure from amino acid sequence with remarkably high accuracy and now provides structural annotations for essentially the whole proteome. These predictions make large-scale downstream analyses possible, but they do not remove the need for careful inference.

In the workshop module, the scientific question is whether certain post-translational modifications (PTMs) are more likely to occur in intrinsically disordered regions (IDRs). Each residue has:

Y: A binary gold-standard disorder label
Yhat: A predicted probability of disorder derived from AlphaFold-based processing
PTM indicators such as:
- phosphorylated
- ubiquitinated
- acetylated

The inferential target is the association between PTM status and disorder, typically summarized through an odds ratio from logistic regression. This is an especially clean PB inference example because it has the exact structure the methods are designed for:

A limited set of residues with trusted disorder labels,
A much larger set with model-based disorder predictions, and
A downstream scientific question about association rather than prediction alone.

The module uses the complete dataset to simulate partial labeling, so participants can see how classical, naive, and PB inference methods behave as the number of labeled residues changes.

NHANES

The National Health and Nutrition Examination Survey (NHANES) is a nationally representative program conducted by the U.S. Centers for Disease Control and Prevention. It combines interviews, physical examinations, and laboratory measurements to assess the health and nutritional status of the U.S. population.

Resources:

In the workshop, NHANES is used to study a very natural example of inference with predicted data, but where the predictions are not complex. Here, the scientific target is true adiposity, measured using percent body fat. But in many studies, percent body fat is unavailable, so researchers instead rely on simpler proxies such as:

BMI
Waist circumference

This is a particularly useful teaching example because BMI is often treated as if it were the ‘real’ outcome, when in practice it is better viewed as a proxy or crude prediction of underlying adiposity. From the PB inference perspective, that is exactly the setting we care about.

In this module, the data are organized into a realistic labeled/unlabeled split:

The 2017–2018 NHANES wave serves as the labeled dataset, because it includes DXA-based percent body fat,
The August 2021–August 2023 wave serves as the unlabeled dataset, because DXA was no longer collected after the pandemic disruption.

This setup allows the workshop to show how PB inference can be used to estimate associations with true adiposity when:

The biologically meaningful outcome is only available in an older cohort, and
The newer cohort contains only cheaper anthropometric proxies.

Substantively, the example also highlights an important public health lesson that even familiar measurements like BMI can introduce bias when used uncritically in downstream inference.

Leukemia Gene Expression Data: `ALL` and `Golub`

The genomics module uses two classic Bioconductor datasets.

`ALL`

The ALL package contains microarray expression data from patients with acute lymphoblastic leukemia (ALL).

This dataset includes:

128 leukemia samples
Affymetrix HGU95Av2 expression measurements
Phenotype information including immunophenotype, molecular subtype, age, and sex

In the workshop, the ALL data are filtered to focus on B-cell lineage ALL and used to train predictive models for BCR-ABL1 fusion status.

`Golub_Merge`

The golubEsets package contains the well-known Golub leukemia data.

The Golub_Merge object includes:

Leukemia samples from ALL and AML patients
Measurements on a different Affymetrix platform
Phenotype annotations including lineage and sex

In the workshop, this second dataset serves as an unlabeled cohort. A classifier trained on the labeled ALL data is transferred to the Golub samples, and PB inference is then used to estimate downstream associations while correcting for the fact that the subtype labels are predicted rather than directly observed.

This mirrors a real translational genomics workflow in that only a subset of patients may receive gold-standard molecular testing, while a larger cohort has rich molecular features but incomplete labeling.

The Rashomon Quartet

The workshop also uses the Rashomon Quartet as a conceptual example about the gap between predictive performance and scientific interpretation.

The Rashomon effect describes settings where many different models fit the data similarly well, yet support different interpretations or downstream decisions.

This example pairs naturally with PB inference because it reinforces a central lesson of the workshop:

Strong predictive performance does not automatically imply valid inference.

Even when models appear equally good from a prediction standpoint, the consequences of using their outputs in later analyses may differ substantially.

Additional Useful Resources

Package References

These are good starting points if you would like to read more about the software used in this workshop:

ipd package paper: Salerno et al. (2025), Bioinformatics
ipd GitHub repository: https://github.com/ipd-tools/ipd

Reproducible Data Analysis

These packages are useful throughout the workshop for wrangling, summarizing, and visualizing results:

Model Diagnostics and Evaluation

These tools can be helpful when you want to inspect predictive performance more closely:

pROC for ROC curves and AUC analyses
DALEX for model explainability and diagnostics

Closing Note

A central theme of this workshop is that prediction and inference are not the same task. Modern scientific data analysis often depends on surrogate outcomes, machine learning predictions, and partially labeled datasets. The goal of this workshop is to demonstrate how we can us use these tools while ensuring principled statistical inference.

For questions or feedback, please contact Stephen Salerno (ssalerno@fredhutch.org).