Proteomics with AlphaFold

Protein Disorder and PTMs

Apply PB methods in a proteomics setting to estimate associations when key outcomes are model-predicted rather than directly measured.

https://www.nature.com/articles/s41586-021-03819-2

Overview

Background

Modern protein biology increasingly relies on machine learning predictions. AlphaFold, developed by Google DeepMind, predicts a protein’s 3D structure from its amino acid sequence with remarkably high accuracy. These predictions have transformed structural biology by providing proteome-scale annotations that would be impossible to obtain experimentally at the same scale.

At the same time, many downstream biological questions remain fundamentally inferential rather than predictive. For example, we may want to ask whether post-translational modifications (PTMs) occur more often in intrinsically disordered regions (IDRs), or whether certain PTMs are associated with different structural environments. In settings like this, model predictions are useful, but they are not the same thing as directly observed truth.

This module uses the ipd package to study that problem. We will combine AlphaFold-derived predictions with a limited amount of ground truth and ask how different inferential strategies behave.

By the end of this module, you will be able to:

Construct labeled and unlabeled splits from biological data,
Estimate PTM–disorder associations using logistic regression,
Compare naive, classical, and prediction-based (PB) inference,
Quantify efficiency gains from unlabeled data, and
Study how many labeled observations may be needed for adequate power.

Biological Motivation

AlphaFold predictions can be post-processed to identify intrinsically disordered regions (IDRs), that is, stretches of a protein that do not adopt a stable folded structure.

IDRs are central to signaling and regulation. They are flexible, accessible, and often enriched for short linear motifs. Many post-translational modifications, especially phosphorylation, ubiquitination, and acetylation, are hypothesized to occur preferentially in disordered regions, consistent with the idea that structural disorder supports rapid and dynamic regulation.

In this workshop, each amino acid residue has:

Y: A binary ‘gold-standard’ disorder label,
Yhat: A predicted probability of disorder derived from AlphaFold-based processing,
PTM indicators:
- phosphorylated
- ubiquitinated
- acetylated

Our scientific question is:

Are residues with a given PTM more likely to lie in intrinsically disordered regions?

Statistical Framing

For a chosen PTM, X:

X = 1 if the residue carries that PTM,
Y = 1 if the residue lies in an IDR.

We summarize the association using an odds ratio from logistic regression:

\[ \text{OR} = \frac{\text{Odds}(Y = 1 \mid X = 1)}{\text{Odds}(Y = 1 \mid X = 0)}. \]

In practice, however, we may only observe Y on a labeled subset, while Yhat is available for all residues.

Part I: Understanding the Dataset

We will first load the packages needed for this module.

Code

# Install these packages if you have not already:
# install.packages(c("ipd", "broom", "tidyverse", "future", "furrr"))

library(ipd)
library(broom)
library(tidyverse)
library(future)
library(furrr)

We have included a file, data/alphafold.RData, containing a tibble named alphafold with columns:

Y
Yhat
phosphorylated
ubiquitinated
acetylated

Each row corresponds to a single residue.

Exercise 1: Inspecting the AlphaFold/PTM Dataset

Before fitting any models, let us inspect the data.

Questions to answer:

How many residues are there?
What fraction are disordered?
How many residues carry each PTM?

This gives you basic biological context and sanity-checks the input.

Answer

load("data/alphafold.RData")

glimpse(alphafold)

Rows: 10,802
Columns: 5
$ Y              <dbl[1d]> <array[26]>
$ Yhat           <dbl[1d]> <array[26]>
$ phosphorylated <dbl[1d]> <array[26]>
$ ubiquitinated  <dbl[1d]> <array[26]>
$ acetylated     <dbl[1d]> <array[26]>

alphafold |>
  summarise(
    n = n(),
    frac_disordered = mean(Y, na.rm = TRUE),
    phosphorylated  = sum(phosphorylated == 1),
    ubiquitinated   = sum(ubiquitinated  == 1),
    acetylated      = sum(acetylated     == 1)
  )

# A tibble: 1 × 5
      n frac_disordered phosphorylated ubiquitinated acetylated
  <int>           <dbl>          <int>         <int>      <int>
1 10802           0.177           6017          3738       1171

Notes

We should observe that:

The dataset contains 10,802 residues,
About 18% are labeled as intrinsically disordered,
PTMs are unevenly represented, with phosphorylation being the most common.

Discussion Questions

Why might experimentally validated disorder labels be much scarcer than AlphaFold-derived predictions?
How could limited disorder labeling affect downstream inference?

Part II: Choosing a PTM and Defining Variables

For illustration, we will analyze one PTM at a time. For a chosen PTM:

X is its binary indicator,
Y is the true disorder label,
Yhat is the predicted disorder probability.

Exercise 2: Creating the Analytic Dataset

Let us begin with phosphorylation. Create a working dataset with columns:

Y
Yhat
X (phosphorylation indicator)

This will be the working dataset for the rest of this module.

Answer

ptm <- "phosphorylated"

dat <- alphafold |>
  transmute(
    Y    = as.integer(Y),
    Yhat = as.numeric(Yhat),
    X    = as.integer(.data[[ptm]])
  )

dat |>
  count(X) |>
  mutate(pct = n / sum(n) * 100)

# A tibble: 2 × 3
      X     n   pct
  <int> <int> <dbl>
1     0  4785  44.3
2     1  6017  55.7

Notes

We have now defined our three core variables:

Y: true disorder,
Yhat: predicted disorder,
X: PTM indicator.

For phosphorylation, we should see that X = 1 for a substantial fraction of residues.

Discussion Questions

Why might some PTMs be easier to detect experimentally than others?
How could PTM-specific detection pipelines influence the analysis?

Part III: Simulating Partial Labeling

In many real studies, only a subset of residues has trusted disorder labels. Here, because we have labels for all samples, we can artificially create a partially labeled setting by:

Selecting n_labeled residues,
Keeping their observed Y,
Masking Y in the remaining rows.

Exercise 3: Creating Labeled/Unlabeled Splits

Write a function that:

Randomly selects n_labeled residues,
Marks them as "labeled",
Sets Y = NA for all others,
Returns a stacked dataset with a column set_label.

Answer

make_split <- function(dat, n_labeled, seed = NULL) {
  
  if (!is.null(seed)) set.seed(seed)

  idx <- sample.int(nrow(dat), size = n_labeled, replace = FALSE)

  labeled <- dat[idx, ] |> 
    
    mutate(set_label = "labeled")
  
  unlabeled <- dat[-idx, ] |> 
    
    mutate(Y = NA_integer_, set_label = "unlabeled")

  bind_rows(labeled, unlabeled)
}

Notes

This creates a realistic experimental setting:

Only a subset has true disorder labels,
All residues have available model-based predictions.

This mirrors many modern biological workflows where experimental validation is expensive and sparse, but predictions are readily available.

Discussion Questions

In real studies, how might labeled residues differ systematically from unlabeled ones?
What assumption are we making when we randomly subsample labels?

Part IV: Comparing Inference Methods on One Split

We now compare several approaches to estimating the same PTM–disorder odds ratio:

Oracle logistic regression: uses Y for all residues
Classical logistic regression: uses Y only in the labeled subset
Naive logistic regression: thresholds Yhat and treats it as truth
PB inference methods:
- PPI
- PDC
- Chen & Chen

The oracle is included as a benchmark, but in practice it is usually unavailable.

Exercise 4: Running All Methods on One Split

With n_labeled = 500:

Create a labeled/unlabeled split,
Fit the oracle, classical, and naive models,
Fit the PB inference models,
Compare the estimated odds ratios and 95% confidence intervals.

Answer

alpha <- 0.05

stacked <- make_split(dat = dat, n_labeled = 500, seed = 123)

fit_oracle <- glm(Y ~ X, data = dat, family = binomial())

fit_classical <- glm(Y ~ X, data = filter(stacked, set_label == "labeled"),
                     
  family = binomial())
  
fit_naive <- glm(I(Yhat > 0.5) ~ X, data = stacked, family = binomial())

fit_ppi <- ipd(Y - I(Yhat > 0.5) ~ X, method = "ppi", model = "logistic", 
               
  data = stacked, label = "set_label")

fit_pdc <- ipd(Y - I(Yhat > 0.5) ~ X, method = "pdc", model = "logistic", 
               
  data = stacked, label = "set_label")

fit_chen <- ipd(Y - I(Yhat > 0.5) ~ X, method = "chen", model = "logistic", 
               
  data = stacked, label = "set_label")

ex4_results <- bind_rows(
  
  tidy(fit_oracle,    conf.int = TRUE) |> mutate(Method = "Oracle"),
  
  tidy(fit_classical, conf.int = TRUE) |> mutate(Method = "Classical"),
  
  tidy(fit_naive,     conf.int = TRUE) |> mutate(Method = "Naive"),
  
  tidy(fit_ppi)                        |> mutate(Method = "PPI"),
  
  tidy(fit_pdc)                        |> mutate(Method = "PDC"),

  tidy(fit_chen)                       |> mutate(Method = "Chen + Chen")) |>
  
  filter(term == "X") |>
  
  mutate(
    
    Method = factor(Method) |>
      
      fct_relevel("Oracle", "Naive", "Classical", "PPI", "PDC", "Chen + Chen"),
    
    OR  = exp(estimate),
    
    LCL = exp(conf.low),
    
    UCL = exp(conf.high)) |>
  
  select(Method, OR, LCL, UCL)

glimpse(ex4_results)

Rows: 6
Columns: 4
$ Method <fct> Oracle, Classical, Naive, PPI, PDC, Chen + Chen
$ OR     <dbl> 2.130875, 2.066667, 2.716635, 2.272798, 2.204858, 2.211696
$ LCL    <dbl> 1.916754, 1.292530, 2.412201, 1.640979, 1.674278, 1.676679
$ UCL    <dbl> 2.371509, 3.366292, 3.065175, 3.147884, 2.903580, 2.917434

oracle_est <- ex4_results |> filter(Method == "Oracle") |> pull(OR)

ex4_results |>
  
  mutate(covers = if_else((LCL <= oracle_est) & (UCL >= oracle_est), T, F)) |>
  
  ggplot(aes(x = OR, xmin = LCL, xmax = UCL, y = Method, color = covers)) +
  
    theme_bw() +
  
    geom_vline(xintercept = oracle_est, linetype = 2, linewidth = 1.1) +
  
    geom_linerange(linewidth = 5) +
  
    geom_point(size = 3, color = "white") +
  
    scale_y_discrete(limits = rev) +
  
    scale_color_manual(values = palette.colors(4)[2:3]) +
  
    labs(
      
      x = "Odds Ratio (95% CI)", 
      
      y = NULL,
      
      color = "Covers the Oracle Estimate?") +
  
    theme(
      
      legend.position = "top",
    
      axis.text = element_text(face = "bold", size = 12),
    
      axis.title.x = element_text(face = "bold", size = 14, 
                              
      margin = margin(t = 20, unit = "pt")))

Notes

This single-split comparison illustrates several common patterns:

Oracle uses the full ground truth and serves as a benchmark.
Classical uses only labeled data and is often valid but inefficient.
Naive treats predictions as truth and often gives biased estimates intervals that are too narrow.
PB inference methods use the labeled subset to correct bias and account for prediction uncertainty while borrowing strength from the unlabeled data.

We should see:

Classical intervals that are wide,
Naive intervals that are biased and look overly precise,
PB intervals that are narrower than classical but more honest than naive.

Discussion Questions

Why can the naive approach look appealing numerically?
Which method would you trust in a scientific paper, and why?
What might happen if Yhat were poorly calibrated?

Part V: Efficiency as the Number of Labels Changes

We now study how uncertainty changes as the number of labeled observations increases.

Exercise 5: Confidence Interval Width vs Labeled Sample Size

For each n in {100, 500, 1000, 5000}:

Repeat the experiment 20 times,
Record confidence intervals,
Compute interval widths.

We will store everything in a single results table and then plot the mean interval width against the number of labeled observations.

Answer

ns <- c(100, 500, 1000, 5000)

num_trials <- 20

fit_oracle <- glm(Y ~ X, data = dat, family = binomial())

run_one <- function(n, t, fit_oracle) {
  
  s <- make_split(dat, n, seed = 1000 + n + t)

  fit_classical <- glm(Y ~ X, data = filter(s, set_label == "labeled"),
                     
    family = binomial())
  
  fit_naive <- glm(I(Yhat > 0.5) ~ X, data = s, family = binomial())

  fit_ppi <- ipd(Y - I(Yhat > 0.5) ~ X, method = "ppi", 
                 
    model = "logistic", data = s, label = "set_label")
  
  fit_pdc <- ipd(Y - I(Yhat > 0.5) ~ X, method = "pdc", 
                  
    model = "logistic", data = s, label = "set_label")

  fit_chen <- ipd(Y - I(Yhat > 0.5) ~ X, method = "chen", 
                  
    model = "logistic", data = s, label = "set_label")     
  
  results <- bind_rows(
    
    tidy(fit_oracle,    conf.int = TRUE) |> mutate(Method = "Oracle"),
    
    tidy(fit_classical, conf.int = TRUE) |> mutate(Method = "Classical"),
    
    tidy(fit_naive,     conf.int = TRUE) |> mutate(Method = "Naive"),
    
    tidy(fit_ppi)                        |> mutate(Method = "PPI"),
    
    tidy(fit_pdc)                        |> mutate(Method = "PDC"),

    tidy(fit_chen)                       |> mutate(Method = "Chen + Chen")) |>
  
  filter(term == "X") |>
  
  mutate(
    
    Method = factor(Method) |>
      
      fct_relevel("Oracle", "Naive", "Classical", "PPI", "PDC", "Chen + Chen"),
    
    OR  = exp(estimate),
    
    LCL = exp(conf.low),
    
    UCL = exp(conf.high),
    
    n = n,
    
    trial = t, 
    
    width = UCL - LCL) |>
  
  select(Method, OR, LCL, UCL, n, trial, width)
}

ex5_grid <- crossing(n = ns, trial = 1:num_trials) 

old_plan <- future::plan()

future::plan(future::multisession, 
             
  workers = max(1, parallel::detectCores() - 1))

ex5_results <- ex5_grid |>
  
  mutate(out = furrr::future_map2(n, trial, 
                                  
    \(n, trial) run_one(n, trial, fit_oracle))) |>
  
  select(out) |> 
  
  unnest(out)

future::plan(old_plan)

glimpse(ex5_results)

Rows: 480
Columns: 7
$ Method <fct> Oracle, Classical, Naive, PPI, PDC, Chen + Chen, Oracle, Classi…
$ OR     <dbl> 2.130875, 2.730000, 2.716635, 1.805772, 1.718718, 1.727755, 2.1…
$ LCL    <dbl> 1.9167544, 0.9364931, 2.4122014, 0.9914139, 0.9251400, 0.927117…
$ UCL    <dbl> 2.371509, 9.148324, 3.065175, 3.289053, 3.193022, 3.219807, 2.3…
$ n      <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100…
$ trial  <int> 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, …
$ width  <dbl> 0.4547542, 8.2118304, 0.6529739, 2.2976393, 2.2678822, 2.292690…

Answer

ex6_results <- ex5_results |>
  
  group_by(Method, n) |>
  
  summarise(
    mn = mean(width, na.rm = TRUE),
    se = sd(width, na.rm = TRUE) / sqrt(sum(!is.na(width))),
    .groups = "drop")

ex6_results |>
  
  ggplot(aes(x = n, y = mn, group = Method, color = Method)) +
  
    theme_bw() +
  
    geom_line(linewidth = 0.6) +

    geom_errorbar(
      aes(ymin = mn - se, ymax = mn + se),
      width = 0.2,
      linewidth = 0.4,
      alpha = 0.65) +
  
    geom_point(size = 1.6) +
  
    scale_x_continuous(breaks = ns) +
  
    scale_color_manual(values = palette.colors(8)) + 
  
    labs(
    
      x = "Number of Labeled Observations (n)",
    
      y = "Mean Confidence Interval Width") +
  
    theme(
      
      legend.position = "top",
    
      axis.text = element_text(face = "bold", size = 12),
    
      axis.title.x = element_text(face = "bold", size = 14, 
                              
        margin = margin(t = 20, unit = "pt")),
      
      axis.title.y = element_text(face = "bold", size = 14, 
                              
        margin = margin(r = 20, unit = "pt")))

Notes

This exercise lets us compare methods in terms of statistical efficiency.

In general, we see that:

The oracle sets the best achievable benchmark,
Naive intervals to remain artificially too small,
Classical requires many more labels to approach that benchmark,
PB inference methods can recover a meaningful fraction of that lost efficiency.

Discussion Questions

When might classical inference still be preferable?
How would you weigh the cost of experimental labeling against the benefit of more efficient inference?

Summary and Takeaways

This module illustrates how PB inference can be used in a modern proteomics workflow:

Use deep learning to generate predictions at scale,
Obtain limited ground truth labels,
Combine both sources of information for valid downstream inference.

In this AlphaFold/PTM example, we saw that:

Classical inference is valid but may be inefficient when labels are scarce,
Naive imputation can look precise while remaining biased and anticonservative,
PB inference methods can preserve validity while recovering some of the efficiency available in the large unlabeled dataset.

Take-Home Messages

AlphaFold enables proteome-scale structural annotation, but prediction is not the same as inference.
PTM–disorder questions are naturally framed as association problems, making them a good setting for PB inference.
When only a subset of residues has trusted labels, PB inference can improve on classical inference without treating predictions as real data.
The value of unlabeled data depends on the quality and calibration of the predictive model.

Closing Note

A central lesson of this module is that large-scale biological prediction does not eliminate the need for statistics. AlphaFold and related systems make it possible to ask questions at a proteome-wide scale, but downstream scientific conclusions still depend on how we handle uncertainty, labeling limitations, and prediction error.

PB inference methods provide principled ways to bridge that gap.

This is the end of the module. We hope this was informative. For questions, comments, or suggestions, please reach out to Stephen Salerno (ssalerno@fredhutch.org.

The exercises below are optional extensions for self-guided exploration.

Supplemental / Self-Guided Exercises

These exercises go a bit beyond the main workflow. They are designed for participants who want to explore study design, sensitivity analyses, or biological extensions in more depth.

Exercise 7 (Supplemental): How Many Labels Do We Need?

One practical question is how many labeled observations are needed to achieve adequate power.

Suppose we want to test

\[ H_0: \text{OR} \le 1 \]

and reject when the lower confidence bound exceeds 1.

We will estimate the smallest n_labeled such that:

PPI achieves about 80% power,
Classical inference achieves about 80% power.

Answer

alpha_pval <- 0.05

target_power <- 0.80

n_experiments <- 50

run_one_ci <- function(s, method = c("PPI", "Classical")) {
  
  method <- match.arg(method)

  if (method == "PPI") {
    
    fit <- ipd(Y - Yhat ~ X, method = "ppi", model = "logistic",
               
      data = s, label = "set_label", alpha = alpha_pval)
    
    tt <- tidy(fit) |> filter(term == "X")
    
    c(LCL = exp(tt$conf.low), UCL = exp(tt$conf.high))
    
  } else {
    
    lab <- s |> filter(set_label == "labeled")
    
    m <- glm(Y ~ X, data = lab, family = binomial())
    
    tt <- tidy(m, conf.int = TRUE) |> filter(term == "X")

    c(LCL = exp(tt$conf.low), UCL = exp(tt$conf.high))
  }
}

power_at_n <- function(n_labeled, method = c("PPI", "Classical")) {
  
  method <- match.arg(method)

  rej <- replicate(n_experiments, {
    
    s <- make_split(dat, n_labeled = n_labeled)
    
    ci <- run_one_ci(s, method = method)
    
    if (method == "PPI") {
      
      as.integer(ci[["LCL.X"]] > 1)
      
    } else {
      
      as.integer(ci[["LCL"]] > 1)
    }
  })

  mean(rej)
}

n_grid <- sort(unique(round(seq(50, 1200, by = 50))))

old_plan <- future::plan()

future::plan(future::multisession, 
             
  workers = max(1, parallel::detectCores() - 1))

power_methods <- c("PPI", "Classical")

power_tbl <- expand_grid(n = n_grid, Method = power_methods) |>
  
  mutate(Power = furrr::future_map2_dbl(
    
    n, Method, \(n, Method) power_at_n(n_labeled = n, method = Method))) |>
  
  arrange(Method, n)

future::plan(old_plan)

power_tbl

n80 <- power_tbl |>
  
  group_by(Method) |>
  filter(Power >= target_power) |>
  summarise(n_for_80 = ifelse(n() == 0, NA_integer_, min(n)), .groups = "drop")

n80

power_tbl |>
  
  ggplot(aes(x = n, y = Power, color = Method)) +
  
    theme_bw() +
    geom_hline(yintercept = target_power, linetype = 2) +
    geom_line(linewidth = 1.1) +
    geom_point(size = 2) +
    scale_x_continuous(breaks = n_grid) +
    scale_color_manual(values = palette.colors(3)[2:3]) +
    labs(
      x = "Number of Labeled Observations (n)",
      y = "Estimated Power") +
    theme(
      legend.position = "top",
      axis.text = element_text(face = "bold", size = 12),
      axis.title.x = element_text(face = "bold", size = 14, 
        margin = margin(t = 20, unit = "pt")),
      axis.title.y = element_text(face = "bold", size = 14, 
        margin = margin(r = 20, unit = "pt")))

Notes

This power analysis translates the earlier efficiency discussion into study design language.

In many settings, we expect:

Classical inference to require more labels,
PPI to reach the same power with fewer labeled observations.

This is practically important because labeling residues experimentally can be costly and slow.

Discussion Questions

How could this change the design of wet-lab validation studies?
What risks arise if Yhat is systematically biased?

Exercise 8 (Supplemental): Sensitivity to the Naive Threshold

So far, the naive method has converted Yhat to a binary label using a cutoff of 0.5. Let us see how sensitive the naive analysis is to that choice by trying thresholds of 0.2, 0.4, 0.6, and 0.8.

Answer

thresholds <- seq(0.2, 0.8, 0.2)

naive_by_threshold <- function(thr) {
  m <- glm(I(Yhat > thr) ~ X, data = dat, family = binomial())

  tidy(m, conf.int = TRUE) |>
    filter(term == "X") |>
    transmute(
      threshold = thr,
      OR = exp(estimate),
      LCL = exp(conf.low),
      UCL = exp(conf.high),
      width = UCL - LCL
    )
}

ex8_results <- bind_rows(lapply(thresholds, naive_by_threshold))

ex8_results

# A tibble: 4 × 5
  threshold    OR   LCL   UCL width
      <dbl> <dbl> <dbl> <dbl> <dbl>
1       0.2  2.24  2.03  2.48 0.449
2       0.4  2.58  2.30  2.89 0.589
3       0.6  2.80  2.47  3.18 0.713
4       0.8  2.91  2.52  3.36 0.841

ex8_results |>
  ggplot(aes(x = OR, xmin = LCL, xmax = UCL, y = threshold)) +
  theme_bw() +
  geom_vline(xintercept = 1, linetype = 2, linewidth = 1.1) +
  geom_linerange(linewidth = 5, color = palette.colors(3)[3]) +
  geom_point(size = 3, color = "white") +
  scale_y_continuous(breaks = thresholds) +
  labs(
    x = "Odds Ratio (95% CI)",
    y = "Threshold for Yhat -> Binary Label"
  ) +
  theme(
    axis.text = element_text(face = "bold", size = 12),
    axis.title.x = element_text(face = "bold", size = 14,
      margin = margin(t = 20, unit = "pt")),
    axis.title.y = element_text(face = "bold", size = 14,
      margin = margin(r = 20, unit = "pt"))
  )

Notes

This exercise highlights a key weakness of naive imputation:

Changing the threshold changes the estimated odds ratio,
But there is no principled inferential correction for that arbitrariness.

That sensitivity is one reason naive procedures can be misleading.

Discussion Questions

Why do the OR estimates change as the threshold changes?
What does this say about the stability of naive inference?

Exercise 9 (Self-Guided): Compare PTMs

Repeat the full workflow for:

ubiquitination
acetylation

Compare:

Estimated odds ratios,
Confidence interval widths,
Labeled sample sizes needed for 80% power.

Biological Context

Different PTMs play different regulatory roles:

phosphorylation is often associated with signaling,
ubiquitination with degradation and protein turnover,
acetylation with chromatin regulation and transcriptional control.

You may find that:

Some PTMs are more enriched in disorder,
Some show weaker or noisier associations,
The inferential gains from PB inference methods differ across PTMs.

Statistical Goal

Assess whether:

Effect sizes differ across PTMs,
Efficiency gains persist across modification types,
Conclusions are robust to the chosen PTM.

Suggested Extensions

If you continue this work, possible extensions include:

Accounting for protein-level clustering,
Adding biological covariates such as amino acid type or sequence position,
Comparing additional PB inference methods,
Perturbing Yhat to study robustness to miscalibration,
Applying the same workflow to a different ML-derived phenotype.