Good Models from Bad Labels (Part 1)


I used to stand in front of my ML students and preach the ML dogma: "Garbage in, garbage out." It was clean, simple, and true for the tidy world of image classification we taught from textbooks. The first rule of building a good model was to build a clean dataset.


Then I moved into biological machine learning. And that rule started to feel like a cruel joke.


In biology, "clean" data is a fantasy. Assays are plagued with noise. Experiments fail in mysterious ways. Annotations are subjective and imperfect. The messy, glorious complexity of life doesn't conform to our tidy digital boxes. For years, I watched brilliant teams shelve promising early data because it wasn't "clean enough." We were waiting for a purity that would never come.


Hypothesis: Rejecting the Dogma

It made me wonder: What if we have it all wrong?

What if the "garbage in, garbage out" dogma is actually holding us back? This led me to a long-held hypothesis: if a model can learn to ignore irrelevant features, why can't it learn to discount noisy labels?

Incorrect labels are simply conflicting messages. But if the true biological signal is strong and structured, the correct messages should outweigh the noise. The model just needs to learn to listen for the melody through the static.

It’s a bit like tuning into a radio station with interference: the song remains recognizable even if the signal isn't perfect.


The Experiment: How Bad Can We Make It?

Here’s a hands-on experiment to show you what I’ve learned about this topic from building hundreds of bioML models over the years.


The Data: I started with a high-quality dataset of 250,000 peptide variants (each 12 amino acids long), where each peptide was labeled for its production fitness. Think of the peptide sequence as your input (X) and fitness as the output (Y); in this case, a binary classification task.

The Intervention: Then, I became a data vandal. I didn't just add a little noise; I systematically corrupted the labels. I started by flipping 5% of them at random, then 10%, and kept going all the way up to 45%, pushing the signal to near-random levels.

The Modeling: I handed this increasingly messy data to an LSTM classification model, chosen to handle the complex dependencies between amino acid positions (epistasis). This choice is explained by its success in my work in the Fit4Function AAV engineering study [1]. I didn't use fancy transformers, and I didn't tune the model for each noise level, just a single-model-fits-all. 

The Tune: I only changed one key parameter for each noise level to make the models robust to the noise (more on this secret sauce later).

Results: The Stunning Resilience 


Figure 1. With the right know-how, ML models can learn through massive label noise.


The results did not fail my hypothesis (Fig. 1). They were stunning!

The models didn't give up. They clung to the signal with stubborn resilience. Even at 45% label noise, the models didn't collapse. Their predictions remained useful, well above random chance. In fact, for this 12-mer peptide fitness prediction task, the model trained with 45% label noise performed remarkably close to the model trained on the pristine, 0% noise data!

But here's the crucial boundary: the limit is real. Cross that ~50% line, and performance plummets. At that point, you’re effectively training on static, and the signal is lost.

How Is This Even Possible? The Science of Training Through Noise

In ML classes, we teach that large batch sizes can sometimes hurt a model's performance. That's often true, but for clean data. Massive label noise changes the rules of the game. The key is to prevent the model from memorizing the errors. Inspired by the foundational 2017 MIT preprint "Deep Learning is Robust to Massive Label Noise" [2], I applied techniques designed to help the model average out the noise and find the true signal. The core of the strategy involves a crucial adjustment to the batch size. The Power of a Larger Batch Size With noisy labels, a small batch is a liability (Fig. 2). The gradient calculated from just a handful of examples is likely to be dominated by incorrect labels, sending the model spiraling in the wrong direction. A larger batch size acts like a pollster surveying a larger population: it provides a statistical average that dilutes the impact of any single error. The correct signals reinforce each other, while the random noise begins to cancel out, yielding a cleaner, more reliable signal for the model to learn from.

Figure 2. You should increase the batch size for noiser labels.
The Nuance of the Learning Rate The MIT paper suggests that increasing the batch size should be coupled with an increased learning rate. The reasoning is sound: a high learning rate gives the model the momentum to escape sharp minima in the loss landscape that correspond to memorizing noise, guiding it instead toward broader, more generalizable solutions.

With increasing both batch size and learning rates, the process itself becomes a noise filter. The large batch calculates a trustworthy direction by averaging out errors, and an appropriately set learning rate (whether standard or slightly increased) allows the model to efficiently follow that direction. This synergy is what makes learning from massively noisy labels possible.

However, in this specific peptide fitness prediction task, I found that the larger batch size alone was the critical factor (Fig. 3). By providing a cleaner gradient estimate, it allowed the model to learn effectively even with a standard learning rate. This is an important practical insight: while the theory recommends adjusting both, the imperative is to increase the batch size. The learning rate can then be tuned based on empirical performance.

Figure 3. Increasing learning rates is not always helpful with noisy labels. 

Why This Matters: A Practical Liberation for Your Lab

This isn't just an academic exercise; it's a fundamental shift in how we approach biological data. For too long, the pursuit of perfect data has been a bottleneck. This work liberates us from that trap.

Most biotech teams discard early or messy datasets because the assays aren’t “clean enough.” They wait, sometimes indefinitely, for pristine labels before trusting a model.

But what if you could start now? This experiment shows that even with significant noise, there's immense value to extract. Your models can be surprisingly robust if the underlying biological signal is strong. This means you can:

  • Build predictive models now, without waiting for the next expensive, pristine assay.

  • Rescue "failed" or shelved experiments and extract value from the messy data already in your lab.

  • Confidently utilize public, high-throughput datasets that were previously considered too noisy to be reliable.

Of course, limits exist. Push beyond the ~50% noise threshold, and performance will rightly plummet; you're training on coin flips. But before that point, there is a vast, untapped resource of usable data.


Practical Takeaways

  1. Stop Trashing Noisy Data: That messy, first-pass experiment or abundant public dataset is likely good enough to bootstrap a powerful predictive model and guide your next round of experiments.
  2. Trust the Structure of Biology: Biology is messy, but it's not random. The evolutionary and biophysical patterns in peptide and protein data give models a strong, structured signal to latch onto, even when labels are imperfect.
  3. Focus on Signal vs. Noise: The goal isn't to eliminate noise but to build models robust enough to see past it. Start with increasing your batch size as a first-line defense.

A Closing Thought

This principle of building robust models from noisy data is one of the most powerful yet underappreciated enablers in bioML. It has been a secret sauce in my work on protein engineering at the Broad Institute of MIT and Harvard for several years and I've seen it validated again and again. This doesn't mean data quality is irrelevant. It matters. But you don't need perfect data to start building something powerful and actionable.

I'm releasing the complete code and the 250k peptide fitness dataset to help you put this into practice immediately. I'd love to hear your thoughts and see what you discover.

Explore the Tutorial & Code


[1] Eid, FE., Chen, A.T., Chan, K.Y. et al. Systematic multi-trait AAV capsid engineering for efficient gene delivery. Nature Communications 15, 6602 (2024). 

[2] Rolnick, D., Veit, A., Belongie, S., & Shavit, N. (2017). Deep Learning is Robust to Massive Label Noise. arXiv preprint arXiv:1705.10694.


PS: I've spent two decades bridging ML theory and application. Follow me on LinkedIn (#TheBioMLClinic) for more practical insights that accelerate bio-innovation.



Comments