Good Models from Bad Labels (Part 1)
I used to stand in front of my ML students and preach the ML dogma: "Garbage in, garbage out." It was clean, simple, and true for the tidy world of image classification we taught from textbooks. The first rule of building a good model was to build a clean dataset.
Then I moved into biological machine learning. And that rule started to feel like a cruel joke.
In biology, "clean" data is a fantasy. Assays are plagued with noise. Experiments fail in mysterious ways. Annotations are subjective and imperfect. The messy, glorious complexity of life doesn't conform to our tidy digital boxes. For years, I watched brilliant teams shelve promising early data because it wasn't "clean enough." We were waiting for a purity that would never come.
Hypothesis: Rejecting the Dogma
It made me wonder: What if we have it all wrong?
What if the "garbage in, garbage out" dogma is actually holding us back? This led me to a long-held hypothesis: if a model can learn to ignore irrelevant features, why can't it learn to discount noisy labels?
Incorrect labels are simply conflicting messages. But if the true biological signal is strong and structured, the correct messages should outweigh the noise. The model just needs to learn to listen for the melody through the static.
It’s a bit like tuning into a radio station with interference: the song remains recognizable even if the signal isn't perfect.
The Experiment: How Bad Can We Make It?
Here’s a hands-on experiment to show you what I’ve learned about this topic from building hundreds of bioML models over the years.
The Data: I started with a high-quality dataset of 250,000 peptide variants (each 12 amino acids long), where each peptide was labeled for its production fitness. Think of the peptide sequence as your input (X) and fitness as the output (Y); in this case, a binary classification task.
The Intervention: Then, I became a data vandal. I didn't just add a little noise; I systematically corrupted the labels. I started by flipping 5% of them at random, then 10%, and kept going all the way up to 45%, pushing the signal to near-random levels.
The Modeling: I handed this increasingly messy data to an LSTM classification model, chosen to handle the complex dependencies between amino acid positions (epistasis). This choice is explained by its success in my work in the Fit4Function AAV engineering study [1]. I didn't use fancy transformers, and I didn't tune the model for each noise level, just a single-model-fits-all.
Results: The Stunning Resilience
![]() | |
|
The models didn't give up. They clung to the signal with stubborn resilience. Even at 45% label noise, the models didn't collapse. Their predictions remained useful, well above random chance. In fact, for this 12-mer peptide fitness prediction task, the model trained with 45% label noise performed remarkably close to the model trained on the pristine, 0% noise data!
But here's the crucial boundary: the limit is real. Cross that ~50% line, and performance plummets. At that point, you’re effectively training on static, and the signal is lost.
How Is This Even Possible? The Science of Training Through Noise
Why This Matters: A Practical Liberation for Your Lab
This isn't just an academic exercise; it's a fundamental shift in how we approach biological data. For too long, the pursuit of perfect data has been a bottleneck. This work liberates us from that trap.
Most biotech teams discard early or messy datasets because the assays aren’t “clean enough.” They wait, sometimes indefinitely, for pristine labels before trusting a model.
But what if you could start now? This experiment shows that even with significant noise, there's immense value to extract. Your models can be surprisingly robust if the underlying biological signal is strong. This means you can:
Build predictive models now, without waiting for the next expensive, pristine assay.
Rescue "failed" or shelved experiments and extract value from the messy data already in your lab.
Confidently utilize public, high-throughput datasets that were previously considered too noisy to be reliable.
Of course, limits exist. Push beyond the ~50% noise threshold, and performance will rightly plummet; you're training on coin flips. But before that point, there is a vast, untapped resource of usable data.
Practical Takeaways
- Stop Trashing Noisy Data: That messy, first-pass experiment or abundant public dataset is likely good enough to bootstrap a powerful predictive model and guide your next round of experiments.
- Trust the Structure of Biology: Biology is messy, but it's not random. The evolutionary and biophysical patterns in peptide and protein data give models a strong, structured signal to latch onto, even when labels are imperfect.
- Focus on Signal vs. Noise: The goal isn't to eliminate noise but to build models robust enough to see past it. Start with increasing your batch size as a first-line defense.
A Closing Thought
This principle of building robust models from noisy data is one of the most powerful yet underappreciated enablers in bioML. It has been a secret sauce in my work on protein engineering at the Broad Institute of MIT and Harvard for several years and I've seen it validated again and again. This doesn't mean data quality is irrelevant. It matters. But you don't need perfect data to start building something powerful and actionable.
Explore the Tutorial & Code
[1] Eid, FE., Chen, A.T., Chan, K.Y. et al. Systematic multi-trait AAV capsid engineering for efficient gene delivery. Nature Communications 15, 6602 (2024).
[2] Rolnick, D., Veit, A., Belongie, S., & Shavit, N. (2017). Deep Learning is Robust to Massive Label Noise. arXiv preprint arXiv:1705.10694.
PS: I've spent two decades bridging ML theory and application. Follow me on LinkedIn (#TheBioMLClinic) for more practical insights that accelerate bio-innovation.
Comments
Post a Comment