← Back to Spotlight
Spotlight

Bigger Isn’t Always Better: Single-Cell AI Models Hit a Data Wall

From Pepkio Team · 11 June 2026 · 2 min read

Artificial intelligence models in biology are growing rapidly, often trained on tens of millions of single cells, but throwing more data at them might not be the most effective path forward. In a comprehensive computational study, scientists report today in Nature Methods that the performance of these "foundation models" plateaus surprisingly early. The work, led by senior author Lorin Crawford at Microsoft Research, with first author Alan DenAdel, reveals that unlike large language models, single-cell AI systems do not automatically improve just by indiscriminately scaling up their training data.

The research team pretrained 400 different models using a massive dataset of 22.2 million cells. By running 6,400 experiments, they tested how well these models handled standard tasks like classifying cell types, predicting how cells respond to drugs, and integrating different batches of experimental data.

They found that model performance often saturated using only 1% to 10% of the available training data. Beyond that small fraction, adding more cells—or even artificially increasing the diversity of the data by mixing in cells that had undergone experimental perturbations—failed to yield tangible improvements in predictive accuracy. In fact, in several cases, simpler computational baselines performed just as well as, or better than, the complex, resource-heavy transformer models.

These findings offer an important reality check for computational biology. Pretraining massive transformer models requires expensive, specialized hardware. Instead of endlessly gathering more data, this evidence suggests developers should prioritize high-quality, carefully curated datasets and design architectures specifically aligned with biological realities, rather than treating gene expression data exactly like words in a sentence.

While the authors note that the fast-moving nature of AI means new architectures are constantly emerging, the current generation of models clearly requires a shift in strategy.

As artificial intelligence continues to intersect with single-cell biology, discovering the right balance of model capacity, computing power, and data curation will be crucial. Future advancements will likely depend more on smart engineering than simply amassing the largest dataset possible.


Reference:
DenAdel, A., Hughes, M., Thoutam, A. et al. Evaluating the role of pretraining dataset size and diversity on single-cell foundation model performance. Nat Methods (2026). https://doi.org/10.1038/s41592-026-03120-y

Bigger Isn’t Always Better: Single-Cell AI Models Hit a Data Wall | Pepkio Radar | Pepkio