Meaningful Advances in Virtual Cells

Brita Belli, Senior Manager of Communications, Recursion

In the past year, Virtual Cells have moved from theoretical ideas to meaningful advances across industry and academia. A combination of these efforts are needed to create tools capable of navigating the complexities of real-world biology. The article will look at the latest efforts to use data and machine learning to model the complex distributions of cell shifts following perturbations and how this can accelerate AI drug discovery, including the latest efforts from Arc Institute (State), Recursion (TxPert), CZI (Cytoland), Noetik (Octo-vc) and the Wellcome Sanger Institute (NicheCompass).

It seems that everyone in biotech is working on virtual cells. A multi-scale, multimodal model that can simulate the behavior of human molecules, cells, and tissues across diverse states would be a game-changer in AI drug discovery – allowing researchers to predict the success of millions of possible treatments on cells before more expensive laboratory testing. vastly scaling up the speed and efficiency of drug discovery and development while significantly lowering the cost. 

But are we there yet? As Anne Carpenter, senior director of the Imaging Platform at Broad Institute of MIT and Harvard, noted in a recent interview on the podcast TechBio Talks1 that the race to the virtual cell is currently “more of a mosh pit.” She says the definition needs to be expanded to encompass new datasets and new tasks. A virtual cell should be a model capable of providing “a fundamental understanding of those reactions that are happening within a cell,” Dr. Carpenter says, and one that can generalise to new datasets, perturbations, genes or compounds that have not yet been directly tested “or are not super similar to something that has been directly tested.” 

Since the term virtual cell started gaining popularity with the publication of a perspective paper in Cell in Dec. 2024 called “How to build the virtual cell with artificial intelligence: Priorities and opportunities2,” from 42 researchers including Patrick Hsu of Arc Institute, Aviv Regev of Genentech, Mohammed AlQuraishi of Columbia, Emma Lundberg of Stanford, and many other notable researchers in the space, the field has accelerated quickly. Research organisations like Arc Institute in Palo Alto, the Chan Zuckerberg Initiative (CZI) in Redwood City, California, and the Wellcome Sanger Institute in the UK are now actively building virtual cells. So are a number of TechBio companies – or companies that take an AI-first approach to drug discovery and development – including Recursion, Noetik, 10x Genomics, and Tahoe Therapeutics. 

An Open Science Approach to Virtual Cell Benchmarks 

In this race, gaining access to data and models is essential to accelerating the training and fine-tuning of virtual cell models, and the more people who contribute, the faster everyone moves. In Feb. 2025, Tahoe and Arc partnered on the release of the Arc Virtual Cell Atlas – single-cell transcriptomic data spanning species, tissues, and experimental and perturbation conditions from over 300 million unique cells. The impetus for releasing this data – which includes the world’s largest 100 million single-cell dataset – was to hasten the development of AI virtual cells. “We are open sourcing Tahoe-100M to help start a new movement in biological modeling that goes beyond us,” said Nima Alidoust, cofounder and CEO of Tahoe in a related release3

A few months later, Arc announced its Virtual Cell Challenge4 – an annual open benchmark competition designed to “provide an evaluation framework, purpose-built datasets, and a venue for accelerating model development.” Participants – using experimental data from Arc – must build a model to predict the effects of perturbation in the H1 embryonic stem cell line, and can win prizes valued at up to $100,000 (split between cash and cloud benefits). The final submission deadline is November 17, with winners announced in early December. 

As living, dynamic systems, cells are incredibly complex to model – and community benchmarks for measuring the success of virtual cell models are essential. That’s part of the incentive behind the challenge. “Community benchmarks are important,” Theofanis Karaletsos, senior director of AI at CZI told GENEdge5, “and we believe open competitions like Arc’s are a powerful mechanism to accelerate innovation and collective progress.” Models will be evaluated on how well they perform in predicting differentially expressed genes; how well they discriminate between different perturbation effects; and their general error in deviating from expression counts. 

UK-based biotech Shift Bioscience also released a study6 aiming to improve the benchmarking of virtual cell models for gene discovery, proposing a series of steps that can better rank models toward more biologically meaningful endpoints. 

The Latest Breakthroughs in New Models for Virtual Cells 

Some of the most important advances happening in virtual cell development are the release of new machine learning models that unlock some key functionality of human cells’ workings that wasn’t available before. Mirroring the complexity and inter-connectedivity of human cells, a successful virtual cell will require the integration of hundreds of machine learning models focusing on different cellular functions capable of simulating interactions between proteins, genes and small molecules as well as chemical reactions driving cellular metabolism and how molecules and cellular components move and change over time within the cell’s 3D structure. 

While chatbots have the benefit of an internet’s worth of information and language, the equivalent data source doesn’t exist for the human body. Each model brings the field closer to that broader understanding. 

One of these is State7 – the first virtual cell model released by the Arc Institute – which  measures how sets of cells move in the RNA expression – or transcriptomics – space after an intervention. Trained on observational data from nearly 170 million cells and perturbational data from over 100 million cells across 70 cell lines, State predicts how a cell’s transcriptomes change after perturbations. While State provides an understanding of single-cell transcriptomics, it’s still just one part of the biological whole. 

Another model, TxPert8, provides broader context for these perturbations – not just how they impact individual cells, but how they affect unseen genes or compounds – how they influence broader biology across cell lines the way a drug would. “By leveraging prior information beyond single-cell data, TxPert moves closer to the multimodal, biologically grounded layer we want in virtual cells,” writes9 Therence Bois, VP of Strategy at Valence Labs, Recursion’s AI research lab. “In several of these settings, performance approaches wet-lab reproducibility, suggesting the model is learning transferable structure rather than memorizing local patterns.”

This ability to move from observability in one context to connecting to unseen cellular mechanisms is a critical bridge from the virtual’s cell’s ability to “predict” to its ability to also “explain,” he writes, two of the three key elements laid out in Valence and Recursion’s virtual cell vision paper10: “Predict, Explain, Discover.” 

Understanding Cancer Cells, and Beyond

Most virtual cells today are aimed at understanding and simulating cancer cells because that’s where the data is. Over decades, cancer research has generated massive, high quality public datasets that can be used to train machine learning models, including the Cancer Genome Atlas and the Cancer Cell Line Encyclopedia. Cancer also has clear endpoints – kill the cancer cells, stop them from proliferating – that are easy to quantify and study – and well-understood genetic drivers like mutations in BRAF, EGFR, or KRAS that serve as ideal inputs for training models. 

The virtual cell being developed by the TechBio company Noetik – OCTO-vc – is trained on data from 77 million human tumor cells across more than a dozen cancers and their goal11 is to use it to help solve clinical-stage problems, to identify which patients will respond to cancer drugs like anti-PD-1 drugs to better refine patient inclusion criteria for trials. 

To expand the application of virtual cells beyond cancer requires significantly more data from significantly more cell types. To build its “Maps of Biology,” Recursion has generated data across dozens of different human cell types both in-house and with pharma partners – from HUVEC cells to disease-specific cells to neuronal cells and oncology cell lines. Portions of these datasets have been released for public use12 to help drive virtual cell and other AI drug discovery research. 

And pharmaceutical companies are beginning to share insights from their rich data troves with the broader AI drug discovery community, via initiatives like OpenFold – a nonprofit AI research and development consortium that’s released a number of high performing open-source models. The latest version of its open-source protein-folding model, OpenFold3, has been fine-tuned using proprietary data from Bristol Myers Squibb, Takeda, Abbvie, Johnson & Johnson and other pharma companies to create a diverse dataset for training models in drug discovery, while still preserving confidentiality and protecting intellectual property through a federated platform from Apheris. 

Columbia’s Mohammed AlQuraishi, who developed OpenFold, sees these open-source initiatives as essential steps toward eventual virtual cells, which he predicts13 are coming in the next 15 years. “I think we can’t ever be certain that we understand all the essential features of a living cell unless we can model it,” he told Columbia Medicine News. “And then suddenly, we will have this incredible tool that we can probe in all sorts of ways.”

Footnotes:

  1. https://open.spotify.com/show/2bjoBPssiZVmjYKG4AlaW4
  2. https://www.cell.com/cell/fulltext/S0092-8674(24)01332-1
  3. https://arcinstitute.org/news/arc-vevo
  4. https://virtualcellchallenge.org/
  5. https://www.genengnews.com/topics/artificial-intelligence/arc-institute-launches-virtual-cell-challenge-to-accelerate-ai-model-development/
  6. https://www.shiftbioscience.com/news/shift-bioscience-proposes-improved-ranking-system
  7. https://arcinstitute.org/news/virtual-cell-model-state
  8. https://arcinstitute.org/news/virtual-cell-model-state
  9. https://www.linkedin.com/pulse/scale-structure-first-virtual-cell-therence-bois-sdg2e/?trackingId=Olam%2Fl%2BBSYaEq2g%2BDncBgg%3D%3D
  10. https://arxiv.org/abs/2505.14613
  11. https://www.noetik.blog/p/how-do-you-use-a-virtual-cell-to
  12. https://www.rxrx.ai/
  13. https://www.cuimc.columbia.edu/news/working-toward-virtual-cell
Brita Belli

Brita Belli is an award-winning science and tech writer and Senior Manager of Communications at Recursion, a clinical-stage TechBio company. Her writing has been featured in the New York Times, National Geographic, MSN.com, and Alternet, and she has covered AI drug discovery and other health tech topics for Future Medicine AI, On Drug Delivery, European Biopharmaceutical Review, and OR Manager Magazine. She’s the author of The Autism Puzzle: Connecting the Dots Between Environmental Toxins and Rising Autism Rates (Seven Stories Press).