Fine-tuning AI Models for Molecular Co-folding
Advancing drug discovery through data collaboration
Robin Röhm, CEO and Co-Founder, Apheris
Accurately predicting protein–ligand interactions remains a major challenge in pharmaceutical research. AI-driven co-folding models, when fine-tuned on proprietary data, enable more reliable structure predictions and faster target validation. This article explores how collaborative model fine-tuning across pharma companies' proprietary datasets can further improve predictive accuracy. We explore how this can be achieved while keeping intellectual property fully protected.

Co-folding models predict how two or more interacting partners, such as a protein and a ligand, two proteins, or a protein and an RNA, assemble into a three-dimensional complex. By learning statistical patterns from large structural datasets, they can generate plausible complex structures directly from sequence and molecular graphs.
This capability is changing how hypotheses are formed in pharmaceutical research. The ability to visualise protein–ligand or protein–protein interactions within minutes helps teams to explore binding modes for uncrystallised targets, propose constructs for antibody design, and investigate conformational states with high accuracy and without waiting for experimental structures.
Co-folding models are increasingly used alongside experimental and physics-based approaches to support structural work in early drug discovery. When benchmarked and customised to internal data, they can provide structural guidance for selected targets and series. When applied appropriately, these models help interpret structure–activity relationships, compare binding hypotheses, and prioritise which experiments are most likely to be informative. Their practical value depends on knowing where they are accurate and operating them within that range.
How co-folding AI fits into discovery programmes
In drug discovery, co-folding AI complements experimental and simulation-based methods at several points in the workflow. In hit identification, it provides structural hypotheses where no experimental structures exist, helping chemists rationalise hits emerging from high-throughput screening or virtual screening. In lead optimisation, it assists in comparing binding poses across related series and in identifying conformational changes that may affect potency or selectivity.
By predicting plausible complex geometries directly from sequence and molecular graphs, co-folding AI allows project teams to interpret assay data within a structural context even when crystallographic information is missing. This helps integrate biophysical measurements, docking results, and molecular dynamics simulations into coherent, testable hypotheses.
The objective is not to replace experiments but to increase their informational value. When model performance boundaries are well understood, predictions from co-folding AI can guide which experiments to prioritise, which series to examine in more detail, and where additional data are most likely to refine understanding of a binding mechanism.
Why scepticism remains
Despite the growing interest in co-folding AI, many discovery teams remain cautious in applying it to decision-making. The reasons are largely empirical. Predictions that appear convincing visually can deviate substantially from experimental structures once tested, and this variability limits confidence in using co-folding outputs as decision-grade evidence.
First, pose prediction accuracy varies significantly across target classes. Models that perform well for kinases or proteases often struggle with GPCRs, multi-domain proteins, or challenging allosteric sites. Even within a target family, local backbone flexibility and water-mediated interactions can lead to incorrect binding geometries. These deviations are not always obvious from the model’s internal confidence metrics, which are typically trained on public datasets and may not capture uncertainty for unseen chemo-types.
Second, co-folding outputs are sometimes difficult to interpret quantitatively. Physics-based methods provide explicit energy functions and clear ranking criteria. Co-folding AI models instead produce probabilities or learned embeddings, and translating these into actionable scoring or ranking metrics remains challenging. In practice, this means that predicted complexes are useful for generating and comparing hypotheses but generally require validation by orthogonal methods before influencing synthesis priorities.
A third, more operational factor is reproducibility. Re-running the same model on different infrastructure or with a newer checkpoint can yield different results. Without standardised pipelines for benchmarking and traceability, such variability makes it difficult to compare predictions across programmes or model versions.
For these reasons, most organisations apply co-folding AI as a complementary source of structural information, valuable for informing experiments but not yet as a standalone decision tool. The question is less about general scepticism and more about trust calibration: understanding when, where, and how the model can be relied upon.
Why co-folding models’ physical plausibility remains limited
A major challenge concerns the physical realism of co-folding predictions. Even when models correctly identify plausible interaction partners or approximate binding geometries, their outputs do not always satisfy fundamental biochemical or biophysical constraints.
Recent independent work by Masters and colleagues (2025)1 systematically evaluated this issue. The authors tested modern co-folding architectures under a set of chemically and biologically plausible perturbations, such as side-chain rotations, ligand conformer changes, and interface residue mutations, and observed that model predictions frequently diverged from physically expected responses. Structures often appeared visually convincing but failed to recover steric compatibility, interaction patterns, or energetically favourable rearrangements when compared with experimental or physics-based references.
These findings echo practical observations within pharmaceutical teams. Co-folding models can produce geometries that look reasonable but violate basic principles such as hydrogen-bond directionality, steric exclusion, or acceptable torsional strain. Physics-based methods such as molecular mechanics or free-energy calculations provide interpretable energy landscapes with explicit penalties for such violations. Co-folding architectures, by contrast, learn statistical correlations and therefore lack explicit physical constraints. The challenge is not only whether a pose is nominally “correct”, but whether model behaviour is consistent with underlying physical principles. Addressing this requires approaches that either embed physical priors more deeply or enforce them during inference.
Why co-folding models’ generalisation remains difficult
Most current co-folding AI models are trained on data derived from the Protein Data Bank (PDB) and similar public sources. While these datasets have transformed structural biology, their coverage is uneven. Certain target classes, such as kinases, proteases, and metabolic enzymes, are represented extensively, while others, including membrane proteins, protein– nucleic acid complexes, allosteric assemblies, and intrinsically disordered regions, remain sparse.
This observation aligns with challenges highlighted in recent community discussions. Experimentalists and modellers point to structural data imbalance, inconsistencies in public bioactivity datasets, and limited availability of high-resolution, ligand-bound complexes across diverse target classes.
A recent independent benchmark by Škrinjar and colleagues (2025)2 , examined this issue directly. Using a post-cutoff dataset of 2,600 newly released protein– ligand complexes, they reported a nearly linear drop in accuracy as the structural similarity to the training set decreased, with success rates falling to around 20 per cent for novel chemotypes. The study highlights that current co-folding models interpolate well within known structural space but struggle to extrapolate to new targets or scaffolds.
This limitation arises from how models learn. Co-folding architectures infer likely geometries by identifying correlations between amino acid environments and ligand features present in their training data. When confronted with previously unseen folds, binding motifs, atom types, or chemotypes, these statistical associations may not generalise. As a result, predictions can appear plausible yet deviate from physical reality in subtle but meaningful ways.
Another constraint lies in the labels themselves. Many PDB structures are determined under crystallographic conditions that stabilise one conformer and may miss physiologically relevant flexibility or alternate binding modes. Models trained on such data may therefore learn a biased representation of binding energetics.
The lack of robust prospective validation frameworks, such as blind prediction challenges on newly generated structures, further limits confidence in model extrapolation. Together, these factors define the current boundary of applicability. Co-folding AI performs strongly when applied within regions of structural space represented in its training data, but reliability declines for novel targets or proprietary chemistries.
Overcoming this boundary requires systematic adaptation: testing, benchmarking, and fine-tuning models to the specific molecular distributions encountered in drug discovery programmes.
How organisations evaluate co-folding reliability
In response to these limitations, leading organisations are formalising how co-folding models are evaluated before they are introduced into decision workflows. Internal practices typically combine three elements.
First, retrospective benchmarks are constructed using post-cutoff structures for relevant target classes. These datasets provide a controlled way to measure pose accuracy, interaction recovery, and ranking performance relative to experimental complexes that were not available during initial model training.
Second, prospective tests are run on active programmes. Predictions are generated for ongoing series, and the resulting hypotheses are tracked against subsequent biophysical and structural readouts. Particular attention is paid to whether predicted binding modes align with observed SAR trends and whether the model is consistent in rejecting chemotypes that later fail experimentally. Third, model outputs are examined for pharmacophore and contact-pattern coherence. Even when RMSD values are acceptable, mismatches in hydrogen-bond networks, key ionic interactions, or hydrophobic contacts often indicate that the model has not captured the relevant binding physics. These checks are now a routine part of many internal validation exercises.
Together, these practices allow organisations to position co-folding AI within their decision processes with greater confidence. Models are not treated as black boxes, but as instruments whose behaviour must be characterised and monitored, much like any other experimental or computational method.
Two complementary adaptation strategies
The limitations described above have led advanced research organisations to explore two practical strategies for improving co-folding performance on proprietary targets and chemotypes: constraint-guided inference and fine-tuning. These approaches address different aspects of the problem. One guides the model using external information, the other adapts the model itself, and both are increasingly viewed as complementary.
The case for constraint-guided inference
Constraint-guided inference improves prediction quality by restricting the model to solutions that are compatible with known structural or biochemical information. Unlike fine-tuning, it does not modify model parameters and does not require training data.
Constraints can include experimentally supported binding-site geometries, conserved interaction residues, validated ligand poses, or backbone conformations established through crystallography, cryo-EM, or simulation. By narrowing the search space, constraints reduce the risk of artefactual poses in regions where the model would otherwise extrapolate beyond its training distribution.
Industrial evaluations have highlighted its value. For example, internal experiments with Boltz-2 have shown improved ranking consistency when pocket conformations are constrained during inference. These observations align with a broader pattern: constraints help stabilise predictions when reliable structural priors are available and when the primary concern is physical plausibility or local extrapolation.
Constraint-guided inference is therefore most effective when high-quality priors exist. Its limitations are equally clear. It cannot improve the model’s internal representation and offers limited benefit for targets with poorly characterised flexibility, cryptic pockets, or chemically diverse series, where relevant priors are either unavailable or incomplete.
The case for fine-tuning
Fine-tuning refers to updating selected parts of a pretrained machine-learning model on a smaller, task-relevant dataset so that its predictions better reflect the distribution of that dataset. In co-folding AI, this typically means adapting a pretrained model that has been trained largely on public structural data to a dataset that reflects the targets, ligands, and assay conditions of a specific research programme. In pharmaceutical settings, such datasets are usually proprietary.
The objective is not simply to improve headline metrics. Fine-tuning adjusts how the model represents energetic and geometric relationships, allowing it to capture binding motifs, chemotypes, and conformational states that are under-represented in public data. Even relatively small, well-curated internal datasets, such as tens to hundreds of complexes, can meaningfully shape model behaviour. This makes fine-tuning one of the few viable methods for extending co-folding models into proprietary chemical space where public training distributions provide little guidance.
However, fine-tuning requires high-quality, harmonised structural and assay data, stable evaluation protocols, and careful governance to ensure reproducibility. These practical foundations are now emerging as a central requirement for organisations that wish to move co-folding AI from experimentation into routine use.
Choosing between fine-tuning, constraint-guided inference, or both
Fine-tuning and constraint-guided inference address the most relevant failure modes encountered in co-folding models today. Fine-tuning improves alignment to the chemical and structural distributions present in drug discovery programmes. Constraints improve physical plausibility when the model’s learned representations diverge from known structural or biochemical priors. The choice between them depends on the availability of structural priors, the novelty of the target class, and the chemical diversity of the programme. Constraints alone are often sufficient when:
• The target fold and binding site are well characterized
• Relevant conformational states are experimentally validated
• Ligand series are chemically homogeneous
• Predictions require local refinement rather than global model adaptation
Fine-tuning becomes necessary when:
• Predictions must generalise across diverse chemotypes
• Proprietary targets differ substantially from public training data
• Internal structure–activity trends need to be reflected in model behavior
• No reliable structural priors are available to anchor inference
Combined approaches are emerging as promising strategies. Constraints can stabilise early predictions, while finetuning gradually adapts the model to the programme’s chemical space as internal data accumulate. Determining which combination is most effective remains an active area of investigation across pharmaceutical research teams. For many organisations, the immediate priority is building the capability to fine-tune and evaluate models reliably on internal data.
Practical foundations for finetuning
Fine-tuning only becomes effective when embedded within a reproducible internal workflow. Across pharmaceutical R&D groups, the bottleneck is rarely the model itself but the supporting infrastructure required to adapt it safely. Three foundations consistently determine whether fine-tuning improves model behaviour or simply amplifies noise.
1. Harmonised structural data
Fine-tuning relies on consistent structural inputs. Ligand protonation states, coordinate frames, alternate conformers, and missing residues must be standardised across the dataset, otherwise the model learns artefacts rather than patterns. In practice, this means defining canonical representations for ligands and proteins, applying uniform preparation procedures, and ensuring that binding-site definitions remain stable across the training set. Without this, model updates become brittle and programme-specific behaviour is difficult to interpret.
2. Stable and traceable evaluation
Because co-folding predictions vary with architecture, checkpoint, and hyperparameters, organisations need fixed validation sets and controlled benchmarking pipelines before attempting any weight updates. These benchmarks provide the reference frame that determines whether fine-tuning has helped or harmed model behaviour. Leading groups maintain internal post-cutoff datasets for each target class and evaluate all model versions against them to preserve comparability over time.
3. Controlled training and governance
Fine-tuning requires reliable compute environments, defined model configurations, and auditable training records. Weight updates must be reproducible and reversible, with clear lineage for model versions. This is particularly important when predictions enter regulated decision processes. The absence of governance is a significant reason why many organisations experiment with co-folding but stop short of integrating it into medicinal chemistry workflows.
Together, these foundations create the conditions under which fine-tuning produces meaningful improvements. Without them, even well-designed training runs struggle to produce reliable changes in model performance. With them, teams can adapt co-folding architectures to reflect their own chemical series, target classes, assay conditions, and structural hypotheses, extending applicability beyond what public data alone can support.
Extending fine-tuning through federated data networks
Even with a well-established internal workflow, fine-tuning remains constrained by the limits of a single organisation’s data. Structural coverage in proprietary archives can be narrow, particularly for novel modalities or under-characterised protein families. Federated training offers a way to broaden the learned representation without sharing raw molecular structures.
In a federated setting, partners retain their data locally and train updates on their own infrastructure. Only model gradients or adapted low-rank components are exchanged. This preserves intellectual property while allowing the model to learn from complementary structural environments. The approach is particularly relevant for co-folding models, where public protein–ligand data are sparse and do not reflect the diversity seen across pharmaceutical research programmes.
A recent example is the AI Structural Biology (AISB) Network3 , powered by Apheris' federated computing product, where five pharmaceutical companies, AbbVie, Astex, Bristol Myers Squibb, Johnson & Johnson, and Takeda, are collaboratively training OpenFold34 across their distributed molecular data. Each partner runs training locally on its own infrastructure; only model updates are shared. This allows OpenFold3 to learn from a more diverse set of protein– ligand interactions, conformational states, and structural contexts than any single organisation could contribute.
For co-folding specifically, this type of distributed training directly targets two of the most persistent challenges: generalisation beyond public structural space and stabilising predictions for underrepresented target classes. Federated finetuning also creates a distributed form of validation. Each participant evaluates the evolving model on its own internal benchmarks, enabling early detection of failure modes that would be invisible in a single-organisation setting. This multisite validation strengthens confidence in model behaviour.
Building such networks requires alignment on data preparation standards, evaluation protocols, and training interfaces, but the underlying motivation is straightforward. Co-folding models improve when they learn from structurally diverse molecular environments, and federated learning is currently one of the few practical mechanisms for achieving this without compromising intellectual property or data governance.
Conclusion
Co-folding AI has become an increasingly valuable tool in early drug discovery, offering rapid structural hypotheses that can guide experiments, contextualise structure–activity trends, and accelerate decision-making. Yet today’s models remain constrained by two fundamental limitations: physical plausibility and generalisation beyond public structural space. These limitations explain why many organisations continue to treat co-folding predictions as hypothesis-generating rather than decision-grade.
Two complementary strategies have emerged to address these gaps. Constraint-guided inference improves physical realism when high-quality structural priors exist, while fine-tuning adapts model behaviour to the specific molecular distributions encountered in real discovery programmes. Together, they provide a practical toolkit for extending co-folding AI into the areas where current architectures struggle most.
As organisations develop internal pipelines for fine-tuning, supported by harmonised data, stable evaluation benchmarks, and controlled training environments, the opportunity to extend these improvements collaboratively becomes increasingly clear. Federated initiatives such as the AISB Network demonstrate that diverse structural data can strengthen model performance without compromising proprietary information, offering a path toward broader applicability and more reliable predictions.
The trajectory is encouraging. As models integrate richer physical priors, learn from more representative structural distributions, and become easier to adapt within and across organisations, co-folding AI is moving from exploratory use toward a more central role in structural reasoning. Progress will depend not only on architectural innovation, but also on how effectively the scientific community builds the workflows, data standards, and collaborative frameworks required to realise the full potential of these models in drug discovery.
Footnotes:
1. https://www.nature.com/articles/s41467-025-63947-5
2. https://www.biorxiv.org/content/10.1101/2025.02.03 .636309v1
3. https://www.apheris.com/join-a-network/aisb
4. https://www.apheris.com/resources/blog/aisb-networkexpands-federated-openfold3-initiative-with-three-newpharma-contrib