A Curriculum Learning Approach to Training Antibody Language Models
Sarah M. Burbach, Bryan Briney
Abstract
There is growing interest in pre-training antibody language models (AbLMs) with a mixture of unpaired and natively paired sequences, seeking to combine the proven benefits of training with natively paired sequences with the massive scale of unpaired antibody sequence datasets. However, given the novelty of this strategy, the field lacks a systematic evaluation of data processing methods and training strategies that maximize the benefits of mixed training data while accommodating the significant imbalance in the size of existing paired and unpaired datasets.
Introduction
Antibodies are a diverse and essential component of the adaptive immune system, with total available repertoire diversity estimated as high as 1018 unique antibodies. This exceptional diversity results initially from the somatic recombination of germline gene segments and is further refined upon antigen exposure via clonal expansion, somatic hypermutation, and antigen-driven selection of productive mutations. Within each recombined antibody gene, diversity is greatest in the complementarity-determining regions (CDRs).
Materials and Methods:
Sequence data was downloaded from the OAS on September 12th, 2024. In addition to sequences from the OAS, the paired dataset was supplemented with an internally generated dataset of ~400k paired sequences. Filtering and clustering were performed as described in AntiRef, with additional filtering for sequences containing ‘nan’ characters. The dataset clustered at 90% was chosen for both datasets, resulting in 151,764,423 unpaired sequences and 1,717,423 paired sequences in the full datasets.
Discussion:
Recent studies on AbLMs have shown that the benefits of training on natively paired sequences can be sufficient to overcome a significant disadvantage in training data scale compared to unpaired sequences. The high cost and effort required to recover natively paired antibody sequences has sparked interest in methods that maximize the training value of these limited datasets, including supplementing paired sequences with unpaired sequences.
Citation: Burbach SM, Briney B (2025) A curriculum learning approach to training antibody language models. PLoS Comput Biol 21(9): e1013473. https://doi.org/10.1371/journal.pcbi.1013473
Editor: Chaok Seok, Seoul National University, REPUBLIC OF KOREA
Received: March 5, 2025; Accepted: August 28, 2025; Published: September 11, 2025.
Copyright: © 2025 Burbach, Briney. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The code used for model training and evaluation is available on GitHub (github.com/brineylab/curriculum-paper). The training data and model weights for CurrAb are available on Zenodo (doi.org/10.5281/zenodo.14661302). The 650M-parameter mixed models, including CurrAb, are uploaded on Hugging Face (huggingface.co/collections/brineylab/curriculum-paper-685b08a4b6986df7c5a5e3c4).
Funding: This work was funded by the National Institutes of Health (P01-AI177683, U19-AI135995, R01-AI171438, P30-AI036214, and UM1-AI144462) and the Pendleton Foundation. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: I have read the journal’s policy and the authors of this manuscript have the following competing interests: BB is an equity shareholder in Infinimmune and a member of their Scientific Advisory Board.