An Expandable Synthetic Library of Human Paired Antibody Sequences
Toma M. Marinov, Perry T. Wasdin, Gwen Jordaan, Alexis K. Janke, Alexandra A. Abu-Shmais, Ivelin S. Georgiev.
Abstract
The potential diversity in the global repertoire of human antibody sequences is currently not well understood due to the limited existing paired antibody heavy-light chain sequence data that has been hindered by the low throughput and high costs of current single-cell sequencing methods. Here, we report IgHuAb, a large language model for high-throughput generation of paired human antibody sequences. Using IgHuAb, we created SynAbLib, a synthetic human antibody library that mimics population-level features of naturally occurring human antibody sequences, yet is associated with significantly greater diversity in sequence space.
Introduction
Monoclonal antibodies are an effective therapeutic modality against a wide range of diseases, including infectious disease, cancer, autoimmunity, and others. Antibodies are also important as diagnostic and research reagents, and can serve as templates for the development of effective vaccines. Monoclonal antibodies are the secreted form of the B cell receptor, which, in humans, consists of a pairing of a heavy chain (HC) and a light chain (LC) protein. HC-LC pairing is one of the mechanisms that enables diversification of the antibody repertoire, along with germline gene recombination and somatic hypermutation in each of the HC and LC. Because of these different antibody diversification mechanisms, the potential space of antibody sequences is exceptionally large.
Materials and Methods:
The 430,000 paired human antibody sequences were curated from the OAS Paired and PlaAbDab public databases (both accessed in Feb 2024), and the repeating pairs of sequences were removed. Germline genes and CDRs were assigned by ANARCI (implemented as AbNumber in Python) with the allowed_species=’human’ option. The non-redundant training data was clustered with Linclust and clusters with combined ~10% of the sequences were set aside for model testing.
Discussion:
Screening the repertoires of naturally occurring antibodies in humans have served as a source for therapeutic and vaccine discovery and have provided important insights into the fundamental rules of antibody sequence diversification and antibody-antigen recognition. However, experimental antibody repertoire screening is associated with high costs, requirements for access to specific samples and specialized instrumentation, and is difficult to scale. In contrast, computational methods for antibody sequence generation can present an efficient, scalable, and generalizable alternative to experimental antibody sequencing approaches.
Acknowledgments:
This work was conducted, in part, using the resources of the Advanced Computing Center for Research and Education (ACCRE) at Vanderbilt University.
Citation: Marinov TM, Wasdin PT, Jordaan G, Janke AK, Abu-Shmais AA, Georgiev IS (2025) An expandable synthetic library of human paired antibody sequences. PLoS Comput Biol 21(4): e1012932. https://doi.org/10.1371/journal.pcbi.1012932
Editor: Frederick A. Matsen IV, Fred Hutchinson Cancer Research Center, UNITED STATES OF AMERICA
Received: September 26, 2024; Accepted: March 5, 2025; Published: April 21, 2025.
Copyright: © 2025 Marinov et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: Data, model and scripts available at: https://figshare.com/s/317313907fce0de1d735.
Funding: This research was funded, in part, by the G. Harold and Leila Y. Mathers Charitable Foundation (MF-2107-01851), NIH R01AI175245, NIH T32AI112541. This research was funded, in part, by the Advanced Research Projects Agency for Health (ARPA-H) 1AY2AX000077. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: I have read the journal's policy and the authors of this manuscript have the following competing interests: I.S.G. is a co-founder of AbSeek Bio. I.S.G. has served as a consultant for Sanofi. The Georgiev laboratory at VUMC has received unrelated funding from Merck and Takeda Pharmaceuticals.