Learning a Deep Language Model for Microbiomes: The Power of Large Scale Unlabeled Microbiome Data

Quintin Pope, Rohan Varma, Christine Tataru, Maude M David, Xiaoli Fern.

Abstract

We use open source human gut microbiome data to learn a microbial “language” model by adapting techniques from Natural Language Processing (NLP). Our microbial “language” model is trained in a self-supervised fashion (i.e., without additional external labels) to capture the interactions among different microbial taxa and the common compositional patterns in microbial communities. The learned model produces contextualized taxon representations that allow a single microbial taxon to be represented differently according to the specific microbial environment in which it appears.

Introduction

Identifiable features of the human microbiome and its interactions with various body systems have been associated with a wide range of diseases, including cancer, depression and inflammatory bowel disease. As our knowledge of such connections has advanced, research on the human microbiome has undergone a shift in focus, moving from establishing links to unraveling the underlying mechanisms and utilizing them to develop clinical interventions.

Materials and Methods:

We begin by introducing the general workflow of applying a transformer model for generating a sample embedding (Fig 1) and explaining each step of the work flow, including a detailed look into the transformer architecture. We then explain how we perform the pretraining, followed by finetuning for specific down stream tasks. This section will also explain how we identify those taxa that most affect the model’s classification decisions (Eq 1) and conclude with a description of the datasets used in this paper.

Discussion:

We apply recent natural language processing techniques to learn a language model for microbiomes from public domain human gut microbiome data. The pre-trained language model provides powerful contextualized representations of microbial communities and can be broadly applied as a starting point for any downstream prediction tasks involving human gut microbiome. In this work, we show the power of the pre-trained model by fine-tuning the representations for IBD disease state and diet classification tasks, achieving strong performance in all tasks. For IBD, our learned representations enable an ensemble model with competitive performance that is robust across study populations even with strong distributional shifts.

Citation: Pope Q, Varma R, Tataru C, David MM, Fern X (2025) Learning a deep language model for microbiomes: The power of large scale unlabeled microbiome data. PLoS Comput Biol 21(5): e1011353. https://doi.org/10.1371/journal.pcbi.1011353

Editor: Stacey D. Finley, University of Southern California, UNITED STATES OF AMERICA

Received: July 25, 2023; Accepted: March 24, 2025; Published: May 7, 2025.

Copyright: © 2025 Pope et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: American Gut Project, Halfvarson, and Schirmer raw data are available from the NCBI database (accession numbers PRJEB11419, PRJEB18471, and PRJNA398089, respectively). We used the curated data produced by Tataru and David, 2020 (doi: 10.1371/journal.pcbi.1007859). All data and code required for our methods are made available and described in a Dryad repository at doi: 10.5061/dryad.tb2rbp08p. Data and code available from: https://datadryad.org/stash/dataset/doi:10.5061/dryad.tb2rbp08p (data) and doi: 10.5281/zenodo.13858903 (code). File descriptions and usage instructions are available in the repository’s README.

Funding: This work was supported by the National Science Foundation Division of Emerging Frontiers (2025457 to XF and MD), as well as the Open Philanthropy Long-Term Future Scholarship Program (to QP). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.