Sinu Paul, Nathan P. Croft, Anthony W. Purcell, David C. Tscharke, Alessandro Sette, Morten Nielsen, Bjoern Peters
T cell epitope candidates are commonly identified using computational prediction tools in order to enable applications such as vaccine design, cancer neoantigen identification, development of diagnostics and removal of unwanted immune responses against protein therapeutics. Most T cell epitope prediction tools are based on machine learning algorithms trained on MHC binding or naturally processed MHC ligand elution data. The ability of currently available tools to predict T cell epitopes has not been comprehensively evaluated. In this study, we used a recently published dataset that systematically defined T cell epitopes recognized in vaccinia virus (VACV) infected C57BL/6 mice (expressing H-2Db and H-2Kb), considering both peptides predicted to bind MHC or experimentally eluted from infected cells, making this the most comprehensive dataset of T cell epitopes mapped in a complex pathogen. We evaluated the performance of all currently publicly available computational T cell epitope prediction tools to identify these major epitopes from all peptides encoded in the VACV proteome. We found that all methods were able to improve epitope identification above random, with the best performance achieved by neural network-based predictions trained on both MHC binding and MHC ligand elution data (NetMHCPan-4.0 and MHCFlurry). Impressively, these methods were able to capture more than half of the major epitopes in the top N = 277 predictions within the N = 767,788 predictions made for distinct peptides of relevant lengths that can theoretically be encoded in the VACV proteome
T cell epitope identification is important in many immunological applications including development of vaccines and diagnostics in infectious, allergic and autoimmune diseases, removal of unwanted immune responses against protein therapeutics and in cancer immunotherapy. Computational T cell epitope prediction tools can help to reduce the time and resources needed for epitope identification projects by narrowing down the peptide repertoire that needs to be experimentally tested. Most epitope prediction tools are developed using machine learning algorithms trained on two types of experimental data: binding affinities of peptides to specific MHC molecules generated using MHC binding assays, or sets of naturally processed MHC ligands found by eluting peptides from MHC molecules on the cell surface and identifying them by mass spectrometry. Since the first computational epitope prediction methods were introduced more than two decades ago [1–3], advancement in machine learning methods and increases in the availability of training data have improved the performance of these methods significantly in recent years, as has been demonstrated on benchmarks of MHC binding data [4,5].
Materials & methods
Selection of methods
As a first step, we compiled a list of all freely available CD8+ T cell epitope prediction methods by querying Google and Google Scholar. We identified 44 methods (S1 Table) that had executable algorithms freely available publicly (excluding those that required us to train a prediction model), and excluding commercial prediction tools that required us to obtain licenses. Out of these 44 methods, we selected those that had trained models available for the two mouse alleles for which we had benchmarking data (H-2Db & H-2Kb). Further, we contacted the authors of the selected methods and excluded the ones that the authors explicitly wanted to be excluded from the benchmarking for different reasons (mostly because the methods were not updated recently or new version of the methods were to be released soon).
In this study we comprehensively evaluated the ability of different prediction methods to identify T cell epitopes. We found that most of the latest methods perform at a very high level, especially the methods developed on artificial neural-network based architectures. In addition, we found that methods that integrated MHC binding and MHC ligand elution data performed better than those trained on MHC binding data alone. And where available, methods that provided two outputs, where one output predicted MHC ligands vs. another that predicted MHC binding, the MHC ligand output score performed better. Based on these results, the IEDB will be updating the default recommended prediction method to NetMHCPan-4.0-L.
Citation: Paul S, Croft NP, Purcell AW, Tscharke DC, Sette A, Nielsen M, et al. (2020) Benchmarking predictions of MHC class I restricted T cell epitopes in a comprehensively studied model system. PLoS Comput Biol 16(5): e1007757. https://doi.org/10.1371/journal.pcbi.1007757
Editor: Ramgopal Mettu, Tulane University, UNITED STATES
Received: May 3, 2019; Accepted: March 2, 2020; Published: May 26, 2020
Copyright: © 2020 Paul et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The supporting data including predicted scores for the peptides by all prediction methods involved in the study and the code for running the benchmarking evaluation are available
Funding: This project has been funded with federal funds from the National Institute of Allergy and Infectious Diseases, National Institutes of Health, U.S. Department of Health and Human Services under Contract Number 75N93019C00001 (Immune Epitope Database and Analysis Resource Program) and an Australian National Health and Medical Research Council (NHMRC) Project Grant (APP1084283). AWP is supported by a Principal Research Fellowship and DCT by a Senior Research Fellowship from the Australian NHMRC. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.