Francesca Young, Simon Rogers, David L. Robertson
The rise in metagenomics has led to an exponential growth in virus discovery. However, the majority of these new virus sequences have no assigned host. Current machine learning approaches to predicting virus host interactions have a tendency to focus on nucleotide features, ignoring other representations of genomic information. Here we investigate the predictive potential of features generated from four different ‘levels’ of viral genome representation: nucleotide, amino acid, amino acid properties and protein domains. This more fully exploits the biological information present in the virus genomes. Over a hundred and eighty binary datasets for infecting versus non-infecting viruses at all taxonomic ranks of both eukaryote and prokaryote hosts were compiled. The viral genomes were converted into the four different levels of genome representation and twenty feature sets were generated by extracting k-mer compositions and predicted protein domains. We trained and tested Support Vector Machine, SVM, classifiers to compare the predictive capacity of each of these feature sets for each dataset. Our results show that all levels of genome representation are consistently predictive of host taxonomy and that prediction k-mer composition improves with increasing k-mer length for all k-mer based features. Using a phylogenetically aware holdout method, we demonstrate that the predictive feature sets contain signals reflecting both the evolutionary relationship between the viruses infecting related hosts, and host-mimicry. Our results demonstrate that incorporating a range of complementary features, generated purely from virus genome sequences, leads to improved accuracy for a range of virus host prediction tasks enabling computational assignment of host taxonomic information.
Determining which virus infects which host species is currently a major challenge in virology. Knowledge of virus-host infectivity is essential to understanding the impact that viruses have on cellular life and the key roles they play as an integral part of all earth’s ecosystems, from our own microbiome  to the marine environment where they contribute to regulation of the biogeochemical cycles [2,3], and as animal and plant pathogens. Advances in metagenomics have led to a rapid expansion in virus discovery with more than half of all known viral genomes being deposited in databases in the last two years . This growth in data is the first step towards cataloguing of the earth’s virosphere. However, the indiscriminate nature of metagenomics results in the majority of these new viruses having no identified host. For example, there are now over 700 000 viral genomes in the IMG/VR databases of which less than 5% have an associated host [5,6]. Currently, there are no high-throughput methods available to make reliable virus-host associations and as such we are unable to keep up with the rapid pace of viral discovery. Fast, accurate computational tools are thus urgently needed to annotate these new viral genomes with host taxon information.
We downloaded the Virus Host Database (https://www.genome.jp/virushostdb/)  on 25/1/2019. The VHDB is a curated database of reported taxonomic interactions between viruses and their known hosts. It is regularly updated from Refseq/GenBank, Uniprot and Viralzone and includes manual annotations. The dataset included 9199 unique viruses associated with 3006 hosts and a total of 14229 interactions. The FASTA files of the reference genome sequences and the amino acid sequences of the coding regions for each virus are also included in the VHDB resources.
Generating Binary Datasets from the known Virus Host Interactions
A host taxonomic tree was constructed from all the hosts in VHDB using ETE 3  at the ranks of kingdom, phylum, class, order, family, genus and species. Each host node was annotated with the viruses known to infect it. The tree was ‘pruned’ to include only nodes infected by at least a minimum number of virus species. The minimum number of infecting viruses was set to 28 for a positive node. As we were comparing how predictive these feature sets were across all taxon ranks, setting this arbitrary threshold at 28 enabled us to include more examples of genus and species level datasets.
The aim of this study was to compare the predictive power of a wide range of features for use in machine learning approaches to virus host prediction. We generated 20 feature sets from multiple representations of viral genomes and tested their capacity for host prediction. We found that features derived from all representations are predictive of host taxon for both bacteria and eukaryote hosts (Figs 2, 3, 4 and 5), and that different features contain complementary signals that can be combined to improve prediction (Figs 10 and 11). Through a phylogenetically aware stratification scheme (Fig 7), our results strongly suggest that the features capture both phylogenetic and convergent signals (Figs 8 and 9).
Citation: Young F, Rogers S, Robertson DL (2020) Predicting host taxonomic information from viral genomes: A comparison of feature representations. PLoS Comput Biol 16(5): e1007894. https://doi.org/10.1371/journal.pcbi.1007894
Editor: Morgan Langille, DAL, CANADA
Received: September 3, 2019; Accepted: April 21, 2020; Published: May 26, 2020
Copyright: © 2020 Young et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data are within the manuscript and its Supporting Information files.
Funding: FY is supported by a studentship from the Medical Research Council (MRC). DLR is funded by the MRC (MC_UU_1201412). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist