Shohei Kojima, Anselmo Jiro Kamada, Nicholas F. Parrish
Acquisition of genetic material from viruses by their hosts can generate inter-host structural genome variation. We developed computational tools enabling us to study virus-derived structural variants (SVs) in population-scale whole genome sequencing (WGS) datasets and applied them to 3,332 humans. Although SVs had already been cataloged in these subjects, we found previously-overlooked virus-derived SVs. We detected non-germline SVs derived from squirrel monkey retrovirus (SMRV), human immunodeficiency virus 1 (HIV-1), and human T lymphotropic virus (HTLV-1); these variants are attributable to infection of the sequenced lymphoblastoid cell lines (LCLs) or their progenitor cells and may impact gene expression results and the biosafety of experiments using these cells.
Union of genomes from discrete biological entities is a major engine of genetic diversity. Fusion of gametes, each bearing a set of recombinant chromosomes, is the immediate source of the genetic material that uniquely identifies each human. Taking a wider viewpoint, much of a human genome can be recognized to have been acquired from a source other than modern humans.
This research was approved by the RIKEN Yokohama Institute Ethics Committee (approval number 2019–13). All data analyzed here was retrieved from databases freely available to the general public, is anonymous, and has no associated medical or phenotype information.
This work provides a comprehensive picture of virus-derived structural variation in two well-studied global WGS datasets. We found previously-missed germline SVs arising from HHV-6 and HERV-K, as well as virus integration in non-germline cells due to natural infection or contamination. The presence of SMRV integrations in LCLs introduces caveats in analyzing these materials and the data derived from them.
The authors wish to acknowledge the resources of 1,000 Genomes Project and HGDP-CEPH Human Genome Diversity Cell Line Panel. We thank Thomas Sasani, Lynn Jorde, Aaron Quinlan, Julie E. Feusier, and Cody Steely for providing unmapped reads from phs001872, and Mark Lathrop for helpful discussions about LCLs generated by CEPH. The super-computing resources were provided by Human Genome Center, the Institute of Medical Science, the University of Tokyo (SHIROKANE), and the Office for Information Systems and Cybersecurity, RIKEN (HOKUSAI General Use project G20021).
Citation: Kojima S, Kamada AJ, Parrish NF (2021) Virus-derived variation in diverse human genomes. PLoS Genet 17(4): e1009324.
Editor: Cédric Feschotte, Cornell University, UNITED STATES
Received: January 11, 2021; Accepted: March 25, 2021; Published: April 26, 2021.
Copyright: © 2021 Kojima et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
High-coverage WGS datasets from 1kGP were downloaded from the following URL: 'ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/'. High coverage WGS datasets from 1kGP Han Chinese trio were downloaded from the following URL: 'https://www.internationalgenome.org/data-portal/data-collection/structural-variation'. High coverage WGS dataset from HGDP were downloaded from the following URL:
'ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/HGDP/data/'. The utilization of the high-coverage WGS of multigenerational CEPH/Utah families (phs001872) are authorized by the National Human Genome Research Institute through dbGaP for the following project: "The prevalence, evolution, and health effects of polymorphic endogenous viral elements in human populations."
Funding: This work was supported by JSPS KAKENHI Grant Number: JP21H02972 (NFP). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.