Improving Population Scale Statistical Phasing with Whole-genome Sequencing Data
Rick Wertenbroek, Robin J. Hofmeister, Ioannis Xenarios, Yann Thoma, Olivier Delaneau
Abstract:
Haplotype estimation, or phasing, has gained significant traction in large-scale projects due to its valuable contributions to population genetics, variant analysis, and the creation of reference panels for imputation and phasing of new samples. To scale with the growing number of samples, haplotype estimation methods designed for population scale rely on highly optimized statistical models to phase genotype data, and usually ignore read-level information. Statistical methods excel in resolving common variants, however, they still struggle at rare variants due to the lack of statistical information.
Introduction:
In the era of biobanks, population-scale sequencing is becoming increasingly common, producing variant calls and comprehensive data sets that depict the genomic landscape across a vast number of samples. To enable analyses at the haplotype level, it is necessary to phase the genotype data produced by sequencing. When dealing with data sets comprising thousands to millions of samples, statistical methods like SHAPEIT5 or Beagle5 are the standard approach. These methods borrow information across many samples in the population in order to produce precise haplotype estimates for common variants. Nevertheless, they tend to exhibit higher error rates when handling rare variants due to the limited information available for those in the population.
Methods:
Phase polishing through the SAPPHIRE method is as follows: First, heterozygous genotypes from all samples are extracted. Then, the phase for each pair of genotypes with overlapping sequencing reads is verified, with common variants as reference for phase verification. Rare variants are checked against common ones. If sequencing reads clearly show a reversed phase, SAPPHIRE corrects it and the read count supporting the phase is reported. Unchanged variants also have their read count reported for phase call confidence.
Discussion:
We present SAPPHIRE a novel approach to improve the phasing accuracy at rare variants within population-scale data sets initially phased with statistical methods. We show that we can reduce switch error rates for variants with low-occurrence frequencies with the most notable impact at extremely rare variants and singletons. Our method also delimits the subset of phased alleles that have been validated by sequencing reads, which makes it possible to discern them from statistically phased variants.
Acknowledgments:
The benchmarks on the UK Biobank data have been conducted using the UK Biobank resource under application number 66995.
Citation: Wertenbroek R, Hofmeister RJ, Xenarios I, Thoma Y, Delaneau O (2024) Improving population scale statistical phasing with whole-genome sequencing data. PLoS Genet 20(7): e1011092. https://doi.org/10.1371/journal.pgen.1011092
Editor: Yun Li, University of North Carolina at Chapel Hill, UNITED STATES OF AMERICA
Received: December 7, 2023; Accepted: June 11, 2024; Published: July 3, 2024.
Copyright: © 2024 Wertenbroek et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The UK Biobank data set is a combination of Imaging, Genetics, Health linkages, Biomarkers, Activity monitors, Online questionnaires, and Samples from 500,000 participants, each with their own data sharing policies and restrictions. These restrictions do not allow the data set to be shared in a completely unrestricted way. However, readers may apply for access at https://www.ukbiobank.ac.uk/enable-your-research/apply-for-access The data and scripts to generate the figures of this paper are available in the supporting information files S1–S9 Data. The software developed for the SAPPHIRE method is open-source and available at https://github.com/rwk-unil/sapphire.
Funding: O.D. was supported by SNF grant number: SNSF-PP00P3_176977. https://www.snf.ch/en R. W. and Y. T. are supported by HEIG-VD. The funders did not play any role in the study design, data collection, and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.