Molly A. Hall , John Wallace, Anastasia M. Lucas, Yuki Bradford, Shefali S. Verma, Bertram Müller-Myhsok, Kristin Passero, Jiayan Zhou, John McGuigan, Beibei Jiang, Sarah A. Pendergrass, Yanfei Zhang, Peggy Peissig, Murray Brilliant, Patrick Sleiman, Hakon Hakonarson, John B. Harley, Krzysztof Kiryluk, Kristel Van Steen , Jason H. Moore , Marylyn D. Ritchie
Assumptions are made about the genetic model of single nucleotide polymorphisms (SNPs) when choosing a traditional genetic encoding: additive, dominant, and recessive. Furthermore, SNPs across the genome are unlikely to demonstrate identical genetic models. However, running SNP-SNP interaction analyses with every combination of encodings raises the multiple testing burden. Here, we present a novel and flexible encoding for genetic interactions, the elastic data-driven genetic encoding (EDGE), in which SNPs are assigned a heterozygous value based on the genetic model they demonstrate in a dataset prior to interaction testing. We assessed the power of EDGE to detect genetic interactions using 29 combinations of simulated genetic models and found it outperformed the traditional encoding methods across 10%, 30%, and 50% minor allele frequencies (MAFs).
Choosing between traditional methods for encoding single nucleotide polymorphisms (SNPs) in association studies, including additive, dominant, and recessive, requires making an assumption about the manner in which the coded risk allele acts. In accordance with Mendel’s patterns of inheritance , given referent allele, A, and alternate (or coded risk) allele, a, all encodings assume that the AA (homozygous referent) genotype incurs no risk and aa (homozygous alternate) genotype bears full risk. As has been described previously [2–4], the assumed heterozygous (Aa) risk, however, varies according to the chosen encoding method. For each encoding, the assumed risk accrued by one copy of the alternate allele (Aa) in relation to two copies (aa) varies: Aa is coded to bear 0%, 50%, or 100% the risk of aa for recessive, additive, and dominant encodings, respectively
Materials and methods:
To assess the ability of EDGE to accurately assign heterozygous genotype values and identify SNP-SNP interactions across different genetic models with high power and low type I error, we developed the Biallelic Model Simulator (available at https://www.hall-lab.org/). This script generates two independent, biallelic SNPs in Hardy-Weinberg equilibrium according to given minor allele frequencies (for a further description of the simulation method, see S1 Text).
Genome-wide genotyping was performed on approximately 55,000 samples (397 of Asian ancestry, 11,109 of African ancestry, 40,243 of European ancestry, 108 of Native American ancestry, and 3,167 of unknown ancestry) across the eMERGE II study sites at the Broad Institute and at the Center for Inherited Disease Research (CIDR) using the Illumina 660W-Quad or 1M-Duo BeadChips
Candidate replication SNP-SNP interaction analyses were performed on data from the UK Biobank (UKB) . The UK Biobank contains genetic and phenotypic data on approximately 500,000 individuals. For genotypic quality, we removed 35,785 poor quality samples determined to be outliers for heterozygosity and/or missing rate as well as individuals found to be related based on a Pi-hat of 0.25.
For all simulated and eMERGE datasets, regression modelling was performed using PLATO software , which employs EDGE, additive, dominant, recessive, and dominant encodings with user specification. Multi-encoding GWAS: In the eMERGE dataset, we performed four GWAS for each phenotype: each GWAS employing one of the traditional encodings (i.e., additive-encoded GWAS, dominant-encoded GWAS, recessive-encoded GWAS, and codominant-encoded GWAS) using logistic regression.
For replication analyses in UKB, we extracted SNP-SNP interaction models found to be significant in eMERGE for AMD, T2D, age-related cataract, and hypertension phenotypes. PLATO software was used to run logistic regression for each of these models, again considering each of the additive, codominant, dominant, recessive, and EDGE encodings separately. Each model was adjusted for age, sex, BMI, and principal components (first 10 UKB-generated PCs).
For over a decade, the additive model has been the most common method for encoding SNPs in regression-based epistasis. In this paper, we aimed to introduce a novel encoding that is flexible to detect SNPs with nonadditive allelic architecture, evaluate the different genetic encodings in the context of epistasis, find evidence that some SNPs may demonstrate a nonadditive model, and identify novel SNP-SNP interactions associated with complex disease.
Citation: Hall MA, Wallace J, Lucas AM, Bradford Y, Verma SS, Müller-Myhsok B, et al. (2021) Novel EDGE encoding method enhances ability to identify genetic interactions. PLoS Genet 17(6): e1009534. https://doi.org/10.1371/journal.pgen.1009534
Editor: Heather J. Cordell, Newcastle University, UNITED KINGDOM
Received: May 7, 2020; Accepted: April 6, 2021; Published: June 4, 2021.
Copyright: © 2021 Hall et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The result files for the eMERGE data are within the manuscript and its Supporting Information files.
Funding: The project described was partially supported by NIH grants LM010098 and AI116794 to JHM.
Competing interests: I have read the journal’s policy and the authors of this manuscript have the following competing interests: MDR is on the scientific advisory board for Cipherome and Goldfinch Bio. The other co-authors have declared that no competing interests exist.