Human Genotype-to-phenotype Predictions Boosting Accuracy with Nonlinear Models

Aleksandr Medvedev, Satyarth Mishra Sharma, Evgenii Tsatsorin, Elena Nabieva, Dmitry Yarotsky

Abstract
Genotype-to-phenotype prediction is a central problem of human genetics. In recent years, it has become possible to construct complex predictive models for phenotypes, thanks to the availability of large genome data sets as well as efficient and scalable machine learning tools. In this paper, we make a threefold contribution to this problem. First, we ask if state-of-the-art nonlinear predictive models, such as boosted decision trees, can be more efficient for phenotype prediction than conventional linear models.

Introduction
The problem of predicting phenotype from genotype is a “holy grail” of modern genetics, with practical applications in fields such as personalized medicine and genomic selection for agriculture, and is an active area of research. Its relevance has grown with the affordability of genotyping, and will likely continue to increase as sequencing becomes more commonplace.

Methods:

Lasso and Snpnet
Most current methods for the prediction of phenotype from genotype are based on some form of penalized or Bayesian regression. Lasso, which is linear regression with an ℓ1-norm penalty, is well-suited for this task as genotype matrices are compressed sensors and are sparse with respect to almost any phenotype.

Ensembling and Stacking
We can combine predictions obtained from several different models in order to obtain a more accurate predictor that may mitigate the shortcomings of each individual model by averaging their biases.

Discussion
We built three groups of models for each phenotype. The first group of models uses only genetic data and sex as a covariate (because sex is also a genetic feature). The second group uses sex, age and top 10 principal components from the genotype matrix built on the training set. The third group is our attempt to make the best possible prediction and find out some nonlinear dependencies between genotype and environment, and environment and phenotype.

Acknowledgments
Most computations performed in this project were done on the Zhores cluster [28], and we thank the CDISE HPC team for their assistance. This research has been conducted using the UK Biobank Resource under Application Number ‘43661’.

Citation: Medvedev A, Mishra Sharma S, Tsatsorin E, Nabieva E, Yarotsky D (2022) Human genotype-to-phenotype predictions: Boosting accuracy with nonlinear models. PLoS ONE 17(8): e0273293. https://doi.org/10.1371/journal.pone.0273293

Editor: Zheng Xu, Wright State University, UNITED STATES

Received: May 27, 2021; Accepted: August 4, 2022; Published: August 31, 2022.

Copyright: © 2022 Medvedev et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The data used for this research was obtained from the UK Biobank Resource under Application Number “43661.” In accordance with the Biobank’s policies, the authors are unable to make this data available to all readers of PLOS ONE. However, the Biobank welcomes applications from researchers, and anyone wishing to replicate the results can directly apply for access to the Biobank following the procedure outlined at: https://www.ukbiobank.ac.uk/enable-your-research/apply-for-access. The minimal dataset required to reproduce the results includes UKB categories 100315 (participant genotype data), 100000 (physical characteristics, lifestyle information, and self-reported medical history collected at the UKB assessment center), and 100091 (health-related outcomes). One can refer to Table S4 for a complete list of data fields used and their corresponding codes in the UK Biobank.

Funding: EN was supported by the Russian Science Foundation grant 21-74-20160. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. DY was supported by Russian Science Foundation, grant 21-11-00373. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

2nd Annual Pharma Impurity Conclave 2024

Thermo Fisher - Drug Discovery and the impact of mAbs

ISPE Singapore Affiliate Conference & Exhibition 2024

2024 PDA Aseptic Manufacturing Excellence Conference

2024 PDA Aseptic Processing of Biopharmaceuticals Conference