Prioritizing Disease-related Rare Variants by Integrating Gene Expression Data

Hanmin Guo,Alexander Eckehart Urban ,Wing Hung Wong.

Abstract

Rare variants, comprising the vast majority of human genetic variations, are likely to have more deleterious impact in the context of human diseases compared to common variants. Here we present carrier statistic, a statistical framework to prioritize disease-related rare variants by integrating gene expression data. By quantifying the impact of rare variants on gene expression, carrier statistic can prioritize those rare variants that have large functional consequence in the patients. Through simulation studies and analyzing real multi-omics dataset, we demonstrated that carrier statistic is applicable in studies with limited sample size (a few hundreds) and achieves substantially higher sensitivity than existing rare variants association methods.

Introduction

Rare variants (minor allele frequency (MAF) < 1%) constitute the vast majority of human genetic variations. They are on average more deleterious compared with common variants, and thus undergo stronger selection and remain at low frequency in the general population. By analyzing large cohorts of whole genome sequencing (WGS) and whole exome sequencing (WES) data, researchers have identified some rare variants-trait associations and have shown that rare variants contribute to a large proportion of missing heritability that cannot be explained by common variants.

Methods:

For each rare variant-gene pair (the variant is located within the exon of the gene), we used the expression of that gene in the rare variant noncarriers as the null distribution and computed a z-score for each rare variant carrier, then average over carriers of that variant.

The carrier statistic was computed separately within the case group and the control group. Rare variants were defined as SNVs and short indels whose allele count was no larger than 5 within the case group or within the control group. Therefore, the rare variants and thus the number of carrier statistics are not the same between two groups.

Discussion

We present carrier statistic, a statistical framework to perform multi-omics data analysis, for prioritization of disease-related rare variants and their regulated genes. Through simulations and analyses of real multi-omics datasets, we demonstrated that carrier statistic overcomes sample size limitations and achieves substantial gains in statistical power compared to existing variants collapsing methods. The superior performance of carrier statistic can be attributed to incorporation of functional gene expression data, which allows quantitatively measuring the impact of rare variants that cannot be determined by considering the variants on the DNA sequence level alone.

Acknowledgments

We thank Dr. Bo Zhou and Dr. Hua Tang for discussion and providing feedbacks.

Citation: Guo H, Urban AE, Wong WH (2024) Prioritizing disease-related rare variants by integrating gene expression data. PLoS Genet 20(9): e1011412. https://doi.org/10.1371/journal.pgen.1011412

Editor: Hongyu Zhao, Yale, UNITED STATES OF AMERICA

Received: May 16, 2024; Accepted: August 29, 2024; Published: September 30, 2024.

Copyright: © 2024 Guo et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: Carrier statistic tool is available at https://github.com/SUwonglab/carrier-stat. The WGS data in the Whole Genome Sequence Harmonization Study (https://www.synapse.org/#!Synapse:syn22264775) and the RNA-seq data in the RNAseq Harmonization Study (https://www.synapse.org/#!Synapse:syn21241740) are publicly available on the AD Knowledge Portal platform through completion of a data use certificate. The gnomAD v2.1.1 data consisting of 125,748 exomes were downloaded from https://gnomad.broadinstitute.org/. The gene expression data (gene reads count) from GTEx project version 8 were downloaded from the GTEx Portal, 
https://gtexportal.org/home/downloads/adult-gtex/bulk_tissue_expression. SAIGE-GENE+, 
https://saigegit.github.io/SAIGE-doc/. coloc, https://chr1swallace.github.io/coloc/. SKAT-O, https://cran.rproject.org/web/packages/SKAT/.

Funding: This work was partially supported by the NIH (R01 HG010359 to W.H.W; R01 MH116529 to A.E.U.; and P50 HG007735 to W.H.W. and A.E.U.) and the NSF DMS (2310788 to W.H.W.). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.