Functional Regression Clustering with Multiple Functional Gene Expressions

Susana Conde, Shahin Tavakoli, Daphne Ezer.

Abstract

Gene expression data is often collected in time series experiments, under different experimental conditions. There may be genes that have very different gene expression profiles over time, but that adjust their gene expression patterns in the same way under experimental conditions. Our aim is to develop a method that finds clusters of genes in which the relationship between these temporal gene expression profiles are similar to one another, even if the individual temporal gene expression profiles differ. We propose a K-means-type algorithm in which each cluster is defined by a function-on-function regression model, which, inter alia, allows for multiple functional explanatory variables.

Introduction

Next-generation sequencing technology (specifically RNA-sequencing or RNA-seq) allows researchers to accurately measure gene expression for all genes in a biological sample. Until recently, it was prohibitively expensive to perform RNA-seq experiments at more than a few time points at once. RNA-seq is now widespread and affordable enough to use it to investigate time-sensitive biological processes, such as response to environmental stimuli or the organism’s internal clock.

Methods:

Fitting functional linear models (FLM) involves solving ill-posed inverse problems, and requires some form of regularization, which is generally performed by projection of the functional observations on a finite number of functional principal components. This is usually performed either after transforming the discretely observed functional data into curves, or simultaneously, see for overviews of functional regression.

Discussion

Functional mixture models make it possible to study relationships between explanatory and response variables over time allowing for clusters characterized by these relationships. Presents a clustering method for mixture regression that involves a penalised likelihood, where the penalty is the total entropy. FRECL is an alternative to these settings that does not need to consider a constrained optimization problem, and consequently does not need to estimate the value of the Lagrange multiplier.

Acknowledgments

We would like to thank Ioannis Kosmidis for helpful discussions and the Isaac Newton Institute for Mathematical Sciences, Cambridge, for support and hospitality during the programme Statistical Scalability where work on this paper was undertaken.

Citation: Conde S, Tavakoli S, Ezer D (2024) Functional regression clustering with multiple functional gene expressions. PLoS ONE 19(11): e0310991. https://doi.org/10.1371/journal.pone.0310991

Editor: Ruofei Du, University of Arkansas for Medical Sciences, UNITED STATES OF AMERICA

Received: March 1, 2024; Accepted: August 26, 2024; Published: November 25, 2024.

Copyright: © 2024 Conde et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The original data in our manuscript was previously used in: Nagano AJ, Kawagoe T, Sugisaka J, Honjo MN, Iwayama K, Kudoh H. Annual transcriptome dynamics in natural environments reveals plant seasonal adaptation. Nature Plants. 2019;5:74–83. "The sequence data that support the findings of this study are available in the DDBJ Short Read Archive repository, with the accession numbers DRA005871, DRA005872, DRA005873, DRA005874, DRA005875 and DRA005876, which are all available at https://www.ncbi.nlm.nih.gov/bioproject/PRJDB5830. Database of detailed results of individual genes is at http://sohi.ecology.kyoto-u.ac.jp/AhgRNAseq/".

Funding: This project was funded by the Alan Turing Institute Research Fellowship under EPSRC Research grant (TU/A/000017) to DE; Biotechnology and Biological Sciences Research Council (BBSRC) and Engineering and Physical Sciences Research Council (EPSRC). EPSRC/BBSRC Innovation Fellowship (EP/S001360/1) to DE and SC. ST would like to thank the Isaac Newton Institute for Mathematical Sciences, Cambridge, for support and hospitality during the programme Statistical Scalability where work on this paper was undertaken. This work was supported by EPSRC grant no EP/R014604/1. Engineering and Physical Sciences Research Council (EPSRC): https://www.ukri.org/councils/epsrc/ Alan Turing Institute: https://www.turing.ac.uk/ Biotechnology and Biological Sciences Research Council (BBSRC): https://www.ukri.org/councils/bbsrc/ Isaac Newton Institute for Mathematical Sciences: https://www.newton.ac.uk/ The funders did not play any role in the study design, data collection and analysis, decision to publish or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.