Nafis Irtiza Tripto ,Mohimenul Kabir ,Md. Shamsuzzoha Bayzid ,Atif Rahman
Time series gene expression data is widely used to study different dynamic biological processes. Although gene expression datasets share many of the characteristics of time series data from other domains, most of the analyses in this field do not fully leverage the time-ordered nature of the data and focus on clustering the genes based on their expression values. Other domains, such as financial stock and weather prediction, utilize time series data for forecasting purposes. Moreover, many studies have been conducted to classify generic time series data based on trend, seasonality, and other patterns. Therefore, an assessment of these approaches on gene expression data would be of great interest to evaluate their adequacy in this domain. Here, we perform a comprehensive evaluation of different traditional unsupervised and supervised machine learning approaches as well as deep learning based techniques for time series gene expression classification and forecasting on five real datasets. In addition, we propose deep learning based methods for both classification and forecasting, and compare their performances with the state-of-the-art methods. We find that deep learning based methods generally outperform traditional approaches for time series classification. Experiments also suggest that supervised classification on gene expression is more effective than clustering when labels are available. In time series gene expression forecasting, we observe that an autoregressive statistical approach has the best performance for short term forecasting, whereas deep learning based methods are better suited for long term forecasting.
Microarray time series gene expression experiments have essential applications in studying cell cycle development, immune response, and other biological processes. Monitoring the change in gene expression patterns over time provides opportunities to study mechanistic characteristics of various cellular processes. The Stanford Microarray Database (SMD) stores raw and normalized data from microarray experiments and provides web interfaces for researchers to retrieve, analyze, and visualize their data. Analyzing time series gene expression data has various significance, such as genetic interaction and knockout screens, understanding of development, cellular response to drug treatment, tumorigenesis, infection or disease identification, and determining correlated genes. However, existing studies mostly utilize gene expression values for clustering gene profiles and rarely focus on performing tasks such as classification, forecasting or anomaly detection.
Materials & Methods
In this section, we provide a detailed description of the classification and forecasting approaches we have evaluated. We also discuss the datasets used in our study and describe the evaluation criteria for performance analysis.
In this study, we have proposed and investigated different classification and forecasting methods for time series gene expression data. To verify the efficiency and effectiveness of these methods; we have conducted an extensive experimental study on five real gene expression datasets, and compared state-of-the-art techniques along with methods proposed in this paper. We find that a CNN based architecture presented here generally outperforms other methods for gene expression time series classification, whereas ARIMA and ANN are the best suited for forecasting purposes.
Citation: Tripto NI, Kabir M, Bayzid MS, Rahman A (2020) Evaluation of classification and forecasting methods on time series gene expression data. PLoS ONE 15(11): e0241686. https://doi.org/10.1371/journal.pone.0241686
Editor: Tao Song, Polytechnical Universidad de Madrid, SPAIN
Received: May 24, 2020; Accepted: October 20, 2020; Published: November 6, 2020
Copyright: © 2020 Tripto et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
All datasets are collected from the following source: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE3406 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE1723 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE6186 http://genomics.stanford.edu/.
The authors received no specific funding for this work.
The authors have declared that no competing interests exist.