Xiaomei Li, Lin Liu, Gregory J. Goodall, Andreas Schreiber, Taosheng Xu, Jiuyong Li, Thuc D. Le
Breast cancer prognosis is challenging due to the heterogeneity of the disease. Various computational methods using bulk RNA-seq data have been proposed for breast cancer prognosis. However, these methods suffer from limited performances or ambiguous biological relevance, as a result of the neglect of intra-tumor heterogeneity. Recently, single cell RNA-sequencing (scRNA-seq) has emerged for studying tumor heterogeneity at cellular levels. In this paper, we propose a novel method, scPrognosis, to improve breast cancer prognosis with scRNA-seq data. scPrognosis uses the scRNA-seq data of the biological process Epithelial-to-Mesenchymal Transition (EMT). It firstly infers the EMT pseudotime and a dynamic gene co-expression network, then uses an integrative model to select genes important in EMT based on their expression variation and differentiation in different stages of EMT, and their roles in the dynamic gene co-expression network. To validate and apply the selected signatures to breast cancer prognosis, we use them as the features to build a prediction model with bulk RNA-seq data. The experimental results show that scPrognosis outperforms other benchmark breast cancer prognosis methods that use bulk RNA-seq data. Moreover, the dynamic changes in the expression of the selected signature genes in EMT may provide clues to the link between EMT and clinical outcomes of breast cancer. scPrognosis will also be useful when applied to scRNA-seq datasets of different biological processes other than EMT.
Cancer prognosis plays an important role in clinical decision making. Traditionally, cancer prognosis is based on several clinical and pathological variables such as tumor size, lymph node status, histological grades . However, these clinicopathological factors are insufficient for cancer prognosis because cancer is heterogeneous at the molecular (e.g., genes) level. Hence, recent clinical guidelines have highlighted the importance of using multi-gene tests to select patients who should receive adjuvant therapies . The multiple genes in the tests are known as cancer signatures, which are crucial to cancer prognosis. Cancer signatures can be identified by in vivo biological experiments. For example, the LM method  analyzed transcriptomics in the cell lines and chose 54 genes associated with lung metastagenicity and virulence. However, these experiments cannot be done on human beings. Meanwhile, experiments on animals would not guarantee that the same conclusion can be drawn for humans. Therefore, computational methods are needed to identify cancer signatures from existing data, including gene expression data and clinical data.
Materials and methods
Overview of scPrognosis
scPrognosis contains five steps as depicted in Fig 1. In step 1, MAGIC  and a gene filter are used to pre-process the noisy and high-dimensional scRNA-seq data. In step 2, EMT pseudotime, pseudotime series gene expression data, and dynamic gene co-expression network are inferred from the scRNA-seq data. In this step, firstly VIM gene expression level and pseudotemporal trajectory estimated by the Wanderlust algorithm  are used to identify EMT pseudotime for all cells in the scRNA-seq dataset. The EMT pseudotime describes the gradual transition of the single-cell transcriptome during the EMT transition process and helps to study gene expression dynamics in different EMT transition stages. Secondly, pseudotime series gene expression data is obtained by ordering cells in the scRNA-seq dataset from epithelial stage to mesenchymal stage according to the EMT pseudotime. Thirdly, from the ordered scRNA-seq data, a dynamic gene co-expression network is constructed by using the LEAP R package . In step 3, based on the ordered scRNA-seq data, three methods are adopted to obtain the different gene ranking measures, including Median Absolute Deviation (MAD), switchde  and Google PageRank. MAD and switchde are used to compute gene importance based on their expression level. Google PageRank ranks genes based on their roles in the dynamic gene co-expression network. In step 4, we integrate the three different rankings obtained in step 3 to prioritize genes. In step 5, the top N ranked genes are selected as signatures to predict the survival outcomes of breast cancer patients in bulk RNA-seq data. Details of each step are described in the following sub-sections.
Discussion and conclusion
Breast cancer is a complex disease caused by intricate genetic and molecular alterations. Thus traditional clinicopathological factors are not sufficient for the accurate prognosis of breast cancer. Recently, a wide range of computational methods have been proposed to identify multi-genes for breast cancer prognosis, and some of the methods have been approved for commercial use, including PAM50, Mamma, and RS test. These methods lead to a revolution in the breast cancer treatment paradigm. However, all of the progress in cancer prognosis has not been enough to overcome therapy resistance in breast cancer under current cancer therapeutics. Some tumor cells acquire resistance to targeted cancer therapy, which leads to worse survival of cancer patients. scRNA-seq can reveal genes that affect cell fate decision by monitoring the expression of genes in different cell states and sub-populations. In this paper, we use scRNA-seq data to detect signatures related to EMT that affect the clinical outcomes of breast cancer patients.
Citation: Li X, Liu L, Goodall GJ, Schreiber A, Xu T, Li J, et al. (2020) A novel single-cell based method for breast cancer prognosis. PLoS Comput Biol 16(8): e1008133. https://doi.org/10.1371/journal.pcbi.1008133
Editor: Ilya Ioshikhes, University of Ottawa, CANADA
Received: March 30, 2020; Accepted: July 9, 2020; Published: August 24, 2020
Copyright: © 2020 Li et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All the datasets used in this paper are available
Funding: TDL was supported by the ARC DECRA Grant (Grant Number: DE200100200). TX was supported by the National Natural Science Foundation of China (Grant Number: 61902372), Natural Science Foundation of Anhui Province, China (Grant Number: 2008085QF292), and Presidential Foundation of Hefei Institutes of Physical Science, Chinese Academy of Sciences (Grant Number: YZJJ2018QN24). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. This study makes use of data generated by the Molecular Taxonomy of Breast Cancer International Consortium. Funding for the project was provided by Cancer Research UK and the British Columbia Cancer Agency Branch. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.