Gene Expression and Metadata Based Identification of Key Genes for Lung Cancer, COPD, and IPF Using Machine Learning and Statistical Models
Mst. Farjana Yasmin, Md. Faruk Hosen, Md. Abul Basar, Anichur Rahman, Mahedi Hasan, Fahmid Al Farid, Hezerul Abdul Karim, Abu Saleh Musa Miah
Abstract
Lung cancer (LC) is one of the most prevalent and deadly cancers globally, presenting a major public health challenge. Patients with chronic obstructive pulmonary disease (COPD) and idiopathic pulmonary fibrosis (IPF) are at a significantly higher risk of developing lung cancer. Despite developments in research, the primary molecular pathways of many disorders remain poorly understood. The current study aimed to identify potential therapeutic genes for lung cancer (LC), chronic obstructive pulmonary disease (COPD), and idiopathic pulmonary fibrosis (IPF) through machine learning (ML) and bioinformatics methodologies.
Introduction
Globally, lung cancer is the leading cause of cancer-related mortality. Oncogene mutations are generally responsible for the development of lung cancer, as they cause aberrant cell proliferation that leads to the formation of lung tumours. Among all cancer types, lung cancer continues to rank among those with the highest incidence. Based on histopathological features, it is categorised into two categories: non-small cell lung cancer and small cell lung cancer. Lung cancers caused by smoking are still common, even if smoking rates are dropping. As of 2023, lung cancer among nonsmokers ranks seventh worldwide in terms of cancer-related fatalities; it predominantly affects Asian and female individuals.
Materials and Methods:
The GSE24206, GSE18842, and GSE76925 datasets were created using the information from the GEO database [35]. In the GSE24206 data series, collected 11 samples that causes IPF and 6 control samples. The GSE24206 comprises lung tissue samples from IPF patients undergoing lung transplantation or diagnostic surgical biopsy, diagnosed using standard clinical, radiological (HRCT), and histopathological criteria. Lung tissues from healthy transplant donors were used as controls; IPF severity staging data were not uniformly available. The GSE24206 data set was analyzed using the GPL570 platforms. The GSE18842 collection included a total of 91 non-small cell lung cancer (NSCLC) samples.
Discussion
This research combines transcriptomics information with machine learning approaches guided by systems biology to decode shared molecular signatures in Idiopathic Pulmonary Fibrosis, Chronic Obstructive Pulmonary Disease, and Lung Cancer, which are highly connected and represent a significant proportion of the global morbidity and mortality cases. The convergent findings for four crucial hub genes, ETS1, MSH2, RORA, and PMAIP1, give insights into possible mechanistic correlations between chronic inflammation, remodelling, and carcinogenesis within the lung microenvironment.
Acknowledgments
The paper is not under consideration at any other journal and has been published only with the proper consent. The authors highly appreciate those who have participated in this research work.
Citation: Yasmin MF, Hosen MF, Basar MA, Rahman A, Hasan M, Al Farid F, et al. (2026) Gene expression and metadata based identification of key genes for lung cancer, COPD, and IPF using machine learning and statistical models. PLoS One 21(3): e0344666. https://doi.org/10.1371/journal.pone.0344666
Editor: Suyan Tian, The First Hospital of Jilin University, CHINA
Received: October 9, 2025; Accepted: February 24, 2026; Published: March 19, 2026.
Copyright: © 2026 Yasmin et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The datasets are publicly available in NCBI data repository. (accession number(s) GSE24206, GSE76925, GSE18842).
Funding: The author(s) received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.