Identification of Potential Biomarkers for Lung Cancer Using Integrated Bioinformatics and Machine Learning Approaches

Md Symun Rabby, Md Merajul Islam, Sujit Kumar, Md Maniruzzaman, Md Al Mehedi Hasan, Yoichi Tomioka, Jungpil Shin

Abstract

Lung cancer is one of the most common cancer and the leading cause of cancer-related death worldwide. Early detection of lung cancer can help reduce the death rate; therefore, the identification of potential biomarkers is crucial. Thus, this study aimed to identify potential biomarkers for lung cancer by integrating bioinformatics analysis and machine learning (ML)-based approaches. Data were normalized using the robust multiarray average method and batch effect were corrected using the ComBat method.

Introduction

Lung cancer is one of the most common cancer and its prevalence and mortality rate have been rapidly increased globally. It is the leading cause of cancer-related death in both sexes. Around 2.2 million new cases of lung cancer are diagnosed each year, and approximately 1.8 million people die from the disease worldwide. There are two main subtypes of lung cancer: small-cell lung cancer (SCLC) and non-small cell lung cancer (NSCLC). NSCLC accounts for around 85% of patients, which is also the most malignant carcinoma among men and women.

Materials and Methods:

The overall workflow adopted for this study is presented in Fig 1. In our study, we utilized gene expression omnibus (GEO) dataset derive from the USA and Taiwan cohort. The training dataset was employed to determine the core genes for each cohort of NSCLC and their performance was validated using test set. Firstly, we combined training datasets for each cohort and normalized them using robust multi-array average (RMA), followed by correction batch effect with the combat method. After that, we determined the differentually expressed genes (DEGs) by linear models for microarray data (LIMMA) and identified carcinema asssociated DEGs using Enrichr web tools for each cohrt.

Discussion

This study attempted to propose a system to identify potential biomarkers for patients with NSCLC using the integration of bioinformatics and ML-based approaches. In high-dimensional genomic data analysis, biomarker selection is challenging, mainly due to the large number of characteristics relative to the limited sample size. To identify effective biomarkers in these settings, multiple approaches are available, including hypothesis-based tests, penalized methods like the least absolute shrinkage and selection operator (LASSO), and other ML-based approaches such as support vector machine recursive feature elimination (SVMRFE).

Citation: Rabby MS, Islam MM, Kumar S, Maniruzzaman M, Hasan MAM, Tomioka Y, et al. (2025) Identification of potential biomarkers for lung cancer using integrated bioinformatics and machine learning approaches. PLoS ONE 20(2): e0317296. https://doi.org/10.1371/journal.pone.0317296

Editor: Suyan Tian, The First Hospital of Jilin University, CHINA

Received: July 23, 2024; Accepted: December 24, 2024; Published: February 27, 2025.

Copyright: © 2025 Rabby et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: In this study, we used five datasets (GSE54495, GSE49644, GSE102287, GSE40791, and GSE101929) from USA cohort and another three datasets (GSE33356, GSE19804, and GSE27262) from Taiwan cohorts. These datasets can be easily downloaded from the following link: www.ncbi.nlm.nih.gov/geo/. Moreover, TCGA-LIHC dataset can also be easily downloaded from the TCGA database (https://portal.gdc.cancer.gov/).

Funding: This work was supported by the Competitive Research Fund of The University of Aizu, Japan (Grant Number: P-13).

Competing interests: The authors have declared that no competing interests exist.