Data-driven Model Discovery and Model Selection for Noisy Biological Systems

Xiaojun Wu, MeiLu McDermott, Adam L MacLean.

Abstract

Biological systems exhibit complex dynamics that differential equations can often adeptly represent. Ordinary differential equation models are widespread; until recently their construction has required extensive prior knowledge of the system. Machine learning methods offer alternative means of model construction: differential equation models can be learnt from data via model discovery using sparse identification of nonlinear dynamics (SINDy). However, SINDy struggles with realistic levels of biological noise and is limited in its ability to incorporate prior knowledge of the system.

Introduction

Mathematical models wielded skillfully can offer great insight into biological systems. The process of constructing models, however, is typically manual and labor-intensive. Data-driven model discovery, i.e. methods by which models can be learnt directly from data, offer a promising alternative to manual model-building. To perform such model discovery however, one must overcome the idiosyncrasies that biological systems present, including appropriate consideration of the extent/type of noise present in the data, and the need to evaluate results in an unbiased way.

Materials and Methods:

We seek to learn ordinary differential equation (ODE) models that govern the behavior of a dynamical system using data-driven model discovery. We define a hybrid dynamical system below as one for which the model is partially but not completely known. We take a two-step approach to model discovery. In the first step, we use neural network (NN) based approaches to learn the unknown derivatives of a hybrid dynamical system.

Discussion:

In this study, we developed a framework for data-driven discovery of ODE models. We presented methods to infer models from noisy data via a two-step model selection framework. In the first, we learnt the latent (unknown) model dynamics with a NN; in the second we used sparse regression to infer equations to model the system. We showed how the use of hybrid dynamical systems outperformed purely data-driven approaches.

Citation: Wu X, McDermott M, MacLean AL (2025) Data-driven model discovery and model selection for noisy biological systems. PLoS Comput Biol 21(1): e1012762. https://doi.org/10.1371/journal.pcbi.1012762

Editor: Mark Alber, University of California Riverside, UNITED STATES OF AMERICA

Received: June 24, 2024; Accepted: December 31, 2024; Published: January 21, 2025.

Copyright: © 2025 Wu et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All code associated with this study are available on GitHub at: github.com/maclean-lab/model-discovery. All raw data used in this study are publicly available on the Gene Expression Omnibus (GSE147405).

Funding: A.L.M. acknowledges support from the National Institutes of Health (R35GM143019) and from the National Science Foundation (DMS 2045327). Funders played no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.