An Integrative Approach to Protein Sequence Design Through Multiobjective Optimization

Lu Hong, Tanja Kortemme

Abstract

With recent methodological advances in the field of computational protein design, in particular those based on deep learning, there is an increasing need for frameworks that allow for coherent, direct integration of different models and objective functions into the generative design process. Here we demonstrate how evolutionary multiobjective optimization techniques can be adapted to provide such an approach. With the established Non-dominated Sorting Genetic Algorithm II (NSGA-II) as the optimization framework, we use AlphaFold2 and ProteinMPNN confidence metrics to define the objective space, and a mutation operator composed of ESM-1v and ProteinMPNN to rank and then redesign the least favorable positions.

Introduction

The field of computational protein design has achieved major breakthroughs in recent years in terms of its ability to design proteins and protein assemblies with diverse folds and functions, which has already found application in the design of therapeutically relevant biomolecules such as vaccines and antibodies. Such breakthroughs are built upon improvements in atomistic modeling techniques, such as the Rosetta software suite, and recent advances in machine learning-based structure prediction models, sequence design (or inverse folding) models, protein language models, and denoising diffusion probabilistic models.

Methods:

The fact that some residues are present in only a subset of the structures for a given protein do have two consequences. First, the local chemical environment around some designable positions close to the undesigned missing residues will not reflect that of the full-length protein, which may lead to biases in native sequence recovery whenever a structure-based metric is involved. However, the fact that such biases, if present, are constant in all design simulation setups suggest that the relative difference in native sequence recovery and other performance metrics among the setups can be meaningfully attributed to the differences in the mutation operator and/or the objective functions.

Discussion

In this work, we examined the potential of evolutionary multiobjective optimization as an integrative framework for protein sequence design. This framework was chosen because of its ability to explicitly approximate the Pareto front in a user-specified objective space, and the flexibility it affords to construct informative mutation operators to guide sampling in the sequence space. Using the multistate design problem of the two-state foldswitching protein RfaH as an in-depth case study, as well PapD and CaM as examples of higher-dimensional optimization problems, we showed that this approach led to design candidates with reduced bias and variance in native sequence recovery, without the need for post hoc filtering or pMPNN hyperparameter tuning.

Citation: Hong L, Kortemme T (2024) An integrative approach to protein sequence design through multiobjective optimization. PLoS Comput Biol 20(7): e1011953. https://doi.org/10.1371/journal.pcbi.1011953

Editor: Alexey Onufriev, Virginia Tech, UNITED STATES OF AMERICA

Received: February 28, 2024; Accepted: June 25, 2024; Published: July 11, 2024.

Copyright: © 2024 Hong, Kortemme. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All code for methods described in this work and for generating and analyzing the benchmark data can be accessed at https://github.com/luhong88/int_seq_des.

Funding: This work was supported by a grant from the National Institutes of Health (R35 GM145236 to T.K.). T.K. is a Chan Zuckerberg Biohub Investigator. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.