GenerRNA: A Generative Pre-trained Language Model for de novo RNA Design

Yichong Zhao, Kenta Oono, Hiroki Takizawa, Masaaki Kotera.

Abstract

The design of RNA plays a crucial role in developing RNA vaccines, nucleic acid therapeutics, and innovative biotechnological tools. However, existing techniques frequently lack versatility across various tasks and are dependent on pre-defined secondary structure or other prior knowledge. To address these limitations, we introduce GenerRNA, a Transformer-based model inspired by the success of large language models (LLMs) in protein and molecule generation. GenerRNA is pre-trained on large-scale RNA sequences and capable of generating novel RNA sequences with stable secondary structures, while ensuring distinctiveness from existing sequences, thereby expanding our exploration of the RNA space.

Introduction

RNAs(Ribonucleic acids) play essential roles in broad biological phenomena. The design and engineering of RNA is a promising tool in the progress of advanced therapeutics and biotechnology. For example, RNA aptamers, which can specifically target proteins or small molecules, have been engineered and utilized in gene silencing therapies through targeted siRNA delivery and as alternatives to fluorescent proteins in diagnostic techniques.

Methods:

GenerRNA is a language model that processes RNA sequences through a linguistic perspective. By employing unsupervised learning on a large-scale RNA dataset, it is possible to discern the inherent syntax, grammar, and semantics in RNA sequences, thereby mastering the ability to generate the RNA “language”—signifying a capacity to generate RNA sequences akin to nature.

Within this framework, GenerRNA computes the probability for each specific token xi in a sequence, where a token refers to a discrete unit consisting of one or more nucleotides, and the index i denotes the token’s position in the sequence.

Discussion

Biological sequences harbor a wealth of information, covering evolutionary history, survival strategies, and even blueprints for future development. AI language models serve as translators for reading and writing this enigmatic language. The ever-increasing number of RNA sequences in public databases, coupled with advancements in natural language model architectures, provides a foundation for constructing models proficient in generating biological sequences.

Our development of GenerRNA marks the first instance of a large-scale. Our model learns RNA from a linguistic perspective to gain the ability to “speak” this language.

Citation: Zhao Y, Oono K, Takizawa H, Kotera M (2024) GenerRNA: A generative pre-trained language model for de novo RNA design. PLoS ONE 19(10): e0310814. https://doi.org/10.1371/journal.pone.0310814

Editor: Muhammad Usman Tariq, Abu Dhabi University, UNITED ARAB EMIRATES

Received: May 16, 2024; Accepted: September 8, 2024; Published: October 1, 2024.

Copyright: © 2024 Zhao et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the manuscript and its Supporting information files and also the public repository https://huggingface.co/pfnet/GenerRNA.

Funding: The author(s) received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.