Deciphering genomic codes using advanced NLP techniques: a scoping review.

ArXiv

Department of Population Health Sciences, Weill Cornell Medicine, New York, NY 10065.

Published: November 2024

Objectives: The vast and complex nature of human genomic sequencing data presents challenges for effective analysis. This review aims to investigate the application of Natural Language Processing (NLP) techniques, particularly Large Language Models (LLMs) and transformer architectures, in deciphering genomic codes, focusing on tokenization, transformer models, and regulatory annotation prediction. This review aims to assess data and model accessibility in the most recent literature, gaining a better understanding of the existing capabilities and constraints of these tools in processing genomic sequencing data.

Methods: Following Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines, our scoping review was conducted across PubMed, Medline, Scopus, Web of Science, Embase, and ACM Digital Library. Studies were included if they focused on NLP methodologies applied to genomic sequencing data analysis, without restrictions on publication date or article type.

Results: A total of 26 studies published between 2021 and April 2024 were selected for review. The review highlights that tokenization and transformer models enhance the processing and understanding of genomic data, with applications in predicting regulatory annotations like transcription-factor binding sites and chromatin accessibility.

Discussion: The application of NLP and LLMs to genomic sequencing data interpretation is a promising field that can help streamline the processing of large-scale genomic data while providing a better understanding of its complex structures. It can potentially drive advancements in personalized medicine by offering more efficient and scalable solutions for genomic analysis. Further research is needed to discuss and overcome limitations, enhancing model transparency and applicability.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11623714PMC

Publication Analysis

Top Keywords

genomic sequencing
16
sequencing data
12
deciphering genomic
8
genomic codes
8
nlp techniques
8
scoping review
8
genomic
8
review aims
8
tokenization transformer
8
transformer models
8

Similar Publications

The interaction of bacteria and harmonine in harlequin ladybird confers an interspecies competitive edge.

Proc Natl Acad Sci U S A

January 2025

Zhejiang Key Laboratory of Biology and Ecological Regulation of Crop Pathogens and Insects, Institute of Insect Sciences, Zhejiang University, Hangzhou, Zhejiang 310058, China.

The harlequin ladybird, , is a predatory beetle used globally to control pests such as aphids and scale insects. Originating from East Asia, this species has become highly invasive since its introduction in the late 19th century to Europe and North America, posing a threat to local biodiversity. Intraguild predation is hypothesized to drive the success of this invasive species, but the underlying mechanisms remain unknown.

View Article and Find Full Text PDF

The widespread application of genome editing to treat and cure disease requires the delivery of genome editors into the nucleus of target cells. Enveloped delivery vehicles (EDVs) are engineered virally derived particles capable of packaging and delivering CRISPR-Cas9 ribonucleoproteins (RNPs). However, the presence of lentiviral genome encapsulation and replication proteins in EDVs has obscured the underlying delivery mechanism and precluded particle optimization.

View Article and Find Full Text PDF

Purpose: We aimed to identify the transcriptomic signatures of soft tissue sarcoma (STS) related to radioresistance and establish a model to predict radioresistance.

Materials And Methods: Nine STS cell lines were cultured. Adenosine triphosphate-based viability was determined 5 days after irradiation with 8 Gy of X-rays in a single fraction.

View Article and Find Full Text PDF

Insect-specific RNA viruses detection in Field-Caught Aedes aegypti mosquitoes from Argentina using NGS technology.

PLoS Negl Trop Dis

January 2025

Laboratorio de Ingeniería Genética y Biología Celular y Molecular-Área de virus de insectos, Departamento de Ciencia y Tecnología, Universidad Nacional de Quilmes, Quilmes, Buenos Aires, Argentina.

Mosquitoes are the primary vectors of arthropod-borne pathogens. Aedes aegypti is one of the most widespread mosquito species worldwide, responsible for transmitting diseases such as Dengue, Zika, and Chikungunya, among other medically significant viruses. Characterizing the array of viruses circulating in mosquitoes, particularly in Aedes aegypti, is a crucial tool for detecting and developing novel strategies to prevent arbovirus outbreaks.

View Article and Find Full Text PDF

The role of chromatin state in intron retention: A case study in leveraging large scale deep learning models.

PLoS Comput Biol

January 2025

Department of Computer Science, Colorado State University, Fort Collins, Colorado, United States of America.

Complex deep learning models trained on very large datasets have become key enabling tools for current research in natural language processing and computer vision. By providing pre-trained models that can be fine-tuned for specific applications, they enable researchers to create accurate models with minimal effort and computational resources. Large scale genomics deep learning models come in two flavors: the first are large language models of DNA sequences trained in a self-supervised fashion, similar to the corresponding natural language models; the second are supervised learning models that leverage large scale genomics datasets from ENCODE and other sources.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!