kmer2vec: A Novel Method for Comparing DNA Sequences by word2vec Embedding.

J Comput Biol

Department of Mathematical Sciences, Tsinghua University, Beijing, China.

Published: September 2022

The comparison of DNA sequences is of great significance in genomics analysis. Although the traditional multiple sequence alignment (MSA) method is popularly used for evolutionary analysis, optimally aligning sequences becomes computationally intractable when increases due to the intrinsic computational complexity of MSA. Despite numerous -mer alignment-free methods being proposed, the existing -mer alignment-free methods may not truly capture the contextual structures of the sequences. In this study, we present a novel -mer contextual alignment-free method (called kmer2vec), in which the sequence -mers are semantically embedded to word2vec vectors, an essential technique in natural language processing. Consequently, the method converts each DNA/RNA sequence into a point in the word2vec high-dimensional space and compares DNA sequences in the space. Because the word2vec vectors are trained from the contextual relationship of -mers in the genomes, the method may extract valuable structural information from the sequences and reflect the relationship among them properly. The proposed method is optimized on the parameters from word2vec training and verified in the phylogenetic analysis of large whole genomes, including coronavirus and bacterial genomes. The results demonstrate the effectiveness of the method on phylogenetic tree construction and species clustering. The method running speed is much faster than that of the MSA method, especially the phylogenetic relationships constructed by the kmer2vec method are more accurate than the conventional -mer alignment-free method. Therefore, this approach can provide new perspectives for phylogeny and evolution and make it possible to analyze large genomes. In addition, we discuss special parameterization in the -mer word2vec embedding construction. An effective tool for rapid SARS-CoV-2 typing can also be derived when combining kmer2vec with clustering methods.

Download full-text PDF

Source
http://dx.doi.org/10.1089/cmb.2021.0536DOI Listing

Publication Analysis

Top Keywords

dna sequences
12
-mer alignment-free
12
method
11
word2vec embedding
8
msa method
8
alignment-free methods
8
alignment-free method
8
word2vec vectors
8
large genomes
8
method phylogenetic
8

Similar Publications

First report of causing black foot on walnut in Chile.

Plant Dis

January 2025

Universidad de Chile, Departamento de Sanidad Vegetal, Facultad de Ciencias Agronomicas, Casilla 1004, Santiago, Chile, 8820000;

Walnut (Juglans regia L.) is the primary nut tree cultivated in Chile, covering 44.626 ha.

View Article and Find Full Text PDF

Fig (Ficus carica L.) holds economic significance in Atushi, Xinjiang, but as fig cultivation expands, disease prevalence has risen. In July 2024, approximately 22% of harvested fig (cv.

View Article and Find Full Text PDF

Occurrence of AG-5 Causing Root Rot on in Northwestern China.

Plant Dis

January 2025

Institute of Plant Protection, Gansu Academy of Agricultural Sciences, Lanzhou, Gansu, China;

Astragalus mongholicus is a perennial Chinese medicinal herb in the family Leguminosae widely cultivated in China. In September 2023, A. mongholicus plants in a field in Weiyuan County, Gansu Province, showed symptoms of circular or irregular brown, sunken and necrotic lesions, multiple lesions coalesced, and brown longitudinal cracks in the roots.

View Article and Find Full Text PDF

First Report of Causing Black Leaf Spot on in China.

Plant Dis

January 2025

Zhejiang Academy of Agricultural Sciences, Institute of Agro-product Safety and Nutrition, Hangzhou, Zhejiang, China;

Chinese yam ( Turcz.), known for its nutrient-rich underground tubers, is both a food source and a traditional Chinese medicinal plant. It offers significant nutritional and medicinal benefits.

View Article and Find Full Text PDF

Katsumada galangal seed ( K. Schum) is an important member of the Zingiberaceae family, with both medicinal value and culinary applications (Park et al. 2020).

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!