Sim4cc: a cross-species spliced alignment program.

Nucleic Acids Res

Department of Computer Science, George Washington University, Washington, DC 20052, USA.

Published: June 2009

Advances in sequencing technologies have accelerated the sequencing of new genomes, far outpacing the generation of gene and protein resources needed to annotate them. Direct comparison and alignment of existing cDNA sequences from a related species is an effective and readily available means to determine genes in the new genomes. Current spliced alignment programs are inadequate for comparing sequences between different species, owing to their low sensitivity and splice junction accuracy. A new spliced alignment tool, sim4cc, overcomes problems in the earlier tools by incorporating three new features: universal spaced seeds, to increase sensitivity and allow comparisons between species at various evolutionary distances, and powerful splice signal models and evolutionarily-aware alignment techniques, to improve the accuracy of gene models. When tested on vertebrate comparisons at diverse evolutionary distances, sim4cc had significantly higher sensitivity compared to existing alignment programs, more than 10% higher than the closest competitor for some comparisons, while being comparable in speed to its predecessor, sim4. Sim4cc can be used in one-to-one or one-to-many comparisons of genomic and cDNA sequences, and can also be effectively incorporated into a high-throughput annotation engine, as demonstrated by the mapping of 64,000 Fagus grandifolia 454 ESTs and unigenes to the poplar genome.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2699533PMC
http://dx.doi.org/10.1093/nar/gkp319DOI Listing

Publication Analysis

Top Keywords

spliced alignment
12
cdna sequences
8
sequences species
8
alignment programs
8
evolutionary distances
8
alignment
6
sim4cc
4
sim4cc cross-species
4
cross-species spliced
4
alignment program
4

Similar Publications

Decoding the mA epitranscriptomic landscape for biotechnological applications using a direct RNA sequencing approach.

Nat Commun

January 2025

National-Local Joint Engineering Laboratory of Druggability and New Drug Evaluation, National Engineering Research Center for New Drug and Druggability (cultivation), Guangdong Province Key Laboratory of New Drug Design and Evaluation, School of Pharmaceutical Sciences, Sun Yat-Sen University, Guangzhou, 510006, China.

Epitranscriptomic modifications, particularly N6-methyladenosine (mA), are crucial regulators of gene expression, influencing processes such as RNA stability, splicing, and translation. Traditional computational methods for detecting mA from Nanopore direct RNA sequencing (DRS) data are constrained by their reliance on experimentally validated labels, often resulting in the underestimation of modification sites. Here, we introduce pum6a, an innovative attention-based framework that integrates positive and unlabeled multi-instance learning (MIL) to address the challenges of incomplete labeling and missing read-level annotations.

View Article and Find Full Text PDF

African ancestry neurodegeneration risk variant disrupts an intronic branchpoint in .

medRxiv

February 2024

Center for Alzheimer's and Related Dementias, National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA.

Recently, a novel African ancestry specific Parkinson's disease (PD) risk signal was identified at the gene encoding glucocerebrosidase (). This variant (rs3115534-G) is carried by ~50% of West African PD cases and imparts a dose-dependent increase in risk for disease. The risk variant has varied frequencies across African ancestry groups, but is almost absent in European and Asian ancestry populations.

View Article and Find Full Text PDF

Proteoform Identification and Quantification Based on Alignment Graphs.

Bioinformatics

January 2025

Department of Computer Science, City University of Hong Kong, Hong Kong, China.

Motivation: Proteoforms are the different forms of a proteins generated from the genome with various sequence variations, splice isoforms, and post-translational modifications. Proteoforms regulate protein structures and functions. A single protein can have multiple proteoforms due to different modification sites.

View Article and Find Full Text PDF

Background: TP53 variant classification benefits from the availability of large-scale functional data for missense variants generated using cDNA-based assays. However, absence of comprehensive splicing assay data for TP53 confounds the classification of the subset of predicted missense and synonymous variants that are also predicted to alter splicing. Our study aimed to generate and apply splicing assay data for a prioritised group of 59 TP53 predicted missense or synonymous variants that are also predicted to affect splicing by either SpliceAI or MaxEntScan.

View Article and Find Full Text PDF

Chikungunya virus (CHIKV) is an emerging, mosquito-borne arthritic alphavirus increasingly associated with severe neurological sequelae and long-term morbidity. However, there is limited understanding of the crucial host components involved in CHIKV replicase assembly complex formation, and thus virus replication and virulence-determining factors, within the central nervous system (CNS). Furthermore, the majority of CHIKV CNS studies focus on neuronal infection, even though astrocytes represent the main cerebral target.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!