Background: Enrichment of loci by DNA hybridization-capture, followed by high-throughput sequencing, is an important tool in modern genetics. Currently, the most common targets for enrichment are the protein coding exons represented by the consensus coding DNA sequence (CCDS). The CCDS, however, excludes many actual or computationally predicted coding exons present in other databases, such as RefSeq and Vega, and non-coding functional elements such as untranslated and regulatory regions. The number of variants per base pair (variant density) and our ability to interrogate regions outside of the CCDS regions is consequently less well understood.

Results: We examine capture sequence data from outside of the CCDS regions and find that extremes of GC content that are present in different subregions of the genome can reduce the local capture sequence coverage to less than 50% relative to the CCDS. This effect is due to biases inherent in both the Illumina and SOLiD sequencing platforms that are exacerbated by the capture process. Interestingly, for two subregion types, microRNA and predicted exons, the capture process yields higher than expected coverage when compared to whole genome sequencing. Lastly, we examine the variation present in non-CCDS regions and find that predicted exons, as well as exonic regions specific to RefSeq and Vega, show much higher variant densities than the CCDS.

Conclusions: We show that regions outside of the CCDS perform less efficiently in capture sequence experiments. Further, we show that the variant density in computationally predicted exons is more than 2.5-times higher than that observed in the CCDS.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3218830PMC
http://dx.doi.org/10.1186/gb-2011-12-7-r68DOI Listing

Publication Analysis

Top Keywords

capture sequence
12
predicted exons
12
consensus coding
8
coding dna
8
dna sequence
8
higher variant
8
variant densities
8
coding exons
8
computationally predicted
8
refseq vega
8

Similar Publications

AiGPro: a multi-tasks model for profiling of GPCRs for agonist and antagonist.

J Cheminform

January 2025

School of Systems Biomedical Science, Soongsil University, 369 Sangdo-ro, Dongjak-gu, 06978, Seoul, Republic of Korea.

G protein-coupled receptors (GPCRs) play vital roles in various physiological processes, making them attractive drug discovery targets. Meanwhile, deep learning techniques have revolutionized drug discovery by facilitating efficient tools for expediting the identification and optimization of ligands. However, existing models for the GPCRs often focus on single-target or a small subset of GPCRs or employ binary classification, constraining their applicability for high throughput virtual screening.

View Article and Find Full Text PDF

scSMD: a deep learning method for accurate clustering of single cells based on auto-encoder.

BMC Bioinformatics

January 2025

Department of Surgery, Shanghai Key Laboratory of Gastric Neoplasms, Shanghai Institute of Digestive Surgery, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China.

Background: Single-cell RNA sequencing (scRNA-seq) has transformed biological research by offering new insights into cellular heterogeneity, developmental processes, and disease mechanisms. As scRNA-seq technology advances, its role in modern biology has become increasingly vital. This study explores the application of deep learning to single-cell data clustering, with a particular focus on managing sparse, high-dimensional data.

View Article and Find Full Text PDF

Population studies provide insights into the interplay between the gut microbiome and geographical, lifestyle, genetic and environmental factors. However, low- and middle-income countries, in which approximately 84% of the world's population lives, are not equitably represented in large-scale gut microbiome research. Here we present the AWI-Gen 2 Microbiome Project, a cross-sectional gut microbiome study sampling 1,801 women from Burkina Faso, Ghana, Kenya and South Africa.

View Article and Find Full Text PDF

Air pollution is a critical global environmental issue, further exacerbated by rapid industrialization and urbanization. Accurate prediction of air pollutant concentrations is essential for effective pollution prevention and control measures. The complex nature of pollutant data is influenced by fluctuating meteorological conditions, diverse pollution sources, and propagation processes, underscores the crucial importance of the spatial and temporal feature extraction for accurately predicting air pollutant concentrations.

View Article and Find Full Text PDF

Nonsense-mediated decay (NMD) is a eukaryotic surveillance pathway that controls degradation of cytoplasmic transcripts with aberrant features. NMD-controlled RNA degradation acts to regulate a large fraction of the mRNA population. It has been implicated in cellular responses to infections and environmental stress, as well as in deregulation of tumor-promoting genes.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!