Background: The application of high-throughput sequencing in a broad range of quantitative genomic assays (e.g., DNA-seq, ChIP-seq) has created a high demand for the analysis of large-scale read-count data. Typically, the genome is divided into tiling windows and windowed read-count data is generated for the entire genome from which genomic signals are detected (e.g. copy number changes in DNA-seq, enrichment peaks in ChIP-seq). For accurate analysis of read-count data, many state-of-the-art statistical methods use generalized linear models (GLM) coupled with the negative-binomial (NB) distribution by leveraging its ability for simultaneous bias correction and signal detection. However, although statistically powerful, the GLM+NB method has a quadratic computational complexity and therefore suffers from slow running time when applied to large-scale windowed read-count data. In this study, we aimed to speed up substantially the GLM+NB method by using a randomized algorithm and we demonstrate here the utility of our approach in the application of detecting copy number variants (CNVs) using a real example.

Results: We propose an efficient estimator, the randomized GLM+NB coefficients estimator (RGE), for speeding up the GLM+NB method. RGE samples the read-count data and solves the estimation problem on a smaller scale. We first theoretically validated the consistency and the variance properties of RGE. We then applied RGE to GENSENG, a GLM+NB based method for detecting CNVs. We named the resulting method as "R-GENSENG". Based on extensive evaluation using both simulated and empirical data, we concluded that R-GENSENG is ten times faster than the original GENSENG while maintaining GENSENG's accuracy in CNV detection.

Conclusions: Our results suggest that RGE strategy developed here could be applied to other GLM+NB based read-count analyses, i.e. ChIP-seq data analysis, to substantially improve their computational efficiency while preserving the analytic power.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5831535PMC
http://dx.doi.org/10.1186/s12859-018-2077-6DOI Listing

Publication Analysis

Top Keywords

read-count data
24
glm+nb method
12
analysis large-scale
8
large-scale read-count
8
data
8
windowed read-count
8
copy number
8
glm+nb based
8
read-count
7
glm+nb
6

Similar Publications

edgeR is an R/Bioconductor software package for differential analyses of sequencing data in the form of read counts for genes or genomic features. Over the past 15 years, edgeR has been a popular choice for statistical analysis of data from sequencing technologies such as RNA-seq or ChIP-seq. edgeR pioneered the use of the negative binomial distribution to model read count data with replicates and the use of generalized linear models to analyze complex experimental designs.

View Article and Find Full Text PDF

Comprehensive genome-scale CRISPR knockout screening of CHO cells.

Sci Data

January 2025

Department of Molecular Science and Technology, Ajou University, Suwon, 16499, Republic of Korea.

Chinese hamster ovary (CHO) cells play a pivotal role in the production of recombinant therapeutics. In the present study, we conducted a genome-scale pooled CRISPR knockout (KO) screening using a virus-free, recombinase-mediated cassette exchange-based platform in CHO-K1 host and CHO-K1 derived recombinant cells. Genome-wide guide RNA (gRNA) amplicon sequencing data were generated from cell libraries, as well as short- and long-term KO libraries, and validated through phenotypic assessment and gRNA read count distribution.

View Article and Find Full Text PDF

Bovine reproductive failure, which includes infertility, abortion, and stillbirth in cattle, leads to significant economic losses for beef and milk producers. Diagnosing the infectious causes of bovine reproductive failure is challenging as there are multiple pathogens associated with it. The traditional stepwise approach to diagnostic testing is time-consuming and can cause significant delays.

View Article and Find Full Text PDF
Article Synopsis
  • * Two high-throughput sequencing methods were used to check for HPV RNA and DNA in various prostate tissue samples, including malignant, normal, and benign cases.
  • * Results showed no significant presence of HPV RNA in malignant or normal prostate tissues, and only a small percentage of benign samples tested positive for HPV16 DNA, leading to the conclusion that HPV is not a major cause of prostate cancer.
View Article and Find Full Text PDF

Rediscovering publicly available single-cell data with the DISCO platform.

Nucleic Acids Res

January 2025

Centre for Computational Biology and Program in Cancer and Stem Cell Biology, Duke-NUS Medical School, 8 College Road, Singapore 169857, Singapore.

Single-cell RNA sequencing (scRNA-seq) has emerged as the key technique for studying transcriptomics at the single-cell level. In our previous work, we presented the DISCO database (https://www.immunesinglecell.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!