Background: Many biological analysis tasks require extraction of families of genetically similar sequences from large datasets produced by Next-generation Sequencing (NGS). Such tasks include detection of viral transmissions by analysis of all genetically close pairs of sequences from viral datasets sampled from infected individuals or studying of evolution of viruses or immune repertoires by analysis of network of intra-host viral variants or antibody clonotypes formed by genetically close sequences. The most obvious naïeve algorithms to extract such sequence families are impractical in light of the massive size of modern NGS datasets.

Results: In this paper, we present fast and scalable k-mer-based framework to perform such sequence similarity queries efficiently, which specifically targets data produced by deep sequencing of heterogeneous populations such as viruses. It shows better filtering quality and time performance when comparing to other tools. The tool is freely available for download at https://github.com/vyacheslav-tsivina/signature-sj CONCLUSION: The proposed tool allows for efficient detection of genetic relatedness between genomic samples produced by deep sequencing of heterogeneous populations. It should be especially useful for analysis of relatedness of genomes of viruses with unevenly distributed variable genomic regions, such as HIV and HCV. For the future we envision, that besides applications in molecular epidemiology the tool can also be adapted to immunosequencing and metagenomics data.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6196405PMC
http://dx.doi.org/10.1186/s12859-018-2333-9DOI Listing

Publication Analysis

Top Keywords

heterogeneous populations
12
genetic relatedness
8
genetically close
8
produced deep
8
deep sequencing
8
sequencing heterogeneous
8
fast estimation
4
estimation genetic
4
relatedness members
4
members heterogeneous
4

Similar Publications

Historical redlining and clustering of present-day breast cancer factors.

Cancer Causes Control

January 2025

Department of Epidemiology and Environmental Health, School of Public Health and Health Professions, State University of New York at Buffalo, 265 Farber Hall, Buffalo, NY, 14214, USA.

Purpose: Historical redlining, a 1930s-era form of residential segregation and proxy of structural racism, has been associated with breast cancer risk, stage, and survival, but research is lacking on how known present-day breast cancer risk factors are related to historical redlining. We aimed to describe the clustering of present-day neighborhood-level breast cancer risk factors with historical redlining and evaluate geographic patterning across the US.

Methods: This ecologic study included US neighborhoods (census tracts) with Home Owners' Loan Corporation (HOLC) grades, defined as having a score in the Historic Redlining Score dataset; 2019 Population Level Analysis and Community EStimates (PLACES) data; and 2014-2016 Environmental Justice Index (EJI) data.

View Article and Find Full Text PDF

Spatial protein expression technologies can map cellular content and organization by simultaneously quantifying the expression of >40 proteins at subcellular resolution within intact tissue sections and cell lines. However, necessary image segmentation to single cells is challenging and error prone, easily confounding the interpretation of cellular phenotypes and cell clusters. To address these limitations, we present STARLING, a probabilistic machine learning model designed to quantify cell populations from spatial protein expression data while accounting for segmentation errors.

View Article and Find Full Text PDF

Objective: Gallstone disease is a prevalent global health issue, but its impact in Africa remains unclear. This study aims to summarise and synthesise available data on the prevalence of gallstone disease across populations in Africa.

Design: Systematic review and meta-analysis, reported in accordance with Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA) guidelines.

View Article and Find Full Text PDF

Coronary microvascular disease (CMVD) affects the coronary pre-arterioles, arterioles, and capillaries and can lead to blood supply-demand mismatch and cardiac ischemia. CMVD can present clinically as ischemia or myocardial infarction with no obstructive coronary arteries (INOCA or MINOCA, respectively). Currently, therapeutic options for CMVD are limited, and there are no targeted therapies.

View Article and Find Full Text PDF

Objective: Medical laboratory data together with prescribing and hospitalisation records are three of the most used electronic health records (EHRs) for data-driven health research. In Scotland, hospitalisation, prescribing and the death register data are available nationally whereas laboratory data is captured, stored and reported from local health board systems with significant heterogeneity. For researchers or other users of this regionally curated data, working on laboratory datasets across regional cohorts requires effort and time.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!