NAR Genom Bioinform
September 2024
NAR Genom Bioinform
September 2024
Data from the single-cell assay for transposase-accessible chromatin using sequencing (scATAC-seq) are now widely available. One major computational challenge is dealing with high dimensionality and inherent sparsity, which is typically addressed by producing lower dimensional representations of single cells for downstream clustering tasks. Current approaches produce such individual cell embeddings directly through a one-step learning process.
View Article and Find Full Text PDFAs available genomic interval data increase in scale, we require fast systems to search them. A common approach is simple string matching to compare a search term to metadata, but this is limited by incomplete or inaccurate annotations. An alternative is to compare data directly through genomic region overlap analysis, but this approach leads to challenges like sparsity, high dimensionality, and computational expense.
View Article and Find Full Text PDFMotivation: Genomic region sets summarize functional genomics data and define locations of interest in the genome such as regulatory regions or transcription factor binding sites. The number of publicly available region sets has increased dramatically, leading to challenges in data analysis.
Results: We propose a new method to represent genomic region sets as vectors, or embeddings, using an adapted word2vec approach.