Publications by authors named "Mark B Gerstein"

The human genome is packaged within a three-dimensional (3D) nucleus and organized into structural units known as compartments, topologically associating domains (TADs), and loops. TAD boundaries, separating adjacent TADs, have been found to be well conserved across mammalian species and more evolutionarily constrained than TADs themselves. Recent studies show that structural variants (SVs) can modify 3D genomes through the disruption of TADs, which play an essential role in insulating genes from outside regulatory elements' aberrant regulation.

View Article and Find Full Text PDF

The accurate screening of candidate drug ligands against target proteins through computational approaches is of prime interest to drug development efforts. Such virtual screening depends in part on methods to predict the binding affinity between ligands and proteins. Many computational models for binding affinity prediction have been developed, but with varying results across targets.

View Article and Find Full Text PDF

The functional properties of the human brain arise, in part, from the vast assortment of cell types that pattern the cerebral cortex. The cortical sheet can be broadly divided into distinct networks, which are embedded into processing streams, or gradients, that extend from unimodal systems through higher-order association territories. Here using microarray data from the Allen Human Brain Atlas and single-nucleus RNA-sequencing data from multiple cortical territories, we demonstrate that cell-type distributions are spatially coupled to the functional organization of cortex, as estimated through functional magnetic resonance imaging.

View Article and Find Full Text PDF

Cryo-EM particle identification from micrographs ("picking") is challenging due to the low signal-to-noise ratio and lack of ground truth for particle locations. State-of-the-art computational algorithms ("pickers") identify different particle sets, complicating the selection of the best-suited picker for a protein of interest. Here, we present REliable PIcking by Consensus (REPIC), a computational approach to identifying particles common to the output of multiple pickers.

View Article and Find Full Text PDF

Protein phase transitions (PPTs) from the soluble state to a dense liquid phase (forming droplets via liquid-liquid phase separation) or to solid aggregates (such as amyloids) play key roles in pathological processes associated with age-related diseases such as Alzheimer's disease. Several computational frameworks are capable of separately predicting the formation of droplets or amyloid aggregates based on protein sequences, yet none have tackled the prediction of both within a unified framework. Recently, large language models (LLMs) have exhibited great success in protein structure prediction; however, they have not yet been used for PPTs.

View Article and Find Full Text PDF

Motivation: The current paradigm of deep learning models for the joint representation of molecules and text primarily relies on 1D or 2D molecular formats, neglecting significant 3D structural information that offers valuable physical insight. This narrow focus inhibits the models' versatility and adaptability across a wide range of modalities. Conversely, the limited research focusing on explicit 3D representation tends to overlook textual data within the biomedical domain.

View Article and Find Full Text PDF

Summary: Pretrained large language models (LLMs) have significantly improved code generation. As these models scale up, there is an increasing need for the output to handle more intricate tasks and to be appropriately specialized to particular domains. Here, we target bioinformatics due to the amount of domain knowledge, algorithms, and data operations this discipline requires.

View Article and Find Full Text PDF

Cultural processes of change bear many resemblances to biological evolution. The underlying units of non-biological evolution have, however, remained elusive, especially in the domain of music. Here, we introduce a general framework to jointly identify underlying units and their associated evolutionary processes.

View Article and Find Full Text PDF
Article Synopsis
  • SNPs (Single nucleotide polymorphisms) pose a reidentification risk for individuals and their relatives due to their ability to uniquely identify individuals, especially rare variants.
  • PLIGHT is a computational tool that uses hidden Markov models for estimating the informativeness of small, noisy SNP sets, allowing for identification in large haplotype databases.
  • The tool can determine privacy leakage from sparse SNP sets and includes a sanitization feature to help remove the most identifying SNPs from genomic data, thus enhancing privacy without needing prior knowledge of population characteristics.
View Article and Find Full Text PDF

The majority of mammalian genes encode multiple transcript isoforms that result from differential promoter use, changes in exonic splicing, and alternative 3' end choice. Detecting and quantifying transcript isoforms across tissues, cell types, and species has been extremely challenging because transcripts are much longer than the short reads normally used for RNA-seq. By contrast, long-read RNA-seq (LR-RNA-seq) gives the complete structure of most transcripts.

View Article and Find Full Text PDF

Public health officials and clinicians routinely advise social media users to avoid nighttime social media use due to the perception that this delays the onset of sleep and predisposes to the health risks of insufficient sleep. With some exceptions, the evidence behind this advice mostly derives from surveys identifying an association between self-reported social media usage and self-reported sleep patterns. In principle, these associations could alternatively be explained by users turning to social media to pass the time when they are otherwise having difficulty sleeping, or by individual differences that draw some people to frequent social media use, or by offline activities that overlap with both social media use and delayed sleep.

View Article and Find Full Text PDF

Although the role of RNA binding proteins (RBPs) in extracellular RNA (exRNA) biology is well established, their exRNA cargo and distribution across biofluids are largely unknown. To address this gap, we extend the exRNA Atlas resource by mapping exRNAs carried by extracellular RBPs (exRBPs). This map was developed through an integrative analysis of ENCODE enhanced crosslinking and immunoprecipitation (eCLIP) data (150 RBPs) and human exRNA profiles (6,930 samples).

View Article and Find Full Text PDF

Background: Individuals with later bedtimes have an increased risk of difficulties with mood and substances. To investigate the causes and consequences of late bedtimes and other sleep patterns, researchers are exploring social media as a data source. Pioneering studies inferred sleep patterns directly from social media data.

View Article and Find Full Text PDF

Motivation: While many quantum computing (QC) methods promise theoretical advantages over classical counterparts, quantum hardware remains limited. Exploiting near-term QC in computer-aided drug design (CADD) thus requires judicious partitioning between classical and quantum calculations.

Results: We present HypaCADD, a hybrid classical-quantum workflow for finding ligands binding to proteins, while accounting for genetic mutations.

View Article and Find Full Text PDF

Many human diseases are caused by mutations in nuclear envelope (NE) proteins. How protein homeostasis and disease etiology are interconnected at the NE is poorly understood. Specifically, the identity of local ubiquitin ligases that facilitate ubiquitin-proteasome-dependent NE protein turnover is presently unknown.

View Article and Find Full Text PDF

Every cell in the human body inherits a copy of the same genetic information. The three billion base pairs of DNA in the human genome, and the roughly 50 000 coding and non-coding genes they contain, must thus encode all the complexity of human development and cell and tissue type diversity. Differences in gene regulation, or the modulation of gene expression, enable individual cells to interpret the genome differently to carry out their specific functions.

View Article and Find Full Text PDF

The extracellular RNA communication consortium (ERCC) is an NIH-funded program aiming to promote the development of new technologies, resources, and knowledge about exRNAs and their carriers. After Phase 1 (2013-2018), Phase 2 of the program (ERCC2, 2019-2023) aims to fill critical gaps in knowledge and technology to enable rigorous and reproducible methods for separation and characterization of both bulk populations of exRNA carriers and single EVs. ERCC2 investigators are also developing new bioinformatic pipelines to promote data integration through the exRNA atlas database.

View Article and Find Full Text PDF

The generation of functional genomics data by next-generation sequencing has increased greatly in the past decade. Broad sharing of these data is essential for research advancement but poses notable privacy challenges, some of which are analogous to those that occur when sharing genetic variant data. However, there are also unique privacy challenges that arise from cryptic information leakage during the processing and summarization of functional genomics data from raw reads to derived quantities, such as gene expression values.

View Article and Find Full Text PDF

RNA-seq has matured and become an important tool for studying RNA biology. Here we compared two RNA-seq (MGI DNBSEQ and Illumina NextSeq 500) and two microarray platforms (GeneChip Human Transcriptome Array 2.0 and Illumina Expression BeadChip) in healthy individuals administered recombinant human erythropoietin for transcriptome-wide quantification of differential gene expression.

View Article and Find Full Text PDF

Background: The diversity of genomic alterations in cancer poses challenges to fully understanding the etiologies of the disease. Recent interest in infrequent mutations, in genes that reside in the "long tail" of the mutational distribution, uncovered new genes with significant implications in cancer development. The study of cancer-relevant genes often requires integrative approaches pooling together multiple types of biological data.

View Article and Find Full Text PDF

Millions of consumer sport and fitness wearables (CSFWs) are used worldwide, and millions of datapoints are generated by each device. Moreover, these numbers are rapidly growing, and they contain a heterogeneity of devices, data types, and contexts for data collection. Companies and consumers would benefit from guiding standards on device quality and data formats.

View Article and Find Full Text PDF

The endoplasmic reticulum (ER) is a membrane-bound organelle responsible for protein folding, lipid synthesis, and calcium homeostasis. Maintenance of ER structural integrity is crucial for proper function, but much remains to be learned about the molecular players involved. To identify proteins that support the structure of the ER, we performed a proteomic screen and identified nodal modulator (NOMO), a widely conserved type I transmembrane protein of unknown function, with three nearly identical orthologs specified in the human genome.

View Article and Find Full Text PDF