Publications by authors named "William Noble"

Protein tandem mass spectrometry data are most often interpreted by matching observed mass spectra to a protein database derived from the reference genome of the sample being analyzed. In many application domains, however, a relevant protein database is unavailable or incomplete, and in such settings de novo sequencing is required. Since the introduction of the DeepNovo algorithm in 2017, the field of de novo sequencing has been dominated by deep learning methods, which use large amounts of labeled mass spectrometry data to train multi-layer neural networks to translate from observed mass spectra to corresponding peptide sequences.

View Article and Find Full Text PDF

Quantitative analysis of proteomics data frequently employs peptide-identity-propagation (PIP) - also known as match-between-runs (MBR) - to increase the number of peptides quantified in a given LC-MS/MS experiment. PIP can routinely account for up to 40% of all quantitative results, with that proportion rising as high as 75% in single-cell proteomics. Therefore, a significant concern for any PIP method is the possibility of false discoveries: errors that result in peptides being quantified incorrectly.

View Article and Find Full Text PDF

In May and June of 2021, marine microbial samples were collected for DNA sequencing in East Sound, WA, USA every 4 hours for 22 days. This high temporal resolution sampling effort captured the last 3 days of a Rhizosolenia sp. bloom, the initiation and complete bloom cycle of Chaetoceros socialis (8 days), and the following bacterial bloom (2 days).

View Article and Find Full Text PDF

Eukaryotic nuclei adopt a highly compartmentalized architecture that influences nearly all genomic processes. Understanding how this architecture impacts gene expression has been hindered by a lack of tools for elucidating the molecular interactions at individual genomic loci. Here, we adapt oligonucleotide-mediated proximity-interactome mapping (O-MAP) to biochemically characterize discrete, micron-scale nuclear neighborhoods.

View Article and Find Full Text PDF
Article Synopsis
  • Training machine learning models for tasks like de novo sequencing and spectral clustering needs substantial data, specifically high-confidence spectra.
  • A new dataset containing 2.8 million reliable peptide-spectrum matches from nine species has been created.
  • This dataset enhances a previously established benchmark, with improved data quality and a clear separation between training and test peptides.
View Article and Find Full Text PDF

The dynamic three-dimensional (3D) organization of the human genome (the "4D Nucleome") is closely linked to genome function. Here, we integrate a wide variety of genomic data generated by the 4D Nucleome Project to provide a detailed view of human 3D genome organization in widely used embryonic stem cells (H1-hESCs) and immortalized fibroblasts (HFFc6). We provide extensive benchmarking of 3D genome mapping assays and integrate these diverse datasets to annotate spatial genomic features across scales.

View Article and Find Full Text PDF
Article Synopsis
  • Data-independent acquisition (DIA) mass spectrometry is gaining popularity in quantitative proteomics due to its effectiveness in data analysis.
  • Creating reliable spectral libraries for DIA is challenging, as most current libraries come from data-dependent acquisition (DDA) data or predictions based on DDA.
  • The study introduces Carafe, a tool that generates specific spectral libraries by using deep learning directly on DIA data, showing better performance in predicting ion intensity and detecting peptides compared to existing DDA models.
View Article and Find Full Text PDF

In May and June of 2021, marine microbial samples were collected for DNA sequencing in East Sound, WA, USA every 4 hours for 22 days. This high temporal resolution sampling effort captured the last 3 days of a sp. bloom, the initiation and complete bloom cycle of (8 days), and the following bacterial bloom (2 days).

View Article and Find Full Text PDF

Three-dimensional nuclear DNA architecture comprises well-studied intra-chromosomal () folding and less characterized inter-chromosomal () interfaces. Current predictive models of 3D genome folding can effectively infer pairwise -chromatin interactions from the primary DNA sequence but generally ignore contacts. There is an unmet need for robust models of -genome organization that provide insights into their underlying principles and functional relevance.

View Article and Find Full Text PDF
Article Synopsis
  • Scientists usually study how parts of the same chromosome (intrachromosomal contacts) connect, but not much about how different chromosomes (interchromosomal contacts) interact.
  • They created a new computer method called trans-C that helps to find these important connections using data from experiments called Hi-C.
  • This method was tested with different models and showed that genes that work together often stay close to each other in the cell, which helps in making RNA better and faster.
View Article and Find Full Text PDF

The cell cycle governs the proliferation, differentiation, and regeneration of all eukaryotic cells. Profiling cell cycle dynamics is therefore central to basic and biomedical research spanning development, health, aging, and disease. However, current approaches to cell cycle profiling involve complex interventions that may confound experimental interpretation.

View Article and Find Full Text PDF

Missing values are a major challenge in the analysis of mass spectrometry proteomics data. Missing values hinder reproducibility, decrease statistical power for identifying differentially expressed (DE) proteins and make it challenging to analyze low-abundance proteins. We present Lupine, a deep learning-based method for imputing, or estimating, missing values in tandem mass tag (TMT) proteomics data.

View Article and Find Full Text PDF

A key parameter of any bottom-up proteomics mass spectrometry experiment is the identity of the enzyme that is used to digest proteins in the sample into peptides. The Casanovo de novo sequencing model was trained using data that was generated with trypsin digestion; consequently, the model prefers to predict peptides that end with the amino acids "K" or "R". This bias is desirable when Casanovo is used to analyze data that was also generated using trypsin but can be problematic if the data was generated using some other digestion enzyme.

View Article and Find Full Text PDF

A fundamental challenge in mass spectrometry-based proteomics is the identification of the peptide that generated each acquired tandem mass spectrum. Approaches that leverage known peptide sequence databases cannot detect unexpected peptides and can be impractical or impossible to apply in some settings. Thus, the ability to assign peptide sequences to tandem mass spectra without prior information-de novo peptide sequencing-is valuable for tasks including antibody sequencing, immunopeptidomics, and metaproteomics.

View Article and Find Full Text PDF

Motivation: High-resolution Hi-C contact matrices reveal the detailed three-dimensional architecture of the genome, but high-coverage experimental Hi-C data are expensive to generate. Simultaneously, chromatin structure analyses struggle with extremely sparse contact matrices. To address this problem, computational methods to enhance low-coverage contact matrices have been developed, but existing methods are largely based on resolution enhancement methods for natural images and hence often employ models that do not distinguish between biologically meaningful contacts, such as loops and other stochastic contacts.

View Article and Find Full Text PDF

Motivation: One of the core problems in the analysis of protein tandem mass spectrometry data is the peptide assignment problem: determining, for each observed spectrum, the peptide sequence that was responsible for generating the spectrum. Two primary classes of methods are used to solve this problem: database search and de novo peptide sequencing. State-of-the-art methods for de novo sequencing use machine learning methods, whereas most database search engines use hand-designed score functions to evaluate the quality of a match between an observed spectrum and a candidate peptide from the database.

View Article and Find Full Text PDF

A pressing statistical challenge in the field of mass spectrometry proteomics is how to assess whether a given software tool provides accurate error control. Each software tool for searching such data uses its own internally implemented methodology for reporting and controlling the error. Many of these software tools are closed source, with incompletely documented methodology, and the strategies for validating the error are inconsistent across tools.

View Article and Find Full Text PDF

Long-read DNA sequencing has recently emerged as a powerful tool for studying both genetic and epigenetic architectures at single-molecule and single-nucleotide resolution. Long-read epigenetic studies encompass both the direct identification of native cytosine methylation and the identification of exogenously placed DNA -methyladenine (DNA-m6A). However, detecting DNA-m6A modifications using single-molecule sequencing, as well as coprocessing single-molecule genetic and epigenetic architectures, is limited by computational demands and a lack of supporting tools.

View Article and Find Full Text PDF
Article Synopsis
  • Traditional database search methods for analyzing mass spectrometry data struggle to detect peptides with post-translational modifications (PTMs), leading to a rise in "open modification" search strategies that allow for more flexibility in mass matching.
  • A study by Kong highlighted that the open modification search tool MSFragger may be better at detecting peptides compared to traditional "narrow window" searches, prompting an empirical investigation into this claim.
  • The investigation revealed potential issues with false discovery rate (FDR) control in certain machine learning tools, but upon reanalysis with standard FDR control methods, it was found that concerns about their reliability in proteomics MS/MS searches may not be substantiated.
View Article and Find Full Text PDF

Searching for tandem mass spectrometry proteomics data against a database is a well-established method for assigning peptide sequences to observed spectra but typically cannot identify peptides harboring unexpected post-translational modifications (PTMs). Open modification searching aims to address this problem by allowing a spectrum to match a peptide even if the spectrum's precursor mass differs from the peptide mass. However, expanding the search space in this way can lead to a loss of statistical power to detect peptides.

View Article and Find Full Text PDF

The three-dimensional organization of genomes plays a crucial role in essential biological processes. The segregation of chromatin into A and B compartments highlights regions of activity and inactivity, providing a window into the genomic activities specific to each cell type. Yet, the steep costs associated with acquiring Hi-C data, necessary for studying this compartmentalization across various cell types, pose a significant barrier in studying cell type specific genome organization.

View Article and Find Full Text PDF