In the field of environment and health studies, recent trends have focused on the identification of contaminants of emerging concern (CEC). This is a complex, challenging task, as resources, such as compound databases (DBs) and mass spectral libraries (MSLs) concerning these compounds are very poor. This is particularly true for semi polar organic contaminants that have to be derivatized prior to gas chromatography-mass spectrometry (GC-MS) analysis with electron impact ionization (EI), for which it is barely possible to find any records. In particular, there is a severe lack of datasets of GC-EI-MS spectra generated and made publicly available for the purpose of development, validation and performance evaluation of cheminformatics-assisted compound structure identification (CSI) approaches, including novel cutting-edge machine learning (ML)-based approaches [1]. We set out to fill this gap and support the machine learning-assisted compound identification, thus aiding cheminformatics-assisted identification of silylated derivatives in GC-MS laboratories working in the field of environment and health. To this end, we have generated 12 datasets of GC-EI-MS spectra, six of which contain GC-EI-MS spectra of trimethylsilyl (TMS) and six GC-EI-MS spectra of -butyldimethylsilyl (TBDMS) derivatives. Four of these datasets, named testing datasets, contain mass spectra acquired by the authors. They are available in full, together with corresponding metadata. Eight datasets, named training datasets, were derived from mass spectra in the NIST 17 Mass Spectral Library. For these, we have only made the metadata publicly available, due to licensing reasons. For each type of derivative, two testing datasets are generated by acquiring and processing GC-EI-MS spectra, such that they include raw and processed GC-EI-MS spectra of TMS and TBDMS derivatives of CECs, along with their corresponding metadata. The metadata contains IUPAC name, exact mass, molecular formula, InChI, InChIKey, SMILES and PubChemID, of each CEC and CEC-TMS or CEC-TBDMS derivative, where available. Eight GC-EI-MS training datasets are generated by using the National Institute of Standards and Technology (NIST)/U.S. Environmental Protection Agency (EPA)/National Institute of Health (NIH) 17 Mass Spectral Library. For each derivative type (TMS and TBDMS), four datasets are given, each corresponding to an original dataset obtained from NIST/EPA/NIH 17 and three variants thereof, obtained after each of the filtering steps of the procedure described below. Only the metadata about the training datasets are available, describing the corresponding NIST/EPA/NIH 17 entires: These include the compound name, CAS Registry number, InChIKey, exact mass, M, NIST number and ID number. The datasets we present here were used to train and test predictive models for identification of silylated derivatives built with ML approaches [4]. The models were built by using data curated from the NIST Mass Spectral Library 17 [2] and the machine learning approach of CSI:Output Kernel Regression (CSI:OKR) [2]. Data from the NIST Mass Spectral Library 17 are commercially available from the National Institute of Standards and Technology (NIST)/U.S. Environmental Protection Agency (EPA)/National Institute of Health (NIH) and thus cannot be made publicly available. This highlights the need for publicly available GC-EI-MS spectra, which we address by releasing in full the four testing datasets.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10147959 | PMC |
http://dx.doi.org/10.1016/j.dib.2023.109138 | DOI Listing |
J Am Soc Mass Spectrom
January 2025
Biomolecular Measurement Division, National Institute of Standards and Technology, Gaithersburg, Maryland 20899-8362, United States.
While gas chromatography mass spectrometry (GC-MS) has long been used to identify compounds in complex mixtures, this process is often subjective and time-consuming and leaves a large fraction of seemingly good-quality spectra unidentified. In this work, we describe a set of new mass spectral library-based methods to assist compound identification in complex mixtures. These methods employ mass spectral uniqueness and compound ubiquity of library entries alongside noise reduction and automated comparison of retention indices to library compounds.
View Article and Find Full Text PDFAnal Chem
January 2025
College of Chemistry and Chemical Engineering, Central South University, Changsha 410083, China.
Molecular weight (MW) is a crucial property to improve the accuracy of multidimensional compound identification. In this study, we have developed MWFormer, a novel method that predicts MWs solely from spectra of electron ionization mass spectrometry (EI-MS) based on a Transformer encoder. MWFormer achieves a mean absolute error (MAE) of 6.
View Article and Find Full Text PDFMolecules
December 2024
Department of Forensic Medicine, Wroclaw Medical University, 4 J. Mikulicza-Radeckiego Street, 50-345 Wroclaw, Poland.
Rapid Commun Mass Spectrom
January 2025
Institute of Pharmacy, Berlin, Germany.
Rationale: Gas chromatography/electron ionization mass spectrometry (GC/EI-MS) is a well-established tool for the identification of unknown compounds such as new metabolites of xenobiotics. But it reaches the limits of confident structural assignment if it comes to stereoisomers. This work helps to overcome this difficulty by getting a deeper comprehension of composition of so far unspecific and also characteristic fragment ions in general and comparison among stereoisomers.
View Article and Find Full Text PDFJ Mass Spectrom
October 2024
Applied and Computational Mathematics Division, National Institute of Standards and Technology, Gaithersburg, Maryland, USA.
This study employs a high-dimensional consensus mass spectral (HDCMS) similarity scoring technique to discriminate isomers collected using an electron ionization mass spectrometer. The HDCMS method was previously introduced and applied to the discrimination of mass spectra of constitutional isomers, methamphetamine and phentermine, collected with direct analysis real-time mass spectrometry (DART-MS). The method formulates the problem of discriminating mass spectra in a mathematical Hilbert space and is hence called "high dimensional.
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!