In the field of environment and health studies, recent trends have focused on the identification of contaminants of emerging concern (CEC). This is a complex, challenging task, as resources, such as compound databases (DBs) and mass spectral libraries (MSLs) concerning these compounds are very poor. This is particularly true for semi polar organic contaminants that have to be derivatized prior to gas chromatography-mass spectrometry (GC-MS) analysis with electron impact ionization (EI), for which it is barely possible to find any records. In particular, there is a severe lack of datasets of GC-EI-MS spectra generated and made publicly available for the purpose of development, validation and performance evaluation of cheminformatics-assisted compound structure identification (CSI) approaches, including novel cutting-edge machine learning (ML)-based approaches [1]. We set out to fill this gap and support the machine learning-assisted compound identification, thus aiding cheminformatics-assisted identification of silylated derivatives in GC-MS laboratories working in the field of environment and health. To this end, we have generated 12 datasets of GC-EI-MS spectra, six of which contain GC-EI-MS spectra of trimethylsilyl (TMS) and six GC-EI-MS spectra of -butyldimethylsilyl (TBDMS) derivatives. Four of these datasets, named testing datasets, contain mass spectra acquired by the authors. They are available in full, together with corresponding metadata. Eight datasets, named training datasets, were derived from mass spectra in the NIST 17 Mass Spectral Library. For these, we have only made the metadata publicly available, due to licensing reasons. For each type of derivative, two testing datasets are generated by acquiring and processing GC-EI-MS spectra, such that they include raw and processed GC-EI-MS spectra of TMS and TBDMS derivatives of CECs, along with their corresponding metadata. The metadata contains IUPAC name, exact mass, molecular formula, InChI, InChIKey, SMILES and PubChemID, of each CEC and CEC-TMS or CEC-TBDMS derivative, where available. Eight GC-EI-MS training datasets are generated by using the National Institute of Standards and Technology (NIST)/U.S. Environmental Protection Agency (EPA)/National Institute of Health (NIH) 17 Mass Spectral Library. For each derivative type (TMS and TBDMS), four datasets are given, each corresponding to an original dataset obtained from NIST/EPA/NIH 17 and three variants thereof, obtained after each of the filtering steps of the procedure described below. Only the metadata about the training datasets are available, describing the corresponding NIST/EPA/NIH 17 entires: These include the compound name, CAS Registry number, InChIKey, exact mass, M, NIST number and ID number. The datasets we present here were used to train and test predictive models for identification of silylated derivatives built with ML approaches [4]. The models were built by using data curated from the NIST Mass Spectral Library 17 [2] and the machine learning approach of CSI:Output Kernel Regression (CSI:OKR) [2]. Data from the NIST Mass Spectral Library 17 are commercially available from the National Institute of Standards and Technology (NIST)/U.S. Environmental Protection Agency (EPA)/National Institute of Health (NIH) and thus cannot be made publicly available. This highlights the need for publicly available GC-EI-MS spectra, which we address by releasing in full the four testing datasets.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10147959PMC
http://dx.doi.org/10.1016/j.dib.2023.109138DOI Listing

Publication Analysis

Top Keywords

gc-ei-ms spectra
28
mass spectral
20
spectral library
16
tbdms derivatives
12
datasets
12
testing datasets
12
training datasets
12
nist mass
12
gc-ei-ms
9
mass
9

Similar Publications

While gas chromatography mass spectrometry (GC-MS) has long been used to identify compounds in complex mixtures, this process is often subjective and time-consuming and leaves a large fraction of seemingly good-quality spectra unidentified. In this work, we describe a set of new mass spectral library-based methods to assist compound identification in complex mixtures. These methods employ mass spectral uniqueness and compound ubiquity of library entries alongside noise reduction and automated comparison of retention indices to library compounds.

View Article and Find Full Text PDF

Molecular weight (MW) is a crucial property to improve the accuracy of multidimensional compound identification. In this study, we have developed MWFormer, a novel method that predicts MWs solely from spectra of electron ionization mass spectrometry (EI-MS) based on a Transformer encoder. MWFormer achieves a mean absolute error (MAE) of 6.

View Article and Find Full Text PDF

Forensic Aspects of Designer LSD Analogs Identification by GC-MS (EI) and UV Spectroscopy.

Molecules

December 2024

Department of Forensic Medicine, Wroclaw Medical University, 4 J. Mikulicza-Radeckiego Street, 50-345 Wroclaw, Poland.

Article Synopsis
  • LSD analogs are emerging psychoactive substances designed to mimic illegal drugs while bypassing regulations, presenting identification challenges due to their diverse isomeric forms.
  • Gas chromatography-mass spectrometry (GC-MS) and UV spectroscopy were utilized to analyze 13 different LSD analogs, revealing that specific solvents enhanced detection sensitivity, while methanol could produce misleading results.
  • The study successfully established a method for distinguishing between LSD analogs and their isomers through unique ion fragmentation patterns and intensities, aiding in accurate forensic analysis.
View Article and Find Full Text PDF

Rationale: Gas chromatography/electron ionization mass spectrometry (GC/EI-MS) is a well-established tool for the identification of unknown compounds such as new metabolites of xenobiotics. But it reaches the limits of confident structural assignment if it comes to stereoisomers. This work helps to overcome this difficulty by getting a deeper comprehension of composition of so far unspecific and also characteristic fragment ions in general and comparison among stereoisomers.

View Article and Find Full Text PDF

This study employs a high-dimensional consensus mass spectral (HDCMS) similarity scoring technique to discriminate isomers collected using an electron ionization mass spectrometer. The HDCMS method was previously introduced and applied to the discrimination of mass spectra of constitutional isomers, methamphetamine and phentermine, collected with direct analysis real-time mass spectrometry (DART-MS). The method formulates the problem of discriminating mass spectra in a mathematical Hilbert space and is hence called "high dimensional.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!