The text mining of patents of pharmaceutical interest poses a number of unique challenges not encountered in other fields of text mining. Unlike fields, such as bioinformatics, where the number of terms of interest is enumerable and essentially static, systematic chemical nomenclature can describe an infinite number of molecules. Hence, the dictionary- and ontology-based techniques that are commonly used for gene names, diseases, species, etc., have limited utility when searching for novel therapeutic compounds in patents. Additionally, the length and the composition of IUPAC-like names make them more susceptible to typographic problems: OCR failures, human spelling errors, and hyphenation and line breaking issues. This work describes a novel technique, called CaffeineFix, designed to efficiently identify chemical names in free text, even in the presence of typographical errors. Corrected chemical names are generated as input for name-to-structure software. This forms a preprocessing pass, independent of the name-to-structure software used, and is shown to greatly improve the results of chemical text mining in our study.

Download full-text PDF

Source
http://dx.doi.org/10.1021/ci200463rDOI Listing

Publication Analysis

Top Keywords

text mining
16
chemical text
8
mining patents
8
chemical names
8
name-to-structure software
8
text
5
improved chemical
4
mining
4
patents infinite
4
infinite dictionaries
4

Similar Publications

Psychosocial rehabilitation and psychosocial disability research have been a longstanding topic in healthcare, demanding continuous exploration and analysis to enhance patient and clinical outcomes. As the prevalence of psychosocial disability research continues to attract scholarly attention, many scientific articles are being published in the literature. These publications offer profound insights into diagnostics, preventative measures, treatment strategies, and epidemiological factors.

View Article and Find Full Text PDF

Objectives: The National Library of Medicine (NLM) currently indexes close to a million articles each year pertaining to more than 5300 medicine and life sciences journals. Of these, a significant number of articles contain critical information about the structure, genetics, and function of genes and proteins in normal and disease states. These articles are identified by the NLM curators, and a manual link is created between these articles and the corresponding gene records at the NCBI Gene database.

View Article and Find Full Text PDF

This paper intends to solve the limitations of the existing methods to deal with the comments of tourist attractions. With the technical support of Artificial Intelligence (AI), an online comment method of tourist attractions based on text mining model and attention mechanism is proposed. In the process of text mining, the attention mechanism is used to calculate the contribution of each topic to text representation on the topic layer of Latent Dirichlet Allocation (LDA).

View Article and Find Full Text PDF

Mining genomic regions associated with stomatal traits and their candidate genes in bread wheat through genome-wide association study (GWAS).

Theor Appl Genet

January 2025

State Key Laboratory of Crop Stress Resistance and High-Efficiency Production and College of Agronomy, Northwest A&F University, Yangling, Shaanxi, China.

112 candidate quantitative trait loci (QTLs) and 53 key candidate genes have been identified as associated with stomatal traits in wheat. These include bHLH, MADS-box transcription factors, and mitogen-activated protein kinases (MAPKs). Stomata is a common feature of the leaf surface of plants and serve as vital conduits for the exchange of gases (primarily CO₂ and water vapor) between plants and the external environment.

View Article and Find Full Text PDF

Many practical disaster reports are published daily worldwide in various forms, including after-action reports, response plans, impact assessments, and resiliency plans. These reports serve as vital resources, allowing future generations to learn from past events and better mitigate and prepare for future disasters. However, this extensive practical literature often has limited impact on research and practice due to challenges in synthesizing and analyzing the reports.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!