An ontology-based text mining dataset for extraction of process-structure-property entities.

Ali Riza Durmaz Akhil Thomas Lokesh Mishra Rachana Niranjan Murthy Thomas Straub

Sci Data

Fraunhofer Institute for Mechanics of Materials IWM, Freiburg im Breisgau, 79108, Germany.

Published: October 2024

Large language models learn language patterns through statistical methods, while ontologies provide symbolic knowledge that can enhance these models; both can work together effectively.
The MaterioMiner dataset links materials mechanics concepts with textual data, featuring 179 classes annotated by three raters across four publications, leading to a total of 2191 curated entities.
The study also evaluates annotation consistency among raters and demonstrates how pre-trained language models can be fine-tuned for named entity recognition, paving the way for advances in materials language models and knowledge graph creation.

While large language models learn sound statistical representations of the language and information therein, ontologies are symbolic knowledge representations that can complement the former ideally. Research at this critical intersection relies on datasets that intertwine ontologies and text corpora to enable training and comprehensive benchmarking of neurosymbolic models. We present the MaterioMiner dataset and the linked materials mechanics ontology where ontological concepts from the mechanics of materials domain are associated with textual entities within the literature corpus. Another distinctive feature of the dataset is its eminently fine-grained annotation. Specifically, 179 distinct classes are manually annotated by three raters within four publications, amounting to 2191 entities that were annotated and curated. Conceptual work is presented for the symbolic representation of causal composition-process-microstructure-property relationships. We explore the annotation consistency between the three raters and perform fine-tuning of pre-trained language models to showcase the feasibility of training named entity recognition models. Reusing the dataset can foster training and benchmarking of materials language models, automated ontology construction, and knowledge graph generation from textual data.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11467320	PMC
http://dx.doi.org/10.1038/s41597-024-03926-5	DOI Listing

Publication Analysis

Top Keywords

language models

three raters

models

ontology-based text

text mining

dataset

mining dataset

dataset extraction

extraction process-structure-property

process-structure-property entities

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!