AI Article Synopsis

  • Large language models learn language patterns through statistical methods, while ontologies provide symbolic knowledge that can enhance these models; both can work together effectively.
  • The MaterioMiner dataset links materials mechanics concepts with textual data, featuring 179 classes annotated by three raters across four publications, leading to a total of 2191 curated entities.
  • The study also evaluates annotation consistency among raters and demonstrates how pre-trained language models can be fine-tuned for named entity recognition, paving the way for advances in materials language models and knowledge graph creation.

Article Abstract

While large language models learn sound statistical representations of the language and information therein, ontologies are symbolic knowledge representations that can complement the former ideally. Research at this critical intersection relies on datasets that intertwine ontologies and text corpora to enable training and comprehensive benchmarking of neurosymbolic models. We present the MaterioMiner dataset and the linked materials mechanics ontology where ontological concepts from the mechanics of materials domain are associated with textual entities within the literature corpus. Another distinctive feature of the dataset is its eminently fine-grained annotation. Specifically, 179 distinct classes are manually annotated by three raters within four publications, amounting to 2191 entities that were annotated and curated. Conceptual work is presented for the symbolic representation of causal composition-process-microstructure-property relationships. We explore the annotation consistency between the three raters and perform fine-tuning of pre-trained language models to showcase the feasibility of training named entity recognition models. Reusing the dataset can foster training and benchmarking of materials language models, automated ontology construction, and knowledge graph generation from textual data.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11467320PMC
http://dx.doi.org/10.1038/s41597-024-03926-5DOI Listing

Publication Analysis

Top Keywords

language models
12
three raters
8
models
5
ontology-based text
4
text mining
4
dataset
4
mining dataset
4
dataset extraction
4
extraction process-structure-property
4
process-structure-property entities
4

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!