Sequence-to-sequence pretraining for a less-resourced Slovenian language.

Front Artif Intell

Faculty of Computer and Information Science, University of Ljubljana, Ljubljana, Slovenia.

Published: March 2023

Introduction: Large pretrained language models have recently conquered the area of natural language processing. As an alternative to predominant masked language modeling introduced in BERT, the T5 model has introduced a more general training objective, namely sequence to sequence transformation, which more naturally fits text generation tasks. The monolingual variants of T5 models have been limited to well-resourced languages, while the massively multilingual T5 model supports 101 languages.

Methods: We trained two different-sized T5-type sequence-to-sequence models for morphologically rich Slovene language with much fewer resources. We analyzed the behavior of new models on 11 tasks, eight classification ones (named entity recognition, sentiment classification, lemmatization, two question answering tasks, two natural language inference tasks, and a coreference resolution task), and three text generation tasks (text simplification and two summarization tasks on different datasets). We compared the new SloT5 models with the multilingual mT5 model, multilingual mBART-50 model, and with four encoder BERT-like models: multilingual BERT, multilingual XLM-RoBERTa, trilingual Croatian-Slovene-English BERT, and monolingual Slovene RoBERTa model.

Results: Concerning the classification tasks, the SloT5 models mostly lag behind the monolingual Slovene SloBERTa model. However, these models are helpful for generative tasks and provide several useful results. In general, the size of models matters, and currently, there is not enough training data for Slovene for successful pretraining of large models.

Discussion: While the results are obtained on Slovene, we believe that they may generalize to other less-resourced languages, where such models will be built. We make the training and evaluation code, as well as the trained models, publicly available.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10086348PMC
http://dx.doi.org/10.3389/frai.2023.932519DOI Listing

Publication Analysis

Top Keywords

models
11
natural language
8
text generation
8
tasks
8
generation tasks
8
slot5 models
8
models multilingual
8
monolingual slovene
8
language
6
model
5

Similar Publications

Purpose: To develop and validate an MRI-based model for predicting postoperative early (≤2 years) recurrence-free survival (RFS) in patients receiving upfront surgical resection (SR) for beyond Milan hepatocellular carcinoma (HCC) and to assess the model's performance in separate patients receiving neoadjuvant therapy for similar-stage tumors.

Method: This single-center retrospective study included consecutive patients with resectable BCLC A/B beyond Milan HCC undergoing upfront SR or neoadjuvant therapy. All images were independently evaluated by three blinded radiologists.

View Article and Find Full Text PDF

Models for pure tone audiometry enabling computational evaluation: Introduction to Japanese extensive experiences.

Auris Nasus Larynx

January 2025

Department of Otolaryngology, Faculty of Medicine, Teikyo University, Tokyo, Japan. Electronic address:

Pure tone audiometry including "masking" is the most basic test in audiological medicine. Masking is based on theoretical models of sound perception and propagation and has been widely discussed since the 1950s. In Japan, such discussion has been conducted extensively, starting from early periods up to recent times, with success to enable mathematical simulation, but the achievements have little been disclosed to the English-speaking world.

View Article and Find Full Text PDF

Cell-Instructive Biomaterials with Native-Like Biochemical Complexity.

Annu Rev Biomed Eng

January 2025

1Weldon School of Biomedical Engineering, Purdue University, West Lafayette, Indiana, USA; email:

Biochemical signals in native tissue microenvironments instruct cell behavior during many biological processes ranging from developmental morphogenesis and tissue regeneration to tumor metastasis and disease progression. The detection and characterization of these signals using spatial and highly resolved quantitative methods have revealed their existence as matricellular proteins in the matrisome, some of which are bound to the extracellular matrix while others are freely diffusing. Including these biochemical signals in engineered biomaterials can impart enhanced functionality and native-like complexity, ultimately benefiting efforts to understand, model, and treat various diseases.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!