Regularizing transformers with deep probabilistic layers.

Neural Netw

Swiss Data Science Institute (ETHZ/EPFL), Universitatstrasse 25, 8006, Zurich, Switzerland. Electronic address:

Published: April 2023

Language models (LM) have grown non-stop in the last decade, from sequence-to-sequence architectures to attention-based Transformers. However, regularization is not deeply studied in those structures. In this work, we use a Gaussian Mixture Variational Autoencoder (GMVAE) as a regularizer layer. We study its advantages regarding the depth where it is placed and prove its effectiveness in several scenarios. Experimental result demonstrates that the inclusion of deep generative models within Transformer-based architectures such as BERT, RoBERTa, or XLM-R can bring more versatile models, able to generalize better and achieve improved imputation score in tasks such as SST-2 and TREC or even impute missing/noisy words with richer text.

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.neunet.2023.01.032DOI Listing

Publication Analysis

Top Keywords

regularizing transformers
4
transformers deep
4
deep probabilistic
4
probabilistic layers
4
layers language
4
language models
4
models grown
4
grown non-stop
4
non-stop decade
4
decade sequence-to-sequence
4

Similar Publications

Cancer is the second leading cause of death, significantly threatening human health. Effective treatment options are often lacking in advanced stages, making early diagnosis crucial for reducing mortality rates. Circulating tumor cells (CTCs) are a promising biomarker for early detection; however, their automatic detection is challenging due to their heterogeneous size and shape, as well as their scarcity in blood.

View Article and Find Full Text PDF

The aim of this study was to determine the frequency-temperature dependence of the AC conductivity and relaxation times in humid electrical pressboard used in the insulation of power transformers, impregnated with the innovative NYTRO BIO 300X bio-oil produced from plant raw materials. Tests were carried out for a composite of cellulose-bio-oil-water nanodroplets with a moisture content of 0.6% by weight to 5% by weight in the frequency range from 10 Hz to 5·10 Hz.

View Article and Find Full Text PDF

Anterior vertebral tethering (AVT) is a non-invasive spine surgery technique, treating severe spine deformations and preserving lower back mobility. However, patient positioning and surgical strategies greatly influences postoperative results. Predicting the upright geometry from pediatric spines is needed to optimize patient positioning in the operating room (OR) and improve surgical outcomes, but remains a complex task due to immature bone properties.

View Article and Find Full Text PDF

Medical image segmentation is essential for accurately representing tissues and organs in scans, improving diagnosis, guiding treatment, enabling quantitative analysis, and advancing AI-assisted healthcare. Organs and lesion areas in medical images have complex geometries and spatial relationships. Due to variations in the size and location of lesion areas, automatic segmentation faces significant challenges.

View Article and Find Full Text PDF

Purpose: Extracting inclusion and exclusion criteria in a structured, automated fashion remains a challenge to developing better search functionalities or automating systematic reviews of randomized controlled trials in oncology. The question "Did this trial enroll patients with localized disease, metastatic disease, or both?" could be used to narrow down the number of potentially relevant trials when conducting a search.

Methods: Six hundred trials from high-impact medical journals were classified depending on whether they allowed for the inclusion of patients with localized and/or metastatic disease.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!