UMLS-based data augmentation for natural language processing of clinical research literature.

J Am Med Inform Assoc

Department of Biomedical Informatics, Columbia University, New York, New York, USA.

Published: March 2021

Objective: The study sought to develop and evaluate a knowledge-based data augmentation method to improve the performance of deep learning models for biomedical natural language processing by overcoming training data scarcity.

Materials And Methods: We extended the easy data augmentation (EDA) method for biomedical named entity recognition (NER) by incorporating the Unified Medical Language System (UMLS) knowledge and called this method UMLS-EDA. We designed experiments to systematically evaluate the effect of UMLS-EDA on popular deep learning architectures for both NER and classification. We also compared UMLS-EDA to BERT.

Results: UMLS-EDA enables substantial improvement for NER tasks from the original long short-term memory conditional random fields (LSTM-CRF) model (micro-F1 score: +5%, + 17%, and +15%), helps the LSTM-CRF model (micro-F1 score: 0.66) outperform LSTM-CRF with transfer learning by BERT (0.63), and improves the performance of the state-of-the-art sentence classification model. The largest gain on micro-F1 score is 9%, from 0.75 to 0.84, better than classifiers with BERT pretraining (0.82).

Conclusions: This study presents a UMLS-based data augmentation method, UMLS-EDA. It is effective at improving deep learning models for both NER and sentence classification, and contributes original insights for designing new, superior deep learning approaches for low-resource biomedical domains.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7973470PMC
http://dx.doi.org/10.1093/jamia/ocaa309DOI Listing

Publication Analysis

Top Keywords

data augmentation
16
deep learning
16
micro-f1 score
12
umls-based data
8
natural language
8
language processing
8
augmentation method
8
learning models
8
method umls-eda
8
lstm-crf model
8

Similar Publications

JC polyomavirus (JCPyV) establishes a persistent, asymptomatic kidney infection in most of the population. However, JCPyV can reactivate in immunocompromised individuals and cause progressive multifocal leukoencephalopathy (PML), a fatal demyelinating disease with no approved treatment. Mutations in the hypervariable non-coding control region (NCCR) of the JCPyV genome have been linked to disease outcomes and neuropathogenesis, yet few metanalyses document these associations.

View Article and Find Full Text PDF

Individuals with special metabolic demands are at risk of deficiencies in fat-soluble vitamins, which can be counteracted via supplementation. Here, we tested the ability of micellization alone or in combination with selected natural plant extracts to increase the intestinal absorption and bioefficacy of fat-soluble vitamins. Micellated and nonmicellated vitamins D3 (cholecalciferol), D2 (ergocalciferol), E (alpha tocopheryl acetate), and K2 (menaquionone-7) were tested in intestinal Caco-2 or buccal TR146 cells in combination with curcuma (), black pepper (), or ginger () plant extracts.

View Article and Find Full Text PDF

Due to the complex and uncertain physics of lightning strike on carbon fiber-reinforced polymer (CFRP) laminates, conventional numerical simulation methods for assessing the residual strength of lightning-damaged CFRP laminates are highly time-consuming and far from pretty. To overcome these challenges, this study proposes a new prediction method for the residual strength of CFRP laminates based on machine learning. A diverse dataset is acquired and augmented from photographs of lightning strike damage areas, C-scan images, mechanical performance data, layup details, and lightning current parameters.

View Article and Find Full Text PDF

Amidst the pervasive threat of bacterial afflictions, the imperative for advanced antibiofilm surfaces with robust antimicrobial efficacy looms large. This study unveils a sophisticated ultrasonic synthesis method for cellulose nanocrystals (CNCs, 10-20 nm in diameter and 300-900 nm in length) and their subsequent application as coatings on flexible substrates, namely cotton (CC-1) and membrane (CM-1). The cellulose nanocrystals showed excellent water repellency with a water contact angle as high as 148° on the membrane.

View Article and Find Full Text PDF

Predicting the time series energy consumption data of manufacturing processes can optimize energy management efficiency and reduce maintenance costs for enterprises. Using deep learning algorithms to establish prediction models for sensor data is an effective approach; however, the performance of these models is significantly influenced by the quantity and quality of the training data. In real production environments, the amount of time series data that can be collected during the manufacturing process is limited, which can lead to a decline in model performance.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!