Improving tabular data extraction in scanned laboratory reports using deep learning models.

J Biomed Inform

Section of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT 06510, USA. Electronic address:

Published: November 2024

Objective: Medical laboratory testing is essential in healthcare, providing crucial data for diagnosis and treatment. Nevertheless, patients' lab testing results are often transferred via fax across healthcare organizations and are not immediately available for timely clinical decision making. Thus, it is important to develop new technologies to accurately extract lab testing information from scanned laboratory reports. This study aims to develop an advanced deep learning-based Optical Character Recognition (OCR) method to identify tables containing lab testing results in scanned laboratory reports.

Methods: Extracting tabular data from scanned lab reports involves two stages: table detection (i.e., identifying the area of a table object) and table recognition (i.e., identifying and extracting tabular structures and contents). DETR R18 algorithm as well as YOLOv8s were involved for table detection, and we compared the performance of PaddleOCR and the encoder-dual-decoder (EDD) model for table recognition. 650 tables from 632 randomly selected laboratory test reports were annotated and used to train and evaluate those models. For table detection evaluation, we used metrics such as Average Precision (AP), Average Recall (AR), AP50, and AP75. For table recognition evaluation, we employed Tree-Edit Distance (TEDS).

Results: For table detection, fine-tuned DETR R18 demonstrated superior performance (AP50: 0.774; AP75: 0.644; AP: 0.601; AR: 0.766). In terms of table recognition, fine-tuned EDD outperformed other models with a TEDS score of 0.815. The proposed OCR pipeline (fine-tuned DETR R18 and fine-tuned EDD), demonstrated impressive results, achieving a TEDS score of 0.699 and a TEDS structure score of 0.764.

Conclusions: Our study presents a dedicated OCR pipeline for scanned clinical documents, utilizing state-of-the-art deep learning models for region-of-interest detection and table recognition. The high TEDS scores demonstrate the effectiveness of our approach, which has significant implications for clinical data analysis and decision-making.

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.jbi.2024.104735DOI Listing

Publication Analysis

Top Keywords

table recognition
20
table detection
16
scanned laboratory
12
lab testing
12
detr r18
12
table
10
tabular data
8
laboratory reports
8
deep learning
8
learning models
8

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!