StatMetaQA: A dataset for closed domain question answering in Indonesian statistical metadata.

Data Brief

Universitas Indonesia Computer Science Faculty of Computer Science, Universiity of Indonesia Campus in Depok West Java Indonesia Depok, West Java 16431, Indonesia.

Published: December 2024

A closed domain question answering (QA) dataset in statistical metadata is important to build an effective QA system about statistic. This dataset can be utilized to train or fine-tune the QA models in statistic. Further, it can also be exploited to evaluate the effectiveness of any QA methods in statistical domain. In this research, we build a new dataset of statistical metadata documents and question-answer pairs annotations of these documents in Indonesian language, called StatMetaQA (Statistical Metadata Question Answering). The collection of statistical metadata documents is used as the knowledge base of a QA system, while the collection of question-answer pairs annotations is used to train or fine-tune the QA models in statistic. The collection of statistical metadata documents, consisting of 861 statistical activity metadata documents and 1,231 statistical indicator metadata documents, was obtained from a website managed by the Statistics Indonesia (http://sirusa.bps.go.id). Next, the collection of question-answer pairs about statistical metadata, consisting of 28,863 question-answer pairs from 1,000 statistical metadata documents, was obtained using two strategies: human and automatic annotation. Here, 7353 question-answer pairs were manually annotated by human, and 21,510 question-answer pairs were automatically generated by machine using our predefined templates that were applied on some document fields of statistical metadata.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11416616PMC
http://dx.doi.org/10.1016/j.dib.2024.110816DOI Listing

Publication Analysis

Top Keywords

statistical metadata
36
metadata documents
24
question-answer pairs
24
question answering
12
statistical
12
metadata
11
closed domain
8
domain question
8
dataset statistical
8
train fine-tune
8

Similar Publications

Background: Smartphone mobile health (mHealth) apps have the potential to enhance access to health care services and address health care disparities, especially in low-resource settings. However, when developed without attention to equity and inclusivity, mHealth apps can also exacerbate health disparities. Understanding and creating solutions for the disparities caused by mHealth apps is crucial for achieving health equity.

View Article and Find Full Text PDF

Artificial intelligence (AI) is revolutionizing biodiversity research by enabling advanced data analysis, species identification, and habitats monitoring, thereby enhancing conservation efforts. Ensuring reproducibility in AI-driven biodiversity research is crucial for fostering transparency, verifying results, and promoting the credibility of ecological findings. This study investigates the reproducibility of deep learning (DL) methods within the biodiversity research.

View Article and Find Full Text PDF

A computational framework for extracting biological insights from SRA cancer data.

Sci Rep

March 2025

Grupo Informática de Biossistemas, Bioengenharia e Genômica, Instituto René Rachou, Fiocruz Minas, Av. Augusto de Lima, 1715, Barro Preto, Belo Horizonte, MG, Brazil.

The integration of sequenced samples and clinical data from independent yet related studies from public domain databases, such as The Sequence Read Archive (SRA), has the potential to increase sample sizes and enhance the statistical power needed for more precise bioinformatic analysis. Data mining and sample grouping are the starting points in this process and still present several challenges, including the presence of structured and unstructured data, missing deposited data, and varying experimental conditions and techniques applied across the studies. Designed to address the main challenges of data mining and sample grouping for biomarkers research, the proposed methodology employs a computational approach integrating relational database construction, text and data mining, natural language processing, network analysis, search by Pubmed publications, and combining MeSH, TTD and WordNet database to identify groups of samples with the same characteristics.

View Article and Find Full Text PDF

Background: TikTok's MedTok is an interconnected network of patients, providers, and producers sharing knowledge and experiences of health-related topics. Awareness of popular content on weight loss medications can benefit healthcare professionals, especially regarding side effects and management.

Objectives: Describe content in popular TikTok videos using side effect hashtags for gastric inhibitory peptide (GIP) and glucagon-like peptide-1 (GLP-1) receptor agonists.

View Article and Find Full Text PDF

The IsoFoodTrack database is a comprehensive, scalable, and flexible platform designed to manage isotopic and elemental composition data for a wide range of food commodities. It supports research in food authenticity and fraud detection by integrating isotopic data with rich metadata, including geographical, production, and methodological details. The database is built for scalability, allowing the addition of new commodities, analytical methods, and metadata fields, while ensuring interoperability with external databases through standardized formats and API integration.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!