StatMetaQA: A dataset for closed domain question answering in Indonesian statistical metadata.

Data Brief

Universitas Indonesia Computer Science Faculty of Computer Science, Universiity of Indonesia Campus in Depok West Java Indonesia Depok, West Java 16431, Indonesia.

Published: December 2024

A closed domain question answering (QA) dataset in statistical metadata is important to build an effective QA system about statistic. This dataset can be utilized to train or fine-tune the QA models in statistic. Further, it can also be exploited to evaluate the effectiveness of any QA methods in statistical domain. In this research, we build a new dataset of statistical metadata documents and question-answer pairs annotations of these documents in Indonesian language, called StatMetaQA (Statistical Metadata Question Answering). The collection of statistical metadata documents is used as the knowledge base of a QA system, while the collection of question-answer pairs annotations is used to train or fine-tune the QA models in statistic. The collection of statistical metadata documents, consisting of 861 statistical activity metadata documents and 1,231 statistical indicator metadata documents, was obtained from a website managed by the Statistics Indonesia (http://sirusa.bps.go.id). Next, the collection of question-answer pairs about statistical metadata, consisting of 28,863 question-answer pairs from 1,000 statistical metadata documents, was obtained using two strategies: human and automatic annotation. Here, 7353 question-answer pairs were manually annotated by human, and 21,510 question-answer pairs were automatically generated by machine using our predefined templates that were applied on some document fields of statistical metadata.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11416616	PMC
http://dx.doi.org/10.1016/j.dib.2024.110816	DOI Listing

Publication Analysis

Top Keywords

statistical metadata

metadata documents

question-answer pairs

question answering

statistical

metadata

closed domain

domain question

dataset statistical

train fine-tune

Similar Publications

Exploring Heart Disease-Related mHealth Apps in India: Systematic Search in App Stores and Metadata Analysis.

J Med Internet Res

March 2025

Division of Biomedical and Public Health Ethics, Department of General Health Studies, Karl Landsteiner University of Health Sciences, Krems, Austria.

Keerthi Dubbala Roshan Prizak Ingrid Metzler Giovanni Rubeis

Background: Smartphone mobile health (mHealth) apps have the potential to enhance access to health care services and address health care disparities, especially in low-resource settings. However, when developed without attention to equity and inclusivity, mHealth apps can also exacerbate health disparities. Understanding and creating solutions for the disparities caused by mHealth apps is crucial for achieving health equity.

View Article and Find Full Text PDF

Similar Publications

Evaluating the method reproducibility of deep learning models in biodiversity research.

PeerJ Comput Sci

February 2025

Heinz Nixdorf Chair of Distributed Information Systems, Friedrich-Schiller Universität Jena, Jena, Thuringia, Germany.

Waqas Ahmed Vamsi Krishna Kommineni Birgitta König-Ries Jitendra Gaikwad Luiz Gadelha

Artificial intelligence (AI) is revolutionizing biodiversity research by enabling advanced data analysis, species identification, and habitats monitoring, thereby enhancing conservation efforts. Ensuring reproducibility in AI-driven biodiversity research is crucial for fostering transparency, verifying results, and promoting the credibility of ecological findings. This study investigates the reproducibility of deep learning (DL) methods within the biodiversity research.

View Article and Find Full Text PDF

Similar Publications

A computational framework for extracting biological insights from SRA cancer data.

Sci Rep

March 2025

Grupo Informática de Biossistemas, Bioengenharia e Genômica, Instituto René Rachou, Fiocruz Minas, Av. Augusto de Lima, 1715, Barro Preto, Belo Horizonte, MG, Brazil.

Paul Anderson Souza Guimarães Maria Gabriela Reis Carvalho Jeronimo Conceição Ruiz

The integration of sequenced samples and clinical data from independent yet related studies from public domain databases, such as The Sequence Read Archive (SRA), has the potential to increase sample sizes and enhance the statistical power needed for more precise bioinformatic analysis. Data mining and sample grouping are the starting points in this process and still present several challenges, including the presence of structured and unstructured data, missing deposited data, and varying experimental conditions and techniques applied across the studies. Designed to address the main challenges of data mining and sample grouping for biomarkers research, the proposed methodology employs a computational approach integrating relational database construction, text and data mining, natural language processing, network analysis, search by Pubmed publications, and combining MeSH, TTD and WordNet database to identify groups of samples with the same characteristics.

View Article and Find Full Text PDF

Similar Publications

TikTok's Take on Side Effects for Glucagon-Like Peptide-1 and Gastric Inhibitory Polypeptide Receptor Agonists.

J Am Pharm Assoc (2003)

March 2025

University of South Florida Taneja College of Pharmacy, Tampa FL.

Gwendolyn A Wantuch Jerica Singleton

Background: TikTok's MedTok is an interconnected network of patients, providers, and producers sharing knowledge and experiences of health-related topics. Awareness of popular content on weight loss medications can benefit healthcare professionals, especially regarding side effects and management.

Objectives: Describe content in popular TikTok videos using side effect hashtags for gastric inhibitory peptide (GIP) and glucagon-like peptide-1 (GLP-1) receptor agonists.

View Article and Find Full Text PDF

Similar Publications

IsoFoodTrack: a comprehensive database and management system based on stable isotope ratio analysis for combating food fraud.

Front Nutr

February 2025

Department of Environmental Sciences, Jožef Stefan Institute, Ljubljana, Slovenia.

Cathrine Terro Robert Modic Matevž Ogrinc Andraž Simčič Jan Drole

The IsoFoodTrack database is a comprehensive, scalable, and flexible platform designed to manage isotopic and elemental composition data for a wide range of food commodities. It supports research in food authenticity and fraud detection by integrating isotopic data with rich metadata, including geographical, production, and methodological details. The database is built for scalability, allowing the addition of new commodities, analytical methods, and metadata fields, while ensuring interoperability with external databases through standardized formats and API integration.

View Article and Find Full Text PDF

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!