Building a benchmark dataset for the Kurdish news question answering.

Data Brief

Computer Science Department, College of Science, University of Halabja, Kurdistan Region, Halabja, Iraq.

Published: December 2024

This article presents the Kurdish News Question Answering Dataset (KNQAD). The texts are collected from various Kurdish news websites. The ParsHub software is used to extract data from different fields of news, such as social news, religion, sports, science, and economy. The dataset consists of 15,002 news paragraphs with question-answer pairs. For each news paragraph, one or more question-answer pairs are manually created based on the content of the paragraphs. The dataset is pre-processed by cleaning and normalizing the data. During the cleaning process, special characters and stop words are removed, and stemming is used as a normalization step. The distribution of each question type is presented in the KNQAD. Moreover, the complexity of the QA problem is analyzed in the KNQAD by using lexical similarity techniques between questions and answers.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11418133PMC
http://dx.doi.org/10.1016/j.dib.2024.110916DOI Listing

Publication Analysis

Top Keywords

kurdish news
12
news question
8
question answering
8
question-answer pairs
8
news
7
building benchmark
4
dataset
4
benchmark dataset
4
dataset kurdish
4
answering article
4

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!