This paper describes the development of a multilingual, manually annotated dataset for three under-resourced Dravidian languages generated from social media comments. The dataset was annotated for sentiment analysis and offensive language identification for a total of more than 60,000 YouTube comments. The dataset consists of around 44,000 comments in Tamil-English, around 7000 comments in Kannada-English, and around 20,000 comments in Malayalam-English. The data was manually annotated by volunteer annotators and has a high inter-annotator agreement in Krippendorff's alpha. The dataset contains all types of code-mixing phenomena since it comprises user-generated content from a multilingual country. We also present baseline experiments to establish benchmarks on the dataset using machine learning and deep learning methods. The dataset is available on Github and Zenodo.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC9388449PMC
http://dx.doi.org/10.1007/s10579-022-09583-7DOI Listing

Publication Analysis

Top Keywords

sentiment analysis
8
analysis offensive
8
offensive language
8
language identification
8
dravidian languages
8
manually annotated
8
comments dataset
8
dataset
7
comments
5
dravidiancodemix sentiment
4

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!