DravidianCodeMix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text.

Bharathi Raja Chakravarthi Ruba Priyadharshini Vigneshwaran Muralidaran Navya Jose Shardul Suryawanshi Elizabeth Sherly John P McCrae

Lang Resour Eval

Insight SFI Research Centre for Data Analytics, Data Science Institute, National University of Ireland Galway, Galway, Ireland.

Published: February 2022

This paper describes the development of a multilingual, manually annotated dataset for three under-resourced Dravidian languages generated from social media comments. The dataset was annotated for sentiment analysis and offensive language identification for a total of more than 60,000 YouTube comments. The dataset consists of around 44,000 comments in Tamil-English, around 7000 comments in Kannada-English, and around 20,000 comments in Malayalam-English. The data was manually annotated by volunteer annotators and has a high inter-annotator agreement in Krippendorff's alpha. The dataset contains all types of code-mixing phenomena since it comprises user-generated content from a multilingual country. We also present baseline experiments to establish benchmarks on the dataset using machine learning and deep learning methods. The dataset is available on Github and Zenodo.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC9388449	PMC
http://dx.doi.org/10.1007/s10579-022-09583-7	DOI Listing

Publication Analysis

Top Keywords

sentiment analysis

analysis offensive

offensive language

language identification

dravidian languages

manually annotated

comments dataset

dataset

comments

dravidiancodemix sentiment

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!