Addressing cyberbullying in Urdu tweets: a comprehensive dataset and detection system.

Farah Adeeba Muhammad Irfan Yousuf Izza Anwer Sardar Umair Tariq Abdullah Ashfaq Malik Naqeeb

PeerJ Comput Sci

Department of Computer Science, University of Engineering and Technology Lahore, Lahore, Punjab, Pakistan.

Published: April 2024

The prevalence of cyberbullying has reached an alarming rate, affecting approximately 54% of teenagers who experience various forms of cyberbullying, including offensive hate speech, threats, and racism. This research introduces a comprehensive dataset and system for cyberbullying detection in Urdu tweets, leveraging a spectrum of machine learning approaches including traditional models and advanced deep learning techniques. The objectives of this study are threefold. Firstly, a dataset consisting of 12,500 annotated tweets in Urdu is created, and it is made publicly available to the research community. Secondly, annotation guidelines for Urdu text with appropriate labels for cyberbullying detection are developed. Finally, a series of experiments is conducted to assess the performance of machine learning and deep learning techniques in detecting cyberbullying. The results indicate that fastText deep learning models outperform other models in cyberbullying detection. This study demonstrates its efficacy in effectively detecting and classifying cyberbullying incidents in Urdu tweets, contributing to the broader effort of creating a safer digital environment.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11065408	PMC
http://dx.doi.org/10.7717/peerj-cs.1963	DOI Listing

Publication Analysis

Top Keywords

urdu tweets

cyberbullying detection

deep learning

comprehensive dataset

machine learning

learning techniques

cyberbullying

urdu

learning

addressing cyberbullying

Similar Publications

An automated approach to identify sarcasm in low-resource language.

PLoS One

December 2024

Department of Computer Science, Al Ain University, Al Ain, UAE.

Shumaila Khan Iqbal Qasim Wahab Khan Aurangzeb Khan Javed Ali Khan

Sarcasm detection has emerged due to its applicability in natural language processing (NLP) but lacks substantial exploration in low-resource languages like Urdu, Arabic, Pashto, and Roman-Urdu. While fewer studies identifying sarcasm have focused on low-resource languages, most of the work is in English. This research addresses the gap by exploring the efficacy of diverse machine learning (ML) algorithms in identifying sarcasm in Urdu.

View Article and Find Full Text PDF

Similar Publications

Addressing cyberbullying in Urdu tweets: a comprehensive dataset and detection system.

PeerJ Comput Sci

April 2024

Department of Computer Science, University of Engineering and Technology Lahore, Lahore, Punjab, Pakistan.

Farah Adeeba Muhammad Irfan Yousuf Izza Anwer Sardar Umair Tariq Abdullah Ashfaq

View Article and Find Full Text PDF

Similar Publications

SentiUrdu-1M: A large-scale tweet dataset for Urdu text sentiment analysis using weakly supervised learning.

PLoS One

September 2023

Dept. of Computer Science, Sukkur IBA University, Sukkur, Pakistan.

Abdul Ghafoor Ali Shariq Imran Sher Muhammad Daudpota Zenun Kastrati Sarang Shaikh

Low-resource languages are gaining much-needed attention with the advent of deep learning models and pre-trained word embedding. Though spoken by more than 230 million people worldwide, Urdu is one such low-resource language that has recently gained popularity online and is attracting a lot of attention and support from the research community. One challenge faced by such resource-constrained languages is the scarcity of publicly available large-scale datasets for conducting any meaningful study.

View Article and Find Full Text PDF

Similar Publications

Multi-label emotion classification of Urdu tweets.

PeerJ Comput Sci

April 2022

CIC, Instituto Politécnico Nacional, Mexico City, Mexico.

Noman Ashraf Lal Khan Sabur Butt Hsien-Tsung Chang Grigori Sidorov

Urdu is a widely used language in South Asia and worldwide. While there are similar datasets available in English, we created the first multi-label emotion dataset consisting of 6,043 tweets and six basic emotions in the Urdu Nastalíq script. A multi-label (ML) classification approach was adopted to detect emotions from Urdu.

View Article and Find Full Text PDF

Similar Publications

The growing amplification of social media: measuring temporal and social contagion dynamics for over 150 languages on Twitter for 2009-2020.

EPJ Data Sci

March 2021

Vermont Complex Systems Center, University of Vermont, Burlington, VT 05405 USA.

Thayer Alshaabi David Rushing Dewhurst Joshua R Minot Michael V Arnold Jane L Adams

Working from a dataset of 118 billion messages running from the start of 2009 to the end of 2019, we identify and explore the relative daily use of over 150 languages on Twitter. We find that eight languages comprise 80% of all tweets, with English, Japanese, Spanish, Arabic, and Portuguese being the most dominant. To quantify social spreading in each language over time, we compute the 'contagion ratio': The balance of retweets to organic messages.

View Article and Find Full Text PDF

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!