In the vibrant linguistic landscape of Bengali, spoken by millions in Bangladesh and India, the gap between saintly and common terms is culturally and computationally significant. Recognising this, we introduce BanglaBlend, a pioneering dataset created to capture these stylistic distinctions. BanglaBlend comes with 7350 annotated sentences, 3675 in saintly form and 3675 in common form, covering a crucial need in natural language processing (NLP) resources for Bangla. This dataset is transformational in a variety of applications. It contributes to the creation of NLP models that can detect and imitate Bengali stylistic nuances, hence improving tasks like as text categorisation, sentiment analysis, and style translation. BanglaBlend also facilitates literary analysis, cultural heritage projects, and the creation of domain-specific texts. To achieve the best data quality, rigorous pre-processing techniques such as anonymization and duplication removal were used. The style designations were extensively validated in three steps to ensure correctness. BanglaBlend is more than just a dataset; it is a cornerstone for future NLP research and development in Bangla. It is a valuable resource for studying stylistic diversity, aids in the development of context-aware language models, and is an essential tool for academic research and practical applications. By making BanglaBlend freely accessible, we hope to encourage cooperation and creativity within the Bangla NLP community, therefore adding to the worldwide variety of linguistic computational resources.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11732579 | PMC |
http://dx.doi.org/10.1016/j.dib.2024.111240 | DOI Listing |
Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!