Automatic authorship identification is a challenging task that has been the focus of extensive research in natural language processing. Regardless of the progress made in attributing authorship, the need for corpora in under-resourced languages impedes advancing and examining present methods. To address this gap, we investigate the problem of authorship attribution in Albanian.
View Article and Find Full Text PDFIn this study, we present the acquisition and categorization of a geographically-informed, multi-dialectal Albanian National Corpus, derived from Twitter data. The primary dialects from three distinct regions-Albania, Kosovo, and North Macedonia-are considered. The assembled publicly available dataset encompasses anonymized user information, user-generated tweets, auxiliary tweet-related data, and annotations corresponding to dialect categories.
View Article and Find Full Text PDF