Impact of Data Augmentation on Hate Speech Detection in Roman Urdu

التفاصيل البيبلوغرافية
العنوان: Impact of Data Augmentation on Hate Speech Detection in Roman Urdu
المؤلفون: Fariha Maqbool, Blerina Spahiu, Andrea Maurino
المساهمون: Atzori, M, Ciaccia, P, Ceci, M, Mandreoli, F, Malerba, D, Sanguinetti, M, Pellicani, A, Motta, F, Maqbool, F, Spahiu, B, Maurino, A
بيانات النشر: CEUR-WS
سنة النشر: 2024
المجموعة: Università degli Studi di Milano-Bicocca: BOA (Bicocca Open Archive)
مصطلحات موضوعية: data augmentation, under resourced languages, large language models
الوصف: The prevalence of hate speech leads to an increase in hate crimes, online violence, and serious harm to social safety, physical security, and cyberspace. To address this issue, several studies have been conducted on hate speech detection in European languages, whereas little attention has been paid to low-resource South Asian languages, making social media vulnerable for millions of users. Due to the scarcity of the datasets and the samples available, there is a need to apply some strategies to increase the data samples. In this paper, we improved the performance of the already fine-tuned m-Bert model by applying data augmentation techniques to one of the datasets on hate speech on tweets in Roman Urdu language. F1-score and accuracy matrix have been used to compare the results. We also experiment to determine the optimal percentage of augmented data to be included and the percentage of words augmented in each instance of data. The new RUHSOLD++ Dataset containing the augmented data has also been published publicly. The improvement in hate speech detection of the model proved that the performance of the models can be improved by applying data augmentation techniques to the dataset with a limited number of instances.
نوع الوثيقة: conference object
اللغة: English
Relation: ispartofbook:Proceedings of the 32nd Symposium on Advanced Database Systems; 32nd Italian Symposium on Advanced Database Systems, SEBD 2024 - 23 June 2024 through 26 June 2024; volume:3741; firstpage:321; lastpage:330; numberofpages:10; serie:CEUR WORKSHOP PROCEEDINGS; alleditors:Atzori, M; Ciaccia, P; Ceci, M; Mandreoli, F; Malerba, D; Sanguinetti, M; Pellicani, A; Motta, F; https://hdl.handle.net/10281/490399; info:eu-repo/semantics/altIdentifier/scopus/2-s2.0-85202057651; https://ceur-ws.org/Vol-3741/
الاتاحة: https://hdl.handle.net/10281/490399
https://ceur-ws.org/Vol-3741/
Rights: info:eu-repo/semantics/openAccess
رقم الانضمام: edsbas.74030F2C
قاعدة البيانات: BASE