Report
When a Language Question Is at Stake. A Revisited Approach to Label Sensitive Content
العنوان: | When a Language Question Is at Stake. A Revisited Approach to Label Sensitive Content |
---|---|
المؤلفون: | Daria, Stetsenko |
سنة النشر: | 2023 |
المجموعة: | Computer Science |
مصطلحات موضوعية: | Computer Science - Computation and Language |
الوصف: | Many under-resourced languages require high-quality datasets for specific tasks such as offensive language detection, disinformation, or misinformation identification. However, the intricacies of the content may have a detrimental effect on the annotators. The article aims to revisit an approach of pseudo-labeling sensitive data on the example of Ukrainian tweets covering the Russian-Ukrainian war. Nowadays, this acute topic is in the spotlight of various language manipulations that cause numerous disinformation and profanity on social media platforms. The conducted experiment highlights three main stages of data annotation and underlines the main obstacles during machine annotation. Ultimately, we provide a fundamental statistical analysis of the obtained data, evaluation of models used for pseudo-labelling, and set further guidelines on how the scientists can leverage the corpus to execute more advanced research and extend the existing data samples without annotators' engagement. Comment: Ukrainian language, pseudo-labelling, dataset, offensive-language |
نوع الوثيقة: | Working Paper |
URL الوصول: | http://arxiv.org/abs/2311.10514 |
رقم الانضمام: | edsarx.2311.10514 |
قاعدة البيانات: | arXiv |
ResultId |
1 |
---|---|
Header |
edsarx arXiv edsarx.2311.10514 1073 3 Report report 1073.15881347656 |
PLink |
https://search.ebscohost.com/login.aspx?direct=true&site=eds-live&scope=site&db=edsarx&AN=edsarx.2311.10514&custid=s6537998&authtype=sso |
FullText |
Array
(
[Availability] => 0
)
Array ( [0] => Array ( [Url] => http://arxiv.org/abs/2311.10514 [Name] => EDS - Arxiv [Category] => fullText [Text] => View record in Arxiv [MouseOverText] => View record in Arxiv ) ) |
Items |
Array
(
[Name] => Title
[Label] => Title
[Group] => Ti
[Data] => When a Language Question Is at Stake. A Revisited Approach to Label Sensitive Content
)
Array ( [Name] => Author [Label] => Authors [Group] => Au [Data] => <searchLink fieldCode="AR" term="%22Daria%2C+Stetsenko%22">Daria, Stetsenko</searchLink> ) Array ( [Name] => DatePubCY [Label] => Publication Year [Group] => Date [Data] => 2023 ) Array ( [Name] => Subset [Label] => Collection [Group] => HoldingsInfo [Data] => Computer Science ) Array ( [Name] => Subject [Label] => Subject Terms [Group] => Su [Data] => <searchLink fieldCode="DE" term="%22Computer+Science+-+Computation+and+Language%22">Computer Science - Computation and Language</searchLink> ) Array ( [Name] => Abstract [Label] => Description [Group] => Ab [Data] => Many under-resourced languages require high-quality datasets for specific tasks such as offensive language detection, disinformation, or misinformation identification. However, the intricacies of the content may have a detrimental effect on the annotators. The article aims to revisit an approach of pseudo-labeling sensitive data on the example of Ukrainian tweets covering the Russian-Ukrainian war. Nowadays, this acute topic is in the spotlight of various language manipulations that cause numerous disinformation and profanity on social media platforms. The conducted experiment highlights three main stages of data annotation and underlines the main obstacles during machine annotation. Ultimately, we provide a fundamental statistical analysis of the obtained data, evaluation of models used for pseudo-labelling, and set further guidelines on how the scientists can leverage the corpus to execute more advanced research and extend the existing data samples without annotators' engagement.<br />Comment: Ukrainian language, pseudo-labelling, dataset, offensive-language ) Array ( [Name] => TypeDocument [Label] => Document Type [Group] => TypDoc [Data] => Working Paper ) Array ( [Name] => URL [Label] => Access URL [Group] => URL [Data] => <link linkTarget="URL" linkTerm="http://arxiv.org/abs/2311.10514" linkWindow="_blank">http://arxiv.org/abs/2311.10514</link> ) Array ( [Name] => AN [Label] => Accession Number [Group] => ID [Data] => edsarx.2311.10514 ) |
RecordInfo |
Array
(
[BibEntity] => Array
(
[Subjects] => Array
(
[0] => Array
(
[SubjectFull] => Computer Science - Computation and Language
[Type] => general
)
)
[Titles] => Array
(
[0] => Array
(
[TitleFull] => When a Language Question Is at Stake. A Revisited Approach to Label Sensitive Content
[Type] => main
)
)
)
[BibRelationships] => Array
(
[HasContributorRelationships] => Array
(
[0] => Array
(
[PersonEntity] => Array
(
[Name] => Array
(
[NameFull] => Daria, Stetsenko
)
)
)
)
[IsPartOfRelationships] => Array
(
[0] => Array
(
[BibEntity] => Array
(
[Dates] => Array
(
[0] => Array
(
[D] => 17
[M] => 11
[Type] => published
[Y] => 2023
)
)
)
)
)
)
)
|
IllustrationInfo |