The challenges of German archival document categorization on insufficient labeled data

التفاصيل البيبلوغرافية
العنوان: The challenges of German archival document categorization on insufficient labeled data
المؤلفون: Fabian Hoppe, Tabea Tietz, Danilo Dessi', Nils Meyer, Mirjam Sprau, Mehwish Alam, Harald Sack
المساهمون: Hoppe, Fabian, Tietz, Tabea, Dessi', Danilo, Meyer, Nil, Sprau, Mirjam, Alam, Mehwish, Sack, Harald
بيانات النشر: CEUR-WS
سنة النشر: 2020
المجموعة: Università degli Studi di Cagliari: UNICA IRIS
مصطلحات موضوعية: Cultural Heritage, Dataless Categorization, Document Exploration, Text Categorization
الوصف: Document exploration in archives is often challenging due to the lack of organization in topic-based categories. Moreover, archival records only provide short text which is often insufficient for capturing the semantic. This paper proposes and explores a dataless categorization approach that utilizes word embeddings and TF-IDF to categorize archival documents. Additionally, it introduces a visual approach built on top of the word embeddings to enhance the exploration of data. Preliminary results suggest that current vector representations alone do not provide enough external knowledge to solve this task.
نوع الوثيقة: conference object
اللغة: English
Relation: ispartofbook:WHiSe 2020 Workshop on Humanities in the Semantic Web 2020; 3rd Workshop on Humanities in the Semantic Web, WHiSe 2020; volume:2695; firstpage:15; lastpage:20; numberofpages:6; serie:CEUR WORKSHOP PROCEEDINGS; http://hdl.handle.net/11584/321829; info:eu-repo/semantics/altIdentifier/scopus/2-s2.0-85095974032
الاتاحة: http://hdl.handle.net/11584/321829
Rights: info:eu-repo/semantics/openAccess
رقم الانضمام: edsbas.CDDA3293
قاعدة البيانات: BASE