Representation learning of writing style

التفاصيل البيبلوغرافية
العنوان: Representation learning of writing style
المؤلفون: Hay, Julien, Doan, Bich-Lien, Popineau, Fabrice, Ait Elhara, Ouassim
المساهمون: Données et Connaissances Massives et Hétérogènes (LRI) (LaHDAK - LRI), Laboratoire de Recherche en Informatique (LRI), CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)-CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS), TAckling the Underspecified (TAU), Inria Saclay - Ile de France, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire de Recherche en Informatique (LRI), Société Octopeek (Enghien Les bains, France), Octopeek (FRANCE)
المصدر: Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020) ; https://hal.science/hal-04244991 ; Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020), Nov 2020, Online, France. pp.232-243, ⟨10.18653/v1/2020.wnut-1.30⟩
بيانات النشر: HAL CCSD
Association for Computational Linguistics
سنة النشر: 2020
مصطلحات موضوعية: [INFO]Computer Science [cs]
جغرافية الموضوع: Online, France
الوصف: International audience ; In this paper, we introduce a new method of representation learning that aims to embed documents in a stylometric space. Previous studies in the field of authorship analysis focused on feature engineering techniques in order to represent document styles and to enhance model performance in specific tasks. Instead, we directly embed documents in a stylometric space by relying on a reference set of authors and the intra-author consistency property which is one of two components in our definition of writing style. The main intuition of this paper is that we can define a general stylometric space from a set of reference authors such that, in this space, the coordinates of different documents will be close when the documents are by the same author, and spread away when they are by different authors, even for documents by authors who are not in the set of reference authors. The method we propose allows for the clustering of documents based on stylistic clues reflecting the authorship of documents. For the empirical validation of the method, we train a deep neural network model to predict authors of a large reference dataset consisting of news and blog articles. Albeit the learning process is supervised, it does not require a dedicated labeling of the data but it relies only on the metadata of the articles which are available in huge amounts. We evaluate the model on multiple datasets, on both the authorship clustering and the authorship attribution tasks.
نوع الوثيقة: conference object
اللغة: English
Relation: hal-04244991; https://hal.science/hal-04244991; https://hal.science/hal-04244991/document; https://hal.science/hal-04244991/file/2020.wnut-1.30.pdf
DOI: 10.18653/v1/2020.wnut-1.30
الاتاحة: https://hal.science/hal-04244991
https://hal.science/hal-04244991/document
https://hal.science/hal-04244991/file/2020.wnut-1.30.pdf
https://doi.org/10.18653/v1/2020.wnut-1.30
Rights: info:eu-repo/semantics/OpenAccess
رقم الانضمام: edsbas.84386DF7
قاعدة البيانات: BASE
الوصف
DOI:10.18653/v1/2020.wnut-1.30