Toward training NLP models to take into account privacy leakages

التفاصيل البيبلوغرافية
العنوان: Toward training NLP models to take into account privacy leakages
المؤلفون: Berthelier, Gaspard, Boutet, Antoine, Richard, Antoine
المساهمون: Privacy Models, Architectures and Tools for the Information Society (PRIVATICS), Inria Grenoble - Rhône-Alpes, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-CITI Centre of Innovation in Telecommunications and Integration of services (CITI), Institut National des Sciences Appliquées de Lyon (INSA Lyon), Université de Lyon-Institut National des Sciences Appliquées (INSA)-Université de Lyon-Institut National des Sciences Appliquées (INSA)-Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National des Sciences Appliquées de Lyon (INSA Lyon), Université de Lyon-Institut National des Sciences Appliquées (INSA)-Université de Lyon-Institut National des Sciences Appliquées (INSA)-Inria Lyon, Institut National de Recherche en Informatique et en Automatique (Inria), Hospices Civils de Lyon (HCL), ANR-22-PECY-0002,iPoP,interdisciplinary Project on Privacy(2022)
المصدر: BigData 2023 - IEEE International Conference on Big Data ; https://hal.science/hal-04299405 ; BigData 2023 - IEEE International Conference on Big Data, Dec 2023, Sorrento, Italy. pp.1-9
بيانات النشر: HAL CCSD
IEEE
سنة النشر: 2023
المجموعة: Université de Lyon: HAL
مصطلحات موضوعية: NLP models, Privacy, Membership Inference, Counterfactual Memorisation, Data Extraction, [INFO]Computer Science [cs]
جغرافية الموضوع: Sorrento, Italy
Time: Sorrento, Italy
الوصف: International audience ; With the rise of machine learning and data-driven models especially in the field of Natural Language Processing (NLP), a strong demand for sharing data between organisations has emerged. However datasets are usually composed of personal data and thus subject to numerous regulations which require anonymization before disseminating the data. In the medical domain for instance, patient records are extremely sensitive and private, but the de-identification of medical documents is a complex task. Recent advances in NLP models have shown encouraging results in this field, but the question of whether deploying such models is safe remains. In this paper, we evaluate three privacy risks on NLP models trained on sensitive data. Specifically, we evaluate counterfactual memorization, which corresponds to rare and sensitive information which has too much influence on the model. We also evaluate membership inference as well as the ability to extract verbatim training data from the model. With this evaluation, we can cure data at risk from the training data and calibrate hyper parameters to provide a supplementary utility and privacy tradeoff to the usual mitigation strategies such as using differential privacy. We exhaustively illustrate the privacy leakage of NLP models through a use-case using medical texts and discuss the impact of both the proposed methodology and mitigation schemes.
نوع الوثيقة: conference object
اللغة: English
الاتاحة: https://hal.science/hal-04299405
https://hal.science/hal-04299405v1/document
https://hal.science/hal-04299405v1/file/NLP_Privacy_Hopitaux%20%2818%29.pdf
Rights: http://creativecommons.org/licenses/by/ ; info:eu-repo/semantics/OpenAccess
رقم الانضمام: edsbas.7E21E6E5
قاعدة البيانات: BASE