Conference
A SOUND DESCRIPTION: EXPLORING PROMPT TEMPLATES AND CLASS DESCRIPTIONS TO ENHANCE ZERO-SHOT AUDIO CLASSIFICATION
العنوان: | A SOUND DESCRIPTION: EXPLORING PROMPT TEMPLATES AND CLASS DESCRIPTIONS TO ENHANCE ZERO-SHOT AUDIO CLASSIFICATION |
---|---|
المؤلفون: | Olvera, Michel, Stamatiadis, Paraskevas, Essid, Slim |
المساهمون: | Signal, Statistique et Apprentissage (S2A), Laboratoire Traitement et Communication de l'Information (LTCI), Institut Mines-Télécom Paris (IMT)-Télécom Paris, Institut Mines-Télécom Paris (IMT)-Institut Polytechnique de Paris (IP Paris)-Institut Polytechnique de Paris (IP Paris)-Institut Mines-Télécom Paris (IMT)-Télécom Paris, Institut Mines-Télécom Paris (IMT)-Institut Polytechnique de Paris (IP Paris)-Institut Polytechnique de Paris (IP Paris), Département Images, Données, Signal (IDS), Télécom ParisTech, This work was supported by the Audible project, funded by French BPI. |
المصدر: | DCASE 2024 - 9th Workshop on Detection and Classification of Acoustic Scenes and Events ; https://hal.science/hal-04701759 ; DCASE 2024 - 9th Workshop on Detection and Classification of Acoustic Scenes and Events, Oct 2024, Tokyo, Japan |
بيانات النشر: | HAL CCSD |
سنة النشر: | 2024 |
مصطلحات موضوعية: | Zero-shot audio classification, audio-text models, contrastive language-audio pretraining, in-context learning, [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI] |
جغرافية الموضوع: | Tokyo, Japan |
الوصف: | International audience ; Audio-text models trained via contrastive learning offer a practical approach to perform audio classification through natural language prompts, such as "this is a sound of" followed by category names. In this work, we explore alternative prompt templates for zero-shot audio classification, demonstrating the existence of higher-performing options. First, we find that the formatting of the prompts significantly affects performance so that simply prompting the models with properly formatted class labels performs competitively with optimized prompt templates and even prompt ensembling. Moreover, we look into complementing class labels by audio-centric descriptions. By leveraging large language models, we generate textual descriptions that prioritize acoustic features of sound events to disambiguate between classes, without extensive prompt engineering. We show that prompting with class descriptions leads to state-of-the-art results in zero-shot audio classification across major ambient sound datasets. Remarkably, this method requires no additional training and remains fully zero-shot. |
نوع الوثيقة: | conference object |
اللغة: | English |
الاتاحة: | https://hal.science/hal-04701759 https://hal.science/hal-04701759v1/document https://hal.science/hal-04701759v1/file/main.pdf |
Rights: | info:eu-repo/semantics/OpenAccess |
رقم الانضمام: | edsbas.8E284E8B |
قاعدة البيانات: | BASE |
الوصف غير متاح. |