ViSpeR: Multilingual Audio-Visual Speech Recognition

التفاصيل البيبلوغرافية
العنوان:	ViSpeR: Multilingual Audio-Visual Speech Recognition
المؤلفون:	Narayan, Sanath, Djilali, Yasser Abdelaziz Dahou, Singh, Ankit, Bihan, Eustache Le, Hacid, Hakim
سنة النشر:	2024
المجموعة:	Computer Science
مصطلحات موضوعية:	Computer Science - Computation and Language, Computer Science - Artificial Intelligence
الوصف:	This work presents an extensive and detailed study on Audio-Visual Speech Recognition (AVSR) for five widely spoken languages: Chinese, Spanish, English, Arabic, and French. We have collected large-scale datasets for each language except for English, and have engaged in the training of supervised learning models. Our model, ViSpeR, is trained in a multi-lingual setting, resulting in competitive performance on newly established benchmarks for each language. The datasets and models are released to the community with an aim to serve as a foundation for triggering and feeding further research work and exploration on Audio-Visual Speech Recognition, an increasingly important area of research. Code available at \href{https://github.com/YasserdahouML/visper}{https://github.com/YasserdahouML/visper}.
نوع الوثيقة:	Working Paper
URL الوصول:	http://arxiv.org/abs/2406.00038
رقم الانضمام:	edsarx.2406.00038
قاعدة البيانات:	arXiv

الوصف
الوصف غير متاح.