MoTE: Reconciling Generalization with Specialization for Visual-Language to Video Knowledge Transfer

التفاصيل البيبلوغرافية
العنوان:	MoTE: Reconciling Generalization with Specialization for Visual-Language to Video Knowledge Transfer
المؤلفون:	Zhu, Minghao, Wang, Zhengpu, Hu, Mengxian, Dang, Ronghao, Lin, Xiao, Zhou, Xun, Liu, Chengju, Chen, Qijun
سنة النشر:	2024
المجموعة:	Computer Science
مصطلحات موضوعية:	Computer Science - Computer Vision and Pattern Recognition
الوصف:	Transferring visual-language knowledge from large-scale foundation models for video recognition has proved to be effective. To bridge the domain gap, additional parametric modules are added to capture the temporal information. However, zero-shot generalization diminishes with the increase in the number of specialized parameters, making existing works a trade-off between zero-shot and close-set performance. In this paper, we present MoTE, a novel framework that enables generalization and specialization to be balanced in one unified model. Our approach tunes a mixture of temporal experts to learn multiple task views with various degrees of data fitting. To maximally preserve the knowledge of each expert, we propose \emph{Weight Merging Regularization}, which regularizes the merging process of experts in weight space. Additionally with temporal feature modulation to regularize the contribution of temporal feature during test. We achieve a sound balance between zero-shot and close-set video recognition tasks and obtain state-of-the-art or competitive results on various datasets, including Kinetics-400 \& 600, UCF, and HMDB. Code is available at \url{https://github.com/ZMHH-H/MoTE}. Comment: NeurIPS 2024 Camera Ready
نوع الوثيقة:	Working Paper
URL الوصول:	http://arxiv.org/abs/2410.10589
رقم الانضمام:	edsarx.2410.10589
قاعدة البيانات:	arXiv

View record in Arxiv

الوصف
الوصف غير متاح.