MAGVIT: Masked Generative Video Transformer

التفاصيل البيبلوغرافية
العنوان:	MAGVIT: Masked Generative Video Transformer
المؤلفون:	Yu, Lijun, Cheng, Yong, Sohn, Kihyuk, Lezama, José, Zhang, Han, Chang, Huiwen, Hauptmann, Alexander G., Yang, Ming-Hsuan, Hao, Yuan, Essa, Irfan, Jiang, Lu
سنة النشر:	2022
المجموعة:	Computer Science
مصطلحات موضوعية:	Computer Science - Computer Vision and Pattern Recognition
الوصف:	We introduce the MAsked Generative VIdeo Transformer, MAGVIT, to tackle various video synthesis tasks with a single model. We introduce a 3D tokenizer to quantize a video into spatial-temporal visual tokens and propose an embedding method for masked video token modeling to facilitate multi-task learning. We conduct extensive experiments to demonstrate the quality, efficiency, and flexibility of MAGVIT. Our experiments show that (i) MAGVIT performs favorably against state-of-the-art approaches and establishes the best-published FVD on three video generation benchmarks, including the challenging Kinetics-600. (ii) MAGVIT outperforms existing methods in inference time by two orders of magnitude against diffusion models and by 60x against autoregressive models. (iii) A single MAGVIT model supports ten diverse generation tasks and generalizes across videos from different visual domains. The source code and trained models will be released to the public at https://magvit.cs.cmu.edu. Comment: CVPR 2023 highlight
نوع الوثيقة:	Working Paper
URL الوصول:	http://arxiv.org/abs/2212.05199
رقم الانضمام:	edsarx.2212.05199
قاعدة البيانات:	arXiv

View record in Arxiv

الوصف
الوصف غير متاح.