Academic Journal

A Survey of Vision and Language Related Multi-Modal Task

التفاصيل البيبلوغرافية
العنوان: A Survey of Vision and Language Related Multi-Modal Task
المؤلفون: Lanxiao Wang, Wenzhe Hu, Heqian Qiu, Chao Shang, Taijin Zhao, Benliu Qiu, King Ngi Ngan, Hongliang Li
المصدر: CAAI Artificial Intelligence Research, Vol 1, Iss 2, Pp 111-136 (2022)
بيانات النشر: Tsinghua University Press, 2022.
سنة النشر: 2022
المجموعة: LCC:Electronic computers. Computer science
مصطلحات موضوعية: deep learning, vision and language, multi-modal generation, multi-modal analysis, multi-modal reasoning, pre-training, Electronic computers. Computer science, QA75.5-76.95
الوصف: With the significant breakthrough in the research of single-modal related deep learning tasks, more and more works begin to focus on multi-modal tasks. Multi-modal tasks usually involve more than one different modalities, and a modality represents a type of behavior or state. Common multi-modal information includes vision, hearing, language, touch, and smell. Vision and language are two of the most common modalities in human daily life, and many typical multi-modal tasks focus on these two modalities, such as visual captioning and visual grounding. In this paper, we conduct in-depth research on typical tasks of vision and language from the perspectives of generation, analysis, and reasoning. First, the analysis and summary with the typical tasks and some pretty classical methods are introduced, which will be generalized from the aspects of different algorithmic concerns, and be further discussed frequently used datasets and metrics. Then, some other variant tasks and cutting-edge tasks are briefly summarized to build a more comprehensive vision and language related multi-modal tasks framework. Finally, we further discuss the development of pre-training related research and make an outlook for future research. We hope this survey can help relevant researchers to understand the latest progress, existing problems, and exploration directions of vision and language multi-modal related tasks, and provide guidance for future research.
نوع الوثيقة: article
وصف الملف: electronic resource
اللغة: English
تدمد: 2097-194X
Relation: https://www.sciopen.com/article/10.26599/AIR.2022.9150008; https://doaj.org/toc/2097-194X
DOI: 10.26599/AIR.2022.9150008
URL الوصول: https://doaj.org/article/b0ecb4ac3c6e4398ba6fc68ce920fdcd
رقم الانضمام: edsdoj.b0ecb4ac3c6e4398ba6fc68ce920fdcd
قاعدة البيانات: Directory of Open Access Journals
الوصف
تدمد:2097194X
DOI:10.26599/AIR.2022.9150008