STEVE-Audio: Expanding the Goal Conditioning Modalities of Embodied Agents in Minecraft

التفاصيل البيبلوغرافية
العنوان: STEVE-Audio: Expanding the Goal Conditioning Modalities of Embodied Agents in Minecraft
المؤلفون: Lenzen, Nicholas, Raut, Amogh, Melnik, Andrew
سنة النشر: 2024
المجموعة: Computer Science
مصطلحات موضوعية: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Robotics
الوصف: Recently, the STEVE-1 approach has been introduced as a method for training generative agents to follow instructions in the form of latent CLIP embeddings. In this work, we present a methodology to extend the control modalities by learning a mapping from new input modalities to the latent goal space of the agent. We apply our approach to the challenging Minecraft domain, and extend the goal conditioning to include the audio modality. The resulting audio-conditioned agent is able to perform on a comparable level to the original text-conditioned and visual-conditioned agents. Specifically, we create an Audio-Video CLIP foundation model for Minecraft and an audio prior network which together map audio samples to the latent goal space of the STEVE-1 policy. Additionally, we highlight the tradeoffs that occur when conditioning on different modalities. Our training code, evaluation code, and Audio-Video CLIP foundation model for Minecraft are made open-source to help foster further research into multi-modal generalist sequential decision-making agents.
Comment: Accepted at CoRL 2024: Workshop on Lifelong Learning for Home Robots
نوع الوثيقة: Working Paper
URL الوصول: http://arxiv.org/abs/2412.00949
رقم الانضمام: edsarx.2412.00949
قاعدة البيانات: arXiv