Abstract
Audio-Visual Intelligence represents a multidisciplinary field integrating auditory and visual modalities through large foundation models, encompassing tasks from understanding and generation to interaction, with unified taxonomies and methodological foundations.
Audio-Visual Intelligence (AVI) has emerged as a central frontier in artificial intelligence, bridging auditory and visual modalities to enable machines that can perceive, generate, and interact in the multimodal real world. In the era of large foundation models, joint modeling of audio and vision has become increasingly crucial, i.e., not only for understanding but also for controllable generation and reasoning across dynamic, temporally grounded signals. Recent advances, such as Meta MovieGen and Google Veo-3, highlight the growing industrial and academic focus on unified audio-vision architectures that learn from massive multimodal data. However, despite rapid progress, the literature remains fragmented, spanning diverse tasks, inconsistent taxonomies, and heterogeneous evaluation practices that impede systematic comparison and knowledge integration. This survey provides the first comprehensive review of AVI through the lens of large foundation models. We establish a unified taxonomy covering the broad landscape of AVI tasks, ranging from understanding (e.g., speech recognition, sound localization) to generation (e.g., audio-driven video synthesis, video-to-audio) and interaction (e.g., dialogue, embodied, or agentic interfaces). We synthesize methodological foundations, including modality tokenization, cross-modal fusion, autoregressive and diffusion-based generation, large-scale pretraining, instruction alignment, and preference optimization. Furthermore, we curate representative datasets, benchmarks, and evaluation metrics, offering a structured comparison across task families and identifying open challenges in synchronization, spatial reasoning, controllability, and safety. By consolidating this rapidly expanding field into a coherent framework, this survey aims to serve as a foundational reference for future research on large-scale AVI.
Community
๐ง๐ Audio-Visual Intelligence in Large Foundation Models: A Comprehensive Survey
๐ arXiv: 2605.04045
We are excited to release what we believe is the first comprehensive survey on Audio-Visual Intelligence (AVI) in the era of large foundation models! ๐
AVI aims to build AI systems that can jointly perceive, generate, and interact through both sound and vision โ moving toward truly omni-modal intelligence. ๐๐ค
In this survey, we systematically organize the rapidly growing AVI landscape around three core pillars:
๐น Perception
Speech recognition, sound localization, video understanding, temporal reasoning, scene understanding, etc.
๐น Generation
Audio-driven video generation, video-to-audio synthesis, talking heads, music/video co-generation, diffusion-based AVI systems, etc.
๐น Interaction
Dialogue systems, embodied agents, conversational AVI, agentic multimodal systems, and interactive world modeling.
โจ Highlights of this survey:
- ๐ A unified taxonomy for AVI tasks and paradigms
- ๐ง Foundations of audio-visual large models
- ๐ Cross-modal fusion & tokenization strategies
- ๐ฌ Autoregressive & diffusion-based AVI generation
- ๐ Comprehensive benchmarks, datasets, and evaluation metrics
- โ ๏ธ Open challenges: synchronization, controllability, spatial reasoning, safety, and more
- ๐ Curated resources and continuously updated paper list
As audio-visual foundation models rapidly evolve with systems like Meta MovieGen and Google Veo, we hope this survey can serve as a foundational reference for future AVI research and omni-modal AI systems. ๐ฅ
๐ Paper: arXiv:2605.04045
๐ Project Page: JavisVerse AVI Survey
๐ป GitHub: Awesome-AVI Repository
#AI #Multimodal #AudioVisual #FoundationModels #LLM #DiffusionModels #ComputerVision #AudioAI #GenerativeAI #MachineLearning #OmniModal #SurveyPaper
the unified taxonomy for avi tasks in the foundation-model era is overdue, and the way the survey threads perception, generation, and interaction into a single framework is refreshing. my main question: in streaming scenarios where audio and video drift, can a single multimodal backbone maintain cross-modal coherence without adding latency or producing desynced outputs? the arxivlens breakdown helped me parse the method details, and it would be helpful if the evaluation protocol explicitly benchmarks alignment under latency and desync (https://arxivlens.com/PaperView/Details/audio-visual-intelligence-in-large-foundation-models-9256-88aebc09). if you could add an ablation showing sensitivity to synchronization latency, e.g., how performance changes as audio-video drift increases, it would really clarify where the gains actually come from.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper