Recent advances in multimodal large language models (MLLMs) have shown immense potential in video understanding. However, existing benchmarks often fall short in evaluating true synergistic reasoning across both audio and visual modalities. They might neglect one modality or fail to integrate them in a logically consistent way.
To address this, we introduce OmniVideoBench, a large-scale, rigorously designed benchmark created to assess synergistic audio-visual understanding. It places a strong emphasis on modality complementarity and logical consistency. The benchmark includes 1,000 high-quality question-answer (QA) pairs from 628 diverse videos (from seconds to 30 minutes long), each annotated with step-by-step reasoning. Our evaluation of various MLLMs reveals a significant gap between current model performance and human-level reasoning, highlighting the challenges of genuine audio-visual intelligence.
Features 628 videos up to 30 minutes, covering 8 major categories and 68 subcategories.
Each of the 1000 QA pairs is annotated with detailed, atomic reasoning chains for transparency.
Designed to test the complementary relationship between audio and visual cues, not just one.
# | Model | Params | Date | Overall | (0,1] min | (1,5] min | (5,10] min | (10,30] min |
---|
Only audio input without visual modality
Only visual input without audio modality
Examples in OmniVideoBench (“V” presents vision and “A” presents audio), and we present the atomic reasoning traces for these examples.
Try: You might try the answer effects of multimodality, visual-only, and audio-only at the same time.
he complete pipeline of data collection, annotation, and refinement, where filtering and refinement serve as two key processes for quality assurance.
(a) OmniVideoBench covers 8 major categories and 68 subcategories. (b) OmniVideoBench comprises 13 task types.The above part shows the video duration distribution across different tasks, while the durations are categorized into four groups: "Short" for less than 1 minute, "Medium" for 1--5 minutes, "Long" for 5--10 minutes, and "Ultralong" for more than 10 minutes. The lower part illustrates the distribution of three types of audio (i.e., Speech, Sound and Music). (c) Distribution of video durations across four time intervals. (d) Distribution of three audio types.
A performance comparison of leading open-source and closed-source models across the 13 distinct reasoning tasks in OmniVideoBench. This highlights current strengths and weaknesses in areas like temporal reasoning, counting, and more.
Accuracy of the Gemini-2.0-Flash model on videos with different primary audio types, demonstrating performance variance on speech, sound, and music.
@misc{li2025omnivideobenchaudiovisualunderstandingevaluation, title={OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs}, author={Caorui Li and Yu Chen and Yiyan Ji and Jin Xu and Zhenyu Cui and Shihao Li and Yuanxing Zhang and Jiafu Tang and Zhenghao Song and Dingling Zhang and Ying He and Haoxiang Liu and Yuxuan Wang and Qiufeng Wang and Zhenhe Wu and Jiehui Luo and Zhiyu Pan and Weihao Xie and Chenchen Zhang and Zhaohui Wang and Jiayi Tian and Yanghai Wang and Zhe Cao and Minxin Dai and Ke Wang and Runzhe Wen and Yinghao Ma and Yaning Pan and Sungkyun Chang and Termeh Taheri and Haiwen Xia and Christos Plachouras and Emmanouil Benetos and Yizhi Li and Ge Zhang and Jian Yang and Tianhao Peng and Zili Wang and Minghao Liu and Junran Peng and Zhaoxiang Zhang and Jiaheng Liu}, year={2025}, eprint={2510.10689}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2510.10689}, }