OmniVideoBench: Towards Audio-Visual Understanding Evaluation

What is OmniVideoBench?

Recent advances in multimodal large language models (MLLMs) have shown immense potential in video understanding. However, existing benchmarks often fall short in evaluating true synergistic reasoning across both audio and visual modalities. They might neglect one modality or fail to integrate them in a logically consistent way.

To address this, we introduce OmniVideoBench, a large-scale, rigorously designed benchmark created to assess synergistic audio-visual understanding. It places a strong emphasis on modality complementarity and logical consistency. The benchmark includes 1,000 high-quality question-answer (QA) pairs from 628 diverse videos (from seconds to 30 minutes long), each annotated with step-by-step reasoning. Our evaluation of various MLLMs reveals a significant gap between current model performance and human-level reasoning, highlighting the challenges of genuine audio-visual intelligence.

Core Features

Diverse Videos

Features 628 videos up to 30 minutes, covering 8 major categories and 68 subcategories.

Step-by-Step Reasoning

Each of the 1000 QA pairs is annotated with detailed, atomic reasoning chains for transparency.

Synergistic Evaluation

Designed to test the complementary relationship between audio and visual cues, not just one.

Leaderboard

no vision Only audio input without visual modality no audio Only visual input without audio modality

Note: "Params" refers to the LLM (Large Language Model) parameter count.

#	Model	Params	Date	Overall	(0,1] min	(1,5] min	(5,10] min	(10,30] min

Want to add your model to the leaderboard? Click here to submit your results!

Benchmark Statistics

Data Examples

Video Category Distribution Chart from the paper

Examples in OmniVideoBench (“V” presents vision and “A” presents audio), and we present the atomic reasoning traces for these examples.

Try: You might try the answer effects of multimodality, visual-only, and audio-only at the same time.

Loading examples...

Generation Pipeline

he complete pipeline of data collection, annotation, and refinement, where filtering and refinement serve as two key processes for quality assurance.

Data Distribution

(a) OmniVideoBench covers 8 major categories and 68 subcategories. (b) OmniVideoBench comprises 13 task types.The above part shows the video duration distribution across different tasks, while the durations are categorized into four groups: "Short" for less than 1 minute, "Medium" for 1--5 minutes, "Long" for 5--10 minutes, and "Ultralong" for more than 10 minutes. The lower part illustrates the distribution of three types of audio (i.e., Speech, Sound and Music). (c) Distribution of video durations across four time intervals. (d) Distribution of three audio types.

Performance Analysis

Model Performance Across 13 Task Categories

A performance comparison of leading open-source and closed-source models across the 13 distinct reasoning tasks in OmniVideoBench. This highlights current strengths and weaknesses in areas like temporal reasoning, counting, and more.

Performance by Video Duration

Performance by Audio Type

Accuracy of the Gemini-2.0-Flash model on videos with different primary audio types, demonstrating performance variance on speech, sound, and music.

Citation

                        @misc{li2025omnivideobenchaudiovisualunderstandingevaluation,
                              title={OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs}, 
                              author={Caorui Li and Yu Chen and Yiyan Ji and Jin Xu and Zhenyu Cui and Shihao Li and Yuanxing Zhang and Jiafu Tang and Zhenghao Song and Dingling Zhang and Ying He and Haoxiang Liu and Yuxuan Wang and Qiufeng Wang and Zhenhe Wu and Jiehui Luo and Zhiyu Pan and Weihao Xie and Chenchen Zhang and Zhaohui Wang and Jiayi Tian and Yanghai Wang and Zhe Cao and Minxin Dai and Ke Wang and Runzhe Wen and Yinghao Ma and Yaning Pan and Sungkyun Chang and Termeh Taheri and Haiwen Xia and Christos Plachouras and Emmanouil Benetos and Yizhi Li and Ge Zhang and Jian Yang and Tianhao Peng and Zili Wang and Minghao Liu and Junran Peng and Zhaoxiang Zhang and Jiaheng Liu},
                              year={2025},
                              eprint={2510.10689},
                              archivePrefix={arXiv},
                              primaryClass={cs.AI},
                              url={https://arxiv.org/abs/2510.10689}, 
                        }

OmniVideoBench

A Large-Scale and Rigorously Designed Benchmark for Synergistic Audio-Visual Understanding