Model | Control | Agent Random | Agent Binding | Agent Coreference | Action Adversarial | Action Mannerism | Action Binding | Event Chronology | Average |
---|---|---|---|---|---|---|---|---|---|
Random | 25.0 | 25.0 | 25.0 | 25.0 | 25.0 | 25.0 | 25.0 | 25.0 | 25.0 |
P-LLaVA | 1.3 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
OwlCon | 24.3 | 3.4 | 0.7 | 0.0 | 4.3 | 2.8 | 0.6 | 0.1 | 1.7 |
V-LLaVA | 65.8 | 16.4 | 7.6 | 0.3 | 8.7 | 3.3 | 10.6 | 3.9 | 7.3 |
QVL-7B | 84.6 | 39.1 | 13.5 | 6.5 | 17.8 | 17.5 | 16.4 | 0.4 | 15.9 |
OV-7B | 81.6 | 56.7 | 32.9 | 8.0 | 29.7 | 30.6 | 36.4 | 30.5 | 32.1 |
OV-72B | 79.3 | 63.7 | 45.4 | 38.6 | 33.1 | 29.3 | 45.1 | 46.5 | 43.1 |
Model | Control | Agent Random | Agent Binding | Agent Coreference | Action Adversarial | Action Mannerism | Action Binding | Event Chronology | Average |
---|---|---|---|---|---|---|---|---|---|
Random | 25.0 | 25.0 | 25.0 | 25.0 | 25.0 | 25.0 | 25.0 | 25.0 | 25.0 |
Gem-1.5 Flash | 91.9 | 56.4 | 23.8 | 4.7 | 32.9 | 21.6 | 25.0 | 2.7 | 23.9 |
QVL-72B | 82.7 | 56.0 | 29.3 | 35.3 | 30.0 | 24.0 | 35.3 | 1.3 | 30.2 |
OV-72B | 81.3 | 64.0 | 46.7 | 41.3 | 30.7 | 32.7 | 46.0 | 50.0 | 44.5 |
GPT-4o | 63.3 | 54.7 | 44.7 | 40.7 | 55.0 | 42.0 | 54.0 | 32.2 | 46.0 |
Gemini-1.5 Pro | 74.3 | 60.1 | 49.7 | 36.7 | 52.3 | 43.5 | 52.3 | 50.3 | 49.3 |
Human | - | 91.5 | 92.9 | 92.6 | 92.9 | 89.9 | 91.5 | 100.0 | 93.0 |
A fundamental aspect of compositional reasoning in a video is associating people and their actions across time. Recent years have seen great progress in general-purpose vision/video models and a move towards long-video understanding. While exciting, we take a step back and ask: are today’s models good at compositional reasoning on short videos? To this end, we introduce VELOCITI, a benchmark to study Video-LLMs by disentangling and assessing the comprehension of agents, actions, and their associations across multiple events. We adopt the Video-Language Entailment setup and propose StrictVLE that requires correct classification (rather than ranking) of the positive and negative caption. We evaluate several models and observe that even the best, LLaVA-OneVision (42.5%) and GPT-4o (44.3%), are far from human accuracy at 89.6%. Results show that action understanding lags behind agents, and negative captions created using entities appearing in the video perform worse than those obtained from pure text manipulation. We also present challenges with ClassicVLE and multiple-choice (MC) evaluation, strengthening our preference for StrictVLE. Finally, we validate that our benchmark requires visual inputs of multiple frames making it ideal to study video-language compositional reasoning.
Humans are amazing at perceiving the visual world. Two processes are continuously at play:
This identifies atomic entities like objects, persons, actions, and scenes and binds them through appropriate relationships.
Irrelevant atomic entities are ignored and not confused with the entities of interest.
We infuse these properties into VELOCITI. At the simplest level, each test in our benchmark compares the similarity between the video and text descriptions. To encourage compositionality without distractors, (typically) texts in our benchmark are designed at an event level, while the visuals span the entire video clip. For a model to perform well on VELOCITI, we expect it to suppress distractors from other events and implicitly localize where in the video, does the compositional description match (or mismatch).
In these tests, the negatiive caption is created without looking at the video, i.e., the negation aspect is uninformed of the video content.
Agent Random Test
This has negative captions where the agent is replaced by other random references from the VidSitu
dataset. Solving this test requires a model to implicitly localize the event based on the action
and identify who is present/absent in the video.
Action Manner Test
The negative caption is generated by replacing the manner with another plausible modifier.
Identifying these subtle variations is challenging for most models.
A truly compositional model should be capable of identifying and binding entities through the right relationship. To solve this these tests, the model must capture and learn associations. A key aspect of VELOCITI, for the in-video negation tests, the negative caption is created by entities from other parts of the same video, requiring a greater capability of distractor suppression to solve.
Agent Binding Test
This test has negative captions created by replacing the agent (the doer of the action), with another agent
from the same video.
Action Binding Test
In this test, the negative caption is created by replacing the action and its modifiers with another
action and modifiers from a different event of the same video, while the primary agent remains unchanged.
Agent Coreference Test
For more details on the LLM prompts and methodology for various evaluations, including human evaluations, please refer to the paper.
@inproceedings{velociti,
title={{VELOCITI: Benchmarking Video-Language Compositional Reasoning with Strict Entailment}},
author={Saravanan, Darshana and Gupta, Varun and Singh, Darshan and Khan, Zeeshan and Gandhi, Vineet and Tapaswi, Makarand},
booktitle={Computer Vision and Pattern Recognition (CVPR)},
year={2025}
}