VELOCITI: Benchmarking Video-Language Compositional Reasoning with Strict Entailment

Video et Language Compositionality Through Time ⏰

Darshana Saravanan^★1, Varun Gupta^★1 Darshan Singh^★1, Zeeshan Khan²,
Vineet Gandhi¹, Makarand Tapaswi¹,

¹CVIT, IIIT Hyderabad, India
²Inria Paris and Département d’informatique de l’ENS, CNRS, PSL Research University
^★Equal Contribution

(Go to VELOCITI v1.0)

ArXiv

Dataset Visualize Dataset Code

Leaderboard

Model	Control	Agent Random	Agent Binding	Agent Coreference	Action Adversarial	Action Mannerism	Action Binding	Event Chronology	Average
Random	25.0	25.0	25.0	25.0	25.0	25.0	25.0	25.0	25.0
P-LLaVA	1.3	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
OwlCon	24.3	3.4	0.7	0.0	4.3	2.8	0.6	0.1	1.7
V-LLaVA	65.8	16.4	7.6	0.3	8.7	3.3	10.6	3.9	7.3
QVL-7B	84.6	39.1	13.5	6.5	17.8	17.5	16.4	0.4	15.9
OV-7B	81.6	56.7	32.9	8.0	29.7	30.6	36.4	30.5	32.1
OV-72B	79.3	63.7	45.4	38.6	33.1	29.3	45.1	46.5	43.1

Subset Leaderboard

Model	Control	Agent Random	Agent Binding	Agent Coreference	Action Adversarial	Action Mannerism	Action Binding	Event Chronology	Average
Random	25.0	25.0	25.0	25.0	25.0	25.0	25.0	25.0	25.0
Gem-1.5 Flash	91.9	56.4	23.8	4.7	32.9	21.6	25.0	2.7	23.9
QVL-72B	82.7	56.0	29.3	35.3	30.0	24.0	35.3	1.3	30.2
OV-72B	81.3	64.0	46.7	41.3	30.7	32.7	46.0	50.0	44.5
GPT-4o	63.3	54.7	44.7	40.7	55.0	42.0	54.0	32.2	46.0
Gemini-1.5 Pro	74.3	60.1	49.7	36.7	52.3	43.5	52.3	50.3	49.3
Human	-	91.5	92.9	92.6	92.9	89.9	91.5	100.0	93.0

A fundamental aspect of compositional reasoning in a video is associating people and their actions across time. Recent years have seen great progress in general-purpose vision/video models and a move towards long-video understanding. While exciting, we take a step back and ask: are today’s models good at compositional reasoning on short videos? To this end, we introduce VELOCITI, a benchmark to study Video-LLMs by disentangling and assessing the comprehension of agents, actions, and their associations across multiple events. We adopt the Video-Language Entailment setup and propose StrictVLE that requires correct classification (rather than ranking) of the positive and negative caption. We evaluate several models and observe that even the best, LLaVA-OneVision (42.5%) and GPT-4o (44.3%), are far from human accuracy at 89.6%. Results show that action understanding lags behind agents, and negative captions created using entities appearing in the video perform worse than those obtained from pure text manipulation. We also present challenges with ClassicVLE and multiple-choice (MC) evaluation, strengthening our preference for StrictVLE. Finally, we validate that our benchmark requires visual inputs of multiple frames making it ideal to study video-language compositional reasoning.

Compositionality

Humans are amazing at perceiving the visual world. Two processes are continuously at play:

Compositionality

This identifies atomic entities like objects, persons, actions, and scenes and binds them through appropriate relationships.

Distractors Suppression

Irrelevant atomic entities are ignored and not confused with the entities of interest.

We infuse these properties into VELOCITI. At the simplest level, each test in our benchmark compares the similarity between the video and text descriptions. To encourage compositionality without distractors, (typically) texts in our benchmark are designed at an event level, while the visuals span the entire video clip. For a model to perform well on VELOCITI, we expect it to suppress distractors from other events and implicitly localize where in the video, does the compositional description match (or mismatch).

Text-Inspired Negation Tests

In these tests, the negatiive caption is created without looking at the video, i.e., the negation aspect is uninformed of the video content.

Agent Random Test

This has negative captions where the agent is replaced by other random references from the VidSitu dataset. Solving this test requires a model to implicitly localize the event based on the action and identify who is present/absent in the video.

Action Adversarial Test

The negative caption is generated by replacing the action in the event description with a contradictory action that does not appear in the video. Solving this test requires the model to not only identify actions but also implicitly localize them in the video while not being distracted by other events.

Action Manner Test

The negative caption is generated by replacing the manner with another plausible modifier. Identifying these subtle variations is challenging for most models.

Event Chronology Test

This test studies a model's ability to confirm whether the video and caption follow the same event progression.

In-Video Negation Tests

A truly compositional model should be capable of identifying and binding entities through the right relationship. To solve this these tests, the model must capture and learn associations. A key aspect of VELOCITI, for the in-video negation tests, the negative caption is created by entities from other parts of the same video, requiring a greater capability of distractor suppression to solve.

Agent Binding Test

This test has negative captions created by replacing the agent (the doer of the action), with another agent from the same video.

Action Binding Test

In this test, the negative caption is created by replacing the action and its modifiers with another action and modifiers from a different event of the same video, while the primary agent remains unchanged.

Agent Coreference Test

Co-reference means two or more expressions that refer to the same entity. In a video, entities are referred to using their appearance or the actions they perform. This test identifies videos with a single agent acting in at least two events. The positive caption is created by concatenating the two references. Solving this test requires resolving multi-step binding across time

For more details on the LLM prompts and methodology for various evaluations, including human evaluations, please refer to the paper.

BibTeX

@inproceedings{velociti,
        title={{VELOCITI: Benchmarking Video-Language Compositional Reasoning with Strict Entailment}},
        author={Saravanan, Darshana and Gupta, Varun and Singh, Darshan and Khan, Zeeshan and Gandhi, Vineet and Tapaswi, Makarand},
        booktitle={Computer Vision and Pattern Recognition (CVPR)},
        year={2025}
    }