Logo VELOCITI: Can Video-Language Models Bind Semantic Concepts Through Time?

Video et Language Compositionality Through Time ⏰
1CVIT, IIIT Hyderabad, India
2Inria Paris and Département d’informatique de l’ENS, CNRS, PSL Research University
Equal Contribution
-->

To keep up with the rapid pace with which Video-Language Models are being proposed, we provide a direction in the form of a benchmark, to evaluate the current SoTA as well as upcoming VLMs on Compositionality, which is a fundamental aspect of Vision-Language-Understanding. This is achieved through carefully designed tests, which evaluate various aspects of perception and binding. With this, we aim to provide a more accurate gauge of VLM capabilities, encouraging research towards improving VLMs and preventing shortcomings that may percolate into the systems that rely on such models.

This work focuses on evaluating subtle perception and visual binding. While VELOCITI has short videos of 10s, they are accompanied by structured, rich, and dense SRL descriptions. Even within this short video, we show that models struggle to relate (binding tests) and track (co-reference tests) entities across time. Thus, we believe that VELOCITI is a milestone that models should surpass and do well on before moving on to reasoning about longer videos.

Compositionality

Humans are amazing at perceiving the visual world. Two processes are continuously at play:

  1. Compositionality
  2. This identifies atomic entities like objects, persons, actions, and scenes and binds them through appropriate relationships.

  3. Distractors Suppression
  4. Irrelevant atomic entities are ignored and not confused with the entities of interest.

We infuse these properties into VELOCITI. At the simplest level, each test in our benchmark compares the similarity between the video and text descriptions. To encourage compositionality without distractors, (typically) texts in our benchmark are designed at an event level, while the visuals span the entire video clip. For a model to perform well on VELOCITI, we expect it to suppress distractors from other events and implicitly localize where in the video, does the compositional description match (or mismatch).


Intra-Video Association Test (IVAT)

Inspired by the Winoground challenge that matches two images with two similar but distinct descriptions, we leverage the structure of VidSitu to create IVAT. We use the first and last event descriptions to obtain two captions: C0, C1. To account for overlapping event descriptions, we consider the video’s first and last 4s: V0 , V1. A model is evaluated on its ability to correctly associate (V0, C0) and (V1, C1). By sampling V0 and V1 from the same video, the overall scene, characters, and actions are highly likely to be similar, and fine-grained distinctions may be necessary.

Model C t2v C v2t I t2v I v2t M t2v M v2t Group
Random 50.0 50.0 50.0 50.0 25.0 25.0 16.7
CLIP-B/32 95.5 91.8 73.4 67.5 49.8 36.9 26.8
CLIP-L/14 96.3 94.2 73.3 68.8 49.4 39.9 27.0
EVA-CLIP-L/14 96.8 95.4 74.6 71.6 52.0 45.2 30.9
SigLIP-B/16 95.9 94.2 70.5 69.0 44.7 40.9 28.1
SigLIP-L/16 96.7 95.0 72.2 69.7 46.9 42.6 29.5
NegCLIP-B/32 95.4 94.1 71.8 72.1 47.2 47.0 32.8
CLIP-ViP-B/32 92.5 93.6 67.5 68.7 38.0 40.8 24.5
ViFi-CLIP-B/16 95.6 93.7 72.5 67.8 49.0 38.4 26.2
mPLUG-V 78.5 79.3 51.8 51.1 17.5 13.6 7.0
PLLAVA 89.6 88.3 65.8 63.1 36.9 30.3 19.1
Video-LLaVA 91.9 90.5 67.0 63.1 40.8 29.9 21.7
Owl-Con 93.0 90.1 66.6 63.1 42.2 33.2 24.8
Human - - 96.0 95.4 92.0 91.3 -

Subtle Perception Tests

In these tests, given a 10s video, a model is required to select the positive (correct) caption over the negative (incorrect) one. Note that the positive caption describes only 1 event (of 5 total) in the video.

Action Adversarial Test

The negative caption is generated by replacing the action in the event description with a contradictory action that does not appear in the video.solving this test requires the model to not only identify actions but also implicitly localize them in the video while not being distracted by other events.

Action Modifier Test

The negative caption is generated by replacing the manner with another plausible modifier. Identifying these subtle variations is challenging for most models.


Visual Binding Tests

A truly compositional model should be capable of identifying and binding entities through the right relationship. To solve this these tests, the model must capture and learn associations.

Agent Binding Test

This test has negative captions created by replacing the agent (the doer of the action), with another agent from the same video. Models without binding ability struggle at this test.

Agent Identification Test

This has negative captions where the agent is replaced by other random references from the VidSitu dataset. The LLM-based replacement is often solved easily. Agent Identification Test acts as a control test for Agent Binding Test as it shows that models perform well on out-of-video agent replacement but fail in concept binding.

Action Binding Test

In this test, the negative caption is created by replacing the action and its modifiers with another action and modifiers from a different event of the same video, while the primary agent remains unchanged.

Agent CoReference Test

Co-reference means two or more expressions that refer to the same entity. In a video, entities are referred to using their appearance or the actions they perform. This test identifies videos with a single agent acting in at least two events. The positive caption is created by concatenating the two references. Solving this test requires resolving multi-step binding across time

Chronology Test

Time is a unique aspect of video understanding. Even in a short 10s video, understanding the story progression across multiple events requires a model to bind each event to its internal representations of temporal concepts like before, after, first, then, etc. This measures if a model can identify the correct order of the events in a video. The positive caption contains descriptions of two events that happen sequentially, while the negative caption contains the same descriptions reversed.

Model AgIden AgBind AgCoref ActAdv ActBind ActMod Chrono Average
CLIP-B/32 77.6 56.3 52.6 64.0 57.6 52.1 49.4 58.5
CLIP-L/14 82.6 55.4 56.9 66.2 58.0 56.8 50.2 60.9
EVA-CLIP-L/14 83.3 53.4 55.0 70.2 55.3 51.2 51.1 59.9
SigLIP-B/16 80.0 54.4 51.0 63.8 54.5 61.1 49.8 59.2
SigLIP-L/16 78.8 53.3 52.2 61.6 52.3 61.1 51.2 59.4
NegCLIP-B/32 83.4 55.6 50.5 61.8 52.3 61.1 51.2 59.4
CLIP-ViP-B/32 75.3 52.4 55.7 70.2 53.5 51.2 48.5 58.1
ViFi-CLIP-B-16 82.3 58.7 54.6 63.0 59.3 60.5 49.8 61.2
mPLUG-V 43.0 31.9 51.7 65.0 42.0 49.6 41.3 46.3
PLLaVA 68.6 43.3 60.5 62.4 46.6 56.0 49.6 55.3
Video-LLaVA 74.1 50.4 60.1 63.6 47.0 47.9 56.0 57.0
Own-Con 67.4 44.6 50.0 73.0 51.1 63.2 45.6 56.4
Gemini-1.5-Flash 91.8 76.4 67.8 80.0 76.4 76.9 68.3 76.8
Human 94.7 93.3 96.0 100.0 92.7 91.3 93.3 94.4

Dataset Stats

For more details on the LLM prompts and methodology for various evaluations, including human evaluations, please refer to the paper.

Abstract

Compositionality is a fundamental aspect of vision-language understanding and is especially required for videos since they contain multiple entities (e.g. persons, actions, and scenes) interacting dynamically over time. Existing benchmarks focus primarily on perception capabilities. However, they do not study binding, the ability of a model to associate entities through appropriate relationships. To this end, we propose VELOCITI, a new benchmark building on complex movie clips and dense semantic role label annotations to test perception and binding in video language models (contrastive and Video-LLMs). Our perception-based tests require discriminating video-caption pairs that share similar entities, and the binding tests require models to associate the correct entity to a given situation while ignoring the different yet plausible entities that also appear in the same video. While current state-of-the-art models perform moderately well on perception tests, accuracy is near random when both entities are present in the same video, indicating that they fail at binding tests. Even the powerful Gemini 1.5 Flash has a substantial gap (16-28%) with respect to human accuracy in such binding tests.

BibTeX

@article{velociti,
        title={VELOCITI: Can Video-Language Models Bind Semantic Concepts Through Time?},
        author={Saravanan, Darshana and Singh, Darshan and Gupta, Varun and Khan, Zeeshan and Gandhi, Vineet and Tapaswi, Makarand},
        journal={arXiv:2406.10889},
        year={2024}
    }