Detect-Describe-Discriminate

Manu Gaur Darshan Singh S Makarand Tapaswi

CVIT, IIIT Hyderabad

Current MLLMs fail during D3 evaluation despite achieving perfect score in VQA, Swipe for more

Given an image pair, it is easier for an MLLM to discern fine-grained differences when prompted with a question and/or multiple choices (VQA evaluation) than to independently detect and describe such differences (our evaluation). Our work finds that state-of-the-art MLLMs struggle to discern fine-grained difference with our Detect-Describe-Discriminate (D₃) evaluation framework, with open-source MLLMs failing to outperform random guess.

Moving Beyond VQA

VQA with multiple choices provides a reliable method for checking the existence of specific visual abilities. However, we find that it is easier for the model to select an answer from multiple choices than to generate the answer itself . Specifically, providing the answer along with task prompt (through multiple choice options or as part of the question) biases the MLLM's output towards the visual concept that is being evaluated.

Self-Retrieval Evaluation

In this work, we offer a novel perspective for fine-grained evaluation: we assess how well an MLLM understands a specific visual concept by its ability to uniquely describe two extremely similar images that differ only in the targeted visual concept. Unlike VQA, we do not constrain the output to multiple choice answers and directly evaluate the model's free-form generation. Specifically, we assess the MLLMs ability to capture subtle visual distinctions using self-retrieval evaluation , i.e., by retrieving the target image using its generated caption against the other image in the pair serving as the distractor .

MLLMs struggle with fine-grained Visual Discrimination

We curate extremely similar image pairs, each having one prominent point of visual difference such that uniquely describing each image within the pair entails capturing a specific facet of visual understanding. With 247 such image pairs, we introduce the D₃ benchmark. For each image in the pair, we prompt the model to:

Detect the visual difference,
Describe the target image uniquely such that it,
Discriminates the target image from the distractor.

We benchmark state-of-the-art open-source (Cambrian-34B , Chameleon-30b , LLaVA-NeXT-34B ) and closed-source(GPT-4o , Gemini-1.5-Pro , Claude-Sonnet-3.5 ) MLLMs on the D₃ benchmark with self-retrieval evaluation. Results reveal that current MLLMs struggle to incorporate fine-grained visual details in their captions. Open-source models such as Cambrian-34B and LLaVA-NeXT-34B fail to even outperform random guess (25%). Although, closed-source models outscore random guess, they still struggle to discern fine-grained visual differences, with Claude-Sonnet-3.5 achieving the highest score of 45.7% on our benchmark.

MMVP Framework — Both closed and open-source MLLMs stuggle to uniquely describe image from pairs in D₃ benchmark.

Points of Difference in D₃ image pairs

Each image pair in D₃ are visually identical with one prominent Point of Difference (POD) distinguishing them. Similar to MMVP , we identify 6 fine-grained POD that MLLMs struggle to identify:

(Click on a specifc POD to see example)

Whitebox Evaluation of MLLMs

Since image pairs in D₃ benchmark have only one prominent POD, Self-Retrieval within D₃ enables whitebox evaluation. Specifically, when an MLLM fails to uniquely describe and retrieve both images within a pair, we are able to accurately identify which one of the following visual concept the MLLM is unable to pick up: SEAL-Bench Results

Whitebox Evaluation with D₃: Self-retrieval performance of various MLLMs on image pairs of D₃ across all six points of difference.

We find that state-of-the-art MLLMs, both open and closed, struggle to perceive fine-grained changes in orientation/direction, camera angle, object's state, or positioning. This corroborates the findings of MMVP. In contrast, uniquely describing similar images with differing scenes appears to be easier for these models. V^* finds that MLLMs struggle to focus on fine-grained details in visually crowded images. While this is true for Cambrian-34B, closed-source MLLMs are capable of identifying characteristics of non-prominent objects to distinguish cluttered images.

Detect, Describe, Discriminate: Moving Beyond VQA for MLLM Evaluation

Moving Beyond VQA

Self-Retrieval Evaluation

MLLMs struggle with fine-grained Visual Discrimination

Points of Difference in D₃ image pairs

Whitebox Evaluation of MLLMs

BibTeX

Bibliography

Acknowledgements

Detect, Describe, Discriminate: Moving Beyond VQA for MLLM Evaluation

Moving Beyond VQA

Self-Retrieval Evaluation

MLLMs struggle with fine-grained Visual Discrimination

Points of Difference in D3 image pairs

Whitebox Evaluation of MLLMs

BibTeX

Bibliography

Acknowledgements

Points of Difference in D₃ image pairs