Given an image pair, it is easier for an MLLM to discern fine-grained differences when prompted with a question and/or multiple choices (VQA evaluation) than to independently detect and describe such differences (our evaluation). Our work finds that state-of-the-art MLLMs struggle to discern fine-grained difference with our Detect-Describe-Discriminate (D3) evaluation framework, with open-source MLLMs failing to outperform random guess.
VQA with multiple choices provides a reliable method for checking the existence of specific visual abilities. However, we find that it is easier for the model to select an answer from multiple choices than to generate the answer itself . Specifically, providing the answer along with task prompt (through multiple choice options or as part of the question) biases the MLLM's output towards the visual concept that is being evaluated.
In this work, we offer a novel perspective for fine-grained evaluation: we assess how well an MLLM understands a specific visual concept by its ability to uniquely describe two extremely similar images that differ only in the targeted visual concept. Unlike VQA, we do not constrain the output to multiple choice answers and directly evaluate the model's free-form generation. Specifically, we assess the MLLMs ability to capture subtle visual distinctions using self-retrieval evaluation , i.e., by retrieving the target image using its generated caption against the other image in the pair serving as the distractor .
We curate extremely similar image pairs, each having one prominent point of visual difference such that uniquely describing each image within the pair entails capturing a specific facet of visual understanding. With 247 such image pairs, we introduce the D3 benchmark. For each image in the pair, we prompt the model to:
We benchmark state-of-the-art open-source (Cambrian-34B , Chameleon-30b , LLaVA-NeXT-34B ) and closed-source(GPT-4o , Gemini-1.5-Pro , Claude-Sonnet-3.5 ) MLLMs on the D3 benchmark with self-retrieval evaluation. Results reveal that current MLLMs struggle to incorporate fine-grained visual details in their captions. Open-source models such as Cambrian-34B and LLaVA-NeXT-34B fail to even outperform random guess (25%).
Although, closed-source models outscore random guess, they still struggle to discern fine-grained visual differences, with Claude-Sonnet-3.5 achieving the highest score of 45.7% on our benchmark.
Each image pair in D3 are visually identical with one prominent Point of Difference (POD) distinguishing them. Similar to MMVP , we identify 6 fine-grained POD that MLLMs struggle to identify:
Since image pairs in D3 benchmark have only one prominent POD, Self-Retrieval within D3 enables whitebox evaluation. Specifically, when an MLLM fails to uniquely describe and retrieve both images within a pair, we are able to accurately identify which one of the following visual concept the MLLM is unable to pick up:
@misc{gaur2024no,
title={Detect, Describe, Discriminate: Moving Beyond VQA for MLLM Evaluation},
author={Manu Gaur and Darshan Singh S and Makarand Tapaswi}
year={2024},
eprint={2409.15125},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
This project page is built upon MMVP.