Transactions on Machine Learning Research, 2024
Image captioning systems are unable to generate fine-grained captions, often using generic descriptions that fail to discriminate visually similar images. We use self-retrieval to improve training and evaluation of captioning systems.
Our work is built upon three key components, each focused on making captioning systems more fine-grained:
We find that captioning systems, irrespective of their size, struggle to capture fine-grained visual details leading to poor performance on TrueMatch. However, using our "plug and play" training recipe, we are able to outpefrom state-of-the-art open-source MLLMs while having 1-2 orders of magnitude fewer parameters.
TrueMatch comprises bags of highly similar images with varying sizes, where uniquely describing each image within a bag requires capturing various facets of visual understanding. It offers a comprehensive framework for using self-retrieval to holistically evaluate captioning systems. For example, retrieving the target image from Figure 2 requires the caption to incorporate information about attributes (red bike) or orientation (inverted body).
Benchmarking Captioning Systems on TrueMatch
Table 1 evaluates several open-source captioning approaches, MLLMs, and SR trained models on TrueMatch.
Existing image captioning datasets like MSCOCO comprise generic annotations.
While recent works address this by synthetically expanding visual information within the captions, they are prone to inherit biases present in foundation models.
Interestingly, although the individual COCO captions are sparse, we find that they describe complementary facets of the image (see Figure 3).
VCB is a novel two stage framework leverages foundation models
to generate rich descriptions while being anchored in human data:
Stage 1: BlendCap uses an LLM to create a blended caption that combines diverse perspectives
offered by different human annotations.
Notably, we prompt the LLM to minimize redundant information resulting in short descriptions.
Stage 2: HolisticCap
expands BlendCap by using an LLM to instill fine-grained details from the Visual Caption (InstructBLIP).
The LLM is prompted to prefer human-grounded BlendCap in case of conflicting visual information.
Anchoring semantic visual information in human annotations reduces verbose tendencies of MLLMs, producing rich and succinct captions.
Benchmarking Visually Boosted Captions on TrueMatch
We adopt ClipCap (Mokady et al., 2021), a lightweight (200M) simplification of the modern MLLMs. Training the captioning system has two stages:
Fine-tuning the Language Model with Self-Retrieval (SR-L): A Deep Dive
MLE encourages generation of generic descriptions. Retrieval performance of MLE trained captioner reduces substantially for all datasets (rows 1-3) compared to their ground-truth captions (see Table 2).
Self-retrieval fine-tuning benefits from a rich MLE initialization.
SR-L fine-tuning with HolisticCap significantly outperforms COCO (Dessì et al., 2023) on TrueMatch.
Self-retrieval unlocks latent semantic information when fine-tuning the LLM. The model trained on COCO, due to the MLE objective, generates sparse captions that resemble the independent annotations of COCO. This leads to a large gap between COCO and BlendCap on RD100 (rows 1, 2), despite both having similar semantic information. Interestingly, SR-L fine-tuning narrows this gap dramatically (rows 4, 5).
Fine-tuning the Visual Encoder with Self-Retrieval (SR-V)
While fine-tuning CLIP with SR yields superior retrieval performance, it makes the captioner less faithful to the ground-truth captions.
Retrieval Performance vs Caption Faithfulness : The Trade-off plaguing current SR approaches
BagCurri: Curriculum Training with Bags of Hard Negatives
For a stronger learning signal, we fine-tune with bags of highly similar images within a minibatch instead of retrieving against 99 random distractors (vanilla SR). We also propose a curriculum over bag sizes (see Figure 6) to more optimally leverage the contrastive SR reward.
SR-V fine-tuning with BagCurri (row 7)
BEST OF BOTH WORLDS: SR-LV fine-tuning with BagCurri.
We initialize the captioner with HolisticCap and fine-tune both the language and visual components (SR-LV) with BagCurri.
This results in the most visually fine-grained model while also improving CIDEr over the MLE trained captioner.
Notably, we find that rich initialization provided by HolisticCap and BagCurri is solely responsible for preserving caption faithfulness, with CIDEr decreasing for COCO (row 4) or when bags are used without our curriculum.
CIDEr Optimization meets SR Fine-Tuning
Genating Diverse Captions
Using our training recipe with Visual Caption Boosting guides the captioner away from the language modelling priors, improving caption diversity over COCO MLE training by 304%.
@misc{ndlb2024,
title={{No Detail Left Behind: Revisiting Self-Retrieval for Fine-Grained Image Captioning},
author={Manu Gaur and Darshan Singh S and Makarand Tapaswi}
year={2024},
eprint={2409.03025},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
This project page is built upon MMVP.