No-Detail-Left-Behind

Manu Gaur Darshan Singh S Makarand Tapaswi

CVIT, IIIT Hyderabad

Image captioning systems are unable to generate fine-grained captions, often using generic descriptions that fail to discriminate visually similar images. We use self-retrieval to improve training and evaluation of captioning systems.

**Figure 1**: Self-retrieval judges the ability of a captioner to retrieve an image using its *generated caption* against of bag of distractor images. Image source.

Our work is built upon three key components, each focused on making captioning systems more fine-grained:

Evaluation: Traditional image caption evaluation is neither fine-grained nor rewards diversity. Hence, we present TrueMatch, a benchmark that uses self-retrieval to assess the captioner's ability to capture subtle visual distinctions.
Data: Training on noisy (alt-text) or generic (human annotations) datasets bottlenecks fine-grained caption generation. To this extent, we present Visual Caption Boosting to instill fine-grainedness in generic image captioning datasets while remaining anchored in human annotations.
Self-Retrieval Training: MLE training encourages generation of statistically probable phrases. We design a training recipe to address this by instilling discriminant visual information in the captioner.

We find that captioning systems, irrespective of their size, struggle to capture fine-grained visual details leading to poor performance on TrueMatch. However, using our "plug and play" training recipe, we are able to outpefrom state-of-the-art open-source MLLMs while having 1-2 orders of magnitude fewer parameters.

I. TrueMatch: Fine-grained Evaluation of Captioning Systems

motorcycles — **Figure 2**: **An example from *TrueMatch*.** (**COCO MLE**) produces the same caption. Vanilla self-retrieval training (**COCO SR**) hallucinates details such as “two people” (middle). Our proposed method (**OURS SR**) generates discriminant captions.

TrueMatch comprises bags of highly similar images with varying sizes, where uniquely describing each image within a bag requires capturing various facets of visual understanding. It offers a comprehensive framework for using self-retrieval to holistically evaluate captioning systems. For example, retrieving the target image from Figure 2 requires the caption to incorporate information about attributes (red bike) or orientation (inverted body).

Benchmarking Captioning Systems on TrueMatch

Table 1 evaluates several open-source captioning approaches, MLLMs, and SR trained models on TrueMatch.

Irrespective of their size, captioners struggle to capture fine-grained visual details leading to poor performance on TrueMatch.
Our SR training recipe outperforms vanilla SR (DiscriTune) by +14.4% to +19.5%, demonstrating its effectiveness.
Cambrian-1 is the best-performing open-source MLLM. Our proposed approach significantly outperforms it despite being 30x smaller.

II. Visual Caption Boosting

Existing image captioning datasets like MSCOCO comprise generic annotations. While recent works address this by synthetically expanding visual information within the captions, they are prone to inherit biases present in foundation models. Interestingly, although the individual COCO captions are sparse, we find that they describe complementary facets of the image (see Figure 3).

VCB is a novel two stage framework leverages foundation models to generate rich descriptions while being anchored in human data:

Stage 1: BlendCap uses an LLM to create a blended caption that combines diverse perspectives offered by different human annotations. Notably, we prompt the LLM to minimize redundant information resulting in short descriptions.

Stage 2: HolisticCap expands BlendCap by using an LLM to instill fine-grained details from the Visual Caption (InstructBLIP). The LLM is prompted to prefer human-grounded BlendCap in case of conflicting visual information.

Anchoring semantic visual information in human annotations reduces verbose tendencies of MLLMs, producing rich and succinct captions.

Benchmarking Visually Boosted Captions on TrueMatch

BlendCap significantly outperforms individual COCO captions on RD100 and ClipScore, confirming that human annotations capture complementary visual aspects of the same image.
HolisticCap (unlike BlendCap) yields substantantial gains over COCO across bags of TrueMatch.

III. Guiding captioners away from their language modeling priors

We adopt ClipCap (Mokady et al., 2021), a lightweight (200M) simplification of the modern MLLMs. Training the captioning system has two stages:

Maximum Likelihood Estimation (MLE) pretraining.
REINFORCE fine-tuning by maximizing the self-retrieval (SR) reward.

Fine-tuning the Language Model with Self-Retrieval (SR-L): A Deep Dive

MLE encourages generation of generic descriptions. Retrieval performance of MLE trained captioner reduces substantially for all datasets (rows 1-3) compared to their ground-truth captions (see Table 2).

Self-retrieval fine-tuning benefits from a rich MLE initialization. SR-L fine-tuning with HolisticCap significantly outperforms COCO (Dessì et al., 2023) on TrueMatch.

Self-retrieval unlocks latent semantic information when fine-tuning the LLM. The model trained on COCO, due to the MLE objective, generates sparse captions that resemble the independent annotations of COCO. This leads to a large gap between COCO and BlendCap on RD100 (rows 1, 2), despite both having similar semantic information. Interestingly, SR-L fine-tuning narrows this gap dramatically (rows 4, 5).

Fine-tuning the Visual Encoder with Self-Retrieval (SR-V)

While fine-tuning CLIP with SR yields superior retrieval performance, it makes the captioner less faithful to the ground-truth captions.

Retrieval Performance vs Caption Faithfulness : The Trade-off plaguing current SR approaches

We observe that captioning system's fine-tuned with vanilla SR-L (Dessì et al. 2023) become less faithful to the ground-truth captions upon extended training.
As seen in Figure 5, retrieval performance continually improves while CIDEr dips significantly.
We also find that captioners fine-tuned with SR, in a dash to enhance retrieval performance, have a tendency to hallucinate details.
We also find that SR-V fine-tuning worsens SR's tendency to hallucinate and attribute binding in captioners.

BagCurri: Curriculum Training with Bags of Hard Negatives

For a stronger learning signal, we fine-tune with bags of highly similar images within a minibatch instead of retrieving against 99 random distractors (vanilla SR). We also propose a curriculum over bag sizes (see Figure 6) to more optimally leverage the contrastive SR reward.

SR-V fine-tuning with BagCurri (row 7)

Forces CLIP to learn fine-grained visual features leading to superior retrieval performance on TrueMatch.
Reduces caption faithfulness (row 7), leading to increased hallucinations and worse attribute binding.

SR-L fine-tuning with BagCurri (row 6)

BagCurri is too challenging for SR-L fine-tuning, unable to yield meaningful retrieval gains over vanilla SR-L.
Benefits from rich MLE initialization provided by HolisticCap (row 6): CIDEr score increases compared to vanilla SR-L

BEST OF BOTH WORLDS: SR-LV fine-tuning with BagCurri.
We initialize the captioner with HolisticCap and fine-tune both the language and visual components (SR-LV) with BagCurri. This results in the most visually fine-grained model while also improving CIDEr over the MLE trained captioner. Notably, we find that rich initialization provided by HolisticCap and BagCurri is solely responsible for preserving caption faithfulness, with CIDEr decreasing for COCO (row 4) or when bags are used without our curriculum.

CIDEr Optimization meets SR Fine-Tuning

CIDEr opt. (row 8) outperforms the MLE trained model (row 6) on TrueMatch, only when initialized with HolisticCap.
Even CIDEr opt. (row 5) is unable to preseve caption faithfulness for COCO, underscoring the importance of initialization during SR fine-tuning.
Joint optimization with HolisticCap (row 10) results in the most discriminant model while significantly improving CIDEr over MLE pretraining (row 6).

Genating Diverse Captions

Using our training recipe with Visual Caption Boosting guides the captioner away from the language modelling priors, improving caption diversity over COCO MLE training by 304%.

BibTeX

@misc{ndlb2024,
  title={{No Detail Left Behind: Revisiting Self-Retrieval for Fine-Grained Image Captioning},
  author={Manu Gaur and Darshan Singh S and Makarand Tapaswi}
  year={2024},
  eprint={2409.03025},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}

Bibliography

Cross-Domain Image Captioning with Discriminative Finetuning.
Roberto Dessì, Michele Bevilacqua, Eleonora Gualdoni, Nathanael Carraz Rakotonirina, Francesca Franzon, Marco Baroni. 2023. CVPR
ClipCap: CLIP Prefix for Image Captioning.
Ron Mokady, Amir Hertz, Amit H. Bermano. 2021. arXiv:2111.09734

Acknowledgements

This project page is built upon MMVP.

Image Category Viewer