No Detail Left Behind:
Revisiting Self-Retrieval for Fine-Grained Image Captioning

Transactions on Machine Learning Research, 2024

Teaser Image
This image was generated using Flux.1 [schnell].

Image captioning systems are unable to generate fine-grained captions, often using generic descriptions that fail to discriminate visually similar images. We use self-retrieval to improve training and evaluation of captioning systems.


Figure 1: Self-retrieval judges the ability of a captioner to retrieve an image using its generated caption against of bag of distractor images. Image source.

Our work is built upon three key components, each focused on making captioning systems more fine-grained:

  1. Evaluation: Traditional image caption evaluation is neither fine-grained nor rewards diversity. Hence, we present TrueMatch, a benchmark that uses self-retrieval to assess the captioner's ability to capture subtle visual distinctions.
  2. Data: Training on noisy (alt-text) or generic (human annotations) datasets bottlenecks fine-grained caption generation. To this extent, we present Visual Caption Boosting to instill fine-grainedness in generic image captioning datasets while remaining anchored in human annotations.
  3. Self-Retrieval Training: MLE training encourages generation of statistically probable phrases. We design a training recipe to address this by instilling discriminant visual information in the captioner.

We find that captioning systems, irrespective of their size, struggle to capture fine-grained visual details leading to poor performance on TrueMatch. However, using our "plug and play" training recipe, we are able to outpefrom state-of-the-art open-source MLLMs while having 1-2 orders of magnitude fewer parameters.


I. TrueMatch: Fine-grained Evaluation of Captioning Systems

motorcycles
Figure 2: An example from TrueMatch. (COCO MLE) produces the same caption. Vanilla self-retrieval training (COCO SR) hallucinates details such as “two people” (middle). Our proposed method (OURS SR) generates discriminant captions.

TrueMatch comprises bags of highly similar images with varying sizes, where uniquely describing each image within a bag requires capturing various facets of visual understanding. It offers a comprehensive framework for using self-retrieval to holistically evaluate captioning systems. For example, retrieving the target image from Figure 2 requires the caption to incorporate information about attributes (red bike) or orientation (inverted body).


Benchmarking Captioning Systems on TrueMatch

Table 1 evaluates several open-source captioning approaches, MLLMs, and SR trained models on TrueMatch.

  • Irrespective of their size, captioners struggle to capture fine-grained visual details leading to poor performance on TrueMatch.
  • Our SR training recipe outperforms vanilla SR (DiscriTune) by +14.4% to +19.5%, demonstrating its effectiveness.
  • Cambrian-1 is the best-performing open-source MLLM. Our proposed approach significantly outperforms it despite being 30x smaller.

table1
Table 1: Recall@1 for self-retrieval evaluation with TrueMatch is reported. The number of bags in #3 is 254, #5 is 104, and #7 is 93

II. Visual Caption Boosting

Existing image captioning datasets like MSCOCO comprise generic annotations. While recent works address this by synthetically expanding visual information within the captions, they are prone to inherit biases present in foundation models. Interestingly, although the individual COCO captions are sparse, we find that they describe complementary facets of the image (see Figure 3).

VCB is a novel two stage framework leverages foundation models to generate rich descriptions while being anchored in human data:

table1
Figure 3: VCB Framework. Generic COCO captions are visually boosted using complementary details present in human annotations VCB ignores specific details from the visual caption (“river or lake”, “drinking”), as they conflict with annotations (“watering hole”, “grazing and standing”).

Stage 1: BlendCap uses an LLM to create a blended caption that combines diverse perspectives offered by different human annotations. Notably, we prompt the LLM to minimize redundant information resulting in short descriptions.

Stage 2: HolisticCap expands BlendCap by using an LLM to instill fine-grained details from the Visual Caption (InstructBLIP). The LLM is prompted to prefer human-grounded BlendCap in case of conflicting visual information.

Anchoring semantic visual information in human annotations reduces verbose tendencies of MLLMs, producing rich and succinct captions.

Benchmarking Visually Boosted Captions on TrueMatch

  • BlendCap significantly outperforms individual COCO captions on RD100 and ClipScore, confirming that human annotations capture complementary visual aspects of the same image.
  • HolisticCap (unlike BlendCap) yields substantantial gains over COCO across bags of TrueMatch.

table1
Table 2: : R@1 scores for COCO and VCB captions evaluated on RD100 (100 random distractors) and TrueMatch.
table1

III. Guiding captioners away from their language modeling priors

We adopt ClipCap (Mokady et al., 2021), a lightweight (200M) simplification of the modern MLLMs. Training the captioning system has two stages:

  1. Maximum Likelihood Estimation (MLE) pretraining.
  2. REINFORCE fine-tuning by maximizing the self-retrieval (SR) reward.

table1
Figure 4: : ClipCap connects a CLIP visual encoder to a GPT-2 through a MLP adapter.


Fine-tuning the Language Model with Self-Retrieval (SR-L): A Deep Dive

MLE encourages generation of generic descriptions. Retrieval performance of MLE trained captioner reduces substantially for all datasets (rows 1-3) compared to their ground-truth captions (see Table 2).

Self-retrieval fine-tuning benefits from a rich MLE initialization. SR-L fine-tuning with HolisticCap significantly outperforms COCO (Dessì et al., 2023) on TrueMatch.

table1
Table 3: Results across different training approaches and captions show the effectiveness of the rich initialization provided by HolisticCap and SR-L fine-tuning.

Self-retrieval unlocks latent semantic information when fine-tuning the LLM. The model trained on COCO, due to the MLE objective, generates sparse captions that resemble the independent annotations of COCO. This leads to a large gap between COCO and BlendCap on RD100 (rows 1, 2), despite both having similar semantic information. Interestingly, SR-L fine-tuning narrows this gap dramatically (rows 4, 5).

table1

Fine-tuning the Visual Encoder with Self-Retrieval (SR-V)

While fine-tuning CLIP with SR yields superior retrieval performance, it makes the captioner less faithful to the ground-truth captions.



Retrieval Performance vs Caption Faithfulness : The Trade-off plaguing current SR approaches

  • We observe that captioning system's fine-tuned with vanilla SR-L (Dessì et al. 2023) become less faithful to the ground-truth captions upon extended training.
  • As seen in Figure 5, retrieval performance continually improves while CIDEr dips significantly.
  • We also find that captioners fine-tuned with SR, in a dash to enhance retrieval performance, have a tendency to hallucinate details.
  • We also find that SR-V fine-tuning worsens SR's tendency to hallucinate and attribute binding in captioners.

table1
Figure 5: R@1 continually increases while CIDEr degrades when fine-tuning ClipCap with SR-L on COCO.


BagCurri: Curriculum Training with Bags of Hard Negatives

For a stronger learning signal, we fine-tune with bags of highly similar images within a minibatch instead of retrieving against 99 random distractors (vanilla SR). We also propose a curriculum over bag sizes (see Figure 6) to more optimally leverage the contrastive SR reward.

SR-V fine-tuning with BagCurri (row 7) SR-L fine-tuning with BagCurri (row 6)
table1
Table 4 (Left): Impact of fine-tuning different components with BagCurri as compared against vanilla SR (SR-L). Figure 6 (Right): Our curriculum progressively increases bag sizes during training.

BEST OF BOTH WORLDS: SR-LV fine-tuning with BagCurri.
We initialize the captioner with HolisticCap and fine-tune both the language and visual components (SR-LV) with BagCurri. This results in the most visually fine-grained model while also improving CIDEr over the MLE trained captioner. Notably, we find that rich initialization provided by HolisticCap and BagCurri is solely responsible for preserving caption faithfulness, with CIDEr decreasing for COCO (row 4) or when bags are used without our curriculum.

table1


CIDEr Optimization meets SR Fine-Tuning

  • CIDEr opt. (row 8) outperforms the MLE trained model (row 6) on TrueMatch, only when initialized with HolisticCap.
  • Even CIDEr opt. (row 5) is unable to preseve caption faithfulness for COCO, underscoring the importance of initialization during SR fine-tuning.
  • Joint optimization with HolisticCap (row 10) results in the most discriminant model while significantly improving CIDEr over MLE pretraining (row 6).

table1
Table 5: Impact of combining SR with CIDEr optimization. C: CIDEr,
SR: Self-Retrieval, and BC: SR with BagCurri.

Genating Diverse Captions

Using our training recipe with Visual Caption Boosting guides the captioner away from the language modelling priors, improving caption diversity over COCO MLE training by 304%.

table1
Table 6: : Number of words with frequency >= 5 on the COCO test set.

BibTeX

@misc{ndlb2024,
  title={{No Detail Left Behind: Revisiting Self-Retrieval for Fine-Grained Image Captioning},
  author={Manu Gaur and Darshan Singh S and Makarand Tapaswi}
  year={2024},
  eprint={2409.03025},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}

Bibliography

  1. Cross-Domain Image Captioning with Discriminative Finetuning.
    Roberto Dessì, Michele Bevilacqua, Eleonora Gualdoni, Nathanael Carraz Rakotonirina, Francesca Franzon, Marco Baroni. 2023. CVPR
  2. ClipCap: CLIP Prefix for Image Captioning.
    Ron Mokady, Amir Hertz, Amit H. Bermano. 2021. arXiv:2111.09734

Acknowledgements

This project page is built upon MMVP.

Image Category Viewer