MICap: A Unified Model for Identity-aware Movie Descriptions

Katha-AI Lab, IIIT-Hyderabad
CVPR 2024

*Indicates Equal Contribution
Teaser Image


Movie-Identity Captioner (MICap) is a single stage approach for identity-aware captioning.

Abstract

Characters are an important aspect of any storyline and identifying and including them in descriptions is necessary for story understanding. While previous work has largely ignored identity and generated captions with someone (anonymized names), recent work formulates id-aware captioning as a fill-in-the-blanks (FITB) task, where, given a caption with blanks, the goal is to predict person id labels. However, to predict captions with ids, a two-stage approach is required: first predict captions with someone, then fill in identities. In this work, we present a new single stage approach that can seamlessly switch between id-aware caption generation or FITB when given a caption with blanks. Our model, Movie-Identity Captioner (MICap), uses a shared auto-regressive decoder that benefits from training with FITB and full-caption generation objectives, while the encoder can benefit from or disregard captions with blanks as input. Another challenge with id-aware captioning is the lack of a metric to capture subtle differences between person ids. To this end, we introduce iSPICE, a caption evaluation metric that focuses on identity tuples created through intermediate scene graphs. We evaluate MICap on Large-Scale Movie Description Challenge (LSMDC), where we show a 4.2% improvement in FITB accuracy, and a 1-2% bump in classic captioning metrics.

Main contributions:
  1. New Paradigm for Identity-Aware Multi-Sentence Movie Description: We propose a novel, single-stage approach that unifies Fill-in-the-Blank (FITB) with full caption generation for identity-aware multi-sentence movie descriptions. This approach streamlines the process and improves performance.
  2. Auto-regressive Sequence-to-Sequence Generation: We formulate the task as an auto-regressive sequence-to-sequence generation model. This allows the model to effectively describe the video content and utilize local person ID labels across a video set (multiple videos). Joint training of this model facilitates knowledge sharing and boosts overall performance.
  3. Seamless Task Switching: We enable seamless task switching, allowing for independent evaluation of two key aspects: (a) caption generation with identities, and (b) filling in identity labels given a caption. This facilitates a more comprehensive evaluation of the model's capabilities.
  4. Identity-Aware Captioning Metric (iSPICE): We propose a new metric, iSPICE, which extends the existing SPICE metric. iSPICE is specifically designed to be sensitive to identities while evaluating captions, providing a more accurate assessment of the model's performance in this area.
  5. State-of-the-Art Performance: Our model, MICap, achieves significant improvements over the current state-of-the-art. Compared to previous models, MICap demonstrates a 4.2% improvement in FITB tasks and a 1.4% CIDEr and 1.8% METEOR improvement in identity-aware captioning.

Method Overview : Movie-Identity Captioner

Identity-aware captioning

MICap consists of two parts:

  1. Feature extractors and a Transformer encoder to build the captioning memory (Left); and
  2. A Transformer decoder that switches between Fill-in-the-blanks (FITB) or full captionset generation (Right).

Identity-aware SPICE - iSPICE

Inspired by a metric used in image captioning evaluation called Semantic Propositional Image Caption Evaluation (SPICE), we propose a new metric – identity-aware SPICE (iSPICE for short) – to evaluate the quality of video descriptions, especially pertaining to identity labels.


How is iSPICE calculated?

SPICE estimates quality of a caption in two stages. First, the reference and predicted caption are converted to scene graphs that explicitly encode objects, attributes, and relationships. This abstraction provides a list of tuples Tr and Tp for the reference and predicted captions. SPICE is the F1-score that measures logical conjunction (overlap):

SPICE = F1(Tr, Tp).

iSPICE is a simple modification of SPICE. We intervene at the list of tuples and filter out tuples that do not have at least one character identity. We define

iSPICE = F1(Trp2+, Tpp2+) · F1(Trp1, Tpp1),

where Trp2+ denotes the list of tuples with a person-id label having 2 or more elements and Trp1 is a set of person-id labels in the reference captionset.

  • The first term scores whether the correct person-id label is used together with a verb or attribute.
  • The second term checks that the total number of person-id labels match.

(You can compare both the SPICE and iSPICE values in the demo above.)

BibTeX

@inproceedings{raajesh2024micap,
                title={MICap: A Unified Model for Identity-aware Movie Descriptions},
                author={Raajesh, Haran and Desanur, Naveen Reddy and Khan, Zeeshan and Tapaswi, Makarand},
                booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
                pages={14011--14021},
                year={2024}
              }