Recap of Season 8 Episode 22 (S08E22) shown at the begining of Episode 23 (S08E23) of the Series 24.

The Story: In a tense series of events, Jack Bauer pursues justice at the cost of a peace treaty, seeking to expose the conspirators behind recent events. He enlists journalist Meredith Reed to publish evidence implicating the Russians, but President Allison Taylor intervenes, ordering Meredith's arrest and seizing the incriminating documents. At CTU NY, Chloe O'Brian informs Cole Ortiz about Jack's ally, Jim Ricker, as they scramble to support Jack's mission. Jack, using drastic measures, forces Former President Charles Logan to confess details implicating Russian President Yuri Suvarov in the conspiracy, including the order to kill Renee Walker. As the stakes escalate and alliances fracture, Jack's relentless pursuit of truth and justice jeopardizes international relations and personal safety alike.

Abstract


We introduce multimodal story summarization by leveraging TV episode recaps — short video sequences interweaving key story moments from previous episodes to bring viewers up to speed. We propose PlotSnap, a dataset featuring two crime thriller TV shows with rich recaps and long episodes of 40 minutes. Story summarization labels are unlocked by matching recap shots to corresponding substories in the episode. We propose a hierarchical model TaleSumm that processes entire episodes by creating compact shot and dialog representations, and predicts importance scores for each video shot and dialog utterance by enabling interactions between local story groups. Unlike traditional summarization, our method extracts multiple plot points from long videos. We present a thorough evaluation on story summarization, including promising cross-series generalization. TaleSumm also shows good results on classic video summarization benchmarks.

TaleSumm


(A) TaleSumm ingests all video shots and dialogs of the episode and encodes them using (B) and (C). Based on temporal order, we combine tokens into local story groups (illustration shows small groups of 2 shots and 0-2 utterances). To each group, we append a group token and add multiple embeddings, before feeding them to the the episode-level Transformer $\mathsf{ET}$. For each shot or dialog token, a linear classifier predicts its importance. (B) Video shot encoder. For each frame, representations from multiple backbones are fused using attention ($\boxplus$). We feed these to a shot Transformer encoder $\mathsf{ST}$, and tap a shot-level representation from the $\mathsf{CLS}$ token. (C) Utterance encoder uses a fine-tuned language model and avg-pooling across all words of the utterance. (D) Self-attention mask illustrates the block-diagonal self-attention structure across the episode. Group tokens across the episode (purple squares) communicate with each other. (E) Multiple embeddings are added to the tokens to capture modality type, time, and membership to a local story group.

Qualitative Analysis

Video Presentation






Poster


Acknowledgement


We thank the Bank of Baroda for partial travel support, and IIIT-H's faculty seed grant and Adobe Research India for funding. Special thanks to Varun Gupta for assisting with the experiments, Hardik Mittal for his helping hand in this project page, and the Katha-AI group members for user studies.

BibTeX

@inproceedings{singh2024previously,
        title={{"Previously on ..." From Recaps to Story Summarization}}, 
        author={Aditya Kumar Singh and Dhruv Srivastava and Makarand Tapaswi},
        booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
        year={2024},
        }