If you find our work useful, please cite us. Citation information can be found below.

Abstract

Understanding what makes a video memorable has important applications in advertising and education technology. Towards this goal, we investigate spatio-temporal attention mechanisms underlying video memorability. Different from previous works that fuse multiple features, we adopt a simple CNN+Transformer architecture that enables analysis of spatio-temporal attention while matching state-of-the-art (SoTA) performance on video memorability prediction. We compare model attention against human gaze fixations collected through a small-scale eye-tracking study where humans perform the video memory task. We uncover the following insights:


  1. Quantitative saliency metrics show that our model, trained only to predict a memorability score, exhibits similar spatial attention patterns to human gaze, especially for more memorable videos.
  2. The model assigns greater importance to initial frames in a video, mimicking human attention patterns.
  3. Panoptic segmentation reveals that both (model and humans) assign a greater share of attention to things and less attention to stuff as compared to their occurrence probability.

Introduction

Context-Dependent Memorable Video

General Memorable Video

Consider Nike’s Dream Crazy ad, which gained exceptional attention when it was aired following Kaepernick’s protests against race-based police brutality. In contrast, the E-Trade commercial—featuring a baby talking in an adult voice while delivering investment advice—possesses qualities that make it broadly memorable, independent of cultural context. Inspired by these examples, we investigate the properties that render a video memorable.


Our study is driven by several key questions:


Spatial Icon

Spatial and Semantic Attention

What are the spatial and semantic patterns of attention that contribute to video memorability?

Human Icon

Human-Like Attentional Cues

Can a model trained solely to predict memorability capture human-like attentional cues?

Clock Icon

Temporal Patterns of Attention

How does the model’s attention to different video segments, such as the initial frames, relate to the inherent memorability of the video?

To address these questions, we train a CNN+Transformer model that predicts human memorability of naturalistic videos by leveraging self-attention scores to pinpoint where the model focuses in space and time. We then compare these attention patterns with human eye-tracking data to extract insights about the spatial, temporal, and semantic features that underpin memorability. Unlike prior work that focuses solely on prediction performance, our approach delves into the underlying attention mechanisms, allowing us to analyze both semantic and temporal factors influencing video memorability.


Main Contributions:


Gear Icon

Efficient Model Architecture

We adopt a simple CNN+Transformer model to predict video memorability. Despite its simplicity, the model enables an in-depth study of spatio-temporal attention mechanisms and achieves state-of-the-art performance.

Human Icon

Human Attention Comparison

We collect eye-tracking data from participants engaged in a video memorability experiment, which allows us to directly compare model attention with human gaze patterns.

Spatial Icon

Semantic Analysis via Panoptic Segmentation

Using panoptic segmentation and attention-weighted analyses, we show that both the model and humans allocate attention similarly across various semantic categories.

Clock Icon

Temporal Attention Patterns

We demonstrate that our model, without any intrinsic temporal bias, learns to focus on the initial frames of a video with a decreasing attention pattern over time—mirroring human gaze agreement patterns.

In summary, our work bridges the gap between high-performance video memorability prediction and a comprehensive understanding of the underlying attention mechanisms. By combining a simple CNN+Transformer model with human eye-tracking data and advanced semantic analyses, we offer novel insights into the factors that make a video memorable. This study not only advances the state-of-the-art in video memorability but also lays the foundation for future research aimed at unraveling the cognitive and perceptual processes behind visual memory.


Please note that we use two datasets for all our analysis:

  1. Memento 10K
  2. VideoMem

Details of the datasets and train, test, validation splits can be found from their websites.

Methodology

Our approach utilizes a CNN+Transformer architecture to extract spatio-temporal features from videos. We then compare the resulting model attention maps against human gaze data acquired through eye-tracking experiments. This dual analysis allows us to understand how both systems prioritize visual information.


CNN+Transformer Model for Video Memorability Prediction


Our approach employs a CNN+Transformer model to predict video memorability by capturing spatio-temporal attention. The model is composed of three key components:


  1. Backbone Image Encoder: A CNN (e.g., ResNet-50 with CLIP pretraining) processes each video frame to extract spatial features with dimensions H×W×D, where H×W represents the spatial resolution and D is the feature dimension.
  2. Transformer Encoder: The extracted features are flattened and projected via a linear layer, after which temporal and spatial position embeddings are added. A learnable CLS token is prepended to the sequence, and the resulting tokens are fed to a Transformer encoder that captures the spatio-temporal relationships using self-attention.
  3. Prediction Head: The contextualized representation of the CLS token from the Transformer encoder is passed to a Multi-Layer Perceptron (MLP) to predict the memorability score. Simultaneously, the self-attention scores provide a mechanism to analyze where the model directs its focus across both space and time.
Input Video Frames

1
2
3
4
5
Backbone Encoder
Spatio-Temporal Features
Transformer Encoder
MLP Head

Model overview: T video frames are passed through an image backbone encoder to obtain spatio-temporal features. Coupled with position embeddings and a CLS token, these features are processed by a Transformer encoder. The CLS token’s output is used to predict the memorability score via an MLP, and the resulting self-attention scores are used for further spatio-temporal analysis.

Image Encoder: Each frame fi is encoded using the backbone network, yielding features that capture detailed spatial information. This enables the subsequent analysis of where the model focuses its attention in the frame.

Transformer Encoder: After linear projection of the image features, both temporal and spatial position embeddings are added to each token. The addition of a learnable CLS token allows the Transformer to aggregate global contextual information. Through multi-head self-attention, the encoder learns to capture complex spatio-temporal dependencies, which are essential for predicting the memorability of the video.

Prediction Head: The CLS token’s final representation is passed through an MLP to compute the video’s memorability score. During inference, the self-attention weights—specifically those associated with the CLS token—are extracted to reveal the model’s focus across different spatial regions and time frames.

The model is trained using a mean squared error (MSE) loss between the predicted and ground-truth memorability scores. This setup not only yields state-of-the-art performance but also enables a detailed exploration of the spatio-temporal attention mechanisms underlying video memorability prediction.


Eye-tracking Study: Capturing Gaze Patterns


We collected eye-tracking data while participants viewed videos in a memory experiment. The experimental setup (please look at the animation embedded on this website) follows protocols from previous video memorability studies [12, 33], ensuring that the recorded gaze patterns accurately reflect the cognitive and visual processes involved in viewing and remembering videos.


Data Collection: Our study involved 20 participants (9 females, 11 males, Age 22.15 ± 0.52). We selected 140 unique videos each from both the Memento10K and VideoMem datasets. Binocular gaze data was captured using the SR Research EyeLink 1000 Plus at a sampling rate of 500Hz, and a 9-point calibration grid was used to ensure tracking accuracy. Saccades and fixations were identified using the manufacturer’s algorithm.

A representative simulation of how videos were presented to the participants. TIP: For interacting, press SPACEBAR when you feel the video is repeated.


Procedure: Participants watched a series of 200 videos, which included 140 unique clips, 20 target repeats (with an interval between 9–200 videos), and 40 vigilance repeats interspersed every 2–3 videos. All videos were displayed in their original aspect ratios on a white screen (1024×768 pixels), and calibration was repeated periodically to maintain data quality.


Data Processing: Gaze fixation coordinates for both eyes were extracted using the EyeLink Data Viewer software. These coordinates were converted into binary fixation maps corresponding to the original video dimensions. To simulate a visual angle of approximately 1 degree, a Gaussian blur was applied to the fixation maps, which were then resized to 224×224 pixels to ensure compatibility with the model’s attention maps.


This eye-tracking study establishes a human baseline for spatial and temporal attention, allowing us to directly compare human gaze patterns with the model’s predicted attention maps.

Video Memorability Prediction

We begin with model ablation studies for the Memento10k dataset. Table 1 shows the impact of various hyperparameter choices on the validation set. In rows R1–R6, we explore different configurations of spatial–temporal (ST) features, temporal embeddings (Fourier vs. learnable), and frame sampling methods. Rows R7 and R8 incorporate caption information (with R7 using original captions and R8 predicting captions). Note that RC (Spearman rank correlation) ↑ indicates higher is better, and MSE (mean squared error) ↓ indicates lower is better.

Table 1. Model Ablations on Memento10k (val)
Row Configuration Sampling Caption RC ↑ MSE ↓
R1 ST, Fourier Random 0.706 0.0061
R2 T, Fourier Random 0.687 0.0062
R3 ST, Learnable Random 0.696 0.0059
R4 ST, Fourier, 1D Random 0.703 0.0057
R5 ST, Fourier, 2D Random 0.701 0.0056
R6 ST, Fourier Middle 0.703 0.0066
R7 ST, Fourier Random Original 0.745 0.0050
R8 ST, Fourier Random Predicted 0.710 0.0056

Table 2 compares our best configurations against state-of-the-art methods on both Memento10k and VideoMem. Even with a single feature encoder (CLIP), our vision-only model achieves competitive performance.

Table 2. Comparison Against SoTA for Video Memorability
Method Caption Dataset RC ↑ MSE ↓
SemanticMemNet ECCV20 No Memento10k (Test) 0.659
M3-S CVPR23 No VideoMem (Test) 0.670 0.0062
Ours (R1 Tab. 1) No Memento10k (Test) 0.662 0.0065
Ours (R1 Tab. 1) No Memento10k (Val) 0.706 0.0061
SemanticMemNet ECCV20 Yes Memento10k (Test) 0.663
Sharingan arXiv Yes VideoMem (Test)
Ours (R7 Tab. 1) Yes Memento10k (Test) 0.713 0.0050
Ours (R7 Tab. 1) Yes Memento10k (Val) 0.745 0.0050

Note: RC denotes Spearman rank correlation (higher is better) and MSE denotes mean squared error (lower is better). Best results are highlighted in bold, with second-best in italics.


Comparing Model Attention and Human Gaze

To compare the human gaze fixation density maps with the model-generated attention maps, we first normalize both to the range [0, 1]. We then compute several popular saliency evaluation metrics: AUC-Judd, Normalized Scanpath Saliency (NSS), Linear Correlation Coefficient (CC), and Kullback–Leibler Divergence (KLD). To establish a human baseline, we split the participants into two random groups and calculate the agreement between their fixation maps (H–H), averaged over 10 random split iterations. These human–human scores serve as a ceiling for our model–human (M–H) comparisons. In addition, chance-level performance is estimated by comparing gaze maps from shuffled videos (H–H Shuff.).

AUC-Judd and NSS vs. Memorability Bins

AUC-Judd and NSS scores plotted against ground-truth memorability bins for Memento10k (left) and VideoMem (right). Error bars denote SEM.

Table 3 below summarizes the agreement scores between model attention and human gaze maps (M–H), the split–half human agreement (H–H), and the shuffled human–human baseline (H–H Shuff.) for both datasets. In all cases, higher values for AUC-Judd, NSS, and CC are better, while lower values for KLD indicate superior alignment.

Table 3. Comparison of Gaze Fixation Maps vs. Model Attention
Metric Dataset M–H H–H H–H Shuff.
AUC-Judd ↑ Memento10k 0.89 ± 0.007 0.90 ± 0.001 0.70 ± 0.002
AUC-Judd ↑ VideoMem 0.89 ± 0.007 0.80 ± 0.002 0.55 ± 0.001
AUC-Percentile ↑ Memento10k 82.91 ± 1.65 - -
AUC-Percentile ↑ VideoMem 88.88 ± 1.29 - -
NSS ↑ Memento10k 1.95 ± 0.074 3.07 ± 0.024 0.84 ± 0.022
NSS ↑ VideoMem 2.00 ± 0.068 3.12 ± 0.023 0.23 ± 0.012
CC ↑ Memento10k 0.46 ± 0.014 0.49 ± 0.003 0.16 ± 0.003
CC ↑ VideoMem 0.27 ± 0.007 0.27 ± 0.018 0.03 ± 0.001
KLD ↓ Memento10k 1.48 ± 0.035 2.17 ± 0.023 4.61 ± 0.022
KLD ↓ VideoMem 2.65 ± 0.020 4.02 ± 0.018 6.49 ± 0.013

Panoptic Segmentation

We extract panoptic segmentation labels from a state-of-the-art model (MaskFormer) on the selected video frames. Using the COCO-stuff hierarchy, labels are classified into stuff (amorphous regions such as sky or road) and things (well-defined objects such as people or cars). We compute three types of counts:


  1. Pixel Count: the normalized number of pixels attributed to each label across frames and videos.
  2. Model Attention Weighted Count: the sum obtained by multiplying the model’s attention map with the segmentation mask for each category.
  3. Human Gaze Weighted Count: the sum obtained by multiplying human fixation densities with the segmentation mask.

Analysis across 40 prevalent classes (20 stuff and 20 things) in our eye-tracking dataset reveals that both the model and humans tend to assign lower attention to stuff and higher attention to things Moreover, by splitting the videos into simple and complex based on the median number of objects per frame, we find that the model–human alignment is largely unaffected by video complexity.


Pixel counts and attention-weighted counts

Normalized pixel counts (blue), model attention-weighted counts (light blue), and human gaze-weighted counts (orange) for stuff and things classes.

Cumulative distributions for stuff and things

Cumulative distribution of pixel counts, highlighting increased attention to things and decreased attention to stuff.


Temporal Attention

We first analyze whether humans look at similar regions across video frames and find that participants are more consistent in their fixations during the initial frames than later frames. To rule out a center-bias effect, we identify a subset of videos with off-center saliency—these exhibit even stronger framewise consensus.


To assess whether our model displays similar temporal patterns, we compute attention scores α ∈ ℝ^(T×HW) and sum over the spatial dimensions to obtain temporal attention αT ∈ ℝ^T. The model preferentially attends to the initial frames of the video sequence, even though no explicit bias is built into the architecture.


We further validate this finding with two control experiments:


  1. Frame Reversal: Reversing the frame order (while preserving the same temporal embeddings) still results in higher attention on the early frames.
  2. Optical Flow Analysis: The average optical flow magnitude per frame peaks around the middle of the video, indicating that motion is not driving the increased early attention.

These experiments confirm that the model, trained solely to predict memorability scores, inherently learns to focus on the visual information that human observers attend to during the early moments of a video.


(Left:) Temporal attention in normal order

(Left): Distribution of temporal attention across frames in normal order; (Middle): Temporal attention distribution when frames are reversed; (Right): Average optical flow magnitude per frame, peaking in the middle.

Conclusion

In this work, we introduced a simple yet powerful CNN+Transformer model for predicting video memorability that not only achieves state-of-the-art performance, but also enables an in-depth analysis of spatio-temporal attention mechanisms. By comparing model attention maps with human gaze fixation data collected in a controlled eye-tracking study, we demonstrated that our model learns to attend to the same salient regions as human observers.


Our experiments revealed several key insights:


  • The model’s spatio-temporal attention is highly aligned with human fixation patterns, particularly in the early frames of the video.
  • Through panoptic segmentation, both the model and human participants were found to preferentially focus on semantically meaningful objects (things) over background regions (stuff).
  • Our comprehensive ablation studies and SoTA comparisons underscore the effectiveness of using a simple, single semantic backbone (CLIP) for video memorability prediction.

Despite these promising results, our study also highlights challenges such as data leakage in existing video memorability datasets and the need for more diverse and larger-scale data to improve model generalization. Future work will focus on addressing these issues and further exploring the integration of multi-modal cues for robust memorability prediction. We hope that the insights and resources provided in this work will serve as a stepping stone for further research in understanding the cognitive and perceptual factors that contribute to video memorability.

Citation

Copy the BibTeX entry below for citation:

      @article{kumar2025eyetoai,
        title = {{Seeing Eye to AI: Comparing Human Gaze and Model Attention in Video Memorability}},
        author = {Kumar, Prajneya and Khandelwal, Eshika and Tapaswi, Makarand and Sreekumar, Vishnu},
        year = {2025},
        booktitle = {IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}
      }