ADQA

Abstract

Audio descriptions (ADs) are inherently subjective. Describers may choose whether and when to describe, what to highlight, and how to phrase it. We show that existing evaluations that work on short clips with a single ground-truth reference cannot capture this subjectivity. We present ADQA, a benchmark on few-minute video segments that tests if ADs fulfill their central goals of helping Blind and Low Vision (BLV) users understand the story and appreciate visual details. ADQA contains two types of questions: Narrative Understanding (NU) and Visual Appreciation (VA). Our experiments show that current AD systems lag far behind human-authored ADs. We also provide a public leaderboard for benchmarking.

Current evaluation methods compare a generated AD sentence with a temporally aligned ground-truth reference sentence. However:

🚨 In our analysis of 17 CMD-AD movies with two human-authored AD tracks, we found that almost 40% of the ADs do not even have a temporal counterpart!

For the remaining ADs that do have a temporal counterpart, the pairs might:

Describe the same detail with similar words (Quadrant I)
Describe different details but happen to use common words (Quadrant II)
Describe completely different details (Quadrant III)
Describe the same detail but with different words (Quadrant IV)
Use completely different words (CIDEr = 0)

Hovering over each data point in the plot below will reveal the mapped AD pairs, their CIDEr, BERT similarity, and temporal overlap.

Note: CIDEr ≤ 0.01 set to 0.01 for log plots; CIDEr = 0 in paper denotes exact zero.

ADQA Benchmark

Movies

CMD-AD: 98

MAD-eval: 10

Average Duration

CMD-AD: 2:20

MAD-eval: 1:56

Visual Appreciation Questions

CMD-AD: 17,595

MAD-eval: 15,441

Narrative Understanding Questions

CMD-AD: 3,128

MAD-eval: 1,962

Paper


@inproceedings{divy2025adqa,
title = {{What You See is What You Ask: Evaluating Audio Descriptions}},
author = {Divy Kala and Eshika Khandelwal and Makarand Tapaswi},
booktitle = {Conference on Empirical Methods in Natural Language Processing (EMNLP)},
year = {2025}
}

What You See is What You Ask: Evaluating Audio Descriptions