Audio descriptions (ADs) are inherently subjective. Describers may choose whether and when to describe, what to highlight, and how to phrase it. We show that existing evaluations that work on short clips with a single ground-truth reference cannot capture this subjectivity. We present ADQA, a benchmark on few-minute video segments that tests if ADs fulfill their central goals of helping Blind and Low Vision (BLV) users understand the story and appreciate visual details. ADQA contains two types of questions: Narrative Understanding (NU) and Visual Appreciation (VA). Our experiments show that current AD systems lag far behind human-authored ADs. We also provide a public leaderboard for benchmarking.
Current evaluation methods compare a generated AD sentence with a temporally aligned ground-truth reference sentence. However:
For the remaining ADs that do have a temporal counterpart, the pairs might:
Hovering over each data point in the plot below will reveal the mapped AD pairs, their CIDEr, BERT similarity, and temporal overlap.
CMD-AD: 98
MAD-eval: 10
CMD-AD: 2:20
MAD-eval: 1:56
CMD-AD: 17,595
MAD-eval: 15,441
CMD-AD: 3,128
MAD-eval: 1,962