A Video is Worth 10,000 Words: Training and Benchmarking with Diverse Captions for Better Long Video Retrieval
Submission Under Review
- 1University of Maryland, College Park
- 2SRI International
Abstract
Existing long video retrieval systems are trained and tested in the paragraph-to-video retrieval regime, where every long video is described by a single long paragraph. This neglects the richness and variety of possible valid descriptions of a video, which could be described in moment-by-moment detail, or in a single phrase summary, or anything in between. To provide a more thorough evaluation of the capabilities of long video retrieval systems, we propose a pipeline that leverages state-of-the-art large language models to carefully generate a diverse set of synthetic captions for long videos. We validate this pipeline's fidelity via rigorous human inspection. We then benchmark a representative set of video language models on these synthetic captions using a few long video datasets, showing that they struggle with the transformed data, especially the shortest captions. We also propose a lightweight fine-tuning method, where we use a contrastive loss to learn a hierarchical embedding loss based on the differing levels of information among the various captions. Our method improves performance both on the downstream paragraph-to-video retrieval task (+1.1% R@1 on ActivityNet), as well as for the various long video retrieval metrics we compute using our synthetic data (+3.6% R@1 for short descriptions on ActivityNet).
Overview
We find that existing long video retrieval datasets are arbitrarily constrained in terms of their text descriptions. They use paragraph-length captions, which are often generated as highly literal descriptions of video content. This neglects the richness of potential captions, especially shorter, abstract summaries.
New Synthetic 10k Words Data
We develop a novel synthetic data generation pipeline, leveraging ChatGPT with GPT-3.5. With our pipeline, we generate 10k Words supplements for 3 long video retrieval datasets: ActivityNet, QuerYD, and LF-VILA. In the table above, we give an example of these synthetic captions. You can download all 3 supplements from the link at the top of this page.
SOTA Models Struggle with 10k Words Problem
We find that in the zero-shot setting, a representative set of video-language models struggle with the 10k Words data for video retrieval. They especially struggle with the shortest summaries.
Novel Finetuning for 10k Words Problem
We propose a novel finetuning strategy to improve performance for video retrieval, espeically the shorter summaries introduced by 10k. At the core of this, we sample 10k Words data during training. Additionally, we propose 2 losses to align a project of 10k text features with video and paragraph text features.
Results
With our finetuning (applied to COSA), we achieve impressive improvements over zero-shot results compared to standard finetuning and standard finetuning with 10k Words data.
Citation
The website was based on the popular template from Ben Mildenhall.