Hierarchical Video-Moment Retrieval and Step-Captioning (CVPR 2023)

1University of North Carolina at Chapel Hill, 2Meta AI
*Equal contribution

HiREST: a holistic, hierarchical benchmark (+joint model) of multimodal retrieval and step-by-step summarization for videos.

Overview of four hierarchical tasks of HiREST dataset. 1) Video retrieval: find a video that is most relevant to a given text query. 2) Moment retrieval: choose the relevant span of the video, by trimming the parts irrelevant to the text query. 3) Moment segmentation: break down the span into several steps and identify the start-end boundaries of each step. 4) Step captioning: generate step-by-step textual summaries of the moment.


There is growing interest in searching for information from large video corpora. Prior works have studied relevant tasks, such as text-based video retrieval, moment retrieval, video summarization, and video captioning in isolation, without an end-to-end setup that can jointly search from video corpora and generate summaries. Such an end-to-end setup would allow for many interesting applications, e.g., a text-based search that finds a relevant video from a video corpus, extracts the most relevant moment from that video, and segments the moment into important steps with captions.

To address this, we present the HiREST (HIerarchical REtrieval and STep-captioning) dataset and propose a new benchmark that covers hierarchical information retrieval and visual/textual stepwise summarization from an instructional video corpus. HiREST consists of 3.4K text-video pairs from an instructional video dataset, where 1.1K videos have annotations of moment spans relevant to text query and breakdown of each moment into key instruction steps with caption and timestamps (totaling 8.6K step captions).

Our hierarchical benchmark consists of video retrieval, moment retrieval, and two novel moment segmentation and step captioning tasks. In moment segmentation, models break down a video moment into instruction steps and identify start-end boundaries. In step captioning, models generate a textual summary for each step. We also present starting point task-specific and end-to-end joint baseline models for our new benchmark. While the baseline models show some promising results, there still exists large room for future improvement by the community.

HiREST Dataset

Comparison of HiREST and other video datasets with step annotations

HiREST covers videos of 1) various domains, 2) with many step annotations per video, and 3) high-quality step captions written by human annotators.

Video Category Distribution

The videos and text queries are collected from the HowTo100M dataset. There are a wide variety of categories for HiREST videos. The most frequent categories are “Hobbies and Crafts”, “Food and Entertaining”, and “Home and Garden”.

Step Caption Distribution

Distribution of HiREST step captions by their first three words for 50 random samples. Many captions are related to actions or objects.


(a) Top 10 most common starting verbs in step captions. (b) Top 10 most common words in step captions. The top words typically refer to objects (e.g., water) or quantities (e.g., all).


Joint Baseline Model

We provide a joint baseline model that handles moment retrieval, moment segmentation, and step captioning tasks with a single architecture. We learn a shallow multimodal transformer that adapts the four pretrained models: EVA-CLIP (frozen), Whisper (frozen), MiniLM (frozen), and CLIP4Caption (finetuned).


EVA-CLIP visual (frozen) encoder maps a video frame into a visual embedding, EVA-CLIP text encoder (frozen) maps a text query into a text embedding, Whisper (frozen) extracts speech transcription from audio, MiniLM (frozen) text encoder maps the speech transcription into a text embedding. To adapt the video, text, and audio embeddings, we finetune a two-layer multimodal encoder and a two-layer text decoder, which are initialized from CLIP4Caption. We train the joint model in a multi-task setup in a round-robin fashion, by sampling a batch from one of the data loaders at each step.

Hierarchical Video Information Retrieval

Given a text query ‘How to make butter biscuits’, our joint model predicts a relevant moment from a video, segments the moment into steps, and describes the moment step-by-step.


Please cite our paper if you use our dataset and/or method in your projects.

  author    = {Abhay Zala and Jaemin Cho and Satwik Kottur and Xilun Chen and Barlas Oğuz and Yashar Mehdad and Mohit Bansal},
  title     = {Hierarchical Video-Moment Retrieval and Step-Captioning},
  booktitle = {CVPR},
  year      = {2023},