YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Video-DETR

This repository contains the implementation of Video-DETR , a transformer-based model for video moment retrieval and highlight detection. Given a natural language query and a video, the model localizes the most relevant temporal moments and predicts the saliency (highlightness) of each video clip.

The codebase is built upon the DETR architecture and supports multiple datasets including QVHighlights, TVSum, Charades-STA, TACoS, NLQ, and YouTube-Uni.

✨ Features

Joint Moment Retrieval & Highlight Detection: Simultaneously predicts relevant temporal windows and per-clip saliency scores.
DETR-based Architecture: Uses a transformer encoder-decoder with learnable object queries for set prediction.
Multi-Dataset Support: Trained and evaluated on QVHighlights, TVSum, Charades-STA, TACoS, NLQ, and YouTube-Uni.
End-to-End Demo: Run inference on your own videos with pre-trained checkpoints using CLIP features.

🛠 Installation

Requirements

Python 3.8+
CUDA-capable GPU (recommended)
PyTorch 1.13.1

Setup


# Install dependencies
pip install -r requirements.txt

Key dependencies:

torch==1.13.1
torchtext==0.14.1
tensorboard==2.5.0
moviepy==1.0.3
scikit-learn==0.24.2
numpy, scipy, matplotlib, tqdm, pytube

📁 Data Preparation

QVHighlights Dataset

The QVHighlights annotations are provided under data/ in JSONL format:

data/highlight_train_release.jsonl
data/highlight_val_release.jsonl
data/highlight_test_release.jsonl

Each line is a dictionary containing:

qid: query ID
query: natural language query
vid: video ID ({youtube_id}_{start}_{end})
duration: video duration in seconds
relevant_windows: ground-truth temporal windows [[start, end], ...]
relevant_clip_ids: clip IDs falling into the relevant windows
saliency_scores: per-clip highlight annotations from 3 annotators

Note: relevant_windows, relevant_clip_ids, and saliency_scores are omitted in the test split.

Features

Pre-extracted features should be placed under features/qvhighlight/:

slowfast_features/: SlowFast video features (2304-dim)
clip_features/: CLIP video features (512-dim)
clip_text_features/: CLIP text features (512-dim)

You can also use other feature combinations by modifying the training script.

🚀 Training

QVHighlights

Use the provided training script:

cd video_detr/scripts
bash train.sh

This script configures:

Video features: slowfast + clip
Text features: clip
Batch size: 32
Transformer layers: 3 enc / 3 dec / 2 t2v / 1 moment / 2 dummy / 1 sent

Custom Training

You can also launch training directly:

PYTHONPATH=$PYTHONPATH:. python video_detr/train.py \
  --dset_name hl \
  --ctx_mode video_tef \
  --train_path data/highlight_train_release.jsonl \
  --eval_path data/highlight_val_release.jsonl \
  --eval_split_name val \
  --v_feat_dirs features/qvhighlight/slowfast_features features/qvhighlight/clip_features \
  --v_feat_dim 2816 \
  --t_feat_dir features/qvhighlight/clip_text_features \
  --t_feat_dim 512 \
  --bsz 32 \
  --results_root results \
  --exp_id my_experiment

Supported Datasets

Training scripts for other datasets are available under video_detr/scripts/:

charades_sta/
tacos/
tvsum/
youtube_uni/

📊 Evaluation

Offline Evaluation

Evaluate a trained checkpoint on the validation set:

cd video_detr/scripts
bash inference.sh <path_to_checkpoint> val

Or run directly:

PYTHONPATH=$PYTHONPATH:. python video_detr/inference.py \
  --resume <path_to_checkpoint> \
  --eval_split_name val \
  --eval_path data/highlight_val_release.jsonl

Standalone Evaluation

You can also evaluate a prediction file independently:

bash standalone_eval/eval_sample.sh

This will evaluate standalone_eval/sample_val_preds.jsonl and output metrics to standalone_eval/sample_val_preds_metrics.json.

Codalab Submission

To evaluate on the test split, submit both val and test predictions to the Codalab evaluation server. The submission should be a single .zip file containing:

hl_val_submission.jsonl
hl_test_submission.jsonl

See standalone_eval/README.md for details.

🎬 Inference on Custom Videos

Run the model on your own videos using the end-to-end demo script in run_on_video/:

PYTHONPATH=$PYTHONPATH:. python run_on_video/run.py

Or use the newer version:

PYTHONPATH=$PYTHONPATH:. python run_on_video/run_new.py

The VideoDETRPredictor class handles:

Video feature extraction using CLIP
Text feature extraction for your queries
Model inference to predict moments and saliency scores

Limitation: The pre-trained positional embeddings support videos up to ~150 seconds (75 clips of 2 seconds each).

🗂 Project Structure

.
├── video_detr/              # Core model implementation
│   ├── model.py             # VideoDETR model definition
│   ├── transformer.py       # Transformer encoder/decoder
│   ├── attention.py         # Attention mechanisms
│   ├── crossattention.py    # Cross-attention modules
│   ├── train.py             # Training loop
│   ├── inference.py         # Evaluation & inference logic
│   ├── start_end_dataset.py # Dataset loaders
│   ├── matcher.py           # Hungarian matcher for DETR
│   ├── config.py            # Argument parsers
│   ├── span_utils.py        # Temporal span utilities
│   └── scripts/             # Training scripts per dataset
├── run_on_video/            # End-to-end demo on custom videos
│   ├── run.py / run_new.py
│   ├── data_utils.py        # CLIP feature extraction
│   ├── model_utils.py       # Model loading utilities
│   └── visualization.py     # Result visualization
├── standalone_eval/         # Official evaluation scripts
│   ├── eval.py
│   └── README.md
├── utils/                   # General utilities
│   ├── basic_utils.py
│   ├── tensor_utils.py
│   ├── temporal_nms.py
│   └── model_utils.py
├── data/                    # Dataset annotations
│   ├── highlight_train_release.jsonl
│   ├── highlight_val_release.jsonl
│   ├── highlight_test_release.jsonl
│   └── README.md
├── features/                # Pre-extracted features (user-provided)
├── results/                 # Experiment outputs
└── requirements.txt

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support