VL-JEPA Custom (V-JEPA 2 + Qwen 2.5 + MiniLM)

English Description

This model is a custom implementation of the VL-JEPA (Video-Language Joint Embedding Predictive Architecture) inspired by Meta AI's research. It is designed for Temporal Moment Retrieval (finding specific actions in videos).

Architecture

Training Details

  • Dataset: Charades-STA (Academic dataset for video action localization).
  • Optimization: LoRA with r=64r=64 and α=128\alpha=128, targeting q_proj and v_proj in Qwen.
  • Learning Rate: 3e-4 with Cosine Warmup.
  • Outcome: Only 0.2% of parameters are trainable, making it extremely lightweight to train and run.

Description en Français

Ce modèle est une implémentation personnalisée de VL-JEPA, inspirée des travaux de Meta AI. Il est optimisé pour la recherche d'actions temporelles dans les vidéos (Temporal Moment Retrieval).

Architecture

Détails d'Entraînement

  • Dataset : Charades-STA.
  • Méthode : Entraînement via LoRA r=64r=64, α=128\alpha=128.
  • Coût : Approche très économique, entraînée pour environ 5$ sur Vast.ai.

Usage / Utilisation

import torch
from vljepa.config import Config
from vljepa.models import VLJepa

# Load model
config = Config()
model = VLJepa(config)
checkpoint = torch.load("best.pth", map_location="cpu")
model.predictor.load_state_dict(checkpoint["predictor_state_dict"])
model.y_encoder.projection.load_state_dict(checkpoint["y_projection_state_dict"])
model.eval()

# Localizing an action
# (Requires preprocessing frames and tokenizing query)

Refer to the source code for full inference pipeline with sliding window and NMS.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support