Thanks for the feedback! I've just updated the input video, so the examples should now match the quality shown in the teaser. The current frame rate is set to 8 fps, which is standard for these demos.
Regarding the resolution (fidelity), VideoCoF actually supports arbitrary resolution and arbitrary length. You are welcome to upload your own high-resolution videos to test the performance!
🚀 Introducing VideoCoF: Unified Video Editing with a Temporal Reasoner (Chain-of-Frames)!
We’re excited to introduce VideoCoF, a unified framework for instruction-based video editing that enables temporal reasoning and ~4× video length extrapolation, trained with only 50k video pairs. 🔥
🔍 What makes VideoCoF different? 🧠 Chain-of-Frames reasoning , mimic human thinking process like Seeing → Reasoning → Editing to apply edits accurately over time without external masks, ensuring physically plausible results. 📈 Strong length generalization — trained on 33-frame clips, yet supports multi-shot editing and long-video extrapolation (~4×). 🎯 Unified fine-grained editing — Object Removal, Addition, Swap, and Local Style Transfer, with instance-level & part-level, spatial-aware control.
⚡ Fast inference update 🚀 H100: ~20s / video with 4-step inference, making high-quality video editing far more practical for real-world use.
🚀 Introducing VideoCoF: Unified Video Editing with a Temporal Reasoner (Chain-of-Frames)!
We’re excited to introduce VideoCoF, a unified framework for instruction-based video editing that enables temporal reasoning and ~4× video length extrapolation, trained with only 50k video pairs. 🔥
🔍 What makes VideoCoF different? 🧠 Chain-of-Frames reasoning , mimic human thinking process like Seeing → Reasoning → Editing to apply edits accurately over time without external masks, ensuring physically plausible results. 📈 Strong length generalization — trained on 33-frame clips, yet supports multi-shot editing and long-video extrapolation (~4×). 🎯 Unified fine-grained editing — Object Removal, Addition, Swap, and Local Style Transfer, with instance-level & part-level, spatial-aware control.
⚡ Fast inference update 🚀 H100: ~20s / video with 4-step inference, making high-quality video editing far more practical for real-world use.