ArXiv

CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives

2026-05-12 · 17:59 UTC ·Yihao Meng, Zichen Liu, Hao Ouyang... ·1 min read

Authors: Yihao Meng, Zichen Liu, Hao Ouyang...
Categories: cs.CV
arXiv: https://arxiv.org/abs/2605.12496v1
PDF: https://arxiv.org/pdf/2605.12496v1

Brief

CausalCine is a 2026 autoregressive video-generation framework targeting multi-shot cinematic narratives, where shot boundaries, viewpoint shifts, and evolving events break standard short-horizon continuation. The authors train a causal base model on native multi-shot sequences, introduce Content-Aware Memory Routing (attention-based retrieval of KV entries) to preserve cross-shot coherence under limited active memory, and distill a few-step generator for real-time, interactive streaming. Summary based on the paper abstract only.

Why it matters

CausalCine (Yihao Meng et al., arXiv 2026-05-12) is an interactive autoregressive framework for multi-shot video narratives that trains a causal base model on native multi-shot sequences and reuses prior-shot context without regenerating previous shots to mitigate motion stagnation and semantic drift.

Key details

The method adds Content-Aware Memory Routing (CAMR) to retrieve historical KV entries by attention-based relevance (rather than temporal proximity) under bounded active memory, and is distilled into a few-step generator for real-time interactive generation; authors report it significantly outperforms autoregressive baselines and approaches bidirectional-model capability (demo: https://yihao-meng.github.io/CausalCine/).

Source evidence

Abstract

Autoregressive video generation aims at real-time, open-ended synthesis. Yet, cinematic storytelling is not merely the endless extension of a single scene; it requires progressing through evolving events, viewpoint shifts, and discrete shot boundaries. Existing autoregressive models often struggle in this setting. Trained primarily for short-horizon continuation, they treat long sequences as extended single shots, inevitably suffering from motion stagnation and semantic drift during long rollouts. To bridge this gap, we introduce CausalCine, an interactive autoregressive framework that transforms multi-shot video generation into an online directing process. CausalCine generates causally across shot changes, accepts dynamic prompts on the fly, and reuses context without regenerating previous shots. To achieve this, we first train a causal base model on native multi-shot sequences to learn complex shot transitions prior to acceleration. We then propose Content-Aware Memory Routing (CAMR), which dynamically retrieves historical KV entries according to attention-based relevance scores rather than temporal proximity, preserving cross-shot coherence under bounded active memory. Finally, we distill the causal base model into a few-step generator for real-time interactive generation. Extensive experiments demonstrate that CausalCine significantly outperforms autoregressive baselines and approaches the capability of bidirectional models while unlocking the streaming interactivity of causal generation. Demo available at https://yihao-meng.github.io/CausalCine/

Comment: Project page: https://yihao-meng.github.io/CausalCine/