ArXiv

Is Your Driving World Model an All-Around Player?

Authors
Lingdong Kong, Ao Liang, Tianyi Yan...
Categories
cs.CV, cs.RO
arXiv
https://arxiv.org/abs/2605.10858v1
PDF
https://arxiv.org/pdf/2605.10858v1

Brief

WorldLens introduces a comprehensive benchmark for driving world models, spanning five aspects and 24 dimensions to evaluate pixels, 4D geometry, closed-loop driving, and human perceptual realism. The authors benchmark six models, show no method excels across axes (best attain 2–3/10 human realism), and provide WorldLens-26K plus WorldLens-Agent for scalable, explainable evaluation. (Summary based on abstract; full text not reviewed.)

Why it matters

WorldLens is a unified benchmark for driving world models that measures fidelity across five complementary aspects and 24 standardized dimensions (pixel quality, 4D geometry, closed-loop driving, human perceptual alignment, etc.), presented at the CVPR 2026 VideoWorldModel Workshop.

Key details

  • An evaluation of six representative world-models found no single approach dominates: texture-rich models often violate geometry, geometry-aware models lack behavioral fidelity, and top performers score only 2–3 out of 10 on human realism ratings.
  • The authors release WorldLens-26K, a 26,808-entry human-annotated preference dataset pairing numerical scores with textual rationales, plus WorldLens-Agent, a distilled vision–language evaluator for scalable, explainable automatic assessment (project page and GitHub provided).
Source evidence

Abstract

Today's driving world models can generate remarkably realistic dash-cam videos, yet no single model excels universally. Some generate photorealistic textures but violate basic physics; others maintain geometric consistency but fail when subjected to closed-loop planning. This disconnect exposes a critical gap: the field evaluates how real generated worlds appear, but rarely whether they behave realistically. We introduce WorldLens, a unified benchmark that measures world-model fidelity across the full spectrum, from pixel quality and 4D geometry to closed-loop driving and human perceptual alignment, through five complementary aspects and 24 standardized dimensions. Our evaluation of six representative models reveals that no existing approach dominates across all axes: texture-rich models violate geometry, geometry-aware models lack behavioral fidelity, and even the strongest performers achieve only 2-3 out of 10 on human realism ratings. To bridge algorithmic metrics with human perception, we further contribute WorldLens-26K, a 26,808-entry human-annotated preference dataset pairing numerical scores with textual rationales, and WorldLens-Agent, a vision-language evaluator distilled from these judgments that enables scalable, explainable auto-assessment. Together, the benchmark, dataset, and agent form a unified ecosystem for assessing generated worlds not merely by visual appeal, but by physical and behavioral fidelity.

Comment: CVPR 2026 VideoWorldModel Workshop; Project Page at https://worldbench.github.io/worldlens GitHub at https://github.com/worldbench/WorldLens