ArXiv

Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment

Authors
Jerry Jiang, Haowen Sun, Denis Gudovskiy...
Categories
cs.CV
arXiv
https://arxiv.org/abs/2605.08064v1
PDF
https://arxiv.org/pdf/2605.08064v1

Brief

Proxy3D proposes compact 3D proxy representations extracted from video frames via semantic and geometric encoders and semantic-aware clustering. The authors curate the SpaceSpan dataset and apply multi-stage training to align proxies with vision-language models; on 3D visual question answering, visual grounding, and spatial intelligence benchmarks the method achieves competitive or state-of-the-art results while using shorter vision sequences. (Abstract-only summary.)

Why it matters

Proxy3D builds compact 3D proxy representations from only video frames using semantic and geometric encoders plus semantic-aware clustering to produce scene proxies in 3D space.

Key details

  • The authors curated the SpaceSpan dataset and use multi-stage training to align these proxies with vision-language models; the approach yields competitive or state-of-the-art performance on 3D visual question answering, visual grounding, and spatial intelligence benchmarks while using shorter vision sequences.
  • Paper by Jerry Jiang, Haowen Sun, Denis Gudovskiy et al., posted to arXiv 2026-05-08 and accepted to CVPR 2026; project page: https://wzzheng.net/Proxy3D
Source evidence

Abstract

Spatial intelligence in vision-language models (VLMs) attracts research interest with the practical demand to reason in the 3D world.Despite promising results, most existing methods follow the conventional 2D pipeline in VLMs and use pixel-aligned representations for the vision modality. However, correspondence-based models with implicit 3D scene understanding often fail to achieve spatial consistency, and representation-based models with 3D geometric priors lack efficiency in vision sequence serialization. To address this, we propose a Proxy3D method with compact yet comprehensive 3D proxy representations for the vision modality. Given only video frames as input, we employ semantic and geometric encoders to extract scene features and then perform their semantic-aware clustering to obtain a set of proxies in the 3D space. For representation alignment, we further curate the SpaceSpan dataset and apply multi-stage training to adopt the proposed 3D proxy representations with the VLM. When using shorter sequences for vision information, our method achieves competitive or state-of-the-art performance in 3D visual question answering, visual grounding and general spatial intelligence benchmarks.

Comment: Accepted by CVPR 2026. Project page: https://wzzheng.net/Proxy3D