ArXiv

From Web to Pixels: Bringing Agentic Search into Visual Perception

Authors
Bokang Yang, Xinyi Sun, Kaituo Feng...
Categories
cs.CV
arXiv
https://arxiv.org/abs/2605.12497v1
PDF
https://arxiv.org/pdf/2605.12497v1

Brief

Perception Deep Research frames open-world visual perception where target identities must be resolved from external web facts before localization. The authors introduce WebEye — a benchmark with 120 images, 473 annotated objects, 645 QA pairs and 1,927 task samples — and propose Pixel-Searcher, an agentic search-to-pixel workflow that attains top open-source results across grounding, segmentation, and VQA.

Why it matters

Formalizes "Perception Deep Research" and introduces the WebEye benchmark (120 images, 473 annotated object instances, 645 unique QA pairs, 1,927 task samples) with three task views: Search-based Grounding, Search-based Segmentation, and Search-based VQA (arXiv 2026-05-12).

Key details

  • Proposes Pixel-Searcher, an agentic search-to-pixel workflow that achieves the strongest open-source performance across all three task views; reported failure modes are evidence acquisition, identity resolution, and visual instance binding (authors: Bokang Yang et al.; project: https://pixel-searcher.github.io/).
Source evidence

Abstract

Visual perception connects high-level semantic understanding to pixel-level perception, but most existing settings assume that the decisive evidence for identifying a target is already in the image or frozen model knowledge. We study a more practical yet harder open-world case where a visible object must first be resolved from external facts, recent events, long-tail entities, or multi-hop relations before it can be localized. We formalize this challenge as Perception Deep Research and introduce WebEye, an object-anchored benchmark with verifiable evidence, knowledge-intensive queries, precise box/mask annotations, and three task views: Search-based Grounding, Search-based Segmentation, and Search-based VQA. WebEyes contains 120 images, 473 annotated object instances, 645 unique QA pairs, and 1,927 task samples. We further propose Pixel-Searcher, an agentic search-to-pixel workflow that resolves hidden target identities and binds them to boxes, masks, or grounded answers. Experiments show that Pixel-Searcher achieves the strongest open-source performance across all three task views, while failures mainly arise from evidence acquisition, identity resolution, and visual instance binding.

Comment: Project page: https://pixel-searcher.github.io/