ArXiv

Count Anything at Any Granularity

2026-05-11 · 17:32 UTC ·Chang Liu, Haoning Wu, Weidi Xie ·1 min read

Authors: Chang Liu, Haoning Wu, Weidi Xie
Categories: cs.CV
arXiv: https://arxiv.org/abs/2605.10887v1
PDF: https://arxiv.org/pdf/2605.10887v1

Brief

Count Anything at Any Granularity (Liu et al., 2026) tackles brittle open-world object counting by making counting granularity explicit across five levels (identity, attribute, instance type, category, abstract concept). The authors build KubriCount using a novel automatic pipeline combining controllable 3D synthesis, image editing, and VLM filtering, and find existing multimodal LLMs and counting models fail to follow fine-grained prompts. They train HieraCount, which fuses text and visual exemplars, and report substantial accuracy and generalization gains on multi-grained counting tasks.

Why it matters

The paper (Liu, Wu, Xie; published 2026-05-11) reframes open-world counting as multi-grained counting with five explicit semantic granularity levels (identity, attribute, instance type, category, abstract concept) so users can specify “what to count” precisely via visual exemplars plus fine-grained text and optional negative prompts.

Key details

The authors introduce a fully automatic data-scaling pipeline that integrates controllable 3D synthesis, consistent image editing, and VLM-based filtering to produce KubriCount — claimed as the largest and most comprehensively annotated counting dataset to date — supporting both training and multi-grained evaluation.
Systematic benchmarks show multimodal LLMs and specialist counting models suffer severe prompt-following failures on fine-grained distinctions; the proposed HieraCount model, which jointly leverages text and visual exemplars, substantially improves multi-grained counting accuracy and generalizes robustly to challenging real-world scenarios.

Source evidence

Abstract

Open-world object counting remains brittle: despite rapid advances in vision-language models (VLMs), reliably counting the objects a user intends is far from solved. We argue that a central reason is that counting granularity is left implicit; users may refer to a specific identity, an attribute, an instance type, a category, or an abstract concept, yet most methods treat "what to count" as a single, category-level matching problem. In this work, we redefine open-world counting as multi-grained counting, where visual exemplars specify target appearance and fine-grained text, with optional negative prompts, specifies the intended semantic granularity across five explicit levels. Making granularity explicit, however, exposes a critical data bottleneck: existing counting datasets lack the multi-category scenes, controlled distractors, and instance-level annotations needed to verify fine-grained prompt semantics. To address this, we propose the first fully automatic data-scaling pipeline that integrates controllable 3D synthesis with consistent image editing and VLM-based filtering, and use it to construct KubriCount, the largest and most comprehensively annotated counting dataset to date, supporting both training and multi-grained evaluation. Systematic benchmarking reveals that both multimodal large language models and specialist counting models exhibit severe prompt-following failures under fine-grained distinctions. Motivated by these findings, we train HieraCount, a multi-grained counting model that jointly leverages text and visual exemplars as complementary target specifications. HieraCount substantially improves multi-grained counting accuracy and generalizes robustly to challenging real-world scenarios. The project page is available here: https://verg-avesta.github.io/KubriCount/.

Comment: Project page: https://verg-avesta.github.io/KubriCount/