ArXiv

Covering Human Action Space for Computer Use: Data Synthesis and Benchmark

Authors
Miaosen Zhang, Xiaohan Zhao, Zhihong Tan...
Categories
cs.CV
arXiv
https://arxiv.org/abs/2605.12501v1
PDF
https://arxiv.org/pdf/2605.12501v1

Brief

CUActSpot targets the long-tail of complex GUI interactions that undermine computer-use agents by providing a multimodal benchmark (GUI, text, table, canvas, natural image) and a renderer-based data-synthesis pipeline: automatic scene generation, screenshot/element-coordinate recording, and LLM-produced instructions/action traces. Training on this corpus yields Phi-Ground-Any-4B, which outperforms open models under 32B parameters. Only the abstract was available for this summary.

Why it matters

Authors identify a long-tail failure pattern in computer-use agents (citing failures in GPT-5.4 and Claude): a small fraction of complex, low-frequency GUI interactions accounts for a disproportionate share of task failures, attributed to data scarcity for complex interactions (paper published 2026-05-12).

Key details

  • They release CUActSpot, a multimodal benchmark (GUI, text, table, canvas, natural image) covering diverse actions (click, drag, draw, etc.), and a renderer-based data-synthesis pipeline that auto-generates scenes, records screenshots and element coordinates, and uses an LLM to produce instructions/action traces; their Phi-Ground-Any-4B model trained on this data outperforms open-source models with <32B parameters (code/data/models at https://github.com/microsoft/Phi-Ground.git).
Source evidence

Abstract

Computer-use agents (CUAs) automate on-screen work, as illustrated by GPT-5.4 and Claude. Yet their reliability on complex, low-frequency interactions is still poor, limiting user trust. Our analysis of failure cases from advanced models suggests a long-tail pattern in GUI operations, where a relatively small fraction of complex and diverse interactions accounts for a disproportionate share of task failures. We hypothesize that this issue largely stems from the scarcity of data for complex interactions. To address this problem, we propose a new benchmark CUActSpot for evaluating models' capabilities on complex interactions across five modalities: GUI, text, table, canvas, and natural image, as well as a variety of actions (click, drag, draw, etc.), covering a broader range of interaction types than prior click-centric benchmarks that focus mainly on GUI widgets. We also design a renderer-based data-synthesis pipeline: scenes are automatically generated for each modality, screenshots and element coordinates are recorded, and an LLM produces matching instructions and action traces. After training on this corpus, our Phi-Ground-Any-4B outperforms open-source models with fewer than 32B parameters. We will release our benchmark, data, code, and models at https://github.com/microsoft/Phi-Ground.git