ArXiv

SafeManip: A Property-Driven Benchmark for Temporal Safety Evaluation in Robotic Manipulation

2026-05-12 · 16:49 UTC ·Chengyue Huang, Khang Vo Huynh, Sebastian Elbaum... ·1 min read

Authors: Chengyue Huang, Khang Vo Huynh, Sebastian Elbaum...
Categories: cs.RO
arXiv: https://arxiv.org/abs/2605.12386v1
PDF: https://arxiv.org/pdf/2605.12386v1

Brief

SafeManip is a property-driven benchmark that evaluates temporal safety in robotic manipulation by mapping rollouts to symbolic predicate traces and checking reusable LTLf templates (eight safety categories) with LTLf-based monitors. The authors evaluate six vision-language-action policies, including π0, π0.5 and GR00T, across 50 RoboCasa365 tasks and find that many successful rollouts still violate temporal safety, especially on longer-horizon or complex tasks.

Why it matters

Introduces SafeManip: a reusable property-driven benchmark that maps rollouts to symbolic predicate traces and evaluates temporal safety via Linear Temporal Logic over finite traces (LTLf) monitors across eight manipulation safety categories (collision/contact, grasp stability, release stability, cross-contamination, action onset, mechanism recovery, object containment, enclosure access).

Key details

Benchmarked six vision-language-action policies — including π0, π0.5, GR00T and their training variants — on 50 RoboCasa365 household tasks (arXiv publication 2026-05-12).
Key result: many successful task executions still violate temporal safety; task-success improvements do not reliably translate to safer execution, and longer-horizon or more complex tasks expose more violations.

Source evidence

Abstract

Robotic manipulation is typically evaluated by task success, but successful completion does not guarantee safe execution. Many safety failures are temporal: a robot may touch a clean surface after contamination or release an object before it is fully inside an enclosure. We introduce SafeManip, a property-driven benchmark to explicitly evaluate temporal safety properties in robotic manipulation, moving beyond prior evaluations that largely focus on task completion or per-state constraint violations. SafeManip defines reusable safety templates over finite executions using Linear Temporal Logic over finite traces (LTLf). It maps observed rollouts to symbolic predicate traces and evaluates them with LTLf-based monitors. Its property suite covers eight manipulation safety categories: collision and contact safety, grasp stability, release stability, cross-contamination, action onset, mechanism recovery, object containment, and enclosure access. Templates can be instantiated with task-specific objects, fixtures, regions, or skills, allowing the same safety specifications to generalize across tasks and environments. We evaluate SafeManip on six vision-language-action policies, including $π0$, $π{0.5}$, GR00T, and their training variants, across 50 RoboCasa365 household tasks. Results show that even strong models often behave unsafely. Task-success gains do not reliably translate into safer execution: many successful rollouts remain unsafe, while longer-horizon or more complex tasks expose more violations. SafeManip provides a reusable evaluation layer for diagnosing temporal safety failures and measuring safe success beyond task completion.