ArXiv

EgoForce: Forearm-Guided Camera-Space 3D Hand Pose from a Monocular Egocentric Camera

Authors
Christen Millerdurai, Shaoxiang Wang, Yaxu Xie...
Categories
cs.CV, cs.GR
arXiv
https://arxiv.org/abs/2605.12498v1
PDF
https://arxiv.org/pdf/2605.12498v1

Brief

EgoForce tackles depth–scale ambiguity and device-specific generalization in monocular, head-mounted hand capture by fusing a differentiable forearm model, an arm–hand transformer that predicts geometry from a single egocentric view, and a ray-space closed-form solver to recover absolute camera-space 3D pose. The method works across fisheye, perspective, and wide-FOV optics and yields up to 28% MPJPE reduction on HOT3D, with code and data released.

Why it matters

EgoForce (Millerdurai et al., arXiv 2026; SIGGRAPH 2026) is a monocular egocentric 3D hand reconstruction framework that recovers absolute camera-space hand pose across fisheye, perspective, and distorted wide-FOV head-mounted cameras using a single unified network combining a differentiable forearm representation, a unified arm–hand transformer, and a ray-space closed-form solver.

Key details

  • On three egocentric benchmarks—including HOT3D—EgoForce reports state-of-the-art camera-space 3D accuracy, reducing MPJPE by up to 28% on HOT3D versus prior methods, and maintains consistent performance across diverse camera configurations; code, data, and demo are available at the project page.
Source evidence

Abstract

Reconstructing the absolute 3D pose and shape of the hands from the user's viewpoint using a single head-mounted camera is crucial for practical egocentric interaction in AR/VR, telepresence, and hand-centric manipulation tasks, where sensing must remain compact and unobtrusive. While monocular RGB methods have made progress, they remain constrained by depth-scale ambiguity and struggle to generalize across the diverse optical configurations of head-mounted devices. As a result, models typically require extensive training on device-specific datasets, which are costly and laborious to acquire. This paper addresses these challenges by introducing EgoForce, a monocular 3D hand reconstruction framework that recovers robust, absolute 3D hand pose and its position from the user's (camera-space) viewpoint. EgoForce operates across fisheye, perspective, and distorted wide-FOV camera models using a single unified network. Our approach combines a differentiable forearm representation that stabilizes hand pose, a unified arm-hand transformer that predicts both hand and forearm geometry from a single egocentric view, mitigating depth-scale ambiguity, and a ray space closed-form solver that enables absolute 3D pose recovery across diverse head-mounted camera models. Experiments on three egocentric benchmarks show that EgoForce achieves state-of-the-art 3D accuracy, reducing camera-space MPJPE by up to 28% on the HOT3D dataset compared to prior methods and maintaining consistent performance across camera configurations. For more details, visit the project page at https://dfki-av.github.io/EgoForce.

Comment: 23 pages, 19 figures and 10 tables; project page: https://dfki-av.github.io/EgoForce (source code, data and demo available); SIGGRAPH 2026 Conference