ArXiv

TriBand-BEV: Real-Time LiDAR-Only 3D Pedestrian Detection via Height-Aware BEV and High-Resolution Feature Fusion

Authors
Mohammad Khoshkdahan, Alexey Vinel
Categories
cs.CV, cs.AI, cs.LG, cs.RO
arXiv
https://arxiv.org/abs/2605.12220v1
PDF
https://arxiv.org/pdf/2605.12220v1

Brief

TriBand-BEV introduces a fast LiDAR-only 3D pedestrian detector that encodes the full point cloud into a lightweight 2D BEV tensor with three height bands, reformulating 3D detection as 2D detection and reconstructing boxes post-hoc. Using area attention, a hierarchical bidirectional neck (P1–P4), and a distribution-focal rotated-IoU head, it reaches 58.7/52.6/47.2 BEV AP on KITTI at 49 FPS; code is public.

Why it matters

TriBand-BEV is a LiDAR-only method that encodes the full 3D point cloud into a lightweight 2D BEV tensor with three explicit height bands, reformulates 3D detection as 2D detection, and reconstructs oriented 3D boxes so cars, pedestrians, and cyclists are detected in one pass.

Key details

  • On KITTI, TriBand-BEV achieves pedestrian BEV AP of 58.7 / 52.6 / 47.2 (easy / moderate / hard) at 49 FPS on a single consumer GPU, outperforming Complex-YOLO by +12.6%, +7.5%, and +3.1%, respectively.
  • Architecture and training details: backbone uses area attention, a hierarchical bidirectional neck over P1–P4 fuses context and detail; head employs distribution focal learning for side offsets and a rotated IoU loss; training uses a small vertical re-bin and mild reflectance jitter, and an IQR filter removes noisy LiDAR points; code is available on GitHub and the work is accepted to AAMAS 2026.
Source evidence

Abstract

Safe autonomous agents and mobile robots need fast real time 3D perception, especially for vulnerable road users (VRUs) such as pedestrians. We introduce a new bird's eye view (BEV) encoding, which maps the full 3D LiDAR point cloud into a light-weight 2D BEV tensor with three height bands. We explicitly reformulate 3D detection as a 2D detection problem and then reconstruct 3D boxes from the BEV outputs. A single network detects cars, pedestrians, and cyclists in one pass. The backbone uses area attention at deep stages, a hierarchical bidirectional neck over P1 to P4 fuses context and detail, and the head predicts oriented boxes with distribution focal learning for side offsets and a rotated IoU loss. Training applies a small vertical re bin and a mild reflectance jitter in channel space to resist memorization. We use an interquartile range (IQR) filter to remove noisy and outlier LiDAR points during 3D reconstruction. On KITTI dataset, TriBand-BEV attains 58.7/52.6/47.2 pedestrian BEV AP(%) for easy, moderate, and hard at 49 FPS on a single consumer GPU, surpassing Complex-YOLO, with gains of +12.6%, +7.5%, and +3.1%. Qualitative scenes show stable detection under occlusion. The pipeline is compact and ready for real time robotic deployment. Our source code is publicly available on GitHub.

Comment: Accepted for publication in the Proceedings of the 2026 International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)