ArXiv

Online Learning-to-Defer with Varying Experts

2026-05-12 · 16:19 UTC ·Dang Hoang Duy, Yannis Montreuil, Maxime Meyer... ·1 min read

Authors: Dang Hoang Duy, Yannis Montreuil, Maxime Meyer...
Categories: stat.ML, cs.LG
arXiv: https://arxiv.org/abs/2605.12340v1
PDF: https://arxiv.org/pdf/2605.12340v1

Brief

The paper introduces the first online L2D algorithm for multiclass classification with bandit feedback and a dynamically varying expert pool, addressing streaming data and shifting expert availability. It achieves regret O((n+ne)T^{2/3}) generally and O((n+ne)√T) under low noise, relying on new H-consistency bounds and first-order online convex optimization; experiments validate practicality. Summary based on the abstract (full text not available).

Why it matters

Presents the first online Learning-to-Defer (L2D) algorithm for multiclass classification with bandit feedback and a dynamically varying pool of experts; proves regret bounds O((n + n_e) T^{2/3}) in the general case and O((n + n_e) √T) under a low-noise condition (T = time horizon, n = number of labels, n_e = distinct experts observed).

Key details

Analysis combines novel H-consistency bounds for the online setting with first-order online convex optimization methods; experiments on synthetic and real-world datasets show the approach handles changing expert availability and reliability effectively.

Source evidence

Abstract

Learning-to-Defer (L2D) methods route each query either to a predictive model or to external experts. While existing work studies this problem in batch settings, real-world deployments require handling streaming data, changing expert availability, and shifting expert distribution. We introduce the first online L2D algorithm for multiclass classification with bandit feedback and a dynamically varying pool of experts. Our method achieves regret guarantees of $O((n+ne)T^{2/3})$ in general and $O((n+ne)\sqrt{T})$ under a low-noise condition, where $T$ is the time horizon, $n$ is the number of labels, and $n_e$ is the number of distinct experts observed across rounds. The analysis builds on novel $\mathcal{H}$-consistency bounds for the online framework, combined with first-order methods for online convex optimization. Experiments on synthetic and real-world datasets demonstrate that our approach effectively extends standard Learning-to-Defer to settings with varying expert availability and reliability.