ArXiv

Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training

2026-05-12 · 17:57 UTC ·Yuanda Xu, Hejian Sang, Zhengze Zhou... ·1 min read

Authors: Yuanda Xu, Hejian Sang, Zhengze Zhou...
Categories: cs.LG, cs.AI
arXiv: https://arxiv.org/abs/2605.12483v1
PDF: https://arxiv.org/pdf/2605.12483v1

Brief

The paper proposes a reward-density principle for scarce verifiable labels: use sparse sequence-level RL on a strong teacher to discover reward-shaped behavior, then compress that behavior into a smaller deployment student via dense token-level supervision (a "bridge" of forward-KL warmup on teacher rollouts followed by OPD on student rollouts). Evaluated on verifiable math with Qwen3 and Llama models, the bridge improves student MATH accuracy (75.4%→78.5%) and enables stronger downstream GRPO; full text not available (abstract-only).

Why it matters

Allocating scarce labeled, verifiable data upstream on a strong teacher (an RL-improved 8B model) and then transferring behavior downstream via a dense "bridge" (forward-KL warmup on teacher rollouts followed by OPD on student rollouts) yields better student performance than running GRPO directly on the deployment student (Qwen3-1.7B).

Key details

On MATH, applying the bridge makes later student-side sparse RL effective: GRPO that is weak on a cold student lifts MATH accuracy from 75.4% to 78.5% after the bridge and outperforms a matched replay control by 2.8 percentage points.
The forward-KL warmup + OPD bridge is consistently strongest on MATH prior to any student-side sparse RL and produces the best pre-Stage-3 AIME endpoints for canonical 8B and 14B teachers (evaluated with Qwen3 and Llama families).

Source evidence

Abstract

In settings where labeled verifiable training data is the binding constraint, each checked example should be allocated carefully. The standard practice is to use this data directly on the model that will be deployed, for example by running GRPO on the deployment student. We argue that this is often an inefficient allocation because it overlooks a reward-density principle: sparse sequence-level reward should train models where exploration is productive, while dense token-level teacher reward should be used where the aim is to compress behavior into a smaller model. In this view, GRPO-style sparse RL and OPD-style dense teacher supervision are not separate recipes; they are different reward-density regimes. The allocation rule is simple: use scarce labeled training data upstream on the strongest model that can turn it into reward-shaped behavior, then transfer that behavior downstream as dense supervision. We evaluate this rule on verifiable math with Qwen3 and Llama models. At fixed Qwen3-1.7B deployment-student size, an RL-improved 8B teacher distilled through the dense bridge outperforms direct GRPO on the same student, while transfer from the same teacher before RL underperforms. The bridge is important: a forward-KL warmup on teacher rollouts followed by OPD on student rollouts is consistently strongest on MATH before any post-bridge student-side sparse RL, and also gives the best pre-Stage~3 AIME endpoints for the canonical 8B/14B teachers. The bridge also makes later student-side sparse RL effective: GRPO that is weak on a cold student lifts MATH from $75.4\%$ to $78.5\%$ after the bridge and outperforms a matched replay control by $2.8$ points. The operational principal is to avoid using scarce labeled data on the least prepared policy: use sparse reward for teacher-side discovery, dense transfer for student compression, and student-side sparse reward only after the bridge.