ArXiv

Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts

2026-05-12 · 17:55 UTC ·Sagi Ahrac, Noya Hochwald, Mor Geva ·1 min read

Authors: Sagi Ahrac, Noya Hochwald, Mor Geva
Categories: cs.LG, cs.CL
arXiv: https://arxiv.org/abs/2605.12476v1
PDF: https://arxiv.org/pdf/2605.12476v1

Brief

Based on the abstract (full text not consulted), the paper analyzes how routing decisions in Sparse Mixture-of-Experts form a geometric coupling between routers and experts: gradients for a routed token point along the same input direction in router and expert weights. The authors validate this in a 1B-parameter SMoE, show that auxiliary load-balancing disrupts the coupling (making router directions ≈3× more similar), and introduce a parameter-free online K-Means router (running-average centroids + cosine assignment) that minimizes load imbalance with only modest perplexity cost, suggesting geometric coupling underlies effective specialization. Published 2026-05-12 by Ahrac, Hochwald, and Geva (arXiv:2605.12476v1).

Why it matters

The authors prove a geometric coupling: for a routed token, router weights for the selected expert and that expert's weights receive gradients along the same input direction (differing only by scalar coefficients), so matched router–expert directions accumulate the same routed-token history.

Key details

Empirically in a 1B-parameter SMoE trained from scratch, higher router scores predict stronger activations inside the selected expert; adding auxiliary load-balancing losses breaks the coupling by spreading input-directed gradients across router weights, making distinct router directions nearly three times more similar.
They propose a parameter-free online K-Means router where each expert keeps a running average of its routed hidden states and tokens are assigned by cosine similarity; this router attains the lowest load imbalance versus auxiliary-loss and loss-free balancing with only a modest perplexity increase.

Source evidence

Abstract

Sparse Mixture-of-Experts (SMoE) models enable scaling language models efficiently, but training them remains challenging, as routing can collapse onto few experts and auxiliary load-balancing losses can reduce specialization. Motivated by these hurdles, we study how routing decisions in SMoEs are formed mechanistically. First, we reveal a geometric coupling between routers and their corresponding experts. For a given token, the router weights for the selected expert and the expert weights processing it receive gradients along the same input direction, differing only in scalar coefficients. Thus, matched router--expert directions accumulate the same routed token history. This theoretical coupling also appears empirically in routing dynamics. In a $1$B SMoE trained from scratch, higher router scores predict stronger expert neuron activations, showing that routing decisions are mirrored inside the selected expert. Next, we analyze the effects of auxiliary load balancing on the router--expert geometric coupling, showing that such losses break this structure by spreading input-directed gradients across router weights, making distinct router directions nearly three times more similar to each other. Last, we demonstrate the centrality of geometric coupling for effective routing with a parameter-free online K-Means router, in which each expert maintains a running average of the hidden states routed to it and tokens are assigned based on cosine similarity. Compared with auxiliary-loss and loss-free balancing, this router achieves the lowest load imbalance with only a modest perplexity increase, indicating that geometric coupling captures a substantial part of what the router learns. Overall, our results explain how routers form assignment geometry that supports an effective division of labor.