Twitter/X

Nous Research announced Token Superposition Training (TST) on 2026-05-13…

2026-05-13 · 20:45 UTC ·@DavidOndrej1 ·1 min read

Brief

Nous Research released Token Superposition Training (TST) on 2026-05-13, claiming a 2–3× wall-clock speedup at matched FLOPs without changing architecture, optimizer, tokenizer, or data. TST trains on averaged contiguous token 'bags' for the first third of training with a modified cross-entropy, then switches to standard next-token training; validated at 270M, 600M, 3B dense and 10B-A1B MoE.

Why it matters

Nous Research announced Token Superposition Training (TST) on 2026-05-13, claiming a 2–3× wall-clock speedup at matched FLOPs without changing the model architecture, optimizer, tokenizer, or training data.

Key details

TST modifies the pretraining loop: during the first third of training the model reads and predicts contiguous 'bags' of tokens (input embeddings are averaged; outputs use a modified cross-entropy to predict the next bag), then the run continues with normal next-token training; the inference-time model is identical to a conventionally pretrained model.
TST was validated at 270M, 600M, and 3B dense scales and at a 10B-A1B Mixture-of-Experts (MoE); the work was led by @bloc97_, @gigant_theo, and @theemozilla — @DavidOndrej1 highlighted this release alongside frequent Hermes updates.

Source evidence

These guys casually ship a new research paper while releasing new Hermes updates nearly every day. Insane.

Nous Research (@NousResearch)

Today we release Token Superposition Training (TST), a modification to the standard LLM pretraining loop that produces a 2-3× wall-clock speedup at matched FLOPs without changing the model architecture, optimizer, tokenizer, or training data.

During the first third of training, the model reads and predicts contiguous bags of tokens, averaging their embeddings on the input side and predicting the next bag with a modified cross-entropy on the output side. For the remainder of the run, it trains normally on next-token prediction. The inference-time model is identical to one produced by conventional pretraining.

Validated at 270M, 600M, and 3B dense scales, and at 10B-A1B MoE.

The work on TST was led by @bloc97_, @gigant_theo, and @theemozilla.

— https://nitter.net/NousResearch/status/2054610062836892054#m