ArXiv

Fast Byte Latent Transformer

Authors
Julie Kallini, Artidoro Pagnoni, Tomasz Limisiewicz...
Categories
cs.CL, cs.AI, cs.LG
arXiv
https://arxiv.org/abs/2605.08044v1
PDF
https://arxiv.org/pdf/2605.08044v1

Brief

Fast Byte Latent Transformer (BLT) variants tackle slow byte-by-byte autoregressive generation by introducing BLT-D — a block-wise diffusion-trained model enabling multi-byte parallel decoding — plus BLT-S and BLT-DV speculative/verification extensions; experiments (abstract-only) report an estimated >50% memory-bandwidth reduction, full text not available.

Why it matters

BLT-D (Byte Latent Transformer Diffusion) adds a block-wise diffusion auxiliary objective to next-byte training, enabling multi-byte parallel generation per decoding step and is reported as the fastest BLT variant.

Key details

  • Two speculative/verification extensions—BLT Self-speculation (BLT-S), where the local decoder drafts bytes past patch boundaries and they are verified with a single full-model forward pass, and BLT Diffusion+Verification (BLT-DV), which adds an autoregressive verification after diffusion—trade speed for improved quality.
  • All proposed methods can achieve an estimated memory-bandwidth cost over 50% lower than the original BLT on generation tasks, reducing a key practical bottleneck for byte-level LMs.
Source evidence

Abstract

Recent byte-level language models (LMs) match the performance of token-level models without relying on subword vocabularies, yet their utility is limited by slow, byte-by-byte autoregressive generation. We address this bottleneck in the Byte Latent Transformer (BLT) through new training and generation techniques. First, we introduce BLT Diffusion (BLT-D), a new model and our fastest BLT variant, trained with an auxiliary block-wise diffusion objective alongside the standard next-byte prediction loss. This enables an inference procedure that generates multiple bytes in parallel per decoding step, substantially reducing the number of forward passes required to generate a sequence. Second, we propose two extensions inspired by speculative decoding that trade some of this speed for higher generation quality: BLT Self-speculation (BLT-S), in which BLT's local decoder continues generating past its normal patch boundaries to draft bytes, which are then verified with a single full-model forward pass; and BLT Diffusion+Verification (BLT-DV), which augments BLT-D with an autoregressive verification step after diffusion-based generation. All methods may achieve an estimated memory-bandwidth cost over 50% lower than BLT on generation tasks. Each approach offers its own unique advantages, together removing key barriers to the practical use of byte-level LMs.