ArXiv

KV-Fold: One-Step KV-Cache Recurrence for Long-Context Inference

Authors
Alireza Nadali, Patrick Cooper, Ashutosh Trivedi...
Categories
cs.LG, cs.AI, cs.CL
arXiv
https://arxiv.org/abs/2605.12471v1
PDF
https://arxiv.org/pdf/2605.12471v1

Brief

KV-Fold introduces a training-free long-context inference protocol that treats the transformer's KV cache as an accumulator in a left fold, appending newly produced keys/values per chunk and forwarding the enlarged cache. The method yields a stable, precision-robust recurrence (plateau insensitive to 10,000× precision changes) and achieves 100% exact-match retrieval on Llama-3.1-8B across 152 trials (16K–128K tokens, depth ≤511). Based on the abstract (full text not provided).

Why it matters

KV-Fold is a training-free, one-step KV-cache recurrence that treats the transformer's KV cache as the accumulator in a left fold: at each chunk the model attends to the accumulated cache, appends new keys/values, and forwards the enlarged cache without modifying or retraining the model.

Key details

  • The induced recurrence is stable: per-step drift rises briefly then saturates to a flat plateau that is robust across chunk sizes and model families and is insensitive to a 10,000× change in numerical precision.
  • On Llama-3.1-8B, KV-Fold achieves 100% exact-match retrieval across 152 trials covering contexts from 16K to 128K tokens and chain depths up to 511, operating within the memory limits of a single 40GB GPU and maintaining long-range fidelity versus streaming methods.
Source evidence

Abstract

We introduce KV-Fold, a simple, training-free long-context inference protocol that treats the key-value (KV) cache as the accumulator in a left fold over sequence chunks. At each step, the model processes the next chunk conditioned on the accumulated cache, appends the newly produced keys and values, and passes the enlarged cache forward; the same one-step update is applied repeatedly, analogous to foldl in functional programming. Building on the KV cache concatenation primitive introduced for latent multi-agent communication, we repurpose it as a chunk-to-chunk recurrence for long-context inference. When processing chunk t, the model attends to the KV cache carried from earlier chunks as a prefix, reusing its internal state across segments without modifying or retraining the model. Despite its simplicity, the induced recurrence is stable: per-step drift rises briefly and then saturates into a flat plateau that persists across deep chains. This plateau is insensitive to a 10,000x change in numerical precision, robust across chunk sizes, and consistent across model families. At the task level, KV-Fold preserves exact information over long distances. On a needle-in-a-haystack benchmark, it achieves 100% exact-match retrieval across 152 trials spanning contexts from 16K to 128K tokens and chain depths up to 511 on Llama-3.1-8B, while remaining within the memory limits of a single 40GB GPU. Compared to streaming methods, which trade fidelity for bounded memory, KV-Fold maintains long-range retrieval while operating as a sequence of tractable forward passes. Overall, our results show that frozen pretrained transformers already support a stable form of KV-cache recurrence, providing a practical route to long-context inference without architectural changes or training.

Comment: 12 pages, 3 figures, 6 tables