ArXiv

Multi-Stream LLMs: Unblocking Language Models with Parallel Streams of Thoughts, Inputs and Outputs

2026-05-12 · 17:47 UTC ·Guinan Su, Yanwu Yang, Xueyan Li... ·1 min read

Authors: Guinan Su, Yanwu Yang, Xueyan Li...
Categories: cs.LG, cs.CL
arXiv: https://arxiv.org/abs/2605.12460v1
PDF: https://arxiv.org/pdf/2605.12460v1

Brief

Multi-Stream LLMs (Su et al., 2026) argue that single-stream chat/instruction formats (e.g., instruction-tuned ChatGPT-style models) create a bottleneck preventing simultaneous reading, thinking and acting. They propose instruction-tuning for parallel computation streams—splitting roles into separate input/output channels so each forward pass concurrently consumes and emits tokens across streams. The preprint (37 pages) reports this architecture should enable concurrent action, boost efficiency via parallelization, and improve security and monitorability; code is provided on GitHub.

Why it matters

Proposes 'Multi-Stream LLMs' (Guinan Su, Yanwu Yang, Xueyan Li, Jonas Geiping): instruction-tune models to operate on multiple parallel streams (separate streams for thoughts, inputs, outputs) so every forward pass simultaneously reads from multiple input streams and generates tokens in multiple output streams.

Key details

Claims practical benefits: unblocks agents so they can act while reading/thinking/writing, improves efficiency via parallelization, and enhances security and monitorability through separation of concerns; arXiv preprint 2605.12460v1 (2026-05-12), 37 pages, code at https://github.com/seal-rg/streaming/

Source evidence

Abstract

The continued improvements in language model capability have unlocked their widespread use as drivers of autonomous agents, for example in coding or computer use applications. However, the core of these systems has not changed much since early instruction-tuned models like ChatGPT. Even advanced AI agents function on message exchange formats, successively exchanging messages with users, systems, with itself (i.e. chain-of-thought) and tools in a single stream of computation. This bottleneck to a single stream in chat models leads to a number of limitations: the agent cannot act (generate output) while reading, and in reverse, cannot react to new information while writing. Similarly, the agent cannot act while thinking and cannot think while reading or acting on information. In this work, we show that models can be unblocked by switching from instruction-tuning for sequential message formats to instruction-tuning for multiple, parallel streams of computation, splitting each role into a separate stream. Every forward pass of the language model then simultaneously reads from multiple input streams and generates tokens in multiple output streams, all of which causally depend on earlier timesteps. We argue that this data-driven change remedies a number of usability limitations as outlined above, improves model efficiency through parallelization, improves model security through better separation of concerns and can further improve model monitorability.

Comment: Preprint, 37 pages. Code at https://github.com/seal-rg/streaming/