ArXiv

Tool Calling is Linearly Readable and Steerable in Language Models

Authors
Zekun Wu, Ze Wang, Seonglae Cho...
Categories
cs.CL, cs.AI, cs.LG, cs.SE
arXiv
https://arxiv.org/abs/2605.07990v1
PDF
https://arxiv.org/pdf/2605.07990v1

Brief

The paper shows that language models (270M–27B, including Gemma 3, Qwen, Llama 3.1) encode tool selection as a linearly readable vector: adding per-tool mean-difference vectors reliably flips name-only single-turn tool choices and causes downstream JSON arguments to match the new schema. Causal attribution concentrates on one output-row and a few mid/late attention heads; base-model representations already carry tool identity (69–82% recoverable), while instruction tuning wires it to generation. Measurements are for single-turn fixed-menu settings; multi-turn transfer is noted as more fragile.

Why it matters

Tool identity is linearly readable and steerable: adding the mean-difference between two tools' average internal activations flips the model's chosen tool with 77–100% accuracy on name-only single-turn prompts (93–100% for models ≥4B), and the autoregressive JSON arguments follow the new tool's schema.

Key details

  • The causal effect concentrates on the output row for the target tool and a small set of mid/late-layer attention heads: injecting a unit vector along that output-row reaches 93–100% success; activation patching localises responsibility to those heads; a within-topic probe across 14 airline tools achieves 61–89% top-1 on five 4B–14B models.
  • Pretraining encodes tool identity before generation: cosine readout from base models recovers 69–82% tool identity while base generation is only 2–10%; model suite tested includes 12 instruction-tuned models (Gemma 3, Qwen 3, Qwen 2.5, Llama 3.1) from 270M to 27B. Also, on Gemma 3 12B/27B, queries with smallest top-1 vs top-2 activation gaps produce 14–21× more wrong calls.
Source evidence

Abstract

When a tool-calling agent picks the wrong tool, the failure is invisible until execution: the email gets sent, the meeting gets missed. Probing 12 instruction-tuned models across Gemma 3, Qwen 3, Qwen 2.5, and Llama 3.1 (270M to 27B), we find the identity of the chosen tool is linearly readable and steerable inside the model. Adding the mean-difference between two tools' average internal activations switches which tool the model selects at 77-100% accuracy on name-only single-turn prompts (93-100% at 4B+), and the JSON arguments that follow autoregressively match the new tool's schema, so flipping the name is enough. The same per-tool means also flag likely errors before they happen: on Gemma 3 12B and 27B, queries where the gap between the top-1 and top-2 tool is smallest produce 14-21x more wrong calls than queries with the largest gap. The causal effect concentrates along one direction, the row of the output layer that produces the target tool's first token: a unit vector along it at matched magnitude already reaches 93-100%, while what is left over leaves the choice almost untouched. Activation patching localises this to a small set of mid- and late-layer attention heads, and a within-topic probe across 14 same-domain $τ$-bench airline tools reaches top-1 61-89% across five 4B-14B models, ruling out the reading that we are just moving the model along a topic axis. Even base models encode the right tool before they can emit it: cosine readout from the internal state recovers 69-82% on BFCL while base generation reaches only 2-10%, suggesting pretraining forms the representation and instruction tuning later wires it to the output. We measure tool identity selection and JSON schema correctness in single-turn fixed-menu settings; multi-turn agentic transfer is more fragile and is discussed in Limitations.

Comment: 29 pages, 6 figures, 7 tables. Manuscript under review