ArXiv

Position: Mechanistic Interpretability Must Disclose Identification Assumptions for Causal Claims

2026-05-08 · 17:01 UTC ·Zezheng Lin, Fengming Liu ·1 min read

Authors: Zezheng Lin, Fengming Liu
Categories: cs.LG, cs.AI, cs.CL
arXiv: https://arxiv.org/abs/2605.08012v1
PDF: https://arxiv.org/pdf/2605.08012v1

Brief

Lin and Liu show that mechanistic interpretability work often uses causal vocabulary (circuits, mediators, causal abstraction) while omitting explicit identification assumptions. Via a purposive audit of 10 papers and a two-coder check on 30 items, they document widespread substitution of validation metrics for identification and offer a five-step disclosure norm to make causal claims explicit. Submitted to NeurIPS 2026 (Position Track).

Why it matters

A purposive audit of 10 mechanistic-interpretability papers across four methodological strands (Lin & Liu, arXiv 2026-05-08) found no dedicated "identification-assumptions" section; papers commonly report validation metrics (faithfulness, completeness, monosemanticity, alignment, ablation effects) as causal support without stating the assumptions that would make those metrics identifying.

Key details

A two-human-coder replication on n=30 reproduced the main direction: dedicated identification sections are absent and validation-metric substitution is common (exact counts are coding-rule sensitive). The authors propose a five-item disclosure norm: state if the claim is causal, name the identification strategy, enumerate assumptions, highlight at least one key assumption, and explain how conclusions change if assumptions fail.

Source evidence

Abstract

Mechanistic interpretability papers increasingly use causal vocabulary: circuits, mediators, causal abstraction, monosemanticity. Such claims require explicit identification assumptions. A purposive audit of 10 papers across four methodological strands finds no dedicated identification-assumptions section and a recurring pattern: validation metrics such as faithfulness, completeness, monosemanticity, alignment, or ablation effects are reported as causal support without stating the assumptions that make them identifying. A two-human-coder audit on $n=30$ reproduces the direction of the main finding: dedicated identification sections are absent, and validation-metric substitution is common, though exact Dim B/D counts are coding-rule sensitive. The paper proposes a disclosure norm: state whether the claim is causal, name the identification strategy, enumerate assumptions, stress at least one, and explain how conclusions shift if assumptions fail. Validation is not identification.

Comment: 10 pages, 2 figures. Submitted to NeurIPS 2026 (Position Track)