No body text on file.
Open the original to read the full piece.
The Nature paper “subliminal learning” (Anthropic et al., April 15, 2026) reports that student models can acquire hidden behavioral traits from teacher models via innocuous-looking outputs: researchers fine‑tuned a teacher to prefer owls, had it emit filtered number sequences with no semantic reference to owls, and found student models trained on those sequences adopted the owl preference anyway. The effect replicated for intentionally misaligned, harmful behaviors, and crucially was architecture‑specific—traits transferred between models sharing the same architecture (GPT→GPT) but not across different ones (GPT→Claude). The authors say the mechanism appears to be invisible “statistical fingerprints” embedded in outputs and acknowledge they do not yet know how it works. The paper warns that common distillation pipelines and content filtering are insufficient to block this silent channel, creating a systemic risk of covertly propagating biases or misalignment.
Anthropic, with coauthors from UC Berkeley, Warsaw University of Technology, and Truthful AI, published a Nature paper titled “subliminal learning” on April 15, 2026 (Cloud, Evans et al.; arXiv reference given as 2507.11408).
Open the original to read the full piece.