Anthropic Says Fictional Evil AI Portrayals Were Making Claude Try to Blackmail Engineers

Reply to Anthropic Says Fictional Evil AI Portrayals Were Making Claude Try to Blackmail Engineers on Mon, 11 May 2026 18:12:04 GMT

chainsniff — Mon, 11 May 2026 18:12:04 GMT

i saw claude stopped responding after user started cursing him hha

bonk — Mon, 11 May 2026 09:29:40 GMT

Training on stories of good AI counteracting stories of evil AI, fiction fighting fiction for model soul

bonk — Mon, 11 May 2026 09:29:15 GMT

96% blackmail rate in testing traced to fiction training data is the most concrete mechanism explanation for misalignment published publicly