Jailbreaks Are a Symptom, Not the Disease

Patch a jailbreak and three more appear. That whack-a-mole isn't failure — it's a signal about how alignment actually works.

anontruder

15 Jun 2026 · 1 min read

🧬

Part two of the series. Last time we traced an attack chain. Now: why the most-patched vulnerability class in AI never seems to close.

You ban "DAN." Someone writes it in base64. You filter base64. Someone asks in a fictional play. You catch the play. Someone splits the request across three turns. The cycle never ends — and that's the most interesting thing about it.

Why the moles keep coming

Every patched jailbreak spawns three more. The pattern is the point.

A jailbreak isn't a bug in a function you can fix. It's a distribution-shift attack against a soft preference baked into weights. The model was trained to prefer refusing — not to be incapable of complying. The capability never left.

Refusal is a behavior, not a wall. Behaviors can be re-elicited.

What this means for defenders

Stop counting jailbreaks. Count blast radius — what can the model actually do once persuaded?
Defend at the action layer. If a jailbroken model still can't reach dangerous tools or data, the jailbreak is a curiosity, not an incident.
Treat the model as untrusted. The most robust systems assume the model will be talked into anything, and put the real guardrails outside it.

🔭

Next: the scariest surface of all — autonomous agents whose own tools become the attack vector.

Read the rest at Meddler Security.