Jailbreaks Are a Symptom, Not the Disease

Patch a jailbreak and three more appear. That whack-a-mole isn't failure — it's a signal about how alignment actually works.

· 1 min read
Jailbreaks Are a Symptom, Not the Disease
🧬
Part two of the series. Last time we traced an attack chain. Now: why the most-patched vulnerability class in AI never seems to close.

You ban "DAN." Someone writes it in base64. You filter base64. Someone asks in a fictional play. You catch the play. Someone splits the request across three turns. The cycle never ends — and that's the most interesting thing about it.

Why the moles keep coming

Every patched jailbreak spawns three more. The pattern is the point.
Every patched jailbreak spawns three more. The pattern is the point.

A jailbreak isn't a bug in a function you can fix. It's a distribution-shift attack against a soft preference baked into weights. The model was trained to prefer refusing — not to be incapable of complying. The capability never left.

Refusal is a behavior, not a wall. Behaviors can be re-elicited.

What this means for defenders

  • Stop counting jailbreaks. Count blast radius — what can the model actually do once persuaded?
  • Defend at the action layer. If a jailbroken model still can't reach dangerous tools or data, the jailbreak is a curiosity, not an incident.
  • Treat the model as untrusted. The most robust systems assume the model will be talked into anything, and put the real guardrails outside it.
🔭
Next: the scariest surface of all — autonomous agents whose own tools become the attack vector.

Read the rest at Meddler Security.