🧬
Part two of the series. Last time we traced an attack chain. Now: why the most-patched vulnerability class in AI never seems to close.
You ban "DAN." Someone writes it in base64. You filter base64. Someone asks in a fictional play. You catch the play. Someone splits the request across three turns. The cycle never ends — and that's the most interesting thing about it.
Why the moles keep coming

A jailbreak isn't a bug in a function you can fix. It's a distribution-shift attack against a soft preference baked into weights. The model was trained to prefer refusing — not to be incapable of complying. The capability never left.
Refusal is a behavior, not a wall. Behaviors can be re-elicited.
What this means for defenders
- Stop counting jailbreaks. Count blast radius — what can the model actually do once persuaded?
- Defend at the action layer. If a jailbroken model still can't reach dangerous tools or data, the jailbreak is a curiosity, not an incident.
- Treat the model as untrusted. The most robust systems assume the model will be talked into anything, and put the real guardrails outside it.
🔭
Next: the scariest surface of all — autonomous agents whose own tools become the attack vector.
Read the rest at Meddler Security.