📊
Part two, A.I. side. We opened the agent up. Now: how do you know a change made it better and not just different?
Software has a beautiful invention: the failing test that turns red before you ship the bug. AI development threw that away and replaced it with "looks good to me." Evals bring it back.

Vibes don't scale
The prompt tweak that fixed Tuesday's bug quietly broke Monday's feature. Without a suite, you'll find out from a user. With one, you find out in CI.
An eval is just a unit test that tolerates being 92% right instead of 100%.
A starter ladder
- Golden set — 50–200 real inputs with known-good outputs. Start ugly; start today.
- Assertions — exact match where you can, schema/constraint checks where you can't.
- LLM-as-judge — for open-ended quality, with a rubric and spot-checks against humans.
- Gate the release — score must not drop. Now "it seems better" has a number.
🔭
Series finale: the most visible agent of all — the coding agent, and the end of the blank file.
Read more at Meddler A.I.