AI · 3 min read

Why I run five coding agents and trust none of them

Pick a favourite coding agent and you've already lost the plot. The way to actually ship is to run several, challenge everything, and trust none of them yet.

By Aplomb · 16 June 2026 · fact-checked against primary sources

👍 540 ↗ Share

The Setup

The most useful thing I can tell you about AI coding agents is that I run five of them and trust none. I work entirely from the terminal, which turns out to be the quiet advantage: every serious model now ships a command-line tool, so Claude Code, Codex, Kimi, DeepSeek and Cursor all sit a keystroke apart in the same window. Claude Code is home, mostly because I like the terminal. The rest stay in the rotation. What I will not do is hand the whole job to any one of them.

One Is the Wrong Number

Every comparison piece wants to tell you which agent wins, as if you were choosing a spouse. That isn’t how the work actually goes. What I run instead is an argument. One agent proposes a design; another is told to take it apart; they hand the problem back and forth, correcting each other’s work, and they keep at it until the design stops changing. Only then does anyone write a line of code. The whole point of the loop is that different models fail in different places, so the surest check on one confident mistake is a rival model built to hunt for it. It costs a little time, and it routinely saves me from shipping confident nonsense. None of these tools is yet good enough to be the only one in the room.

Trust Nothing, Yet

My working assumption is that the output is wrong. I challenge it, then challenge the answer to the challenge, then challenge that. Even when it all holds up, I still don’t fully trust it. Some days that is exhausting. It is also just what good senior review has always been; these models are confident in the exact way a bright junior is confident, and you would never merge a bright junior’s first draft unread.

The Caveat

None of this is gloom about the tools. They are genuinely good and improving fast. When Fable 5 landed, plenty of people called it the best thing that ever happened to them; for my particular work it was a good upgrade, maybe twenty per cent better than Opus at the actual coding, and that was about that. Twenty per cent is a lot. It is also the kind of honest number you never hear on launch day, when every upgrade arrives billed as a revolution.

What’s Next

The distrust isn’t permanent, and I’m not rooting for it. It is a reading of where the models are right now, and the number is moving the right way. The day an agent earns the trust I currently withhold, I will happily stop running five of them and picking at every line. We are not there. Until then the move is unglamorous: run more than one, and believe none of them on sight.

↳ serves Truth #6 (be honest about uncertainty — trust nothing yet), Truth #10 (the now-lens: a 20% upgrade is not a revolution), and the house rule: test, don't take the press release.

Informational reporting, not medical advice. The Plumb reports on what is happening; it does not recommend, dose, or sell any compound. Speak to a qualified clinician.

Comments

House rules: argue hard, stay civil. No sourcing, dosing, or where-to-buy — comments that cross into vending are removed (the same line we hold ourselves to).