Code Review in the Age of Agents: What Has to Change

May 8, 20265 min read

Code ReviewAI AgentsEngineering CultureBest Practices

A senior engineer I respect told me recently that he approves AI-generated PRs faster than human ones. His reasoning: "the code is usually cleaner." He is right about the surface and wrong about everything underneath, and the gap between those two things is becoming the central quality problem of our profession.

Agent-written code is clean. Well-named variables, consistent style, plausible structure, often decent tests. Every signal we spent a decade training ourselves to read, like messy code means risky code, has been broken. The diffs look like they were written by a careful senior. Sometimes they were written by a process that fundamentally misunderstood the task and expressed that misunderstanding beautifully.

After a year of working with parallel coding agents daily, here is how my review practice actually changed.

The volume problem is real, the solution isn't speed

Agents produce more code. A lot more. The naive response is to review faster, which in practice means reviewing shallower, which means the quality gate silently stops gating.

My counter-intuitive fix: constrain generation, not review. I stopped letting agents run unattended for an hour and deliver an 800-line diff. Briefs got smaller, checkpoints got more frequent, and any diff over a few hundred lines gets sent back for splitting. Same rule I would apply to a human teammate, applied with less guilt. Review capacity is the scarce resource now; everything upstream should be shaped around it.

Review the intent, not just the diff

With human code, the diff usually tells you what the author was thinking. With agent code, the diff tells you what the agent did. What it understood is a separate question, and that is where the bugs live.

So the first thing I read in an agent PR is no longer the code. It is the spec or brief that produced it, side by side with the result. The most dangerous failures I have caught were not bugs in the implementation. They were correct implementations of a misreading: an edge case silently "handled" by inventing a business rule nobody asked for, a migration that preserved behavior the spec explicitly wanted changed. Locally flawless, globally wrong. No linter category exists for that.

Tests get reviewed harder than code

I wrote this in my worktrees post and I will keep repeating it: when the agent writes both the implementation and the tests, the tests are your only independent signal, except they are not independent. They were generated by the same process, sharing the same misunderstanding.

Concrete habit: for every agent test suite, I ask what is not tested. Agents are excellent at testing the happy path they just implemented and weirdly reluctant to test the boundaries of their own assumptions. A test suite with 95% coverage and zero adversarial cases is a rubber stamp wearing a lab coat. I routinely write two or three hostile test cases by hand against agent code. The hit rate is humbling.

What humans should stop reviewing

Honesty in the other direction: a lot of classical review activity is now waste.

Style, formatting, naming conventions, import ordering: solved, between formatters, linters and the fact that agents follow conventions more consistently than humans ever did. If your reviewers still spend attention on this, you are paying senior salaries for work a config file does.

Boilerplate-heavy diffs, like a new CRUD endpoint following an established pattern, a config change, mechanical refactors, deserve a different, lighter review tier than novel logic. We informally tag PRs by risk now: pattern-following gets a sanity pass, novel logic or data handling gets the full treatment, anything touching money, auth, or migrations gets two humans regardless of who or what wrote it. Spending equal attention on everything means spending insufficient attention on the dangerous parts.

Agents reviewing agents helps, as a pre-filter

We run a review agent on PRs before any human looks. It catches real things: forgotten error handling, inconsistencies with patterns elsewhere in the codebase, the occasional genuine bug. Worth having.

But after months of this I am convinced the agent reviewer and the human reviewer are doing different jobs, and conflating them is the mistake. The agent checks the code against the codebase. The human checks the code against reality: the business rule that lives in a Slack thread, the customer who will hit this edge case, the knowledge that this module is load-bearing in a way no comment documents. Until agents hold that context (and despite the context-engineering progress, they mostly don't), the human review is the one that catches the expensive failures.

What I would tell myself at the start

The mental shift that took me longest: stop reviewing agent code as if competence implies understanding. With a human author, clean code is evidence of a clear mind that probably also got the intent right. With an agent, the correlation is broken: polish is free, and it tells you nothing.

So read the brief before the diff. Distrust beautiful tests. Tier your attention by risk, not by size. And keep one ugly habit from the old world: when something feels off and you cannot articulate why, do not approve. That instinct was trained on years of production incidents, and it is the one part of review that has not been automated.

Working on a similar AI project? Let's talk about it.