Agentic Code Review
When coding agents produce thousands of lines of often solid code in minutes, the engineering bottleneck shifts from writing to trusting, making review the most leveraged skill in software. Multi-source 2026 data (Faros AI, CodeRabbit, GitClear, GitHub) shows: AI users generate ~4x raw output but only ~12% more delivered value; code churn up 861%, defect rate from 9% to 54%, review duration up 441.5%, and zero-review merges up 31.3%. The article argues the fix is not to stop using AI but to tier review effort by blast radius: light for solo no-user projects, heavy for large enterprises. Specific advice: triage PRs upfront, require evidence before review, watch test rewrites, run two differently-structured AI reviewers in parallel, and upgrade humans from line-level review to spot-checking and auditing. The durable skill is understanding a system well enough to stand behind it.
Coding agents are extraordinarily good now, and getting better fast. The interesting consequence is that the hard part of engineering moved from writing code to deciding whether to trust it, which makes review the most leveraged skill in software right now. How you approach it depends enormously on who you are: a solo developer with no users and a team maintaining a ten-year-old application are not solving the same problem.
I am more optimistic about agentic engineering than I have ever been. The agents are genuinely good, they get better every month, and on an ordinary day I now ship things I would not have attempted a year ago. This write-up is a map of where the interesting work went, because it did move, and most teams have not fully caught up to where.
当下的编码智能体/agent 已经异常出色,并且还在飞速进步。随之而来的有趣后果是:工程中最难的部分已从“编写代码”转移到了“判断是否该信任这些代码”,这使得代码审查/Code Review 成为当前软件行业最具杠杆价值的技能。具体怎么做,完全取决于你的身份:一个没有用户的独立开发者与一个维护着有十年历史的应用程序的团队,面临的问题截然不同。
我对基于智能体的工程(agentic engineering)从未如此乐观。这些智能体确实优秀,每个月都在变得更好,如今在每个平常的日子里,我都会交付一年前甚至不敢尝试的东西。这篇文章旨在勾勒出真正有趣的工作转移到了哪里——因为确实转移了,而大多数团队尚未完全跟上这个变化。
Code review used to work because of a happy accident of relative speed. A senior engineer could read code faster than a junior could write it, so review kept pace without anyone designing it to, and the team absorbed how the system fit together as a side effect of reading each other’s diffs. A lot of that was not deliberate. It fell out of a single fact: writing code was the slow, expensive part, and reading it was cheap and fast.
That fact no longer holds. An agent will produce a thousand lines of often solid, well-formatted code in less time than it takes me to read this paragraph, while a human’s reading speed has not changed since roughly the day we started staring at screens for a living. So the constraint moved downstream, to the one step that did not get faster: a person being confident the change is right. I do not think that is a loss. It is the most leveraged place in software to be good right now, and it is where I have put most of my attention this year.
过去的代码审查之所以行之有效,源于一个速度上的幸运巧合:高级工程师阅读代码的速度比初级开发者编写代码的速度更快,因此审查自然就跟上了节奏,无需刻意设计;同时,团队在阅读彼此的 diff 时也顺带理解了系统的整体构成。这很大程度上并非有意为之,而是源于一个简单事实:写代码曾是缓慢、昂贵的那一部分,而读代码则廉价且快速。
这一事实已不复存在。一个智能体在我读这个段落的时间内就能生成一千行通常质量不错且格式规范的代码,而人类的阅读速度自我们开始对着屏幕工作以来几乎没有改变。因此,约束被转移到了下游——转移到了那个从未变快的环节:让人确信改动是正确的。我不认为这是一种损失。这正是当前软件行业最具杠杆价值的技能所在,也是我今年投入大部分精力的地方。
There is a happy twist here that shapes the rest of this piece. The same tools generating all that extra code are also the best thing I have for keeping up with it. On my own projects, including the popular open-source ones, I now point Claude Code or Codex at a batch of incoming PRs and have them triage the queue for me, and that has genuinely changed how I spend my time. So this is not an anti-AI argument, and I will come back to exactly how I use it.
It is also not a data dump, and not another round of whether letting a model write your code is wonderful or the end of the craft, because that framing is useless. The only answer that survives contact with a real codebase is that it depends entirely on who you are. A developer vibe-coding a side project a dozen people will ever run, and a team keeping a ten-year-old enterprise system alive for another quarter, share almost no constraints worth naming, and most of the advice in circulation is really one of those two people telling the other how to live.
这里有一个令人愉快的转折,它也塑造了本文的后续内容。那些生成额外代码的同一批工具,恰恰也是我用来跟上节奏的最佳帮手。在我自己的项目中,包括一些受欢迎的开源项目,我现在会指向一批 incoming PR,让 Claude Code 或 Codex 帮我分诊队列,这确实改变了我分配时间的方式。所以,这并非反 AI 的观点,后文我也会详细说明我是如何使用的。
这也不是一次数据堆砌,亦非另一场关于“让模型写代码是奇妙至极还是手艺终结”的辩论——因为那种框架毫无意义。唯一能在真实代码库中站住脚的答案是:这完全取决于你是谁。一个用 vibe-coding 方式开发只有几十个人会运行的个人项目的开发者,与一个维护着十年老企业系统以让它再撑一个季度的团队,几乎没有任何值得一提的共同约束。而市面上流通的大部分建议,本质上只是这两类人中某一类在告诉另一类人该如何生活。
The productivity gains from AI are real, but raw output overstates them: about four times the code for a tenth more delivered value. The gap between those numbers is review work, which is exactly why review is where the leverage now sits.
For a couple of years this was anecdote and argument. It is now measured at scale, by organizations with no shared agenda and in several cases competing commercial interests, and the measurements keep pointing the same way: AI pushes output sharply up, and pushes both quality and reviewability down.
Faros AI instrumented 22,000 developers across 4,000 teams and tracked what happened as teams moved from low to high AI adoption. This is March 2026 data, about as current as anything here. The upside is real and worth stating plainly: developers merge considerably more PRs and complete more work, and throughput per engineer climbs. Then the rest of the report:
code churn up 861%
the incidents-to-PR ratio up 242.7%
the per-developer defect rate up from 9% to 54%
median review duration up 441.5%, with time-to-first-review and average review time both roughly doubling
PRs merged with zero review up 31.3%
The last figure is the one I find hardest to dismiss, because nobody chose it. There was no decision to stop reviewing. Reviewers simply could not keep pace with the volume, so code began merging unread, and that became normal. The detail I keep returning to is that teams with mature, disciplined engineering practices were hit just as hard as everyone else. Good process did not protect them, because the volume arrived faster than any process was designed to absorb.
AI 带来的生产力提升是真实的,但原始产出夸大了其效果:大约 4 倍的代码量对应仅 10% 的价值增长。这两组数字之间的差距就是审查工作——这正是为什么审查现在具有最高杠杆效应的原因。
几年前这还只是传闻与争论。如今它已在规模化数据中得到验证——由那些没有共同议程、甚至存在商业竞争的组织完成——并且测量结果一直指向同一个方向:AI 使产出急剧上升,同时使质量和可审查性双双下降。
Faros AI 对来自 4000 个团队的 22000 名开发者进行了监测,追踪了团队从低到高采用 AI 后发生的变化。这是 2026 年 3 月的数据,非常新。正面效应是真实且值得坦率承认的:开发者合并了更多 PR,完成了更多工作,每位工程师的吞吐量持续攀升。然后,报告的其他发现如下:
- 代码变动率/Churn 上升 861%
- 事件与 PR 之比上升 242.7%
- 每位开发者的缺陷率从 9% 升至 54%
- 中位数审查时长上升 441.5%,首次审查时间和平均审查时间均翻倍
- 零审查即合并的 PR 增加 31.3%
最后一个数字是我最难忽视的,因为无人主动选择它。没有人决定停止审查。审查者只因无法跟上代码量,代码便开始在无人阅读的情况下被合并,这进而成为常态。我反复想起的一个细节是:那些拥有成熟、严谨工程实践的团队遭受的冲击与其他团队一样严重。良好的流程并未保护他们,因为代码量增长的速度超过了任何流程所设计能承受的范围。
One caveat to hold throughout: CodeRabbit and Faros both sell into this market, so their framing is not disinterested. That does not make the numbers wrong, the effect sizes are large and consistent across unrelated sources, but vendor research deserves to be read with that in mind.
CodeRabbit studied 470 open source PRs in December 2025, 320 AI-coauthored and 150 human-only, and found the AI changes carried roughly 1.7x more issues: logic and correctness problems up about 75%, security issues 1.5 to 2x more common, readability problems more than tripling. Their AI director David Loker described these as “predictable, measurable weaknesses that organizations must actively mitigate”. Predictable is the operative word. These are known, locatable weaknesses, which is good news: it means a review process, human or automated, can be aimed straight at them.
GitClear has the single number I would lead with. In their productivity data through 2025, daily AI users produce around 4x the raw output of non-users, but measured against their own output a year earlier, the real productivity gain is only about 12%. You are generating roughly four times the code for something like a tenth more delivered value, and a human still has to review all four times of it. To GitClear’s credit, Bill Harding is explicit that some of even that 12% is selection bias, because stronger developers concentrated in the AI cohort. The gap between 4x the code and a tenth more value is the review problem stated in one line.
GitHub reports that Copilot review has now run over 60 million reviews, a 10x increase in under a year, and more than one in five reviews on the platform involves an agent. This is no longer a niche practice. It is how code gets made.
Four datasets, four methods, one conclusion. We poured machine-speed output into a system built for human-speed work. The bottleneck did not disappear; it moved to verification, and review is where that bill comes due.
有一项警告需始终铭记:CodeRabbit 和 Faros 都销售面向该市场的产品,因此它们的分析框架并非超脱利益之外。但这不意味着数据有误——效应量很大,且在不同来源间保持一致——不过,供应商的研究确实值得带着这种意识去解读。
CodeRabbit 于 2025 年 12 月研究了 470 个开源 PR,其中 320 个有 AI 参与编写、150 个纯由人类编写。研究发现,AI 参与的变更携带的问题大约是纯人类版本的 1.7 倍:逻辑与正确性问题增加了约 75%,安全问题高发 1.5 到 2 倍,可读性问题则增加了三倍多。其 AI 总监 David Loker 将其描述为“可预测、可衡量的弱点,组织必须主动缓解”。“可预测”是关键词。这些是已知、可定位的弱点——这是好消息,意味着无论是人工还是自动化的审查流程,都可以直接针对它们。
GitClear 给出了我最想引用的一个数字。在其截至 2025 年的生产力数据中,每日使用 AI 的用户原始产出约为非用户的 4 倍,但与自身一年前的产出相比,实际生产力提升仅为约 12%。你生成了大约四倍的代码,获得的交付价值增量却只有十分之一左右,而且人类仍然需要对这四倍的代码进行审查。诚然,GitClear 的 Bill Harding 也坦承,这 12% 中的一部分甚至来自选择偏差——因为更强的开发者更集中地使用 AI。四倍代码量与十分之一价值增量之间的差距,就是一句话的审查问题。
GitHub 报告称,Copilot Review 已完成超过 6000 万次审查,在不到一年的时间内增长了 10 倍,平台上超过五分之一的审查涉及智能体。这已不再是边缘实践,而是代码被制造出来的方式。
四组数据、四种方法,一个结论。我们将机器速度的产出注入了一个为人类速度工作而构建的系统。瓶颈并未消失,它转移到了验证环节,而审查正是为此买单的地方。
Everyone is solving a different problem
How much review a change needs depends almost entirely on its blast radius, and most advice you read was written by someone operating at a very different one.
Almost all the alarming data above comes from enterprise telemetry and from open source maintainers being overwhelmed. It is entirely real if that is your situation. If you are one person shipping something a handful of people will ever run, much of it simply does not apply to you, and you should not be made to feel otherwise.
Three variables determine where you sit:
blast radius: what happens when it breaks. Nothing, or angry users and money and PII on the line.
how long the code lives: a throwaway prototype you might rewrite next week, or a codebase you will maintain for years.
how many people need to understand it: just you holding the whole thing in your head, or a team that has to share ownership over time.
Run the same diff through those three and “good review” means genuinely different things.
每个人面临不同的问题
一次变更需要多少审查,几乎完全取决于其爆炸半径/blast radius,而你读到的绝大部分建议,都是由一个处于完全不同爆炸半径上的人所写的。
上述几乎所有的惊人数据都来自于企业遥测和不堪重负的开源维护者。如果你的情况就是这样,那它们完全真实。但如果你是一个人开发、一个只有少数人会运行的东西,那么其中大部分并不适用于你,你也不应被反向暗示。
三个变量决定了你的位置:
- 爆炸半径:出问题时会发生什么。什么都不会发生,还是愤怒的用户、金钱以及个人身份信息/ PII 面临风险。
- 代码存活多久:可能下周就会重写的废弃原型,还是需要维护数年的代码库。
- 需要多少人理解它:只有你一人掌握所有细节,还是一个需要长期共享所有权的团队。
将同一个 diff 放入这三个维度考量,“好的审查”就意味着截然不同的事情。
If you are working solo on a greenfield project with no users, review’s second job, distributing knowledge across a team, does not exist for you. You are the team. The reasonable move is to lean hard on tests and automation, review the parts that genuinely matter, and accept a lighter touch on the rest. Duplication and churn cost far less when the code may not exist in a month and nobody is paged at 3am when it breaks. The catch, and people learn this one painfully, is that it only works if the tests are real. Skipping review without a safety net does not remove the work, it defers it at a higher price, and standards slip when no one is there to push back. No users is permission to defer review. It is not permission to skip verification.
Then the project gets users. This is the dangerous middle, and the crossing is rarely noticed at the time. Review’s bug-catching role suddenly matters, because bugs now hurt people, and its knowledge-sharing role switches on, because it is no longer only you. Teams keep their solo-era habits a few months too long, and then there is a postmortem and the Faros numbers stop being a chart and become their own dashboard.
At the far end is the large organization with an old codebase and many users. Here every alarming figure lands at full strength. A duplicated helper is not a style nit, it is a future bug surface and a maintenance cost that compounds for years. A change nobody understood is comprehension debt that becomes someone’s on-call incident. Review is doing several jobs at once, and the volume of agent output quietly breaks all of them. The Faros finding about mature teams is aimed squarely here.
So the point is not “enterprises should be cautious and solo developers can relax”. It is that the purpose of review changes with your position, so the rules have to change with it. Bolt an enterprise’s locked-down, multi-agent, evidence-required pipeline onto a two-person prototype and you have added friction for no benefit. Run “tests pass, ship it” on a payments system and you have built an incident generator with a green checkmark on top. Most bad advice in this space is one position on that spectrum prescribing to another.
如果你独自一人在一个无用户的绿地项目上工作,审查的第二个作用——在团队间分发知识——对你来说不存在。你就是团队。合理的做法是:依靠测试和自动化,审查真正重要的部分,在其余部分采取更宽松的态度。当代码可能一个月后就不复存在、出问题时晚上 3 点也不会有人被呼叫时,重复和变动成本要低得多。然而,人们会痛苦地学到,这只有在测试真实有效时才成立。在没有安全网的情况下跳过审查,并不会消除这项工作,只会以更高的代价推迟它,而无人质疑时标准就会下滑。没有用户是推迟审查的许可,但不是跳过验证的许可。
然后,项目开始有用户了。这是危险的中段,而跨越这个阶段的时刻通常很少被注意到。审查的 bug 捕捉角色突然变得重要,因为 bug 现在会伤害到人;其知识共享角色也启动了,因为不再只有你一人。团队往往会将个人时代的一些习惯保持过头几个月,然后就会有一次事后复盘,而 Faros 的数据也不再仅仅是图表,而是变成了他们自己的仪表盘。
另一端是拥有旧代码库和许多用户的大型组织。在这里,每一个惊人的数据都会全力生效。一个重复的辅助函数不是风格上的吹毛求疵,而是未来 bug 的隐患和逐年复利增长的维护成本。一个无人理解的变更是理解性债务,会变成某个人的值班事故。审查同时承担着多项职责,而智能体产出的大量代码悄悄地摧毁了它们全部。Faros 关于成熟团队的那个发现,直接针对的就是这种情况。
所以,关键点不是“企业应谨慎,个人开发者可以放松”。而是审查的目的随你的位置而变,因此规则也必须随之改变。将一个企业级、锁定、多智能体、要求证据的流程硬塞给一个两人原型,只会徒增摩擦而无好处。在支付系统上使用“测试通过,发了”的做法,就是制造了一个带绿色通过标志的事件生成器。这个领域里的大多数坏建议,都来自于光谱上某一端的人向另一端开出的处方。
What review is actually for now
Review was built to check an author’s reasoning. An agent does reason, but that reasoning is usually thrown away rather than attached to the code, so the reviewer has to reconstruct a rationale that never made it into the diff. The good news: that is a tooling problem, and capturing the reasoning makes review dramatically easier.
This is the part that genuinely changed, and I think it is underappreciated.
When a human writes code, intent comes along for free. The reasoning, the alternatives weighed and discarded, lived in the author’s head, and review was you checking that reasoning. Modern agents do reason, often visibly, producing thinking traces and weighing options and explaining themselves as they go. The catch is that this reasoning is usually discarded the moment the diff is produced. It is rarely captured, rarely attached to the PR, and in any case it is the agent’s reasoning about how to implement the task, not a human’s judgment about whether it was the right task to begin with. So review shifts from checking reasoning that sits in front of you to reconstructing intent that never got written down, which is harder and slower, and we keep acting surprised that it takes 441% longer.
A 2026 paper, AI Slop and the Software Commons, analyzed 1,154 posts across 15 Reddit and Hacker News threads where developers discussed “AI slop”. One line from a developer has stayed with me: reviewing an agent’s PR made them “the first human being to ever lay eyes on this code”.
That is worth sitting with, and it points straight at the fix. In normal review the author already understood the change and you were checking their work. With an agent PR, nobody has reconstructed the why yet, and the reviewer is the first to try. As the paper puts it, review “wasn’t built to recover missing intent”. The encouraging part is that missing intent is recoverable: the reasoning existed, we just discarded it. Have the agent state what it was trying to do and what it ruled out, capture that as a decision log on the PR, and a large part of the reconstruction cost disappears. This is a tooling problem, and tooling problems get solved.
审查现在真正的作用是什么
审查最初是为了检查作者的推理而设计的。智能体确实会推理,但这种推理通常被丢弃而非附加到代码上,因此审查者必须重建一个从未进入 diff 的 rationale(基本原理)。好消息是:这是一个工具问题,而捕获推理过程可以让审查变得容易得多。
这是真正改变的部分,我认为它被低估了。
当人类编写代码时,意图是自然附带的。推理、权衡过又被放弃的替代方案,都存在于作者的头脑中,审查就是你检查那些推理。现代智能体确实会推理,而且通常是可见的——产生思考痕迹、权衡选项并自我解释。但问题是,这种推理通常在 diff 生成的那一刻就被丢弃了。它很少被捕获,很少附加到 PR 上,而且这始终是关于智能体如何实现任务的推理,而非人类关于一开始这是否是正确任务的判断。因此,审查从检查你面前的推理,转变为重建从未被写下的意图,这更难、更慢,而我们一直对审查耗时增加 441% 的事实感到惊讶。
2026 年的一篇论文《AI Slop and the Software Commons》分析了来自 15 个 Reddit 和 Hacker News 帖子的 1154 条开发者讨论“AI 垃圾/AI slop”的评论。其中一位开发者的话让我印象深刻:审查一个智能体的 PR 让他们成为了“有史以来第一个看到这些代码的人”。
这句话值得深思,它直接指向了解决方案。在常规审查中,作者已经理解了变更,你只需检查他们的工作。而对于智能体的 PR,还没有人重建过“为什么”,审查者是第一个尝试这样做的人。正如论文所说,审查“本来就不是为了恢复缺失的意图而设计的”。令人鼓舞的是,缺失的意图是可以恢复的:推理曾经存在,我们只是丢弃了它。让智能体陈述它试图做什么、排除了什么,将其作为决策日志附加到 PR 上,那么很大一部分重建成本就消失了。这是一个工具问题,而工具问题总会被解决。
None of which makes “have the AI review the AI” a complete answer on its own. A second model with different priors genuinely catches real bugs, and it catches a lot of them, which is why you should run one. What it does not supply is the human judgment about whether this is the right change to build in the first place. That judgment stays with a person, and it happens to be the most interesting part of the job, the part worth keeping.
The current AI reviewers are genuinely good, and they occasionally don’t flag the same lines as each other, so the right move is not picking the best one but running two that are built differently.
The dedicated AI review tools are good now, and I think you should be running at least one on everything, side projects included. CodeRabbit is the most widely deployed and topped the independent Martian benchmark (January to February 2026) on F1, around 49% precision with the best recall in the field. Greptile trades precision for recall: around an 82% bug-catch rate against CodeRabbit’s 44% in one benchmark, at the cost of more false positives. Anthropic’s Code Review reports under 1% of its findings marked incorrect by their engineers, and the figure I would actually show a manager: it raised their internal rate of PRs receiving a substantive review from 16% to 54%. The long tail of changes that used to get a glance and an approval now gets read by something.
以上这些都并不意味着“让 AI 审查 AI”本身就是一个完整的答案。一个具有不同先验的第二模型确实能捕捉到真正的 bug,而且数量可观——这也是你应该运行它的原因。但它的不足之处在于,无法提供关于“这个变更是否一开始就是应该构建的正确变更”的人类判断。这种判断仍然属于人,而这恰好是工作中最有趣、最值得保留的部分。
当前的 AI 审查工具确实不错,而且它们偶尔不会标记相同的代码行,因此正确的做法不是选择最好的一个,而是运行两个构建方式不同的工具。
专用的 AI 审查工具现在已经很好了,我认为你应该在一切项目上至少运行一个,个人项目也不例外。CodeRabbit 是部署最广泛的,在独立的 Martian 基准测试(2026 年 1-2 月)中 F1 得分最高,精度约为 49%,且在同领域中召回率最佳。Greptile 则以精度换取召回率:在某基准测试中,其 bug 捕获率约为 82%,而 CodeRabbit 为 44%,代价是更多的误报。Anthropic 的 Code Review 报告称,其工程师标记为错误的发现不到 1%,而我会向管理者展示的数字是:它将内部获得实质性审查的 PR 比例从 16% 提升到了 54%。那些曾经只被扫一眼就批准的尾部变更,现在有东西在阅读它们了。
The most useful result I have seen this year is not from a vendor. An engineer ran four reviewers in parallel, CodeRabbit, Sentry Seer, Greptile and Cursor BugBot, across 146 real PRs and 679 findings over three and a half weeks:
Of 617 distinct flagged locations, 93.4% were caught by exactly one of the four tools. 6% by two. Almost none by three. None at all by all four.
The four tools never once flagged the same line. Each was strong at a different class of problem: Greptile with near-zero false positives on correctness and architecture, CodeRabbit with the widest net and one-click fixes, Seer best on production-failure severity. That is the adversarial review argument demonstrated on a real codebase rather than in a paper. Heterogeneity is the whole point. Four copies of one model is a single reviewer with a larger invoice, whereas four genuinely different reviewers surface a set of bugs no single member could find alone, the human included.
In practice: do not agonize over the single best tool, there isn’t one. At the high-stakes end, run two with deliberately different characters (the experiment above paired Greptile for everyday correctness with Seer for production-failure severity, with almost no overlap). If you are solo, one good reviewer plus real tests is plenty. And whatever the marketing says, measure it on your own code, because every one of these results was specific to a particular codebase, and yours will be too.
我今年看到最有用的结果并非来自供应商。一位工程师并行运行了四个审查工具——CodeRabbit、Sentry Seer、Greptile 和 Cursor BugBot——在三个半周内审查了 146 个真实 PR,共产生 679 条发现:
在 617 个不同的被标记位置中,93.4% 仅被四个工具中的某一个捕获。6% 被两个工具同时捕获。几乎没有被三个工具同时捕获的。四个工具全部捕获的则为零。
这四个工具从未标记过同一行代码。每个工具在不同类型的问题上各有优势:Greptile 在正确性和架构方面误报率极低,CodeRabbit 覆盖范围最广且提供一键修复,Seer 则擅长识别生产故障严重性问题。这是对抗性审查论点在真实代码库上的证明,而不仅仅在论文中。多样性是全部意义所在。同个模型的四个副本只是一个开销更大的检测者,而四个真正不同的审查器能发现任何单一成员(包括人类在内)都无法独自找到的一组 bug。
实践中:不要纠结于选择最好的工具——不存在单一的最佳选择。在高风险场景下,运行两个刻意风格不同的工具(上述实验将 Greptile 用于日常正确性,Seer 用于生产故障严重性,它们几乎无重叠)。如果你是个人开发者,一个好的审查器加上真实的测试就已经足够。而且,无论营销怎么说,都要在自己的代码上进行评估——因为这些结果都与你自己的代码库有关。
The machine is already reviewing more of your code than you are. The only real decision left is whether you do that deliberately, and the amount of human you keep should scale with your blast radius.
I keep hearing a question that would have been heresy a year ago, now from experienced engineers: should the machine be doing more of the reviewing, perhaps most of it? I no longer think that is a foolish question.
The uncomfortable part is that AI review works. Under 1% of Anthropic’s findings are marked wrong, the tools catch bugs humans read straight past, and they do not get tired on the thirtieth PR of the day, which is exactly when a human is least reliable. Meanwhile humans are visibly not keeping up: zero-review merges are up 31% and review times are up triple digits. In a real sense the machine is already reviewing more of the code than we are. The honest framing is not “should we let AI review more” but “AI is already doing it, are we going to be deliberate about that or let it happen by default while pretending humans still read everything”.
Loop engineering sharpens this. The premise of a loop is that you stop being the person who prompts the agent and instead build a system that prompts it, and a central part of that system is a judge: an agent that decides whether the work is done before moving on. The reviewer is the next role being designed out of the inner loop, on purpose. We spent a year automating the writing, and the loops are now automating the checking, and the human keeps getting pushed up and out. “Where does the human stay” is not a seminar question, it is something you decide every time you wire up a loop, whether or not you realize you are deciding it.
Where I currently land, and I hold this loosely: the answer is not “a human reads every line”. That is over. The volume ended it, and anyone insisting otherwise is describing a world that no longer exists. But it is also not “let the loop review itself and walk away”. When an agent writes the code, another reviews it and a third judges it, you have a closed loop of models with broadly correlated blind spots, especially when they come from the same family, confidently agreeing in the same places. A confident “looks good” with no human anywhere in it is borrowed confidence: the system’s certainty becomes yours, and nobody actually understood anything. The loop can be both very sure and very wrong, with no human left to tell the difference.
So the human does not leave; the human moves up a level. You stop reviewing every diff and start owning the parts that do not transfer to a model. Accountability, because you cannot page a model at 3am. The judgment of whether this is even the right change to build, as distinct from whether the code is correct. The high-blast-radius gates where being wrong is expensive. And the awkward one: the behavior nobody specified, because a model reviews the code that exists and rarely flags the requirement that nobody thought to write down, which remains a human-shaped gap I do not expect to close soon. Human in the loop becomes human on the loop: sampling, spot-checking and auditing the system rather than reading every PR, and spending your limited attention where being wrong would actually hurt.
机器已经在审查你的大部分代码了。剩下的唯一真正决定是:你是否主动管理这个过程,以及你保留多少人工审查,这应该根据你的爆炸半径来调整。
我不断听到一个一年前还被视为异端的问题,现在却来自经验丰富的工程师:机器是否应该做更多的审查,甚至是大部分的审查?我不再认为这是一个愚蠢的问题。
令人不安的部分在于,AI 审查是有效的。Anthropic 不到 1% 的发现被工程师标记为错误,这些工具能捕捉人类直接读过的 bug,而且它们不会在同一天的第 30 个 PR 上感到疲倦——而这恰恰是人类最不可靠的时候。同时,人类显然跟不上节奏:零审查合并增加了 31%,审查时间增长了三位数。从现实意义上讲,机器已经在审查比我们更多的代码。诚实的表述不是“我们应该让 AI 审查更多吗”,而是“AI 已经在做了,我们是主动应对,还是任由它默认发生,同时假装人类还在阅读一切”。
循环工程/Loop engineering 使这一点更加尖锐。循环的前提是,你不再亲自提示智能体,而是构建一个系统来提示它;该系统的核心部分是一个裁判/judge:一个决定工作是否完成后再继续的智能体。审查者是有意被设计出内环的下一个角色。我们花了一年时间自动化编写,现在循环正在自动化检查,而人类不断被推到更高层、推到外环。“人类留在哪里”不是一个研讨会问题,而是每次你搭建一个循环时都在决定的事情——无论你是否意识到这一点。
我目前的立场(我对此保持开放)是:答案不是“人类阅读每一行代码”。那已经结束了。代码量终结了这一点,任何坚持相反观点的人都在描述一个已不存在的世界。但答案也不是“让循环自我审查,然后走开”。当一个智能体编写代码、另一个审查它、第三个评判它时,你就有了一群盲点大致相关的模型闭环——尤其当它们来自同一系列时——会在相同的地方自信地达成一致。一个自信的“看起来不错”,而其中没有任何人类参与,那就是借来的信心:系统的确定性变成了你的,而实际上没有人理解任何事情。这个循环可能既非常确信又非常错误,而没有人能分辨这种区别。
因此,人类并未离开;人类上升了一层。你不再审查每一个 diff,而是开始负责那些无法转移给模型的部分:责任/accountability(因为你不能在凌晨 3 点 call 一个模型);判断这个变更是否一开始就是正确的变更(这与代码是否正确不同);高爆炸半径的门控(出错代价高昂之处);以及一个尴尬之处——那些没人指定的行为,因为模型审查的是存在的代码,很少会标记没人想到要写下来的需求——这仍然是一个人类形状的缺口,我不预期很快能填补。人类在环内转变为人类在环上:抽样、抽查和审计系统,而非阅读每一个 PR,将有限的注意力投入到真正会因错误而受影响的地方。
This is already how I work on my own projects, including the open-source ones that now see more PRs in a day than I could carefully read in an evening. I point Claude Code or Codex at a batch of incoming PRs and ask for a first pass: a high-level read of what looks safe to merge, what needs more work, and what is genuinely high-risk. I do not auto-merge on the result, and I do not lazy-merge whatever it approves. What it gives me is a way to allocate attention. I can spend a few minutes confirming the changes it considers low-risk, and put real, careful time into the ones it flags as dangerous. The detail that matters is that this is not my old review hour made slightly faster. It is a different shape of hour, and at the volume I now deal with, it is the main reason the queue stays survivable at all.

Codex and Claude Code giving me a first-pass, risk-sorted read of a batch of PRs. The triage is the help. The merge decision stays mine.
这已经是我在自己的项目上的工作方式,包括那些现在每天收到的 PR 比我一个晚上能仔细阅读的还要多的开源项目。我会将一批 incoming PR 指向 Claude Code 或 Codex,要求它们进行一轮初筛:高层面阅读,分辨哪些看似安全可合并、哪些需要更多工作、哪些真正高风险。我不会根据结果自动合并,也不会偷懒式地合并不论它批准了什么。它给我的是一种分配注意力的方式。我可以花几分钟确认它认为低风险的变更,然后投入真正仔细的时间去处理那些它标记为危险的变更。关键的细节是,这并非我过去的审查时间被简单加速,而是一种不同形态的时间分配,以我目前处理的代码量而言,这是队列仍然可控的主要原因。

Codex 和 Claude Code 为我提供一批 PR 的初筛、按风险排序的读取。分诊就是帮助。合并的决定权仍然在我手中。
A more extreme version of the same move is Kun Chen, an ex-Meta L8 engineer now shipping around 40 PRs a day as a solo builder, who has largely stopped reviewing code. It would be easy to dismiss this, except he is an L8, unusually good at the thing he stopped doing, which is what makes it interesting. He runs 20 to 30 agents in parallel and has moved his effort into the plan: he writes detailed plans up front, the agents run for hours against them, and he says plan quality determines how long they can run unattended. That is the move I described above in its purest form. It is worth being precise about what actually happened, because it is not that he stopped verifying. The intent did not vanish, he wrote it down himself in the plan, so the “first human to ever lay eyes on this” problem is half-solved: a human did understand the why, just up front rather than after. And he did not work without a net, he built an automated review gate (he calls it No Mistakes) that checks the code before it merges, and he stays on escalation when an agent gets stuck. The human does the expensive thinking before the code exists and the machine does the line-by-line afterward, which may well be the shape of where this goes.
But he is a solo builder with no large team and no decade-old system full of landmines beneath him. The exact conditions that make 40 PRs a day without review rational for him are conditions most readers do not have. Copy his workflow onto a team shipping to many users and you reproduce the Faros numbers on your own dashboard. He is not wrong; he is a long way down one specific end of the spectrum.
Which is the spectrum point again. Solo with no users: letting AI review almost all of it is a defensible 2026 position, and you should not feel guilty about it. Maintaining something large for many people: let the machine handle the first pass, the second pass and the boring 90%, but keep a real human on the load-bearing paths and do not let the loop close completely on anything that can hurt someone. How much human you keep is a dial, and you set it by blast radius, not by guilt.
同一策略的更极端版本是 Kun Chen,一位前 Meta L8 工程师,现在作为独立开发者每天提交约 40 个 PR,且基本停止了审查代码。很容易对其不以为然,但他是 L8,恰恰是他停止做的事情他做得异常出色——这才是引人注目的地方。他并行运行 20 到 30 个智能体,并将精力转移到计划上:他提前编写详细的计划,智能体们根据这些计划运行数小时;据他说,计划质量决定了它们可以无人在场运行多久。这是我刚才描述的模式的纯粹形式。值得精确理解实际发生了什么,因为他并没有停止验证。意图并未消失,他亲自将其写在了计划中,因此“第一个看到此代码的人类”问题被解决了一半:人类确实理解了“为什么”,不过是在代码之前而非之后。而且他并非在没有安全网的情况下工作,他构建了一个自动化的审查门控(他称之为 No Mistakes),在代码合并前进行检查,并在智能体卡住时保持升级处理。人类在代码存在之前进行昂贵的思考,机器随后进行逐行检查——这很可能就是未来发展的方向。
但是,他是一个独立开发者,没有大型团队,也没有一个充满地雷的十年旧系统。使得每天 40 个 PR 无需人工审查对他合理的确切条件,是大多数读者所不具备的。将他的工作流复制到一个面向大量用户的团队中,你会在自己的仪表板上重现 Faros 的数字。他并没有错,他只是处于光谱一个特定端点的极远处。
又是光谱点。个人、无用户:让 AI 审查几乎全部内容是 2026 年站得住脚的立场,你不应为此感到内疚。维护一个面向多人、大型的系统:让机器处理第一轮、第二轮和无聊的 90%,但在承重路径上保留真正的人工审查,并且不要让循环在可能伤害到任何人的任何事情上完全闭合。保留多少人工是一个刻度盘,由爆炸半径而非内疚感来设置。
Stop reviewing everything to the same depth. Spend scarce human attention only where being wrong is costly, and let cheap deterministic gates and AI reviewers handle the rest.
The organizing idea is to match review effort to the cost of being wrong, push the cheap deterministic work as early as possible, and reserve human attention for what only humans can do.
Tier by risk, not by author. A config change earns a linter and a glance. A payments path earns the full stack: types, tests, two different AI reviewers, a human who owns that system, and a security pass. Do not spend a heavy review on boilerplate, and do not wave through an auth change because the tests are green. The layered approach is the same everywhere; what changes is how many layers a given diff has to clear.
停止以同样的深度审查所有内容。将稀缺的人类注意力仅用于出错代价高昂的地方,让廉价的确定性门控和 AI 审查器处理其余部分。
核心理念是:将审查力度与出错成本相匹配,尽早完成廉价的确定性工作,并保留人类注意力给只有人类才能做的事情。
按风险分层,而非按作者。一个配置变更只需一个 linter 和扫一眼。一个支付路径则需要全栈审查:类型、测试、两个不同的 AI 审查器、拥有该系统的工程师、以及安全审查。不要在模板代码上投入高强度审查,也不要因为测试是绿色的就轻易放过一个认证变更。分层方法在任何地方都一样;区别在于特定 diff 必须通过多少层。
Fast-fail the expensive tail. The most useful recent finding for teams drowning in agent PRs is Early-Stage Prediction of Review Effort (January 2026), which studied 33,707 agent-authored PRs. Agents are good at small, well-defined changes, around 28% merge almost instantly, but they tend to “ghost” the moment they get subjective feedback, abandoning the back-and-forth that review actually is. (A companion 2026 paper found reviewer abandonment accounted for 38% of rejected agent PRs.) The researchers built a “circuit breaker” that predicts high-maintenance PRs from cheap signals like file types and patch size before a human looks, and it works well. Triage agent PRs up front, fast-track the trivial ones, and do not let a person sink an hour into a sprawling change the agent will abandon as soon as you push back.
快速失败昂贵的长尾。对于深陷智能体 PR 的团队来说,最近最有用的发现是 2026 年 1 月的论文《Early-Stage Prediction of Review Effort》,该研究分析了 33707 个由智能体编写的 PR。智能体擅长小而明确的变更,约 28% 几乎立刻就能合并;但它们倾向于在收到主观反馈时“消失”,放弃审查实际需要的来回讨论。(配套的一篇 2026 年论文发现,审查者放弃占被拒绝智能体 PR 的 38%。)研究人员构建了一个“断路器/circuit breaker”,能在人类查看之前通过文件类型和补丁大小等廉价信号预测高维护成本的 PR,并且效果很好。提前分诊智能体 PR,快速处理简单的变更,不要让一个人花一小时在一个庞杂的变更上,而这个变更在你提出意见后智能体很可能就会放弃。
Raise the bar for what you will even review. The fix for being buried is not locking down the repository, it is refusing to review changes that arrive without evidence. Require, before review: a statement of what the change is for, a diff that is not 3,500 lines with no comments, the test output, and proof it was actually run. This is how you stop being the first human to read the code. You push the intent-reconstruction work back onto whoever submitted it, where it is cheap, rather than absorbing it yourself, where it is expensive.
提高你愿意审查的门槛。摆脱困境的解法不是锁死仓库,而是拒绝审查那些不带证据而来的变更。在审查之前要求:一份关于此变更目的的说明,一个不超过 3500 行且有注释的 diff,测试输出,以及确实运行过测试的证明。这样你就不再是第一个阅读这些代码的人类。你将意图重建工作推回给提交者——在那里它很廉价——而不是自己承担,因为在你这边它很昂贵。
Keep PRs small, deliberately. Agent PRs run large, 51% larger on average in the Faros data, and reviewer engagement is one of the strongest predictors that a PR merges at all. A large unreviewable PR gets rejected outright or, worse, rubber-stamped. Instruct your agents to produce small commits. A diff a human can actually read is now a design constraint, not a courtesy.
刻意保持 PR 小巧。智能体 PR 通常很大,在 Faros 的数据中平均大 51%,而审查者的参与度是 PR 能否合并的最强预测因素之一。一个庞大得无法审查的 PR 要么直接被拒绝,要么更糟,被盖章通过。指示你的智能体生成小的提交。一个人类实际上能阅读的 diff 现在是一个设计约束,而非礼貌。
Read the test changes more carefully than the code. This is the agent failure mode to watch. The agent changes behavior, then “fixes” the test by rewriting the assertion to match the new, broken behavior. A green check over 200 edited tests means nothing until you have confirmed the edits were correct. Treat any diff that rewrites many tests as a flag and read those first. Mutation testing earns its place here: coverage tells you a line ran, mutation testing tells you whether the test would notice if that line were wrong.
阅读测试变更时要更仔细,比代码本身更仔细。这是需要注意的智能体故障模式。智能体改变了行为,然后通过重写断言来“修复”测试,以匹配新的、有问题的行为。200 个修改过的测试上的绿色勾号毫无意义,除非你确认了这些修改是正确的。将任何重写了大量测试的 diff 视为一个警示,并首先阅读它们。变异测试/mutation testing 在这里终于有了用武之地:覆盖率告诉你一行代码是否运行过,而变异测试告诉你当那行代码出错时测试能否察觉。
Treat CI as the wall that does not move. Watch for the patterns GitHub now warns reviewers about: removed tests, skipped lint, lowered coverage thresholds, a duplicated helper that already exists elsewhere, and untrusted input flowing into a prompt. That last one deserves emphasis, because agent-built features are a fresh source of prompt injection: if a change pipes user-controlled text into an LLM call without thinking about what that text can instruct the model to do, the vulnerability is not visible in the diff, it is latent in the data that will arrive later. Agents will also weaken CI to make themselves pass, not maliciously, just gradient descent finding the cheapest path to green. Deterministic gates are the one part of the pipeline that cannot be talked out of their verdict by a confident paragraph, so keep them strict.
将 CI 视为不动之墙。注意 GitHub 现在提示审查者的模式:删除测试、跳过 lint、降低覆盖率阈值、一个已存在于别处的重复辅助函数,以及不受信任的输入流入到 prompt 中。最后一点值得强调,因为智能体构建的功能是提示注入的新来源:如果一个变更将用户控制的文本送入 LLM 调用,而没有考虑这段文本可以指示模型做什么,那么这个漏洞在 diff 中并不可见,它潜伏在稍后才到达的数据中。智能体还会削弱 CI 以让自己通过——并非出于恶意,只是梯度下降/ gradient descent 寻找通往绿色的最廉价路径。确定性门控是整个流水线中唯一不会被一段自信的段落说服而改变判决的部分,所以保持它们的严格性。
A human owns the merge. A model cannot be paged and cannot be held responsible for what it shipped, so whoever clicks merge owns it. When an AI review says “looks good” in a calm, confident voice, it is handing you confidence it has not necessarily earned. Treat every AI review as a sensor, not a verdict: data, not a decision.
If you are solo with no users, the tiering, the test-change discipline and CI are most of what you need; the rest is overhead until people show up. If you are the large organization, all of it is the baseline, and the triage and intake bar are the difference between a review process that scales and one that quietly collapses.
人类拥有合并的最终责任。一个模型不能被呼叫,也不能为它交付的东西负责,所以谁点击了合并,谁就拥有这份责任。当 AI 审查用平静、自信的声音说“看起来不错”时,它正在将未必赢得过的信心转交给你。将所有 AI 审查视为传感器,而非裁决:是数据,而非决策。
如果你是个人开发者、无用户,那么分层机制、测试变更纪律和 CI 就是你所需的大部分内容,其余都是开销,直到有用户出现。如果你是大组织,则以上所有都是基线,而分诊和输入门槛正是区分一个可扩展的审查流程与一个悄然崩溃的审查流程的关键。
What this means if you run a team
The bottleneck is no longer how fast you write code, it is how fast a trusted human can be confident in a review. Cutting the people who provide that confidence because “AI made us faster” simply converts the saving into future incidents.
The binding constraint on shipping is no longer how fast you can write code. It is how fast a trusted human can be confident a change is correct. Any plan that treats generation as the bottleneck and review as free will quietly stall, with the velocity dashboard staying green the whole way.
The Faros report is direct about this: QA and review work rises even as output rises, so reducing engineering headcount because “AI made us faster” is dangerous unless you have closed the review gap first. The senior-engineer tax, review time up by triple digits, falls hardest on the people you can least afford to bottleneck, and it is invisible to any metric that only counts merged PRs.
Open source maintainers hit this wall first and hardest. The steady stream of plausible but hollow contributions costs real triage time even when it is well-intentioned, and that is the canary. Companies are next. The ones handling it well treat review capacity as a real resource to be measured, protected and spent deliberately, not as slack that AI has freed up.
如果你管理一个团队,这意味着什么
瓶颈不再是写代码的速度,而是一个值得信赖的人能多快对一个审查产生信心。因为“AI 让我们更快了”而去削减提供这种信心的人,只是将节约转化为未来的事故。
交付的约束不再是你写代码有多快,而是值得信赖的人能多快确信一个变更是正确的。任何将生成视为瓶颈、将审查视为免费的方案,都会悄然停滞,而速度仪表盘将全程保持绿色。
Faros 的报告对此直言不讳:因为 QA 和审查的工作量随产出增加而增加,所以如果你尚未弥补审查差距,基于“AI 让我们更快了”而减少工程人员规模是危险的。高级工程师的“税”——审查时间上升三位数——对最无法承受其成为瓶颈的人打击最大,而且它对任何只统计合并 PR 数量的指标都不可见。
开源维护者最先、也最严重地撞上了这堵墙。源源不断的看似合理但实质空洞的贡献,即使出于善意,也要花费真正的分诊时间——这就是金丝雀。接下来是公司。处理得好的公司,将审查容量视为需要衡量、保护和刻意投资的真实资源,而非 AI 释放出来的空闲。
Writing got cheap, understanding didn’t
Code review did not become less important when agents arrived. It became the central activity. Writing code is increasingly solved and getting cheaper by the month; the durable advantage is the system that lets you trust what was written.
Do not take the one-size answer in either direction. If you are solo with no users, the enterprise horror stories about churn and duplication are a future risk, not today’s fire, so lean on your tests, review what matters, and stay honest that the deferred work is still owed. If you maintain something large for many people, every alarming number here is about you, and the only thing that holds is a tiered, evidence-required, deliberately heterogeneous review process with a human owning the merge.
What is constant across the whole spectrum is the underlying economics. We made writing cheap, and understanding stayed exactly as expensive as it has always been. The teams that do well over the next few years will not be the ones generating the most code, they will be the ones who built a review system they can actually trust, and who never confuse “the tests passed” with “a person understands what this does and why”.
Or, as Simon Willison keeps putting it, your job is to deliver code you have proven to work. Agents have not changed that. They have made the proving the center of the job rather than an afterthought, and I think that is a good trade. Understanding a system well enough to stand behind it is the most durable and most interesting skill in software, and there has never been a better time to get extraordinarily good at it.
写作变得廉价了,理解没有
代码审查并没有因为在智能体到来后变得不那么重要。它变成了核心活动。编写代码正变得日益成熟,且月月都在变得更便宜;持久的优势在于那个让你能信任被写下的东西的系统。
不要走向任何一个方向的“一刀切”答案。如果你是无用户的个人开发者,那些关于变动率和重复代码的企业恐怖故事虽然存在,但那只是未来的风险,不是今天的火灾。所以依靠你的测试,审查对你重要的内容,并诚实地承认那些延期的工作终归要还。如果你在为很多人维护一个大型系统,那么这里每一个惊人的数字都与你有关,唯一有效的是分层的、要求证据的、刻意使用异质审查者的流程,并且由人类拥有合并权。
整个光谱范围内不变的是底层经济学。我们让写作变廉价了,而理解则始终和过去一样昂贵。未来几年表现良好的团队,不会是生成代码最多的团队,而是那些构建了一个真正值得信赖的审查系统、并且从不混淆“测试通过了”与“一个人理解这段代码做什么以及为什么这样做”的团队。
或者,正如 Simon Willison 一直说的那样:你的工作是交付你已经证明可以工作的代码。智能体并没有改变这一点。它们只是将“证明”变成了工作的核心而不是事后才想到的一件事,我认为这是一个很好的交易。对一个系统有足够深入的理解以至于能够支持它,是软件工程中最持久、最有趣的技能,而从来没有比现在更好的时机去变得异常擅长于此。