编排税:当AI智能体并行时,注意力的串行瓶颈
启动AI智能体变得极其廉价,但关闭循环(审核、合并、判断)仍需经过你这一串行处理器。作者将人类注意力类比为GIL:你可以同时跑20个智能体,但真正能交付到main分支的工作受限于你单线程的审核吞吐量。Amdahl定律在此适用——串行部分(人类的判断)不可并行化,盲目增加智能体只会积压待审队列。文章给出了可操作的策略:按审核速率而非UI上限扩容、将任务分为可委托的异步工作与需要锁定注意力的复杂工作、批量审核、让智能体自证正确性。最后指出,忙碌不等于高效,未支付的编排税会同时积累技术债务和认知债务。
Starting more agents is easy now. However, more agents running doesn’t mean more of you available - your cognitive bandwidth doesn’t parallelize. All the judgement to actually steer them and merge the code they write into the codebase still has to route through exactly one serial processor which is just you. Orchestration tax is basically the price you pay for forgetting this and the only real fix is to start architecting your own attention like you architect any concurrent system.
现在启动更多 agent 很容易。然而,更多 agent 在运行并不意味着你有了更多自己——你的认知带宽无法并行化。所有真正需要你来引导它们、将它们写的代码合并到代码库的判断工作,仍然必须经过唯一的一个串行处理器,那就是你。编排税基本上就是你忘记这一点所付出的代价,而唯一的真正解决办法是开始像设计任何并发系统一样来设计你自己的注意力。
I was in a panel at Google I/O this week with Richard Seroter, Aja Hammerly and Ciera Jaspan talking about what software engineering looks like right now and how it will probably evolve. Near the end Richard asked us what is one thing developers should walk away and do differently. I said the thing I been circling around for months: feeling busy is definetly not the same as being productive. You can run 20 agents and feel completely busy. But thats not 20 agents worth of shipped work.
Earlier in that chat Richard gave this problem a name. “You talked about the orchestration tax” he said. “You can’t manage twenty agents successfully in your own brain.” He is totally right. I want to breakdown this idea properly because its not a discipline problem. It is an architecture problem.
本周我在 Google I/O 的一个小组讨论中,与 Richard Seroter、Aja Hammerly 和 Ciera Jaspan 一起谈论软件工程现在的样子以及它可能如何演变。接近尾声时,Richard 问我们,开发者应该带走并做出改变的一件事是什么。我说了我这几个月一直在围绕的事情:感觉忙碌肯定不等于高效。你可以运行 20 个 agent 并感觉完全忙碌,但那不是相当于 20 个 agent 值的已交付工作。
在那次聊天早些时候,Richard 给这个问题起了个名字。“你提到了编排税,”他说,“你无法在自己的大脑里成功管理二十个 agent。”他完全正确。我想恰当地分解这个想法,因为它不是一个纪律问题,而是一个架构问题。
The line from the panel I keep thinking about is something I said almost randomly: running multiple agents does not mean there is more of you.
There is this hidden asymmetry in agentic workflows. Starting an agent is very cheap. It is just a keystroke or a sentence prompt. But closing the loop on the agent is not cheap at all. Someone has to check if what came back is correct and reconcile it with whatever the other agents touched. That someone is you. And there is exactly one of you.
那个小组讨论中我一直在思考的一句话,是我几乎随口说出的:运行多个 agent 并不意味着有了更多个你。
在 agent 工作流中有一个隐蔽的不对称性。启动一个 agent 非常廉价,只需一次按键或一个句子提示。但关闭 agent 的循环一点也不廉价。有人必须检查返回的结果是否正确,并将其与其他 agent 改动过的地方协调一致。那个人就是你,而且你只有一个。
I wrote about a piece of this last month in Your parallel Agent limit, mostly about the ambient anxiety of not knowing which paralell thread is quietly failing. This post is about the actual shape underneath that cost. When you start seeing agent development as a concurrent system, you realize the human is just a component inside it. The slow serial component.
If you ever wrote concurrent code you already have the right intuition. You just been pointing it at the wrong part of the system.
Python has the Global Interpreter Lock (GIL). You can spawn as many threads as you want but only one executes python bytecode at a time because they must acquire the lock. You are the GIL of your AI agents. They all can run at once. But when any of their work needs genuine understanding of the architecture or resolving merge conflicts, that work has to acquire the lock. There is one lock. You hold it.
上个月我在《你的并行 agent 上限》中写了这部分内容,主要是关于不知道哪个并行线程在悄悄失败的弥漫性焦虑。这篇文章是关于这种成本背后的实际形态。当你开始将 agent 开发视为一个并发系统时,你会意识到人类只是其中的一个组件——那个缓慢的串行组件。
如果你曾经写过并发代码,你已经有了正确的直觉,只是你一直把它指向了系统中错误的部分。
Python 有全局解释器锁(GIL)。你可以随心所欲地生成任意多的线程,但一次只有一个线程能执行 Python 字节码,因为它们必须获取锁。你就是你 AI agent 的 GIL。它们都可以同时运行,但当它们中的任何一个工作需要真正理解架构或解决合并冲突时,那项工作就必须获取锁。只有一把锁,而握着它的是你。
Amdahl’s Law makes this very precise. The speedup you get from paralellizing is capped by the fraction of work that stays serial. If a big chunk of your pipeline cant be paralellized, you top out at a hard limit no matter how many cores you throw at it. In agent development the serial fraction is the judgement. Spawning 8 agents doesn’t speed up your judgement time. It just makes the queue of things feeding into it much deeper.
This is an old performance engineering fact that still surprise people: optimizing the non bottleneck part doesn’t increase throughput. You just grow the pile of unfinished work sitting in front of the bottleneck. Adding agents optimize the part that was never the constraint. The constraint is the review step and the throughput of your system equals exactly the throughput of that step. The orchestration tax is the structural gap between agent production and what you can actually merge. It’s what happens when you put a single-threaded resource in charge of a concurrent one.
阿姆达尔定律非常精确地说明了这一点:从并行化中获得的加速受限于保持串行的工作比例。如果你的很大一部分流水线无法并行化,那么无论你投入多少核心,都会卡在一个硬上限上。在 agent 开发中,串行部分就是判断。生成 8 个 agent 并不会加速你的判断时间,它只是让送入判断队列的东西堆积得更深。
这是一个古老的性能工程事实,但至今仍让人惊讶:优化非瓶颈部分并不会增加吞吐量。你只是让堆积在瓶颈前的未完成工作变得更大。添加 agent 优化了从来不是约束的那部分。约束是审查步骤,你系统的吞吐量正好等于该步骤的吞吐量。编排税是 agent 产出与你实际能合并之间的结构性差距——当你用一个单线程资源来管理一个并发系统时,就会发生这种情况。
At the panel I said I never felt more productive with my tools but I am also more tired than I ever been. Both halves are completely real and they have the same cause.
The tiredness has a very specific cause. It is how running a serial processor at 100% with no slack feels like. Everytime you check on an agent you been away from you pay a context switch cost. You flush your brain and reload a different context from cold. CPUs do this in microseconds and architects still work hard to avoid it. You do it in minutes and you never reload the context perfectly. Five agents is not 1x workload done five times. It is 5 cold reloads plus a background brain process constantly worrying about which agent you should be checking.
You can’t just try harder to fix a structural limit. The tax will be paid anyway. If you try to grind it out, the limit just shows up as shallow code reviews or experiencing cognitive surrender where you just accept the agent’s code because forming your own opinion costs attention you don’t have anymore. You either pay the tax deliberately or you let it quietly destroy your understanding of your own system.
在小组讨论中我说,我从未觉得工具让我更高效,但我也比以往任何时候都更累。这两部分都是完全真实的,而且它们有相同的原因。
疲劳有一个非常具体的原因:就像让一个串行处理器以 100% 的负载运行,没有任何余裕。每次你查看一个你暂别过的 agent,你都要付出一次上下文切换的成本。你清空大脑,然后从冷启动状态重新加载一个不同的上下文。CPU 以微秒级完成这件事,而架构师们仍在努力避免它;你以分钟级完成,而且永远无法完美地重新加载上下文。五个 agent 不是 1 倍工作量的五次重复,而是 5 次冷加载,外加一个背景脑进程在持续焦虑应该检查哪个 agent。
你无法仅仅通过更努力来修复结构性限制。税终究是要付的。如果你试图硬扛过去,限制就会表现为草率的代码审查,或者经历认知投降——你直接接受 agent 的代码,因为形成自己的意见需要你已经不再有余力的注意力。你要么有意识地支付编排税,要么让它悄悄地摧毁你对自身系统的理解。
So you have to treat your attention as the scarce serial resource it is. You wouldn’t design a distributed system without thinking hard about the bottleneck. Give your brain the same respect.
Some things that actually held up for me:
Scale fleet to review rate, not the UI. A good concurrent system uses backpressure so the queue doesn’t grow infinitely. The producer slows down to match the consumer. Your agent count is the producer and your review rate is the consumer. The right number of paralell agents is how many you can actually code review properly. For most of us this is a low single digit. The AI tool will happily let you spawn 20 but that is just a UI feature.
Sort the work. I mentioned this to Richard when he asked how I navigate it. I keep two piles of tasks. One is isolated work that I’m happy to delegate to background agents running in the Cloud. These can run async and often just need me at the final gate. The other pile is complex tasks where the judgement is the work. Like a weird bug or architecture design. The big mistake is trying to paralellize the second pile. Doing multiple complex tasks doesn’t scale your output. It just thrashes the lock and everything comes out worse.
Batch your reviews. Context switching cost you heavily everytime you do it. Reviewing 4 agents at the same time in one sitting is much cheaper than checking one, leaving to do something else and returning cold. Give agents a long leash. Let the work pile up a bit and process the batch.
Only spend the lock on judgement. Dont waste your brain on things the machine can verify itself. Make the agent write a passing test or generate a screenshot. They can prove the boring 80% themselves so you only spend your scarce attention on the 20% that genuinely needs a human.
Protect your serial time. The bottleneck needs your best hours, not the leftover minutes between agent check-ins. Sometimes the highest leverage move is to stop orchestrating entirely, close the laptop full of agents and just think hard about one single problem with the lock held the whole time. Orchestrating is not the real work. Its the overhead around the work.
Aja pointed out that architecture is the urgent skill now. Knowing what belongs inside one agent and what is too much for it. I would add that you are a component in that system. Your attention has a known, low serial throughput. The system either respects that number or it routes around it by secretly lowering your standards.
所以,你必须将你的注意力视为它那稀缺的串行资源。你不会在不仔细考虑瓶颈的情况下设计分布式系统,请给予你的大脑同样的尊重。
一些对我来说确实有效的方法:
根据审查率来扩展 agent 群,而不是根据 UI。一个好的并发系统使用背压来阻止队列无限增长。生产者减速以匹配消费者。你的 agent 数量是生产者,你的审查率是消费者。正确的并行 agent 数量是你能真正妥善进行代码审查的数量。对我们大多数人来说,这是一个个位数的低值。AI 工具会很乐意让你生成 20 个,但那只是一个 UI 功能。
对工作进行分类。当 Richard 问我如何应对时,我提到了这一点。我保持两个任务堆。一个是孤立的工作,我很乐意委托给在云中运行的后台 agent。这些可以异步运行,通常只在最后关卡需要我。另一个堆是复杂任务,其中判断本身就是工作,比如一个奇怪的 bug 或架构设计。最大的错误是试图并行化第二个堆。同时处理多个复杂任务并不会扩展你的产出,它只会剧烈争夺那把锁,让所有事情变得更糟。
批量进行审查。每次上下文切换都会让你付出高昂代价。一次性连续审查 4 个 agent 的产出,比检查一个、离开去做别的事再冷返回要便宜得多。给 agent 一个较长的牵绳,让工作稍微堆积一下,然后批量处理。
只在需要判断时才花那把锁。不要在机器自己能验证的事情上浪费你的大脑。让 agent 编写一个通过的测试或生成一张截图。它们可以自己证明那无聊的 80%,这样你只把稀缺的注意力花在真正需要人类的 20% 上。
保护你的串行时间。瓶颈需要你最好的时间,而不是 agent 检查间隙那些零碎分钟。有时最高杠杆的行动是完全停止编排,合上满是 agent 的笔记本电脑,在全程持有那把锁的情况下,深入思考一个单一问题。编排不是真正的工作,它是围绕工作的开销。
Aja 指出,架构现在是紧迫的技能——知道什么属于一个 agent,什么对 agent 来说太多了。我要补充一点:你是那个系统中的一个组件。你的注意力具有已知的、低下的串行吞吐量。系统要么尊重这个数字,要么通过秘密降低你的标准来绕过它。
This is really important because the failure mode is invisible to you. Twenty running agents gives you this feeling of massive productivity. The dashboard is full and everything moves. But that feeling is decoupled from actually shipping good code to main. You can be maximally busy and barely produce anything. From the inside it feels identical.
Ciera pointed out Margaret-Anne Storey’s work on debt. We talked about technical debt and cognitive debt. The orchestration tax left unpaid is how you accumulate both at once. You merge stuff you didn’t read well. Your mental model of the codebase goes completely stale. None of this shows up on the dashboard today. It shows up when production breaks and you look at the system and realize you have no idea how it works anymore.
So this is the actual takeaway. Spawning agents is not the skill. Anyone can run 20. The real skill is designing the system around the one serial resource that cannot be cloned or paralellized. That resource is your attention. Architect it the way you architect anything else you depend on in production.
这一点非常重要,因为失败模式对你是不可见的。二十个正在运行的 agent 给你一种巨大生产力的感觉:仪表盘上满满的,一切都在运转。但那种感觉与实际向主分支交付好代码是脱钩的。你可以最大限度地忙碌,却几乎生产不出任何东西。从内部来看,感觉是完全一样的。
Ciera 提到了 Margaret-Anne Storey 关于债务的工作。我们讨论了技术债务和认知债务。未偿还的编排税就是如何同时积累这两种债务的方式。你合并了你没有仔细阅读的内容。你对代码库的心智模型完全过时了。这些在今天的仪表盘上都不会显示出来。它们会在生产环境出问题时显现——当你查看系统,意识到你不再知道它如何工作了。
所以,这才是真正的要点。生成 agent 不是技能,谁都能运行 20 个。真正的技能是围绕那一个无法被克隆或并行化的串行资源来设计系统。那个资源就是你的注意力。像你架构生产中依赖的任何其他东西一样来架构它。