The 8 Levels of Agentic Engineering
Bassim Eledath maps the progression of AI-assisted coding into 8 levels, from tab-complete and AI IDEs to context engineering, compounding engineering, MCPs & skills, harness engineering with automated feedback loops, background agents, and autonomous agent teams. Each level builds on the previous, with practical insights on closing the gap between model capability and practice. He argues that plan mode is fading, multi-model dispatching yields better results, and true autonomous teams are still experimental. The piece serves as a roadmap for engineers looking to leverage AI more effectively.
AI's coding ability is outpacing our ability to wield it effectively. That's why all the SWE-bench score maxxing isn't syncing with the productivity metrics engineering leadership actually cares about. When Anthropic's team ships a product like Cowork in 10 days and another team can't move past a broken POC using the same models, the difference is that one team has closed the gap between capability and practice and the other hasn't.
AI 的编码能力正在超越我们有效驾驭它的能力。这就是为什么所有 SWE-bench 分数的刷分并没有转化为工程领导真正关心的生产力指标。当 Anthropic 的团队用 10 天交付了 Cowork 这样的产品,而另一个团队用同样的模型却连一个破碎的概念验证都走不通时,区别就在于前者弥合了能力与实践之间的鸿沟,后者没有。
That gap doesn't close overnight. It closes in levels. 8 of them. Most of you reading this are likely past the first few, and you should be eager to reach the next one because each subsequent level is a huge leap in output, and every improvement in model capability amplifies those gains further.
The other reason you should care is the multiplayer effect. Your output depends more than you'd think on the level of your teammates. Say you're a level 7 wizard, raising several solid PRs with your background agents while you sleep. If your repo requires a colleague's approval before merge, and that colleague is on level 2, still manually reviewing PRs, that stifles your throughput. So it is in your best interest to pull your team up.
From talking to several teams and individuals practicing AI-assisted coding, here's the progression of levels I've seen play out, imperfectly sequential:
![]()
这个鸿沟不会一夜闭合,而是通过层级逐步弥合——共 8 层。大多数读者可能已经跨过了前面几层,应该渴望达到下一层,因为每一层都会带来产出的巨大飞跃,模型能力的每一次提升都会进一步放大这些收益。
你该关心的另一个原因是“多人效应”。你的产出在很大程度上取决于队友的层级。假设你是一个 7 级高手,在你睡觉时后台代理已经提了好几个可靠的 PR。但如果你的仓库需要经过同事审批才能合并,而那位同事还在 2 级、依赖纯人工审查 PR,那你的吞吐量就会被扼杀。因此,把团队往上拉对你最有利。
通过与多个实践 AI 辅助编程的团队和个人交流,我看到了这样一套进阶层级,虽非严格线性,但大致如此:
![]()
I'll address these two zippily, mostly for posterity. Skim freely.
It started with Copilot and tab complete. Click tab, autocomplete code. Probably long forgotten by many and skipped entirely by new entrants to agentic engineering. It favored experienced devs who could adeptly skeleton their code before AI filled in the blanks.
AI-focused IDEs like Cursor changed the game by connecting chat to your codebase, making multi-file edits dramatically easier. But the ceiling was always context. The model could only help with what it could see, and annoyingly often, it was either not seeing the right context or seeing too much of the wrong context.
Most people at this level are also experimenting with plan mode in their coding agent of choice: translating a rough idea into a structured step-by-step plan for the LLM, iterating on that plan, and then triggering the implementation. It works well at this stage, and it's a reasonable way to maintain control. Though we'll see in later levels less of a dependence on plan mode.
我对这两层快速带过,主要为记录完整性。请随意浏览。
它始于 Copilot 和 tab 补全:按 tab,代码自动完成。许多人可能已经淡忘,后来加入 agentic 工程的新人甚至完全跳过。它更青睐那些能先搭好代码骨架、再由 AI 填空的经验开发者。
以 AI 为中心的 IDE(如 Cursor)改变了游戏规则,通过将聊天与代码库连接,使多文件编辑大幅简化。但天花板始终是上下文。模型只能帮助它能看到的部分,而且令人恼火的是,它常常要么看不到正确的上下文,要么看到了太多无关的上下文。
这一层级的人大多也在尝试他们编码代理中的计划模式:将一个粗略想法翻译成结构化的、步步为营的 LLM 执行计划,迭代这个计划,然后触发实现。在此阶段效果不错,也是保持控制的合理方式。不过我们会在后续层级看到,对计划模式的依赖会减少。
Buzz phrase of the year in 2025, context engineering became a thing when models got reliably good at following a reasonable number of instructions with just the right amount of context. Noisy context was just as bad as underspecified context, so the effort was in improving the information density of each token. "Every token needs to fight for its place in the prompt" was the mantra.

2025 年的热词,上下文工程之所以兴起,是因为模型在获得恰好适量的上下文时,能够可靠地遵循合理数量的指令。嘈杂的上下文和过于简略的上下文一样糟糕,因此努力方向是提高每个 token 的信息密度。“每一个 token 都必须为它在提示词中的位置而战”是当时的信条。

In practice, context engineering touches more surface area than people realize. It's your system prompt and rules files (.cursorrules, CLAUDE.md). It's how you describe your tools, because the model reads those descriptions to decide which ones to call. It's managing conversation history so a long-running agent doesn't lose the plot ten turns in. It's deciding which tools to even expose per turn, because too many options overwhelm the model just like they overwhelm people.
You don't hear as much about context engineering these days. The scale has tipped in favor of models that forgive noisier context and reason through messier terrain (larger context windows help too). Still, being mindful of what eats up context remains relevant. A few examples of where it still bites:
Smaller models are more context-sensitive. Voice applications often use smaller models, and context size also correlates with time to first token, which affects latency.
Token-heavy tools and modalities. MCPs like Playwright and image inputs burn through tokens fast, pushing you into "compact session" state in Claude Code way sooner than you'd expect.
Agents with access to dozens of tools, where the model spends more tokens parsing tool schemas than doing useful work.
The broader point is that context engineering hasn't gone away, it's just evolved. The focus has shifted from filtering out bad context to making sure the right context is present at the right time. That shift is what sets up level 4.
实践中,上下文工程触及的面比人们意识到的要广。它是你的系统提示词和规则文件(.cursorrules、CLAUDE.md)。是你如何描述工具,因为模型会阅读这些描述来决定调用哪个。是管理对话历史,让长时间运行的代理不会在十轮之后丢失主线。是决定每个回合暴露哪些工具,因为太多选项会像压倒人一样压倒模型。
如今关于上下文工程的讨论少了许多。天平已倒向更能容忍嘈杂上下文、能在混乱地形上推理的模型(更大的上下文窗口也有帮助)。但留意什么在吞噬上下文仍然重要。以下几处依然会咬人:
较小的模型对上下文更敏感。语音应用常用较小模型,上下文大小还与首 token 时间相关,进而影响延迟。
消耗大量 token 的工具和模态。像 Playwright 这样的 MCP 和图像输入会迅速烧光 token,让你比预期更早地在 Claude Code 中进入“压缩会话”状态。
拥有数十个工具的代理,模型花在解析工具 schema 上的 token 比做有用工作还多。
更广泛的观点是,上下文工程并未消失,只是进化了。焦点从滤除坏的上下文转向确保正确的上下文在正确的时间出现。这一转变为第 4 层奠定了基础。
Context engineering improves the current session. Compounding engineering improves every session after it. Popularized by Kieran Klaassen, compounding engineering was an inflection point for not only me but many others that "vibe coding" could do far more than just prototyping.
It's a plan, delegate, assess, codify loop. You plan the task with enough context for the LLM to succeed. You delegate it. You assess the output. And then, crucially, you codify what you learned: what worked, what broke, what pattern to follow next time.

上下文工程优化当前会话,复合工程则优化后续所有会话。由 Kieran Klaassen 推广,复合工程对我和许多人来说是一个拐点——它证明了“vibe coding”远不止能用于原型。
这是一个“计划、委派、评估、固化”循环。你用足够的上下文计划任务以确保 LLM 成功,委派它,评估输出,然后关键一步,将学到的固化下来:什么有效、什么坏了、下次该遵循什么模式。

That codify step is what makes it compound. LLMs are stateless. If they re-introduce a dependency you explicitly removed yesterday, they'll do it again tomorrow unless you tell them not to. The most common way to close that loop is updating your CLAUDE.md (or equivalent rules file) so the lesson is baked into every future session. A word of caution: the instinct to codify everything into your rules file can backfire (too many instructions is as good as none). The better move is to create a setting where the LLM can easily discover useful context on its own, for example by maintaining an up-to-date docs/ folder (more on this in Level 7).
Practitioners of compounding engineering are usually hyper-aware of the context being fed to their LLM. When an LLM makes a mistake, they instinctively think about missing context before blaming the model's competence. That instinct is what makes levels 5 through 8 possible.
“固化”步骤是使工程复合的关键。LLM 是无状态的。如果它们重新引入你昨天明确移除的依赖,除非你告诉它们不要,否则明天它们还会再做。关闭这个循环最常见的方式是更新你的 CLAUDE.md(或等效的规则文件),让教训被烘焙到未来的每一次会话中。一个警告:将所有东西都固化进规则文件的冲动可能适得其反(太多指令等同于无指令)。更好的做法是创建一个 LLM 可以自行轻松发现有用上下文的环境,比如维护一个最新的 docs/ 文件夹(详见 L7)。
复合工程的实践者通常对自己喂给 LLM 的上下文高度敏感。当 LLM 出错时,他们本能地先想到缺失的上下文,而不是质疑模型的能力。这种直觉正是让 L5 到 L8 成为可能的基础。
Levels 3 and 4 solve for context. Level 5 solves for capability. MCPs and custom skills give your LLM access to your database, your APIs, your CI pipeline, your design system, Playwright for browser testing, Slack for notifications. Instead of just thinking about your codebase, the model can now act on it.
There's no shortage of good material on MCPs and skills already, so I won't rehash what they are. But here are some examples of how I use them: my team shares a PR review skill that we've all iterated on (and still do) that conditionally launches subagents depending on the nature of the PR. One handles integration safety with the database. Another runs complexity analysis to flag redundancies or overengineering. Another checks prompt health to ensure our prompts follow the team's standard format. It also runs linters and Ruff.

L3 和 L4 解决上下文问题,L5 解决能力问题。MCP 和自定义技能让你的 LLM 能访问数据库、API、CI 流水线、设计系统、Playwright 浏览器测试、Slack 通知等。现在模型不仅能思考代码库,还能实施操作。
关于 MCP 和技能的好材料已有很多,我不再赘述它们是什么。以下是我如何使用它们的例子:我的团队共享一个 PR 审查技能,我们都在上面迭代(至今仍在做),它会根据 PR 的性质条件性地启动子代理。一个处理与数据库的集成安全性,另一个运行复杂度分析以标记冗余或过度工程,另一个检查提示健康以确保提示符合团队标准格式。它还会运行 linter 和 Ruff。

Why invest this much in a review skill? Because as agents start producing PRs at volume, human review becomes the bottleneck, not the quality gate. Latent Space makes a compelling case that code review as we know it is dead. Automated, consistent, skill-driven review is what replaces it.
On the MCP side, I use the Braintrust MCP so my LLM can query evaluation logs and make changes directly. I use DeepWiki MCP to give my agent access to documentation for any open-source repo without manually pulling it into context.
Once multiple people on your team are writing their own versions of the same skill, it's worth consolidating into a shared registry. Block (my condolences) has a great write-up on this: they built an internal skills marketplace with over 100 skills and curated bundles for specific roles and teams. Skills get the same treatment as code: pull requests, reviews, version history.
One more trend worth calling out: it's becoming common for LLMs to use CLI tools instead of MCPs (and it seems like every company is shipping one: Google Workspace CLI, Braintrust is launching one soon). The reason is token efficiency. MCP servers inject full tool schemas into context on every turn whether the agent uses them or not. CLIs flip this: the agent runs a targeted command, and only the relevant output enters the context window. I use agent-browser heavily for exactly this reason versus using the Playwright MCP.
为什么在审查技能上投入这么多?因为当代理开始大量产出 PR 时,人工审查成为瓶颈而非质量把关。Latent Space 有力论证了已知的代码审查已死。自动化、一致、技能驱动的审查是替代品。
在 MCP 方面,我使用 Braintrust MCP,让 LLM 能查询评估日志并直接修改。我用 DeepWiki MCP 让代理访问任何开源 repo 的文档,无需手动拉入上下文。
当团队中多人开始编写同一技能的各自版本时,就值得合并到一个共享登记库。Block(节哀)对此有出色记录:他们建立了一个内部技能市场,拥有超过 100 个技能,并为特定角色和团队打包了精选组合。技能享有与代码同等待遇:PR、审查、版本历史。
另一个值得注意的趋势:LLM 使用 CLI 工具而非 MCP 正变得越来越普遍(似乎每家公司都在发布一个:Google Workspace CLI,Braintrust 也即将发布)。原因是 token 效率。MCP 服务器在每个回合都会将完整的工具 schema 注入上下文,无论代理是否使用。CLI 反转了这一点:代理运行一个目标命令,只有相关输出进入上下文窗口。我大量使用 agent-browser 正是出于这个原因,而非使用 Playwright MCP。
One thing before we continue. Levels 3 through 5 are the building blocks for everything that follows. LLMs are unpredictably good at some things and bad at others, and you need to develop an intuition for where those edges are before stacking more automation on top. If your context is noisy, your prompts are under- or misspecified, or your tools are poorly described, levels 6 through 8 just amplify the mess.
继续之前有一件事要说。L3 到 L5 是后续一切的基础。LLM 在某些事情上意外出色,在其他事情上糟糕,你需要培养对边界在哪里的直觉,然后才能在其上堆叠更多自动化。如果你的上下文嘈杂,提示词描述不足或有误,工具描述不佳,L6 到 L8 只会放大混乱。
This is where the rocket really starts to ship.
Context engineering is about curating what the model sees. Harness engineering is about building the entire environment, tooling, and feedback loops that let agents do reliable work without you intervening. Give the agent the feedback loop, not just the editor.

OpenAI's Codex team wired Chrome DevTools, observability tooling, and browser navigation into the agent runtime so it could take screenshots, drive UI paths, query logs, and validate its own fixes. Given a single prompt, the agent can reproduce a bug, record a video, and implement a fix. Then it validates by driving the app, opens a PR, responds to review feedback, and merges, escalating only when judgment is required. The agent doesn't just write code. It can see what the code produces and iterate on it, the same way a human would.
My team builds voice and chat agents for tech troubleshooting, so I built a CLI tool called converse that lets any LLM chat with our backend endpoint and have turn-by-turn conversations. The LLM makes code changes, uses converse to test conversations against the live system, and iterates. Sometimes these self-improvement loops run for several hours on end. This is especially powerful when the outcome is verifiable: the conversation must follow this flow, or call these tools in these situations (e.g., escalation to a human agent).
这是火箭真正开始发射的地方。
上下文工程是打理模型看见的东西,harness 工程则是构建整个环境、工具和反馈循环,让代理无需你介入就能可靠工作。给代理反馈循环,而不只是编辑器。

OpenAI 的 Codex 团队将 Chrome DevTools、可观测性工具和浏览器导航接入代理运行时,使其能截图、驱动 UI 路径、查询日志并验证自己的修复。仅凭一个提示,代理就能复现缺陷、录制视频并实现修复,然后通过驱动应用验证,打开 PR、响应审查反馈并合并,只在需要判断时升级。代理不仅写代码,还能看到代码产生的效果并迭代,就像人一样。
我的团队构建用于技术故障排除的语音和聊天代理,因此我构建了一个叫 converse 的 CLI 工具,让任何 LLM 都能与后端端点进行逐轮对话。LLM 进行代码更改,用 converse 在真实系统上测试对话,然后迭代。有时这些自我改进循环会持续数小时。当结果可验证时尤其强大:对话必须遵循某个流程,或在特定情况下调用这些工具(例如向人工代理升级)。
The concept that enables this is backpressure: automated feedback mechanisms (type systems, tests, linters, pre-commit hooks) that let agents detect and correct mistakes without human intervention. If you want autonomy, you need backpressure. Otherwise you end up with a slop machine. This extends to security too. Vercel's CTO makes the case that agents, the code they generate, and your secrets should live in separate trust domains, because a prompt injection buried in a log file can trick an agent into exfiltrating your credentials if everything shares one security context. Security boundaries are backpressure: they constrain what an agent can do when it goes off the rails, not just what it should do.
Two things that help here:
Design for throughput, not perfection. When perfection is required per commit, agents pile on the same bug and overwrite each other's fixes. Better to tolerate small non-blocking errors and do a final quality pass before release. We do the same for our human colleagues.
Constraints > instructions. Step-by-step prompting ("do A, then B, then C") is increasingly outdated. In my experience, defining boundaries works better than giving checklists, because agents fixate on the list and ignore anything not on it. The better prompt is "here's what I want, work on it until you pass all these tests."
使这一切成为可能的概念是背压(backpressure):自动化的反馈机制(类型系统、测试、linter、pre-commit hook)让代理能检测并纠正错误,无需人工介入。如果你想要自主性,就需要背压。否则你得到的就是一台垃圾制造机。这同样延伸到安全领域。Vercel 的 CTO 论证说,代理、其生成的代码以及你的密钥应处于不同的信任域,因为隐藏在日志文件中的提示注入可能诱骗代理窃取你的凭证——如果所有东西共享同一安全上下文。安全边界就是背压:限制代理在失控时能做什么,而不仅仅是该做什么。
这里有两件事很有帮助:
为吞吐量设计,不为完美。如果每次提交都要求完美,代理会在同一个 bug 上叠加并覆盖彼此的修复。更好的做法是容忍小的非阻塞错误,在发布前做一次最终质量检查。我们对人类同事也是如此。
约束大于指令。逐步提示(“先做 A,然后 B,然后 C”)已日益过时。根据我的经验,定义边界比给检查清单效果更好,因为代理会专注于清单而忽略清单之外的一切。更好的提示是:“这是我想要的结果,不断改进直到通过所有测试。”
The other half of harness engineering is making sure the agent can navigate your repo without you. OpenAI's approach: keep AGENTS.md to roughly 100 lines that serve as a table of contents pointing to structured docs elsewhere, and make documentation freshness part of CI rather than relying on ad hoc updates that go stale.
harness 工程的另一半是确保代理能在没有你的情况下导航你的代码库。OpenAI 的做法:保持 AGENTS.md 大约 100 行,作为目录指向别处的结构化文档,并将文档新鲜度纳入 CI,而不是依赖会过时的临时更新。
Once you've built all of this, a natural question emerges: if the agent can verify its own work, navigate the repo, and correct its mistakes without you, why do you need to be in the chair at all?
Heads up, for folks in the early levels, this next section may sound alien (but hey, bookmark and come back to it).
Hot take: plan mode is dying.
Boris Cherny, creator of Claude Code, still starts 80% of tasks in plan mode today. But with each new model generation, the one-shot success rate after planning keeps climbing. I think we're approaching the point where plan mode as a separate human-in-the-loop step fades away. Not because planning doesn't matter, but because models are getting good enough to plan well on their own. Big caveat: this only works if you've done the work in levels 3 through 6. If your context is clean, your constraints are explicit, your tools are well-described, and your feedback loops are tight, the model can plan reliably without you reviewing it first. If you haven't done that work, you'll still need to babysit the plan.
To be clear, planning as a general practice isn't going away. It's just changing shape. For newer practitioners, plan mode remains the right entry point (as described in Levels 1 and 2). But for complex features at Level 7, "planning" looks less like writing a step-by-step outline and more like exploration: probing the codebase, prototyping options in worktrees, mapping the solution space. And increasingly, background agents are doing that exploration for you.
一旦你构建完这些,一个自然的问题浮现:如果代理能验证自己的成果、导航仓库、纠正错误而不需要你,为什么还需要你坐在那里?
注意,对处于早期层级的同学,下一节可能听起来陌生(但没关系,收藏起来以后再看)。
尖锐观点:计划模式正在消亡。
Claude Code 的创建者 Boris Cherny 今天仍有 80% 的任务以计划模式开始。但随着每一代新模型的出现,规划后的一次成功率持续攀升。我认为我们正接近一个临界点,计划模式作为一个独立的人机交互步骤将逐渐消失。不是因为规划不重要,而是因为模型好到可以自己做好规划。重大提醒:这只有在完成了 L3 到 L6 的工作后才成立。如果你的上下文干净,约束清晰,工具描述良好,反馈循环严密,模型可以在不经过你审核的情况下可靠地规划。如果没做那些工作,你仍然需要看护计划。
明确一点,规划作为一种普遍实践不会消失,只是形式在变。对于较新的实践者,计划模式仍是正确的切入点(如 L1 和 L2 所述)。但在 L7 面对复杂特性时,“规划”看起来不像写一步步的大纲,而更像探索:探查代码库、在 worktree 中原型化选项、映射解空间。而且,后台代理正在越来越多地为你进行这种探索。
This matters because it's exactly what unlocks background agents. If an agent can generate a solid plan and execute without needing you to sign off, it can run asynchronously while you do something else. That's the critical shift from "multiple tabs I'm juggling" to "work that's happening without me."
The Ralph loop is the popular entry point: an autonomous agent loop that runs a coding CLI repeatedly until all PRD items are complete, where each iteration spawns a fresh instance with clean context. In my experience, getting the Ralph loop right is hard and any under/misspecification of the PRD comes back to bite. It's a little too fire-and-forget.
You can run multiple Ralph loops in parallel, but the more agents you spin up, the more you notice where your time actually goes: coordinating them, sequencing work, checking output, nudging things along. You're not writing code anymore. You've become a middle manager. You need an orchestrator agent that handles the dispatch so you can stay focused on intent, not logistics.

这一点很重要,因为它正是解锁后台代理的关键。如果代理能生成可靠的计划并在无需你签字的情况下执行,它就能在你做其他事情时异步运行。这是从“我同时处理多个标签页”到“工作在我无涉的情况下发生”的关键转变。
Ralph 循环是流行的切入点:一个自治代理循环,反复运行编码 CLI,直到所有 PRD 条目完成,每次迭代用干净上下文创建一个新实例。依我的经验,把 Ralph 循环做对很难,PRD 的任何欠缺或错误说明都会反噬。它有点过于“发射后不管”。
你可以并行运行多个 Ralph 循环,但启动的代理越多,你就越注意到时间实际花在了什么地方:协调、排序工作、检查输出、推动事情进展。你不再写代码,变成了中层管理者。你需要一个调度代理来处理派发,让你能专注于意图而非后勤。

The tool I've been using heavily for this is Dispatch, a Claude Code skill I built that turns your session into a command center. You stay in one clean session while workers do the heavy lifting in isolated contexts. The dispatcher plans, delegates, and tracks, so your main context window is preserved for orchestration. When a worker gets stuck, it surfaces a clarifying question rather than silently failing.
Dispatch runs locally, which makes it ideal for rapid development where you want to stay close to the work: faster feedback, easier to debug interactively, and no infrastructure overhead. Ramp's Inspect is the complementary approach for longer-running, more autonomous work: each agent session spins up in a cloud-hosted sandboxed VM with the full development environment. A PM spots a UI bug, flags it in Slack, and Inspect picks it up and runs with it while your laptop is closed. The tradeoff is operational complexity (infrastructure, snapshotting, security), but you get scale and reproducibility that local agents can't match. I'd say use both (local and cloud background agents).
One pattern that's been surprisingly powerful at this level: use different models for different jobs. The best engineering teams aren't staffed with clones. They're staffed with people who think differently, trained by different experiences, bringing different strengths. The same logic applies to LLMs. These models were post-trained differently and have meaningfully different dispositions. I routinely dispatch Opus for implementation, Gemini for exploratory research, and Codex for review, and the cumulative output is stronger than any single model working alone. Think wisdom of crowds, but for code.
Critically, you also need to decouple the implementer from the reviewer. I've learned this the hard way too many times: if the same model instance implements and evaluates its own work, it's biased. It will gloss over issues and tell you all tasks are complete when they aren't. It's not malice, it's the same reason you don't grade your own exam. Have a different model (or a different instance with a review-specific prompt) do the review pass. Your signal quality goes way up.

我为此重度使用的工具是 Dispatch,一个我构建的 Claude Code 技能,能把你的会话变成指挥中心。你留在一个干净的会话中,worker 在隔离的上下文中干重活。调度器计划、委派、跟踪,因此你的主上下文窗口留给编排。当 worker 卡住时,它会提出澄清问题,而不是悄悄失败。
Dispatch 在本地运行,非常适合你想贴近工作的快速开发:反馈更快,更容易交互式调试,没有基础设施开销。Ramp 的 Inspect 是补充方案,用于运行时间更长、更自主的工作:每个代理会话在云端托管的沙箱 VM 中启动,带有完整开发环境。一个 PM 发现 UI 缺陷,在 Slack 中标记,Inspect 就拾起并在你合上笔记本电脑时运行。权衡是运维复杂性(基础设施、快照、安全),但获得了本地代理无法比拟的规模化和可重现性。我建议两者都用(本地和云端后台代理)。
在这一层,一个令人惊讶的强大模式是:用不同模型做不同工作。最好的工程团队不是由克隆人组成的,而是由思维方式各异、经历不同、优势互补的人组成的。同样的逻辑适用于 LLM。这些模型经过不同的后训练,具有显著不同的倾向。我通常派 Opus 做实现,Gemini 做探索性研究,Codex 做审查,累积产出比任何单一模型单独工作都要强。想象一下群体的智慧,但用于代码。
关键的是,你还需要将实施者与审查者解耦。我多次惨痛地学到这个教训:如果同一个模型实例既实现又评估自己的工作,它会有偏见。它会掩盖问题,告诉你所有任务都完成了,实际上并未。这不是恶意,这和你不给自己考卷打分是同一个道理。让一个不同的模型(或一个带有审查专用提示的另一个实例)来做审查。你的信号质量会大幅提升。

Background agents also open the floodgates for combining your CI with AI. Once agents can run without a human in the chair, trigger them from your existing infrastructure. A docs bot that regenerates documentation on every merge and raises a PR to update CLAUDE.md (we do this and it's a huge time saver). A security reviewer that scans PRs and opens fixes. A dependency bot that actually upgrades packages and runs the test suite rather than just flagging them. Good context, compounding rules, capable tools, and automated feedback loops, now running autonomously.
后台代理也为将 CI 与 AI 结合打开了闸门。一旦代理能在无人值守的情况下运行,就可以从现有基础设施触发它们。一个文档机器人,每次合并后重新生成文档并提交 PR 以更新 CLAUDE.md(我们这样做,节省了大量时间)。一个安全审查员,扫描 PR 并打开修复。一个依赖机器人,真正升级包并运行测试套件,而不只是标记一下。良好的上下文、复合规则、强大的工具和自动化反馈循环,现在自主运行。
Nobody has mastered this level yet, though a few are pushing into it. It's the active frontier.
In Level 7, you have an orchestrator LLM dispatching work to worker LLMs in a hub-and-spoke pattern. Level 8 removes that bottleneck. Agents coordinate with each other directly, claiming tasks, sharing findings, flagging dependencies, and resolving conflicts without routing everything through a single orchestrator.
Claude Code's experimental Agent Teams feature is an early implementation: multiple instances work in parallel on a shared codebase, where teammates operate in their own context windows and communicate directly with each other. Anthropic used 16 parallel agents to build a C compiler from scratch that can compile Linux. Cursor ran hundreds of concurrent agents for weeks to build a web browser from scratch and migrate their own codebase from Solid to React.
But look closely and you'll see the seams. Cursor found that without hierarchy, agents became risk-averse and churned without progress. Anthropic's agents kept breaking existing functionality until a CI pipeline was added to prevent regressions. Everyone experimenting at this level says the same thing: multi-agent coordination is a hard problem and nobody is near optimal yet.
I honestly don't think the models are ready for this level of autonomy for most tasks. And even if they were smart enough, they're still too slow and too token-hungry for it to be economical outside of moonshot projects like compilers and browser builds (impressive, but far from clean). For the work most of us do day to day, Level 7 is where the leverage is. I wouldn't be surprised if Level 8 becomes the prevailing pattern eventually, but right now Level 7 is where I'd put my energy (unless you're Cursor and the breakthrough is the business).
还没有人完全掌握这一层级,但有少数人在推进。它是活跃的前沿。
在 L7,你有一个编排 LLM 以中心轮辐模式向 worker LLM 分派工作。L8 移除了这个瓶颈。代理彼此直接协调,认领任务、分享发现、标记依赖、解决冲突,而不必通过单一编排器中转所有内容。
Claude Code 的实验性 Agent Teams 功能是一个早期实现:多个实例在共享代码库上并行工作,队友在各自的上下文窗口中运作并直接相互通信。Anthropic 使用 16 个并行代理从头构建了一个能编译 Linux 的 C 编译器。Cursor 运行了数百个并发代理数周,从头构建网络浏览器,并将自己的代码库从 Solid 迁移到 React。
但仔细看会发现接缝。Cursor 发现没有层级结构时,代理变得风险规避、停滞不前。Anthropic 的代理不断破坏现有功能,直到添加了 CI 流水线防止回归。每个在该层级实验的人都说了同样的话:多代理协调是个难题,还没有人接近最优。
老实说,我不认为模型已为大多数任务的这种自主性做好准备。即使它们足够聪明,它们仍然太慢、太消耗 token,除了像编译器和浏览器构建这样的登月项目外并不经济(令人印象深刻,但远非干净)。对于大多数人日常的工作,L7 才是杠杆所在。L8 最终可能成为主流模式,我不会惊讶,但目前我会把精力放在 L7(除非你是 Cursor,突破本身就是业务)。
Inevitable what's next question.
Once you're adept at orchestrating agent teams without much friction, there's no reason the interface has to stay text-only. Voice-to-voice (thought-to-thought, maybe?) interaction with your coding agent — conversational Claude Code, not just voice-to-text input — is a natural next step. Look at your app, describe a sequence of changes out loud, and watch them happen in front of you.
There's a crowd chasing the perfect one-shot: state what you want and the AI composes it flawlessly in a single pass. The problem is that this presupposes we humans know exactly what we want. We don't. We never have. Software has always been iterative, and I think it always will be. It's just going to get much easier, stretch well beyond plain text interactions, and be a heck of a lot faster.
So: what level are you on? And what are you doing to get to the next one?
不可避免的“下一步是什么”问题。
一旦你能几乎无摩擦地编排代理团队,界面就没有理由仅限文本。与编码代理的语音对语音(也许意念对意念?)交互——对话式 Claude Code,不只是语音转文本输入——是自然的下一步。看着你的应用,大声描述一系列变化,然后看着它们在眼前发生。
有一群人在追逐完美的一键:说出你想要的,AI 在一遍中完美组合。问题在于这预设了我们人类确切知道自己想要什么。我们不知道。我们从来都不知道。软件一直是迭代的,我认为它将永远如此。只是会变得容易得多,远超纯文本交互,并且快得多。
那么:你现在处于哪一层?你正在做什么以到达下一层?