Glean 拾遗
Daily /2026-06-25 / Loop Engineering: How One Loop Ships 259 PRs a Month

Loop Engineering: How One Loop Ships 259 PRs a Month

Source x.com Glean’d 2026-06-25 06:00 Read 12 min
AI summary

This article breaks down the engineering of AI-driven development loops, contrasting a single engineer shipping 259 PRs in a month with a runaway loop that burned $47,000. It dissects six essential components—state file, automation/scheduling commands (e.g., /loop, /schedule, /goal), git worktrees, skills, MCP connectors, and sub-agents (writer vs. checker)—with concrete configuration examples for both Claude Code and OpenAI Codex. The piece provides a brake configuration template (max_turns, max_budget_usd, scope, circuit_breaker), describes four failure modes, and offers low-cost starting strategies. Aimed at engineers building or evaluating AI agent workflows.

Original · 12 min
x.com ↗
§ 1

Last December, one engineer shipped 259 finished code changes in a single month (in the jargon they are called PRs, pull requests). His AI wrote every one. He says he never opened a code editor. Peter Steinberger put it in two lines that more than eight million people have seen:

The flip side showed up the same month. Someone else's loop ran unwatched for 11 days and burned $47,000 before anyone noticed. So there are two skills, and only the first gets taught: build a loop that does the work, and build the brakes that keep it from driving you off a cliff.

去年12月,一位工程师在一个月内完成了259个代码变更(行业内称为PR/Pull Request)。每一个都是由AI编写而成,他自称从未打开过代码编辑器。Peter Steinberger用两句话概括,已被超过八百万人阅读:

同一个月也出现了反面案例——另一个人的循环无人监控地运行了11天,在无人察觉的情况下烧掉了47,000美元。因此,有两种技能——且仅有第一种有人教:既能建立工作的循环(loop),也能建立防止失控的刹车机制。

§ 2

A loop is a small program wrapped around an agent. The agent takes a step. Something checks the result. Not done? It takes the next step. Round and round, until your condition comes true or you stop it.

Loop Engineering: How One Loop Ships 259 PRs a Month

One question separates a real loop from a money pit: is there an honest way to know the work was done right? A test that passes or fails. A build that compiles or does not. No check, and the agent just agrees with itself in circles and sends you the bill.

循环(loop)是一个包裹在AI Agent外围的小程序。Agent每执行一步,就有人检查结果。未完成?就接着下一步。如此循环往复,直到满足设定的条件或你手动停止。

Loop Engineering: How One Loop Ships 259 PRs a Month

区分真正的循环与财务黑洞的关键问题只有一个:是否有可靠的方式判断工作是否正确完成?一个可以判定通过或失败的测试,一个可以表明编译成功或失败的构建。如果没有检查手段,Agent只会原地打转、自我认同,然后给你寄来账单。

§ 3

A loop pays off under four conditions. Miss one, and it costs more than it returns.

  1. The task repeats at least weekly. Otherwise it is a one-time script, not a loop.
  2. Verification is automated. A test, a linter (a code-style checker), a build. No check, and you are back to reading every change by hand.
  3. Your budget can absorb waste. A loop re-reads context and explores, so it burns money even on empty runs.
  4. The agent has an engineer's tools. Logs, an environment to reproduce a failure, the ability to run its own code.
Good first loops Bad first loops
Nightly triage of failing tests Architecture rewrites
Weekly dependency bumps Auth and payments code
Auto-fixing style errors Production deploys
Hunting a flaky test Anything where "done" is a matter of taste

只有当满足以下四个条件时,循环才值得投入。任何一个条件不满足,其成本都会超过收益。

  1. 任务至少每周重复一次。否则它只是一个一次性脚本,而非循环。
  2. 验证是自动化的。例如测试、linter(代码风格检查器)、构建。没有检查,你仍得手动审阅每一个变更。
  3. 你的预算能够承受浪费。循环会重新读取上下文并进行探索,即便空转也会消耗费用。
  4. Agent拥有工程师的工具:日志、可重现失败的运行环境、运行自身代码的能力。
适合的起始循环 不适合的起始循环
每日失败测试的排查分流 架构重写
每周依赖包升级 认证和支付代码
自动修复代码风格错误 生产环境部署
追踪不稳定的测试 任何“完成”取决于主观判断的任务
§ 4

Strip the noise and a working loop is six parts. You no longer build them by hand: they ship inside the tools and map the same way onto Claude Code and the OpenAI Codex app.

Loop Engineering: How One Loop Ships 259 PRs a Month

去除干扰信息,一个能正常工作的循环由六个部分组成。你不再需要手动构建它们:这些组件已内置于工具中,并以相同的方式映射到Claude Code和OpenAI Codex应用中。

Loop Engineering: How One Loop Ships 259 PRs a Month

§ 5

A model forgets everything when a run ends. The chat's memory dies with it, so the memory has to live on disk. In practice that is one file.

# STATUS.md  (the loop reads it first, writes it last)
## Done
- [x] auth: migrated to tokens v2, tests green
## In progress
- [ ] billing: webhook refactor (PR #214, tests red)
## Next
- [ ] dashboard: flaky test in test/charts
## Never
- do not touch infra/ without a human

Treat the loop like a night shift. You are judged not by what it did at three in the morning, but by the note on your desk at nine. Design the note, and half the loop designs itself.

模型在每次运行结束后会遗忘所有信息。对话记忆也随之消失,因此记忆必须持久化到磁盘上。实际操作中,这只是一个文件。

# STATUS.md  (循环首先读取它,最后写入它)
## Done
- [x] auth: migrated to tokens v2, tests green
## In progress
- [ ] billing: webhook refactor (PR #214, tests red)
## Next
- [ ] dashboard: flaky test in test/charts
## Never
- do not touch infra/ without a human

把循环想象成夜班工作。评判你的标准不是它在凌晨三点做了什么,而是九点钟你桌上留下的便签。设计好这张便签,循环的一半设计便已完成。

§ 6

A loop becomes a loop when it starts on its own. In Codex that is the Automations tab: project, prompt, schedule. In Claude Code it is three separate commands, and this is where popular retellings get it wrong, so to be exact:

# /loop is a skill (not a built-in command). It repeats your prompt
# while the session is open:
/loop every 30 min: check src/auth for new failures

# /schedule is a cloud job (1-hour minimum, runs with the laptop closed):
/schedule daily PR triage at 9am

# /goal runs until a condition becomes TRUE.
# After each step a separate small model checks if it is done:
/goal all tests in test/auth pass and the lint step is clean

The key one is /goal. The agent that wrote the code does not get to grade itself. Write the stop condition like a contract: "all tests in test/auth pass" is a contract, "make it better" is how you keep a loop spinning until payday.

循环变成真正的循环,始于它能自动启动。在Codex中,这对应的是Automations(自动化)标签页:设置项目、提示词和计划。在Claude Code中,它由三个独立的命令构成——这里流行的描述常常出错,所以给精确的版本:

# /loop 是一个技能(不是内置命令)。它会在会话保持连接时重复你的提示词:
/loop every 30 min: check src/auth for new failures

# /schedule 是一个云端任务(至少间隔1小时,关闭笔记本也会运行):
/schedule daily PR triage at 9am

# /goal 一直运行直到某个条件变为TRUE。
# 每一步之后,会有一个独立的小模型检查是否完成:
/goal all tests in test/auth pass and the lint step is clean

关键命令是 /goal。编写代码的Agent不能自我评分。请把停止条件写得像一份合同:“test/auth目录下所有测试通过”是一份合同,“让它变得更好”则是让循环持续空转直到发薪日的做法。

§ 7

Two agents writing to one file are two engineers editing the same lines in silence. A git worktree is the wall between them: a separate folder on its own branch, sharing the history.

# Claude Code: a copy of the project per agent
claude --worktree feature-auth

# or on a sub-agent in .claude/agents/<name>.md:  isolation: worktree

# Codex: choose "Worktree" when you open a thread

Loop Engineering: How One Loop Ships 259 PRs a Month

The caveat the tool will not mention: worktrees remove the collision, not the bottleneck. The bottleneck is you. However many agents you start, your review speed decides how many you can trust.

两个Agent同时写入同一个文件,就像两位工程师在静静地修改相同的代码行。git worktree 就是它们之间的隔墙:一个位于自己分支上的独立文件夹,共享着项目历史。

# Claude Code: 每个Agent拥有项目的一份拷贝
claude --worktree feature-auth

# 或在子Agent的配置文件 .claude/agents/<name>.md 中设置:  isolation: worktree

# Codex: 打开新线程时选择"Worktree"

Loop Engineering: How One Loop Ships 259 PRs a Month

工具不会告诉你的隐含代价:worktree解决了并行冲突,但并未解决瓶颈。瓶颈是你自己。无论启动多少个Agent,你的review速度决定了你能信任的数量。

§ 8

The agent starts from scratch and fills any gap with a confident guess. A skill is your intent, written where the model reads it every time. The format is the same everywhere: a folder with a SKILL.md file.

# .claude/skills/triage/SKILL.md  (Codex: ~/.agents/skills/triage/SKILL.md)
---
name: triage
描述: Read yesterday's failing tests, open issues, 和 recent
  commits; write each finding to STATUS.md as a task.
---
1. Collect failing tests, group by root cause.
2. Match each one to the issue or commit that caused it.
3. Write one task in STATUS.md per real finding. Skip the noise.

Without skills, the loop relearns your project from a blank page. With them, it gets smarter every morning.

Agent从零开始,用自信的猜测填补任何空白。技能(skill)就是你写下的意图,放在模型每次都会读取的地方。其格式在所有地方都是统一的:一个包含SKILL.md文件的文件夹。

# .claude/skills/triage/SKILL.md  (Codex: ~/.agents/skills/triage/SKILL.md)
---
name: triage
description: 读取昨天的失败测试、打开的问题和最近的提交;
  将每个发现作为任务写入STATUS.md。
---
1. 收集失败的测试,按根因分组。
2. 将每个失败与导致它的Issue或commit关联。
3. 针对每个真实发现,在STATUS.md中写入一个任务。忽略无关噪音。

没有技能,循环会从一张白纸开始重新学习你的项目。有了技能,它能做到每天早晨都比前一天更聪明。

§ 9

A loop that only sees files is a toy. Connectors (the MCP standard) let the agent read your issue tracker, hit a staging server, post to chat.

# Claude Code
claude mcp add --transport http linear https://mcp.linear.app/mcp
# Codex
codex mcp add linear --url https://mcp.linear.app/mcp

This is the line between "here is how to fix it" and a loop that opens the pull request itself, links the ticket, and reports the tests green. Both tools speak MCP, so a connector for one usually drops into the other. Fastest payback: a code tracker, an issue tracker (Linear, Jira), chat (Slack), an error tracker.

一个只能看到文件的循环只是个玩具。连接器(遵循MCP标准)允许Agent读取问题追踪器、访问预发布服务器、或发布消息到聊天中。

# Claude Code
claude mcp add --transport http linear https://mcp.linear.app/mcp
# Codex
codex mcp add linear --url https://mcp.linear.app/mcp

这就是“给你修复方案”与“循环自动打开Pull Request、关联工单、并报告测试通过”之间的区别。两种工具都支持MCP协议,因此为一个工具编写的连接器通常可以无缝迁移到另一个。回报最快的几个连接器:代码仓库、问题追踪器(如Linear、Jira)、聊天工具(如Slack)、以及错误追踪器。

§ 10

The most valuable move: split the one who writes from the one who checks. A model reviewing itself always gives a pass. A second agent, with different instructions and on a different model, catches what the first talked itself into.

# Codex: .codex/agents/reviewer.toml
name = "reviewer"
description = "Adversarial reviewer: correctness, security, tests."
developer_instructions = "Review like an owner of the project. Assume the
  author is wrong until the change proves otherwise."
model_reasoning_effort = "high"

# Claude Code: .claude/agents/reviewer.md
#   name: reviewer
#   description: Adversarial reviewer. After any code change.
#   model: opus

One explores, one writes, one checks against the spec. That is what /goal runs under the hood. A second opinion costs money, since each agent runs its own model. Spend it where being wrong is expensive.

最有价值的举措是:将代码编写者与代码检查者分开。一个模型进行自我审查时,总是会给出通过的结论。而第二个Agent,使用不同的指令和不同的模型,能捕获第一个Agent自己说服自己而忽略的问题。

# Codex: .codex/agents/reviewer.toml
name = "reviewer"
description = "对抗性审查者:关注正确性、安全性、测试。"
developer_instructions = "像项目的所有者一样进行审查。假设作者是错误的,除非变更能证明其正确。"
model_reasoning_effort = "high"

# Claude Code: .claude/agents/reviewer.md
#   name: reviewer
#   description: 对抗性审查者。在任何代码变更之后。
#   model: opus

一个负责探索,一个负责编写,另一个对照规格进行审查。/goal 命令在底层就是这样运作的。第二意见需要花钱,因为每个Agent都运行着各自的模型。这笔钱应该花在那些错误代价高昂的地方。

§ 11

Put the six parts together and a single thread becomes a machine.

  1. An automation runs each morning on the project.
  2. The triage skill reads the overnight failures, issues, and commits and writes findings to STATUS.md.
  3. For each finding, a separate worktree opens.
  4. One sub-agent writes the fix, a second checks it against your tests.
  5. Connectors open the pull request and update the ticket.
  6. Whatever failed waits in the triage inbox. STATUS.md remembers the rest.

You wake up not to a wall of logs, but to a short note:

Loop Engineering: How One Loop Ships 259 PRs a Month

将上述六个部分组合起来,一条独立的线就变成了一台机器。

  1. 每天早上,自动化任务在项目上运行。
  2. 问题排查技能(triage skill)读取过夜的失败信息、问题单和提交记录,并将发现写入STATUS.md
  3. 针对每个发现,打开一个独立的工作树(worktree)。
  4. 一个子Agent负责编写修复代码,另一个子Agent对照你的测试进行检查。
  5. 连接器负责打开Pull Request并更新工单状态。
  6. 任何未能解决的问题会停留在排查收件箱(triage inbox)中。STATUS.md 则记录了其余的所有内容。

你醒来时看到的不是满屏日志,而是一段简短的便条:

Loop Engineering: How One Loop Ships 259 PRs a Month

§ 12

That $47,000 bill was not an evil AI. It was two agents politely asking each other for more work: no step limit, no money ceiling, no stop condition. One cap would have killed it on day one.

# loop.config: bolt these on BEFORE you walk away
max_turns: 50            # hard stop, no exceptions
max_budget_usd: 10       # money ceiling per run
scope: [src/]            # read-only outside the folder
write_branches: [auto/*] # blast radius: where it may write
circuit_breaker: 3       # same call 3x in a row = halt
heartbeat: STATUS.md     # silence instead of a mark = alarm

The core rule: scope the loop by what it can break, not by what you want it to do. Blast radius first, task second. And price it honestly:

Loop Engineering: How One Loop Ships 259 PRs a Month

The technology is the same in all three cases. The only difference is whether the brakes are on.

那47,000美元的账单并非邪恶的AI造成的。而是两个Agent在礼貌地互相请求更多工作:没有步骤限制、没有费用上限、没有停止条件。任何一个上限设置都能在第一天就遏制住它。

# loop.config: 在你撒手不管之前,把这些安全措施装上去
max_turns: 50            # 硬性停止,绝不例外
max_budget_usd: 10       # 每次运行的费用上限
scope: [src/]            # 该文件夹之外只能读取
write_branches: [auto/*] # 影响范围:它可以在哪些分支写入
circuit_breaker: 3       # 连续3次执行相同的调用 = 停机
heartbeat: STATUS.md     # 沉默(而非标记更新)视为警报

核心规则是:根据循环可能破坏的范围来框定它,而不是根据你希望它做什么。优先考虑影响范围,其次才是任务。然后,诚实地为它定价:

Loop Engineering: How One Loop Ships 259 PRs a Month

上述三种情况在技术上完全相同,唯一的区别在于刹车是否已开启。

§ 13
  1. Runaway recursion. Two agents feed each other work forever. Cure: a step cap and a money ceiling.
  2. Silent death. The context filled up, the loop stalled, but it looks alive and reports "progress." Cure: a heartbeat and a fresh context per phase.
  3. Walking in circles. With no verifiable condition, the loop drifts from the goal. Green tests are a point it can reach. "Looks done" is not.
  4. Comprehension debt. The faster the loop writes code you never read, the further you drift from your own project. Cure: a mandatory "a human read this" step you never skip.

A bonus with a name: the "Ralph loop." The agent signals "done" too early and exits halfway. It happens where there is no real checker and no hard stop. A useful self-check metric: count not tokens, but the cost per accepted change. Below half accepted, and the loop runs at a loss.

  1. 失控递归。两个Agent互相给对方分配工作,无限循环。对策:设置步数上限和费用上限。
  2. 静默死亡。上下文被填满,循环陷入停滞,但在外界看来它仍然“活着”,并报告着“进度”。对策:设置心跳机制,并为每个阶段提供新的上下文。
  3. 原地打转。没有可验证的条件,循环会偏离目标。所有测试通过是可以达到的状态,而“看起来完成了”则不是。
  4. 理解负债。循环写代码的速度越快,你阅读过的代码就越少,你与自身项目的距离就越远。对策:设置一个你绝不跳过的强制性“人工阅读”步骤。

还有一个有名字的额外风险:“Ralph循环”。Agent过早地发出“完成”信号并半途退出。这种情况通常发生在没有真正的检查者和硬性停止条件时。一个有用的自查指标:不要统计token数量,而是统计每次被接受的变更的成本。如果接受率低于一半,那么这个循环就在亏本运行。

§ 14

The loop is built for people who do not pay for tokens. Here is how to get most of the value for less.

  1. Spend the thinking before the tokens. "Build me auth" spins for 30 steps of guessing. A finished spec (login method, token lifetime, error states, definition of done) hits the target in one pass.
  2. Plan cheap, execute expensive. Your top model should not read files. Give reading and planning to a small model, leave the hard part to the strong agent.
  3. Turn on caching. The loop sends the same context every step, and caching is exactly what makes the repeat cheaper. A 15-minute change cuts repeated input by about 90%.
  4. Engineer the context. The rule from the Devin team: delegate the reading, centralize the writing. Send the bulky search to a cheap sub-agent, the main one reads a short summary, not forty files.
  5. Be the loop yourself. Three passes you steer beat thirty run alone. If you automate, cap it by step count.

When a loop is worth it: the finish line is green tests, the run is bounded, and an hour of your time beats the tokens. Overnight refactors and large backlogs you cannot clear by hand are exactly that shape. But that is a narrow slice of the work.

循环是为那些不需要为token付费的人设计的。以下是如何用更少的投入获得大多数价值的五个方法:

  1. 在产生token消耗之前,先投入思考。像“帮我写个认证模块”这样的提示词会引发30步的猜测。一份完整的规格说明(登录方式、令牌有效期、错误状态、完成标准)可以一步到位地命中目标。
  2. 用低成本模型做计划,用高性能模型做执行。你的顶级模型不应该用来读取文件。把读取和规划工作交给一个小模型,把艰难的部分留给强大的Agent。
  3. 开启缓存。循环每步都会发送相同的上下文,而缓存正是让重复操作变得更便宜的关键。一个15分钟的变更可以减少约90%的重复输入。
  4. 精加工上下文。来自Devin团队的规则:将读取任务委派出去,将写入任务集中起来。把耗时的大规模搜索发送给廉价的子Agent,主要Agent只读取一份简短的摘要,而不是四十个文件。
  5. 你自己来做这个循环。你干预的三次通过,胜过无人监管的三十次运行。如果要自动化,务必设置步数上限。

循环值得使用的情况是:终点线是测试全部通过,运行范围是有限的,并且你花一个小时的人工时间处理比花费token更划算。隔夜重构和无法手动完成的大量积压工作正是这种形状。但这类工作只是所有工作中的一小部分。

§ 15

Two people build the same loop and get opposite results. One speeds up on work they understand to the bone. The other stops understanding the work at all. The loop cannot tell them apart. You can.

The leverage moved: from the prompt to the loop, from the code to the judgment. That is a harder job, not a softer one.

So the move is small, on purpose. Tomorrow morning, take the most boring job you still do by hand: triaging failures, closing stale issues, chasing a flaky test. Wrap one capped loop around it. Brakes first. Small enough that you read every change.

Nobody who ships two hundred changes a month started with a hundred agents. They started with one loop they trusted. Build that one.

两个人构建相同的循环,可能得到截然相反的结果。一个人加速了自己已经彻底理解的工作;另一个人则完全不再理解那些工作。循环无法区分这两者,但你可以。

杠杆已经从提示词转移到了循环,从代码转移到了判断力。这是一项更难的工作,而不是更轻松的。

所以,请有意识地从一个小动作开始。明天早上,找一个你仍在手动执行的最无聊的工作:排查失败、关闭陈旧Issue、追踪不稳定的测试。围绕它构建一个受限制的循环。先装好刹车。确保它足够小,让你能够阅读每一次变更。

没有哪个每个月提交两百个变更的人是从一百个Agent开始的。他们都是从构建一个自己信任的循环开始的。先构建那一个。

Open source ↗