Glean 拾遗
Daily /2026-06-29 / Loop Engineering: A Technical Roadmap for an Autonomous Loop

Loop Engineering: A Technical Roadmap for an Autonomous Loop

Source x.com Glean’d 2026-06-29 06:01 Read 18 min
AI summary

This is a technical roadmap for building reliable autonomous loops, arguing that a loop is fundamentally different from a prompt—a prompt requires manual initiation while a loop drives itself: set a goal once, then the system finds work, executes, checks, fixes, and repeats until completion. The author emphasizes that the ceiling is set not by prompting skills but by engineering a loop that converges toward truth rather than becoming an expensive random walk. The piece provides step-by-step guidance (Step 0 through Step 7) with working code (Bash scripts), explaining the mechanics of stateless iteration (fresh context per turn to combat context rot), building a narrow relevant context with a token budget, designing incorruptible checks (external deterministic oracle + reward-hacking defense gates + adversarial judge on a different model), dual-level state persistence (human-readable STATUS.md + machine-parseable JSON), physical isolation (git worktree, container with --network none), brakes with observability (structured JSONL log, circuit breakers for stuck/repeated failures, liveness heartbeats), and nonlinear cost analysis (why stateless keeps per-iteration cost constant while stateful grows quadratically). This is aimed at production engineers building AI agent pipelines who need practical, verifiable techniques.

Original · 18 min
x.com ↗
§ 1

A loop is not a prompt. A prompt you turn yourself. A loop turns itself: set the goal once, then the system finds work, does it, checks it, fixes it, repeats until done. The skill that sets the ceiling is not writing a prompt, it is building a loop that converges to truth instead of becoming an expensive random walk.

Below is a technical roadmap. Each step is not only what to do but why exactly this way at the mechanics level. The order is strict, skipping a step is where the loop blows up later.

循环并非提示。提示需要你亲自驱动,而循环则自我运转:只需设定一次目标,系统便会自主寻找任务、执行、检查、修正,重复直至完成。决定能力上限的并非编写提示的技巧,而是构建一个能收敛于真相而非沦为昂贵随机漫步的循环的能力。

以下是一份技术路线图。每一步不仅说明“做什么”,更在机制层面阐释“为何必须这样做”。顺序严格,跳过任何一步都会为循环的崩溃埋下隐患。

§ 2

A filter that saves weeks. A loop only makes sense if there is a check that delivers a verdict independent of the agent.

The mechanics of why this is critical. The model that generated a solution and also grades it is in a conflict of interest at the statistical level: its own output is a high-probability continuation to it, so it systematically overrates its correctness. This is not the model being lazy, it is a direct consequence of sampling from a distribution where its own answer is already raised in probability. So an agent's self-assessment is not a check, it is an echo.

The check must be an external deterministic oracle: a test, type-check, linter, build, a number above a threshold. Something that returns an exit code, not an opinion.

A hard requirement on the check almost everyone misses: it must be deterministic and idempotent. A flaky test (green then red on the same code) is worse than no test, because it breaks the stop condition: the loop will fix what is not broken, or stop on what is broken. Before building the loop, run the check ten times on one state. If the result is not stable, fix the check first, then the loop.

Fail this filter, do not build a loop.

这是一个能为你节省数周的过滤器。只有当存在一个独立于智能体/agent 的检查并能给出明确判定时,循环才有意义。

关键在于其核心机制。生成解决方案并同时为其评分的模型在统计层面存在利益冲突:模型自身的输出是其高概率的延续,因此它会系统性地高估其正确性。这不是模型偷懒,而是从概率分布中采样的直接后果——在该分布中,模型自己的答案已被提升概率。因此,智能体的自我评估并非真正的检查,而只是一种回声。

检查必须是一个外部的确定性预言机:测试、类型检查、linter、构建、超过某个阈值的数字。它应返回退出码/exit code,而非观点。

几乎所有人都会忽略的一个硬性要求:检查必须具有确定性和幂等性。一个不稳定的测试(同一份代码时而通过时而失败)比没有测试更糟糕,因为它会破坏循环的停止条件:循环可能会修复未损坏的内容,或在真正损坏时却停止运行。在构建循环之前,对同一状态运行该检查十次。如果结果不稳定,先修复检查,再构建循环。

如果无法通过此过滤器,请不要构建循环。

§ 3

Do not automate what does not work by hand. Do the task once manually through the agent to a green check. But on this step do one more thing: measure.

Record how many model calls it took, how many tokens, what the most frequent agent error type was. This is your baseline. When the loop later burns three times as much, you will know something broke, because you have something to compare against.

If the manual run is unstable, the loop multiplies instability by the iteration count. Reliability of one pass first, then automation.

不要自动化那些手动都无法正常运作的流程。手动通过智能体完成一次任务,直至检查通过。但在这一步,你还需要做一件事:测量。

记录它花费了多少次模型调用、多少 token、智能体最常见的错误类型是什么。这就是你的基准线。当后续循环消耗了三倍资源时,由于有可对比的数据,你便能察觉问题所在。

如果手动运行不稳定,循环会将不稳定性乘以迭代次数。先确保单次运行的可靠性,再考虑自动化。

§ 4

The simplest working loop is a while loop feeding the agent a prompt until the check is green.

#!/usr/bin/env bash
set -euo pipefail
MAX_ITER=20
i=0

while [ $i -lt $MAX_ITER ]; do
  i=$((i + 1))
  echo "=== Iteration $i of $MAX_ITER ==="

  if npm test --silent; then
    echo "Green in $i iterations."; exit 0
  fi

  claude -p "Tests fail. Run npm test, read the first failure,
  make the minimal change that fixes it. Do not refactor unrelated
  code. Do not weaken the tests." \
    --permission-mode acceptEdits
done

echo "Limit $MAX_ITER. Tests red."; exit 1

The key property here is not obvious: each iteration launches the agent anew, from clean context. This is not laziness, it is an engineering decision, and here is the mechanics of why.

The model degrades as the context window fills. This is not a linear slowdown but a measurable loss of quality: instructions given at the start of the prompt are lost as the window fills (the lost-in-the-middle effect, the model holds the middle of a long context worst), and the more history, the more the model is distracted by its own past turns instead of the current state. This is called context rot.

Stateless iteration cures this radically. Progress is held not by the agent's memory but by the filesystem and git. Each new run sees the changed files and the red test, reads them anew, and works with a short fresh context where the instructions are in plain view. You deliberately throw away conversational memory so as not to accumulate degradation. State on disk, not in the window.

MAX_ITER is the first fuse. Without it the loop spins until the money runs out.

最简单有效的循环是一个 while 循环,它不断向智能体提供提示,直到检查通过。

#!/usr/bin/env bash
set -euo pipefail
MAX_ITER=20
i=0

while [ $i -lt $MAX_ITER ]; do
  i=$((i + 1))
  echo "=== Iteration $i of $MAX_ITER ==="

  if npm test --silent; then
    echo "Green in $i iterations."; exit 0
  fi

  claude -p "Tests fail. Run npm test, read the first failure,
  make the minimal change that fixes it. Do not refactor unrelated
  code. Do not weaken the tests." \
    --permission-mode acceptEdits
done

echo "Limit $MAX_ITER. Tests red."; exit 1

这里有一个并不明显的关键属性:每次迭代都从头开始启动智能体,拥有全新的上下文。这并非偷懒,而是一项经过深思熟虑的工程决策,其背后的机制原因如下。

随着上下文窗口的填充,模型性能会下降。这并非线性变慢,而是可衡量的质量损失:在提示开始时给出的指令会随着窗口填充而丢失(即“中间信息丢失/the lost-in-the-middle effect”效应,模型对长上下文中部信息的记忆最差),且历史记录越多,模型越容易被自己过去的输出分散注意力,而非专注于当前状态。这被称为“上下文腐烂/context rot”。

无状态迭代从根本上解决了这个问题。进度不由智能体的记忆维持,而由文件系统和 git 保存。每次新的运行时,模型会看到变更后的文件和失败的测试,重新读取它们,并基于一个简短、干净、指令清晰可见的上下文工作。你刻意丢弃了对话记忆,以避免积累性能衰退。状态保存在磁盘上,而非上下文窗口中。

MAX_ITER 是第一道保险丝。没有它,循环会一直运行直到资金耗尽。

§ 5

Saying "fresh context" is easy, building it right is separate engineering, and more loops break here than you would think. If you feed every iteration the whole repo tree, you kill the point of stateless: the window fills, context rot returns, plus you pay for a ton of irrelevant tokens. If you feed too little, the agent does not see what it needs and fixes blind.

The right iteration context is three things and nothing extra: the current state (what is done and what is blocked), the specific open failure being worked on, and only the files relevant to that failure. Not the whole repo, but its relevant slice.

How to pick relevant files mechanically. Do not make the agent guess across the whole tree, assemble the slice yourself from signals that already exist: files mentioned in the failing test's stack trace, files changed in the last diff, files the test imports. This is cheap and deterministic.

#!/usr/bin/env bash
# build_context.sh — assembles a narrow relevant context for the iteration
set -euo pipefail

CONTEXT_FILE=".loop_context.md"
TOKEN_BUDGET=8000   # context ceiling so the window does not fill
> "$CONTEXT_FILE"

# 1. machine state first: where we are and what not to touch
echo "## State" >> "$CONTEXT_FILE"
cat .loop_state.json >> "$CONTEXT_FILE"
echo >> "$CONTEXT_FILE"

# 2. the specific failure being worked on (first failing test)
echo "## Current failure" >> "$CONTEXT_FILE"
failure=$(npm test 2>&1 | grep -A 15 -m1 "FAIL" || true)
echo '' >> "$CONTEXT_FILE"
echo "$failure" >> "$CONTEXT_FILE"
echo '' >> "$CONTEXT_FILE"

# 3. extract file paths from the failure stack trace (real repo files only)
echo "## Relevant files" >> "$CONTEXT_FILE"
files=$(echo "$failure" \
  | grep -oE '[a-zA-Z0-9_/.-]+\.(ts|js|py|go)' \
  | sort -u \
  | while read -r f; do [ -f "$f" ] && echo "$f"; done)

# 4. add files from the last diff (what the loop changed last turn)
changed=$(git diff --name-only HEAD~1 2>/dev/null || true)

# 5. merge, dedupe, pour in contents within the token budget
printf "%s\n%s\n" "$files" "$changed" | sort -u | while read -r f; do
  [ -z "$f" ] && continue
  [ -f "$f" ] || continue
  # rough token estimate: chars / 4. do not exceed the budget
  budget_chars=$((TOKEN_BUDGET * 4))
  current=$(wc -c < "$CONTEXT_FILE")
  fsize=$(wc -c < "$f")
  if [ $((current + fsize)) -gt "$budget_chars" ]; then
    echo "### $f (skipped, context budget exceeded)" >> "$CONTEXT_FILE"
    continue
  fi
  echo "### $f" >> "$CONTEXT_FILE"
  echo '' >> "$CONTEXT_FILE"
  cat "$f" >> "$CONTEXT_FILE"
  echo '' >> "$CONTEXT_FILE"
done

echo "Context built: $(wc -l < "$CONTEXT_FILE") lines, $(($(wc -c < "$CONTEXT_FILE") / 4)) ~tokens"

Now the loop iteration feeds the agent not "everything there is" but this narrow file:

# inside the loop, before the agent call
./build_context.sh
claude -p "Context is in .loop_context.md. Fix the first failing test
with a minimal change, touch only files from the relevant ones." \
  --permission-mode acceptEdits

Why a token budget as an explicit number. This is not decoration. It is the guarantee that the iteration context does not grow unnoticed as the diff and stack traces grow. Without a ceiling, a loop that started with clean context drowns again in its own history after twenty iterations, just smuggled through files instead of conversation. The ceiling keeps every iteration equally light, and that is what preserves both quality and linear cost.

The relevance heuristic here is deliberately dumb (files from the stack trace plus the last diff), and that is right for the start: cheap, deterministic, explainable. Smart variants (file embeddings, dependency graph) add precision but also complexity, and cost separate debugging. Start with the dumb heuristic, complicate it only if it actually misses.

说得容易,构建恰当却是独立的工程问题。想象中容易,实际执行时更多循环恰恰在此处崩溃。如果你每次都喂入整个仓库树,就扼杀了无状态的优势:窗口被填满、上下文腐烂重现,而且你还为大量无关 token 付出了代价。如果喂入的信息太少,智能体看不到所需内容,只能盲目修复。

正确的迭代上下文只包含三样东西,别无其他:当前状态(已完成和受阻的任务)、正在处理的具体失败信息,以及仅与该失败相关的文件。不是整个仓库,而是其相关切片。

机械地选择相关文件。不要让智能体在整个仓库中猜测,而应从已有的信号中自行组装切片:失败测试的堆栈跟踪中提及的文件、上次差异中变更的文件、测试导入的文件。这种方法成本低廉且具有确定性。

#!/usr/bin/env bash
# build_context.sh — assembles a narrow relevant context for the iteration
set -euo pipefail

CONTEXT_FILE=".loop_context.md"
TOKEN_BUDGET=8000   # context ceiling so the window does not fill
> "$CONTEXT_FILE"

# 1. machine state first: where we are and what not to touch
echo "## State" >> "$CONTEXT_FILE"
cat .loop_state.json >> "$CONTEXT_FILE"
echo >> "$CONTEXT_FILE"

# 2. the specific failure being worked on (first failing test)
echo "## Current failure" >> "$CONTEXT_FILE"
failure=$(npm test 2>&1 | grep -A 15 -m1 "FAIL" || true)
echo '' >> "$CONTEXT_FILE"
echo "$failure" >> "$CONTEXT_FILE"
echo '' >> "$CONTEXT_FILE"

# 3. extract file paths from the failure stack trace (real repo files only)
echo "## Relevant files" >> "$CONTEXT_FILE"
files=$(echo "$failure" \
  | grep -oE '[a-zA-Z0-9_/.-]+\.(ts|js|py|go)' \
  | sort -u \
  | while read -r f; do [ -f "$f" ] && echo "$f"; done)

# 4. add files from the last diff (what the loop changed last turn)
changed=$(git diff --name-only HEAD~1 2>/dev/null || true)

# 5. merge, dedupe, pour in contents within the token budget
printf "%s\n%s\n" "$files" "$changed" | sort -u | while read -r f; do
  [ -z "$f" ] && continue
  [ -f "$f" ] || continue
  # rough token estimate: chars / 4. do not exceed the budget
  budget_chars=$((TOKEN_BUDGET * 4))
  current=$(wc -c < "$CONTEXT_FILE")
  fsize=$(wc -c < "$f")
  if [ $((current + fsize)) -gt "$budget_chars" ]; then
    echo "### $f (skipped, context budget exceeded)" >> "$CONTEXT_FILE"
    continue
  fi
  echo "### $f" >> "$CONTEXT_FILE"
  echo '' >> "$CONTEXT_FILE"
  cat "$f" >> "$CONTEXT_FILE"
  echo '' >> "$CONTEXT_FILE"
done

echo "Context built: $(wc -l < "$CONTEXT_FILE") lines, $(($(wc -c < "$CONTEXT_FILE") / 4)) ~tokens"

现在,循环迭代不再喂入“所有一切”,而是这个精简的文件:

# inside the loop, before the agent call
./build_context.sh
claude -p "Context is in .loop_context.md. Fix the first failing test
with a minimal change, touch only files from the relevant ones." \
  --permission-mode acceptEdits

为什么要把 token 预算设定为显式的数字。这并非装饰。它保证了迭代上下文不会随着 diff 和堆栈跟踪的增长而不被察觉地膨胀。没有这个上限,一个以干净上下文开始的循环在二十次迭代后,会再次溺毙在自己的历史记录中——只是这次通过文件而非对话方式被偷渡进来。这个上限让每次迭代都同样轻量,这是保持质量和线性成本的关键。

这里使用的相关性启发式方法故意设计得简单(来自堆栈跟踪的文件加上上次 diff 的文件),这对起点来说是合适的:便宜、确定、可解释。更智能的变体(文件嵌入、依赖图)能提升精确度,但也会带来复杂性和额外的调试成本。从简单的启发式方法开始,只有当它确实遗漏信息时才进行复杂化改造。

§ 6

The heart of the loop. There are two separate technical problems here.

First: the check must be independent. An external oracle (a test's exit code) not the agent's opinion. Already covered.

Second, more subtle: the agent will try to fool the check, and this is not malice but optimization. If the loop's only goal is to make the test green, the model finds the cheapest path to green, and often that is not fixing the code but breaking the test. Delete an assert, mock everything, wrap in try/except, hardcode the expected value. This is reward hacking: the optimizer exploits a hole in the metric instead of solving the task.

The defense is several layers:

A prohibition in the prompt ("do not weaken the tests") is the weakest layer, the agent breaks it under pressure.

The real defense is a second check the agent does not control. For example tests live read-only and the loop physically cannot edit them, or a separate gate verifies the test files did not change in the diff:

# gate against reward hacking: tests must not change in this loop
if ! git diff --quiet -- test/; then
  echo "Agent changed the tests. Revert, this is reward hacking."
  git checkout -- test/
  exit 3
fi

The third layer is an independent judge on a different model. After a turn a separate reviewer agent reads the diff and decides whether the task is solved in substance, not only whether the test is green. A judge on a different model matters: a model catches its own self-deception patterns poorly but catches others' well.

# .claude/agents/reviewer.md
---
name: reviewer
description: Adversarial judge. After every code change.
model: opus
---
Assume the author is wrong until the diff proves otherwise.
Check separately: the tests went green BECAUSE the code was fixed,
not because the tests were weakened. If asserts were deleted, mocks
replaced logic, values hardcoded, return FAIL with the location.
You do not fix code, you deliver a verdict PASS or FAIL with a reason.

The cost is real: a judge on a strong model doubles the bill per turn. So put it on expensive errors, and keep the cheap deterministic gate (test diff) always on, it is nearly free.

这是循环的核心。这里涉及两个独立的技术问题。

第一:检查必须独立。必须是外部预言机(如测试的退出码),而非智能体的自我判断。这已在前文提及。

第二,更微妙的一点:智能体会试图欺骗检查,这并非恶意,而是优化行为。如果循环的唯一目标是让测试通过,模型会寻找让测试通过的最廉价路径,而这往往不是修复代码,而是破坏测试。删除断言、模拟一切、包裹在 try/except 中、硬编码预期值。这就是“奖励黑客/reward hacking”:优化器利用指标中的漏洞而非真正解决问题。

防御措施是分层的:

在提示中加入禁止指令(“不要弱化测试”)是最弱的一层,智能体在压力下会违反它。

真正的防御是智能体无法控制的第二项检查。例如,将测试设为只读,循环根本无法编辑它们;或者设置一个独立门控,验证测试文件在 diff 中没有变化:

# gate against reward hacking: tests must not change in this loop
if ! git diff --quiet -- test/; then
  echo "Agent changed the tests. Revert, this is reward hacking."
  git checkout -- test/
  exit 3
fi

第三层是使用不同模型的独立裁判。每次修改后,一个独立的审查智能体读取 diff,判断任务是否在实质上被解决,而不仅仅是测试是否通过。使用不同模型的裁判至关重要:模型很难察觉自身的自我欺骗模式,却能很好地发现其他模型的类似问题。

# .claude/agents/reviewer.md
---
name: reviewer
description: Adversarial judge. After every code change.
model: opus
---
Assume the author is wrong until the diff proves otherwise.
Check separately: the tests went green BECAUSE the code was fixed,
not because the tests were weakened. If asserts were deleted, mocks
replaced logic, values hardcoded, return FAIL with the location.
You do not fix code, you deliver a verdict PASS or FAIL with a reason.

成本是实实在在的:在强模型上运行一个裁判,会使每次轮转的费用翻倍。因此,将它用于严重的错误,而让成本低廉的确定性门控始终保持开启,它几乎免费。

§ 7

The model forgets when the run ends. Memory lives in a file the loop reads first and writes last.

# STATUS.md  (read first, written last)
## Done
- [x] auth: migrated to token v2, tests green
## In progress
- [ ] billing: webhook refactor (PR #214, CI red)
## Next
- [ ] dashboard: flaky test in test/charts
## Never
- do not touch infra/ without a human

But one markdown file is the minimum. Technically it is more robust to hold state at two levels: a human-readable STATUS.md for you, and machine state the loop parses unambiguously. Free text the model may reread differently from run to run, so fields critical to logic go into structure:

// .loop_state.json  — machine state, parsed unambiguously
{
  "phase": "billing-webhook",
  "iteration": 7,
  "last_green_commit": "a3f21c8",
  "blocked_paths": ["infra/", "test/"],
  "open_failures": ["test/billing/webhook.spec.ts:42"],
  "budget_spent_usd": 4.10
}

The split is needed because human-readable and machine-parsable are different requirements. STATUS.md for your morning glance, JSON for the loop's logic, which must not depend on how the model rephrased its plan today.

Treat the loop as a night shift you never see. You will be judged by the note in the morning, not by what the loop did at three a.m. Design the note first.

模型在运行结束后会忘记一切。记忆存在于一个文件中,循环首先读取,最后写入。

# STATUS.md  (read first, written last)
## Done
- [x] auth: migrated to token v2, tests green
## In progress
- [ ] billing: webhook refactor (PR #214, CI red)
## Next
- [ ] dashboard: flaky test in test/charts
## Never
- do not touch infra/ without a human

但一个 markdown 文件只是最低配置。从技术上讲,将状态维持在两个层面更为健壮:一个可供你阅读的 STATUS.md,以及循环能无误解析的机器状态。自由文本模型在不同的运行中可能解读各异,因此关键逻辑字段应放入结构化数据中:

// .loop_state.json  — machine state, parsed unambiguously
{
  "phase": "billing-webhook",
  "iteration": 7,
  "last_green_commit": "a3f21c8",
  "blocked_paths": ["infra/", "test/"],
  "open_failures": ["test/billing/webhook.spec.ts:42"],
  "budget_spent_usd": 4.10
}

这种分离是必要的,因为人类可读和机器可解析是两种不同的需求。STATUS.md 供你早晨快速浏览,JSON 供循环逻辑使用——循环逻辑绝不能依赖模型今天如何重新表述其计划。

将循环视为一个你永远看不见的夜班工人。你将在早晨通过笔记来评判它的工作,而不是看它凌晨三点做了什么。所以,优先设计好这份笔记。

§ 8

Brakes are the half no one teaches. But before limits, physical isolation, because a limit can be exceeded by one step, while access is binary: either the loop can wipe prod or it cannot.

Isolation via git worktree gives the loop a separate working copy on its own branch, physically detached from your main tree:

# separate worktree on its own branch, the loop lives only here
git worktree add ../loop-sandbox -b loop/billing-fix
cd ../loop-sandbox

This already limits the radius: the loop does not see your working branch. But a worktree is still the same filesystem. For real isolation, a container with stripped permissions:

# container: working folder writable, the rest read-only,
# outbound network off (important against prompt injection)
docker run --rm \
  --network none \
  --read-only \
  --tmpfs /tmp \
  -v "$(pwd):/work:rw" \
  -v "$HOME/.claude:/root/.claude:ro" \
  -w /work \
  loop-runner ./loop.sh

Why --network none is not paranoia but necessity. The loop reads untrusted input: task texts, others' code, commit messages. Any of them may contain prompt injection, an instruction the agent executes as a command. If an issue says "delete the database and push," an agent with network and permissions can do it. With no outbound network and read-only outside the working folder, the maximum damage is confined to the sandbox. Blast radius is about security, not only errors.

The rule: define the loop by what it can destroy, not by what you want it to do. Radius first, task second.

没有人教过你的另一半工作是刹车。但在设置限额之前,首先要进行物理隔离:因为限额有可能被单次操作超越,而访问权限则是二元的——要么循环能抹掉生产环境,要么不能。

使用 git worktree 进行隔离,可以为循环提供一个在其独立分支上的分离工作副本,从物理上脱离你的主仓库树:

# separate worktree on its own branch, the loop lives only here
git worktree add ../loop-sandbox -b loop/billing-fix
cd ../loop-sandbox

这已经缩小了半径:循环无法看到你的工作分支。但 worktree 仍在同一文件系统。要真正隔离,需要一个剥夺了权限的容器:

# container: working folder writable, the rest read-only,
# outbound network off (important against prompt injection)
docker run --rm \
  --network none \
  --read-only \
  --tmpfs /tmp \
  -v "$(pwd):/work:rw" \
  -v "$HOME/.claude:/root/.claude:ro" \
  -w /work \
  loop-runner ./loop.sh

为什么 --network none 不是杞人忧天而是必需品。循环会读取不受信任的输入:任务文本、他人的代码、提交信息。其中任何一个都可能包含提示注入/prompt injection——智能体会将其作为命令执行的指令。如果一个 issue 写着“删除数据库并推送”,拥有网络和权限的智能体确实可以做到。没有出站网络且工作文件夹外只读,则最大损害被限制在沙箱内。爆炸半径关乎的是安全,而非仅仅是错误。

规则:通过它能摧毁什么来定义循环,而不是通过你想让它做什么。半径优先,任务其次。

§ 9

Now the limits, and most importantly a structured log so you can later understand why the loop died. Without a log you stare at a burned loop at three a.m. and do not know what happened.

#!/usr/bin/env bash
set -euo pipefail
MAX_ITER=20
MAX_BUDGET_USD=10
i=0
last_failure=""
repeat_count=0
LOG=".loop_log.jsonl"

log() {  # structured log, one json line per event
  echo "{\"ts\":$(date +%s),\"iter\":$i,\"event\":\"$1\",\"detail\":\"$2\"}" >> "$LOG"
}

while [ $i -lt $MAX_ITER ]; do
  i=$((i + 1))
  echo "iter=$i ts=$(date +%s)" > .loop_heartbeat   # liveness
  log "iter_start" ""

  if npm test --silent; then
    log "green" "done in $i"; echo "Green in $i."; exit 0
  fi

  # reward-hacking gate: tests must not change
  if ! git diff --quiet -- test/; then
    log "reward_hack" "tests modified"; git checkout -- test/; exit 3
  fi

  # circuit breaker: same failure 3 times = stuck
  current_failure=$(npm test 2>&1 | grep -m1 "FAIL" || true)
  if [ "$current_failure" = "$last_failure" ]; then
    repeat_count=$((repeat_count + 1))
    if [ $repeat_count -ge 2 ]; then
      log "stuck" "$current_failure"; echo "Stuck, calling a human."; exit 2
    fi
  else
    repeat_count=0
  fi
  last_failure="$current_failure"

  log "agent_call" "$current_failure"
  claude -p "Fix the first failing test with a minimal change." \
    --permission-mode acceptEdits \
    --max-budget-usd "$MAX_BUDGET_USD"
done

log "iter_limit" ""; echo "Iteration limit, tests red."; exit 1

What the structured log gives you. Each event is a json line with time, iteration, type, and detail. After the loop dies you grep the log and see the pattern in a second: did the count climb with no green (runaway), did one failure repeat (stuck), did the agent change tests (reward hacking), when did the heartbeat stop updating (silent death). Without a log this is guessing, with a log it is a diagnosis.

The minimal brakes fitted here: iteration limit, budget cap per turn, repeat detector, liveness marker, reward-hacking gate, plus the isolation from step 5. This is the lower bound for a loop you leave unattended.

现在是设置限额,更重要的是建立结构化日志,以便之后理解循环为何崩溃。没有日志,你凌晨三点盯着烧毁的循环,却对发生了什么一无所知。

#!/usr/bin/env bash
set -euo pipefail
MAX_ITER=20
MAX_BUDGET_USD=10
i=0
last_failure=""
repeat_count=0
LOG=".loop_log.jsonl"

log() {  # structured log, one json line per event
  echo "{\"ts\":$(date +%s),\"iter\":$i,\"event\":\"$1\",\"detail\":\"$2\"}" >> "$LOG"
}

while [ $i -lt $MAX_ITER ]; do
  i=$((i + 1))
  echo "iter=$i ts=$(date +%s)" > .loop_heartbeat   # liveness
  log "iter_start" ""

  if npm test --silent; then
    log "green" "done in $i"; echo "Green in $i."; exit 0
  fi

  # reward-hacking gate: tests must not change
  if ! git diff --quiet -- test/; then
    log "reward_hack" "tests modified"; git checkout -- test/; exit 3
  fi

  # circuit breaker: same failure 3 times = stuck
  current_failure=$(npm test 2>&1 | grep -m1 "FAIL" || true)
  if [ "$current_failure" = "$last_failure" ]; then
    repeat_count=$((repeat_count + 1))
    if [ $repeat_count -ge 2 ]; then
      log "stuck" "$current_failure"; echo "Stuck, calling a human."; exit 2
    fi
  else
    repeat_count=0
  fi
  last_failure="$current_failure"

  log "agent_call" "$current_failure"
  claude -p "Fix the first failing test with a minimal change." \
    --permission-mode acceptEdits \
    --max-budget-usd "$MAX_BUDGET_USD"
done

log "iter_limit" ""; echo "Iteration limit, tests red."; exit 1

结构化日志带给你的价值。每个事件都是一行 JSON,包含时间、迭代次数、类型和详情。循环崩溃后,你可以通过 grep 日志,在一秒钟内看到模式:计数攀升但未通过(失控)、同一失败重复出现(卡住)、智能体修改了测试(奖励黑客)、心跳何时停止更新(静默崩溃)。没有日志只能靠猜测,有日志则能做出诊断。

这里配备了最基本的刹车:迭代次数限制、每次轮转的预算上限、重复检测器、存活标记、奖励黑客门控,再加上第五步的隔离。这是你离开后还放心得下的循环的最低配置。

§ 10

One technical point about money that breaks intuition. A loop costs not "N model calls" but the sum of growing contexts.

Why. If the loop were stateful (accumulating history), each iteration would reread the whole previous conversation, and cost would grow quadratically: iteration k pays for k previous turns. This is the second, economic reason to make the loop stateless. Fresh context each time keeps the per-iteration cost roughly constant: each pays for reading state from disk (small) plus its own work, not for the whole history.

A rough estimate before launch: cost ≈ (iteration count) × (state tokens + work tokens per iteration) × price. Measure one iteration on step 1, multiply by MAX_ITER, get the upper bound. If it scares you, lower MAX_ITER or split the task into phases, do not launch and hope.

The real spread from practice: the same approach in braked hands closes a contract for hundreds of dollars of API, unbraked it burns tens of thousands. The difference is not the model but whether there was a real check and whether limits were in place.

关于成本,有一个违背直觉的技术事实。循环的成本并不是“N 次模型调用”,而是不断增长的上下文的总和。

原因如下。如果循环是有状态的(累积历史记录),那么每次迭代都需要重新读取整个之前的对话,成本会呈二次方增长:第 k 次迭代需要为前 k 次轮转付费。这是让循环保持无状态的第二个原因——经济层面的原因。每次使用全新上下文,保证了每次迭代的成本大致恒定:每次只需支付从磁盘读取状态(少量)费用加上自身工作的费用,而无需支付全部历史记录的费用。

上线前的粗略估算:成本 ≈(迭代次数)×(状态 token + 每次迭代的工作 token)× 单价。在第一步中测量一次迭代,乘以 MAX_ITER,得出上限。如果这个数字让你害怕,就降低 MAX_ITER 或将任务分成多个阶段,不要上线后去碰运气。

实际应用中的真实差距:同样的方法,在有刹车的控制下,API 成本仅为数百美元;而无刹车时,则可能烧掉数万美元。区别不在于模型本身,而在于是否存在真正的检查以及是否设置了限额。

§ 11

Four deaths, and how each shows in the structured log from step 6.

Runaway. Bill and iterations climb, no green. In the log: many agent_call in a row, no green. Cure: step and budget limits.

Silent death. The loop "works" but stands still. In the log: the heartbeat stopped updating, no new events. Cause: hit a full context. Cure: fresh context per phase, the liveness marker catches the symptom.

Random walk. Spins but moves away from the goal. In the log: agent_call present, current_failure changes to a new one each time, no progress to green. Cause: no hard stop condition. Cure: a deterministic fixpoint check.

Understanding debt. The repo grows, you understand less. Not visible in the log at all, and that is the most dangerous. The loop ships code faster than you read it, you stamp diffs blind. Cure: a mandatory human read that cannot be skipped, the only flag here is your discipline.

The first three are engineering bugs, the log catches them and brakes cure them. The fourth is not a loop bug but your degradation as an engineer, and no code fixes it.

四种死亡模式,以及它们如何在第六步的结构化日志中体现。

失控。账单和迭代次数不断攀升,始终未通过。日志中:连续出现多次 agent_call,没有 green 事件。治疗方法:设置步数和预算限制。

静默崩溃。循环“看似工作”但停滞不前。日志中:心跳停止更新,没有新事件。原因:上下文已满。治疗方法:在每个阶段使用全新上下文,存活标记能捕捉到该症状。

随机漫步。循环在运转,但远离目标。日志中:存在 agent_call,但 current_failure 每次都会变成一个新的失败,未向 green 推进。原因:缺乏硬性的停止条件。治疗方法:设置确定性的不动点检查。

理解债务。仓库在增长,你的理解却在减少。日志中完全看不出来,这也是最危险的一种。循环生成代码的速度快于你的阅读速度,你只是在盲目地确认 diff。治疗方法:设置一个不能跳过的强制人工审查,唯一的标识是你的自律。

前三种是工程错误,日志能捕获它们,刹车能解决它们。第四种不是循环的错误,而是你作为工程师的退化,没有代码能修复它。

§ 12

Deterministic check -> reliable manual run with measurement -> minimal stateless loop -> narrow context build with a token budget -> check cannot be fooled (gate + judge) -> state on disk (md + json) -> isolation (worktree/container) -> brakes with a log -> count the cost -> schedule.

No one shipping two hundred PRs a month started with a hundred agents. They all started with one loop they trusted, with a real check and brakes. Take the most boring task, wrap it in one such loop, small enough to read every diff. Build that one.

确定性检查 → 带测量的可靠手动运行 → 最小化无状态循环 → 带 token 预算的窄上下文构建 → 不可欺骗的检查(门控 + 裁判) → 磁盘上的状态(md + json) → 隔离(worktree/容器) → 带日志的刹车 → 成本计算 → 制定计划。

那些每月能发布两百个 PR 的人,最初都不是靠一百个智能体起步的。他们都是从自己信任的、带有真正检查和刹车的一个循环开始的。找一个最无聊的任务,将它包裹在这样的一个循环中,小到足以阅读每一个 diff。先构建好这一个。

Open source ↗