Glean 拾遗
Daily /2026-06-27 / Human in the /loop

Human in the /loop

Source x.com Glean’d 2026-06-27 06:00 Read 4 min
AI summary

The author shares a practical workflow for coding with AI agents: define a verifiable 'definition of done' (model eval score, QA pass, green tests, performance benchmark), wrap it in a loop for the agent to iterate autonomously, and get notified via Slack when a decision is needed or the task completes. Loops run in the cloud, not on the local machine. The author runs 3-5 long loops concurrently plus shorter tasks. For engineers looking to level up from one-shot agent interactions to long-running autonomous optimization tasks.

Original · 4 min
x.com ↗
§ 1

What I like most about coding with agents right now is the room to leave a few runs going and still get on with other work. When something finishes or needs a call, I show up.

当前,使用 AI 代理进行编程最让我欣赏的一点是,我可以同时启动几个任务,然后继续做其他工作。当某个任务完成或需要我介入时,我再出现。

§ 2

This post is a short explainer of the setup I use, a definition of done the agent can score, a loop that keeps going until it should stop, pings so I know when to lean in.

这篇文章简要介绍了我使用的设置、代理可量化的“完成”定义、一个持续运行直到应该停止的循环,以及让我知道何时需要介入的通知机制。

§ 3

Before kicking off a longer running task, I lock a definition of done. Examples I actually use:

  • Model or eval work. Target is a score. Change the approach, run the eval, keep the change only if the number moved the right way. Closest to Karpathy's autoresearch for ML training loops.
  • Web app or UI. Target is a QA pass. Load the page or run Playwright, screenshot it, make sure it still does the thing.
  • Backend or refactor. Target is the test suite. Failing tests first, then green, and it has to stay green.
  • Speed or flakiness. Target is a number (p95, a benchmark). Change and measure until you're under the line you set.
  • Data or content cleanup. Target is a count. Loop until zero rows fail validation, or every item passes the check. Writing the loop is mostly writing how you'd check the work yourself. Some tasks need every step on the page. Others I give a goal and a rough direction and let the model fill in the middle. I start more explicit than I think I need, then loosen it once I see what it can infer.

在启动一个耗时较长的任务之前,我会先锁定一个“完成”的定义。以下是我实际使用的例子:

  • 模型或评估工作:目标是一个分数。改变方法、运行评估,只有在数值向正确方向变化时才保留改动。这与 Karpathy 在 ML 训练循环中提出的 auto-research 最为接近。
  • Web 应用或 UI:目标是 QA 通过。加载页面或运行 Playwright,截屏,确保功能正常。
  • 后端或重构:目标是测试套件。先让测试失败,然后使其通过,并且必须保持通过状态。
  • 速度或稳定性:目标是一个数值(p95、基准测试)。不断修改和测量,直到低于你设定的阈值。
  • 数据或内容清理:目标是计数。循环直到零行验证失败,或每个项目都通过检查。 编写循环大致上就是编写你如何自己检查工作。有些任务需要每一步都写清楚。其他任务我只给一个目标和大致方向,让模型自己去填充中间步骤。我一开始会写得比我以为需要的更明确,等看到它能推断什么之后,再放宽要求。
§ 4

Definition of done in hand, I tell the agent to loop on it. Change something, measure, keep or revert, go again. Doesn't have to be one tiny edit each time. The step just has to be measurable against the target. I care most about the stop conditions, which might be

  • Metric hits the target
  • No improvement after a few tries
  • Out of ideas
  • Blocked or unsure (stop and ask)

有了“完成”定义后,我告诉代理循环执行:修改某些东西、测量、保留或回退、再次尝试。每次不一定要做微小的改动。步骤只需能针对目标进行度量。我最关心的是停止条件,可能包括:

  • 指标达到目标
  • 尝试几次后没有改进
  • 没想法了
  • 受阻或不确定(停下来询问)
§ 5

So the agent gets a notify path (MCP plus /notify) and reaches me there. Usually Slack, because that's where everything else already is. Same setup could be iMessage or whatever. I treat it as a generic notification channel, not full Slack access for the agent. Status updates and "I need a decision" show up like normal messages. When I answer, that reply is the next thing the loop runs on.

因此,我为代理设置了一个通知路径(MCP 加 /notify),它可以在这里联系我。通常使用 Slack,因为其他所有工作消息也都在那里。同样的设置也可以用于 iMessage 或其他应用。我将其视为一个通用的通知渠道,而不是让代理拥有 Slack 的完全访问权限。状态更新和“我需要一个决定”会像普通消息一样出现。当我回复时,这条回复会成为循环下一步的执行依据。

§ 6

Most of this doesn't stay on my laptop. It runs in the cloud so a loop can keep going for hours without my machine being open. I use my own client as the orchestrator and fan work out to cloud agents from there.

这些任务大多不会留在我的笔记本电脑上运行。它们在云端执行,这样循环可以持续数小时而无需我的机器保持开机。我用自己的客户端作为编排器,从那里将任务分发到云端代理。

§ 7

Once a loop is running, I start another. Usually three or so, sometimes five. And that's only the long loops. I often have other agents up at the same time on shorter work: a PR, a one-off investigation, something that isn't a multi-hour hill climb. If things are quiet I fire off another. If three are waiting on me I stop starting stuff and go review.

一旦一个循环开始运行,我就会启动另一个。通常有三到五个,这只是指那些长时间运行的循环。我经常同时启动其他代理处理更短的工作:例如一个 PR、一次性的调查,或者任何不需要数小时爬坡的任务。如果事情不多,我会再启动一个。如果已经有三个任务在等待我处理,我就会停止启动新任务,转而去审查。

§ 8

Rough template of how i prompt. /loop drives the iterations and /notify keeps me posted

Task: <the long thing>

/loop until <metric / tests / eval / QA check> hits <target>. Treat it as the source of truth and don't change it while you run.

Stop early if it stalls for several tries or you run out of ideas. If you're blocked or unsure, stop and ask.

/notify on start, on anything surprising, and when you're done or stuck. Ping me when you need a decision.

If you're running loops, I'd love to hear how we can make it easier for you!

以下是我提示词的大致模板。/loop 驱动迭代,/notify 让我了解进展。

任务:<长时间任务>

/loop 直到 <指标/测试/评估/QA 检查> 达到 <目标>。将其视为真理源,在运行期间不要修改它。

如果连续几次没有进展或你没了想法,则提前停止。如果受阻或不确定,停下来询问。

/notify 在开始时、遇到任何意外情况时以及完成或卡住时通知我。当你需要我做出决定时,请提示我。

如果你也在运行循环模式,我很想听听如何能让它对你来说更容易!

§ 9

If you're running loops, I'd love to hear how we can make it easier for you!

如果你也在使用循环模式,我很想听听你的意见,看看我们如何能让它变得更容易!

Open source ↗