日刊 /2026-06-21 / 把每次 Agent 犯错变成永久的结构性免疫

把每次 Agent 犯错变成永久的结构性免疫

原文 x.com 收录 2026-06-21 06:00 阅读 22 min

AI 解读

Garry Tan 提出“Skillify”方法论：每次 AI Agent 犯错，不靠道歉或提示词修补，而是将其转化为一项带完整测试链的 Skill。文章以两个真实故障（日历查询绕过本地脚本、时区心算偏差）为例，展示如何将失败固化为 SKILL.md 契约加确定性脚本，并引入涵盖单元测试、集成测试、LLM eval、解析器触发与校验、可达性审查、冒烟测试等十步清单的验证体系。该流程已集成于作者的开源知识引擎 GBrain 中，确保 Agent 的每一次判断提升都是永久且可验证的。适合正在构建 AI Agent 并受困于同类错误反复出现的开发者。

原文 22 分钟

原文 x.com ↗

§ 1

LangChain has raised $160 million. Three years of development. A billion-dollar valuation. LangSmith, their testing platform, is genuinely sophisticated: trajectory evals, trace-to-dataset pipelines, LLM-as-judge, regression suites, unit test frameworks for tools. They have the pieces. Credit where it's due.

But pieces aren't a practice.

LangChain gives you testing tools. It never tells you what to test, in what order, or when you're done.

There's no opinionated workflow that says, in order: this failure happened, now write a skill, now write the deterministic code, now write unit tests, now write LLM evals, now add a resolver trigger, now eval the resolver, now audit for duplicates, now smoke test, now file correctly.

That loop doesn't exist. You have to invent it yourself from scattered primitives. A great many users of AI still don't test their agents at all, because the framework they chose probably gave them a gym membership without a workout plan.

Most AI agent "reliability" is vibes-based. Prompt tweaks. Bigger system messages. "Please don't hallucinate" incantations. That stuff decays the moment the conversation gets complex. The frameworks that raised hundreds of millions of dollars to solve this gave you monitoring dashboards and unit test helpers and said "good luck."

My agent screwed up twice this week. Neither failure can happen again. Not because I asked nicely. Because I turned each failure into a permanent structural fix: a skill with tests that run every day, forever.

I call the practice "skillify." Once you use it, your agents won't keep making the same mistakes. Here's how it works.

LangChain 已经融了 1.6 亿美元。三年多的开发，估值突破十亿。他们的测试平台 LangSmith 确实相当复杂：轨迹评估、trace-to-dataset 流水线、LLM-as-judge、回归套件、面向工具的单元测试框架——该有的零部件都有了。这点必须承认。

但零部件不等于实践本身。

LangChain 给了你测试工具，却从没告诉你该测什么、按什么顺序测、什么时候才算测完。

没有一个带有明确意见的工作流会说：按照顺序——这个故障发生了，现在写一条 skill，现在写对应的确定性代码，现在写单元测试，现在写 LLM 评估，现在添加 resolver 触发器，现在评估 resolver，现在检查重复，现在跑冒烟测试，现在正确归档。

这个闭环根本不存在。你只能从杂乱的原语里自己拼出来。绝大多数 AI 用户至今完全不测试自己的 agent，因为他们选择的框架大概只发了一张健身卡，却没给训练计划。

如今大多数 AI agent 的“可靠性”靠的是一种氛围：调调 prompt、把系统消息写长一点、念几句“请勿幻觉”的咒语。对话一复杂，这些东西就不管用了。那些融了数亿美元来解决这个问题的框架，最终给了你监控面板和单元测试助手，然后说了句“祝你好运。”

这周我的 agent 搞砸了两次。这两次事故不能再发生。不是因为我好好跟它商量，而是因为我把每次失败都变成了一次永久的结构性修复：一条拥有测试的 skill，这些测试每天都会运行，永远运行下去。

我把这套方法叫做“skillify”。一旦用上它，你的 agent 就不会再重复犯同样的错误。下面是它的工作原理。

§ 2

Failure 1: The Trip That Was Already in the Database

I asked my OpenClaw about an old business trip, nearly ten years back, buried somewhere in calendar history. Simple question. Should take one second.

Instead the agent did this:

Called the live calendar API → blocked (too far back).
Tried email search → noisy results, nothing conclusive.
Tried the calendar API again with different params → still blocked.
Five minutes later, searched my local knowledge base and found it instantly.

The answer had been sitting in my own data the whole time. 3,146 calendar files spanning 2013 through 2026. Already indexed. Already local. One grep away.

The agent just didn't look there first.

In the framework I've been writing about (thin harness, fat skills) there's a key distinction between work that requires judgment and work that requires precision. I call them latent and deterministic. Calendar grep is deterministic. Same input, same output, every time. No model needed. But the agent did it in latent space anyway, spinning up reasoning, making API calls, interpreting results, when a three-line script would have returned the answer instantly.

That's the bug. Not a wrong answer. A wrong side.

故障 1：早已在数据库里的那次旅行

我问我的 OpenClaw 关于一次旧差旅的信息，差不多十年前的事了，埋在日历历史里的某个角落。很简单的问题，一秒钟就应该能搞定。

结果 agent 却做了这么一串事：

调用实时日历 API → 被拦截（时间太久远）。
尝试邮件搜索 → 结果嘈杂，没有定论。
用不同参数再调一次日历 API → 仍然被拦截。
五分钟后，搜索我的本地知识库，立刻找到了。

答案一直就在我自己的数据里。3146 个日历文件，横跨 2013 到 2026 年。已经建立索引，已经存在本地，一个 grep 就能搞定。

Agent 只是没有第一时间去那里找。

在我一直在写的框架（thin harness，fat skills）里，有一组关键区分：需要判断力的工作和需要精度的的工作。我分别称为 latent 和 deterministic。日历 grep 是确定性的。相同的输入，相同的输出，每次都是。不需要模型。但 agent 仍通过 latent 空间来处理它，启动推理、调用 API、解读结果——而一个三行脚本就能瞬间返回答案。

这就是 bug。不是答案错了，而是处理的空间错了。

§ 3

The fix: calendar-recall (Steps 1 + 2)

In thin harness / fat skills, a skill is a markdown procedure that teaches the model how to approach a task. Not what to do (the user supplies the what). The skill supplies the process. Think of it like a method call: same procedure, radically different outputs depending on what you pass in.

Here's the skill that came out of this failure:

name: calendar-recall description: "Brain-first historical calendar lookup. ALWAYS use this before any live API for any event not in the future or the last 48 hours."

And the hard rule inside:

Live calendar APIs are ONLY for events in the FUTURE or the LAST 48 HOURS. Everything historical goes through the local knowledge base first.

Here's the thing that makes this work: the agent itself wrote the deterministic script. The skill file (markdown, living in latent space) told the agent how to fix the problem. The agent read the skill, understood that calendar search is deterministic work, and generated a script to handle it:

$ node scripts/calendar-recall.mjs search "Singapore"

Found 2 matching day(s): ── 2016-05-07 ── Flight to Singapore, Mandarin Oriental check-in ── 2016-05-08 ── Lunch with investors at Fullerton Hotel

Code that runs in under 100 milliseconds (most of which is Bun startup; the actual grep is sub-millisecond). Zero LLM calls. Zero network. Just local files.

This is the loop that makes the whole architecture work: the latent space builds the deterministic tool, then the deterministic tool constrains the latent space. The agent used judgment (latent) to write calendar-recall.mjs. Now the skill forces the agent to run that script instead of reasoning about calendar data. The model's intelligence created the constraint that prevents the model from being stupid.

The old failure path becomes structurally unreachable. The skill says "search local first." The script does the search. The agent never gets a chance to be clever about it or get it wrong again.

修复方式：calendar-recall（第 1、2 步）

在 thin harness / fat skills 架构中，一条 skill 就是一个 markdown 规程，用来教模型如何接近一个任务。不教该做什么（用户提供“做什么”），只提供流程。可以把它想象成一个方法调用：同样的规程，根据传入的参数不同，输出也截然不同。

以下是从这次故障中产出的 skill：

名称：calendar-recall 描述：“脑优先的历史日历查找。对于任何不在未来或最近 48 小时内的事件，总是先使用本方法，再去碰任何实时 API。”

内部硬性规则：

实时日历 API 仅用于未来或最近 48 小时内的事件。所有历史事件都先走本地知识库。

这里最关键的一点是：agent 自己写出了这个确定性脚本。skill 文件（markdown，存在于 latent 空间）告诉 agent 怎么修这个问题。agent 阅读了 skill，理解了日历搜索是确定性工作，然后生成了一段脚本来处理它：

$ node scripts/calendar-recall.mjs search "Singapore"

找到 2 个匹配的日期： ── 2016-05-07 ── 飞往新加坡，文华东方酒店入住 ── 2016-05-08 ── 与投资者在 Fullerton 酒店午餐

这段代码运行只需不到 100 毫秒（大部分是 Bun 的启动时间，实际 grep 是亚毫秒级）。零 LLM 调用，零网络，纯本地文件。

这就是让整个架构运转起来的循环：latent 空间构建了确定性工具，然后确定性工具反过来约束了 latent 空间。agent 使用了判断力（latent）来编写 calendar-recall.mjs。现在，这条 skill 强制 agent 运行那个脚本，而不是自己去推理日历数据。模型的智慧创建了约束，阻止模型犯蠢。

旧有的失败路径在结构上变得不可达。skill 说“先搜本地”。脚本去执行搜索。agent 再也没有机会自作聪明或再次搞错。

§ 4

Failure 2: "28 Minutes" (Steps 1 + 2 again)

Same day. Agent says: "Your next meeting is in 28 minutes."

Reality: 88 minutes away. The agent had done UTC→PT timezone math in its head and was off by exactly an hour.

The thing is, a script already existed (context-now.mjs) that outputs this:

{ "now": "2026-04-21T07:38:12-07:00", "upcomingEvents": [{ "summary": "App Ops Sprint Planning", "minutesUntil": 88 }] }

Code that runs in about 50 milliseconds. Zero ambiguity. The agent just didn't run it.

Same shape as before: deterministic work (subtracting timestamps) done in latent space. The model was doing mental math when a script had the answer.

The fix: context-now, the skill:

name: context-now description: "ALWAYS-ON discipline: run context-now.mjs before making ANY time-sensitive claim. Never do UTC→PT conversion in your head."

Here's the simple before/after with and without these simple skills:

故障 2：“28 分钟”（又见第 1、2 步）

同一天。agent 说：“您的下一个会议在 28 分钟后。”

实际上：是 88 分钟后。agent 在脑子里做了 UTC 到太平洋时间的换算，整整差了一个小时。

问题是，已经存在一个脚本（context-now.mjs）能输出这样的结果：

{ “now”: “2026-04-21T07:38:12-07:00”， “upcomingEvents”: [{ “summary”: “App Ops Sprint Planning”， “minutesUntil”: 88 }] }

这段代码大约 50 毫秒就能跑完。零歧义。agent 就是没去跑它。

和前一次一样的问题：把确定性的工作（时间戳相减）放在 latent 空间里做。模型在用脑子做算术，而脚本早就有了答案。

修复方式：context-now 这条 skill：

名称：context-now 描述：“始终打开的训练准则：在做出任何涉及时间的声称之前，必须先运行 context-now.mjs。永远不要在脑子里做 UTC 到太平洋时间的转换。”

下面是一张简单的对比图，展示了有或没有这些简单 skill 前后的效果：

§ 5

Skillify: The pattern that will save your sanity

Two failures. Same shape. The agent had the right tool and chose cleverness instead of discipline. The wrong thing happened in the wrong machine space.

In a normal AI setup, the AI will apologize, promise to do better, and two weeks later the same thing happens with a different query or a different timezone. The agent has no memory of the bug, no test for the bug, nothing stops it from recurring.

Skillify is the fix. Every failure becomes a skill. Every skill has tests. The bug becomes structurally impossible to repeat.

Here's the 10-item checklist I use when a failure gets promoted:

□ 1. SKILL.md — the contract (name, triggers, rules) □ 2. Deterministic code — scripts/*.mjs (no LLM for what code can do) □ 3. Unit tests — vitest □ 4. Integration tests — live endpoints □ 5. LLM evals — quality + correctness □ 6. Resolver trigger — entry in AGENTS.md □ 7. Resolver eval — verify the trigger actually routes □ 8. Check resolvable + DRY audit □ 9. E2E smoke test □ 10. Brain filing rules

A feature that doesn't pass all ten is not a skill. It's just code that happens to work today.

Skillify：能让你省心的模式

两次故障。同样的形态。agent 手里有正确的工具，却选择了自作聪明而不是遵守纪律。事情发生在错误的机器空间里。

在普通的 AI 设置下，AI 会道歉、承诺改进，然后两周后在另一个查询或另一个时区里发生同样的事。agent 没有关于 bug 的记忆，没有针对 bug 的测试，没有任何东西阻止它重复发生。

Skillify 就是解药。每次失败都变成一条 skill。每条 skill 都带有测试。bug 在结构上变得无法复现。

以下是我在将一次失败提级为 skill 时使用的十条检查清单：

□ 1. SKILL.md — 契约（名称、触发器、规则） □ 2. 确定性代码 — scripts/*.mjs（代码能做的事绝不劳烦 LLM） □ 3. 单元测试 — vitest □ 4. 集成测试 — 实时端点 □ 5. LLM 评估 — 质量 + 正确性 □ 6. Resolver 触发器 — 在 AGENTS.md 中登记 □ 7. Resolver 评估 — 验证触发器确实能路由 □ 8. 可解析性检查 + DRY 审计 □ 9. 端到端冒烟测试 □ 10. 大脑归档规则

一个没有通过全部十条的功能不是 skill，它只是恰好今天还能工作的代码。

§ 6

Skillify as a verb

For me, building my OpenClaw (and GBrain), the checklist started as a failure-response protocol. Then it became the way I built everything.

Here's what my actual workflow looks like. I'm talking to my agent in natural language. We build something together in conversation. I try it. It works. Then I say one word:

Garry: hot damn it worked. can you remember this as a webhook skill and skillify it, next time we need to do some webhooks? why was this so hard to get right? anyway it's good now. DRY it up too

That was an OAuth webhook integration. We spent an hour getting it to work. Then "skillify it" turned the ad-hoc session into a durable skill with tests, a resolver entry, and documentation. Next time I need a webhook, the skill exists. The agent reads it. The hard-won knowledge from that hour is permanent.

Another one. We discovered that our container needs a headless browser for certain tasks, and a headed browser on my desktop for others:

Garry: great! so we should actually remember this as a skill whenever anything in openclaw needs a headless browser! and also know that if we need a headed browser we should ask the user to run gstack browser and give us a pair-agent code. skillify it!

One message. The agent writes skills/browser/SKILL.md with the decision tree, the deterministic scripts, the tests. Now every future session that needs a browser gets routed to the right tool automatically.

Or this. I noticed the agent kept sending me ngrok links without checking if they actually worked:

Garry: can we make a skill that says whenever you send me a link you have to curl it yourself to make sure the endpoint is open and the tunnel works? skillify it!

Or the calendar double-booking that almost cost me a meeting:

Garry: Here is one regular skill I need you to write. It's the calendar check skill. Tomorrow I have a double booked 11am. Make a skill, make it deterministic to check these kinds of things.

One sentence. Code, skill, tests, resolver entry, reachability audit. The whole 10-step checklist in one breath. My OpenClaw knows, does it, and now it's a groove. I've done it dozens of times now. I couldn't live without it.

The pattern is always the same: prototype in conversation, see it work, say "skillify," and the prototype becomes permanent infrastructure. I don't write specs. I don't file tickets. I talk to my agent, we solve the problem together, and then the solution becomes a skill that the agent can use forever without me.

This is what $160 million in framework funding missed. Not the testing primitives. Not the eval tooling. The workflow. The moment where a human says "that worked, now make it permanent" and the system knows exactly what "permanent" means: SKILL.md, deterministic code, unit tests, integration tests, LLM evals, resolver trigger, resolver eval, DRY audit, smoke test, brain filing. Ten steps. One word.

Skillify 成了一个动词

对我而言，在构建 OpenClaw（和 GBrain）的过程中，这份清单起初只是故障响应协议，后来变成了我构建一切的方式。

以下是我的实际工作流程。我用自然语言和 agent 对话。我们一起在对话中构建东西。我试试，好用。然后我说一个词：

Garry：太棒了居然成了。你能把它记作一条 webhook skill 然后 skillify 一下吗？下次我们需要做 webhook 的时候就有现成的了。为什么这个搞对这么难？反正现在好了。顺便去重一下。

那是一次 OAuth webhook 集成。我们花了一小时才让它跑通。然后一句“skillify it”就把那段临时对话变成了一条持久的 skill，包含测试、resolver 登记和文档。下次我需要 webhook 时，这条 skill 直接可用。agent 可以读取它。那一小时的辛苦结晶永不过期。

还有一次。我们发现容器在特定任务下需要无头浏览器，而在我的桌面上某些任务又需要有头浏览器：

Garry：太好了！我们真应该把这个记成一条 skill，以后只要 OpenClaw 里需要无头浏览器就用它。同时也要知道如果需要用有头浏览器，就应该让用户跑 gstack browser 然后给我们一个 pair-agent 代码。skillify 它！

就一句话。agent 便写好了 skills/browser/SKILL.md，包含决策树、确定性脚本和测试。从此，任何需要浏览器的会话都会自动被路由到正确的工具上。

还有这个。我发现 agent 老是给我发 ngrok 链接，却不去检查它们是否真的能用：

Garry：能造一条 skill 吗：以后任何你给我发的链接，你都得自己先 curl 一下，确认端点通、隧道正常。skillify 它！

还有那个几乎让我错过会议的日历重名：

Garry：这里有一条常规 skill 需要你来写。叫做日历检查 skill。明天我 11 点有重名。造一条 skill，用确定性的方法来检查这类问题。

一句话。代码、skill、测试、resolver 登记、可达性审计，十条检查清单一口气完成。我的 OpenClaw 懂得干这事，现在它已经成了习惯。我已经做了几十次。离了它我活不了。

模式总是一样的：在对话中打样，看它生效，说一声“skillify”，原型就变成了永久的基础设施。我不写规格说明，不提交工单。我和 agent 对话，一起解决问题，然后解方变成一条 agent 可以在没有我的情况下永久使用的 skill。

这就是那 1.6 亿美元的框架投资所忽略的东西。不是测试原语，不是评估工具。而是工作流。是当人类说出“生效了，现在把它永久化”的那一刻，系统确切知道“永久化”意味着什么：SKILL.md、确定性代码、单元测试、集成测试、LLM 评估、resolver 触发器、resolver 评估、DRY 审计、冒烟测试、大脑归档。十个步骤。一个词。

§ 7

Here's what the remaining eight steps look like in practice.

Step 3: Unit tests

Classic vitest. Deterministic functions, deterministic assertions. calendar-recall.mjs exports pure functions like parseEventLine, eventMatchesKeyword, searchKeyword, formatJson. Each one gets tested against fixture data: synthetic calendar files in a temp directory, known inputs, known outputs.

The kind of bug these catch: parseEventLine silently drops events with Unicode characters in the location field. dateFromPath returns null for leap-year dates. formatJson omits the attendees array when there's only one person. Small, boring, critical. If the script produces wrong output, the skill produces wrong answers, and the agent confidently tells me the wrong thing.

For context-now, unit tests verify timezone formatting, quiet-hours detection, and the minutesUntil calculation across DST boundaries. One test feeds in a time 3 minutes before a DST transition and verifies the output doesn't jump by 60 minutes. That's the exact bug that caused the "28 minutes" failure. It's now structurally impossible.

I have 179 unit tests across 5 suites. They run in under 2 seconds.

Step 4: Integration tests

These hit live endpoints and real data. Does calendar-recall.mjs actually find events in the real brain repo, not just the test fixtures? Does context-now.mjs produce valid JSON when the calendar cache is stale or missing? Integration tests catch the bugs that unit tests miss because the fixture data was too clean. Real data has malformed event lines, missing timezone fields, calendar files with Windows line endings, events that span midnight.

The rule: if you find yourself manually checking whether the script did the right thing on real data, that check should be an integration test.

Step 5: LLM evals

This is where it gets interesting. Some outputs require judgment to evaluate. "Is this calendar summary useful?" is not a yes/no question a script can answer. So I use LLM-as-judge: a model evaluating another model's output against a rubric.

For context-now, 35 evals run daily. One of them feeds the agent a message like "hey, my flight leaves in about 45 minutes, will I make it to SFO?" and checks whether the agent runs context-now.mjs before answering or tries to do the math in its head. If the agent takes the bait and computes the time itself, the eval fails.

Another eval gives the agent a UTC timestamp and asks "what time is that for me?" The correct behavior is to run the script and quote the result. The incorrect behavior is to do the conversion mentally. The eval catches both the wrong answer AND the wrong process, because even if the mental math happens to be right this time, it'll be wrong next time.

The most honest eval heuristic I've found: search your conversation history for when you said "fucking shit" or "wtf." Those are the test cases you're missing.

下面是余下八个步骤在实际中的样子。

第 3 步：单元测试

经典的 vitest。确定性函数，确定性断言。calendar-recall.mjs 导出了 parseEventLine、eventMatchesKeyword、searchKeyword、formatJson 等纯函数。每个函数都针对固定数据进行测试：临时目录里的合成日历文件，已知的输入，已知的输出。

这类测试能捕获的 bug：parseEventLine 静默丢弃包含 Unicode 字符的字段；dateFromPath 对闰年日期返回 null；formatJson 在只有一位参与者时省略 attendees 数组。这些 bug 小、无聊、但至关重要。如果脚本输出错了，那条 skill 给出的答案就是错的，agent 就会自信地告诉我错误的信息。

对于 context-now 来说，单元测试验证时区格式化、安静时段检测以及跨夏令时边界的 minutesUntil 计算。一个测试输入了夏令时切换前 3 分钟的时间，验证输出不会跳 60 分钟——这正是导致“28 分钟”故障的 bug。现在它在结构上已经不可能再出现了。

我有 179 个单元测试，分布在 5 个套件中。运行时间不到 2 秒。

第 4 步：集成测试

这些测试会命中实时端点和真实数据。calendar-recall.mjs 是否真的能在真实的大脑仓库中找到事件，而不只是在测试 fixture 里？当日历缓存过期或缺失时，context-now.mjs 是否还能输出有效的 JSON？集成测试能捕获单元测试遗漏的 bug，因为 fixture 数据太干净了。真实数据中可能会有格式错误的日历行、缺失的时区字段、带有 Windows 换行符的日历文件、跨午夜的事件。

规则：如果你发现自己正在手动检查脚本对真实数据的处理是否正确，那么那个检查就应该是一个集成测试。

第 5 步：LLM 评估

这里就很有意思了。有些输出需要判断力来评估。“这份日历摘要有用吗？”不是一个脚本能用是/否回答的问题。所以我用 LLM-as-judge：用一个模型根据标准来评估另一个模型的输出。

对于 context-now，每天会跑 35 个评估。其中一个评估给 agent 发送一条消息，比如“嘿，我的航班大约 45 分钟后起飞，我赶得上吗？”，然后检查 agent 在回答之前是否运行了 context-now.mjs，还是试图在脑子里做数学。如果 agent 上钩了，自己计算了时间，评估就会失败。

另一个评估给 agent 一个 UTC 时间戳，然后问“对我来说这是几点？”正确的行为是运行脚本并引用结果。错误的行为是在心里做换算。这个评估同时捕获了错误的答案和错误的过程——因为即使这次心里换算碰巧对了，下次它也会错。

我发现的最诚实的评估启发：搜索你的对话历史，找你什么时候说了“卧槽”或“什么鬼”。那些就是你缺失的测试用例。

§ 8

Step 6: Resolver trigger

A resolver is a routing table for context: when task type X appears, load skill Y. I wrote about resolvers in detail here. Each skill needs a trigger entry in AGENTS.md, the file that teaches the agent what skills exist and when to use them.

Resolver triggers are just rows in a markdown table:

The bug this step catches: you write a new skill but forget to add it to the resolver. The skill exists. The capability exists. The system can't reach it. It's like having a surgeon on staff but not listing them in the hospital directory. Worse than not having the skill at all, because you think the system handles it.

Step 7: Resolver eval

This is the layer most people miss entirely. A resolver trigger says "this phrase should route to this skill." A resolver eval tests that it actually does.

My resolver eval suite has 50+ test cases like this:

{ intent: 'check my signatures', expectedSkill: 'executive-assistant' }, { intent: 'who is Pedro Franceschi', expectedSkill: 'brain-ops' }, { intent: 'save this article', expectedSkill: 'idea-ingest' }, { intent: 'what time is my meeting', expectedSkill: 'context-now' }, { intent: 'find my 2016 trip', expectedSkill: 'calendar-recall' },

Two failure modes. False negative: the skill should fire but doesn't, because the trigger description is wrong or missing. False positive: the wrong skill fires, because two triggers overlap. "What's on my calendar tomorrow" should route to calendar-check, not calendar-recall and not google-calendar. Three skills, three different time domains, one phrase that could plausibly match any of them. The resolver eval catches the ambiguity before a user hits it.

I run these evals both as deterministic structural tests (does the AGENTS.md table contain the right mapping?) and as LLM routing tests (given this intent, does the model actually pick the right skill?). Both layers matter. The table can be correct and the model can still route wrong because the trigger description is vague.

Step 8: Check-resolvable + DRY audit

After a month of building, I had 40+ skills. Some created in response to specific incidents, others spawned by sub-agents running crons. Nobody was maintaining the resolver table. Skills were being born but not registered.

So I built check-resolvable. A meta-test that walks the entire chain: AGENTS.md resolver → SKILL.md → script/cron. If a script exists that does useful work but has no path from the resolver, it's unreachable. The LLM will never know to use it.

First run found 6 unreachable skills out of 40+. Fifteen percent of the system's capabilities were dark.

A flight tracker that nobody could invoke by asking about flights.
A content-ideas generator that only ran on cron but couldn't be triggered manually.
A citation fixer that existed in the skills directory but wasn't listed in the resolver at all.

Fixed in an hour. Just added trigger entries to AGENTS.md. Now check-resolvable runs weekly as part of gbrain doctor. It checks three things:

Every skill directory with a SKILL.md has a corresponding entry in the resolver.
Every script referenced by a skill is actually callable (file exists, exports the right functions).
No two skills have overlapping trigger descriptions that would cause ambiguous routing.

The DRY audit runs alongside it. You end up with fifteen skills that sort of do the same thing if you're not careful, and the resolver picks whichever one the dice roll lands on. For calendar-recall:

Four skills in the same domain. Zero overlap. Each has its lane. That matrix isn't a diagram drawn for this post. It lives inside the SKILL.md, and the audit script parses it. Build a sixth calendar skill that steps on another's lane and the audit fails before the skill can ship.

第 6 步：Resolver 触发器

Resolver 是一张上下文路由表：当出现 X 类型的任务时，加载 Y 技能。每条 skill 需要在 AGENTS.md 文件（告诉 agent 有哪些技能以及何时使用它们的文件）中有一个触发器登记。

Resolver 触发器只是 markdown 表格中的行：

这一步能捕获的 bug：你写了一条新 skill，但忘了把它加入 resolver。skill 存在了，能力也有了，但系统够不着它。这就像医院里有一位外科医生，但目录里找不到他的名字。这比根本没有这项 skill 更糟糕，因为你会以为系统已经能处理了。

第 7 步：Resolver 评估

这是大多数人会完全错过的一层。一条 resolver 触发器说“这句话应该路由到这条 skill”。Resolver 评估则测试它是否真的做到了。

我的 resolver 评估套件有 50 多个这样的测试用例：

{ intent: 'check my signatures', expectedSkill: 'executive-assistant' }， { intent: 'who is Pedro Franceschi', expectedSkill: 'brain-ops' }， { intent: 'save this article', expectedSkill: 'idea-ingest' }， { intent: 'what time is my meeting', expectedSkill: 'context-now' }， { intent: 'find my 2016 trip', expectedSkill: 'calendar-recall' }，

两种失败模式。假阴性：skill 应该触发但没触发，因为触发器描述有误或缺失。假阳性：错误的 skill 触发了，因为两条触发器重叠了。“我明天日历上有什么”应该路由到 calendar-check，而不是 calendar-recall 或 google-calendar。三条 skill，三个不同的时间域，一句短语可能匹配其中任何一个。Resolver 评估在用户碰到这个问题之前就捕获了歧义。

我同时作为确定性结构测试（AGENTS.md 表格是否包含正确的映射？）和 LLM 路由测试（给定这个意图，模型是否真的选对了 skill？）来运行这些评估。两层都很重要。表格可能正确，但模型仍然可能因为触发器描述太模糊而路由错误。

第 8 步：可解析性检查 + DRY 审计

构建一个月后，我有了 40 多条 skill。有些是针对特定事件创建的，有些是子 agent 通过 crons 生成的。没人维护 resolver 表格。Skill 不断诞生，却没人登记。

所以我构建了 check-resolvable。一个元测试，遍历整个链条：AGENTS.md resolver → SKILL.md → 脚本/cron。如果一个脚本做着实用的工作，但从 resolver 没有路径到达它，那么它就是不可达的。LLM 永远不会知道去用它。

第一次运行发现 40 多条 skill 中有 6 条不可达。系统 15% 的能力处于黑暗状态。

一个航班追踪器，没人能通过询问航班来调用它。
一个内容创意生成器，只在定时任务上运行，不能手动触发。
一个引用修复器，存在于 skills 目录中，但根本没有在 resolver 里登记。

一小时内就修好了。只需在 AGENTS.md 里添加触发器登记。现在 check-resolvable 作为 gbrain doctor 的一部分每周运行一次。它检查三件事：

每个包含 SKILL.md 的 skill 目录在 resolver 中都有对应的登记。
每条 skill 引用的脚本实际可调用（文件存在，导出正确的函数）。
没有两条 skill 有会导致歧义路由的重叠触发器描述。

DRY 审计与它一起运行。如果不小心，你最终会得到十五条功能相似的 skill，resolver 像掷骰子一样随机选择。对于 calendar-recall：

同一领域内有四条 skill。零重叠。每一条都有自己的车道。那个矩阵不是为本文画的图。它存在于 SKILL.md 内部，审计脚本会解析它。如果有人构建第六条日历 skill 踩了别人的车道，审计会在 skill 上线之前就给出不通过。

§ 9

Step 9: E2E smoke test

The full pipeline, end to end.

Ask the agent "when did I go to Singapore?" and verify that it runs calendar-recall.mjs, gets the right answer, and formats it correctly.
Ask "what time is my next meeting?" and verify it runs context-now.mjs instead of doing mental math.

Smoke tests are the last line of defense. Everything else can pass and the system can still fail if the pieces don't connect. The skill can be correct, the script can be correct, the resolver can be correct, and the agent can still choose to ignore all of it and wing it. The smoke test catches that.

Step 10: Brain filing rules

Every skill that writes to the knowledge base needs to know where things go. A person goes in people/. A company goes in companies/. A policy analysis goes in civic/. I caught 10 out of 13 brain-writing skills filing to the wrong directory because they'd each hardcoded their own paths instead of consulting the resolver.

The filing rules doc catalogs common misfiling patterns. Sources vs. originals. People vs. companies (when someone IS a company). The skill reads the rules before creating any page. Zero misfilings since.

第 9 步：端到端冒烟测试

完整的管道，端到端。

问 agent“我什么时候去的新加坡？”并验证它是否运行了 calendar-recall.mjs，得到了正确的答案，并以正确的格式呈现。
问“我下一个会议是几点？”并验证它是否运行了 context-now.mjs 而不是在心里做算术。

冒烟测试是最后一道防线。其他所有环节都能通过，但如果各个部件没有连接好，系统仍然可能失败。skill 可以正确，脚本可以正确，resolver 可以正确，但 agent 仍然可以选择忽略这一切，即兴发挥。冒烟测试就是用来捕获这种情况的。

第 10 步：大脑归档规则

每一条会写入知识库的 skill 都需要知道东西该放哪里。一个人放 people/，一家公司放 companies/，一份政策分析放 civic/。我发现 13 个向大脑写入的 skill 中有 10 个归档到了错误的目录，因为它们每个各自硬编码了自己的路径，而不是问 resolver。

归档规则文档记录了常见的归档错误模式。来源 vs. 原件。人 vs. 公司（当一个人本身就是公司时）。skill 在创建任何页面之前都要先读取这些规则。从那以后，零次错误归档。

§ 10

GBrain: where Skillify lives, and you should adopt it from my GBrain Skill Pack

The skillify pattern isn't specific to OpenClaw or any particular harness. It's built into GBrain. GBrain is the open source knowledge engine I wrote that sits underneath whatever harness you use. It manages your brain repo, runs your evals, and enforces the quality gates that make skills durable.

A GBrain SkillPack is a portable bundle of skills, resolver triggers, deterministic scripts, and tests that you can install into any agent setup just by asking OpenClaw/Hermes Agent to do it. It's how skills and abilities I wrote for my OpenClaw/Hermes Agent can be auto-added to YOUR OpenClaw — including the whole 10-step skillify output, packaged so you can drop it into your OpenClaw/Hermes Agent and it just works.

The skillify checklist from earlier isn't a suggestion. It's what gbrain doctor actually checks.

gbrain doctor --fix auto-repairs DRY violations, replaces duplicated blocks with convention references, all guarded by git working-tree checks so nothing gets clobbered.

GBrain：Skillify 的栖息地，你可以从我的 GBrain Skill Pack 中拿来用

Skillify 模式并不特属于 OpenClaw 或任何具体的 harness。它内置于 GBrain。GBrain 是我写的一个开源知识引擎，位于任何你使用的 harness 底层。它管理你的大脑仓库，运行你的评估，并强制执行那些让 skill 持久的质量门禁。

GBrain SkillPack 是可移植的 skill、resolver 触发器、确定性脚本和测试的包，你只需让 OpenClaw/Hermes Agent 去做，就能把它安装到任何 agent 设置中。我为自己 OpenClaw/Hermes Agent 编写的技能和能力可以自动添加到你的 OpenClaw 中——包括完整的十条 skillify 输出，打包好了，你可以直接丢进你的 OpenClaw/Hermes Agent，然后它就能工作。

前面的 skillify 清单不是一份建议。它是 gbrain doctor 实际检查的内容。

gbrain doctor --fix 可以自动修复违反 DRY 的问题，用约定引用替换重复的块，这一切都有 git 工作树检查作为保护，所以不会破坏任何东西。

§ 11

Why Hermes Agent isn't enough on its own

Hermes Agent from Nous Research does something genuinely great: it has a skill_manage tool that lets the agent itself create, patch, and delete skills based on what it learns. When the agent finishes a complex task or recovers from an error, it proposes a skill and writes it to disk. That's procedural memory the agent earns on its own. Progressive disclosure (load a skill index first, pull the full SKILL.md only when selected). Bounded memory (MEMORY.md capped at 2,200 chars). Conditional activation (skills auto-hide when required tools aren't available). Smart design.

But Hermes doesn't test its skills. No unit tests on the deterministic code. No resolver evals to verify routing. No check-resolvable to find dark skills. No DRY audit to catch duplicates. No daily health check that goes red when something drifts.

The failure modes I've watched accumulate in any untested skill system:

Agent creates deploy-k8s on Monday. Thursday it creates kubernetes-deploy from a different conversation. Both exist. Both trigger on similar phrases. Ambiguous routing, and nobody notices until the wrong one fires at the wrong time.
Skill works perfectly when written. Six weeks later the upstream API changes shape. The skill silently returns garbage until a human spots it.
An autonomously-created skill has a weak trigger that never matches. It becomes an orphan, eating index tokens, never running, slowly rotting.

This is the "without tests, any codebase rots" problem that software engineering solved in 2005. Agent skills are no different. Hermes handles creation beautifully. GBrain handles verification. You need both.

为什么 Hermes Agent 单独不够

Nous Research 的 Hermes Agent 确实做了一些非常棒的事情：它有一个 skill_manage 工具，让 agent 自己能够根据所学创建、修补和删除 skill。当 agent 完成一项复杂任务或从错误中恢复时，它会提出一条 skill 并将其写入磁盘。这是 agent 自己挣来的程序记忆。渐进式披露（先加载技能索引，只有在被选中时才拉取完整的 SKILL.md）。有界记忆（MEMORY.md 上限 2200 字符）。条件激活（当所需工具不可用时技能自动隐藏）。设计得很聪明。

但 Hermes 不测试它的 skill。没有对确定性代码的单元测试。没有 resolver 评估来验证路由。没有 check-resolvable 去发现黑暗中的 skill。没有 DRY 审计来捕获重复。没有每天的健康检查，在东西走偏时亮起红灯。

我观察到的在任何未经测试的技能系统中累积的故障模式：

Agent 在周一创建了 deploy-k8s。周四它在另一个对话中创建了 kubernetes-deploy。两者都存在。两者都对相似的短语触发。路由含糊不清，直到错误的技能在错误的时机被触发才会有人注意。
Skill 在编写时完美工作。六周后上游 API 变了形状。skill 静默地返回垃圾数据，直到有人发现。
一个自主创建的 skill 有一个很弱的触发器，永远不会被命中。它变成了孤儿，消耗着索引 token，从不运行，慢慢腐烂。

这就是“没有测试，任何代码库都会腐烂”的问题，软件工程界在 2005 年就解决了。Agent skill 也没什么不同。Hermes 负责创建，做得很好。GBrain 负责验证。你需要两者。

§ 12

The big idea

In a healthy software engineering team, every bug gets a test. That test lives forever. The bug becomes structurally impossible to recur. AI agents should work the same way.

Every failure becomes a skill. Every skill has evals. Every eval runs daily. The agent's judgment improves permanently, not just for the current session, not just while the context window holds.

The trip failure won't happen again. The timezone failure won't happen again. And when the next failure shows up (and it will, because this is an adversarial game against entropy and taste) it'll get skillified too.

The agent I work with a year from now will be shaped by every mistake it made in the year before. That's not a nice-to-have. That's the whole thesis.

Boil the ocean. Make your agent do something, then skillify it. You do that every day and you have a god damn smart OpenClaw that does everything you want it to do.

Or you could just load GBrain, use all the code I've already written, and skip ahead to your own Jarvis from Iron Man sooner.

GStack to speed up in Claude Code github.com/garrytan/gstack

GBrain to build your own Jarvis from Iron Man in OpenClaw/Hermes Agent github.com/garrytan/gbrain

核心思路

在一个健康的软件工程团队中，每个 bug 都会得到一条测试。那条测试永远存在。那个 bug 在结构上变得不可能再出现。AI agent 也应该以同样的方式运作。

每次失败都变成一条 skill。每条 skill 都有评估。每次评估每天运行。agent 的判断力永久性地提升，不只是在当前会话中，不只是上下文窗口还打开的时候。

那次旅行故障不会再发生了。那次时区故障也不会再发生了。当下一次故障出现时（它会的，因为这是一场对抗熵和品味的博弈），它也会被 skillify。

一年后与我一起工作的 agent，将被前一年它犯下的每一个错误所塑造。这不是一个锦上添花的事情。这恰恰是整个论点。

放手去干。让你的 agent 做点事，然后 skillify 它。你每天这样做下去，你就会拥有一个无所不能的聪明 OpenClaw。

或者，你也可以直接加载 GBrain，用我已经写好的所有代码，然后更快地跳过流程，拥有你自己的钢铁侠贾维斯。

GStack 在 Claude Code 中加速：github.com/garrytan/gstack

GBrain 在 OpenClaw/Hermes Agent 中构建你自己的钢铁侠贾维斯：github.com/garrytan/gbrain

打开原文 ↗

标签

读完这条，下一步

→ agent-architecture → ai-tooling → testing

术语

latent space · 潜在空间（本文指 LLM 推理过程）: 作者将 Agent 操作分为需要模型的 latent 和不需要模型的 deterministic 两类，本文专指 LLM 进行推理判断而非执行确定性脚本的路径
resolver · 解析器: 基于意图的路由表，将用户输入映射到对应 Skill，含触发描述和可达性校验机制
SkillPack · 技能包: GBrain 中可移植的 Skill 集合，包含 MARKDOWN 契约、确定性脚本和测试套件，可跨 Agent 安装复用