Glean 拾遗
Daily /2026-07-03 / Building a Good Vertical Agent: Context as a Cache Hierarchy

Building a Good Vertical Agent: Context as a Cache Hierarchy

Source x.com Glean’d 2026-07-03 06:00 Read 21 min
AI summary

The article argues that a good vertical agent is a faithful compression of its task distribution, and its context should be organized as L1/L2/L3 cache tiers. Using their Shortcut spreadsheet agent as example, they detail extreme optimizations: reading a range compresses 500 formulas into a single legend line via R1C1 normalization and aliasing; after writing, a structured diff groups, samples, and triages changes, flagging #REF! errors under MUST FIX. L2 provides curated English specs fetched on demand, like the pivot table recipe that bakes in gotchas (suspendLayout/resumeLayout, raw integer 8 for aggregation). L3 is the raw API reference plus a 100-line grep skill that lets the model mine tens of thousands of lines in bounded steps. The prompt budget mirrors the frequency curve, and the hierarchy moves as models improve. Practical, transferable advice for engineers building reliable agents in any domain.

Original · 21 min
x.com ↗
§ 1

How do you build an agent that actually performs in a domain — one customers pick because it's better?

The basics have been standardized over the past year: an agent is a while-loop around a model that calls tools until the task is done. Give it a filesystem, give it a shell, and let it do most things through that. You can write it in an afternoon, and most people have. Everyone can build an agent — it really isn't that hard, and, as I'll spell out, it isn't that deep either. What separates a good one from a toy isn't cleverness; it's a real understanding of your domain and the patience to do some tedious, careful work in the few places that matter.

I've spent almost a year now building the Shortcut agent, which is widely considered the most accurate spreadsheet agent around — it's deployed inside three of the largest four multistrategy hedge funds, where being wrong is expensive and nobody grades on a curve. We don't have Microsoft's or Anthropic's distribution. What we have is that the agent is right more often, and in this domain that has been the single most compelling reason customers pick us. So agent performance is the question I think about all day.

And here's the gap I keep running into: plenty is written about building agents, but few about building good ones. Look at how much the field varies on something as basic as tool count — Codex and Claude Code ship ~30 tools each; Pi ships 7. When popular agents disagree 4x on the most basic design question, it's a tell: there's no agreed-on principle. So I'm sharing mine, from a year of building one, to demystify the process for anyone writing their own.

Here it is: a good agent is a faithful compression of its task distribution. The rest of this is just what that means, and what it forces you to build.

如何构建一个在特定领域真正表现出色的代理——一个客户会因为它更好而选择的代理?

过去一年里,基础知识已经标准化:代理就是一个围绕模型的 while 循环,不断调用工具直到任务完成。给它一个文件系统,给它一个 shell,让它通过这个机制做大部分事情。你一个下午就能写出来,大多数人也确实这么做了。人人都能构建代理——这真的不难,而且,正如我将要阐述的,这也不深奥。把好代理与玩具代理区分开来的,不是聪明才智,而是对你所在领域的真正理解,以及在少数关键地方投入做繁琐、细致工作的耐心。

我花了近一年时间构建 Shortcut 代理,它被广泛认为是目前最准确的电子表格代理——部署在四大多策略对冲基金中的三家,在这些地方,犯错代价高昂,而且没有人会按曲线评分。我们没有微软或 Anthropic 那样的分发渠道。我们所拥有的是代理更频繁地做出正确判断,在这个领域,这正是客户选择我们的最有力理由。因此,代理性能是我整天思考的问题。

我反复遇到一个差距:关于构建代理的文章很多,但关于构建好代理的却很少。看看这个领域在像工具数量这样基本的问题上有多大的差异——Codex 和 Claude Code 各自提供了约 30 个工具;Pi 只有 7 个。当流行的代理在最基本的设计问题上相差四倍时,这说明了一个问题:没有公认的原则。所以,我要分享我自己的原则,来自一年的构建经验,为任何编写自己代理的人揭秘。

原则是:一个好的代理是其任务分布的忠实压缩。本文其余部分只是解释这意味着什么,以及它迫使你构建什么。

§ 2

Assume you don't own the environment and you didn't train the model. Then three things are yours to design — the system prompt, the tools, and the artifacts (skills, curated docs, references) — and they're all the same thing: the agent's context.

So the game is simple to state. With the model fixed, accuracy is a function of context quality: bloated context buries the signal, missing context forces guessing, and both cost you accuracy. And accuracy is what you're selling — the relationship isn't linear, a task that scores 99% is worth 10x more than one that scores 95%.

But your users don't bring you a uniform distribution of problems to solve. They bring you a long tail.

The agent has to handle all of it. But it cannot hold the union of everything in context at once — that's the bloated-prompt failure mode. So the real objective is sharper than "have everything available": minimize the context spent per task, averaged over the task distribution.

This is exactly the problem a CPU faces. A program might touch gigabytes of data, but the storage right next to the processor is tiny — so computers stack memory in tiers: a small, instant cache (L1), bigger-and-slower ones below it (L2, L3), then main memory and disk. It works because access is long-tailed too: keep the hot set in the fast tier, reach down to the slow tiers only for the rare stuff. A "cache miss" is when what you need isn't in the fast tier and you pay to fetch it from a slower one — exactly the cost you're avoiding on the common path.

Agents should have the same structure. Build your context as L1 / L2 / L3.

假设你不拥有环境,也没有训练模型。那么,有三样东西是你的设计范畴——系统提示词、工具和工件(技能、精选文档、参考资料)——而它们本质上是同一回事:代理的上下文。

因此,游戏规则说起来很简单。在模型固定的情况下,准确性是上下文质量的函数:臃肿的上下文会埋没信号,缺失的上下文则迫使猜测,两者都会损害准确性。而准确性正是你销售的东西——这个关系不是线性的,一个任务达到 99% 的准确率,其价值是 95% 准确率的 10 倍。

但你的用户不会带来一个均匀分布的问题集。他们带来的是一个长尾分布。

代理必须处理所有这些。但它无法一次性将全部内容都放入上下文——那是提示词臃肿的失败模式。因此,真正目标比“拥有所有可用信息”更尖锐:最小化每个任务消耗的上下文量,按任务分布取平均。

这恰恰是 CPU 面临的问题。一个程序可能触及千兆字节的数据,但紧挨着处理器的存储空间却很小——所以计算机将内存分层堆叠:小而快的缓存(L1),其下更大更慢的缓存(L2、L3),然后是主内存和磁盘。它能工作,因为访问也是长尾分布的:将热点集放在快速层,只有处理罕见数据时才下探到慢速层。“缓存未命中”指的是你需要的数据不在快速层,你必须付出代价从慢速层获取——这正是你在常见路径上要避免的成本。

代理应该采用相同的结构。将你的上下文构建为 L1 / L2 / L3。

§ 3

Before the hierarchy, the substrate. Every spreadsheet capability I'm about to describe — every read, every write, every curated lookup — is code executed under a single tool.

async function execute() {
  const data = await sheet.getCellRange("Sheet1!A1:D200");
  // ...read, compute, write...
}

The agent writes code; the code calls our functions; the functions touch the sheet. There is no read_range tool, no write_range tool, no make_chart tool. There is one tool, and the API lives inside the code.

Why? Because model accuracy degrades as you add tools. That's been consistent in our own experiments. Every tool you add is more schema in the prompt, more surface to confuse, more ways to pick the wrong one, especially if the tools occupy overlapping responsibilities. A single execute_code tool collapses all of that into one decision — write code — and lets the model compose capabilities with the full expressive power of a programming language or DSL instead of stitching together rigid tool calls (more on this in a future post).

This matters for the hierarchy because it means all three cache tiers are reachable from the same place: the model is always writing code, and L1/L2/L3 are just which functions it knows it can call, and how much work it had to do to find them.

在讨论层级之前,先看看基础。我将要描述的每一个电子表格功能——每一次读取、每一次写入、每一次精选查找——都是在单个工具下执行的代码。

async function execute() {
  const data = await sheet.getCellRange("Sheet1!A1:D200");
  // ...read, compute, write...
}

代理编写代码;代码调用我们的函数;函数操作表格。没有 read_range 工具,没有 write_range 工具,没有 make_chart 工具。只有一个工具,而 API 存在于代码之中。

为什么?因为随着工具增加,模型准确性会下降。这在我们自己的实验中是一致的。每增加一个工具,提示词中就多一份模式(schema),就多一个混淆的方面,就多一种选错工具的可能——尤其是在工具职责重叠的情况下。一个单一的 execute_code 工具将所有这一切简化为一个决策——编写代码——并让模型利用编程语言或 DSL 的全部表达能力来组合功能,而不是拼接僵硬的工具调用(后续文章会详细讨论)。

这对层级结构很重要,因为它意味着所有三个缓存层都可以从同一个地方访问:模型总是在编写代码,而 L1、L2、L3 只是它知道自己可以调用哪些函数,以及需要付出多少工作来找到它们。

§ 4

This is the 80%. If reading and writing cell ranges isn't excellent, nothing else matters. So this is where we've spent absurd, disproportionate effort. Look at what a single getCellRange actually does.

Reading a range is an act of compression

Reading a 200-row revenue table:

Common formulas are abbreviated like F1, F2, etc.

A2:North | B2:1200 | C2:9.99 | D2:11988(F1)
A3:South | B3:840  | C3:9.99 | D3:8391.6(F1)
A4:West  | B4:1500 | C4:9.99 | D4:14985(F1)
... (196 more rows, each one line) ...

=F1 -> =RC[-2]*RC[-1]

--- Style patterns ---
D2:D201: 200 cells (numbers)
  → numberFormat:#,##0.00, font.color:#1A7F37
A2:A201: 200 cells (text)
  → font.bold:true

--- Context from cells above ---
A1:Region | B1:Units | C1:Price | D1:Revenue

Three things are happening.

First, formula aliasing. A 500-row column of =A2B2, =A3B3, … is 500 near-identical formulas. We normalize each formula to R1C1 form — so =A2B2 and =A3B3 both become =RC[-2]*RC[-1] — count the patterns, and any pattern that appears more than ten times collapses to a short alias like F1. The model sees F1 repeated plus one legend line, instead of 500 formulas. Big token savings, zero information loss.

Second, free row and column context. When you read C5:E20, what do those bare numbers mean? We scan leftward for the row labels and upward for the header row (picking the header by voting on which nearby row has the most text cells) and attach them, so the model gets Region | Q1 | Q2 and North America | … for free and never has to guess what a grid of numbers represents.

Third, style compression. Formatting is information too — a bold red cell with a 0.00% number format is telling you something — but listing the full style of every cell would swamp the values. So we group cells by identical style, collapse each group to its connected range, and print one line per group: the range, the cell count, and a compact description.

Six hundred formulas became one legend line. Four hundred styled cells became two lines. And the header row the model never explicitly asked for is right there at the bottom. That's the whole table, losslessly, in a fraction of the tokens a raw dump would cost. Every one of these is the compression-vs-discovery tradeoff, won decisively for the common case.

Writing cells: tell the model what it actually changed, and what looks wrong

Writing is harder than it looks, because a single execute_code call can change hundreds of cells, and the agent needs to know what happened without re-reading the whole sheet. So after the code runs, we hand back a structured diff of every cell that changed — and, just as importantly, we compress and triage it.

The code:

async function execute() {
  const rows = await sheet.getCellRange("Sheet1!A2:C201");
  for (let i = 0; i < rows.length; i++) {
    const r = i + 2;
    await sheet.setCell(`D${r}`, `=B${r}*C${r}`);
  }
}

The diff that comes back:

--- CELL DIFF SUMMARY ---
(Formatted display values shown. ∅ = undefined/empty.)

  Changed without issues: 199 total cells
    Sheet1!Row 2 (D): 1 cells
      → D2: ∅ -> 11,988 [=B2*C2]
    Sheet1!Row 3 (D): 1 cells
      → D3: ∅ -> 8,391.6 [=B3*C3]
    ... (sampled rows) ...
    Sheet1!Row 201 (D): 1 cells
      → D201: ∅ -> 4,995 [=B201*C201]
    ... and 189 more rows

  Cells that need review:
    MUST FIX: INVALID_FORMULA: 1 total cells
      Sheet1!Row 57 (D): 1 cells
        → D57: ∅ -> #REF! [=B57*C57]

Two kinds of compression are doing the work here.

First, the diff is grouped and sampled, not dumped. Changed cells are grouped by sheet and row, each row shown as a column range with a count (Row 2 (D): 1 cells), and only a deterministic sample of cells per row and rows per section is printed, with "… and N more" tallies for the rest. Two hundred writes become a handful, and the agent still knows the totals.

Second, the diff is categorized. Clean writes land under "Changed without issues." Anything that looks suspicious — an invalid formula like #REF!, an untagged hardcoded number, a hardcoded number buried inside a formula, an implausibly large percentage — gets pulled into a "Cells that need review" section, and the worst offenders are flagged MUST FIX. That #REF! in row 57 would be trivial to miss in a wall of two hundred green diffs; here it's surfaced at the top with a label. The feedback loop isn't "here's what changed," it's "here's what changed, and here's the part you probably got wrong" — a built-in linter on the agent's own edits.

L1 in one line: the operations on the steep part of the curve get feature-engineered, token-compressed, consequence-reporting wrappers that live in the prompt forever. They're expensive to build, and you build them anyway, because the agent pays the cost on every task.

这是那 80% 的部分。如果读写单元格区域做得不好,其他一切都无关紧要。因此,我们在这里投入了不成比例的巨大努力。看看一个简单的 getCellRange 实际上做了什么。

读取范围是一种压缩行为

读取一个 200 行的收入表:

常用公式缩写为 F1、F2 等。

A2:North | B2:1200 | C2:9.99 | D2:11988(F1)
A3:South | B3:840  | C3:9.99 | D3:8391.6(F1)
A4:West  | B4:1500 | C4:9.99 | D4:14985(F1)
... (还有 196 行,每行一条)...

=F1 -> =RC[-2]*RC[-1]

--- 样式模式 ---
D2:D201: 200 个单元格(数字)
  → numberFormat:#,##0.00, font.color:#1A7F37
A2:A201: 200 个单元格(文本)
  → font.bold:true

--- 来自上方单元格的上下文 ---
A1:Region | B1:Units | C1:Price | D1:Revenue

这里发生了三件事。

首先,公式别名化。一个 500 行的列,内容为 =A2B2、=A3B3…… 这是 500 个几乎相同的公式。我们将每个公式标准化为 R1C1 形式——所以 =A2B2 和 =A3B3 都变成 =RC[-2]*RC[-1]——然后统计模式,任何出现超过十次的模式都压缩成一个简短的别名,如 F1。模型看到的是重复的 F1 加上一行图例,而不是 500 个公式。巨大的 token 节省,零信息损失。

其次,免费的行和列上下文。当你读取 C5:E20 时,那些赤裸裸的数字是什么意思?我们向左扫描行标签,向上扫描标题行(通过投票选择附近文本单元格最多的行作为标题),然后将它们附加进来,这样模型就能免费获得 Region|Q1|Q2 和 North America|……,再也不必猜测数字网格代表什么。

第三,样式压缩。格式也是一种信息——一个加粗的红色单元格,数字格式为 0.00%,是在告诉你一些信息——但列出每个单元格的完整样式会淹没实际值。因此,我们按相同样式对单元格进行分组,将每个组折叠为连续的范围,并为每个组打印一行:范围、单元格计数和紧凑描述。

六百个公式变成了一行图例。四百个样式单元格变成了两行。而模型从未显式请求的标题行就在底部。这就是整张表,无损地呈现,只消耗原始转储所需 token 的一小部分。所有这些都是在压缩与发现之间的权衡中,为常见情况赢得决定性胜利。

写入单元格:告诉模型实际更改了什么,以及什么看起来有问题

写入比看起来更难,因为一次 execute_code 调用可能更改数百个单元格,而代理需要知道发生了什么,而无需重新读取整个工作表。因此,在代码运行后,我们会返回每个更改单元格的结构化差异——同样重要的是,我们对其进行压缩和分类。

代码:

async function execute() {
  const rows = await sheet.getCellRange("Sheet1!A2:C201");
  for (let i = 0; i < rows.length; i++) {
    const r = i + 2;
    await sheet.setCell(`D${r}`, `=B${r}*C${r}`);
  }
}

返回的差异:

--- 单元格差异摘要 ---
(显示格式化后的显示值。∅ = 未定义/空。)

  无问题更改:共 199 个单元格
    Sheet1!第 2 行 (D): 1 个单元格
      → D2: ∅ -> 11,988 [=B2*C2]
    Sheet1!第 3 行 (D): 1 个单元格
      → D3: ∅ -> 8,391.6 [=B3*C3]
    ... (采样行)...
    Sheet1!第 201 行 (D): 1 个单元格
      → D201: ∅ -> 4,995 [=B201*C201]
    ... 以及另外 189 行

  需要检查的单元格:
    必须修复:无效公式:共 1 个单元格
      Sheet1!第 57 行 (D): 1 个单元格
        → D57: ∅ -> #REF! [=B57*C57]

这里有两种压缩方式在起作用。

首先,差异被分组和抽样,而不是转储。更改的单元格按工作表和行分组,每行显示为一个列范围并带有计数(第 2 行 (D): 1 个单元格),并且只打印每行和每部分中确定性的单元格样本,其余部分用“…以及另外 N 个”汇总。两百次写入变成几行,而代理仍然知道总数。

其次,差异被分类。干净的写入归入“无问题更改”。任何看起来可疑的内容——比如 #REF! 这样的无效公式、未标记的硬编码数字、隐藏在公式中的硬编码数字、难以置信的大百分比——都会被拉到“需要检查的单元格”部分,最严重的问题会被标记为“必须修复”。第 57 行那个 #REF! 错误,在两百条绿色差异的墙中很容易被忽略;而在这里,它被放在顶部并带有标签。反馈循环不再是“这是更改了什么”,而是“这是更改了什么,以及这里可能是你弄错的部分”——这是代理自身编辑的一个内置 lint 检查器。

L1 一句话总结:位于曲线陡峭部分的操作,会拥有功能工程化、token 压缩、并报告后果的包装器,它们永远驻留在提示词中。构建它们成本高昂,但你必须构建,因为代理在每个任务上都要支付这个成本。

§ 5

You cannot put everything in L1. Conditional formatting, pivot tables, charts, data validation, copy/move semantics — each is important, each shows up a few times a session, and each has enough surface that documenting it in the system prompt would bloat every task that doesn't use it. Classic L2.

So we wrote curated capability specs in English, fetched on demand, exactly like skill mds. The model calls, from inside its code:

console.log(general.getConditionalFormattingInfo());
console.log(general.getPivotTableInfo());
console.log(general.getChartInfo());
console.log(general.getDataValidationInfo());
console.log(general.getAPIInfo("addSpanAt"));   // any single function, by name

These aren't dumps of type signatures. They're hand-written prose — a few hundred lines each — that describe the canonical way to accomplish the task, including the knowledge the raw API will never give you. Take the pivot-table spec. It doesn't just list methods; it teaches the whole recipe, in the right order:

const pt = sheet.originalSheet.pivotTables.add("SalesPivot", "SalesData", 0, 0, ...);
pt.suspendLayout();
pt.add("Region",  "Region",        rowField);
pt.add("Quarter", "Quarter",       columnField);
pt.add("Amount",  "Sum of Amount", valueField, 8);   // 8 = sum
pt.resumeLayout();

and it bakes in the things you would otherwise learn only by failing repeatedly: that you must suspendLayout()/resumeLayout() around a batch of changes or the table rebuilds on every call; that a value field's aggregation has to be passed as a raw integer (8 for sum) because the friendly enum doesn't exist at runtime. None of that is a quirky footgun — it's the actual shape of doing pivots correctly, written down once by someone who already paid for it.

The key property: this costs zero tokens until the task needs it. A task that never touches pivots never pays for the pivot docs. One console.log is the entire discovery cost — a single cache miss, served fast.

The same idea, for executable tools

L2 isn't only for docs. We apply the identical pattern to deferred tools — web_search, web_crawl, create_website, etc. Their schemas don't sit in the prompt. Instead there's a meta-tool wall:

get_tool_info("web_search")   → returns the schema, marks it "fetched"
execute_tool("web_search", …) → REFUSES unless you fetched it first

The set of fetched tools is, literally, a session-scoped cache. The model loads a schema once, and from then on it's resident. Same compression-vs-discovery tradeoff, same resolution: keep the prompt small, pay a one-step miss when you actually need the capability. This is the same idea as deferred tools on Claude but we're not locked to one vendor's tool-loading feature to get the behavior.

你不能把所有东西都放在 L1 中。条件格式、数据透视表、图表、数据验证、复制/移动语义——每一个都很重要,每个会话都会出现几次,而且每个都有足够大的表面,如果在系统提示词中全部记录,会使不使用它们的每个任务都变得臃肿。典型的 L2。

因此,我们用英文编写了精选的能力规范,按需加载,就像 skill.md 一样。模型在其代码内部调用:

console.log(general.getConditionalFormattingInfo());
console.log(general.getPivotTableInfo());
console.log(general.getChartInfo());
console.log(general.getDataValidationInfo());
console.log(general.getAPIInfo("addSpanAt"));   // 任何单个函数,按名称

这些不是类型签名的转储。它们是手写的散文——每个几百行——描述了完成任务的标准方法,包括原始 API 永远不会给你的知识。以数据透视表规范为例。它不仅列出了方法;它按正确顺序教授了完整的配方:

const pt = sheet.originalSheet.pivotTables.add("SalesPivot", "SalesData", 0, 0, ...);
pt.suspendLayout();
pt.add("Region",  "Region",        rowField);
pt.add("Quarter", "Quarter",       columnField);
pt.add("Amount",  "Sum of Amount", valueField, 8);   // 8 = sum
pt.resumeLayout();

并且它内置了那些你只有反复失败才能学到的东西:你必须在一系列更改前后调用 suspendLayout()/resumeLayout(),否则每次调用都会重建表格;value field 的聚合必须作为原始整数传递(8 代表求和),因为友好的枚举在运行时并不存在。这些都不是古怪的陷阱——它们是正确操作数据透视表时的实际形态,由已经为此付出过代价的人一次性写下来。

关键属性:在任务需要之前,这不会花费任何 token。从未接触数据透视表的任务永远不会为数据透视表文档付费。一个 console.log 就是全部的发现成本——一次缓存未命中,快速服务。

同样的想法,用于可执行工具

L2 不仅用于文档。我们将完全相同的模式应用于延迟加载的工具——web_search、web_crawl、create_website 等。它们的模式不在提示词中。取而代之的是一个元工具屏障:

get_tool_info("web_search")   → 返回模式,标记为“已获取”
execute_tool("web_search", …) → 除非你先获取了它,否则会拒绝

已获取工具集,字面意义上,是一个会话级别的作用域缓存。模型加载一次模式,之后它就常驻了。同样的压缩与发现之间的权衡,同样的解决方案:保持提示词小巧,当你真正需要该能力时,付出一步遗漏的成本。这与 Claude 上的延迟工具是同一个想法,但我们不局限于某个供应商的工具加载特性来获得这种行为。

§ 6

Then there's the long tail: the one obscure thing we never wrapped and never wrote a spec for. You can't anticipate it — by definition. But the agent still has to be able to get there, or it hits a wall and fails the task. Concretely, this is where requests like these end up:

  • "Add a sparkline to each row summarizing its trend" — sparklines are a real but rarely-touched API surface.
  • "Set the chart's secondary axis to log scale and recolor just the third series" — a chart property three levels deep that no curated spec bothered to cover.
  • "Insert a hyperlink from this cell to that named range, and group these shapes" — drawing/shape/hyperlink corners nobody asked about until now. So L3 is the complete raw API — the entire Office.js surface (Excel plugin) or the entire SpreadJS surface (Shortcut web), dumped to disk. It's a machine-generated reference that is 70k lines long. It contains everything. It's also completely unusable as prompt context — you'd never paste it in.

The trick: you give it a skill — a short map that teaches it how to mine the tome with bash:

# from the advanced-api SKILL.md — the recommended workflow
grep -n '"charts.add"' api-reference.json -A 5 # find a method
grep -n '"pivots\.' api-reference.json | head # list a namespace
grep -n '"ChartConfig"' api-reference.json -A 10 # resolve a type
grep -n '"isEnum": true' api-reference.json -B2 -A10 # enumerate enums

The skill is ~100 lines. It says: here's the structure, here's how each method and type entry is shaped, here's the grep recipe for each kind of question. With it, the agent goes from "tens of thousands of lines I can't read" to "the 3-6 greps that surface exactly the signature I need." That's the L3 access cost — real, but bounded, and only paid by the rare task that reaches this deep.

And the system prompt makes the escape hatch explicit, so the model knows the path exists and when to take it:

API HIERARCHY — There are 2 levels of API capability. Wrapped API: convenience functions; some listed directly, others via getAPIInfo(...). NEVER guess — read the docs in FULL. Raw API: use when the wrapped API doesn't cover your need… If the wrapped API can't do it, use the raw API — don't compromise.

That last clause is the whole point of L3. The agent should never be stuck. It can miss in L1, drop to L2, and if even the curated spec is silent, descend into the raw tome and still come out with the answer in a sane number of calls.

然后是长尾部分:那些我们从未封装、也从未编写规范的晦涩功能。你无法预见它——这是定义使然。但代理仍然必须能够到达那里,否则它会碰壁并导致任务失败。具体来说,像这样的请求最终会落在这里:

  • “为每一行添加一个迷你图来总结其趋势”——迷你图是一个真实但很少触及的 API 面。
  • “将图表的次坐标轴设置为对数刻度,并仅重新着色第三个系列”——一个深达三层的图表属性,任何精选规范都懒得涵盖。
  • “从此单元格插入指向该命名区域的超链接,并对这些形状进行分组”——绘图、形状、超链接的角落,之前没人问过。

因此,L3 是完整的原始 API——整个 Office.js 表面(Excel 插件)或整个 SpreadJS 表面(Shortcut Web),转储到磁盘。这是一个机器生成的参考,长达 7 万行。它包含一切。但它也完全无法用作提示词上下文——你永远不会把它粘贴进去。

诀窍:你给它一个技能——一个简短的地图,教它如何用 bash 挖掘这部巨著:

# 来自 advanced-api SKILL.md — 推荐工作流程
grep -n '"charts.add"' api-reference.json -A 5 # 查找方法
grep -n '"pivots\.' api-reference.json | head # 列举命名空间
grep -n '"ChartConfig"' api-reference.json -A 10 # 解析类型
grep -n '"isEnum": true' api-reference.json -B2 -A10 # 枚举枚举值

这个技能大约 100 行。它告诉你:结构是这样的,每个方法和类型条目的形状是这样的,针对每种问题的 grep 配方是这样的。有了它,代理从“数万行我读不了”变成了“3 到 6 次 grep 就能精准呈现我需要的签名”。这就是 L3 的访问成本——真实存在,但有界限,并且只有深入到这一层的罕见任务才会支付。

系统提示词明确标明了这个逃生舱口,以便模型知道路径存在以及何时使用它:

API 层级 — 有两个级别的 API 能力。包裹 API:便捷函数;一些直接列出,其他的通过 getAPIInfo(...)。绝对不要猜测——完整阅读文档。原始 API:当包裹 API 无法满足你的需求时使用……如果包裹 API 做不到,就用原始 API——不要妥协。

最后一条子句是 L3 的全部意义所在。代理永远不应该卡住。它可能在 L1 未命中,下降到 L2,如果即使精选规范也保持沉默,就深入到原始巨著中,并且仍然能在合理次数的调用中带着答案出来。

§ 7

It's worth looking at where the tokens go, because the hierarchy shows up directly in the system prompt's shape.

The bulk of the prompt is L1 — on the order of a few hundred lines. Core read/write operations, the execute_code contract, the key types and the handful of methods the agent uses on essentially every task, plus the execution and safety guidelines. This is the part that's resident on every single call, so it's also the part we fight hardest to keep tight.

L2 is a thin slice on top — roughly 50 lines. It isn't the specs themselves; it's a curated allowlist of the "blessed" methods and the pointers that tell the agent the getXInfo(...) specs exist and when to reach for them. The specs' actual content stays out of the prompt until a console.log pulls it in.

L3 is essentially 5 lines, the name and description of the skill.md, and other references scattered elsewhere. The raw reference — 70k lines — lives entirely on disk and never touches the prompt. All that's resident is the short skill file and the one line in the API-hierarchy section pointing at it.

So the budget mirrors the frequency curve: most of the prompt is spent on the 80% case, a little on signposting the 15%, and almost nothing on the long tail — which is exactly the allocation the cache-hierarchy framing predicts.

看看 token 都花在哪里是值得的,因为层级结构直接体现在系统提示词的形态上。

提示词的主体是 L1——大约几百行。核心读写操作、execute_code 契约、关键类型以及代理在基本每个任务上都会用到的少数方法,外加执行和安全指南。这部分在每次调用时都常驻,因此也是我们最努力保持紧凑的部分。

L2 是上面薄薄的一层——大约 50 行。它不是规范本身;它是一个精选的“已批准”方法允许列表,以及告诉代理 getXInfo(...) 规范存在以及何时去获取它们的指针。规范的实际内容在通过 console.log 拉入之前不会进入提示词。

L3 基本上只有 5 行,即 skill.md 的名称和描述,以及其他散布在各处的引用。原始参考——7 万行——完全存在于磁盘上,从不触及提示词。所有常驻的内容只是简短的技能文件以及 API 层级部分中指向它的一行。

因此,预算反映了频率曲线:大部分提示词花在了 80% 的情况上,一小部分用于指示那 15%,而几乎不花在长尾部分——这正是缓存层级框架预测的分配方式。

§ 8

Spreadsheets are just my example. The structure transfers to any domain. The compression in those system prompts and curated specs is really an encoding of the distribution of your users and the tasks they do — and you, in your domain, understand that distribution better than anyone. So your job is three questions:

  1. What do you wrap into L1? The bread-and-butter operations on the steep part of the frequency curve. Make them brutally token-efficient and fast, and make them report consequences. Spend disproportionate effort here — the agent pays this cost on every task.

  2. What do you defer to L2? The important-but-occasional capabilities. Write them as curated, English, gotcha-aware specs reachable in one discovery step. Encode the canonical recipe and the constraints, not just the signatures.

  3. What is your escape hatch (L3)? The raw, complete substrate — plus a skill that teaches the agent to mine it. It doesn't have to be ergonomic. It has to be reachable, complete, and findable in a bounded number of steps. The agent must be able to — and will — eventually find the right information.

Get those three placements right and you've built an agent that is fast on the common case, capable on the occasional one, and never truly stuck on the rare one — all while keeping context small enough that the model stays sharp.

电子表格只是我的例子。这个结构可以转移到任何领域。系统提示词和精选规范中的那些压缩,实际上是对你的用户及其执行任务的分布进行编码——而你在你的领域中,比任何人都更了解这个分布。因此,你的工作是回答三个问题:

  1. 你把什么包装进 L1?位于频率曲线陡峭部分的日常操作。让它们在 token 效率上极其高效、快速,并且让它们报告后果。在这里投入不成比例的努力——代理在每个任务上都要支付这个成本。

  2. 你把什么推迟到 L2?重要但不经常出现的能力。将它们编写为精选的、英文的、包含陷阱的规范,在一个发现步骤中即可获取。编码标准配方和约束,而不仅仅是签名。

  3. 你的逃生舱口(L3)是什么?原始的、完整的底层——外加一个教代理如何挖掘它的技能。它不必符合人体工程学。它必须可到达、完整,并且在有限的步骤内可找到。代理必须能够——并且将会——最终找到正确的信息。

把这三个放置正确,你就构建了一个在常见情况下快速、在偶发情况下有能力、在罕见情况下从不真正卡住的代理——同时保持上下文足够小,使模型保持敏锐。

§ 9

One closing observation. What counts as L1 is not fixed; it drifts with model strength.

Early, weak models needed tiny, single-purpose tools and everything spelled out. Today's models can take a larger L2 spec in one shot and reason over more raw L3 detail without choking. So as models improve, yesterday's L3 becomes tomorrow's L2, and yesterday's L2 collapses into L1. The agent's responsibility expands outward; the tiers slide down a level.

But the hierarchy itself never goes away — because context will always be scarce relative to everything you could put in it, and noise will always cost you accuracy. There is no model so large that "put the right thing in front of it at the right time" stops mattering.

Bigger context windows tempt people to paste in more. The better instinct is the one CPUs settled on decades ago: summaries in cache, details on demand, the raw substrate as the last resort. Build your agent's context like a memory hierarchy, and accuracy follows.

最后一个观察。什么算作 L1 并非固定不变;它会随着模型能力的增强而移动。

早期弱的模型需要微小的、单一用途的工具,并且所有东西都要详细说明。今天的模型可以一次性吸收更大的 L2 规范,并在更多原始 L3 细节上进行推理而不会卡顿。因此,随着模型改进,昨天的 L3 变成了今天的 L2,昨天的 L2 则坍缩进 L1。代理的责任范围向外扩展;层级下移了一层。

但层级结构本身永远不会消失——因为相对于你可能放入其中的所有东西,上下文将永远是稀缺的,而噪声将永远损害你的准确性。没有任何模型大到“在正确时间将正确的东西放在它面前”这件事变得无关紧要。

更大的上下文窗口诱惑人们粘贴更多内容。更好的直觉是 CPU 几十年前就确定的:缓存摘要、按需提供细节、原始底层作为最后手段。像构建内存层次结构一样构建代理的上下文,准确性随之而来。

Open source ↗