Glean 拾遗
日刊 /2026-06-07 / 构建 Claude Code 的教训:提示缓存就是一切

构建 Claude Code 的教训:提示缓存就是一切

原文 x.com 收录 2026-06-07 06:01 阅读 8 min
AI 解读

Anthropic 工程师分享 Claude Code 中优化提示缓存的实际经验。提示缓存基于前缀匹配,缓存从请求头到每个 breakpoint 的内容,因此 prompt 各部分的顺序至关重要:遵循“静态在前、动态在后”原则,能最大化跨会话的缓存命中。文章给出多条反直觉教训:用消息传递更新信息而非修改系统提示;不要在会话中切换模型或增减工具,这会立刻导致缓存全部失效;压缩(compaction)时复用父会话前缀避免缓存丢失。每条建议都附带具体实现策略(如 system-reminder 标签、EnterPlanMode/ExitPlanMode 作为工具、defer_loading 机制)。适合正在构建长运行 Agent 产品的工程师参考。

原文 8 分钟
原文 x.com ↗
§ 1

It is often said in engineering that "Cache Rules Everything Around Me", and the same rule holds for agents.

Long running agentic products like Claude Code are made feasible by prompt caching which allows us to reuse computation from previous roundtrips and significantly decrease latency and cost.

What is prompt caching, how does it work and how do you implement it technically? Read more in @RLanceMartin's piece on prompt caching and our new auto-caching launch.

At Claude Code, we build our entire harness around prompt caching. A high prompt cache hit rate decreases costs and helps us create more generous rate limits for our subscription plans, so we run alerts on our prompt cache hit rate and declare SEVs if they're too low.

These are the (often unintuitive) lessons we've learned from optimizing prompt caching at scale.

工程师圈子里有句话叫「缓存主宰一切」,这条规则对智能体同样成立。

像 Claude Code 这样的长运行智能体产品,正是因为 prompt 缓存才变得可行——它让我们能复用前序往返的计算结果,大幅降低延迟和成本。

什么是 prompt 缓存?它如何工作?技术上怎么实现?更多细节请参考 @RLanceMartin 关于 prompt 缓存以及我们新推出的自动缓存功能的文章。

在 Claude Code,我们整个框架都围绕 prompt 缓存构建。高缓存命中率能降低成本,帮助我们为订阅计划设置更慷慨的速率限制——所以我们会对缓存命中率设置告警,命中率太低时直接拉 SEV(严重事故)。

以下是我们从大规模优化 prompt 缓存中学到的(往往反直觉的)经验。

§ 2

Prompt caching works by prefix matching — the API caches everything from the start of the request up to each cache_control breakpoint. This means the order you put things in matters enormously, you want as many of your requests to share a prefix as possible.

The best way to do this is static content first, dynamic content last. For Claude Code this looks like:

  1. Static system prompt & Tools (globally cached)
  2. Claude.MD (cached within a project)
  3. Session context (cached within a session)
  4. Conversation messages This way we maximize how many sessions share cache hits.

But this can be surprisingly fragile! Examples of reasons we've broken this ordering before include: putting an in-depth timestamp in the static system prompt, shuffling tool order definitions non-deterministically, updating parameters of tools (e.g. what agents the AgentTool can call), etc.

Prompt 缓存通过前缀匹配工作——API 会缓存从请求开头到每个 cache_control 断点之间的所有内容。这意味着请求里内容的顺序至关重要:你要让尽可能多的请求共享同一个前缀。

最好的做法是静态内容在前,动态内容在后。在 Claude Code 中,顺序如下:

  1. 静态系统提示词与工具(全局缓存)
  2. Claude.MD(项目内缓存)
  3. 会话上下文(会话内缓存)
  4. 对话消息 这样我们就最大化地让多个会话共享缓存命中。

但这个顺序竟然异常脆弱!我们曾打破这个顺序的原因包括:在静态系统提示词里放了一条详细的 timestamp、对工具定义的顺序做了非确定性打乱、更新了工具的参数(比如 AgentTool 可以调用哪些智能体)等等。

§ 3

There may be times when the information you put in your prompt becomes out of date, for example if you have the time or if the user changes a file. It may be tempting to update the prompt, but that would result in a cache miss and could end up being quite expensive for the user.

Consider if you can pass in this information via messages in the next turn instead. In Claude Code, we add a <system-reminder> tag in the next user message or tool result with the updated information for the model (e.g. it is now Wednesday), which helps preserve the cache.

有时 prompt 里的信息会过时,比如时间变了,或者用户修改了一个文件。你可能会忍不住去更新 prompt,但这会导致缓存缺失,最终让用户付出更高成本。

不妨考虑是否能通过下一轮的消息来传递这些信息。在 Claude Code 中,我们在下一条用户消息或工具结果里添加一个 <system-reminder> 标签,附带更新后的信息(比如「现在是周三」),这样就能保住缓存。

§ 4

Prompt caches are unique to models and this can make the math of prompt caching quite unintuitive.

If you're 100k tokens into a conversation with Opus and want to ask a question that is fairly easy to answer, it would actually be more expensive to switch to Haiku than to have Opus answer, because we would need to rebuild the prompt cache for Haiku.

If you need to switch models, the best way to do it is with subagents, where Opus would prepare a "handoff" message to another model on the task that it needs done. We do this often with the Explore agents in Claude Code which use Haiku.

Prompt 缓存是针对具体模型的,这让缓存的计算逻辑变得相当反直觉。

如果你已经跟 Opus 聊了 10 万个 token,现在想问一个很简单的问题,切换到 Haiku 反而比让 Opus 回答更贵——因为你需要为 Haiku 重建整个 prompt 缓存。

如果确实需要切换模型,最佳方式是通过子智能体:让 Opus 准备一条「交接」消息,把需要完成的任务交给另一个模型。我们在 Claude Code 的 Explore 代理中经常这么做,它们用的是 Haiku。

§ 5

Changing the tool set in the middle of a conversation is one of the most common ways people break prompt caching. It seems intuitive — you should only give the model tools you think it needs right now. But because tools are part of the cached prefix, adding or removing a tool invalidates the cache for the entire conversation.

在对话中途更改工具集是人们破坏 prompt 缓存最常见的原因之一。这看起来很符合直觉——你应该只给模型当前需要的工具。但因为工具是缓存前缀的一部分,增删一个工具会使整个对话的缓存失效。

§ 6

Plan mode is a great example of designing features around caching constraints. The intuitive approach would be: when the user enters plan mode, swap out the tool set to only include read-only tools. But that would break the cache.

Instead, we keep all tools in the request at all times and use EnterPlanMode and ExitPlanMode as tools themselves. When the user toggles plan mode on, the agent gets a system message explaining that it's in plan mode and what the instructions are — explore the codebase, don't edit files, call ExitPlanMode when the plan is complete. The tool definitions never change.

This has a bonus benefit: because EnterPlanMode is a tool the model can call itself, it can autonomously enter plan mode when it detects a hard problem, without any cache break.

计划模式是一个围绕缓存约束来设计功能的绝佳例子。直观的做法是:用户进入计划模式时,把工具集换成只读工具。但那会破坏缓存。

我们的做法是:始终在所有请求中保留全部工具,把 EnterPlanMode 和 ExitPlanMode 本身设计成工具。当用户切换到计划模式时,智能体会收到一条系统消息,解释当前处于计划模式以及运行指令——探索代码库、不编辑文件、计划完成时调用 ExitPlanMode。工具定义从未改变。

这还有一个额外的好处:因为 EnterPlanMode 是模型可以自行调用的工具,它在发现难题时可以自主进入计划模式,完全不破坏缓存。

§ 7

The same principle applies to our tool search feature. Claude Code can have dozens of MCP tools loaded, and including all of them in every request would be expensive. But removing them mid-conversation would break the cache.

Our solution: defer_loading. Instead of removing tools, we send lightweight stubs — just the tool name, with defer_loading: true — that the model can "discover" via a ToolSearch tool when needed. The full tool schemas are only loaded when the model selects them. This keeps the cached prefix stable: the same stubs are always present in the same order.

Luckily you can use the tool search tool through our API to simplify this.

同样的原则也适用于我们的工具搜索功能。Claude Code 可以加载几十个 MCP 工具,如果在每个请求中都包含全部工具,成本会很高。但在对话中途移除它们会破坏缓存。

我们的解决方案是:延迟加载(defer_loading)。我们不移除工具,而是发送轻量级的桩(stub)——只有工具名称,加上 defer_loading: true——模型在需要时可以通过 ToolSearch 工具「发现」它们。完整的工具 schema 只有在模型选中时才会加载。这样缓存前缀始终保持稳定:同样的桩始终以同样的顺序出现。

幸运的是,你可以通过我们的 API 直接使用工具搜索功能来简化这一流程。

§ 8

Compaction is what happens when you run out of the context window. We summarize the conversation so far and continue a new session with that summary.

Surprisingly, compaction has many edge cases with prompt caching that can be unintuitive.

In particular, when we compact we need to send the entire conversation to the model to generate a summary. If this is a separate API call with a different system prompt and no tools (which is the simple implementation), the cached prefix from the main conversation doesn't match at all. You pay full price for all those input tokens, drastically increasing the cost for the user.

The Solution — Cache-Safe Forking

When we run compaction, we use the exact same system prompt, user context, system context, and tool definitions as the parent conversation. We prepend the parent's conversation messages, then append the compaction prompt as a new user message at the end.

From the API's perspective, this request looks nearly identical to the parent's last request — same prefix, same tools, same history — so the cached prefix is reused. The only new tokens are the compaction prompt itself.

This does mean however that we need to save a "compaction buffer" so that we have enough room in the context window to include the compact message and the summary output tokens.

Compaction is tricky but luckily, you don't need to learn these lessons yourself — based on our learnings from Claude Code we built compaction directly into the API, so you can apply these patterns in your own applications.

压缩(compact)发生在上下文窗口用完时:我们会总结当前对话,然后带着这份摘要开始一个新的会话。

令人惊讶的是,压缩在 prompt 缓存方面有许多反直觉的边缘情况。

具体来说,在压缩时,我们需要把整个对话发送给模型来生成摘要。如果这是一个独立的 API 调用,使用不同的系统提示词且没有工具(简单的实现方式就是这样),那么它与主对话的缓存前缀完全不匹配,所有输入 token 都得按全价付费,大幅增加用户成本。

解决方案——缓存安全的 fork:

执行压缩时,我们使用与父对话完全相同的系统提示词、用户上下文、系统上下文和工具定义。我们把父对话的消息前置,然后在末尾追加一条压缩 prompt 作为新的用户消息。

从 API 的角度看,这个请求与父对话的最后一次请求几乎一模一样——相同的前缀、相同的工具、相同的历史——所以缓存的 prefix 被复用了。唯一的新 token 就是压缩 prompt 本身。

但这意味着我们需要预留一个「压缩缓冲区」,确保上下文窗口有足够空间容纳压缩消息和摘要输出的 token。

压缩虽然棘手,但幸运的是,你不需要亲自踩这些坑——基于我们从 Claude Code 积累的经验,我们已经把压缩功能直接内建到了 API 中,你可以在自己的应用里直接使用这些模式。

§ 9
  1. Prompt caching is a prefix match. Any change anywhere in the prefix invalidates everything after it. Design your entire system around this constraint. Get the ordering right and most of the caching works for free.
  2. Use messages instead of system prompt changes. You may be tempted to edit the system prompt to enter plan mode, change the date, etc., but it would actually be better to insert these into messages during the conversation.
  3. Don't change tools or models mid-conversation. Use tools to model state transitions (like plan mode) rather than changing the tool set. Defer tool loading instead of removing tools.
  4. Monitor your cache hit rate like you monitor uptime. We alert on cache breaks and treat them as incidents. A few percentage points of cache miss rate can dramatically affect cost and latency.
  5. Fork operations need to share the parent's prefix. If you need to run a side computation (compaction, summarization, skill execution), use identical cache-safe parameters so you get cache hits on the parent's prefix. Claude Code is built around prompt caching from day one, you should do the same if you're building an agent.
  1. Prompt 缓存是前缀匹配。前缀中任何地方的任何变更都会使之后的所有缓存失效。请围绕这个约束来设计你的整个系统。顺序搞对了,大部分缓存就自动生效了。
  2. 用消息而非修改系统提示词。你可能想通过编辑系统提示词来进入计划模式、更改日期等,但更好的做法是在会话过程中将这些信息插入到消息里。
  3. 不要在对话中途更换工具或模型。用工具来建模状态转换(比如计划模式),而不是改变工具集。延后加载工具,而不是移除工具。
  4. 像监控系统在线率一样监控你的缓存命中率。我们对缓存中断设置告警,并将其当作事故处理。仅仅几个百分点的缓存缺失率就会显著影响成本和延迟。
  5. Fork 操作需要共享父前缀。如果你需要运行一个侧线计算(压缩、摘要、技能执行),请使用相同的、缓存安全的参数,以便命中父前缀的缓存。 Claude Code 从第一天起就围绕 prompt 缓存构建,如果你在构建智能体,也应该这样做。
打开原文 ↗