Glean 拾遗
Daily /2026-06-30 / Context Engineering for AI Agents: The Complete Playbook

Context Engineering for AI Agents: The Complete Playbook

Source x.com Glean’d 2026-06-30 06:01 Read 21 min
AI summary

This article systematically explains why context engineering is the most critical skill for building reliable AI agents. It argues that agent degradation usually stems from poor context window management rather than model limitations. The context window is likened to RAM, and as tool outputs, retrieval results, and conversation history accumulate, attention thins and the “Lost in the Middle” effect kicks in. Four core strategies are presented: Write (persist information outside context), Select (just-in-time retrieval), Compress (proactively reduce tokens), and Isolate (separate contexts for different jobs). The article details four failure modes—poisoning, distraction, confusion, and clash—and offers concrete evidence: Chroma benchmarks show continuous performance decline well before token limits, RAG‑MCP improved tool selection accuracy from 14% to 43% while halving token usage, and KV‑cache hit rates can yield a 10× cost reduction. A real-world workflow that shipped ~35,000 lines of Rust code in 7 hours using frequent intentional compaction is presented. The target audience is engineers building production‑grade agents.

Original · 21 min
x.com ↗
§ 1

Your AI agent works great for the first 10 steps.

Then somewhere around step 15, it starts getting sloppy.

Wrong tool calls. Forgetting your original instructions. Low-quality outputs.

Most people blame the model.

It's almost never the model.

It's what the model is seeing.

Organizing what the model sees is called context engineering.

It is quickly becoming the most important skill for anyone building AI agents.

Here is the complete playbook.

你的 AI 代理在前 10 步表现很好。

但到了第 15 步左右,它就开始变得马虎了。

调用错误的工具、忘记原始指令、输出质量低下。

大多数人会归咎于模型。

但这几乎从来不是模型的问题。

问题在于模型看到的内容。

组织模型所看到的内容被称为上下文工程/context engineering。

它正迅速成为每个构建 AI 代理的人最重要的技能。

以下是完整的玩法手册。

§ 2

Prompt engineering is dead. Context engineering is what matters now.

You've heard of prompt engineering.

Writing clear instructions. Good examples. Telling the model what role to play.

That works perfectly for a chatbot.

It stops working the moment you build an agent.

Here's why.

A chatbot answers one question and stops.

An agent takes actions — browsing the web, calling APIs, writing code, running commands — step after step after step, sometimes for dozens of steps.

Every single step produces output that gets added to the model's context.

And that context is finite.

Anthropic's engineering team defines it like this:

"Context is the set of tokens included when you sample from an LLM. Context engineering is optimizing the utility of those tokens to consistently achieve a desired outcome."

Put simply: make sure your agent sees the right information, in the right format, at the right time.

Prompt engineering is a subset of context engineering.

Context engineering is everything.


提示工程已死,上下文工程才是关键

你听说过提示工程。

编写清晰的指令、提供好例子、告诉模型扮演什么角色。

这对聊天机器人来说完美适用。

但当你要构建一个代理时,它就不起作用了。

原因如下。

聊天机器人回答一个问题后就停止了。

代理会采取行动——浏览网页、调用 API、编写代码、运行命令——一步一步又一步,有时多达几十步。

每一步产生的输出都会被添加到模型的上下文中。

而上下文是有限的。

Anthropic 的工程团队这样定义它:

"上下文是从 LLM 采样时包含的 token 集合。上下文工程就是优化这些 token 的效用,以持续地达成预期结果。"

简而言之:确保你的代理在正确的时间、以正确的格式看到正确的信息。

提示工程是上下文工程的一个子集。

上下文工程就是一切。

§ 3

Your agent's context window is RAM. And it's filling up.

LangChain has the right analogy for this.

Think of an LLM like a new kind of operating system.

The model is the CPU — it does the thinking.

The context window is RAM — the working memory where everything the model can currently see and reason about lives.

Just like your computer slows down when RAM fills up, your agent's reasoning degrades when the context window gets crowded.

This is called context rot.

Chroma ran a study evaluating 18 frontier models — GPT-4.1, Claude 4, Gemini 2.5, Qwen3, and others.

Every single model's performance degraded as input length increased.

Not at the hard limit. Well before it.

A model with a 200K-token window might show significant degradation at 50K tokens.

The decline is continuous. Not a cliff.

Why? Transformers work by having every token attend to every other token — creating n-squared relationships. As context grows, the model's ability to hold all those relationships thins out.

And then there's the "Lost in the Middle" problem.

LLMs show a U-shaped attention curve.

→ Beginning of context: remembered well

→ End of context: remembered well

→ Middle: largely ignored

Researchers measured a 30+ percentage point accuracy drop when relevant information moved from the beginning of context to the middle.

Your original instructions — buried under 50,000 tokens of tool outputs — effectively disappear.

Claude Code users have found output quality degrades at 40–60% of context capacity. Well before any hard limit.


代理的上下文窗口如同 RAM,正在填满

LangChain 对此有一个贴切的类比。

把 LLM 想象成一种新型操作系统。

模型是 CPU——负责思考。

上下文窗口是 RAM——工作内存,当前模型能看到并推理的所有内容都驻留在此。

就像电脑在 RAM 满时会变慢一样,当上下文窗口变得拥挤时,代理的推理能力也会下降。

这被称为上下文腐烂/context rot。

Chroma 进行了一项研究,评估了 18 个前沿模型——GPT-4.1、Claude 4、Gemini 2.5、Qwen3 等。

每个模型的性能都随着输入长度的增加而下降。

这并非在硬性限制处发生,而是在那之前很久就开始了。

一个拥有 200K token 窗口的模型,可能早在 50K token 时就显示出明显的性能退化。

这种下降是连续的,而不是断崖式的。

为什么?Transformer 的工作原理是让每个 token 都关注其他所有 token——这会产生 n 平方的关系。随着上下文增长,模型维持所有这些关系的能力会减弱。

此外还有"迷失在中间"/"Lost in the Middle"的问题。

LLM 表现出 U 型注意力曲线。

→ 上下文开头:记忆良好 → 上下文结尾:记忆良好 → 中间部分:基本被忽略

研究人员测量到,当相关信息从上下文开头移到中间时,准确率会下降 30 多个百分点。

你的原始指令——埋在 50,000 token 的工具输出之下——实际上就消失了。

Claude Code 用户发现,输出质量在上下文容量达到 40–60% 时就会下降,远在达到任何硬性限制之前。

§ 4

What's actually competing for space in your agent's context

7 categories. All fighting for the same finite window.

  1. System Prompt

The agent's identity. Behavioral rules. Control flow logic. Instructions for different task types. In an agent, this isn't just "be helpful." It can define the entire architecture.

  1. Tool Definitions

Every tool the agent could call needs a schema describing what it does, what parameters it takes, and when to use it.

  1. Tool Call Results

Every tool call adds its output to context. A web page retrieval: 5,000–10,000 tokens. A file read: similar. These accumulate fast.

  1. Retrieved Knowledge (RAG)

Documents pulled from vector databases, search results, API responses — anything retrieved to inform the agent's decisions.

  1. Conversation History

The full transcript of everything that's happened. User messages, agent responses, reasoning, prior decisions. Grows linearly with every turn.

  1. Memory

Short-term memory from the current session. Long-term memory from previous sessions — user preferences, prior outcomes, learned patterns.

  1. Agent State

Current plan, todo list, progress markers, scratchpad notes. The meta-information tracking where the agent is in a multi-step task.

All 7 competing for the same window.

Context engineering is deciding what wins.


代理上下文中真正争夺空间的内容

7 个类别,都在争夺同一个有限的窗口。

  1. 系统提示/System Prompt 代理的身份、行为规则、控制流逻辑、针对不同任务类型的指令。在代理中,这不仅限于"乐于助人",它可能定义整个架构。

  2. 工具定义/Tool Definitions 代理可能调用的每个工具都需要一个模式,描述它的功能、接受什么参数、以及何时使用它。

  3. 工具调用结果/Tool Call Results 每次工具调用都会将其输出添加到上下文中。检索一个网页:5,000–10,000 token。读取一个文件:类似。这些内容累积得很快。

  4. 检索到的知识/Retrieved Knowledge (RAG) 从向量数据库取出的文档、搜索结果、API 响应——任何用于告知代理决策的检索结果。

  5. 对话历史/Conversation History 已经发生的所有事情的完整记录:用户消息、代理响应、推理过程、先前决策。每轮交互线性增长。

  6. 记忆/Memory 当前会话的短期记忆。来自之前会话的长期记忆——用户偏好、先前结果、学习到的模式。

  7. 代理状态/Agent State 当前计划、待办事项列表、进度标记、草稿笔记。跟踪代理在多步任务中进展的元信息。

所有 7 类内容都在争夺同一个窗口。

上下文工程就是决定哪些内容胜出。

§ 5

The 4 Core Strategies

LangChain published the framework that organizes every context engineering technique into 4 buckets.

Every technique you will ever learn fits into one of these.

Write. Select. Compress. Isolate.


四大核心策略

LangChain 发布了一个框架,将所有的上下文工程技术归纳为四个类别。

你将会学到的每个技术都归属于其中之一:

写入/Write、选择/Select、压缩/Compress、隔离/Isolate。

§ 6

Strategy 1 — Write (Agents forget. Give them a way to remember.)

When an agent's context fills up and gets compacted, it loses information.

If the agent didn't write anything down before that happened — that information is gone forever.

Write means giving the agent ways to persist information outside the context window.

Three forms:

Scratchpads

Give the agent a tool that lets it take notes during a task. Intermediate findings. Decisions made. Information it knows it will need later.

Anthropic built a "think" tool — a dedicated space for Claude to work through problems.

On the tau-bench benchmark, this improved performance by up to 54% on certain tasks.

Rules Files

Persistent procedural memory.

If you've used Claude Code, you've seen CLAUDE.md.

Instructions loaded at the start of every session — project architecture, conventions, how to run tests, what to be careful about.

The agent reads it every time it starts.

It never forgets the fundamentals.

Memory Extraction

The agent saves facts, user preferences, and learned patterns so it can retrieve them across sessions.

Lives outside the context window entirely.

Information the agent needs tomorrow is there waiting when tomorrow comes.


策略一——写入(代理会忘记,给它们一种记住的方法)

当代理的上下文填满并被压缩时,它会丢失信息。

如果代理在那之前没有写下来任何东西——那些信息就永远消失了。

写入意味着赋予代理在上下文窗口之外持久化信息的能力。

三种形式:

草稿本/Scratchpads

给代理一个工具,让它能在任务期间记笔记:中间发现、做出的决定、它知道以后会需要的信息。

Anthropic 构建了一个"思考"/"think"工具——一个供 Claude 处理问题的专用空间。

在 tau-bench 基准测试中,这使某些任务的性能提升了高达 54%。

规则文件/Rules Files

持久的过程性记忆。

如果你用过 Claude Code,你见过 CLAUDE.md

每次会话开始时加载的指令——项目架构、约定、如何运行测试、需要注意什么。

代理每次启动都会读取它。

它永远不会忘记基本原则。

记忆提取/Memory Extraction

代理保存事实、用户偏好和学习到的模式,以便在跨会话时检索它们。

完全存在于上下文窗口之外。

代理明天需要的信息,当明天到来时就在那里等着。

§ 7

Strategy 2 — Select (Don't give the agent everything. Give it what it needs right now.)

An agent with 40 tools, a large knowledge base, and several sessions of history cannot load all of that at once.

Something has to decide what's relevant for this step.

Traditional RAG: the system decides.

User asks → retrieve documents → stuff into prompt → done.

Static. One-shot. The model has no say.

Agentic RAG: the agent decides. It searches for what it needs, refines queries, picks tools, determines when it has enough information.

Retrieval as an iterative process, not a one-shot pipeline.

This matters because what's relevant changes at every step — and only the agent knows what it needs next.

The tool selection problem is the one that trips people up most.

If your agent has 40+ tools, that's potentially 10,000 tokens of tool definitions sitting in context before any work begins.

The fix: RAG over tool descriptions.

Instead of dumping all tool definitions in every call, use semantic search to surface only the tools relevant to the current step.

A paper called RAG-MCP tested this.

Tool-selection accuracy: 14% → 43% (3x improvement). Token usage: cut roughly in half.

Anthropic calls it a hybrid strategy: load essential context up front (like CLAUDE.md), let the agent do just-in-time retrieval for everything else.

Front-load the basics. Retrieve the rest on demand.


策略二——选择(不要给代理所有东西,只给当前所需)

一个拥有 40 个工具、大型知识库和多次会话历史的代理,无法一次加载所有这些内容。

必须由某个机制来决定当前步骤什么内容是相关的。

传统 RAG:由系统决定。

用户提问 → 检索文档 → 塞入提示 → 完成。

静态的、一次性的。模型没有说话权。

代理化 RAG:由代理决定。它搜索自己需要的信息,优化查询,挑选工具,判断自己何时获得了足够的信息。

检索是一个迭代过程,而不是一次性的流水线。

这一点很重要,因为相关的内容在每一步都会变化——而且只有代理知道自己下一步需要什么。

工具选择问题是人们最常遇到麻烦的地方。

如果你的代理有 40+ 个工具,那就意味着在工作开始之前,上下文中就可能有 10,000 token 的工具定义。

解决方案:对工具描述进行 RAG。

不要在每次调用中都倾泻所有工具定义,而是使用语义搜索来仅展示与当前步骤相关的工具。

一篇名为 RAG-MCP 的论文测试了这种方法。

工具选择准确率:从 14% 提升到 43%(提升了 3 倍)。token 使用量大约减少了一半。

Anthropic 将其称为混合策略:预先加载基本的上下文(如 CLAUDE.md),让代理对其他一切进行即时检索。

预先加载基础信息,按需检索其余部分。

§ 8

Strategy 3 — Compress (Context accumulates. Keep the meaning, cut the tokens.)

Even with good selection, context accumulates.

Every tool call, retrieved document, and decision stays in the window.

Imagine your agent has made 20 tool calls.

Context: 80,000 tokens of accumulated tool outputs, conversation history, reasoning traces.

Most of that is no longer relevant. The agent already acted on it.

But it's still there, taking space, degrading attention, driving up cost and latency.

You can compress at 3 points.

Before information enters context:

→ Chunk large documents into coherent pieces before retrieval

→ Rerank so only the most useful chunks make it in

→ Summarize tool outputs on the fly before they enter the main context

While the agent is working:

→ Rolling summary of conversation history — continuously updated

→ Popular hybrid: keep last 10 messages verbatim + summarize everything older

→ Hard trimming: remove older messages once context hits a size threshold

→ Claude Code auto-compaction: triggers at 95% capacity, automatically summarizes full trajectory

After the agent has acted on something:

→ Tool result clearing: once a tool result was used 15 steps ago, drop it

→ Replace with a one-line summary or remove entirely

→ The agent does not need the full text of a web page it fetched 20 steps ago

The goal: reduce token count. Preserve what actually matters.


策略三——压缩(上下文会累积。保留含义,削减 token。)

即使有良好的选择策略,上下文还是会累积。

每次工具调用、每个检索到的文档和每个决策都会留在窗口里。

想象一下你的代理已经进行了 20 次工具调用。

上下文:80,000 token 的累积工具输出、对话历史、推理痕迹。

其中大部分内容已经不再相关。代理已经据此采取了行动。

但它们仍然在那里,占用空间,使注意力下降,推高成本和延迟。

你可以在三个时点进行压缩。

在信息进入上下文之前:

→ 将大型文档分割成连贯的块后再检索 → 重新排序,确保只有最有用的块进入上下文 → 在工具输出进入主上下文之前,动态地对它们进行摘要

在代理工作期间:

→ 对话历史的滚动摘要——持续更新 → 流行的混合方案:保留最后 10 条消息的原文 + 对更早的消息进行摘要 → 硬性修剪:一旦上下文达到某个大小阈值,移除更早的消息 → Claude Code 的自动压缩:在上下文容量达到 95% 时触发,自动对整个轨迹进行摘要

在代理已对某些事物采取行动之后:

→ 清除工具结果:一旦某个工具结果在 15 步之前被使用过,就丢弃它 → 用一行摘要替代,或完全移除 → 代理不需要它在 20 步之前获取的网页全文

目标:减少 token 数量。保留真正重要的内容。

§ 9

Strategy 4 — Isolate (The most powerful strategy. Enables multi-agent systems.)

Here is the deeper problem with long agent runs.

It's not just space. It's contamination.

The detailed file searches from the research phase are still sitting in context when the agent moves to writing code.

That old research context is now noise. It's distracting the model during a phase where it needs to focus on clean implementation.

Isolation means giving different parts of the work their own separate context windows.

Sub-agents

A parent agent delegates a focused subtask — "search the codebase for all authentication-related files" — to a sub-agent.

The sub-agent works in its own clean context window.

When it reports back, it returns only a condensed summary.

All the messy search operations stay isolated in the sub-agent's context and never pollute the parent.

State schema isolation (LangGraph's approach)

Design the agent's state so different fields store different types of context.

The LLM only sees the fields relevant to the current step.

Tool results sit in a "backstage" field — invisible to the model until explicitly surfaced.

Fine-grained control over what the agent sees at each step without spinning up separate sub-agents.

Isolation is what makes complex multi-step workflows actually reliable.

Different jobs. Different context windows. No contamination.


策略四——隔离(最强大的策略,支撑多代理系统)

长时间代理运行还有一个更深层的问题。

不仅仅是空间问题,更是污染问题。

当代理转向编写代码时,研究阶段的详细文件搜索仍然留在上下文中。

那些旧的研究上下文现在变成了噪音,在需要专注于干净实现的阶段分散了模型的注意力。

隔离意味着将工作的不同部分分配给它们自己独立的上下文窗口。

子代理/Sub-agents

父代理将一个聚焦的子任务委派给子代理,例如"在代码库中搜索所有与认证相关的文件"。

子代理在自己的干净上下文窗口中工作。

当它报告结果时,只返回一个精炼的摘要。

所有混乱的搜索操作都隔离在子代理的上下文中,永远不会污染父代理。

状态模式隔离/State schema isolation(LangGraph 的方法)

设计代理的状态,使不同的字段存储不同类型的上下文。

LLM 只看到与当前步骤相关的字段。

工具结果存放在"后台"/"backstage"字段中——对模型不可见,直到被明确地展示出来。

这实现了对代理每一步所看到内容的细粒度控制,而无需启动单独的子代理。

隔离是让复杂的多步工作流真正可靠的关键。

不同的工作,不同的上下文窗口,没有污染。

§ 10

4 Ways Agents Fail (Name the failure. Fix it.)

Drew Breunig identified four distinct failure modes as agent context grows.

Every broken agent you've ever seen falls into one of these.

Failure 1: Context Poisoning

A hallucination or error enters context.

The agent references it again and again in subsequent steps.

Bad data from step 5 compounds into every step after.

Fix: Validate tool outputs before they enter context. After recovering from an error, compress the failed-attempt history. Don't leave 10 steps of dead-end debugging visible when only the resolution matters.

━━━

Failure 2: Context Distraction

Context gets so long the model starts over-relying on recent history.

Instead of synthesizing a novel plan, it just rehashes what it recently did.

It stops thinking. It starts repeating.

Fix: Summarize and prune aggressively. Even when you have a large context window available. Big window does not mean fill it.

━━━

Failure 3: Context Confusion

Superfluous content leads the model to low-quality decisions.

Classic example: a model failing on a benchmark when given 46 tools — even though the context was well within limits — but working fine with only 19 tools.

The tools were not too many for the context to hold.

They were too many for the model to reason about clearly.

Fix: Dynamic tool management. Use RAG-MCP to surface only the tools relevant to the current step. Keep the tool set matched to the current phase.

━━━

Failure 4: Context Clash

New information contradicts something already in context.

The system prompt says one thing. A retrieved document says something different.

The agent cannot reconcile the contradiction. Produces inconsistent behavior.

Fix: Establish a clear authority ordering. System prompt > retrieved facts > conversation history. Validate new information against existing context before injecting. Use XML tags and clear headers so the model knows which source to trust.


代理失败的四种方式(识别失败,修复它)

随着代理上下文增长,Drew Breunig 识别出四种不同的失败模式。

你见过的每个出问题的代理都可以归入其中之一。

失败模式 1:上下文中毒/Context Poisoning

幻觉或错误进入上下文。

代理在后续步骤中一次又一次地引用它。

第 5 步的错误数据会叠加到之后的每一步。

修复:在工具输出进入上下文之前进行验证。从错误中恢复后,压缩失败尝试的历史记录。当只有解决方案重要时,不要留下 10 步的死胡同调试记录可见。

━━━

失败模式 2:上下文干扰/Context Distraction

上下文变得非常长,模型开始过度依赖最近的历史。

它不再综合生成新计划,而是直接重复它最近做过的事情。

它停止了思考,开始重复。

修复:积极地进行摘要和修剪。即使你有很大的上下文窗口可用,也不要填满它。

━━━

失败模式 3:上下文迷惑/Context Confusion

冗余内容导致模型做出低质量决策。

经典例子:一个模型在给定 46 个工具时在基准测试中失败——尽管上下文远远在限制之内——但只给 19 个工具时却工作正常。

不是上下文装不下那么多工具,

而是模型无法清晰地对它们进行推理。

修复:动态工具管理。使用 RAG-MCP 仅展现与当前步骤相关的工具。保持工具集与当前阶段相匹配。

━━━

失败模式 4:上下文冲突/Context Clash

新信息与上下文中已有的信息相矛盾。

系统提示说一件事,而检索到的文档说了另一件事。

代理无法调和这个矛盾,导致行为不一致。

修复:建立清晰的权威排序。系统提示 > 检索到的事实 > 对话历史。在注入新信息之前,根据现有上下文进行验证。使用 XML 标签和明确的标题,以便模型知道应该信任哪个来源。

§ 11

How to Write System Prompts for Agents (Not chatbots. Agents.)

A chatbot system prompt sets a tone.

"You are a helpful assistant. Be concise and friendly."

An agent system prompt defines architecture.

It specifies control flow — how to approach task types, which tools to use when, what to do on errors, what guardrails to follow.

It is closer to writing a job description for an autonomous employee than a personality prompt.

Anthropic calls it writing at the "right altitude."

Too prescriptive:"If the user mentions billing AND mentions a refund AND the amount is over $100, call tool X." Fragile. Breaks on every edge case you didn't anticipate.

Too vague:"Be helpful and use the appropriate tools." Gives the agent nothing. It can't make good autonomous decisions without concrete signals.

The sweet spot:Specific enough to guide autonomous behavior. Flexible enough for the model to apply judgment in novel situations. Strong heuristics. Not rigid rules.

Practical tips:

→ Organize with XML tags or markdown headers — Background, Instructions, Tool Guidance

→ Start minimal and iterate on failures — don't try to anticipate every edge case up front

→ Minimal doesn't mean short — a complex agent system prompt can be thousands of tokens and that's fine as long as every token earns its place

→ Use few-shot examples — show the agent what good behavior looks like rather than trying to describe every rule in words


如何为代理编写系统提示(不是聊天机器人,是代理)

聊天机器人的系统提示设定语气。

"你是一个有用的助手。请简洁友好。"

代理的系统提示定义架构。

它指定控制流——如何接近任务类型、何时使用哪些工具、出错时该做什么、要遵循哪些护栏。

这更像是为一个自主工作的员工编写职位描述,而不是一个人格提示。

Anthropic 称之为在"正确的高度"写作。

过于刻板:"如果用户提到了账单 AND 提到了退款 AND 金额超过 100 美元,就调用工具 X。"脆弱。在你没预料到的每个边缘情况都会出问题。

过于模糊:"提供帮助并使用适当的工具。"没有给代理任何指导。没有具体的信号,它无法做出好的自主决策。

理想点:足够具体以指导自主行为,足够灵活以让模型在新情况下运用判断力。强大的启发式规则,而非僵化的规则。

实用技巧:

→ 使用 XML 标签或 markdown 标题来组织——背景/Background、指令/Instructions、工具指导/Tool Guidance → 从最小化开始,在失败中迭代——不要试图一开始就预见到所有边缘情况 → 最小化并不意味着简短——一个复杂的代理系统提示可能有几千 token,但只要每个 token 都有其价值就没问题 → 使用少样本示例——向代理展示好的行为是什么样的,而不是试图用文字描述每一条规则

§ 12

The KV-Cache: The $$$ reason to care about context order

Most agent builders don't know this exists.

When you send tokens to an LLM, the model computes key-value representations for each token.

Computationally expensive.

So inference providers cache these representations.

If the beginning of your context — the prefix — stays the same between API calls, the provider reuses the cached computation and only processes the new tokens at the end.

Fast. Cheap.

But if you rearrange or change the early part of your context between calls — you invalidate the cache. The provider recomputes everything from scratch.

The cost difference on Claude Sonnet:

→ Cached input tokens: $0.30 per million

→ Uncached input tokens: $3.00 per million

10x difference.

For an agent making 30–40 API calls per task, this adds up fast.

Practical rules for KV-cache efficiency:

→ Stable content goes at the TOP of context — system prompt, tool definitions, anything that doesn't change between turns

→ Dynamic content goes at the BOTTOM — conversation history, current step, agent state → Don't dynamically add and remove tools mid-conversation — it invalidates the cache

→ Use tool masking instead of tool removal — keep all tool definitions stable in the prefix (cached), just mark irrelevant ones as unavailable for the current phase


KV 缓存:关注上下文顺序的经济原因

大多数代理构建者不知道这个存在。

当你向 LLM 发送 token 时,模型会为每个 token 计算键值表述。

这在计算上非常昂贵。

因此,推理提供商会缓存这些表述。

如果你的上下文开头——即前缀——在 API 调用之间保持不变,提供商就会复用缓存的计算结果,只处理末尾的新 token。

快速、便宜。

但如果你在调用之间重新排列或更改上下文的前面部分,就会使缓存失效。提供商必须从头开始重新计算所有内容。

Claude Sonnet 上的成本差异:

→ 缓存的输入 token:每百万个 $0.30 → 未缓存的输入 token:每百万个 $3.00

10 倍的差异。

对于一个每项任务进行 30–40 次 API 调用的代理来说,成本会迅速累积。

KV 缓存效率的实用规则:

→ 稳定的内容放在上下文顶部——系统提示、工具定义、任何在轮次之间不变的内容 → 动态内容放在底部——对话历史、当前步骤、代理状态 → 不要在对话中间动态添加和移除工具——这会使缓存失效 → 使用工具掩蔽/tool masking 而非移除工具——将所有工具定义稳定地放在前缀中(已缓存),只需将不相关的标记为当前阶段不可用

§ 13

The Workflow That Ships 35,000 Lines of Code in 7 Hours

Dex Horthy (CEO of HumanLayer) presented this at the AI Engineer Code Summit.

His team reportedly used it to ship ~35,000 lines of code to a large Rust codebase in one 7-hour session.

The method: Frequent Intentional Compaction.

Structure agent work into phases. Each phase produces a compacted artifact. Each new phase starts with a fresh context window containing only that artifact.

Stay deliberately below 40–60% of the context window at all times.

Phase 1 — Research

Sub-agents explore the codebase. Read files. Trace data flows. Map architecture.

All the messy grep results and file contents stay in sub-agent contexts. Never touch the parent. (Isolate)

Output: a compact research.md — file paths, function signatures, patterns, gotchas. (Write)

Context reset: raw research used 60–80% of the window. The research artifact compresses it down to 15–20%. (Compress)

Phase 2 — Planning

New context window. Contains only: research document + problem definition.

Agent produces a detailed implementation plan.

This is the most important human review checkpoint.

Catch logic errors here where fixing them is easy and free. Later it costs hours.

Phase 3 — Implementation

Another fresh context window. Contains only: the plan.

Agent follows it step by step.

For complex tasks: a progress.md tracks what's been completed and what remains. (Write)

The result: a clean, focused agent at every phase. No contamination. No context rot. No "sloppy step 20."


7 小时交付 35,000 行代码的工作流

Dex Horthy(HumanLayer 的 CEO)在 AI Engineer Code Summit 上介绍了这个方法。

据报道,他的团队使用这个方法,在一次 7 小时的会话中向一个大型 Rust 代码库交付了约 35,000 行代码。

方法是:频繁的有意压缩/Frequent Intentional Compaction。

将代理的工作组织成多个阶段。每个阶段产生一个压缩过的产物。每个新阶段从一个只包含该产物的全新上下文窗口开始。

始终有意识地将上下文窗口保持在 40–60% 以下。

阶段 1——研究/Research

子代理探索代码库、读取文件、追踪数据流、绘制架构图。

所有混乱的 grep 结果和文件内容都留在子代理的上下文中,永远不触及父代理。(隔离/Isolate)

输出:一个紧凑的 research.md 文件——包含文件路径、函数签名、模式、陷阱。(写入/Write)

上下文重置:原始研究占用了 60–80% 的窗口。研究产物将其压缩到 15–20%。(压缩/Compress)

阶段 2——规划/Planning

新的上下文窗口。只包含:研究文档 + 问题定义。

代理生成一个详细的实现计划。

这是最重要的人工检查点。

在这里发现逻辑错误,修复起来容易且成本为零。晚一点修复将耗费数小时。

阶段 3——实现/Implementation

又一个全新的上下文窗口。只包含:计划。

代理一步步地遵循计划执行。

对于复杂任务:一个 progress.md 追踪已完成和剩余的工作。(写入/Write)

结果:在每个阶段都有一个干净、专注的代理。没有污染,没有上下文腐烂,没有"第 20 步的马虎"。

§ 14

How the best platforms handle this differently

Claude Code

Hybrid retrieval. CLAUDE.md loads up front. Tools like glob and grep handle just-in-time codebase navigation.

Auto-compaction at 95% — preserves architectural decisions and the 5 most recently accessed files.

Can spawn sub-agents for complex subtasks, each with their own clean context.

Philosophy: "do the simplest thing that works." Let the model be smart about what it needs and give it tools to go find it.

Manus

KV-cache-aware context ordering: stable prefix, dynamic suffix. Tool masking not removal.

Observation compression pipeline — every tool output is processed before entering the agent's context.

Persistent todo list for state tracking.

File system as overflow memory for evicted context.

Built for scale. Serving hundreds of thousands of users where efficiency is a cost-of-business problem.

ChatGPT Agent

Visual-first approach. The agent interacts with a GUI browser.

Screenshots added to context as visual snapshots. Model reasons over what it sees.

Visual tokens are expensive so the agent is selective about screenshot count.

Uses RL to learn optimal tool-use strategies across thousands of virtual machines rather than explicitly programming them.

Google ADK

The most principled architectural approach.

Three design principles:

  1. Separate storage from presentation — durable state is not the same as what appears in each API call
  2. Explicit transformations — named, ordered processors that transform context in testable, composable steps
  3. Scope context by default — every model call sees only the minimum required information Engineering discipline over prompt crafting.

顶尖平台处理方式的不同

Claude Code

混合检索。CLAUDE.md 预先加载。globgrep 等工具处理即时代码库导航。

95% 容量时自动压缩——保留架构决策和最近访问的 5 个文件。

可为复杂的子任务生成子代理,每个子代理都有自己干净的上下文。

理念:"做最简单有效的事"。让模型自己判断需要什么,并给它去寻找的工具。

Manus

KV 缓存感知的上下文排序:稳定的前缀,动态的后缀。使用工具掩蔽而非移除。

观察压缩管线——每个工具输出在进入代理上下文之前都会被处理。

用于状态跟踪的持久化待办事项列表。

使用文件系统作为被逐出上下文的溢出内存。

为规模化而构建。服务于数十万用户,效率直接关乎业务成本。

ChatGPT Agent

视觉优先的方法。代理通过与图形界面浏览器交互。

截图作为视觉快照被添加上下文。模型根据所见内容进行推理。

视觉 token 很昂贵,因此代理对截图数量有选择。

使用强化学习在数千台虚拟机上学习最优的工具使用策略,而不是显式地编程它们。

Google ADK

最注重原则的架构方法。

三个设计原则:

  1. 将存储与呈现分离——持久状态不等于每次 API 调用中显示的内容
  2. 显式转换——有名称、有序的处理器,以可测试、可组合的步骤转换上下文
  3. 默认限定上下文范围——每次模型调用只看到最少量的必要信息

工程纪律优先于提示工程技巧。

§ 15

The universal agent turn pipeline

Every serious platform converges on the same 5-step loop per agent turn:

→ Collect — user input, conversation history, tool results, retrieved docs, agent state

→ Select — what's relevant for this step within the remaining token budget

→ Compress — summarize, truncate, or restructure to fit the context → Arrange — stable content first (cache), dynamic content last

→ Assemble + call — final context → API call → get output → loop

This is the loop running inside every production agent you've ever used.

Understanding it is what separates builders who ship reliable agents from builders who wonder why their agent goes sloppy at step 15.


通用的代理单步处理流水线

每个严肃的平台在代理的每一次单步处理中都遵循相同的 5 步循环:

→ 收集/Collect——用户输入、对话历史、工具结果、检索到的文档、代理状态 → 选择/Select——在剩余的 token 预算内,确定哪些内容与当前步骤相关 → 压缩/Compress——摘要、截断或重构以适应上下文 → 排列/Arrange——稳定内容优先(缓存),动态内容置后 → 组装与调用/Assemble + call——最终上下文 → API 调用 → 获取输出 → 循环

这就是你使用过的每个生产级代理内部运行的循环。

理解这一点,将那些能交付可靠代理的构建者,与那些纳闷为什么代理在第 15 步就变得马虎的构建者区分开来。

§ 16

The summary

Context rot is real and starts well before your context limit.

The 4 strategies that fix it:

→ Write — persist info outside context so agents don't forget

→ Select — pull in only what's needed for this step

→ Compress — cut tokens, keep meaning, proactively not reactively

→ Isolate — separate contexts for separate jobs, no contamination

The 4 failure modes to watch for:

→ Poisoning — bad data compounds through every step

→ Distraction — long history makes agents rehash instead of think

→ Confusion — too many tools degrades decision quality

→ Clash — contradictions produce inconsistent behavior

The KV-cache is worth 10x cost savings. Put stable content first.

The best workflow: research → compact → plan → compact → implement. Fresh context at every phase.

Context engineering is not optional for serious agent work.

It is the work.


总结

上下文腐烂是真实存在的,而且远在达到上下文限制之前就开始了。

解决它的 4 个策略:

→ 写入/Write——将信息持久化到上下文之外,这样代理就不会忘记 → 选择/Select——只拉取当前步骤所需的信息 → 压缩/Compress——削减 token,保留含义,主动而非被动地操作 → 隔离/Isolate——为不同的工作分配独立的上下文,没有污染

需要警惕的 4 种失败模式:

→ 中毒/Poisoning——错误数据在每一步被放大 → 干扰/Distraction——过长的历史让代理只是在重复而不是思考 → 迷惑/Confusion——太多工具降低了决策质量 → 冲突/Clash——矛盾导致行为不一致

KV 缓存带来 10 倍的成本节约。把稳定内容放在前面。

最佳工作流:研究 → 压缩 → 规划 → 压缩 → 实现。每个阶段使用全新的上下文。

对于严肃的代理工作来说,上下文工程不是可选项。

它就是工作本身。

Open source ↗