Glean 拾遗
日刊 /2026-06-03 / Agent Harness 解剖:构建生产级 Agent 的 12 个组件

Agent Harness 解剖:构建生产级 Agent 的 12 个组件

原文 x.com 收录 2026-06-03 06:00 阅读 19 min
AI 解读

本文深入剖析了驱动现代 AI Agent 的核心基础设施——Agent Harness(代理框架)。作者综合 Anthropic、OpenAI、LangChain 等一线实践,梳理出生产级 Harness 的 12 个组件:编排循环、工具、记忆、上下文管理、提示构建、输出解析、状态管理、错误处理、护栏、验证循环、子代理编排。文章强调,Harness 才是 Agent 性能的真正瓶颈:LangChain 仅改变 Harness 便使 TerminalBench 排名提升 20+ 位;Claude Code 通过精心设计的记忆分层实现 95% 的上下文缩减。适合正在构建或优化 Agent 系统的工程师阅读,避免重蹈“模型强但系统弱”的覆辙。

原文 19 分钟
原文 x.com ↗
§ 1

You've built a chatbot. Maybe you've wired up a ReAct loop with a few tools. It works for demos. Then you try to build something production-grade, and the wheels come off: the model forgets what it did three steps ago, tool calls fail silently, and context windows fill up with garbage.

The problem isn't your model. It's everything around your model.

LangChain proved this when they changed only the infrastructure wrapping their LLM (same model, same weights) and jumped from outside the top 30 to rank 5 on TerminalBench 2.0. A separate research project hit a 76.4% pass rate by having an LLM optimize the infrastructure itself, surpassing hand-designed systems.

That infrastructure has a name now: the agent harness.

你搭建过聊天机器人,或许还用 ReAct 循环接了几个工具。Demo 跑得通,但一上生产环境就全线崩溃:模型忘了三步前做了什么,工具调用静默失败,上下文窗口塞满垃圾信息。

问题不在模型,而在模型周围的一切。

LangChain 证明了这一点——他们只改了封装 LLM 的底层基础设施(模型和权重都没变),在 TerminalBench 2.0 上的排名就从 30 名开外跃升至第 5。另一项研究发现,让 LLM 自行优化基础设施,能达到 76.4% 的通过率,超过手工设计的方案。

这套基础设施如今有了正式名称:agent harness / 智能体框架。

§ 2

The term was formalized in early 2026, but the concept existed long before. The harness is the complete software infrastructure wrapping an LLM: orchestration loop, tools, memory, context management, state persistence, error handling, and guardrails. Anthropic's Claude Code documentation puts it simply: the SDK is "the agent harness that powers Claude Code." OpenAI's Codex team uses the same framing, explicitly equating the terms "agent" and "harness" to refer to the non-model infrastructure that makes the LLM useful.

I really liked the canonical formula, from LangChain's Vivek Trivedy: "If you're not the model, you're the harness."

Here's the distinction that trips people up. The "agent" is the emergent behavior: the goal-directed, tool-using, self-correcting entity the user interacts with. The harness is the machinery producing that behavior. When someone says "I built an agent," they mean they built a harness and pointed it at a model.

Beren Millidge made this analogy precise in his 2023 essay "Scaffolded LLMs as Natural Language Computers." A raw LLM is a CPU with no RAM, no disk, and no I/O. The context window serves as RAM (fast but limited). External databases function as disk storage (large but slow). Tool integrations act as device drivers. The harness is the operating system. As Millidge wrote: "We have reinvented the Von Neumann architecture" because it's a natural abstraction for any computing system.

这个术语在 2026 年初正式定型,但其概念早已存在。Harness / 框架是封装 LLM 的完整基础设施:编排循环、工具、记忆、上下文管理、状态持久化、错误处理和护栏。Anthropic 的 Claude Code 文档说得很直白:SDK 就是“驱动 Claude Code 的 agent harness”。OpenAI 的 Codex 团队也采用同样的表述,明确将“agent”和“harness”等同起来,指代那些让 LLM 变得有用的非模型基础设施。

我很喜欢 LangChain 的 Vivek Trivedy 给出的经典定义:“如果它不是模型,那它就是 harness。”

这里有一个容易混淆的区别。“Agent”是涌现出来的行为:用户与之交互的那个有目标、会用工具、能自我纠正的实体。而 harness 是产生这种行为背后的机制。当有人说“我构建了一个 agent”,其实是指他构建了一个 harness,然后指向了一个模型。

Beren Millidge 在 2023 年的文章“Scaffolded LLMs as Natural Language Computers”中精确地类比了这个概念。一个原始的 LLM 就像一块没有 RAM、没有磁盘、没有 I/O 的 CPU。上下文窗口相当于 RAM(速度快但容量小),外部数据库是磁盘存储(容量大但速度慢),工具集成则是设备驱动程序。Harness 就是操作系统。正如 Millidge 所说:“我们重新发明了冯·诺依曼架构”——因为对任何计算系统来说,这都是一个自然的抽象。

§ 3

Three concentric levels of engineering surround the model:

  • Prompt engineering crafts the instructions the model receives.
  • Context engineering manages what the model sees and when.
  • Harness engineering encompasses both, plus the entire application infrastructure: tool orchestration, state persistence, error recovery, verification loops, safety enforcement, and lifecycle management.

The harness is not a wrapper around a prompt. It is the complete system that makes autonomous agent behavior possible.

模型周围环绕着三个同心圆式的工程层次:

  • Prompt 工程:设计模型接收的指令。
  • Context 工程:管理模型在何时看到什么信息。
  • Harness 工程:涵盖以上两者,外加整个应用基础设施:工具编排、状态持久化、错误恢复、验证循环、安全执行和生命周期管理。

Harness 不是 prompt 的包装壳。它是让自主 Agent 行为成为可能的完整系统。

§ 4

Synthesizing across Anthropic, OpenAI, LangChain, and the broader practitioner community, a production agent harness has twelve distinct components. Let's walk through each one.

综合 Anthropic、OpenAI、LangChain 以及更广泛实践社区的经验,一个生产级 agent harness 包含 12 个不同的组件。我们逐一来看。

§ 5

This is the heartbeat. It implements the Thought-Action-Observation (TAO) cycle, also called the ReAct loop. The loop runs: assemble prompt, call LLM, parse output, execute any tool calls, feed results back, repeat until done.

Mechanically, it's often just a while loop. The complexity lives in everything the loop manages, not the loop itself. Anthropic describes their runtime as a "dumb loop" where all intelligence lives in the model. The harness just manages turns.

这是整个系统的心脏。它实现了 Thought-Action-Observation(TAO)循环,也就是 ReAct 循环。循环流程如下:组装 prompt → 调用 LLM → 解析输出 → 执行工具调用 → 将结果反馈 → 重复直至完成。

从机制上说,它通常只是一个 while 循环。复杂性存在于循环所管理的一切之中,而非循环本身。Anthropic 将他们的运行时描述为一个“笨循环”,所有智能都存在于模型中。Harness 只负责管理轮次。

§ 6

Tools are the agent's hands. They're defined as schemas (name, description, parameter types) injected into the LLM's context so the model knows what's available. The tool layer handles registration, schema validation, argument extraction, sandboxed execution, result capture, and formatting results back into LLM-readable observations.

Claude Code provides tools across six categories: file operations, search, execution, web access, code intelligence, and subagent spawning. OpenAI's Agents SDK supports function tools (via @function_tool), hosted tools (WebSearch, CodeInterpreter, FileSearch), and MCP server tools.

工具是 agent 的双手。它们以 schema(名称、描述、参数类型)的形式注入 LLM 的上下文,让模型知道有哪些工具可用。工具层负责注册、schema 校验、参数提取、沙箱执行、结果捕获,以及将结果格式化回 LLM 可读的观察。

Claude Code 提供六类工具:文件操作、搜索、执行、网络访问、代码智能和子 agent 生成。OpenAI 的 Agents SDK 支持函数工具(通过 @function_tool)、托管工具(WebSearch, CodeInterpreter, FileSearch)和 MCP 服务器工具。

§ 7

Memory operates at multiple timescales. Short-term memory is conversation history within a single session. Long-term memory persists across sessions: Anthropic uses CLAUDE.md project files and auto-generated MEMORY.md files; LangGraph uses namespace-organized JSON Stores; OpenAI supports Sessions backed by SQLite or Redis.

Claude Code implements a three-tier hierarchy: a lightweight index (~150 characters per entry, always loaded), detailed topic files pulled in on demand, and raw transcripts accessed via search only. A critical design principle: the agent treats its own memory as a "hint" and verifies against actual state before acting.

记忆在多个时间尺度上运作。短期记忆是单次会话内的对话历史。长期记忆跨会话持久化:Anthropic 使用 CLAUDE.md 项目文件和自动生成的 MEMORY.md 文件;LangGraph 使用按命名空间组织的 JSON 存储;OpenAI 支持基于 SQLite 或 Redis 的会话。

Claude Code 采用三层层级:轻量级索引(每条约150字符,始终加载)、按需拉取的详细主题文件,以及仅通过搜索访问的原始转录。一个关键设计原则是:agent 将自己的记忆视为“提示”,并在执行操作前与实际状态进行验证。

§ 8

This is where many agents fail silently. The core problem is context rot: model performance degrades 30%+ when key content falls in mid-window positions (Chroma research, corroborated by Stanford's "Lost in the Middle" finding). Even million-token windows suffer from instruction-following degradation as context grows.

Production strategies include:

  • Compaction: summarizing conversation history when approaching limits (Claude Code preserves architectural decisions and unresolved bugs while discarding redundant tool outputs)
  • Observation masking: JetBrains' Junie hides old tool outputs while keeping tool calls visible
  • Just-in-time retrieval: maintaining lightweight identifiers and loading data dynamically (Claude Code uses grep, glob, head, tail rather than loading full files)
  • Sub-agent delegation: each subagent explores extensively but returns only 1,000 to 2,000 token condensed summaries

Anthropic's context engineering guide states the goal: find the smallest possible set of high-signal tokens that maximize likelihood of the desired outcome.

这是许多 agent 静默失效的根源。核心问题是上下文腐烂 / context rot:当关键内容位于窗口中间位置时,模型性能会下降 30% 以上(Chroma 的研究证实,斯坦福的“Lost in the Middle”发现也支持这一结论)。即便是百万 token 的上下文窗口,随着内容增多,指令遵循能力也会下降。

生产环境中的应对策略包括:

  • 压缩 / compaction:在接近限制时对对话历史进行摘要(Claude Code 保留架构决策和未解决的 bug,丢弃冗余的工具输出)
  • 观察掩码 / observation masking:JetBrains 的 Junie 隐藏旧的工具输出,但保留工具调用可见
  • 即时检索 / just-in-time retrieval:维护轻量级标识符,动态加载数据(Claude Code 使用 grep、glob、head、tail,而不是加载完整文件)
  • 子 agent 委派:每个子 agent 进行广泛探索,但仅返回 1000-2000 token 的精简摘要

Anthropic 的上下文工程指南指出了目标:找到最小的高信号 token 集合,以最大化期望结果的可能性。

§ 9

This assembles what the model actually sees at each step. It's hierarchical: system prompt, tool definitions, memory files, conversation history, and the current user message.

OpenAI's Codex uses a strict priority stack: server-controlled system message (highest priority), tool definitions, developer instructions, user instructions (cascading AGENTS.md files, 32 KiB limit), then conversation history.

这一步组装模型在每个步骤实际看到的内容。它是分层的:系统 prompt、工具定义、记忆文件、对话历史和当前用户消息。

OpenAI 的 Codex 使用严格的优先级堆栈:服务端控制的系统消息(最高优先级)、工具定义、开发者指令、用户指令(级联的 AGENTS.md 文件,32 KiB 限制),然后是对话历史。

§ 10

Modern harnesses rely on native tool calling, where the model returns structured tool_calls objects rather than free-text that must be parsed. The harness checks: are there tool calls? Execute them and loop. No tool calls? That's the final answer.

For structured outputs, both OpenAI and LangChain support schema-constrained responses via Pydantic models. Legacy approaches like RetryWithErrorOutputParser (which feeds the original prompt, the failed completion, and the parsing error back to the model) remain available for edge cases.

现代 harness 依赖原生 tool calling / 工具调用——模型返回结构化的 tool_calls 对象,而不是需要解析的自由文本。Harness 检查:有工具调用吗?执行它们然后循环。没有工具调用?那就是最终答案。

对于结构化输出,OpenAI 和 LangChain 都支持通过 Pydantic 模型进行 schema 约束的响应。诸如 RetryWithErrorOutputParser 之类的传统方法(将原始 prompt、失败的补全和解析错误反馈给模型)在边缘场景中仍可使用。

§ 11

LangGraph models state as typed dictionaries flowing through graph nodes, with reducers merging updates. Checkpointing happens at super-step boundaries, enabling resume after interruptions and time-travel debugging. OpenAI offers four mutually exclusive strategies: application memory, SDK sessions, server-side Conversations API, or lightweight previous_response_id chaining. Claude Code takes a different approach: git commits as checkpoints and progress files as structured scratchpads.

LangGraph 将状态建模为在图节点间流动的类型化字典,通过 reducer 合并更新。检查点在 super-step 边界处执行,支持中断后恢复和时间旅行调试。OpenAI 提供四种互斥的策略:应用内存、SDK 会话、服务端 Conversations API,或轻量级的 previous_response_id 链。Claude Code 采用不同的方式:以 git 提交作为检查点,以进度文件作为结构化暂存区。

§ 12

Here's why this matters: a 10-step process with 99% per-step success still has only ~90.4% end-to-end success. Errors compound fast.

LangGraph distinguishes four error types: transient (retry with backoff), LLM-recoverable (return error as ToolMessage so the model can adjust), user-fixable (interrupt for human input), and unexpected (bubble up for debugging). Anthropic catches failures within tool handlers and returns them as error results to keep the loop running. Stripe's production harness caps retry attempts at two.

这一点至关重要:一个 10 步流程,即使每步成功率为 99%,端到端成功率也仅有约 90.4%。错误会快速累积。

LangGraph 区分四种错误类型:瞬时错误(退避重试)、LLM 可恢复错误(将错误作为 ToolMessage 返回,让模型自行调整)、用户可修复错误(中断等待人工输入)和意外错误(向上抛出用于调试)。Anthropic 在工具处理程序内捕获失败,并将其作为错误结果返回以保持循环运行。Stripe 的生产级 harness 将重试次数限制为两次。

§ 13

OpenAI's SDK implements three levels: input guardrails (run on first agent), output guardrails (run on final output), and tool guardrails (run on every tool invocation). A "tripwire" mechanism halts the agent immediately when triggered.

Anthropic separates permission enforcement from model reasoning architecturally. The model decides what to attempt; the tool system decides what's allowed. Claude Code gates ~40 discrete tool capabilities independently, with three stages: trust establishment at project load, permission check before each tool call, and explicit user confirmation for high-risk operations.

OpenAI 的 SDK 实现了三个级别:输入护栏(第一个 Agent 运行时执行)、输出护栏(最终输出时执行)和工具护栏(每次工具调用时执行)。“绊线 / tripwire”机制可在触发时立即停止 agent。

Anthropic 在架构上将权限执行与模型推理分离。模型决定尝试做什么;工具系统决定什么被允许。Claude Code 独立控制约 40 个离散的工具能力,分三个阶段:项目加载时建立信任、每次工具调用前进行权限检查,以及对高风险操作进行明确的用户确认。

§ 14

This is what separates toy demos from production agents. Anthropic recommends three approaches: rules-based feedback (tests, linters, type checkers), visual feedback (screenshots via Playwright for UI tasks), and LLM-as-judge (a separate subagent evaluates output).

Boris Cherny, creator of Claude Code, noted that giving the model a way to verify its work improves quality by 2 to 3x.

这是区分玩具 demo 和生产级 Agent 的关键。Anthropic 推荐三种方法:基于规则的反馈(测试、linter、类型检查器)、视觉反馈(通过 Playwright 截屏进行 UI 任务)和 LLM-as-judge(用一个独立的子 agent 评估输出)。

Claude Code 的创建者 Boris Cherny 指出,让模型有能力验证自己的工作,可将质量提升 2-3 倍。

§ 15

Claude Code supports three execution models: Fork (byte-identical copy of parent context), Teammate (separate terminal pane with file-based mailbox communication), and Worktree (own git worktree, isolated branch per agent). OpenAI's SDK supports agents-as-tools (specialist handles bounded subtask) and handoffs (specialist takes full control). LangGraph implements subagents as nested state graphs.

Claude Code 支持三种执行模型:Fork(父上下文的字节级精确副本)、Teammate(独立的终端面板,通过基于文件的邮箱通信)和 Worktree(自己的 git worktree,每个 agent 拥有独立分支)。OpenAI 的 SDK 支持 agents-as-tools(专家处理有边界的子任务)和 handoffs(专家完全接管控制)。LangGraph 将子 agent 实现为嵌套的状态图。

§ 16

Now that you know the components, let's trace how they work together in a single cycle.

Step 1 (Prompt Assembly): The harness constructs the full input: system prompt + tool schemas + memory files + conversation history + current user message. Important context is positioned at the beginning and end of the prompt (the "Lost in the Middle" finding).

Step 2 (LLM Inference): The assembled prompt goes to the model API. The model generates output tokens: text, tool call requests, or both.

Step 3 (Output Classification): If the model produced text with no tool calls, the loop ends. If it requested tool calls, proceed to execution. If a handoff was requested, update the current agent and restart.

Step 4 (Tool Execution): For each tool call, the harness validates arguments, checks permissions, executes in a sandboxed environment, and captures results. Read-only operations can run concurrently; mutating operations run serially.

Step 5 (Result Packaging): Tool results are formatted as LLM-readable messages. Errors are caught and returned as error results so the model can self-correct.

Step 6 (Context Update): Results are appended to conversation history. If approaching the context window limit, the harness triggers compaction.

Step 7 (Loop): Return to Step 1. Repeat until termination.

Termination conditions are layered: the model produces a response with no tool calls, maximum turn limit is exceeded, token budget is exhausted, a guardrail tripwire fires, the user interrupts, or a safety refusal is returned. A simple question might take 1 to 2 turns. A complex refactoring task can chain dozens of tool calls across many turns.

For long-running tasks spanning multiple context windows, Anthropic developed a two-phase "Ralph Loop" pattern: an Initializer Agent sets up the environment (init script, progress file, feature list, initial git commit), then a Coding Agent in every subsequent session reads git logs and progress files to orient itself, picks the highest-priority incomplete feature, works on it, commits, and writes summaries. The filesystem provides continuity across context windows.

现在你已经了解了各个组件,让我们追踪它们在单个循环中的协作方式。

第一步(Prompt 组装):Harness 构建完整的输入:系统 prompt + 工具 schema + 记忆文件 + 对话历史 + 当前用户消息。重要的上下文放置在 prompt 的开头和结尾(基于“Lost in the Middle”的发现)。

第二步(LLM 推理):组装好的 prompt 发往模型 API。模型生成输出 token:文本、工具调用请求,或两者兼有。

第三步(输出分类):如果模型只生成文本而没有工具调用,循环结束。如果请求了工具调用,则进入执行阶段。如果请求了移交 / handoff,则更新当前 agent 并重新启动。

第四步(工具执行):对于每个工具调用,harness 验证参数、检查权限、在沙箱环境中执行并捕获结果。只读操作可并行运行;有副作用的操作串行执行。

第五步(结果打包):工具结果被格式化为 LLM 可读的消息。错误被捕获并作为错误结果返回,以便模型自我纠正。

第六步(上下文更新):结果追加到对话历史中。如果接近上下文窗口限制,harness 触发压缩。

第七步(循环):返回第一步,重复直至终止。

终止条件是分层的:模型生成无工具调用的响应、超过最大轮次限制、token 预算耗尽、护栏绊线触发、用户中断或返回安全拒绝。一个简单问题可能只需 1-2 轮。一个复杂的重构任务可能会在多个轮次链式调用数十次工具。

对于跨多个上下文窗口的长任务,Anthropic 开发了一种两阶段的“Ralph Loop”模式:一个初始化 Agent 设置环境(init 脚本、进度文件、功能列表、初始 git 提交),然后每个后续会话中的编码 Agent 读取 git 日志和进度文件来定位自身,选择最高优先级的未完成功能,进行处理、提交并编写摘要。文件系统为跨上下文窗口提供了连续性。

§ 17

Anthropic's Claude Agent SDK exposes the harness through a single query() function that creates the agentic loop and returns an async iterator streaming messages. The runtime is a "dumb loop." All intelligence lives in the model. Claude Code uses a Gather-Act-Verify cycle: gather context (search files, read code), take action (edit files, run commands), verify results (run tests, check output), repeat.

OpenAI's Agents SDK implements the harness through the Runner class with three modes: async, sync, and streamed. The SDK is "code-first": workflow logic is expressed in native Python rather than graph DSLs. The Codex harness extends this with a three-layer architecture: Codex Core (agent code + runtime), App Server (bidirectional JSON-RPC API), and client surfaces (CLI, VS Code, web app). All surfaces share the same harness, which is why "Codex models feel better on Codex surfaces than a generic chat window."

LangGraph models the harness as an explicit state graph. Two nodes (llm_call and tool_node) connected by a conditional edge: if tool calls present, route to tool_node; if absent, route to END. LangGraph evolved from LangChain's AgentExecutor, which was deprecated in v0.2 because it was hard to extend and lacked multi-agent support. LangChain's Deep Agents explicitly use the term "agent harness": built-in tools, planning (write_todos tool), file systems for context management, subagent spawning, and persistent memory.

CrewAI implements a role-based multi-agent architecture: Agent (the harness around the LLM, defined by role, goal, backstory, and tools), Task (the unit of work), and Crew (the collection of agents). CrewAI's Flows layer adds a "deterministic backbone with intelligence where it matters," managing routing and validation while Crews handle autonomous collaboration.

AutoGen (evolving into Microsoft Agent Framework) pioneered conversation-driven orchestration. Its three-layer architecture (Core, AgentChat, Extensions) supports five orchestration patterns: sequential, concurrent (fan-out/fan-in), group chat, handoff, and magentic (a manager agent maintains a dynamic task ledger coordinating specialists).

Anthropic 的 Claude Agent SDK 通过一个 query() 函数暴露 harness,该函数创建 agentic 循环并返回一个流式消息的异步迭代器。运行时是一个“笨循环”,所有智能都在模型中。Claude Code 使用“收集-行动-验证”循环:收集上下文(搜索文件、阅读代码)、采取行动(编辑文件、运行命令)、验证结果(运行测试、检查输出),然后重复。

OpenAI 的 Agents SDK 通过 Runner 类实现 harness,支持三种模式:异步、同步和流式。该 SDK 是“代码优先”的:工作流逻辑用原生 Python 表达,而不是图 DSL。Codex harness 在此基础上扩展为三层架构:Codex Core(Agent 代码 + 运行时)、App Server(双向 JSON-RPC API)和客户端界面(CLI、VS Code、Web 应用)。所有界面共享同一个 harness,这就是为什么“Codex 模型在 Codex 界面上的体验优于通用聊天窗口”。

LangGraph 将 harness 建模为显式的状态图。两个节点(llm_call 和 tool_node)通过一条条件边连接:如果有工具调用,路由到 tool_node;如果没有,路由到 END。LangGraph 从 LangChain 的 AgentExecutor 演化而来,后者在 v0.2 中被弃用,因为它难以扩展且缺乏多 agent 支持。LangChain 的 Deep Agents 明确使用“agent harness”这个术语:内置工具、规划(write_todos 工具)、用于上下文管理的文件系统、子 agent 生成和持久化记忆。

CrewAI 实现了基于角色的多 agent 架构:Agent(围绕 LLM 的 harness,由角色、目标、背景故事和工具定义)、Task(工作单元)和 Crew(agent 集合)。CrewAI 的 Flows 层增加了“确定性骨架,智能注入关键位置”,管理路由和验证,而 Crew 处理自主协作。

AutoGen(正在演变为 Microsoft Agent Framework)开创了对话驱动的编排。其三层架构(Core、AgentChat、Extensions)支持五种编排模式:顺序、并发(扇出/扇入)、群组聊天、移交和 magentic(管理 agent 维护动态任务账本,协调专家)。

§ 18

The scaffolding metaphor isn't decorative. It's precise. Construction scaffolding is temporary infrastructure that enables workers to build a structure they couldn't reach otherwise. It doesn't do the construction. But without it, workers can't reach the upper floors.

The key insight: scaffolding is removed when the building is complete. As models improve, harness complexity should decrease. Manus was rebuilt five times in six months, each rewrite removing complexity. Complex tool definitions became general shell execution. "Management agents" became simple structured handoffs.

This points to the co-evolution principle: models are now post-trained with specific harnesses in the loop. Claude Code's model learned to use the specific harness it was trained with. Changing tool implementations can degrade performance because of this tight coupling.

The "future-proofing test" for harness design: if performance scales up with more powerful models without adding harness complexity, the design is sound.

脚手架这个隐喻并非装饰性的,它很精确。建筑脚手架是临时性的基础设施,让工人能够建造原本无法够到的结构。它不负责实际的建造工作,但没有它,工人就无法到达更高的楼层。

关键洞察在于:建筑完工后,脚手架会被拆除。随着模型改进,harness 的复杂性也应该降低。Manus 在六个月内重写了五次,每次重写都在移除复杂性。复杂的工具定义变成了通用的 shell 执行。“管理 agent”变成了简单的结构化移交。

这指向了协同演化原则:模型现在是在特定 harness 的参与下进行后训练的。Claude Code 的模型学会了使用它被训练时所用的特定 harness。改变工具实现可能会降低性能,正是因为这种紧密耦合。

Harness 设计的“未来可验证性测试”是:如果性能随着更强大模型的出现而提升,而无需增加 harness 的复杂性,那么这个设计就是正确的。

§ 19

Every harness architect faces seven choices:

  1. Single-agent vs. multi-agent. Both Anthropic and OpenAI say: maximize a single agent first. Multi-agent systems add overhead (extra LLM calls for routing, context loss during handoffs). Split only when tool overload exceeds ~10 overlapping tools or clearly separate task domains exist.
  2. ReAct vs. plan-and-execute. ReAct interleaves reasoning and action at every step (flexible but higher per-step cost). Plan-and-execute separates planning from execution. LLMCompiler reports a 3.6x speedup over sequential ReAct.
  3. Context window management strategy. Five production approaches: time-based clearing, conversation summarization, observation masking, structured note-taking, and sub-agent delegation. ACON research showed 26 to 54% token reduction while preserving 95%+ accuracy by prioritizing reasoning traces over raw tool outputs.
  4. Verification loop design. Computational verification (tests, linters) provides deterministic ground truth. Inferential verification (LLM-as-judge) catches semantic issues but adds latency. Martin Fowler's Thoughtworks team frames this as guides (feedforward, steer before action) versus sensors (feedback, observe after action).
  5. Permission and safety architecture. Permissive (fast but risky, auto-approve most actions) versus restrictive (safe but slow, require approval for each action). The choice depends on deployment context.
  6. Tool scoping strategy. More tools often means worse performance. Vercel removed 80% of tools from v0 and got better results. Claude Code achieves 95% context reduction via lazy loading. The principle: expose the minimum tool set needed for the current step.
  7. Harness thickness. How much logic lives in the harness versus the model. Anthropic bets on thin harnesses and model improvement. Graph-based frameworks bet on explicit control. Anthropic regularly deletes planning steps from Claude Code's harness as new model versions internalize that capability.

每个 harness 架构师都面临七个选择:

  1. 单 Agent 与多 Agent。Anthropic 和 OpenAI 都表示:优先最大化单 agent。多 agent 系统会带来额外开销(路由需要额外 LLM 调用,移交时丢失上下文)。只有当工具超过约 10 个重叠工具或存在明显分离的任务域时才进行拆分。
  2. ReAct 与计划-执行。ReAct 在每个步骤交替进行推理和行动(灵活但每步成本更高)。计划-执行将规划与执行分离。LLMCompiler 报告称,相比顺序 ReAct 可提速 3.6 倍。
  3. 上下文窗口管理策略。五种生产方法:基于时间的清除、对话摘要、观察掩码、结构化笔记和子 agent 委派。ACON 研究表明,通过优先保留推理轨迹而非原始工具输出,可减少 26% 到 54% 的 token,同时保持 95%+ 的准确率。
  4. 验证循环设计。计算性验证(测试、linter)提供确定性基准真相。推断性验证(LLM-as-judge)捕捉语义问题但增加延迟。Martin Fowler 的 Thoughtworks 团队将其框架化为导引 / guides(前馈,行动前引导)与传感器 / sensors(反馈,行动后观察)。
  5. 权限与安全架构。宽松型(快速但有风险,自动批准大多数操作)与限制型(安全但缓慢,每项操作都需要批准)。选择取决于部署上下文。
  6. 工具范围策略。更多工具通常意味着更差的性能。Vercel 从 v0 中移除了 80% 的工具,结果反而更好。Claude Code 通过惰性加载实现了 95% 的上下文缩减。原则是:只暴露当前步骤所需的最小工具集。
  7. Harness 厚度 / thickness。有多少逻辑放在 harness 中,多少留给模型。Anthropic 押注薄 harness 和模型改进。基于图的框架押注显式控制。随着新版本模型内化某些能力,Anthropic 会定期从 Claude Code 的 harness 中删除规划步骤。
§ 20

Two products using identical models can have wildly different performance based solely on harness design. The TerminalBench evidence is clear: changing only the harness moved agents by 20+ ranking positions.

The harness is not a solved problem or a commodity layer. It's where the hard engineering lives: managing context as a scarce resource, designing verification loops that catch failures before they compound, building memory systems that provide continuity without hallucination, and making architectural bets about how much scaffolding to build versus how much to leave to the model.

The field is moving toward thinner harnesses as models improve. But the harness itself isn't going away. Even the most capable model needs something to manage its context window, execute its tool calls, persist its state, and verify its work.

The next time your agent fails, don't blame the model. Look at the harness.

两个使用相同模型的产品,仅因 harness 设计的不同,性能可能天差地别。TerminalBench 的证据很清楚:仅改变 harness 就让 agent 的排名变动了 20 多个位次。

Harness 不是一个已经解决的问题,也不是一个可商品化的层。真正的硬核工程就在这里:将上下文作为稀缺资源来管理,设计能在错误累积之前捕获它们的验证循环,构建既能提供连续性又不会产生幻觉的记忆系统,并在“该搭建多少脚手架”和“该留给模型多少”之间做出架构性抉择。

随着模型的改进,整个领域正在向更薄的 harness 演进。但 harness 本身不会消失。即使是最强大的模型,也需要有东西来管理它的上下文窗口、执行它的工具调用、持久化它的状态,并验证它的工作。

下一次你的 agent 失败时,不要责怪模型。去看看 harness。

打开原文 ↗