Continually Improving Our Agent Harness
Cursor shares how it continuously improves its agent harness, covering context window evolution from static to dynamic fetching, a two-layer evaluation system (offline benchmarks and online A/B tests measuring code keep rate and user satisfaction), tool call error classification and repair pipeline (anomaly detection + automated log analysis with Cloud Agents), per-model customization of tool formats and prompts (e.g., patch vs. string replacement), and mid-chat model switching with specialized instructions. The post concludes with a vision of multi-agent architectures where the harness orchestrates specialized sub-agents.
We approach building the Cursor agent harness the way we'd approach any ambitious software product. Much of the work is vision-driven, where we start with an opinion about what the ideal agent experience should look like.
From there, we form hypotheses about how to get closer to that vision, run experiments to test them, and iterate using quantitative and qualitative signals from evals and real usage. That process depends on having the right online and offline instrumentation, so we can tell when a change actually makes the harness better.
When we get early access to new models, all of these approaches converge. We spend weeks customizing our harness to a model's strengths and quirks until the same model inside our specially tuned harness is noticeably faster, smarter, and more efficient.
Occasionally we discover step-change improvements. More often, though, improving the harness is a matter of obsessively stacking small optimizations that together make agents better at building software.
我们构建 Cursor 代理框架的方法,与打造任何雄心勃勃的软件产品如出一辙。大部分工作由愿景驱动:我们先对理想中的代理体验应该是什么样形成明确观点。
从那里出发,我们形成关于如何接近该愿景的假设,运行实验来测试它们,并根据评估和实际使用中的定量/定性信号进行迭代。这个过程依赖于正确的在线和离线检测手段,这样我们才能知道某个改动是否真的让框架变得更好。
当获得新模型的早期访问权时,所有这些方法会汇聚到一起。我们会花数周时间将框架针对模型的优势和特性进行定制,直到同一个模型在我们特调的框架内变得明显更快、更智能、更高效。
偶尔我们会发现阶跃式的改进。但更多时候,改进框架是一个痴迷地堆叠微小优化的过程——这些优化叠加起来,让代理更擅长构建软件。
At the heart of interacting with large language models is the context window. When asking the agent to build something, the context window starts with the system prompt and tool descriptions, followed by the current state of the conversation, and finally the user's request.
The way we populate and manage that window has evolved significantly over the history of Cursor.
When we first developed our coding agent in late 2024, models were much worse at choosing their own context and we invested lots of context engineering work into creating guardrails—for example, surfacing lint and type errors to the agent after every edit, rewriting its file reads when it requested too few lines, and even limiting the maximum number of tools it could call in one turn.
We also provided substantial amounts of static context that was always available to the agent at the start of each session. At various points, that included the folder layout of the codebase, code snippets that semantically matched the query, and compressed versions of files that the user manually attached.
That is mostly long gone.
We still include some useful static context (e.g., operating system, git status, current and recently viewed files). But we’ve adapted to increasing model capability by knocking down guardrails and providing more dynamic context, which can be fetched by the agent while it works. In an earlier post, we did a deep dive into some of our techniques behind dynamic context, many of which have since been adopted by other coding agents. Much of our work now focuses on providing more ways for the agent to dynamically pull context and interact with the world.
与大语言模型交互的核心是上下文窗口。当要求代理构建某样东西时,上下文窗口从系统提示和工具描述开始,接着是对话的当前状态,最后是用户的请求。
我们填充和管理该窗口的方式,在 Cursor 的发展历程中发生了显著演变。
在 2024 年末首次开发编码代理时,模型自行选择上下文的能力要差得多,我们投入了大量上下文工程工作来创建护栏——例如,在每次编辑后向代理展示 lint 和类型错误,当它请求的行数太少时重写其文件读取,甚至限制它在一轮中能调用的最大工具数量。
我们还提供了大量静态上下文,这些上下文在每次会话开始时始终可供代理使用。在不同阶段,这包括代码库的文件夹布局、与查询语义匹配的代码片段,以及用户手动附加的文件的压缩版本。
这些做法如今大部分已不复存在。
我们仍然包含一些有用的静态上下文(例如操作系统、git 状态、当前和最近查看的文件)。但我们已适应不断增强的模型能力,拆除了护栏,并提供了更多动态上下文——代理可以在工作时自行获取。在之前的一篇文章中,我们深入探讨了动态上下文背后的一些技术,其中许多已被其他编码代理采纳。我们现在的大部分工作集中在为代理提供更多动态拉取上下文并与世界交互的方式。




The harness and the model together determine how good the agent is, but "good" is hard to pin down. To locate it, we've built several layers of measurement.
We maintain public benchmarks alongside our own eval suite, CursorBench, which gives us a fast, standardized read on quality and lets us compare across time. But even the best benchmarks only approximate real usage, meaning we’d miss important signals if we relied on them entirely.
So we also run online experiments where we deploy two or more harness variants side by side and A/B test them on real usage. We measure agent quality in these tests through a variety of metrics. Some are straightforward like latency, token efficiency, tool call count, and cache hit rate. Those are directionally useful but still don’t get at fuzzier and more important questions of whether the agent actually did a good job. We measure those in two ways.
The first is the “Keep Rate” of agent-generated code. For a given set of code changes that the agent proposed, we track what fraction of those remain in the user’s codebase after fixed intervals of time. This allows us to understand when users have to manually adjust the agent's output, or need to iterate and have the agent fix things, indicating the agent’s initial response was of lower quality.
Second, we use a language model to read the user's responses to the agent’s initial output in order to capture semantically whether the user was satisfied or not. A user moving on to the next feature is a strong signal the agent did its job, while a user pasting a stack trace is a reliable signal that it didn't.
Sometimes these online tests tell us to shelve an idea that seems promising. In one experiment, we tried a more expensive model for context summarization and observed it made a negligible difference in agent quality that wasn’t worth the higher cost.
框架和模型共同决定了代理的好坏,但“好”的标准很难精确衡量。为了定位这个标准,我们构建了多层测量体系。
我们维护着公开基准测试以及自己的评估套件 CursorBench,这为我们提供了快速、标准化的质量读数,并允许我们进行跨时间比较。但即使是最好的基准测试也只能近似反映实际使用情况,这意味着如果完全依赖它们,我们会错过重要的信号。
因此,我们也运行在线实验,并列部署两个或更多框架变体,并在实际使用中进行 A/B 测试。我们在这些测试中通过各种指标来衡量代理质量。有些指标很直观,比如延迟、token 效率、工具调用次数和缓存命中率。这些指标在方向上是有用的,但仍然无法触及更模糊也更关键的问题:代理是否真的完成了好工作。我们通过两种方式来衡量这些。
第一种是代理生成代码的“保留率”。对于代理提出的一组特定代码更改,我们跟踪在固定时间间隔后,这些更改中仍保留在用户代码库中的比例。这让我们能了解用户何时需要手动调整代理的输出,或需要迭代并让代理修复问题——表明代理的初始响应质量较低。
第二种,我们使用语言模型来阅读用户对代理初始输出的回复,从而在语义上捕捉用户是否满意。用户继续处理下一个功能,是代理完成工作的强信号;而用户粘贴堆栈跟踪,则是代理未完成工作的可靠信号。
有时这些在线测试会让我们搁置看似有前途的想法。在一次实验中,我们尝试了更昂贵的模型进行上下文摘要,但观察到它对代理质量的影响微乎其微,不值得更高的成本。
As we add more models and capabilities, the harness gets more complex with more potential states, just like any piece of software. With this comes more surface area for bugs to crop up, many of which we can only detect at scale.
The agent’s tools are one of the broadest surfaces for bugs, and tool call errors can be extremely harmful to a session in Cursor. While the agent can often self-correct, errors remain in context, wasting tokens and causing “context rot,” where accumulated mistakes degrade the quality of the model's subsequent decisions.
Sometimes, the agent can be blocked or go off the rails completely after a failed tool call. Though metrics like tool call volume and error rate don’t directly measure whether the agent did a good job, they act as indicators that can point to a broader issue.
Any unknown error represents a bug in the harness, and we treat it accordingly. But many errors are “expected,” for example the model occasionally proposing an incorrect edit or trying to read a file that doesn't exist. We classify these expected errors by cause. InvalidArguments and UnexpectedEnvironment capture model mistakes and contradictions in the context window, while ProviderError captures vendor outages from tools like GenerateImage or WebSearch.
We have several other classifications like UserAborted and Timeout which altogether encompass most expected errors.
随着我们添加更多模型和功能,框架会像任何软件一样变得更加复杂,拥有更多潜在状态。随之而来的是更大的 bug 滋生面,其中许多问题我们只能在大规模运行时才检测到。
代理的工具是 bug 最广泛的表面之一,工具调用错误对 Cursor 中的会话可能极具破坏性。虽然代理通常能自我纠正,但错误仍然留在上下文中,浪费 token 并导致“上下文腐化”——累积的错误降低了模型后续决策的质量。
有时,代理在工具调用失败后可能被阻塞或完全失控。虽然工具调用量和错误率等指标并不直接衡量代理是否完成了好工作,但它们作为指标可以指向更广泛的问题。
任何未知错误都代表框架中的一个 bug,我们会相应处理。但许多错误是“预期的”,例如模型偶尔提出不正确的编辑,或尝试读取不存在的文件。我们按原因对这些预期错误进行分类。InvalidArguments 和 UnexpectedEnvironment 捕获模型错误和上下文窗口中的矛盾,而 ProviderError 捕获来自 GenerateImage 或 WebSearch 等工具的供应商故障。
我们还有其他分类,如 UserAborted 和 Timeout,它们共同涵盖了大多数预期错误。




We define alerts based on these metrics to catch significant regressions that make it into production. Since unknown errors are always bugs, we alert whenever the unknown error rate for any tool exceeds a fixed threshold. But it can be tricky to tell whether expected errors represent a bug in the harness or expected behavior.
For example, a grep search timeout might be because of a performance issue with the tool, or the codebase might just be huge and the model formed an inefficient query. To deal with this, we have anomaly detection alerts which fire when expected errors significantly exceed the baseline. We compute baselines per-tool and per-model, because different models may mess up tool calls at different rates.
We also run a weekly Automation equipped with a skill that teaches the model how to search through our logs, surface issues that are new or recently spiked, and create or update tickets in a backlog with an investigation. We lean heavily on Cloud Agents to kick off fixes for many issues at once, and can even trigger them directly from Linear.
This process is part of the way we’re instantiating an automated “software factory” for our agent harness. Over the course of a focused sprint earlier this year, we drove unexpected tool call errors down by an order of magnitude.
我们根据这些指标定义告警,以捕获进入生产环境的显著性能退化。由于未知错误始终是 bug,因此只要任何工具的未知错误率超过固定阈值,我们就会发出告警。但要判断预期错误是代表框架中的 bug 还是预期行为,可能很棘手。
例如,grep 搜索超时可能是因为工具的性能问题,或者代码库太大且模型形成了低效的查询。为了处理这种情况,我们设置了异常检测告警——当预期错误显著超过基线时会触发。我们按工具和按模型计算基线,因为不同模型搞砸工具调用的频率可能不同。
我们还运行一个每周自动化任务,该任务配备了一项技能,教导模型如何搜索我们的日志、发现新出现或最近激增的问题,并在待办列表中创建或更新带有调查的工单。我们大量依赖 Cloud Agents 同时启动多项修复,甚至可以直接从 Linear 触发它们。
这个过程是我们为代理框架实例化自动化“软件工厂”的一部分。在今年初的一次集中冲刺中,我们将意外工具调用错误降低了整整一个数量级。
All of our harness abstractions are model agnostic and can be heavily customized for every model we support. For instance, OpenAI's models are trained to edit files using a patch-based format, while Anthropic's models are trained on string replacement. Either model could use either tool, but giving it the unfamiliar one costs extra reasoning tokens and produces more mistakes. So in our harness, we provision each model with the tool format it had during training.
This customization goes very deep, and includes custom prompting for different providers and even for different model versions. OpenAI’s models tend to be more literal and precise in their instruction following, whereas Claude is a bit more intuitive and more tolerant to imprecise instructions.
When we get early access to a new model ahead of launch, we start from the closest existing model's harness and begin iterating. We run offline evals to find where the model gets confused, have people on our team use it and surface problems, and tweak the harness in response. We iterate like this until we have a model-harness combination we feel good about shipping.
Much of this tuning process is about customizing the harness to a new model’s strengths, but sometimes we encounter genuine model quirks that we can mitigate with the harness. For example, we observed one model develop what we came to call context anxiety: As its context window filled up, it would start refusing work, hedging that the task seemed too big. We were able to reduce the behavior through prompt adjustments.
我们所有的框架抽象都是模型无关的,并且可以针对我们支持的每个模型进行深度定制。例如,OpenAI 的模型经过训练,使用基于补丁的格式编辑文件,而 Anthropic 的模型则训练使用字符串替换。任一模型都可以使用任一工具,但给它不熟悉的工具会消耗额外的推理 token 并产生更多错误。因此,在我们的框架中,我们为每个模型配备它在训练期间使用的工具格式。
这种定制非常深入,包括针对不同提供者甚至不同模型版本的定制提示。OpenAI 的模型在指令遵循上往往更字面化和精确,而 Claude 则更直观,对不精确的指令容忍度更高。
当我们在模型发布前获得早期访问权时,我们会从最接近的现有模型框架开始迭代。我们运行离线评估来发现模型在何处感到困惑,让团队成员使用它并暴露问题,然后相应调整框架。我们这样迭代,直到获得一个我们认为可以发布的模型-框架组合。
这个调优过程大部分是关于将框架定制以适应新模型的优势,但有时我们会遇到可以通过框架缓解的真正模型怪癖。例如,我们观察到某个模型出现了我们称之为“上下文焦虑”的现象:随着其上下文窗口填满,它会开始拒绝工作,推脱说任务似乎太大。我们通过提示调整减少了这种行为。
It’s especially tricky to design the harness to support users switching models mid conversation, because different models have different behaviors, prompts, and tool shapes.
When a user switches models, Cursor automatically switches to the appropriate harness, with that model’s customized set of prompts and tools. However, the model still has to apply those tools to a conversation history that was produced by a different model and is out of distribution from what it was trained on.
To address this, we add custom instructions that tell the model when it's taking over mid-chat from another model. These instructions also steer it away from calling tools that appear in the conversation history but aren't part of its own tool set.
设计框架以支持用户在对话中途切换模型尤其棘手,因为不同模型具有不同的行为、提示和工具形态。
当用户切换模型时,Cursor 会自动切换到相应的框架,使用该模型定制的一套提示和工具。然而,模型仍需将这些工具应用于由不同模型生成且偏离其训练分布的对话历史。
为了解决这个问题,我们添加了自定义指令,告诉模型何时从中途接手另一个模型的对话。这些指令还引导它避免调用对话历史中出现但不在其自身工具集中的工具。


A second challenge is that caches are provider- and model-specific, so switching means a cache miss and a slower, more expensive first turn. We have experimented with mitigating this by summarizing the conversation at switch time, which provides the model with a clean summary that reduces the cache penalty. But if the user is deep into a complex task, the summary can lose important details. We generally recommend staying with one model for the duration of a conversation unless you have a reason to switch.
Another way to sidestep the challenges of mid-conversation model switching is to instead use a subagent, which starts from a fresh context window. We recently added to the harness the ability for users to directly ask for a subagent to be run with a particular model.


第二个挑战是缓存是特定于提供者和模型的,因此切换意味着缓存未命中,导致第一轮更慢更昂贵。我们尝试通过在切换时总结对话来缓解这一问题,这为模型提供了一个干净的摘要,减少了缓存惩罚。但如果用户深入进行复杂任务,摘要可能会丢失重要细节。我们通常建议在对话期间坚持使用一个模型,除非有理由切换。
避免对话中模型切换挑战的另一种方法是改用子代理,它从全新的上下文窗口开始。我们最近在框架中添加了让用户直接请求使用特定模型运行子代理的能力。
The future of AI-assisted software engineering will be multi-agent. Instead of running every subtask through a single agent, the system will learn to delegate across specialized agents and subagents: one for planning, another for fast edits, and a third for debugging, each scoped to what it does best.
Making that work well is fundamentally a harness challenge. The system needs to know which agent to dispatch, how to frame the task for that agent's strengths, and how to stitch the results into a coherent workflow. The ability to orchestrate that kind of coordination will live in the harness rather than any single agent. This means that, while harness engineering has always been important for agent success, it's only going to be more critical going forward.
AI 辅助软件工程的未来将是多代理的。系统不会通过单个代理运行每个子任务,而是学会在专门的代理和子代理之间进行委派:一个负责规划,另一个负责快速编辑,第三个负责调试——每个代理都专注于自己最擅长的事情。
让这种模式良好运行,从根本上说是框架的挑战。系统需要知道该派遣哪个代理,如何根据代理的优势来构建任务,以及如何将结果整合成一个连贯的工作流程。协调这种协作的能力将存在于框架中,而非任何单个代理内。这意味着,尽管框架工程对于代理的成功一直很重要,但未来它将变得更加关键。