Glean 拾遗
日刊 /2026-06-18 / 从40万Claude Code会话看:领域专长是智能体编程成功的关键

从40万Claude Code会话看:领域专长是智能体编程成功的关键

原文 www.anthropic.com 收录 2026-06-18 06:00 阅读 27 min
AI 解读

Anthropic基于约40万次Claude Code会话的分析显示,用户主要负责规划,Claude负责执行,领域专业知识而非编码技能是成功的关键。专家级会话的验证成功率是初学者的两倍多,但中级用户已能获得大部分成效;非软件职业编码成功率与软件工程师仅差约5个百分点。七个月内调试会话占比从33%降至19%,端到端任务(部署、数据分析、文档写作)比例上升,平均任务价值估计上升约25%。报告给出了决策归因、专业知识评级与成功验证的方法论,并指出局限性。适合关注AI编码工具、智能体协作与技能迁移的工程师与研究者。

原文 27 分钟
原文 www.anthropic.com ↗
§ 1

Building on prior work, we introduce a framework for studying interactive agentic coding based on a privacy-preserving analysis of ~400,000 Claude Code sessions from between October 2025 and April 2026. We evaluate the composition of tasks, human-AI collaboration, and success rates.

In a typical session, people make most of the planning decisions (what to do) and Claude makes most of the execution decisions (how to do it). The greater domain expertise a person brings to a session, the more work Claude does per instruction. On coding tasks, every major occupation succeeds––accomplishes what the person set out to do, with verifiable evidence like passing tests or committed work––at nearly the same rate as software engineers, on average.

The more domain expertise a person has, the more often the session ends in success—though the gap between intermediate and expert users is modest. Over the seven months we observe, the share of sessions spent debugging fell by nearly half, and usage shifted toward more end-to-end agentic use: deploying and running code, analyzing data, and writing non-code documents.

Over those seven months, the value of the typical task, which we estimate through a comparison to freelance job postings, rose in almost every kind of work—about 25% on average.

在先前工作基础上,我们提出了一个研究交互式智能编码的框架,基于对 2025 年 10 月至 2026 年 4 月间约 40 万次 Claude Code 会话的隐私保护分析。我们评估了任务构成、人机协作及成功率。

在典型的会话中,人负责绝大部分规划决策(做什么),而 Claude 负责绝大多数执行决策(如何做)。用户在会话中展现的领域专业知识越多,每次指令触发 Claude 完成的工作就越多。在编码任务上,每一个主要职业群体都能以与软件工程师几乎相同的平均成功率完成任务——即有可验证的证据(例如测试通过或工作提交)证明用户实现了其既定目标。

用户的领域专业知识越丰富,会话越倾向于成功——但中级用户与专家用户之间的差距并不大。在我们观察的七个月间,用于调试的会话比例下降了近一半,使用方式转向了更端到端的智能体模式:部署并运行代码、分析数据、以及撰写非代码文档。

在这七个月期间,通过与自由职业招聘帖的比较来估算的典型任务价值,在几乎所有工作类型中都上涨了——平均约 25%。

§ 2

Agentic coding has taken off. The share of GitHub projects with coding agent activity has more than doubled since late 2025, and Claude Code users now spend an average of 20 hours per week using the tool. Can people without formal coding experience successfully direct an agent through complex technical work? And what will rapid adoption and improvement of these tools mean for knowledge work broadly? While we don’t have full answers to these questions yet, we look to Claude Code usage data for early signals.

This report provides evidence on how Claude Code is used in practice, based on a privacy-preserving analysis of ~400,000 interactive sessions from ~235,000 people between October 2025 and April 2026. It builds on prior work focused on measures of autonomy in Claude Code sessions, and how Claude Code is changing work at Anthropic. Here, we introduce a framework for describing interactive AI coding-assistant usage: what kind of work is being done, who is doing it, and whether it succeeds. We focus on Claude Code usage through a command-line interface (CLI), Claude.ai, or the Claude Code desktop app. By tracking how agentic coding usage changes as models get more capable, we can better understand how these tools affect the labor market for coding professionals and knowledge workers.

What happens on Claude Code may be a preview of where knowledge work is headed, as agents become embedded in non-coding work. We find that Claude is handling more complex and more valuable tasks. At the same time, there remains a clear division of labor in agentic coding: People decide what to build, and the agent decides how to build it.

We also see evidence that domain expertise, and not coding proficiency, amplifies effective use of the tool. In particular, domain experts succeed more often, and more easily recover from errors and misunderstandings. However, the gap between experts and intermediates is modest—suggesting that proficiency in a domain is enough to use the tool almost as effectively as those with deep mastery.

These findings give us an early read on possible transitions in the labor market. In our data, success is determined by how well a person understands the problem they are trying to solve, not whether they’re trained in coding. If these patterns hold across the economy, it suggests that while agentic coding tools may be absorbing some implementation-heavy work, they are also rewarding those with firm understanding of the problems they solve on the job. Coding agents are not substituting for domain expertise—the more understanding a worker brings to an agent, the more quality work the agent is able to do.

智能编码已然起飞。自 2025 年底以来,涉及编码智能体的 GitHub 项目比例翻了一倍多,Claude Code 用户现在平均每周使用该工具 20 小时。没有正式编程经验的人能否成功引导智能体完成复杂的技术工作?这些工具的快速普及与持续改进,对整个知识工作意味着什么?虽然我们尚未得到完整的答案,但我们从 Claude Code 的使用数据中寻找早期信号。

本报告基于对 2025 年 10 月至 2026 年 4 月间约 23.5 万人、约 40 万次交互式会话的隐私保护分析,提供了关于 Claude Code 实际使用情况的证据。它建立在先前关于 Claude Code 会话自主性度量以及 Claude Code 如何改变 Anthropic 内部工作的研究基础之上。在这里,我们提出了一个描述交互式 AI 编码助手使用情况的框架:做什么工作、谁在做、是否成功。我们重点关注通过命令行界面(CLI)、Claude.ai 或 Claude Code 桌面应用的 Claude Code 使用行为。通过追踪智能编码使用方式如何随着模型能力的提升而变化,我们可以更好地理解这些工具对编码专业人士和知识工作者劳动市场的影响。

随着智能体被嵌入到非编码工作中,Claude Code 上发生的一切或许预示着知识工作的未来方向。我们发现,Claude 正在处理更复杂、更有价值的任务。同时,智能编码中存在着明确的分工:人决定构建什么,智能体决定如何构建。

我们还看到证据表明,放大工具使用效果的是领域专业知识,而非编码熟练度。具体来说,领域专家更容易获得成功,也更容易从错误和误解中恢复。然而,专家与中级用户之间的差距不大——这表明,在一个领域具有足够的能力,就能几乎像深度掌握该领域的人一样有效地使用该工具。

这些发现让我们得以早期一瞥劳动市场可能发生的转型。在我们的数据中,成功取决于一个人对其试图解决问题的理解深度,而非其是否受过编程训练。如果这种模式在整个经济体中成立,那就意味着,虽然智能编码工具可能正在吸收一些实现密集型工作,但它们也在奖励那些对工作中要解决的问题有扎实理解的从业者。编码智能体并没有取代领域专业知识——工作者带给智能体的理解越深,智能体能产出的高质量工作就越多。

§ 3

To understand what people are using Claude Code for, we classify each session into one of nine work modes—the single activity that best describes what the session is trying to accomplish. Four modes involve writing or maintaining code directly: building something new, fixing something broken, testing code, and orchestrating other agents or automated pipelines. Another category is operating software—deploying, configuring, running pipelines, monitoring systems. Two categories are more about working out what to do: understanding how an existing system works, and planning a change before making it. And two take actions unrelated to code, or where code is incidental to the final product: analyzing data, and communicating via presentations and other prose-based documents.

About 56% of sessions consist of writing (25%), fixing (26%), or testing and orchestrating code (5%). Operating software comprises 17%, while 14% of sessions are planning or exploring, and 13% produce analysis or prose (Figure 1).

Figure 1: The nine modes of work

Each interactive session is classified into the single mode that best describes what it is trying to accomplish.

We classify each session by having a model read its transcript, then using our privacy-preserving analysis tool, we check them against telemetry that's recorded automatically for every session, including whether any lines of code were added or deleted. The two sources have high agreement—for instance, more than 90% of sessions our classifier labeled as creating or modifying code showed code changes in the telemetry. See the Appendix for details.

为了理解人们用 Claude Code 做什么,我们将每个会话分类为九种工作模式之一——即最能描述该会话试图完成的任务类型。其中四种模式涉及直接编写或维护代码:构建新东西、修复问题、测试代码,以及编排其他智能体或自动化流水线。另一种类别是操作软件——部署、配置、运行流水线、监控系统。还有两种类别更侧重于确定要做什么:理解现有系统的工作原理,以及在改动前进行规划。最后两种类别则与代码无关,或者代码只是最终产品的附属品:分析数据,以及通过演示文稿和其他基于散文的文档进行沟通。

约 56% 的会话涉及编写(25%)、修复(26%)或测试与编排代码(5%)。操作软件占 17%,14% 的会话用于规划或探索,13% 用于生成分析报告或文字内容(图 1)。

图 1:九种工作模式

每个交互式会话被分类为最能描述其目标的一种模式。

我们让一个模型读取会话记录来进行分类,然后使用我们的隐私保护分析工具,将其与每个会话自动记录的数据(包括是否添加或删除了代码行)进行对照。这两种来源高度一致——例如,被我们的分类器标记为创建或修改代码的会话中,超过 90% 在遥测数据中都表现出了代码变更。详见附录。

§ 4

How autonomous is Claude Code? Capability evaluations suggest the ceiling is high and rising: on benchmarks such as METR's time-horizon evaluations, frontier models can now complete software tasks that would take a person hours, autonomously working through obstacles along the way. But what does usage actually look like in practice? Here, we look at how much steering is done by the person and by Claude in real sessions.

We investigate this question from two angles. First, we focus on the extent to which people are entrusting decisions to Claude, and second we look at how many actions they give to Claude. To understand the division of decision-making in a session, we build a privacy-preserving decision attribution classifier based on the content of a session. We ask a classifier to list all the meaningful decisions in a session. We separate these decisions into planning (what to do, which approach to take, what counts as done) and execution (which files to change, what code to write, what language to write in, which commands to run). The classifier then attributes each decision to Claude or to the user, giving every session two numbers: the user's share of planning decisions and the user's share of execution decisions.

On average, people make about 70% of the planning decisions but only 20% of the execution decisions (Figure 2). In practice, there is a clear division of labor in agentic coding––people decide what to build, and the agent decides how to build it.

To understand the delegation of actions in a session, we look at the session's structure instead of its content. A Claude Code session involves Claude and the user going back and forth trading prompts (from the user) and actions (taken by Claude)––the user writes a prompt and Claude goes off and does some work, and then the user writes another prompt, and so forth. In a typical session, there are about four such turns. In our historical data from October to April, each prompt the user sends sets off a chain of around 10 actions taken by Claude on average––and sometimes over 100. In each turn, Claude reads files, edits code, runs commands, and writes on average 2,400 words of output.

How much Claude does between check-ins largely tracks who is making the decisions. When the user keeps control of execution (i.e. makes over 80% of execution decisions), Claude takes fewer actions per turn (about eight actions). And when Claude takes control of planning (i.e. makes over 80% of planning decisions), it takes on the highest number of actions (about 16).

Figure 2: Claude's share of planning and execution decisions

Distribution across sessions of the share of planning decisions (what to do) and execution decisions (how to do it) attributed to Claude rather than the user. In the typical session, the user makes about 70% of planning decisions while Claude makes about 80% of execution decisions.

Claude Code 的自主性有多高?能力评估显示其上限很高且仍在上升:在 METR 的时间跨度评估等基准测试中,前沿模型现在能够自主完成原本需要人类数小时的软件任务,并在过程中自行克服障碍。但在实际使用中,情况又是怎样的呢?这里,我们考察在真实会话中人与 Claude 各自承担了多少引导工作。

我们从两个角度来探究这个问题。首先,我们关注人们在多大程度上将决策权托付给 Claude;其次,我们考察人们给予 Claude 多少操作任务。为了理解会话中的决策分工,我们构建了一个基于会话内容的隐私保护决策归属分类器。我们要求分类器列出会话中所有有意义的决策。我们将这些决策分为规划(做什么、采用哪种方法、什么算完成)和执行(更改哪些文件、编写什么代码、使用什么语言、运行哪些命令)。然后,分类器将每个决策归因于 Claude 或用户,并为每个会话提供两个数字:用户的规划决策份额和用户的执行决策份额。

平均而言,人做出约 70% 的规划决策,但仅做出约 20% 的执行决策(图 2)。在实践中,智能编码存在着明确的分工——人决定构建什么,智能体决定如何构建。

为了理解会话中的行为委派,我们关注会话的结构而非内容。一个 Claude Code 会话涉及用户提出提示(prompts)和 Claude 执行操作(actions)的来回交互——用户编写一条提示,Claude 去完成一些工作,然后用户再编写下一条提示,如此循环。在典型的会话中,大约有四个这样的交互轮次。根据我们从 10 月到 4 月的历史数据,用户发送的每条提示平均会触发 Claude 执行约 10 个操作,有时甚至超过 100 个。在每个轮次中,Claude 会读取文件、编辑代码、运行命令,并平均输出 2,400 字。

Claude 在两次检查点之间做的事情多少,很大程度上与谁在做决策有关。当用户保持对执行的控制权时(即做出超过 80% 的执行决策),Claude 每轮执行的操作较少(约 8 个)。而当 Claude 控制规划时(即做出超过 80% 的规划决策),它执行的操作最多(约 16 个)。

图 2:Claude 在规划和执行决策中的份额

各会话中由 Claude(而非用户)承担的规划决策(做什么)和执行决策(如何做)比例的分布。在典型会话中,用户做出约 70% 的规划决策,而 Claude 做出约 80% 的执行决策。

§ 5

From each transcript, Claude rates the user's apparent expertise at the task on a five-point scale from novice to expert. The expertise classifier looks for three signals: how precisely the user frames their directions, what they ask Claude to verify, and whether the user tends to correct Claude or Claude tends to correct the user. Note that expertise is capturing something quite different from job title or general ability, and, crucially, it is task-specific. A senior engineer asking their first Rust question is a beginner at Rust. An accountant who has never used Python, but tells Claude exactly which reconciliation rules a Python script must enforce and catches the edge case it mishandles at month-end close, is an expert at that task.

The table below shows how we defined each expertise level in the classifier along with an example request from a public dataset of coding agent sessions, SWE-chat. The conversation categorized as Novice gives generic instructions with no implied domain-specific knowledge. The Expert conversation conveys deep knowledge of the codebase and technical environment.

Table 1: Expertise classifier

The examples paraphrase, anonymize and condense real sessions labeled by our classifiers. Many of the sessions used in the table come from a public dataset of agentic coding sessions, SWE-chat.

We quantify how expertise relates to Claude's output and activity per prompt. In typical novice sessions, each prompt sets off about five Claude actions and roughly 600 words of output, while expert sessions set off action chains more than twice as long (12 actions) carrying five times the output (3,200 words) (Figure 3). This gap between novice and expert sessions appears within every kind of work and every band of task value.

These measures complement the autonomy measures in our prior report on Claude Code, which tracked how long the agent runs and how often people approve its actions automatically. Our decision attribution measure, by contrast, captures who makes the substantive decisions in a session as a whole, while our measures of output and actions per prompt measure how much autonomous activity from Claude each human prompt sets off.

Figure 3: Claude does more per prompt for more expert users

Claude produces more actions (left bar) and text output per prompt (right bar) for more expert users. Boxes span the interquartile range (split at the median). Whiskers represent the 5th to 95th percentile. White dots are geometric means. Both upward trends are statistically significant (p < 0.001), as is each adjacent-level step, and they remain significant (at +9% actions and +13% output per expertise level) in a regression controlling for work mode, task value, month, occupation, and model family, with standard errors clustered by user.

Claude 根据每个会话的记录,将用户在该任务上的明显专业水平从新手到专家分为五个等级。专业水平分类器关注三个信号:用户指令表述的精确度、他们要求 Claude 验证的内容,以及是用户在纠正 Claude 还是 Claude 在纠正用户。请注意,专业水平捕捉的是与职位头衔或通用能力截然不同的特质,并且关键在于它是任务特定的。一位高级工程师提出第一个关于 Rust 的问题时,他是 Rust 方面的新手。一位从未使用过 Python 的会计师,却能精确地告诉 Claude 一个 Python 脚本必须强制执行哪些对账规则,并在月末结账时指出它处理错误的边缘情况,那么他就是那项任务上的专家。

下表显示了我们在分类器中如何定义每个专业水平,并附带了来自公共编码智能体会话数据集 SWE-chat 的示例请求。被归类为“新手”的对话给出了通用的指令,没有隐含的领域特定知识。而“专家”对话则展示了深厚的代码库和技术环境知识。

表 1:专业水平分类器

这些示例对由我们分类器标注的真实会话进行了转述、匿名化和精简。表中使用的许多会话来自公共编码智能体会话数据集 SWE-chat。

我们量化了专业水平与每次提示 Claude 的输出和活动量之间的关系。在典型的新手会话中,每一条提示会触发约 5 个 Claude 操作和约 600 字的输出,而专家会话则会触发长度超过两倍的操作链(12 个操作)并产生五倍的输出(3,200 字)(图 3)。这种新手与专家会话之间的差距,出现在每一种工作和每一个任务价值区间中。

这些度量补充了我们之前关于 Claude Code 的报告中关于自主性的度量,后者追踪了智能体运行的时间以及人们自动批准其操作的频率。相比之下,我们的决策归属度量捕捉的是会话中实质性决策的制定者,而我们每次提示的输出和操作量度量则衡量了每个人类提示触发了 Claude 多少自主活动。

图 3:用户越专业,Claude 每次提示产出越多

对于更专业的用户,Claude 每次提示会产生更多的操作(左图)和文本输出(右图)。箱线图范围是四分位距(在中位数处分割)。须线表示第 5 至第 95 百分位数。白点是几何平均值。两者上升趋势在统计上显著(p < 0.001),每个相邻级别间的步骤也显著,并且在控制了工作模式、任务价值、月份、职业和模型族(标准误按用户聚类)的回归中仍然显著(每个专业级别操作量 +9%,输出量 +13%)。

§ 6

To understand who is doing this work, we infer each user's occupation from the session transcript, mapping it to one of 23 major groups in the Bureau of Labor Statistics’ Standard Occupational Classification (SOC) taxonomy. The classifier is instructed to rely only on signals such as the project context the agent loads at the start of a session, the names and structure of their files, any artifacts they reference (e.g., legal filings, clinical data, financial reports, a curriculum, etc.) and vocabulary they use. It is explicitly instructed not to treat the act of coding as evidence of a coding profession. A session is classified into the coding SOC code (Computer and Mathematical Occupations) only when there is clear signal that software or data work is the user’s job. A session in which a lawyer builds a script to automatically flag missing clauses across a folder of contracts is mapped into Legal Occupations, even if the session’s work is primarily software. The session is left unclassified when there is no signal about the user’s occupation.

We were able to infer occupation in about 70% of sessions. Within this set, Computer and Mathematical Occupations, a category which encompasses most software-related jobs, is unsurprisingly the largest group. The next largest are Business and Financial Operations; Arts, Design, and Media; Management; and Life, Physical, and Social Sciences. The fastest-growing non-software occupation groups in our sample are management, sales, and legal occupations.

为了理解是谁在使用 Claude Code 工作,我们从会话记录中推断每个用户的职业,并将其映射到美国劳工统计局标准职业分类(SOC)中的 23 个主要职业组之一。该分类器被指示仅依靠一些信号,例如智能体会话开始时加载的项目上下文、用户文件的名称和结构、他们引用的任何工件(例如法律文件、临床数据、财务报告、课程等)以及他们使用的词汇。它被明确告知,不能把编码行为本身作为编码职业的证据。只有当有明确信号表明软件或数据工作是用户的职业时,会话才会被分类到编码相关的 SOC 代码(计算机和数学职业)。如果一个律师为了自动标记一叠合同中的缺失条款而构建了一个脚本,即使该会话的工作主要是软件,它也会被映射到法律职业。当没有关于用户职业的信号时,该会话将保持未分类状态。

我们能够推断出约 70% 会话中用户的职业。在这些会话中,计算机和数学职业(包含大多数与软件相关的工作)毫无意外是最大的群体。其次是商业和金融运营;艺术、设计和媒体;管理;以及生命、物理和社会科学。在我们的样本中,增长最快的非软件职业群体是管理、销售和法律职业。

§ 7

The composition of the work done with Claude Code changed substantially between October 2025 and April 2026. The clearest change is that the share of sessions spent fixing broken code fell from 33% to 19% (Figure 4). In its place, we saw a greater share of the work that surrounds code. Operating software grew from 14% to 21% of sessions. Writing and data analysis roughly doubled, from about 10% to 20% of sessions.

The tasks themselves also grew more valuable. We approximate each session's economic value by asking what the work would cost on a freelance marketplace, calibrated against a public dataset of real postings. By this measure, the estimated value of the average session rose by 27% between October and April. The rise holds across many kinds of work. Building, operating, and fixing-type tasks all grew more valuable by roughly a third or more (about 43%, 34%, and 32% respectively). These price estimates are coarse, so we use them primarily to compare tasks to one another over time, not as dollar values to be read literally. For details about the construction of the task estimator, see the Appendix.

Figure 4: The composition and value of Claude Code work, October 2025 to April 2026

Share of sessions in each work mode over the seven-month window. The share of sessions fixing broken code fell from 33% to 19%, while operating software, analyzing data, and writing documents grew.

从 2025 年 10 月到 2026 年 4 月,使用 Claude Code 完成的工作构成发生了显著变化。最明显的变化是用于修复损坏代码的会话比例从 33% 下降到了 19%(图 4)。取而代之的是,围绕代码周边工作的比例增加了。操作软件的会话比例从 14% 增长到 21%。写作和数据分析的比例则大约翻了一番,从约 10% 增长到约 20%。

任务本身也变得更加有价值。我们通过估算在自由职业市场上完成这些工作需要多少费用,并根据一个真实的公开招聘数据集进行校准,来近似每个会话的经济价值。根据这个衡量标准,从 10 月到 4 月,平均会话的估计价值上升了 27%。这一增长遍及多种工作类型。构建、操作和修复类任务的价值都增长了大约三分之一或更多(分别约为 43%、34% 和 32%)。这些价格估算比较粗略,因此我们主要用它们来对不同任务进行跨时间的比较,而不作为字面意义上的美元价值。有关任务估算器构建的详细信息,请参见附录。

图 4:Claude Code 工作的构成与价值,2025年10月至2026年4月

七个月窗口期内每种工作模式所占的会话比例。修复损坏代码的会话比例从 33% 下降至 19%,而操作软件、分析数据和撰写文档的比例则有所增长。

§ 8

The estimated value of a task is one way to get a sense of how Claude Code is helping people do their work. Another angle is to look at how many sessions are successful, and what characteristics of a session are linked to success. Across all our measures of success, we see a clear pattern: the more expertise a person exhibits in a session, the higher the likelihood of success. Most of the gain is concentrated at the lower end of the expertise scale––the gap between novice sessions and intermediate sessions is bigger than the gap between intermediate and expert.

Before turning to the characteristics of successful sessions, we should be precise about how we measure success. We do not observe users’ real-world outcomes, and we cannot ask them directly whether they got what they wanted out of Claude. Instead, we rely on two complementary transcript-based measures. The first, judged success, comes from a classifier that reads the full transcript and decides whether the person succeeded in doing what they set out to do (with options: succeeded, partially succeeded, failed, no clear goal). Two companion classifiers then rate the strength of the evidence for that judgment to determine verified success. A success signal classifier looks for verifiable evidence of success. In particular, it looks for git activity like commits and pull requests matching the work, as well as test suites passing, and explicit affirmation from the user. It scores the session from "no signal" to “weak signal” (1) to "multiple hard signals” (5). A parallel failure signal scores the evidence that things went wrong—errors, failed tests, retries, the user pushing back on the output. Verified success requires both that the session is judged successful and there is at least one hard verifiable signal of success. For the following analysis, which is focused on the degree of success or failure in a session, we exclude sessions classified as having “no clear goal,” which comprise about 7.7% of our full sample.

So what kinds of sessions are most successful? It turns out that the expertise rating of a session, described above, matters a great deal for the success of a session.

One might worry that expertise isn't the real driver—perhaps experts simply pick different tasks, or differ in other ways. Throughout this section, we partially address this worry by comparing sessions doing the same kind of work, at the same estimated value, in the same month, on the same subject, from people in the same broad occupation group, and ask how outcomes differ by the person’s rated expertise.

Table 2: Definitions of success and failure derived from classifiers

The examples paraphrase and summarize real sessions from a public dataset of agentic coding interactions, SWE-chat, labeled by our classifiers.

Across all of our success measures, the more expertise a person exhibits in a session, the more likely it is that the session succeeds. A novice-rated session reaches our strictest measure, verified success, 15% of the time and at least partial success 77% of the time. A session rated intermediate or up reaches verified success 28-33% of the time and partial success 91-92% of the time (Figure 5).

In each measure, most of the gain comes from moving from novice to intermediate; between intermediate and expert, the slope decreases. In the Appendix, we give details about the regressions behind Figure 5.

Figure 5: Expertise and how sessions end

Session outcomes by the user's rated expertise at the task, on a five-point scale from novice to expert. The left panel includes all sessions. The middle and right panels restrict to sessions that hit trouble (failure signals > 3) and show the share that still end in various definitions of success and failure. Each point is an adjusted rate––we estimate the differences between expertise levels by comparing only sessions that share the same work mode, the same task-value band, the same month, the same task subject, and the same kind of user (software-related occupation or not). Details about the regressions behind these points are in the Appendix. Whiskers are confidence intervals on sample means (most are too small to be visible in this plot). These plots exclude sessions judged by the success outcome classifier to have no clear goal.

任务的估计价值是了解 Claude Code 如何帮助人们工作的一个视角。另一个角度是考察有多少会话是成功的,以及会话的哪些特征与成功相关。在我们所有的成功衡量指标中,都看到了一个清晰的模式:用户在会话中展现的专业水平越高,成功的可能性就越大。大部分收益集中在专业水平尺度的较低端——新手与会话与中级会话之间的差距,大于中级与专家之间的差距。

在讨论成功会话的特征之前,我们需要精确地说明如何衡量成功。我们无法直接观察用户在现实世界中的成果,也无法直接询问他们是否从 Claude 那里得到了想要的东西。因此,我们依赖于两种互补的、基于会话记录的度量方法。第一种是“判定成功”,由一个分类器读取完整的会话记录,并判断用户是否成功完成了他们开始做的事情(选项包括:成功、部分成功、失败、目标不明确)。然后,两个配套的分类器会评估支持该判断的证据强度,以确定“验证成功”。一个成功信号分类器会寻找可验证的成功证据,特别是寻找与工作匹配的 git 活动(如提交和拉取请求),以及测试套件是否通过和用户的明确确认。它会将会话从“无信号”到“弱信号”(1)再到“多个强信号”(5)进行评分。一个并行的失败信号分类器则会评估事情出错的证据——错误、失败的测试、重试、用户对输出的反驳。“验证成功”要求会话被判定为成功,并且至少有一个强的可验证的成功信号。在接下来的分析中,由于关注的是会话成功或失败的程度,我们排除了被分类为“目标不明确”的会话,这部分占我们完整样本的约 7.7%。

那么,哪种会话最成功呢?事实证明,前面描述的专业水平评级对会话的成功至关重要。

人们可能会担心专业水平并非真正的驱动因素——也许专家只是选择了不同的任务,或者在其他方面有所不同。在本节中,我们通过比较相同工作模式、相同估计价值、相同月份、相同任务主题以及来自相同主要职业群体的用户的会话,来部分地解决这个问题,并考察其成功结果如何因用户的专业水平评级而异。

表 2:分类器对成功与失败的定义

这些示例对来自公共编码智能体交互数据集 SWE-chat 的真实会话进行了转述和总结,并由我们的分类器标注。

在我们所有的成功衡量指标中,用户在会话中展现的专业水平越高,会话成功的可能性就越大。一个被评为新手的会话,在最严格的“验证成功”指标上达到 15%,至少部分成功的概率为 77%。一个被评为中级或以上的会话,验证成功的概率为 28-33%,部分成功的概率为 91-92%(图 5)。

在每个指标中,大部分收益都来自从新手到中级的跃升;从中级到专家之间,斜率变缓。我们在附录中提供了关于图 5 背后回归的详细信息。

图 5:专业水平与会话结果

按用户在任务上的专业水平评级(从新手到专家的五级量表)划分的会话结果。左图包含所有会话。中间和右图限于遇到困难的会话(失败信号 > 3),并显示其中仍然以各种成功和失败定义结束的会话比例。每个点都是一个调整后的比率——我们通过仅比较共享相同工作模式、相同任务价值区间、相同月份、相同任务主题和相同用户类型(是否为软件相关职业)的会话,来估计专业水平之间的差异。这些点背后回归的详细信息见附录。须线是样本均值的置信区间(在此图中大多数小到不可见)。这些图排除了被成功结果分类器判定为目标不明确的会话。

§ 9

A similar gradient appears in sessions that run into challenges along the way. We say a session hits trouble when the failure signal records verified evidence of failure. This could be an error, a failed test, multiple attempts to do the same thing, or the user expressing frustration or dissatisfaction. Among sessions that hit trouble, the share that are verified successes rises from 4% for novice-rated sessions to 15% for expert-rated ones, accounting for all the controls described above (Figure 5). Looking at the looser measures, we find that the share of at least partial success is 60% for novice and 80-81% for intermediate through expert sessions.

We also track the inverse relationship––expertise versus various measures of failure. Note that in this analysis, the sessions judged as failures are those that do not even partially succeed. We say a troubled session is abandoned if it is judged as failed and zero lines of code are written: 19% of sessions where the user appears to be a novice end abandoned, against 5-7% for everyone else. In other words, the least experienced users are more likely to give up when they are struggling to get the outcome they are after. Part of the value of expertise appears to be the ability to steer the agent in the right direction.

在运行中遇到挑战的会话里也出现了类似的梯度。当失败信号记录到可验证的失败证据时,我们认为会话遇到了困难。这些证据可能是错误、失败的测试、多次尝试做同一件事,或者用户表达了沮丧或不满。在所有上述控制条件下,在遇到困难的会话中,验证成功的比例从新手级会话的 4% 上升到专家级会话的 15%(图 5)。考察更宽松的指标,我们发现,新手在遇到困难时至少部分成功的比例为 60%,而中级到专家级会话则为 80-81%。

我们还追踪了反向关系——专业水平与各种失败度量之间的关系。请注意,在此分析中,被判定为失败的会话是指那些甚至没有部分成功的会话。如果遇到困难的会话被判定为失败,并且没有编写任何代码,我们会认为它被放弃了:在用户表现为新手的会话中,19% 以放弃告终,而其他用户组的这一比例为 5-7%。换句话说,经验最少的用户在难以获得期望结果时,更倾向于放弃。专业知识的价值之一,似乎体现在引导智能体走向正确方向的能力上。

§ 10

People in software-related occupations reach verified success in about 30% of their sessions overall, while users from other professions reach verified success about 26% of the time. Among sessions that produce code (i.e., sessions that add or modify at least one line of code), those numbers are 34% and 29% respectively (Figure 6). The gap between software-related occupations and other occupations narrows under our looser definition of success––with both groups reaching at least partial success in code-producing sessions 89% and 88% of the time, respectively. That five-point gap is small, and it has neither widened nor narrowed over seven months, even as the success rates in both groups increased. In code-producing sessions, every one of the ten largest occupations in our dataset lands within seven points of software engineers in terms of their success. Management occupations are highest on verified success, slightly above the software engineering occupations. Their higher verified success rates may reflect management skills that transfer to directing an agent. But they may also partly reflect our measurement: verification rests partially on explicit confirmation in the transcript, and managers may be more likely to communicate when they get what they ask for.

Figure 6: Verified and judged success rates in coding sessions by inferred occupation

Share of sessions meeting strict definitions of success––judged success and verified success––among sessions that add or change at least one line of code, by the user's inferred occupational group, for the ten largest groups. Every group is within seven percentage points of software/math users (SOC Code Computer and Mathematical Occupations). Error bars are 95% confidence intervals computed on distinct accounts.

软件相关职业的用户的会话中,验证成功的比例约为 30%,而其他职业的用户约为 26%。在产出代码的会话(即添加或修改了至少一行代码的会话)中,这两个数字分别为 34% 和 29%(图 6)。在我们较宽松的成功定义下,软件相关职业与其他职业之间的差距进一步缩小——在产出代码的会话中,两组用户至少部分成功的比例分别达到 89% 和 88%。这五个百分点的差距很小,并且在七个月内既未扩大也未缩小,尽管两组用户的成功率都有所提高。在产出代码的会话中,我们数据集中最大的十个职业群体,其成功率都落在软件工程师的七个百分点之内。管理职业在验证成功上的比例最高,略高于软件工程职业。他们更高的验证成功率可能反映了可迁移到指导智能体上的管理技能。但这也可能部分归因于我们的测量方式:验证部分依赖于会话记录中的明确确认,而管理者在得到他们要求的东西时可能更倾向于给出反馈。

图 6:按推断职业划分的编码会话验证成功与判定成功率

按用户推断的职业群体划分,在添加或更改了至少一行代码的会话中达到成功严格定义(判定成功和验证成功)的会话比例(前十大群体)。每个群体都在软件/数学用户(SOC代码:计算机和数学职业)的七个百分点之内。误差线是按不同账户计算的 95% 置信区间。

§ 11

The results in this report offer an emerging picture of how agentic coding amplifies some forms of knowledge and skills, while substituting for others. In sessions that produce code, every major occupation succeeds at rates within a few points of those in software-related occupations. It appears that coding agents are making a coding background less relevant to successful programming.

At the same time, successful sessions are more likely to exhibit domain expertise. Sessions rated expert reach verified success more than twice as often as those rated novice, and when a session hits trouble, novices abandon the session at several times the rate of everyone else. The shape of the collaboration gives this picture more color—domain experts are able to direct Claude to do more work with each instruction they give. So, the ability to steer Claude toward success comes more from command of a domain than from the ability to write code. A person with such command, in any field, may now be able to do technical work they previously could not. A person without any such expertise will get far less from the same tool. And the gains come mostly from competence, not mastery––a working grasp of the domain captures most of the benefit, while deep specialization adds only a bit more beyond that.

These findings are preliminary. As in most of our research, we cannot measure real-world outcomes, like whether code written in a session is actually used or discarded thereafter, or whether it produces an economically valuable artifact. In addition, the non-interactive usage this report excludes is a substantial share of activity. Developing a framework to measure it is a priority for future work. And all of our classifications of sessions depend on a model's reading of the transcript. In the Appendix, we show that our classifiers track independent telemetry in expected directions, and agree with a strong reference model on the majority of sessions. But classifiers remain challenging to validate at scale, and Claude Code sessions add further difficulty, as they may be too long and complex for human labels to serve as ground truth.

The picture in this report will be updated as the models, the users, and the division of labor between them change. We hope that these measures will allow us to track consequential shifts as they happen. For instance, if the returns to expertise begin to decrease over time, that would suggest that models are starting to supply the essential judgment that users currently bring, and that the gains from these tools are broadening beyond domain experts. If the share of coding sessions completed successfully by users outside software occupations continues to grow, it could indicate that software production is becoming a part of ordinary work in every field, rather than the product of a single occupation. These shifts would change who benefits from agentic coding, and by how much, and would have implications for what is most valued in the labor market.

本报告的结果描绘了一幅正在浮现的图景:智能编码如何增强某些形式的知识和技能,同时替代另一些。在产出代码的会话中,每一个主要职业群体的成功率都接近软件相关职业的用户。编码智能体似乎正在使编程背景对于成功的编程活动变得不那么重要。

与此同时,成功的会话更有可能展现领域专业知识。被评为专家的会话,其验证成功率是新手的两倍以上;当会话遇到困难时,新手放弃会话的比率是其他人的数倍。这种协作的形态使这幅图景更加丰满——领域专家能够通过每条指令引导 Claude 完成更多工作。因此,引导 Claude 走向成功的能力更多来自于对领域的掌握,而非编写代码的能力。任何领域中拥有这种掌握的人,现在或许都能完成他们以前做不到的技术工作。而那些缺乏任何此类专业知识的人,从同样的工具中获得的收益将少得多。而且收益主要来自于胜任而非精通——对领域有扎实的掌握就能捕获大部分好处,而深度专业化在此基础上增加得并不多。

这些发现是初步的。与我们的大多数研究一样,我们无法衡量现实世界的结果,例如会话中编写的代码是否被实际使用或之后被丢弃,或者它是否产生了具有经济价值的产物。此外,本报告排除在外的非交互式使用占据了相当大的活动比例。开发一个衡量它的框架是未来工作的优先事项。我们所有对会话的分类都依赖于模型对会话记录的解读。在附录中,我们展示了我们的分类器以预期的方式追踪独立的遥测数据,并在大多数会话上与一个强大的参考模型达成一致。但大规模验证分类器仍然具有挑战性,而 Claude Code 会话又增加了额外的难度,因为会话可能太长太复杂,以至于人工标签无法作为真实标准。

本报告中的图景将随着模型、用户以及它们之间分工的改变而更新。我们希望这些度量能让我们追踪到随时发生的重大转变。例如,如果专业知识的回报率开始随时间下降,那将表明模型已经开始提供用户目前带来的关键判断力,并且这些工具的收益正在超出领域专家范围。如果非软件职业用户成功完成的编码会话比例持续增长,这可能表明软件生产正在成为每个领域日常工作的一部分,而非单一职业的产物。这些转变将改变谁从智能编码中受益以及受益多少,并对劳动力市场中最被看重的部分产生影响。

打开原文 ↗