Daily /2026-06-15 / Anthropic's Analytics Agent Stack: Tackling Entity Ambiguity, Staleness, and Retrieval Failure

Anthropic's Analytics Agent Stack: Tackling Entity Ambiguity, Staleness, and Retrieval Failure

Source claude.com Glean’d 2026-06-15 06:01 Read 32 min

AI summary

Anthropic’s data team shares how they use Claude to automate 95% of business analytics queries at roughly 95% accuracy. They identify three core failure modes—concept‑entity ambiguity, data staleness, and retrieval failure—and describe a four‑layer agentic stack to address them: data foundations (canonical datasets, rigorous governance), sources of truth (semantic layer, lineage, business knowledge graph), skills (knowledge and procedural skills, which lifted accuracy from ~21% to >95%), and validation (offline evals, adversarial review, online monitoring). The post includes concrete practices such as colocating docs with code, treating metadata as a first‑class product, and an appendix with a skill file skeleton. It is aimed at data engineers and analysts building LLM‑powered self‑service analytics.

Original · 32 min

claude.com ↗

§ 1

As many data science and data engineering teams can attest, enabling self-service business analytics has traditionally been a slog. Making the data model more accessible to less technical coworkers via wide and denormalized tables often leads to overlapping views with inconsistent definitions as the business scales (and does little to bridge the gap for employees with little desire to learn SQL). Alternatively, creating more ringfenced environments for users often misses the long tail of business questions and leads to metric and dashboard bloat as teams silo their work.

许多数据科学和工程团队都能证实，实现自助式商业分析历来是一项苦差事。通过宽表和反范式化表让数据模型对技术较弱的同事更易用，往往会导致随着业务规模扩大而出现视图重叠、定义不一致（而且对于不想学 SQL 的员工帮助甚微）。另一种做法是，为用户创建更封闭的环境，但这种方式常常遗漏长尾业务问题，并导致指标和仪表盘膨胀，因为团队各自为政。

§ 2

The rise of LLMs provides an additional path for self-service analytics that avoids those challenges. However, pointing Claude at a warehouse and letting the agents execute can create a false sense of precision. The initial elation of liberation from ad-hoc requests turns into dread with the realization that this setup separates stakeholders from the underlying infrastructure, documentation, and expertise that previously steered them toward carefully curated datasets.

LLM 的崛起提供了另一条避免上述挑战的自助分析路径。然而，让 Claude 直接连接数据仓库并自由执行，可能会产生一种虚假的精确感。最初从临时请求中解放出来的兴奋，很快就会变成恐惧——因为人们意识到，这种设置将利益相关者与之前指引他们找到精心整理的数据集的基础设施、文档和专业知识隔离开来。

§ 3

At Anthropic, 95% of business analytics queries are automated via Claude, with ~95% accuracy in aggregate. By giving this often rote, repetitive work to Claude, our data science team can focus on more strategic work like causal modeling, forecasting, and machine learning. After meeting with dozens of Anthropic's top Claude Code users and having seen myriad design patterns for analytics agents, we've cultivated some best practices for other data teams working with LLMs. In this post, we'll share these tips and approaches to maximizing Claude's ability to drive self-serve business insights, including: Why analytics accuracy is a context and verification problem, not a code generation issue; The three failure modes that cause most errors; The agentic analytics stack we built to address these errors; How we measure effectiveness; and A basic template for how we create the majority of our skills (see the appendix).

在 Anthropic，95% 的业务分析查询已通过 Claude 实现自动化，整体准确率约为 95%。通过将这项通常是重复性的工作交给 Claude，我们的数据科学团队可以专注于更具战略性的工作，如因果建模、预测和机器学习。在与数十位 Anthropic 顶尖的 Claude Code 用户交流，并观察了大量分析代理的设计模式后，我们为其他使用 LLM 的数据团队总结出一些最佳实践。本文将分享这些技巧和方法，帮助最大化 Claude 驱动自助式业务洞察的能力，包括：为什么分析准确度是一个上下文和验证问题，而不是代码生成问题；导致大多数错误的三种失败模式；我们为解决这些错误而构建的代理分析技术栈；我们如何衡量有效性；以及我们创建大部分技能的基本模板。

§ 4

LLMs' generative abilities are a double-edged sword: the mechanisms that enable creative solutions to complex problems can also hallucinate erroneous output. To fully understand the challenges with analytics agents, it's useful to compare them to coding agents. Coding is an open-ended solution space that rewards the models' creativity, while documentation and tests provide natural guardrails against hallucination. In contrast, for analytics use cases, there's often only a single correct answer using a single correct source in which there's no deterministic way of proving the correctness. For self-service agentic business analytics, the complexity mainly lies in the ambiguity of the data. The central problem comes down to our ability to map a user's question to specific and up-to-date entities in our data model and know the correct way of working with them. If we can do that, then the resulting execution and SQL becomes trivial.

LLM 的生成能力是一把双刃剑：那些能对复杂问题产生创造性解决方案的机制，同样也会产生幻觉般的错误输出。要完全理解分析代理面临的挑战，将其与代码代理进行比较会很有帮助。编码是一个开放式的解决方案空间，它奖励模型的创造力，同时文档和测试提供了防止幻觉的自然护栏。相比之下，对于分析用例，通常只有一个使用单一正确来源的正确答案，而且没有确定性的方法来证明其正确性。对于自助式代理业务分析，复杂性主要在于数据的歧义性。核心问题归结为我们能否将用户的问题映射到数据模型中特定且最新的实体，并知道处理它们的正确方法。如果我们能做到这一点，那么后续的执行和 SQL 就变得微不足道了。

§ 5

We've identified three attributes of this problem that account for an overwhelming majority of inaccurate responses: Concept <> entity ambiguity: with hundreds of viable options in a data model (out of potentially millions of fields), the agent is unable to choose the correct fields that best answer a user's question. For example, in measuring the number of active users: what actions constitute being "active"? Do you include fraudulent users? What lookback window do you use? Data staleness: data sources, business definitions, and schemas change constantly; assets and agent knowledge go stale and start returning subtly wrong answers. Retrieval failure: the right information may actually be in the data model and properly annotated, but given the vastness of the search space, the agent simply doesn't find it.

我们识别出了导致绝大多数不准确响应的三个问题属性：概念与实体之间的歧义——数据模型中有数百个（潜在数百万个字段中可供选择的选项，代理无法选择最能回答用户问题的正确字段。例如，在衡量活跃用户数时：哪些行为构成“活跃”？是否包含欺诈用户？使用什么回溯窗口？数据陈旧——数据源、业务定义和模式不断变化；资产和代理知识变得过时，开始返回细微错误的答案。检索失败——正确的信息可能确实存在于数据模型中并得到了适当注释，但由于搜索空间巨大，代理就是找不到它。

§ 6

At Anthropic, the main way we minimize these three errors is via our agentic data stack. Each layer exists primarily to attack one or more of these problems: Entity ambiguity: data foundations and sources of truth shrink the space of plausible entities until there's a single governed answer. Staleness: maintenance and validation processes keep everything from rotting as the business changes. Retrieval failure: skills make sure the agent reliably finds and correctly uses that answer. In this section, we'll discuss how we built each layer.

在 Anthropic，我们主要通过代理数据技术栈来最小化这三类错误。每一层主要针对其中一个或多个问题：实体歧义——数据基础和事实来源将可能实体的空间缩小到只有一个受管控的答案；陈旧性——维护和验证流程确保一切随着业务变化而不会腐烂；检索失败——技能确保代理可靠地找到并正确使用那个答案。本节我们将讨论每个层是如何构建的。

§ 7

The most important aspect of ensuring analytics agents are accurate is via strong data foundations, which include the data models, transforms, tests, and tables in a data warehouse, along with the metadata describing them. Standard data engineering and data quality practices such as dimensional modeling, shift-left testing, freshness and completeness checks on critical pipelines all still apply (and we won't relitigate these). What does change is that the end user of your data model is no longer a data expert (e.g. data scientist), but rather agents acting on behalf of users with varying degrees of data expertise or understanding of the underlying infrastructure. This shift presents a challenge in that the results can't require the user to validate the underlying correctness simply because the end user doesn't know. The data foundations layer is aimed primarily at ambiguity: if revenue, for example, resolves to one governed dataset instead of forty plausible candidates, the problem largely disappears before the agent ever has to search. It's also where the first staleness defense lives, since the same repo that defines the canonical models is the natural place to enforce that they stay current.

确保分析代理准确性的最重要方面是拥有强大的数据基础，包括数据仓库中的数据模型、转换、测试和表，以及描述它们的元数据。标准的数据工程和数据质量实践，如维度建模、左移测试、关键管道的及时性和完整性检查，仍然适用（我们不做赘述）。变化在于，数据模型的最终用户不再是数据专家（如数据科学家），而是代表具有不同数据专业知识或对底层基础设施理解程度的用户的代理。这种转变带来了一个挑战：结果不能要求用户验证其底层正确性，因为最终用户不知道。数据基础层主要针对歧义：例如，如果收入解析为一个受管控的数据集，而不是四十个看似合理的候选集，那么问题在代理搜索之前就基本消失了。它也是第一道防过时的防线，因为定义规范模型的同一个仓库正是强制执行它们保持最新的自然场所。

§ 8

We've seen a few practices work especially well: Create canonical datasets: By far the most common failure is that the agent can't map a concept ("revenue for product X") to the single correct table, column, and metric definition, usually because there are multiple plausible candidates with subtly different implementations. The fix is fewer, more heavily governed logical models: curate a small set of canonical, single source-of-truth datasets that are clearly owned, consumption-ready, and discoverable, then aggressively deprecate the near-duplicates. Physical rollups and caches still matter for cost and performance, but they should derive mechanically from the canonical models rather than living alongside them as alternatives. The goal is that when an agent searches for a concept, it finds a single governed answer. Enforce your standards: We've found the foundations only hold if the canonical models and metric definitions are enforced by tooling (the agent is structurally routed to them first; more on that below), by CI (changes that bypass them fail review), and by mandate (downstream teams build on the governed layer or explain why not). Governance without enforcement otherwise quickly decays back to the multiple candidates problem.

我们看到一些特别有效的实践：创建规范数据集。迄今为止最常见的失败是代理无法将概念（如“产品 X 的收入”）映射到正确的表、列和指标定义，通常是因为存在多个实现细微不同的候选方案。解决方案是减少逻辑模型数量，加强治理：精心维护一小部分规范的、单一事实来源的数据集，它们拥有明确的所有者、可供消费且易于发现，然后积极弃用近似的重复项。物理汇总和缓存对成本和性能仍然重要，但它们应从规范模型中机械地派生，而不是作为替代品与之共存。目标是当代理搜索一个概念时，它只找到一个受管控的答案。强制执行标准。我们发现，基础只有在以下情况下才能稳固：通过工具强制（代理在结构上被首先引导到它们，下文详述）、通过 CI 强制（绕过它们的变更无法通过审查）以及通过授权强制（下游团队在治理层之上构建，或解释为什么不这样做）。没有强制力的治理很快就会退化回多个候选问题。

§ 9

Colocate artifacts: Our main defense against constantly changing data models and business logic is colocation. Nearly all data code (i.e., modeling, semantic layer, reference docs, canonical dashboard definitions) lives in a single repo, with CI checks that protect cross-layer integrity. If a modeling change would break a downstream dashboard or invalidate a documented metric, CI flags it and the fix ships in the same PR. (We'll come back to the mechanics of this in the Skills section below.) Treat metadata as a first-class product: Coding agents perform well partly because codebases are legible: READMEs, type signatures, docstrings, etc. Your warehouse can be just as legible, but only if column and table descriptions, canonical metric definitions, grain documentation, valid value ranges, lineage, ownership, and model tiering are maintained with the same rigor as the transformations themselves. While not a new insight, good governance provides critical context that helps the agent choose the right dataset.

共置工件：我们应对不断变化的数据模型和业务逻辑的主要策略是共置。几乎所有数据代码（即建模、语义层、参考文档、规范仪表盘定义）都位于单个仓库中，并通过 CI 检查来保护跨层完整性。如果建模变更会破坏下游仪表盘或使已记录的指标失效，CI 会标记出来，修复工作在同一 PR 中提交（我们在下面的技能部分会回到这个机制）。将元数据视为一等公民：代码代理之所以表现出色，部分原因是代码库易于理解：有 README、类型签名、文档字符串等。你的数据仓库同样可以做到，但前提是列和表描述、规范指标定义、粒度文档、有效值范围、血缘关系、所有者和模型分层，都要以与转换本身相同的严谨度来维护。虽然这不是什么新见解，但良好的治理提供了关键的上下文，帮助代理选择正确的数据集。

§ 10

If data foundations are the data warehouse itself, sources of truth are the reference surfaces the agent consults to navigate it. This layer reduces concept <> entity ambiguity and turns "weekly active users" in a stakeholder's question into a specific, governed entity in your data model. Roughly in descending order of trust: Semantic layer: the compiled metric and dimension definitions. If a question maps cleanly to a defined metric, the agent calls a function and gets one number, the same number every other surface in the company produces. Our agents are structurally required (by skill instruction) to leverage the semantic layer first (see the appendix). One idea we tried that didn't work: bootstrapping the semantic layer by having an LLM auto-generate metric definitions from raw tables and query logs. It produced plausible-looking definitions that encoded the very ambiguities we were trying to eliminate, and was net-negative on our evals versus a smaller, human-curated layer. Therefore we recommend generating the documentation with Claude, but having a human own the definition. Lineage and the transformation graph: when the semantic layer doesn't cover a question, lineage and table ranking (based on number of references) let the agent reason about which upstream models feed a concept, which are deprecated, and which share grain. This transforms "I don't know the metric" into "I know which governed model to aggregate from." It's also the backbone of the freshness and provenance signals we surface in online validation below. Query corpus: historical SQL from dashboards, notebooks, and prior analyses. Intuitively, this should be high-value: it's a record of every question already answered correctly. In practice, we found that giving the agent raw retrieval access to thousands of prior queries moved accuracy by less than a point (we walk through that ablation in a later section below). Unstructured retrieval couldn't map a new question to the right precedent. What does work is distilling that corpus into structured per-domain reference docs and reusable analysis patterns described in skills. Treat the query history as raw material for curation, not as a source of truth the agent reads directly. Business context: the layer most teams skip, and the one we underrated the longest. An agent that doesn't understand your business will answer what the user asked, but not what they meant. It won't know that "the Q2 launch" refers to a specific product, that two teams define the same term differently, or that a question is being asked because a board meeting is on Thursday. We pipe in a company knowledge graph consisting of indexed docs, roadmaps, decision logs, and our organizational structure so the agent can resolve ambient references and ask better clarifying questions. The common failure pattern across all four is the same one from the data foundations layer: poor or stale documentation. Claude is exceptionally useful for closing the gap (drafting column descriptions, proposing metric docs from query patterns, flagging undocumented models in CI), but the curation and ownership are managed by humans.

如果说数据基础是数据仓库本身，那么事实来源就是代理用来导航的参考界面。这一层减少了概念与实体之间的歧义，将利益相关者问题中的“周活跃用户”转化为数据模型中特定的、受管控的实体。按信任度大致降序排列：语义层：指已编译的指标和维度定义。如果一个问题能清晰地映射到某个已定义的指标，代理就会调用一个函数并得到一个数字，这个数字与公司其他所有界面产生的数字完全一致。我们通过技能指令在结构上要求代理首先利用语义层（见附录）。我们尝试过一个不奏效的主意：让 LLM 从原始表和查询日志中自动生成指标定义来引导语义层的搭建。它产出了看似合理的定义，但这些定义编码了我们正试图消除的那些歧义，与我们较小的人工策展语义层相比，在评估中表现为净负值。因此，我们建议用 Claude 生成文档草稿，但由人类负责定义。血缘关系与转换图谱：当语义层不能覆盖某个问题时，血缘关系和表排名（基于引用次数）能让代理推断哪些上游模型支撑某个概念，哪些已被弃用，以及哪些共享同一粒度。这便将“我不知道这个指标”转化为“我知道该从哪个受管控模型进行聚合”。它也是我们下文在线验证中提供的及时性和出处信号的支柱。查询语料库：来自仪表盘、笔记本和先前分析的历史 SQL。直觉上这应该很有价值：它记录了所有已被正确回答的问题。但实际上，我们发现让代理通过原始检索访问数千条历史查询，准确率提升不到一个百分点（我们在下文消融实验中会讨论）。非结构化检索无法将新问题映射到正确的先例。真正有效的是将该语料库提炼为结构化的、按领域划分的参考文档，以及在技能中描述的可复用分析模式。将查询历史视作策展的原材料，而不是代理直接读取的事实来源。业务上下文：这是大多数团队跳过的层，也是我们长期低估的层。不了解你业务的代理会回答用户所问的问题，但不会理解他们真正的意图。它不会知道“Q2 发布”指的是某个特定产品，两个团队对同一术语有不同定义，或者某个问题的提出是因为周四有董事会会议。我们引入了一个公司知识图谱，包含索引文档、路线图、决策日志和组织结构，以便代理解析当前引用并提出更好的澄清性问题。这四层的共同失败模式与数据基础层相同：文档质量差或过时。Claude 在弥合这一差距方面非常有用（起草列描述、从查询模式中提出指标文档、在 CI 中标记未记录的模型），但策展和所有权由人类管理。

§ 11

If the sources of truth are the agent's declarative knowledge (i.e., what a metric means) then a skill is its procedural knowledge: which sources to consult in what order, how to navigate ambiguous data, and what a finished analysis looks like. In Claude Code, a skill is a folder of markdown the agent reads on demand. At Anthropic, the skills we developed are hugely value additive. Without skills, Claude's ability to answer analytics questions accurately didn't exceed 21% on our evals. Adding skills gets these numbers consistently above 95% in aggregate and regularly around 99% in certain domains. See the appendix for a skeleton we use to create a majority of our skills.

如果说事实来源是代理的陈述性知识（即指标的含义），那么技能就是它的过程性知识：按什么顺序查阅哪些来源、如何导航有歧义的数据、一份完成的分析报告应该是什么样子。在 Claude Code 中，一个技能就是一个代理按需读取的 Markdown 文件夹。在 Anthropic，我们开发的技能极具价值。没有技能时，Claude 在我们的评估中准确回答分析问题的能力不超过 21%。添加技能后，这些数字一致地稳定在整体 95% 以上，并在某些领域经常达到 99% 左右。附录中包含我们用来创建大部分技能的模板骨架。

§ 12

Some best practices: Create pairwise skills: a knowledge skill acts as a thin top-level router that allows additional domain details to load on demand. It says "try the semantic layer first, but if there's no coverage, here are ~30 reference files for this domain describing the relevant tables, columns, joins and gotchas." This router is, in effect, our answer to retrieval failure: rather than letting the agent search a million-field warehouse, it narrows the space to a few dozen curated files before a query is ever written. The unbook skill encodes the process a senior analyst would follow: clarify the question, find sources (via the knowledge skill), run the query, and then loop the result through adversarial review sub-agents. It also bundles a dozen reusable analysis patterns (retention curves, rate decomposition, funnel analysis) so that common requests don't get reinvented each time. Create proper reference docs: written for retrieval by an LLM. Our reference docs describe tables (grain, scope, and exclusions), the mechanics of gotchas (e.g., "exclude known free-email domains, but keep custom ones like anthropic.com"), and explicit routing triggers (e.g., "IF the question is about experiment lift… DO NOT use for raw event counts") without prescriptive recipes that go stale. See below for a skeleton we use to create reference docs. Treat skill maintenance as a first class citizen: Skill docs describe a data model that changes daily, so without active maintenance they're wrong within weeks. We watched our offline accuracy drift from ~95% at launch to ~65% over a month before we treated this as an engineering problem. That meant colocating skill markdown files in the same repo as our transformation models, so the PR that changes a model is the same PR that updates the doc describing it. A code-review hook flags any reporting-model change that doesn't touch a skill file. Roughly 90% of our data-model PRs now include a skill change in the same diff. We also regularly prune skill scaffolding as models improve and previous failure modes no longer apply. Create a consistent and seamless experience across all surfaces: the same skill must provide the same answer to questions in Slack, in the IDE, in a dashboard tool, and in standalone agent sessions. We did this by ensuring one canonical source (the data repo) and that skill changes are synced automatically. On merge, the skill syncs to a plugin marketplace (for IDE users), to cloud-storage blobs (for hosted apps that read a single file), and is served directly as resources over MCP. We also designed for portability from the start by avoiding hardcoded repo paths and surface-specific namespaces.

一些最佳实践：创建配对技能。一个知识技能充当一个轻量级顶层路由器，允许按需加载额外的领域细节。它说：“先尝试语义层，如果没有覆盖，这里有该领域约 30 个参考文件，描述了相关的表、列、连接和陷阱。”这个路由器实际上就是我们应对检索失败的方法：不让代理搜索拥有百万个字段的数据仓库，而是在编写查询之前就将搜索空间缩小到几十个经过策展的文件。无代码技能则编码了资深分析师会遵循的流程：澄清问题、通过知识技能查找来源、运行查询，然后将结果循环送入对抗性审查子代理。它还捆绑了十几种可复用的分析模式（留存曲线、比率分解、漏斗分析），这样常见请求就不必每次都重新发明轮子。创建合适的参考文档：这些文档是为 LLM 检索而写的。我们的参考文档描述表（粒度、范围、排除项）、陷阱的机制（如“排除已知的免费邮箱域名，但保留像 anthropic.com 这样的自定义域名”）以及明确的路由触发条件（如“如果问题是关于实验提升……不要用于原始事件计数”），但不包含那些会过时的规定性配方。将技能维护视为一等公民：技能文档描述的数据模型每天都在变化，所以如果不主动维护，几周内它们就会出错。在我们将此视为工程问题之前，我们观察到离线准确率从发布时的大约 95% 在一个月内下滑到大约 65%。这意味着将技能 markdown 文件与转换模型放在同一个仓库中，这样修改模型的 PR 就是更新描述它的文档的那个 PR。一个代码审查钩子会标记任何未触及技能文件的报告模型变更。目前我们大约 90% 的数据模型 PR 都在同一个差异中包含技能变更。随着模型改进以及以前的失败模式不再适用，我们也会定期修剪技能的脚手架。在所有界面创建一致且无缝的体验：同样的技能必须在 Slack、IDE、仪表盘工具以及独立代理会话中为问题提供相同的答案。我们通过确保一个规范来源（数据仓库）并自动同步技能变更来做到这一点。合并时，技能会同步到插件市场（供 IDE 用户使用）、云存储 blob（供读取单个文件的托管应用使用），并通过 MCP 作为资源直接提供。我们还从一开始就设计了可移植性，避免硬编码仓库路径和特定于界面的命名空间。

§ 13

See below for a skeleton we use to create reference docs.

[Domain] Tables

Quick Reference

Business Context — [what this domain means in plain words]

Entity Grain — [what one row represents]

Standard Hygiene Filter — [the filter every query in this domain applies]

Dimensions

[How the key dimensions are encoded, and how the same concept is named differently across tables]

Key Tables

[table_name]

Grain: [...] · Scope/exclusions: [...]
Usage: [when to use it, when NOT to, join keys, required filters] [... one short section per governed table ...]

Gotchas

[The wrong-answer modes a senior analyst would warn you about]

Best Practices / Common Query Patterns

[Default choices, standard cuts, worked patterns where the exact query form is the hard part]

Cross-References

[Neighboring domain docs that own adjacent questions]

下面是我们用来创建参考文档的模板。

[领域] 表

快速参考

业务背景 — [用平实的语言解释该领域含义]

实体粒度 — [一行数据代表什么]

标准过滤条件 — [该领域每个查询都应用的过滤器]

维度

[关键维度如何编码，以及同一概念在不同表中的不同命名方式]

关键表

[表名]

粒度：[...] · 范围/排除项：[...]
使用方式：[何时使用，何时不使用，连接键，必需过滤器] [... 每个受管控表一个简短段落 ...]

陷阱

[资深分析师会警告你的错误答案模式]

最佳实践 / 常见查询模式

[默认选择、标准切分、精确查询形式是关键的工作模式]

交叉引用

[拥有相邻问题的邻近领域文档]

§ 14

Finally, validation is how you find out which of the three failure modes is still leaking through. A common pattern we see is that data teams will set up elaborate analytic environments without having any process to understand the accuracy of their analytics agents. One way of addressing this gap is via offline evals, which are simple question / answer pairs. You can think of offline evals similar to offline testing for an ML model in that they don't tell you the performance of your online agents, but they do give you a good sense of whether you'll have any critical gaps. We deploy two kinds of offline evals at Anthropic. Dashboard-based evals are auto-generated by Claude (then human validated), covering the most common stakeholder questions. Long tail evals are where we feed Claude business context (roadmaps, table docs) and have it generate plausible questions across the rest of the domain. We also continuously harvest every time a stakeholder corrects the agent in a thread as that correction is a candidate eval. Other best practices, include: Anchor ground truth so it can't drift: An eval written against live data goes stale the moment the underlying number moves. Pin every eval to a snapshot date, write it against a stable fact table, or have the grader judge the agent's query rather than its number. Wire the suite into CI so a PR touching a dependency re-runs the affected evals. Store results like telemetry, not like test logs: Every run lands in a warehouse table with the skill version, git SHA, model ID, per-assertion pass/fail, token count, and wall-clock. "Did that change help?" becomes a query, and you get the time-series to catch slow regressions that a single CI run won't. Gate launches per domain: A domain owner can't announce the agent to their stakeholders until their slice of the eval set clears some threshold (we initially used ~90%). It forces reference-doc fixes before users see the failures. Create the appropriate number of evals: The number of evals you should have depends on the complexity of the business area and the complexity of the underlying data model. Calibrate by tracking how well offline accuracy predicts online accuracy: we've found there are diminishing returns past a few dozen per topic (e.g., "growth"), and that ceiling drops with each new model generation. Offline eval accuracy should be ~100%; every correct answer should also be hitting your semantic layer (if you have one). Again, this level of accuracy doesn't tell you your system isn't going to produce a wrong answer, just that there are no obvious gaps, assuming you have proper eval coverage.

最后，验证是用来发现三种失败模式中哪些仍在漏过的环节。我们常见的一个模式是，数据团队会搭建精巧的分析环境，却没有建立任何流程来理解其分析代理的准确性。弥补这一差距的一种方法是通过离线评估，即简单的问题/答案对。你可以将离线评估类比为 ML 模型的离线测试：它们不会告诉你在线代理的性能，但能让你很好地了解是否存在任何关键缺口。我们在 Anthropic 部署了两种离线评估。基于仪表盘的评估由 Claude 自动生成（然后人工验证），覆盖最常被问到的利益相关者问题。长尾评估则是在我们向 Claude 提供业务上下文（路线图、表文档）后，由它生成领域中其他部分可能被问到的合理问题。我们还会持续收集每次利益相关者在对话中纠正代理的实例，并将该纠正确评为候选评估。其他最佳实践包括：锚定真实答案使其不会漂移。针对实时数据编写的评估在底层数字变化的那一刻就会过时。将每个评估固定到一个快照日期，针对一个稳定的事实表编写，或者让评分者评判代理的查询而非其数字。将评估套件接入 CI，以便触及依赖项的 PR 能重新运行受影响的评估。将结果像遥测数据一样存储，而不是像测试日志。每次运行都会写入一个仓库表，包含技能版本、Git SHA、模型 ID、每条断言的通过/失败、token 数和挂钟时间。“那个改动有帮助吗？”变成了一个查询，从而获得时间序列，以捕捉单次 CI 运行无法发现的缓慢回归。按领域限制发布。领域所有者必须等到其所属的评估集切片超过了某个阈值（我们最初用 ~90%），才能向利益相关者宣布该代理。这迫使参考文档在用户看到失败之前就得到修复。创建适当数量的评估。你应该拥有的评估数量取决于业务领域的复杂度和底层数据模型的复杂度。通过跟踪离线准确性预测在线准确性的程度来校准：我们发现每个主题（如“增长”）超过几十个评估后，收益就会递减，并且这个上限会随着每一代新模型而降低。离线评估的准确率应接近 100%；每个正确的答案也应命中你的语义层（如果有的话）。重申一下，这种准确率水平不能保证你的系统不会产生错误答案，只是说明，假设评估覆盖得当，没有明显的缺口。

§ 15

Every structural decision about the skill (e.g., which sources to expose, whether a sub-agent earns its latency, whether to merge two skills into one) is made by holding our offline eval set fixed. We vary exactly one component and compare pass rates. Each run only takes an hour and replaces a lot of arguments. The methodology matters more than any single result: Design for null results. Our most useful ablation was a negative one. We gave the agent direct grep access to our entire dashboard, transformation, and analyst-notebook SQL (thousands of files). We then verified in transcripts that it actually read them before every answer. Accuracy moved by less than a point in either direction. We then checked the obvious confounds: was the answer actually in the corpus for the questions it got wrong? About 80% of the time, yes. Did "answer present" predict "now gets it right"? No, the flip rate was flat. The information was there, the agent saw it, and it still didn't use it. That single experiment told us our bottleneck wasn't access to prior work, it was structure (i.e., mapping a question to the right entity). That insight redirected months of roadmap. Ablate at PR granularity. Every meaningful skill edit gets a before / after run on the relevant eval slice, with the delta in the PR description. It keeps "I improved the docs" honest and catches the surprisingly common case where a well-intentioned addition makes things worse. Keep a short list of what didn't work. Two of ours: stacking additional rounds of doc refinement past a certain point (we hit three consecutive net-negative iterations: the docs were getting longer, not better), and swapping the adversarial reviewer to a cheaper model to cut latency (it lost most of the accuracy wins, for no real speedup). Negative results are cheap to record and they prevent the next person from re-running the same experiment.

关于技能的每个结构决策（例如，暴露哪些来源、子代理的延迟是否值得、是否将两个技能合并为一个），都是通过保持我们的离线评估集不变来做出的。我们每次只改变一个组件并比较通过率。每次运行只需要一个小时，并且取代了大量的争论。方法论比任何单一结果都更重要：为无效结果而设计。我们最有用的消融实验是一个负向的。我们给予代理对整个仪表盘、转换和分析师笔记本 SQL（数千个文件）的直接 grep 访问权限。然后我们在对话记录中确认它在每次回答前确实读取了这些文件。准确率在两个方向的变动都不到一个百分点。然后我们检查了明显的混淆变量：对于那些答错的问题，答案是否真的存在于语料库中？大概 80% 的情况下，是的。“答案存在”是否预示着“现在能做对”？不，翻转率是平的。信息就在那里，代理看到了，但它仍然没有使用。这个单一实验告诉我们，瓶颈不在于对过往工作的访问，而在于结构——即将问题映射到正确的实体。这个洞察重塑了数月的路线图。在 PR 粒度上做消融。每个有意义的技能编辑都会在相关的评估切片上运行一次前后对比，差异体现在 PR 描述中。这让“我改进了文档”这种说法更可信，并捕捉到了那种令人惊讶的常见情况：一个善意的添加反而让事情变得更糟。保留一份“什么没奏效”的简短清单。我们有两个这样的例子：在某个点之后堆叠额外的文档优化轮次（我们连续遇到了三个净负值迭代：文档变得更长了，但没有变得更好），以及将对抗性审查者换成更便宜的模型以减少延迟（它损失了大部分准确性收益，却没有带来真正的加速）。负向结果记录起来成本不高，而且可以防止下一个人重复进行同样的实验。

§ 16

The final step is ensuring the actual online system performance is as accurate as possible. Some of the steps we take include: Adversarial review: we've found that employing a Claude skill to aggressively challenge all underlying assumptions on a potential final answer increased accuracy by 6% within our eval set, but at the cost of 32% more tokens and 72% higher latency. Provenance footer: every response carries a footer that contains which source tier it came from (semantic layer › curated reference › raw table), how fresh the underlying data is, and who owns the model. It doesn't make the answer more correct, but it does help the consumer judge how much they can trust the response. A "raw table, freshness unknown" footer is a signal to verify before forwarding upstream, and it's one of the few mitigations we have for silent failures. Data quality checks: it's possible that your agent is using the right field in the appropriate way, but the data itself is incorrect. Adding basic data quality checks to ensure the referenced field is up-to-date, complete, and has no anomalies is generally good hygiene. Passive monitoring: two production signals we track continuously are the share of agent queries that resolve through the semantic layer, and the share of responses that use correction language ("that's the wrong table," "you're missing the fraud filter"). Both feed a dashboard reviewed weekly alongside the offline pass rate. Active correction harvesting: the part that closes the loop. A scheduled agent scans stakeholder channels every few hours for similar correction language, drafts a one-line fix to the relevant reference doc, and opens a PR tagged to the domain owner. The fix path is deliberately boring — edit a markdown file, merge, auto-sync everywhere — so a domain owner doesn't spend too much time on the task. The same corrections feed back into the offline eval set. The failure mode none of this fully catches is the silent one. The answer is wrong, but looks plausible and is used without objection. Our mitigations are the provenance footer, explicit human sign-off on anything leadership-bound, and a standing eval for each domain's top KPIs that sanity-checks against the blessed dashboard daily, though we don't have a robust solution yet.

最后一步是确保实际在线系统的性能尽可能准确。我们采取的一些步骤包括：对抗性审查。我们发现，使用一个 Claude 技能来激进地质疑潜在最终答案的所有底层假设，在我们的评估集中将准确率提升了 6%，但代价是 token 消耗增加 32%，延迟增加 72%。出处页脚。每个响应都带有一个页脚，说明它来自哪个来源层级（语义层 › 策展参考 › 原始表）、底层数据的新鲜度以及模型的所有者。这不会让答案变得更正确，但确实能帮助消费者判断他们对响应的信任程度。一个“原始表，新鲜度未知”的页脚是一个信号，意味着在向上游转发之前需要验证，这是我们针对静默失败的少数缓解措施之一。数据质量检查。你的代理可能正在以正确的方式使用正确的字段，但数据本身是错误的。添加基本的数据质量检查，确保被引用的字段是最新的、完整的且没有异常，这通常是很好的卫生习惯。被动监控。我们持续跟踪两个生产信号：通过语义层解决的代理查询比例，以及使用修正语言（如“那是错误的表”，“你遗漏了欺诈过滤器”）的响应比例。这两个信号都会输入到一个仪表盘，每周与离线通过率一起审查。主动修正收集。这是闭环的一部分。一个定时执行的代理每隔几小时扫描一次利益相关者渠道，寻找类似的修正语言，起草对相关参考文档的一行修复，并打开一个标记给领域所有者的 PR。修复路径故意设计得很简单——编辑 markdown 文件、合并、自动同步到各处——这样领域所有者不会在任务上花费太多时间。同样的修正会反馈到离线评估集中。以上所有方法都无法完全捕捉的失败模式是静默失败。答案错了，但看起来合理，并且被无异议地使用。我们的缓解措施包括出处页脚、任何面向领导层的内容必须有明确的知情同意，以及为每个领域的顶级 KPI 设立一个日常评估，对照官方仪表盘进行合理性检查，不过我们尚未找到稳健的解决方案。

§ 17

If you're starting from zero, a handful of canonical datasets, a few dozen offline evals, and a thin knowledge skill will capture most of the upside; everything else in this post is what we added once those were built. We also shared many best practices, and not all of them will be appropriate for every data team. Align with your organization on a few principles that will affect your approach by asking: How important is a correct answer today vs. in the future? AI models are progressing at a rapid pace. We often see companies building a significant amount of infrastructure to account for current model shortfalls that become moot once those models improve. Knowing where models fall short, and waiting for model improvements to fill the gap has significantly less overhead, but may not fit your company's risk tolerance. How do you anticipate the complexity of your business to change over time? Some of the processes we discussed may be overkill if, for example, you don't produce much data, you only have a few consumers of the output, or your data model is likely to remain simple. How technical is the intended audience of the output? Phrased differently, if you're building this analytics system for data scientists who can recognize when an answer is incorrect, you may be more tolerant of errors compared to a situation in which the audience has no familiarity with the underlying data model. How much are you willing to spend for improved accuracy? We've found certain processes like adversarial validation can significantly improve accuracy, but often at a higher cost and latency. What is your comfort around access controls and internal data privacy? Agents are often significantly more performant the more context they have; however, broad data access cuts against most companies' governance posture. This determines whether you're building one agent or many scoped ones. Whatever your route, our greatest gains have come from addressing each of the three failure modes: collapsing ambiguity into a single governed answer, making the answer easily discoverable, and flagging when either has gone stale.

如果你从零开始，少量规范数据集、几十个离线评估和一个轻量级知识技能就能捕获大部分收益；本文中其他所有内容都是我们在这些基础搭建之后才添加的。我们还分享了许多最佳实践，但并非所有都适合每个数据团队。通过提出以下问题与你的组织就几个原则达成一致，这些原则将影响你的方法：今天得到正确答案与未来得到正确答案有多重要？AI 模型正在飞速进步。我们经常看到公司投入大量基础设施来弥补当前模型的不足，而这些弥补随着模型改进可能变得毫无意义。了解模型的短板，并等待模型改进来填补缺口，开销要小得多，但可能不符合公司的风险承受能力。你预判业务复杂度会如何随时间变化？如果，例如，你产生的数据不多、只有少数消费者、或者数据模型可能保持简单，那么我们讨论的某些流程可能就过头了。输出的预期受众技术背景如何？换种说法，如果你为能识别答案是否正确（例如数据科学家）构建这个分析系统，那么你对错误的容忍度可能比面对完全不熟悉底层数据模型的受众时要高。你愿意为提高准确度支付多少成本？我们发现，像对抗性验证这样的过程可以显著提高准确率，但通常伴随着更高的成本和延迟。你对访问控制和内部数据隐私的舒适度如何？代理通常拥有的上下文越多，性能越好；然而，宽泛的数据访问权限与大多数公司的治理姿态相悖。这决定了你是在构建一个全面的代理还是多个限定范围的代理。无论你选择哪条路，我们最大的收益来自于应对这三个失败模式：将歧义压缩为一个受管控的答案，使答案易于发现，以及在任一者变得过时时进行标记。

Open source ↗