用 Codex 构建自改进税务 AI:生产反馈闭环实践
OpenAI 与 Thrive Holdings 联合为希腊克里特岛会计网络开发 Tax AI,基于 Codex 驱动自改进循环。系统处理 7,000 份税表,准确率达 97%,吞吐量提升 50%,将一位高级会计师的税务准备时间从 180 小时降至 15 小时。核心设计三支柱:从业者反馈、生产轨迹(从原始文件到最终申报的结构化流程)、Codex 迭代循环。以租赁房产表格为例,详细展示了从业者修正如何转化为评估目标,再由 Codex 分析根因并提出补丁。适合在专家知识密集型领域构建自进化代理的团队。
Real-world systems behave differently in production than they do in a lab, breaking in ways that are hard to anticipate before deployment. Teams often discover those failures after launch, then spend weeks inspecting edge cases, adjusting prompts, and translating production feedback into durable product improvements. The feedback loop is manual and slow, and only improves when an engineer advances it. But today, with thoughtfully designed eval infrastructure, direct access to practitioners and real world environments, and the frontier agentic capabilities of Codex, you can build agents that self-improve.
现实世界中的系统在生产环境中的表现与实验室中截然不同,它们会以难以预料的方在部署前就崩溃。团队通常在上线后才发现这些故障,然后花费数周时间检查边缘情况、调整提示词,并将生产反馈转化为持久的产品改进。这个反馈循环是人工且缓慢的,只有工程师推动时才会改善。但如今,凭借精心设计的评估基础设施、直接接触实践者和真实环境的机会,以及 Codex 的前沿智能体能力,你可以构建能够自我改进的智能体。
In this post, we’ll unpack how we used Codex to build this type of agent. Over the past six months, OpenAI forward deployed engineers and researchers along with Thrive Holdings’ engineers collaborated to build Tax AI alongside and for Crete(opens in a new window)’s network of 30+ accounting firms to help prepare increasingly complex tax returns. Instead of relying on engineers to find and fix each failure, Tax AI uses Codex to turn production use into structured signals that fuel autonomous improvement.
Crete practitioners prepare tens of thousands of tax returns each season which requires working through millions of underlying documents. For medium- to large-complexity filings, data entry alone can take eight hours per return, often involving messy data sources, prior-year documents, and manual extraction and calculation. They pointed us to tax preparation as a significant bottleneck during the busiest stretch of tax season.
To solve this problem, Tax AI processed 7,000 tax returns across the Crete firms that participated in the pilot this tax season. The system automates much of the time-intensive process of preparing 1040 and 1041 tax returns, but even more compelling than the efficiency gains is that the system itself is measurably better than the version that was first deployed three months ago.
在本文中,我们将详细解析如何使用 Codex 构建这类智能体。过去六个月,OpenAI 的前沿部署工程师和研究员与 Thrive Holdings 的工程师合作,为克里特岛的 30 多家会计师事务所构建了 Tax AI,以帮助处理日益复杂的税务申报。Tax AI 不依赖工程师来查找和修复每个故障,而是利用 Codex 将生产使用转化为结构化信号,推动自主改进。
克里特岛的实践者每个季度要准备数万份税务申报,这需要处理数百万份底层文档。对于中等至复杂程度的申报,仅数据录入每份就可能需要八小时,通常涉及混乱的数据源、往年文档以及手动提取和计算。他们指出,税务准备是报税季最繁忙时期的一个重大瓶颈。
为了解决这个问题,Tax AI 在本报税季处理了参与试点的克里特岛公司中的 7,000 份税务申报。该系统自动化了准备 1040 和 1041 税务申报中大部分耗时的过程,但比效率提升更引人注目的是,系统本身比三个月前首次部署时的版本有了可衡量的改进。
In Tax AI, practitioners upload source files along with any client-specific notes. Tax AI then creates a tax engine submission, ready for review. It saves practitioners about a third of their time on tax preparation, drafts returns with up to 97% accuracy, and increases throughput by about 50%, creating more room for them to spend time with clients.
We can quantify this improvement by understanding how accurately Tax AI can complete a return without needing correction later. We measure accuracy by checking what share of returns reach 75%, 90%, or 100% correct field completion. At launch, only a quarter of returns were at 75% correct field completion, but within six weeks, 86% hit that mark. The system showed even faster growth at the 90% and 100% correct field completion levels. These thresholds give us a practical view of how much practitioner follow-up different returns still require.
Early on, Tax AI handled simpler work, like W-2s and 1099s. As the season went on, it moved into more complex returns with K-1s, schedules, and harder edge cases. Each new capability saved more time per return than the last because the tasks it took on were harder and more time consuming to do manually. We continue to see ongoing progress today.
在 Tax AI 中,实践者上传源文件以及客户特定的备注。Tax AI 随后创建一份税务引擎提交材料,供审核。它能为实践者节省约三分之一的税务准备时间,起草的申报准确率高达 97%,吞吐量提升约 50%,让他们有更多时间与客户相处。
我们可以通过了解 Tax AI 完成申报的准确率(无需后续更正)来量化这一改进。我们通过检查达到 75%、90% 或 100% 正确字段完成率的申报比例来衡量准确率。上线时,只有四分之一的申报达到 75% 的正确字段完成率,但六周内,86% 达到了这一标准。系统在 90% 和 100% 正确字段完成率水平上增长更快。这些阈值让我们能够实际了解不同申报还需要多少实践者的后续工作。
早期,Tax AI 处理较简单的工作,如 W-2 和 1099。随着报税季推进,它开始处理更复杂的申报,涉及 K-1、附表和更难的边缘情况。每项新能力每份申报节省的时间都比之前更多,因为它承担的任务更复杂、更耗时。直到今天,我们仍在看到持续的进展。
Next, we’ll walk through how our teams co-engineered Tax AI to be self-improving by leaning on three critical pillars: 1) expert practitioner feedback, 2) production traces (a structured history from inputs through final output), and 3) a Codex-driven iteration loop based on tailored evals to enable continuous, faster product development. We hope our experience will be useful to other builders in domains where practitioner expertise is key to shaping the quality of the overarching system and the data running through it.
As Tax AI expanded into more complex filings, the share of scored returns reaching 75%, 90%, and full completion continued to rise through tax season.
接下来,我们将介绍我们的团队如何依靠三个关键支柱共同设计 Tax AI 的自我改进能力:1) 专家实践者反馈,2) 生产追踪(从输入到最终输出的结构化历史记录),3) 基于定制评估的 Codex 驱动迭代循环,以实现持续、更快速的产品开发。我们希望我们的经验能对其他构建者有用,特别是在实践者专业知识对塑造整个系统和其中运行的数据质量至关重要的领域。
随着 Tax AI 扩展到更复杂的申报,在报税季期间,达到 75%、90% 和完全完成率的评分申报比例持续上升。
The problem
As we pushed into harder parts of tax preparation (K-1s, rental real estate schedules, and tax forms where values had to be reconciled across multiple source files), it became obvious that the real challenge was whether the product could make complex production failures visible, understandable, and actionable.
In the early days of the product, most of the correction was manual. Practitioners could correct system errors, but the product did not capture the full context: a changed value before filing might reflect a true extraction miss, a mapping problem, missing product support, or expected workflow noise. Sorting those cases out still required follow-up from the engineering team. Engineers could use coding agents, but the system was not yet designed to use AI meaningfully inside an improvement loop. We did not have the signal to identify the right hill to climb.
问题
当我们推进到税务准备中更困难的部分(K-1、出租不动产附表,以及需要在多个源文件之间对账的税务表格)时,很明显,真正的挑战在于产品能否让复杂的生产故障变得可见、可理解且可操作。
在产品早期,大部分修正都是手动的。实践者可以更正系统错误,但产品没有捕获完整的上下文:在申报前更改的值可能反映真正的提取遗漏、映射问题、缺少产品支持或预期的工作流噪声。对这些情况进行分类仍需要工程团队的后续跟进。工程师可以使用编码智能体,但系统尚未设计为在改进循环中有意义地使用 AI。我们缺乏信号来识别正确的攀登目标。
Our approach: a three-part loop
That led us to design the system around three pillars:
Stay close to practitioners: The people doing the work need to steer what the product learns. Their intuition and understanding reveal which errors matter and help inform which parts of the workflow are worth focusing on next.
我们的方法:三部分循环
这使我们围绕三个支柱设计系统:
贴近实践者:从事实际工作的人需要引导产品学习的内容。他们的直觉和理解揭示了哪些错误重要,并帮助确定工作流的哪些部分值得下一步关注。
Build the product so production creates evidence: The product has to capture more than just inputs and outputs; it needs to capture the full path from source material, to extracted fields and provenance, to downstream submission and expert correction.
构建产品使生产创造证据:产品必须捕获的不仅仅是输入和输出;它需要捕获从源材料到提取字段和出处,再到下游提交和专家修正的完整路径。
Create a Codex-driven improvement loop: Once production issues are visible and structured, they can become findings, tailored evals, and scoped engineering tasks. Codex can then help investigate, propose changes, validate them against targeted and regression evals, and move the product forward faster than a purely manual iteration cycle.
The rental properties example below shows how that loop works in practice, walking you through how a practitioner correction becomes a structured finding, then an eval target, and finally a Codex-scoped engineering task.
创建 Codex 驱动的改进循环:一旦生产问题变得可见且结构化,它们就可以成为发现、定制评估和范围明确的工程任务。然后,Codex 可以帮助调查、提出更改、针对目标评估和回归评估进行验证,并比纯手动迭代周期更快地推动产品前进。
下面的出租不动产示例展示了该循环在实际中如何工作,引导您了解实践者的修正如何成为结构化发现,然后是评估目标,最后是 Codex 范围明确的工程任务。
Rental property example
Rental property income is reported on Schedule E of an individual tax return. From an engineering perspective, the task of extracting it is simple to describe but hard to do well. The system has to read messy source material (handwritten notes, emails, spreadsheets, and other client files), extract the rental-property fields the system can confidently map to the tax engine, and preserve enough evidence that a practitioner can approve or correct the result. The simplified example below shows what those source files and extracted outputs might look like.
A rental property source package is normalized into cited fields before those are mapped to downstream tax engine concepts.
出租不动产示例
出租不动产收入在个人纳税申报表附表 E 中报告。从工程角度来看,提取该信息的任务描述起来简单,但做好却很难。系统必须读取混乱的源材料(手写笔记、电子邮件、电子表格和其他客户文件),提取系统可以自信地映射到税务引擎的出租不动产字段,并保留足够的证据,以便实践者可以批准或更正结果。下面的简化示例展示了这些源文件和提取输出可能的样子。
出租不动产源包被标准化为引用的字段,然后这些字段被映射到下游税务引擎概念。
- A practitioner correction reveals a failure
A difference between the agent-predicted value and the actual value from the filed tax return might reflect a true extraction miss, but it could also be a practitioner preference, a value carried forward from a prior-year return in the tax engine, or a value introduced or changed elsewhere in the filing workflow. Practitioners helped us discern those cases so we could identify which actions required a practitioner correction or blocked a submission.
Because we could see these corrections in detail, we transformed the review process from a terminal, post-failure step into a continuous learning cycle. We designed the workflow to capture expert actions as structured data. Now, every intervention feeds the product's improvement loop by recording exactly what Tax AI proposed, what the practitioner modified, and what ultimately went into the filed return.
- 实践者修正揭示故障
智能体预测值与已提交纳税申报表中的实际值之间的差异可能反映真正的提取遗漏,但也可能是实践者的偏好、从往年申报结转到税务引擎的值,或者在申报工作流中其他地方引入或更改的值。实践者帮助我们区分这些情况,以便我们识别哪些操作需要实践者修正或阻止了提交。
由于我们能够详细看到这些修正,我们将审查过程从一个终止性的、故障后的步骤转变为一个持续的学习循环。我们设计了工作流,将专家操作捕获为结构化数据。现在,每一次干预都会记录 Tax AI 提议了什么、实践者修改了什么,以及最终进入已提交申报的内容,从而反馈到产品的改进循环中。
- Product traces turn corrections into evals
For a complex workflow like rental properties, the system has to preserve what happens between the source files and the filed return. Along that path, documents are organized, split, and classified; rental-property fields are extracted with citations back to the source material; those values are mapped into the tax engine; and practitioners may still correct them before filing. Those product-level traces make it possible to investigate where a failure occurred. To turn practitioner corrections into useful evaluation targets, the system processes them in three steps:
Capture the difference: Tax AI’s output is compared with the filed return to produce field-level review rows that capture the expected value, predicted value, and whether the difference appears actionable.
Group related failures: Similar review rows are grouped to separate recurring product failures from expected workflow noise. For example, repeated practitioner corrections might show that Tax AI often misses fair-rental-day fields, mishandles “other expenses,” or confuses multiple rental properties across the same source package.
Turn repeated patterns into eval targets: Once reviewed and measured, repeated findings become clear eval targets for Codex to improve.
- 产品追踪将修正转化为评估
对于像出租不动产这样的复杂工作流,系统必须保留源文件和已提交申报之间发生的一切。在此路径中,文档被组织、拆分和分类;出租不动产字段被提取并附有源材料引用;这些值被映射到税务引擎;实践者可能在提交前进行更正。这些产品级追踪使得调查故障发生地点成为可能。为了将实践者的修正转化为有用的评估目标,系统通过三个步骤处理它们:
捕获差异:Tax AI 的输出与已提交申报进行比较,生成字段级审查行,捕获预期值、预测值以及差异是否可操作。
对相关故障进行分组:相似的审查行被分组,以将重复的产品故障与预期的工作流噪音分开。例如,重复的实践者修正可能表明 Tax AI 经常遗漏公平出租天数字段、处理“其他费用”错误,或混淆同一源包中的多个出租不动产。
将重复模式转化为评估目标:经过审查和衡量后,重复的发现成为 Codex 改进的明确评估目标。
- The finding becomes a hill to climb for Codex
The third pillar is creating an engineering loop capable of acting on these new evals. This is where Codex becomes central.
Suppose our eval pipeline flags that Tax AI consistently misses the "fair rental days" field, while practitioners reliably fill it in. Because this finding has already been packaged into a targeted eval set, with representative source packages and expected outputs, Codex can investigate the root cause directly within the product scaffold.
Codex isn’t working solely with a sub-par final output. It inspects the trace, eval, repo, and skills together:
Investigate the pipeline: Inspect source packages, extraction schemas, mapper behavior, and code paths to determine whether the issue is an unsupported field, a missed extraction pattern, a source-selection problem, a mapper gap, or a grader issue.
Implement targeted fixes: Extend the extraction schema, improve source selection for rental-property documents, update the tax-engine mapper, or refine the grader if expected workflow noise is being counted as a failure.
Validate and propose: Rerun the targeted eval, run broader regression suites, and surface a candidate pull request for engineering review.
Close the loop: Turn a recurring practitioner correction into a measurable engineering task. If the evidence is ambiguous or not safely automatable, the case routes back to the product team instead of being forced through the loop.
- 发现成为 Codex 的攀登目标
第三个支柱是创建一个能够对这些新评估采取行动的工程循环。这正是 Codex 的核心所在。
假设我们的评估流水线标记出 Tax AI 始终遗漏“公平出租天数”字段,而实践者可靠地填写了它。由于这个发现已经被打包成一个有针对性的评估集,包含代表性的源包和预期输出,Codex 可以直接在产品框架内调查根本原因。
Codex 不仅仅处理次优的最终输出。它一起检查追踪、评估、代码库和技能:
调查流水线:检查源包、提取模式、映射器行为和代码路径,以确定问题是不支持的字段、遗漏的提取模式、源选择问题、映射器差距还是评分器问题。
实施有针对性的修复:扩展提取模式、改进出租不动产文档的源选择、更新税务引擎映射器,或者如果预期的工作流噪音被计为失败,则优化评分器。
验证并提出:重新运行有针对性的评估,运行更广泛的回归套件,并提出一个候选拉取请求供工程审查。
闭合循环:将重复的实践者修正转化为可衡量的工程任务。如果证据不明确或无法安全自动化,则该案例会路由回产品团队,而不是强制通过循环。
How to use Codex to build this loop
The rental property example is emblematic of a broader reusable pattern: using production artifacts and traces to improve an agent’s capabilities. Given reviewed findings from production data, source traces, expected tax-engine output, relevant code examples, and eval commands as a set of inputs, Codex can materially improve on performance and accuracy over weeks and months. This builds on the principles described in our work on harness engineering and Symphony, which walk-through how to make tasks legible to Codex, provide scoped context and tools, and keep validation and human review part of the environment.
That evidence does not become a Codex task automatically. A practitioner correction may reflect an extraction miss, a mapping issue, unsupported product behavior, tax judgment, or expected workflow noise. Only after repeated differences have been reviewed and grouped into an actionable finding does the system turn them into a bounded task with a clear success condition.
We apply this automation to a bounded layer of the product. This layer performs extraction and maps source documents into tax workflows. Engineers remain responsible for architecture, product decisions, and shipping. Practitioners steer the improvement loop through the work they already do: correcting extracted values, reviewing returns, and approving final filings.
For Codex, the result is not a vague alert but a scoped engineering task with evidence, editable product surfaces, and explicit validation gates. The context for a representative rental property task can be summarized as follows:
A bounded Codex task environment separates the writable worktree [1] from read-only production context [5]. The worktree contains the scoped product surface Codex can inspect or modify [2], the targeted and regression evals that define success [3], and reusable skills/docs that encode how to run the task and respect prior decisions [4]. The read-only context provides the production trace, source documents, Tax AI prediction, finalized return, and tax-engine field documentation, so Codex can investigate the failure without mutating the underlying evidence.
如何使用 Codex 构建这个循环
出租不动产示例代表了一个更广泛的可重用模式:利用生产工件和追踪来改进智能体的能力。给定来自生产数据的经过审查的发现、源追踪、预期的税务引擎输出、相关代码示例和评估命令作为一组输入,Codex 可以在数周和数月内实质性地提高性能和准确性。这建立在我们关于 harness engineering 和 Symphony 的工作中描述的原则之上,这些工作介绍了如何使任务对 Codex 可读、提供范围明确的上下文和工具,并将验证和人工审查保留在环境中。
这些证据不会自动成为 Codex 任务。实践者的修正可能反映提取遗漏、映射问题、不支持的产品行为、税务判断或预期的工作流噪音。只有在反复差异被审查并分组为可操作的发现之后,系统才会将它们转化为一个有明确成功条件的边界明确的任务。
我们将这种自动化应用于产品的边界层。该层执行提取并将源文档映射到税务工作流。工程师仍然负责架构、产品决策和交付。实践者通过他们已经做的工作来引导改进循环:更正提取的值、审查申报和批准最终申报。
对于 Codex 来说,结果不是一个模糊的警报,而是一个有证据、可编辑的产品表面和明确验证门的范围明确的工程任务。一个代表性的出租不动产任务的上下文可以总结如下:
一个边界明确的 Codex 任务环境将可写的工作树 [1] 与只读的生产上下文 [5] 分开。工作树包含 Codex 可以检查或修改的范围明确的产品表面 [2]、定义成功的目标评估和回归评估 [3],以及编码如何运行任务和尊重先前决策的可重用技能/文档 [4]。只读上下文提供生产追踪、源文档、Tax AI 预测、最终申报和税务引擎字段文档,以便 Codex 可以在不改变底层证据的情况下调查故障。
Expanding to new domains
The same loop applies beyond rental properties. Rental properties took about six weeks and substantial engineering oversight to reach 90% precision and recall, but that work produced reusable abstractions, review artifacts, eval conventions, and implementation patterns that made it easier to support similarly complex schedules such as Schedule C and Schedule A.
Tax AI proves a path to building self-improving agents. Practitioners generate high-value feedback signals by delivering the service. Product workflows preserve those signals as structured evidence. Eval-backed engineering systems validate improvements before they reach production, and an agent-powered loop keeps the system in a continuous self-improving flow.
Thrive Holdings’ structure allows us to replicate this environment in specific industries. Holdings is both an owner and operator, so our combined engineering teams are able to work directly with practitioners and production data from inside businesses like Crete, not as a vendor but as partners. This means the technology, the product, and the service all sit under one roof to help us move faster and build exceptional products.
One senior accountant who spent 180 hours on tax prep last year spent only 15 hours on it this year. She put that time in part toward calling every one of her clients and walking them through their returns, a level of high touch service that wasn’t possible a year ago. The rest of that time she used to take on new clients and expand to new service offerings.
Together, our teams are now using the same three-part design from Tax AI as a blueprint for building workflows in other domains across Thrive Holdings(opens in a new window); accounting workflows such as bookkeeping and audit, and operational workflows such as IT help desk automation. Across domains and industries, the broader promise of self-improving agents holds. The best agents are steered by people to learn to become more capable, more trusted, and more valuable over time.
To learn more about the OpenAI team that worked on this project, get in touch.
扩展到新领域
同样的循环也适用于出租不动产以外的领域。出租不动产花了大约六周时间和大量的工程监督才达到 90% 的精确率和召回率,但这项工作产生了可重用的抽象、审查工件、评估约定和实现模式,使得支持类似复杂度的附表(如附表 C 和附表 A)变得更加容易。
Tax AI 证明了构建自我改进智能体的一条路径。实践者通过提供服务产生高价值的反馈信号。产品工作流将这些信号保存为结构化证据。评估支持的工程系统在改进进入生产前对其进行验证,而智能体驱动的循环使系统保持持续自我改进的流程。
Thrive Holdings 的结构使我们能够在特定行业复制这种环境。Holdings 既是所有者又是运营者,因此我们的联合工程团队能够直接与像 Crete 这样的企业内部的实践者和生产数据合作,不是作为供应商,而是作为合作伙伴。这意味着技术、产品和服务都归属于同一个屋檐下,帮助我们更快地行动并构建卓越的产品。
一位去年在税务准备上花费了 180 小时的高级会计师,今年只花了 15 小时。她将这部分时间部分用于给每位客户打电话,为他们讲解退税单,这是一种一年前还不可能的高接触服务。剩下的时间则用于接纳新客户和扩展新服务。
现在,我们的团队正在使用 Tax AI 中相同的三部分设计作为蓝图,在 Thrive Holdings 范围内的其他领域构建工作流:会计工作流如簿记和审计,以及运营工作流如 IT 帮助台自动化。跨领域和行业,自我改进智能体的更广泛承诺依然成立。最好的智能体由人引导,随着时间的推移学会变得更强大、更值得信赖、更有价值。
要了解更多关于参与该项目的 OpenAI 团队的信息,请联系我们。