The Future Of Software Engineering with Anthropic
A summary of a roundtable on the future of software engineering, featuring leaders from Stripe, NVIDIA, Microsoft, and others. Key insights: closed-loop development creates compounding gains; test-first is the new default; human code review is fading; comments are written for AI readability; long-horizon tasks remain unsolved; developer tooling is being displaced first; hiring values experimentation over raw skill; human-authored context files help, agent-authored ones can hurt. Candid trade-offs and real-world practices are shared.
Sivesh and I recently hosted a roundtable on the future of software engineering with Anthropic’s Ash Prabaker and we were joined by engineering leaders from Stripe, NVIDIA, Microsoft, Google DeepMind, xAI, Apple, Scale AI, as well as the legend Peter Steinberger of OpenClaw/OpenAI.
我和 Sivesh 最近与 Anthropic 的 Ash Prabaker 共同主持了一场关于软件工程未来的圆桌讨论,参与讨论的工程领袖来自 Stripe、NVIDIA、微软、Google DeepMind、xAI、Apple、Scale AI,以及 OpenClaw/OpenAI 的传奇人物 Peter Steinberger。


The session opened with a retelling of the Claude Code origin story, much of which has been covered in public interviews. It began as a simple terminal UI in late 2024, was rough at first, and was built against a guiding principle of designing for where models would be in six to twelve months rather than where they were that day. Adoption was organic — an IC-driven project that scaled through demonstrated value rather than mandate.
会议以讲述 Claude Code 的起源故事开场,其中很多内容在公开采访中已有提及。它最初是 2024 年底一个简单的终端 UI,初期很粗糙,其构建原则是面向未来六到十二个月的模型能力来设计,而非当天的状态。采用过程是自发的——这是一个由个人贡献者驱动的项目,通过展现价值来扩展,而非通过命令。
A major thread throughout the discussion was “closed-loop” development. One participant described a setup at their company where bug reports are automatically triaged by an agent, bucketed by severity, checked against an eval set, and then a fix PR is opened — much of it running with minimal human touch. The room broadly agreed that this kind of loop is where compounding gains actually come from: better coding tools improve the models, better models improve the coding tools. Several people noted their companies are prioritizing coding specifically because of this dynamic.
整个讨论的一个主线是“闭环”开发。一位参与者描述了其公司的设置:错误报告由代理自动分类,按严重性分桶,对照评估集检查,然后提交修复 PR——整个过程几乎不需要人工干预。与会者普遍认为,这种循环正是复利效应的来源:更好的编码工具改进模型,更好的模型改进编码工具。多人指出,他们的公司正是因为这种动态而优先投入编码领域。
Participants compared notes on what’s shifting in their engineering practice:
Test-first has become the default. Multiple people said they now define test cases first and let the agent build against them — described as the only sane way to handle the volume of PRs being generated.
Two tiers of evals. One participant outlined their team’s approach: regression evals that must stay at 100% and run on every PR, plus frontier evals for new capabilities. Others in the room recognized the pattern.
与会者交流了各自工程实践中的变化:
测试先行已成为默认。多人表示,现在他们先定义测试用例,再由代理根据测试构建代码——这被称为应对大量 PR 的唯一理智方式。
双层评估。一位参与者概述了其团队的方法:回归评估必须保持 100% 通过率,并在每个 PR 上运行;同时针对新能力的前沿评估。其他与会者认可这种模式。
Don’t mandate adoption. There was strong consensus here. One attendee described using competitions, hackathons, and casual incentives instead of top-down requirements — arguing that forced usage breeds resentment, whereas letting people see early adopters’ results drives proliferation naturally.
不要强制采用。这里达成了强烈共识。一位参与者描述了使用比赛、黑客马拉松和随意激励代替自上而下的要求——理由是被迫使用会滋生怨恨,而让人们看到先行者的成果自然会推动普及。
Code review is in flux. One participant admitted that human reviewers at their company often just click approve within minutes because the AI review layer has gotten good enough. When pushed on where this ends up, they acknowledged the mandatory-human-review model will eventually become inefficient — and suggested they may already be past that point for some repos.
代码审查正在变化。一位参与者承认,他们公司的人工审查者通常几分钟内就点击批准,因为 AI 审查层已经足够好。当追问最终走向时,他们承认强制人工审查模式终将变得低效——并且暗示对于某些仓库,他们已经越过了这个点。
Comments are back. A cultural reversal several people found amusing: engineers initially hated the verbose comments agents generated, but the consensus is now swinging toward leaving them in, because the next agent session finds them useful. One person put it as “we’re writing code for AI readability as much as human readability now.”
注释又回来了。多人发现这种文化逆转很有趣:工程师最初讨厌代理生成的冗长注释,但现在共识正转向保留它们,因为下一次代理会话会发现它们有用。有人总结道:“我们现在写代码既为了人类可读性,也为了 AI 可读性。”
Life in the terminal. One participant described their personal workflow as: plan, verify the plan, implement via the agent, move on — without reading generated code line by line. This prompted some debate about when that’s safe and when it isn’t.
终端生活。一位参与者描述了其个人工作流:计划、验证计划、通过代理实现、继续前进——而不逐行阅读生成的代码。这引发了关于何时安全、何时不安全的辩论。
Not all code is treated the same. Participants generally agreed that anything involving destructive actions (data loss, permission escalation) or core infrastructure deserves higher human review, while internal prototypes don’t need the same bar as public-facing code. Where exactly to draw the line varied by company.
并非所有代码都一视同仁。与会者普遍认为,涉及破坏性操作(数据丢失、权限提升)或核心基础设施的代码需要更严格的人工审查,而内部原型不必达到与面向公众代码相同的标准。具体如何划界因公司而异。
The room converged on long-horizon tasks as the real frontier problem. One participant noted that product engineering has started to go exponential for them, but closing the loop on more complex research workflows isn’t there yet. The open questions everyone shared: what do you actually assign an agent for a four- or five-hour run? How do you observe it? How do you keep a human in the loop without babysitting? Nobody had a clean answer.
与会者一致认为长周期任务是真正的前沿难题。一位参与者指出,产品工程对他们已开始呈指数级增长,但在更复杂的研究工作流上闭环尚未实现。大家共同的开放问题:你究竟能给代理分配什么样的四五个小时运行任务?如何观察?如何让人参与而不成为保姆?没人有明确答案。
Discussion of how the industry has swung on sandboxing — first toward it for safety, then away from it for convenience, now back toward it with more nuance (remote coding agents, sandbox-per-session). The practical pain points people raised were compute for long-running sessions, permissioning, and enterprise deployment.
讨论了行业在沙盒化上的摇摆——最初为安全而采用,后来为方便而放弃,现在又以更微妙的方式回归(远程编码代理、按会话分配沙盒)。人们提出的实际痛点包括长运行会话的计算资源、权限管理以及企业部署。
One participant described early-stage internal prototypes where agents with access to logs, source control, and chat systems handle incident triage and debugging — reducing the on-call burden even though the systems aren’t production-grade yet. A side effect several people found interesting: engineers without infra backgrounds can now contribute to infra work because agents fill the knowledge gaps.
一位参与者描述了早期的内部原型:代理可以访问日志、源码控制和聊天系统,处理事件分类和调试——尽管系统尚未达到生产级别,但已减轻值班负担。多人发现一个有趣的副作用:没有基础设施背景的工程师现在也能参与基础设施工作,因为代理填补了知识空白。
Someone asked how you manage context at scale when thousands of people are changing things every minute. The honest answer from the room was that nobody has this figured out. One participant admitted their approach is basically unstructured — ad hoc chat threads that agents get MCP access to read, plus a strong writing culture but no formal documentation process.
有人问,当每分钟有数千人修改东西时,如何管理大规模上下文。与会者的诚实回答是:没人解决了这个问题。一位参与者承认他们的方法基本是非结构化的——代理通过 MCP 访问临时的聊天线索,加上强大的写作文化,但没有正式文档流程。
A study was mentioned suggesting that pre-loaded markdown context files can sometimes underperform versus letting agents traverse the codebase from first principles. The counter offered was that this probably reflects stale or agent-generated context. The takeaway people seemed to agree on: human-authored context files help, agent-authored or stale ones can actively hurt. Humans have to supply the insight.
有人提到一项研究,表明预加载的 markdown 上下文文件有时不如让代理从基本原理遍历代码库。反驳认为这可能反映了过时或代理生成的上下文。人们似乎同意的要点是:人类编写的上下文文件有帮助,代理编写或过时的文件可能起反作用。洞察必须由人类提供。
When hiring came up, the most striking claim was that the trait one participant now screens hardest for isn’t raw engineering skill — it’s willingness to experiment constantly at the bleeding edge. Their best performers are the ones who understand model limits deeply enough to know when to trust the output and when to intervene. Another attendee noted their core infrastructure teams have stayed lean because AI-assisted cross-pollination lets product engineers contribute outside their usual domain.
谈到招聘时,最引人注目的说法是,一位参与者现在最看重的是持续在最前沿实验的意愿,而非原始工程技能。他们最好的员工是那些深刻理解模型局限,知道何时信任输出、何时干预的人。另一位与会者指出,他们的核心基础设施团队保持精简,因为 AI 辅助的交叉融合使产品工程师能跨领域贡献。
Participants traded stories about which tool categories they’ve replaced internally:
Incident management — one person said their team ripped out their vendor because it was too complicated for how people actually worked.
Auth layers — one participant claimed to have migrated auth systems several times in six months, each migration taking hours not weeks.
Project tracking — someone is building custom UIs on top of their coding agent for managing engineering work, and floated that this whole category might be next.
Internal micro-tools — link shorteners and similar utilities were the easy wins several people mentioned.
与会者分享了各自在内部替换的工具类别:
事件管理——有人说他们团队撤掉了供应商,因为对于实际工作方式来说太复杂了。
认证层——一位参与者声称在六个月内多次迁移认证系统,每次迁移只需几小时而非几周。
项目跟踪——有人正在其编码代理之上构建自定义 UI 来管理工程工作,并暗示整个类别可能是下一个。
内部微工具——链接缩短器及类似工具是多人提到的低垂果实。
The pattern everyone noticed: it’s all developer tooling so far, because that’s where engineers have agency and speed. Business-facing software (CRMs, etc.) is stickier. One view was that incumbent business tools survive not because they’re good but because nobody has shipped a compelling AI-native replacement — just incremental add-ins.
A counterpoint from the room: the opportunity-cost argument (”we should focus on what we’re best at”) may always hold, which means labs might never prioritise building SaaS replacements over improving models.
大家注意到的模式:迄今为止都是开发者工具,因为那是工程师有自主性和速度的领域。面向企业的软件(CRM 等)更具粘性。一种观点认为,现有企业工具存活下来不是因为他们好,而是因为没有人推出令人信服的 AI 原生替代品——只有渐进式的附加组件。
来自会场的反驳:机会成本的论点(“我们应该专注于我们最擅长的”)可能始终成立,这意味着实验室可能永远不会优先构建 SaaS 替代品,而不是改进模型。
A startup founder in the room raised the flip side: because AI makes everything feasible, prioritisation is harder, not easier. Six months ago, rebuilding a tool internally was obviously not worth it. Now it takes a night. Teams get overloaded by the sheer volume of things they could do. Nobody had a great answer beyond defining clear swim lanes and giving individuals ownership of mini-companies inside the org.
会场中一位创业公司创始人提出了反面:因为 AI 让一切变得可行,优先级排序更难了,而非更容易。六个月前,在内部重建一个工具显然不值得。现在只需一夜。团队因大量可做之事而不堪重负。除了划定清晰的泳道并将内部小公司的授权给个人之外,没人给出好的答案。
When someone asked about code quality standards, the response was that the definition is shifting. “Good code” used to mean human-centric things — simple, easy to maintain, easy to contribute to. Now it has to account for AI readability too. The practical view from the room: strong regression evals and test-first discipline matter more than clean-code aesthetics.
当有人问及代码质量标准时,回答是定义正在演变。“好代码”过去意味着以人为中心——简单、易于维护、易于贡献。现在还必须考虑 AI 可读性。会场的实际观点是:强大的回归评估和测试先行纪律比整洁代码美学更重要。
“The purple gradient vibe” got a laugh — everyone recognized the AI-generated-UI aesthetic. The catch-22 someone identified: if you update a model’s taste profile, everyone uses it, and the new aesthetic just becomes the next generation of slop. Someone also noted that some models actively steer users toward particular frameworks, which functions as a form of lock-in.
“紫色渐变 vibe” 引来笑声——大家都认出了 AI 生成的 UI 美学。有人指出的两难:如果你更新模型的品味配置,所有人都会使用,新美学只会成为下一代 “slop”。还有人指出,一些模型会主动将用户导向特定框架,起到一种锁定效果。
One attendee raised a concern that everyone coding with the same models making the same suggestions will collapse the industry onto the same tools and patterns. The pushback was that this was a bigger risk with earlier model generations, which were much stronger at popular web stacks than at legacy or niche languages — and that gap is closing. Code modernisation of legacy systems came up as an area improving fast.
一位与会者担忧,每个人都用相同模型、收到相同建议,会让行业坍缩到相同的工具和模式上。反对观点认为,早期模型代际的风险更大,因为当时对流行 Web 技术栈的优势远大于对传统或小众语言——而这一差距正在缩小。遗留系统的代码现代化被提及为一个快速改进的领域。
General agreement that the direction of travel is toward asynchronous background agents — remote sandboxes, monitorable from a phone, persisting across hours or days. One person noted that multi-hour autonomous runs are only recently becoming routine for them, having been experimental until not long ago.
普遍认为前进方向是异步后台代理——远程沙盒、可从手机监控、持续数小时或数天。有人指出,数小时的自主运行最近才成为他们的常规,直到不久前还是实验性的。
Asked how much recent improvement is model weights versus harness, one participant’s view was that both matter but on different cadences — big leaps come from model steps, and the harness philosophy should be “get out of the way of the model.” They described a stripped-down prototype — basically a current-gen model with a system prompt and bash access — that performs surprisingly well, which wouldn’t have worked a few generations ago.
被问及最近改进多少来自模型权重、多少来自脚手架时,一位参与者的观点是两者都重要,但节奏不同——大的飞跃来自模型进步,而脚手架的理念应该是“别挡模型的路”。他们描述了一个极简原型——基本上是一个当前代模型,配有系统提示和 bash 访问——表现惊人,这在几代之前还不行。
Someone from a fintech background asked about regulated deployment. The room’s read: the most successful AI startups in regulated industries (legal tech was the example) are still fundamentally human-in-the-loop chat-with-document products. Nobody has made the jump to autonomous agents in regulated workflows. The bar is asymmetric — analogous to self-driving cars, where AI has to be dramatically better than a human to be accepted. Better explainability and structured audit trails were floated as the unlock.
一位有金融科技背景的人问及受监管环境的部署。会场的解读:受监管行业(以法律科技为例)最成功的 AI 初创公司本质上仍然是人在回路的文档聊天产品。还没有人跃入到受监管工作流中的自主代理。门槛是不对称的——类似于自动驾驶汽车,AI 必须显著优于人类才能被接受。更好的可解释性和结构化的审计追踪被认为是突破口。
Refreshingly low-tech answer: git worktrees and ten terminal tabs. More sophisticated orchestration is being built by third parties, but nobody in the room claimed to have it solved.
令人耳目一新的低技术答案:git worktrees 和十个终端标签。更高级的编排正由第三方构建,但会场上没人声称已经解决。
Someone observed that getting engineers to adopt AI tools — finding champions, overcoming resistance, managing change — is exactly the digital transformation problem other industries have faced for years. The irony of applying it to the engineers who built those transformation tools was not lost on the room.
有人观察到,让工程师采用 AI 工具——寻找拥护者、克服阻力、管理变革——正是其他行业多年来面临的数字化转型问题。将这一问题应用在构建转型工具的工程师身上,其讽刺意味并未被会场忽视。
Closing question: will agents start writing closer to the metal, bypassing the abstraction layers that exist for human convenience? The view was yes, eventually, but only when the model decides it serves performance — not because lower-level code is easier for models. Current models still benefit from well-structured, well-commented, human-readable code. Someone noted a trend toward Rust in startups, driven partly by AI flattening the learning curve.
最后一个问题:代理是否会开始更接近底层编写代码,绕过为人类便捷而存在的抽象层?观点是最终会的,但只有当模型认为这有助于性能时——不是因为底层代码对模型更容易。当前模型仍然受益于结构良好、注释清晰、人类可读的代码。有人指出,创业公司中 Rust 的趋势部分由 AI 拉平学习曲线所驱动。
The recursive loop is real. Better coding tools produce better models, which produce better coding tools. Multiple participants said this is why their companies are prioritising coding.
递归循环真实存在。更好的编码工具产出更好的模型,更好的模型产出更好的编码工具。多位参与者表示,这正是他们公司优先投入编码的原因。
The bottleneck has moved from writing code to managing long-horizon tasks and deploying agents in regulated settings.
Developer tooling is being displaced first. Business-facing software with network effects is holding.
The human role is shifting from writing and reviewing to planning, evaluating, and steering — and the best performers are the ones who stay at the bleeding edge.
Enterprise adoption is gated by permissioning, sandboxing, and regulatory caution more than by model capability.
Slop and convergence are real concerns when millions of people use the same models to make the same choices.
Context remains unsolved. Human-authored context helps; stale or agent-generated context can hurt.
瓶颈已从编写代码转移到管理长周期任务和在受监管环境中部署代理。
开发者工具首先被替代。具有网络效应的面向企业的软件仍在坚守。
人类角色正从编写和审查转向规划、评估和引导——最优秀的人是那些持续处于前沿的人。
企业采用的门槛在于权限管理、沙盒化和监管谨慎,而非模型能力。
当数百万人使用相同的模型做出相同选择时,slop 和趋同是真实的担忧。
上下文仍未解决。人类编写的上下文有帮助;过时或代理生成的上下文可能有害。