Glean 拾遗
Daily /2026-06-20 / Imagine Naked People Were Stupider. Naked Models Are.

Imagine Naked People Were Stupider. Naked Models Are.

Source x.com Glean’d 2026-06-20 06:01 Read 18 min
AI summary

YC partner Garry Tan responds to Kyle Kingsbury's anti-AI essay, arguing that Kingsbury's tests of naked models are like testing an engine on a bench and concluding cars are unsafe. The article details the 'thin harness, fat skills' architecture: skill files (reusable Markdown procedures) constrain model input, resolvers (routing tables) dispatch tasks, deterministic code handles precision operations, and testing covers the full pipeline. Using Kingsbury's own bathroom rendering and stock data hallucination examples, Tan shows how architecture can turn unreliable models into reliable systems, and shares a personal resolver that reduced file misfilings from 10/13 to zero. The automotive metaphor concludes: seatbelts, traffic lights, and crumple zones made cars safe, not skepticism of engines. Targeted at engineers building or evaluating AI systems.

Original · 18 min
x.com ↗
§ 1

Sorry, not that kind of naked model. But the disappointment you're feeling right now? That's exactly how Kyle Kingsbury feels about LLMs.

Kyle Kingsbury is one of the best systems engineers alive. His Jepsen project spent a decade methodically proving that distributed databases didn't work as advertised. That CockroachDB, MongoDB, Redis, and dozens of others made consistency guarantees they couldn't keep. He published the results, the vendors fixed the bugs, and the entire industry got more honest. Jepsen is a masterwork of applied skepticism.

Last week he published a 32-page essay called "The Future of Everything is Lies, I Guess: Bullshit About Bullshit Machines." It's beautifully written, deeply researched, genuinely funny, and wrong about the most important thing.

His observations are correct. His conclusion is not.

A note on scope: this essay addresses Kingsbury's technical claims: that LLMs are unreliable bullshit machines incapable of producing trustworthy output. His broader concerns about labor displacement, information ecology, and cultural impact are real, separate questions that deserve their own essays. I'm not dismissing them by not addressing them here. I'm addressing the architecture question: does model unreliability make useful systems impossible, or does it make them an engineering problem? I think it's the latter. Kingsbury's essay assumes the former. That's where we disagree.

抱歉,不是那种裸模(naked model)。但你此刻感受到的失望?凯尔·金斯伯里(Kyle Kingsbury)对大语言模型(LLM)的感受正是如此。

凯尔·金斯伯里是当世最优秀的系统工程师之一。他的 Jepsen 项目耗费十年,系统性地证明了分布式数据库并不像宣传中那样可靠——CockroachDB、MongoDB、Redis 以及其他数十个数据库都做出了它们无法兑现的一致性承诺。他公开了结果,供应商修复了错误,整个行业因此变得更诚实。Jepsen 是一部应用怀疑论(skepticism)的杰作。

上周他发表了一篇 32 页的文章,标题是《未来的万事皆是谎言,我猜:关于扯淡机器的扯淡》。文章文笔优美、考据深入、内容也确实有趣——但在最关键的问题上,它错了。

他的观察是对的。但他的结论是错的。

关于本文范围的一点说明:本文探讨的是金斯伯里的技术论点——大语言模型是不可靠的扯淡机器(bullshit machines),无法产生可信的输出。他对劳动力替代、信息生态和文化影响等更广泛的担忧是真实存在的,也是独立的问题,它们值得另文讨论。我在此不探讨它们,并不意味着我否定这些担忧。我要探讨的是架构问题:模型的不可靠性是否让有用系统的构建变得不可能?还是说,它只是一个工程问题?我认为是后者。金斯伯里的文章假设是前者。这正是我们的分歧所在。

§ 2

Kingsbury's essay is structured as a catalogue of LLM failures. He asked Gemini to apply materials to a 3D bathroom rendering. It forgot the toilet and changed the room's shape. He asked Claude to do image-to-image transformation. It produced thousands of lines of JavaScript creating an incomprehensible garble of nonsense polygons. He asked ChatGPT to put white patches on a blue shirt. It changed the color, moved the patches, deleted them. He watched a colleague's LLM claim to download stock data and produce a graph of randomly generated numbers.

Every single one of these failures is real. I've seen failures like them. Everyone building with LLMs has. Kingsbury is not making anything up.

But here's what's happening in each example: a human sits in front of a raw language model, types a request in natural language, and watches it fail.

No skill file telling the model how to approach the task. No deterministic tool handling the parts that require precision. No resolver routing the request to the right capability. No harness managing context, enforcing safety, or constraining behavior.

He's testing the engine on a bench and concluding that cars are unsafe.

金斯伯里的文章结构是一份大语言模型失败案例的目录。他让 Gemini 给一张 3D 浴室渲染图应用材质——它忘记了马桶,还改变了房间形状。他让 Claude 做图像到图像的转换——它生成了数千行 JavaScript 代码,创建了一堆无法理解的无意义多边形。他让 ChatGPT 在蓝色衬衫上添加白色补丁——它改变了颜色、移动了补丁、删除了它们。他还看到同事的大语言模型声称下载了股票数据,却生成了一张随机数字的图表。

这些失败每一个都是真实的。我见过类似的失败。每个用大语言模型构建系统的人都见过。金斯伯里没有编造任何东西。

但这些例子的共同点是:一个人坐在一个原始的语言模型(raw language model)前,用自然语言输入一个请求,然后看着它失败。

没有技能文件(skill file)告诉模型如何着手处理任务。没有确定性工具(deterministic tool)处理需要精度的部分。没有解析器(resolver)将请求路由到正确的能力模块。没有框架(harness)管理上下文、强制安全或约束行为。

他在台架上测试引擎,然后得出结论:汽车是不安全的。

§ 3

Some terms for readers who haven't read the first essay in this series. A skill file is a reusable markdown document that teaches the model how to approach a task — a procedure, not a prompt. A resolver is a routing table that tells the model which document to read for which task. Deterministic code is software that produces the same output every time — SQL queries, API calls, math — the parts the model should never touch. A harness is the thin conductor that runs the model in a loop, reads files, and manages context. Together: thin harness, fat skills.

Kingsbury's central claim is that LLMs are "bullshit machines." They confabulate, they're chaotic, they're vulnerable to manipulation, and they can't be trusted. He arrives at this by testing models in isolation, the way you'd test a function: input in, output out, evaluate the output.

The people having Kingsbury's problems are the ones doing exactly this: chatting with a raw model and expecting reliable output. The people who aren't having those problems built harnesses. Not because harnesses make the model trustworthy. Because harnesses make the system trustworthy, even when the model inside it is not.

Kingsbury knows the harness exists, incidentally. He cites the Claude Code source leak: 512,000 lines of engineering that Anthropic built around their own model. Even the makers of the best LLM in the world don't trust the model naked. They wrapped it in live repo context, prompt caching, purpose-built tools, session memory, and parallel sub-agents. That's 512,000 lines of evidence that model reliability is an engineering problem, not a philosophical one.

But I want to be honest about something: 512,000 lines is not simple. The pattern I'll describe: skills, resolvers, deterministic code, testing— is conceptually clean. Robust implementation is real engineering. Harnesses fail. Skills encode wrong procedures. Verification layers are only as good as the invariants they check. The claim is not that harnesses make everything easy. The claim is that they turn model unreliability from a reason-to-stop into a problem-to-solve. That distinction matters.

为未读过本系列第一篇的读者说明一些术语。技能文件(skill file)是一份可复用的 markdown 文档,它教模型如何执行任务——是一个流程(procedure),而不是提示词(prompt)。解析器(resolver)是一个路由表,告诉模型哪个任务该读取哪个文档。确定性代码(deterministic code)是每次产生相同输出的软件——SQL 查询、API 调用、数学运算——这些是模型永远不应该触及的部分。框架(harness)是一个轻薄的调度器,它在循环中运行模型、读取文件并管理上下文。总结起来就是:薄框架(thin harness),胖技能(fat skills)。

金斯伯里的核心论点是:大语言模型是“扯淡机器”(bullshit machines)。它们虚构、混乱、易被操纵、不可信赖。他是通过孤立测试模型得出这个结论的——就像测试一个函数那样:输入进入,输出出来,评估输出。

遇到金斯伯里描述的那些问题的人,恰恰就是这么做的:与原始模型聊天,然后期望得到可靠的输出。而没有遇到这些问题的人,则构建了框架。不是因为框架让模型变得可信,而是因为框架让系统变得可信——即使系统内部的模型并不可信。

顺便说一句,金斯伯里知道框架的存在。他引用了 Claude Code 源码泄露事件:Anthropic 围绕自家模型构建了 512,000 行工程代码。即使是全球最佳大语言模型的制造者,也不信任裸模型。他们将模型包裹在实时仓库上下文(live repo context)、缓存提示词(prompt caching)、专用工具、会话记忆和并行子代理(parallel sub-agents)之中。这 512,000 行代码就是证据:模型可靠性是一个工程问题,而不是一个哲学问题。

但我想坦率地说一句:512,000 行代码并不简单。我将要描述的模式——技能、解析器、确定性代码、测试——在概念上是纯粹的。但稳健的实现是真正的工程。框架会失败。技能会编码错误的流程。验证层的质量取决于它所检查的不变性条件(invariants)。我的论点不是说框架能让一切变得简单。我的论点是,它将模型不可靠性从“停止的理由”转变成了“可以解决的问题”。这个区别至关重要。

§ 4

The stock data example is the clearest illustration. An LLM claimed to download stock prices and produced a graph. The data was random. Kingsbury presents this as evidence that LLMs lie. But what actually happened? Someone asked a language model (a text prediction machine) to fetch data from the internet. It can't fetch data from the internet. It has no tools. It has no HTTP client. It has no API keys. So it did what language models do: it produced text that looked like what a stock data response would look like.

The fix isn't a better model. The fix is a deterministic tool: a function that actually calls a stock API, returns real numbers, and hands them to the model as context. The model never touches the data retrieval. It decides what to look up. The code decides how. Same input, same output, every time.

That's the architecture I described in "Thin Harness, Fat Skills." Push intelligence up into skills. Push execution down into deterministic code. Keep the harness thin. When you do this, the class of failures Kingsbury describes becomes far less likely. Not because the model got smarter, but because the model was never asked to do something it can't do. Not all failures vanish. But the failure mode shifts from "the model hallucinated" to "the skill was wrong" or "the tool had a bug." And those are debuggable, testable, fixable problems. That's the difference between chaos and engineering.

股票数据的例子是最清晰的说明。一个大语言模型声称下载了股票价格,然后生成了一个图表——但数据是随机的。金斯伯里将此作为大语言模型说谎的证据。但实际发生了什么?有人让一个语言模型(一个文本预测机器)从互联网上获取数据。它无法从互联网上获取数据。它没有工具,没有 HTTP 客户端,没有 API 密钥。所以它做了语言模型该做的事:生成了一段看起来像股票数据响应的文本。

解决之道不是更好的模型。解决之道是确定性工具:一个真正调用股票 API、返回真实数字并将它们作为上下文交给模型的函数。模型永远不接触数据检索。它决定查什么,代码决定怎么查。同样的输入,同样的输出,每次如此。

这就是我在《薄框架,胖技能》(Thin Harness, Fat Skills)中描述的架构。将智能(intelligence)向上推入技能,将执行(execution)向下推入确定性代码,保持框架的轻薄。当你这样做时,金斯伯里描述的那类失败就变得不太可能发生了。不是因为模型变得更聪明了,而是因为模型从未被要求做它做不到的事情。并非所有失败都会消失。但是失败模式从“模型产生了幻觉”(the model hallucinated)转变为“技能有误”(the skill was wrong)或“工具有 bug”。而这些是可调试、可测试、可修复的问题。这就是混乱(chaos)与工程(engineering)的区别。

§ 5

Take Kingsbury's bathroom example. He asked Gemini (a language model) to apply materials to a 3D rendering. Gemini isn't an image editor. It's not a 3D modeling tool. It's a text prediction system that has been given image capabilities as a bolted-on feature. Of course it forgot the toilet. Of course it changed the room's shape. It was playing improv with pixels.

A properly harnessed system would handle this differently. The skill file would say: decompose the task into steps.

  • Step 1: identify every surface in the image (use a vision model)
  • Step 2: for each surface, select the appropriate material (latent — the model decides)
  • Step 3: apply the material using a deterministic image processing tool (code — Pillow, OpenCV, a Blender script)
  • Step 4: verify the output geometry matches the input geometry (deterministic comparison) The model does the judgment. The code does the execution. The resolver routes between them. The harness orchestrates the sequence.

Would this always work? No. The skill might decompose the task wrong. The vision model might misidentify a surface. The deterministic comparison might have the wrong tolerance. These are real failure modes, and I've hit every one of them. But they're debuggable failure modes — you can find the step that went wrong, fix it, and run it again. Kingsbury's failures aren't debuggable because there's no system to debug. There's just a model playing improv.

以金斯伯里的浴室例子来说。他让 Gemini(一个语言模型)给一张 3D 渲染图应用材质。Gemini 不是一个图像编辑器。它不是 3D 建模工具。它是一个文本预测系统,只是增加了一项图像能力作为附加功能。难怪它忘了马桶。难怪它改变了房间形状。它不过是在用像素即兴发挥。

一个适当框架化的系统会以不同的方式处理这个问题。技能文件会这样说:将任务分解为步骤。

  • 第一步:识别图像中的每个表面(使用视觉模型)
  • 第二步:为每个表面选择适当的材质(潜层操作/latent——由模型决定)
  • 第三步:使用确定性图像处理工具应用材质(代码——Pillow、OpenCV 或 Blender 脚本)
  • 第四步:验证输出几何形状与输入几何形状匹配(确定性比较) 模型做判断。代码做执行。解析器在两者之间路由。框架编排整个过程。

这总是有效吗?不是。技能可能错误地分解任务。视觉模型可能错误识别表面。确定性比较的容差可能不对。这些都是真实的失败模式,我全都遇到过。但它们是可调试的失败模式——你可以找到出错的步骤、修复它、然后再次运行。金斯伯里的那些失败是不可调试的,因为根本没有什么系统可以调试。只有一个模型在即兴发挥。

§ 6

One of Kingsbury's best observations is the "jagged technology frontier" — the idea that LLMs have irregular, unpredictable capability boundaries. They do multivariable calculus and fail simple word problems. They write essays and can't count letters.

This is correct and important. But Kingsbury draws the wrong conclusion. He argues that the jagged frontier makes LLMs unsuitable for tasks requiring reliability. What it actually means is that you need routing.

A resolver is a routing table for context. When task type X appears, load skill Y. When the task requires letter-counting, route it to code. A three-line Python function. When the task requires essay writing, route it to the model. When the task requires image editing, route it to a pipeline that combines both.

The jagged frontier is an argument for harness engineering, not against AI. The irregularity is real. The solution is routing around it! You must put deterministic code where the model is weak and model judgment where the code can't reason. The resolver maps the territory.

In my own system, I designed my OpenClaw's brain resolver to be 80 lines of markdown, in a numbered decision tree that routes every piece of knowledge to the right directory. When we didn't have it, skills made their own filing decisions and 10 out of 13 were wrong. When we added it, misfilings dropped to zero. The model didn't get smarter. The routing got explicit.

金斯伯里最精彩的观察之一是“参差不齐的技术前沿”(jagged technology frontier)——大语言模型的能力边界不规则且不可预测。它们能做多变量微积分,却解决不了简单的文字题。它们能写文章,却数不清字母的个数。

这个观察正确且重要。但金斯伯里得出了错误的结论。他认为参差不齐的前沿使大语言模型不适合需要可靠性的任务。但它的真正含义是:你需要路由(routing)。

解析器(resolver)是一张用于上下文的路线图。当任务类型 X 出现时,加载技能 Y。当任务需要计数字母时,将其路由给代码——一个三行的 Python 函数。当任务需要写文章时,路由给模型。当任务需要编辑图像时,路由给一个结合了二者的流水线。

参差不齐的前沿是对框架工程(harness engineering)的论证,而不是对 AI 的否定。不规则性是真实的,而解决方案是绕开它!你必须在模型薄弱的环节放上确定性代码,在代码无法推理的环节放上模型的判断。解析器就像地图一样绘制了领域。

在我自己的系统中,我将 OpenClaw 的脑解析器(brain resolver)设计为 80 行的 markdown,采用编号决策树,将每一条知识路由到正确的目录。在我们没有它的时候,技能自己做归档决策,13 个中有 10 个是错的。当我们加入它之后,归档错误降为零。模型没有变得更聪明。路由变得显式(explicit)了。

§ 7

Kingsbury argues that LLMs are chaotic systems. They're sensitive to small input perturbations, vulnerable to adversarial manipulation, unpredictable in behavior. He's right. Rephrasing a question changes the answer. Rearranging sentences changes the output. Invisible Unicode characters can hijack behavior.

He presents this as a fundamental problem. It is! If you're feeding raw, unstructured input to a naked model.

But it's doesn't have to be if you're constraining the input through skill files.

A skill file is a structured markdown document that defines the procedure. It tells the model what to read, what to consider, what format to output, and what constraints to observe. The input to the model isn't "fix this image" — it's a 200-line document that says: here's the task, here's the process, here's what good output looks like, here are the tools you have, here are the ones you don't, here's what to do if you're uncertain.

Structured input through a skill file is dramatically less chaotic than freeform natural language. The skill constrains the trajectory. It doesn't eliminate chaos (nothing does in a stochastic system) but it channels it, the way banks channel a river. Kingsbury is right that unbanked rivers flood. That's not an argument against rivers. It's an argument for banks.

金斯伯里认为大语言模型是混沌系统。它们对小的输入扰动很敏感,容易受到对抗性操纵(adversarial manipulation),行为不可预测。他是对的。改写一个问题会让答案变化。重排句子会改变输出。不可见的 Unicode 字符可以劫持行为。

他把这描述为一个根本性的问题。确实如此!——前提是你正在向一个裸模型输入原始、非结构化的内容。

a但是,如果你通过技能文件来约束输入,情况就不一样了。

技能文件是一份结构化的 markdown 文档,它定义了流程。它告诉模型读什么、考虑什么、输出什么格式、遵循什么约束。模型的输入不是“修复这张图片”——而是一份 200 行的文档,它说:这是任务,这是流程,这是好的输出应该是什么样子,这是你拥有的工具,这是你没有的工具,如果不确定该怎么做。

通过技能文件提供的结构化输入,其混乱程度远低于自由格式的自然语言。技能约束了轨迹。它没有消除混沌(在随机系统中没有任何东西能做到),但它引导了混沌——就像河岸引导河流一样。金斯伯里说没有河岸的河流会泛滥,这是对的。但这并不是反对河流的论点。这是建立河岸的论点。

§ 8

Kingsbury makes a clever point about "reasoning" models. He notes that chain-of-thought traces (the stream-of-consciousness text that models emit while "thinking") are "essentially LLMs writing fanfic about themselves." He cites Anthropic's finding that Claude's chain-of-thought traces don't reliably reflect its actual reasoning process.

This is true and irrelevant. The chain-of-thought trace is the scratchpad, not the product. When a human does long division, the intermediate steps on paper aren't a truthful record of their neuronal activity — they're a tool for organizing the computation. Nobody evaluates a mathematician by the beauty of their scratch work.

What matters is the output. Does the system (model plus harness plus tools) produce correct, verifiable results? That's the testable question. Kingsbury doesn't test it, because he doesn't build the system.

He evaluates the scratchpad instead of the answer.

金斯伯里对“推理”(reasoning)模型提出了一个聪明的观点。他指出,思维链追踪(chain-of-thought traces)——模型在“思考”时产生的意识流文本——本质上是“大语言模型在写关于自己的同人小说”。他引用了 Anthropic 的发现:Claude 的思维链追踪并不能可靠地反映其实际的推理过程。

这是真的,但也是无关紧要的。思维链追踪是草稿纸,不是产品。当人类做长除法时,纸上的中间步骤并不是他们神经元活动的真实记录——它们是组织计算的工具。没有人会通过草稿的美观程度来评价一名数学家。

重要的是输出。系统(模型加框架加工具)是否产生正确、可验证的结果?这才是可检验的问题。金斯伯里没有检验这一点,因为他没有构建这个系统。

他评估的是草稿纸,而不是答案。

§ 9

Near the end of his essay, Kingsbury observes that "we don't really know why transformer models have been so successful, or how to make them better." This is true. It's also true of aspirin (the mechanism of action wasn't fully understood until the 1970s), general anesthesia (still incompletely understood), and bicycle stability (the gyroscopic theory was wrong — it was definitively explained only in 2011).

Practical utility doesn't require theoretical completeness. Engineering can proceed while research continues, which is how it always has.

We didn't stop prescribing aspirin while waiting for the 1971 mechanism paper. We didn't ground planes while debating exactly why wings generate lift. The question isn't whether we understand the mechanism. The question is whether we can build reliable systems with the capabilities that exist, test them rigorously, and improve them as understanding deepens. Kingsbury implicitly assumes that incomplete understanding means we can't build. The entire history of engineering says otherwise.

在他的文章接近尾声时,金斯伯里指出:“我们其实不知道为什么 transformer 模型如此成功,也不知道如何让它们变得更好。”这是真的。对于阿斯匹林也是如此(作用机制直到 1970 年代才完全阐明),对于全身麻醉也是如此(至今仍未完全理解),对于自行车稳定性也是如此(陀螺仪理论是错误的——直到 2011 年才得到确切解释)。

实用价值不需要理论完备性。工程可以在研究继续进行的同时推进——历来如此。

我们没有为了等待 1971 年那篇机制论文而停止开阿司匹林。我们没有在争论机翼究竟如何产生升力的同时让飞机停飞。问题不是我们是否理解了机制。问题是,我们能否利用现有的能力构建可靠的系统、严格测试它们,并在理解加深的同时不断改进。金斯伯里隐含地假设,不完整的理解意味着我们无法构建。整个工程史告诉我们恰恰相反。

§ 10

Here's the irony. Kingsbury's Jepsen methodology is exactly the right approach for AI systems. But it is just applied to the wrong layer.

Jepsen tests databases by injecting failures and checking invariants. It doesn't test the CPU. It doesn't test the operating system. It tests the database — the full system, under stress, against its own claims. When CockroachDB claimed linearizable consistency, Jepsen injected network partitions and checked whether the claim held.

Apply the same methodology to AI systems and the targets are obvious. Don't test whether the model hallucinates! Of course it does. Test whether the system hallucinates:

  • Does the harness prevent hallucinated data from reaching the user?

  • Does the skill file route the task to deterministic code where precision matters?

  • Does the resolver fire for the right inputs?

  • Does the entity propagation complete? These are testable claims. We test them.

  • Unit tests for deterministic code.

  • Integration tests for pipeline correctness.

  • Resolver trigger evals for routing accuracy.

  • LLM-as-judge evals for output quality.

  • End-to-end tests for the full pipeline.

The testing pyramid for agent systems exists. It's just not what Kingsbury tested. He tested the raw model, which is like running Jepsen against the bare filesystem instead of the database.

这里有一个讽刺之处。金斯伯里的 Jepsen 方法论本身就是测试 AI 系统的正确方法。但他只是用错了层。

Jepsen 通过注入故障并检查不变性条件(invariants)来测试数据库。它不测试 CPU,也不测试操作系统。它测试的是数据库——完整的系统,在压力下,对照它自己的声明。当 CockroachDB 声称实现了线性一致性(linearizable consistency)时,Jepsen 注入网络分区,然后检查该声明是否成立。

将同样的方法论应用于 AI 系统,目标就很明显了。不要测试模型是否产生幻觉!它当然会。测试的是系统是否产生幻觉:

  • 框架是否能防止幻觉数据到达用户?
  • 技能文件是否将任务路由到需要精度的确定性代码?
  • 解析器是否针对正确的输入触发?
  • 实体传播(entity propagation)是否完成?

这些是可测试的声明。我们测试它们。

  • 确定性代码的单元测试。
  • 流水线正确性的集成测试。
  • 解析器触发评估(routing accuracy)。
  • 大语言模型作为裁判(LLM-as-judge)的输出质量评估。
  • 完整流水线的端到端测试。

针对代理系统(agent systems)的测试金字塔是存在的。只是金斯伯里没有这样测。他测试了原始模型,这无异于对裸文件系统运行 Jepsen,而不是对数据库运行它。

§ 11

Let me address Kingsbury's deepest argument, the one that most readers will feel even if they can't articulate it. LLMs don't produce truth. They produce text that looks like truth. The aesthetics of truth-telling without the epistemological grounding. They are, in the philosophical sense that Harry Frankfurt defined, bullshit machines: systems indifferent to the truth-value of their output.

This is correct. And it's precisely why the architecture matters.

A naked model produces plausible text. A harnessed system produces verified text.

  1. The skill file says "check your work against the source data."
  2. The deterministic code says "compare this output to the ground truth and reject if it diverges."
  3. The model produces a draft.
  4. The system produces a verified result.
  5. The gap between plausible and verified is exactly the gap that harness engineering fills.

But here's the part Kingsbury misses entirely: the quality of the verification depends on the quality of the skill file. And the skill file is written by a human, who is the founder, the developer, or the prompt writer.

This is why open source matters so much. A closed-source agent can never let you write the skill that verifies its output, because legalism and safetyism prevent the kind of deep customization that real verification requires. You need to be able to say: "here is my schema, here are my invariants, here is what correct looks like in my domain, now verify against that." You can't do that through an API that guards its system prompt.

The epistemological problem is real. The solution is open harnesses where the user controls the verification layer. Not trust. Engineering.

This is exactly what my YC partner and friend Pete Koomen talks about in AI Horseless Carriages. The user must write their prompt, otherwise we'll be slaves to a system prompt that we can't see.

让我来谈谈金斯伯里最深刻的论点——一个大多数读者即使说不清也能感受得到的论点:大语言模型不产生真理。它们产生看起来像真理的文本。它们呈现出一种“讲真话的美感”,却没有认识论基础。按照哈里·法兰克福(Harry Frankfurt)的哲学定义,它们是扯淡机器(bullshit machines)——对其输出真值漠不关心的系统。

这个观点是对的。而这正是架构之所以重要的原因。

裸模型(naked model)产生似是而非(plausible)的文本。框架化系统(harnessed system)产生经过验证(verified)的文本。

  1. 技能文件说:“根据源数据检查你的工作。”
  2. 确定性代码说:“将此输出与基准真值(ground truth)进行比较,如果偏离则拒绝。”
  3. 模型产生初稿。
  4. 系统产生经过验证的结果。
  5. 似是而非与经过验证之间的差距,正是框架工程需要填补的差距。

但金斯伯里完全忽略了这一点:验证的质量取决于技能文件的质量。而技能文件是由人来编写的——创始人、开发者或提示词编写者。

这就是开源之所以如此重要的原因。一个闭源的代理永远不会让你编写验证其输出的技能,因为法条主义和法律主义防止了真正验证所需的深度定制。你需要能够说:“这是我的模式(schema),这是我的不变性条件(invariants),这是在我的领域中正确的样子,现在请对照它进行验证。”你无法通过一个保护其系统提示词(system prompt)的 API 做到这一点。

认识论问题是真实的。解决方案是开放框架(open harnesses),让用户控制验证层。不是信任,而是工程。

这正是我在 YC 的合伙人兼朋友 Pete Koomen 在《AI 无马马车》(AI Horseless Carriages)中谈到的。用户必须编写他们自己的提示词,否则我们将成为看不到的系统提示词的奴隶。

§ 12

At the end of his essay, Kingsbury compares AI to the automobile. He asks the reader to consider not how fast cars are, but what they did to cities. His answer? Sprawl, lead poisoning, bulldozed communities, car dependency. It's a good metaphor.

But he draws exactly the wrong lesson from it. We didn't solve car problems by being skeptical of engines. We solved them with engineering: seatbelts, crumple zones, catalytic converters, traffic lights, highway design, fuel injection, ABS, airbags, emissions standards. Every decade brought new engineering that made the system safer, more efficient, and more useful — not by making the engine smarter, but by building better systems around it.

Skepticism about engines never saved a life. Engineering the chassis did.

That's the answer to Kingsbury's essay. Not "be careful." Not "these machines are liars."

The answer is:

  • Build the system.
  • Write the skills.
  • Test the code.
  • Route with resolvers.
  • Make the deterministic parts deterministic.
  • Make the latent (model) parts constrained.
  • Test the system, not the model.
  • And when the system fails — because it will — debug the step that broke, fix the skill, and run it again. Kingsbury is right that naked models are unreliable. He is right that they confabulate, that they're chaotic, that they produce the aesthetics of truth rather than truth itself. Where he goes wrong is in treating these properties as verdicts rather than constraints. The model is unreliable. The system doesn't have to be.

The model is the engine. The harness is the car. Build the car.

--

I open-sourced the architecture described in this series. GBrain is the knowledge layer to let you build your own personal OpenClaw/Hermes Agent that does work for you. GStack is the skill layer to go fast with Claude Code.

在文章的最后,金斯伯里将 AI 与汽车进行了比较。他让读者考虑的不仅仅是汽车跑得有多快,还有它们对城市做了什么。他的回答?城市蔓延、铅中毒、被推土机铲平的社区、对汽车的依赖。这是一个很好的比喻。

但他从中得出了完全错误的教训。我们解决汽车问题靠的不是对引擎持怀疑态度。我们靠的是工程:安全带、溃缩区、催化转换器、交通信号灯、公路设计、燃油喷射、防抱死制动系统、安全气囊、排放标准。每十年都有新的工程成果让这个系统更安全、更高效、更有用——不是通过让引擎变得更智能,而是通过围绕它构建更好的系统。

对引擎的怀疑从未拯救过一条生命。底盘工程做到了。

这就是对金斯伯里文章的回答。不是“要小心”,不是“这些机器是骗子”。

回答是:

  • 构建系统。
  • 编写技能。
  • 测试代码。
  • 利用解析器进行路由。
  • 让确定性的部分保持确定性。
  • 让潜层(模型)部分受到约束。
  • 测试系统,而不是模型。
  • 当系统失败时——因为它一定会失败——调试出错的步骤,修复技能,然后再次运行。

金斯伯里说裸模型不可靠,是对的。他说它们会编造、它们是混乱的、它们产出的是真理的美感而非真理本身,这些都是对的。他出错的地方在于将这些特性视为最终判决,而不是约束条件。模型是不可靠的。系统不必如此。

模型是引擎。框架是汽车。去造那辆车吧。

--

我已经将本系列描述的架构开源了。GBrain 是知识层,让你可以构建自己的个人 OpenClaw/Hermes 代理来为你工作。GStack 是技能层,让你可以结合 Claude Code 加速前进。

Open source ↗