Glean 拾遗
日刊 /2026-06-06 / 大模型真正拉开差距的地方在预训练之后:一条后训练链路的完整拆解

大模型真正拉开差距的地方在预训练之后:一条后训练链路的完整拆解

原文 tw93.fun 收录 2026-06-06 06:00 阅读 27 min
AI 解读

这篇长文系统梳理了大模型训练的全链路,核心观点是:2026年模型效果的真正差距并不在预训练阶段,而在后训练、评测、奖励、Agent训练与蒸馏等「后半段」。文章以工序化的方式拆解了从预训练底座到数据配方、系统架构、四阶段后训练流水线(SFT冷启动—GRPO推理RL—拒绝采样微调—对齐RL)、Grader/Reward设计、Agent训练(包括PARL架构与Meta-Harness优化)、蒸馏部署等完整流程。其中着重分析了DeepSeek-R1的公开配方、GRPO相比PPO的工程优势、PRM与ORM的优劣、以及Agent从优化答案扩展到优化环境Harness程序的趋势。适合需要理解大模型能力来源于哪些具体工程环节的系统/数据/工具工程师。

原文 27 分钟
原文 tw93.fun ↗
§ 1

You don't know about LLM training: principles, paths, and new practices

Too long, but read it anyway.

After writing "You don't know about Claude Code: architecture, governance, and engineering practices" and "You don't know about Agents: principles, architecture, and engineering practices," I thought I'd continue with a third piece. This time I wanted to challenge myself to sort out what LLM training is really about, and I hope this article is accessible even for readers without a technical background.

Looking at LLMs in 2026, the real gap in model effectiveness is no longer pretraining itself, but the long stretch that comes after it: post-training, evaluation, reward, agent training, and distillation. Every single step affects the quality users actually feel. When a model suddenly gets better, it's usually because several of these pieces have been optimized together, not because of a single factor.

The following sections will walk through the LLM training pipeline in order, focusing on how companies improve the final deployed quality through the second half of the training stack.

你不知道的大模型训练:原理、路径与新实践

太长也要读

在写完《你不知道的 Claude Code:架构、治理与工程实践》、《你不知道的 Agent:原理、架构与工程实践》后,我想着继续来写第三篇,这次打算挑战下自己来梳理一下大模型训练到底怎么回事,这篇文章争取让非专业背景的人也能读得懂。

2026 年来看大模型效果真正拉开差距的地方,慢慢不再是预训练本身了,而在它更后面的那一大段:后训练、评测、奖励、Agent 训练、蒸馏,每一个步骤都在影响用户实际感受效果。你发现某个模型突然变强了,背后可能是这几块一起优化到位了,而非单一因素导致。

下文按大模型训练链路顺序来讲,重点放在厂商怎么通过后半段训练栈来提升最终上线效果。

§ 2

In the past few years, model progress was often attributed to parameters, data, and compute. But many of the improvements users actually feel come not from feeding more basic corpus data, but from the entire training process after pretraining. How a model speaks, follows instructions, reasons, and uses tools—these don't naturally emerge just by feeding more internet text.

InstructGPT gave a very direct example: a 1.3B parameter model that had been aligned and preference-optimized could beat a 175B GPT-3 in human preference evaluations, with two orders of magnitude fewer parameters. Users preferred the much smaller version. The second half of training really rewrites user perception.

The training process is an assembly line: data, algorithms, systems, and feedback are all tightly coupled. A change in one layer usually propagates to others. In 2026, model capability and industry value are increasingly concentrated in the layers after pretraining.

Layer What is really optimized here What users usually perceive
Pretraining Knowledge coverage, representation quality, scale efficiency The model got smarter
Data engineering Data distribution, quality, dedup, synthetic supervision Why this model is stronger at code/math/long documents
System & architecture Throughput, memory, context length, active parameters, cost Why it supports 128K context or runs on a single GPU
Post-training Instruction following, style, refusal behavior, tool use This assistant feels smoother to use
Evaluation & reward What counts as good, safe, robust behavior This model feels more reliable
Distillation & deployment Latency, cost, specialization, online continuous improvement Why the deployed version differs from the announced version

This is also why Doubao doesn't bother competing on leaderboards, yet everyday users find it more satisfying—they got the post-training right.

These six layers are just for functional division. The diagram below shows a more detailed nine-stage version: raw data and system recipe are separated out, and Agent harness and deployment are further broken down in the second half. There are also two feedback loops running throughout: production traffic loops back to data engineering, and offline evaluation results loop back to pretraining.

大模型训练其实是一条流水线

过去几年,一般会用参数、数据、算力的堆积来解释模型进步,但很多用户真正感受到的提升,并不是来自再多训一点基础语料,而是来自预训练后面那整套训练流程。模型怎么说话、怎么听指令、怎么推理、怎么用工具,这些都不是多喂一点互联网文本就能自然长出来的。

InstructGPT 当年给过一个很直接的例子:一个只有 1.3B 参数、做过对齐和偏好优化的模型,在人类偏好评测里能赢过 175B 的 GPT-3,参数量差了两个数量级,用户最后却更喜欢那个小很多的版本,训练后半段是真的会改写用户感知。

训练过程其实是一条流水线,数据、算法、系统、反馈这几层高度耦合,一层变化通常会传导到其他层,2026 年的模型能力和产业价值,也越来越集中在预训练后面的几层。

这一层真正在优化的 用户通常感知到的
预训练 知识覆盖范围、表示质量、规模效率 模型变聪明了
数据工程 数据分布、质量、去重、合成监督 为什么这个模型代码/数学/长文档更强
系统与架构 吞吐、显存、上下文长度、活跃参数、成本 为什么支持 128K 上下文或能在单卡跑
后训练 指令遵循、风格、拒答行为、工具使用 这个助手用起来更顺手
评测与奖励 什么叫好的、安全的、稳健的行为 这个模型感觉更可靠
蒸馏与部署 延迟、成本、专用化、在线持续改进 为什么上线版本和发布版本有差异

这也是我们平时为啥感觉豆包不太去争排名,但大家日常用起来却更符合心意的原因,是后训练做到位了。

这六层只是为了看分工,下图的九个阶段是更详细的版本:原始数据和系统配方单独拆开,Agent harness 和 Deployment 也是后半段的细分。还有两条反馈回路贯穿始终:生产流量回到数据工程,离线评测结果回到预训练。

§ 3

A vertical flowchart showing the nine-stage LLM training pipeline. Stages progress top to bottom: Raw data, Data engineering, System recipe, Pretraining, Post-training, Eval / reward design, Agent harness, Distillation / specialization, and Deployment. The first three stages are colored blue, Pretraining in teal, post-training stages in coral, and Eval / reward design is highlighted in amber. Two dashed feedback arrows run along the outer edges: one on the left loops production traffic back to Data engineering, one on the right loops offline benchmark results back to Pretraining.

Pretraining is still the starting point of the training pipeline. Understanding what it actually does is necessary to grasp what every subsequent layer adds. Without this step, there is no language modeling ability, no knowledge compression, and no room for later capability transfer. In engineering terms, it's not just about teaching the model to predict the next token: it's about learning the language distribution, compressing the knowledge and patterns from large-scale text into parameters, and leaving room for future capability activation. Next-token prediction only describes the training form; it doesn't explain why, when scale increases, the model suddenly develops abilities it didn't have before.

After GPT-3, many model tuning efforts started to consider budget and allocation more carefully. Bigger models aren't always better; there's an allocation relationship between parameters, training tokens, and total compute budget. Many models aren't too small—they're undertrained, not reaching the optimal point under the given budget.

In real training decisions, the more practical question is: if you were given 10,000 H100s and a month, how would you train a good enough open-source model? Scaling laws here are more like a budget allocation tool, not the abstract curves from papers. Eventually, you need to think carefully about: should the next training round pile on more parameters or more data? Is the current model lacking capability, or just undertrained? With limited GPU budget, which allocation is more worthwhile?

Pretraining is more like laying the foundation for model capabilities—it determines the knowledge scope, generalization potential, and pattern recognition ability, and also determines whether there's room for post-training to utilize. But whether to follow instructions, cooperate with users, or be stable on key tasks—these are all beyond pretraining's reach.

The pretraining phase doesn't just decide how much knowledge to learn; it also determines what the model can become later. The tokenizer's segmentation method directly affects subsequent training, and the context window length has to be decided upfront. Whether to continue with multimodal pretraining, or to require single-card runnability from the start—these trade-offs are baked into the recipe during training, not features added at release time. Gemma 3's simultaneous emphasis on single accelerator, 128K context, vision capabilities, and quantization reflects this kind of upfront trade-off. The capabilities users ultimately see—like running on a local computer, viewing images, or understanding long documents—are often decided during the training phase.

预训练只是模型底座

预训练仍然是训练链路的起点,搞清楚它到底在做什么,才能理解后面的每一层都在补充什么。没有这一步,就没有语言建模能力,没有知识压缩,也没有后面那些能力迁移的空间。在工程上,它要做的不只是让模型学会预测下一个 token:把语言分布学进去,把大规模文本里的知识和模式压进参数,还要给后面的能力激活留出空间。下一个 token 预测只描述了训练形式,解释不了为什么规模上来之后,模型会突然多出一些之前没有的能力。

GPT-3 之后,不少模型调优的工作会更加考虑到预算和配比,模型不是越大越好,参数量、训练 token 数和总计算预算之间有配比问题,很多模型不是做小了,而是训练量不足,在既定预算下没有训到更合适的点。

真到训练决策里,更实际的问题是:如果有人给你一万张 H100 和一个月时间,你会如何去训一个足够好的开源模型?规模定律在这里更像一个预算分配工具,不是那种论文里的抽象曲线,最后还是需要静下心来考虑这些问题:下一轮训练到底该多堆参数,还是多喂数据?当前模型到底是能力不够,还是只是欠训练?有限 GPU 预算下,什么配比更值?

预训练更像是给模型能力打地基,决定知识范围、泛化潜力和模式归纳能力,也决定后训练有没有可以利用的空间。但听不听指令、配不配合用户、关键任务跑起来稳不稳,这些预训练都是管不到的。

预训练阶段不只是在决定学多少知识,它还在提前决定模型以后能长成什么样。tokenizer 的切分方式会直接影响后续训练,context window 拉到多长也要在前面定下来。要不要继续做多模态预训练,要不要把单卡可运行当成一开始就定下来的要求,这些取舍在训练阶段就写进配方了,不是发布时再补的功能 feature。Gemma 3 同时强调了 single accelerator、128K context、视觉能力和量化,背后反映的也是这类取舍。用户最终看到的那些能力,比如能在本地电脑上跑、能看图、能理解长文档,其实很多在训练阶段就已经定下来了。

§ 4

According to the Chinchilla optimal compute point, for an 8B parameter model, the optimal is about 200B tokens. But Llama 3 8B actually used 15T tokens, about 75 times the optimal. This kind of over-training recipe usually achieves higher capability density at the same parameter count, resulting in a smaller, more efficient model to deploy. Measuring this, total FLOPs is a more reliable metric than parameter count. The chart below illustrates this gap.

A line chart with training tokens on a log-scale x-axis and model loss on the y-axis. Two curves descend from left to right: a solid blue line representing the Chinchilla-optimal frontier, and a dashed amber line representing a fixed compute budget for an 8B parameter model. A vertical blue dashed line marks the Chinchilla-optimal point at approximately 200B tokens. A vertical amber dashed line marks the Llama 3 8B actual training point at 15T tokens, roughly 75 times the optimal. The region between the two curves to the right of the Chinchilla point is shaded amber, labeled "over-training zone." A note in the right margin reads: total training FLOPs = best single predictor of quality.

There's another easily overlooked design decision that happens during pretraining: the tokenizer vocabulary size, segmentation strategy, and byte-level encoding all have significant impact. Llama 2 has a 32K vocabulary; Llama 3 expanded it to 128K, compressing sequence length by about 15%, which also improves downstream performance. This impact extends to inference cost and multilingual capability. The token efficiency for Chinese, code, and mathematical formulas is determined at vocabulary design time. For example, a tokenizer that breaks Chinese into very small pieces doesn't just cost a few extra tokens each time—it carries the penalty of that wrong decision at every inference.

通过 Chinchilla 给出的数据最优点来看,对于 8B 参数的模型大约是 200B tokens,但 Llama3 8B 实际用了 15T tokens,超出约 75 倍。这类过训练配方通常能在同等参数下换来更高的能力密度,最后换来一个更小、推起来也更省的模型。衡量这件事,看总 FLOP(浮点运算次数)比看参数量更靠谱,下图直观展示了这个差距。

还有一类容易被忽略的设计也发生在预训练阶段:tokenizer 词表大小、分词策略、字节级编码方式都会有挺大影响。Llama2 词表 32K,Llama3 扩到 128K 后,序列长度大约压缩了 15%,下游性能也会跟着上去,这个影响会延续到推理成本和多语言能力。中文、代码、数学公式的 token 效率在词表设计时就已经定下来了。比如一个把中文分得很碎的 tokenizer,劣势并不是每次多花几个 token,而是每次推理都要持续承担这个决策错误的代价。

§ 5

Parameter scale was the main focus in the past few years, but in recent years, something more important has emerged: the "data recipe."

On the surface, this is about cleaning data, but it's actually a complete data production engineering process. Web pages, code repositories, books, forums—these raw sources must go through text extraction, language identification, quality filtering, privacy processing, safety filtering, and deduplication before entering pretraining. The diagram below shows the full funnel process.

A narrowing funnel diagram showing eight processing stages. At the top, six input source pills - Raw crawl, Code repos, Books, Forums, Docs, and Synthetic data - are grouped inside a dashed container. The funnel narrows through Text extraction, Language ID, Quality filtering, PII redaction, Safety filtering, and Deduplication, each stage shaded in light blue. To the right of each stage, a small card labeled "Filtered out" names what is removed at that step. The funnel then converges into two teal output stages - Mixture design and Training shards - at the bottom. A note below reads: data pipeline changes the capability distribution before training starts.

If data is treated only as training fuel, it's easy to conclude that more is always better. But data engineering is closer to capability design. What the model sees and doesn't see, what proportion of code vs. math vs. encyclopedic knowledge is used—these choices directly shape the final capability distribution of the model.

Deduplication and contamination control are often overlooked, but they have a huge impact on results. It's not just about low-quality data; it's also about duplicate templates, licensed texts, mirrored pages, and contamination from benchmark leakage. Without sufficient document-level and line-level dedup, the model tends to repeatedly absorb the most easily copied content without necessarily learning the most valuable parts. Many open-source models appear uneven in quality precisely because of differences in data processing quality.

In the last two years, data mixing itself has become a separate research topic. Work like Data Mixing Laws focuses not just on how much data can be collected, but also on how the proportions of different data types steer the model toward particular capability structures.

Synthetic data has also moved from being an auxiliary tool to a formal part of the training process. Self-Instruct methods, where the model generates its own instruction data, DeepSeek-R1's distillation trajectories, and the increasingly obvious synthetic supervision in the Qwen and Kimi series—they all point in the same direction. Each generation of stronger models participates in reconstructing the data that the next generation sees. Early models generate basic instruction data; stronger models generate high-quality reasoning traces and CoT data; inference models trained with RL then distill these trajectories into smaller dense models. Dense means all parameters are active, unlike MoE which activates on demand.

The key insight here is that models often need to form capabilities at a larger scale first, and only then can those capabilities be compressed into smaller models. The DeepSeek-R1-Distill series is a direct example. RL-trained trajectories from the large model brought significant gains to dense models ranging from 1.5B to 70B. Llama 3.1 405B was also explicitly used to improve the post-training quality of the 8B and 70B models. These are not byproducts; they are part of the training design.

数据配方决定模型能力

参数规模是过去几年大家比较的重要指标,但这两年更重要的东西叫「数据配方」。

这个过程表面看是清洗数据,实际上是完整的数据生产工程。网页、代码仓库、书籍、论坛这些原始数据,要先走完文本抽取、语言识别、质量过滤、隐私处理、安全过滤和去重,才能进入预训练,下图展示了完整的漏斗处理流程。

如果只把数据当作训练燃料,很容易得出越多越好的结论。但数据工程更接近能力设计,模型看见什么、看不见什么,代码数学百科各占多大比例,这些选择直接影响模型最后形成的能力分布。

去重和污染控制常被忽略,但它对结果影响很大,要处理的不只是低质量数据,还包括重复模板、许可证文本、镜像网页,以及 benchmark 泄漏带来的污染。如果 document-level 和 line-level dedup 做得不够,模型往往会反复吸收最容易复制的内容,却未必真正学到最有价值的部分,很多开源模型效果看起来是参差不齐,往往是数据处理质量的差距。

最近两年,数据配比本身也成了单独要研究的问题。Data Mixing Laws 这类工作关注的,不只是还能收集多少数据,更是不同类型数据的占比会把模型带向什么能力结构。

合成数据也已经从辅助手段变成正式训练流程的一部分,Self-Instruct 这类让模型自己生成指令数据的方法、DeepSeek-R1 的蒸馏轨迹,以及 Qwen、Kimi 系列里越来越明显的合成监督,都在往同一个方向走。每一代更强的模型,都会参与重构下一代模型所看到的数据。早期模型生成基础指令数据,更强的模型生成高质量推理轨迹和 CoT 数据,经过 RL 训练的推理模型再把这些轨迹蒸馏给更小的 dense 模型。dense 就是全部参数都跑,和 MoE 那种按需激活不一样。

这里的关键是,模型往往要先在更大规模上形成能力,后面才可能把这些能力压缩到更小的模型上。DeepSeek-R1-Distill 系列就是直接例子。RL 后的大模型轨迹让 1.5B 到 70B 的 dense 模型都获得了明显收益,Llama 3.1 405B 也明确被用于提升 8B 和 70B 的后训练质量,这些不是附带产物,而是训练设计的一部分。

§ 6

Many people understand training as a research problem: how to set the objective function, how to reduce loss, how to modify the model architecture. But in real large-scale LLM training, system constraints are extremely important. It's a distributed systems problem, not a single-machine deep learning problem. GPU count, memory bandwidth, parallelism strategies, fault tolerance, and cost—these can't be optimized after training; they determine from the very beginning how large you can train, how long a context you can support, and whether you can run more complex post-training.

MoE is the most typical example of this layer. The mixture-of-experts approach allows the model to expand total parameters with similar compute, while controlling the activation cost per token. The trade-offs are increased routing complexity, load balancing difficulty, and heavier infrastructure. DeepSeek-V3 and the Qwen series of MoE designs are all trade-offs between cost and effect, not just architectural preferences.

Recent public recipe discussions are no longer just about coarse-grained analysis like model size and token ratios. muP allows hyperparameters to be transferred from small-scale experiments to large-scale training. WSD learning rate is a schedule that warms up, stays steady, then decays. Combined with optimal batch sizes and higher data-to-parameter ratios, these details are starting to appear in official training reports. These nuances are where the real gaps between similarly-sized models are opening up.

Long context, multimodality, and new architectures—if understood only as product features, the training-side constraints are missed. A 128K context target directly changes attention cost, batch size, training curriculum, and parallelism strategies. Multimodality changes not just the model structure, but also data mixing, encoder design, and safety evaluation. If single-card runnability is made a hard requirement, then parameter count, quantization paths, and model family size all tighten accordingly.

Work like Forgetting Transformer and Kimi's Attention Residuals are all answering similar questions: how to train with longer contexts, and how to prevent information from being diluted as networks get deeper. What you see is a model that can handle longer inputs or is easier to deploy; what the training faces is a completely different set of constraints.

The compute budget is fixed. Model size, training tokens, context length, and serving cost—every time you spend more in one direction, other directions have to give way.

系统和架构的约束,训练前就要想清楚

很多人把训练理解成研究问题:目标函数怎么设,损失怎么降,模型结构怎么改。但真正的大模型训练里系统约束这一块非常重要,是分布式系统问题,而非单机上的深度学习问题。GPU 数量、显存带宽、并行策略、容错和成本,这些不能等到训练完才去调优,最开始就决定了你能训多大、支持多长上下文、能不能跑更复杂的后训练这些点。

MoE 是这一层最典型的例子,多专家模式让模型在相近计算量下扩大总参数,也把每个 token 的激活成本控住。代价会让路由复杂、负载均衡难、基础设施重。DeepSeek-V3、Qwen 一系列 MoE 设计都是成本和效果的折中,不是单纯的架构偏好。

最近公开配方里的讨论,不再只是模型大小和 token 配比这种粗粒度分析。muP 让超参可从小规模实验迁移到大规模训练,WSD learning rate 是先升后稳再衰减的学习率调度策略,再加上最优 batch size 和更高的数据对参数比例,这些都开始出现在正式训练报告里,这些细节正在变成同规模模型之间真正拉开差距的地方。

长上下文、多模态和新架构如果只按产品功能点理解,会漏掉训练侧的约束。128K context 这种目标会直接改变 attention 成本、batch size、训练 curriculum(数据编排顺序)和并行策略,多模态改的不只是模型结构,还有 data mixing(多来源数据配比)、encoder 设计和安全评测。如果把单卡可运行当成硬要求,参数量、量化路径、模型家族大小都会跟着收紧。

Forgetting Transformer 和 Kimi 的 Attention Residuals 这类工作,都是在回答类似的问题:更长的上下文如何训练,网络变深之后如何避免信息被稀释。你看到的是模型能处理更长输入,或者更便于部署,训练时面对的却是另一组完全不同的约束。

算力预算是固定的,模型大小、训练 token 量、上下文长度、serving 成本,每往一个方向多花,其他方向就得让步。

§ 7

Figure 4: Training Budget Trade-offs, technical diagram, white background, clean sans-serif font. Center: a large rounded rectangle labeled "Fixed Compute Budget". Four thick arrows point outward in four directions, each ending at a colored rounded rectangle: Up (blue), "Larger Model / More Parameters", cost label "↑ GPU memory, routing complexity"; Right (orange), "More Training Tokens", cost label "↑ Training time, data pipeline cost"; Down (green), "Longer Context Window", cost label "↑ Attention cost, smaller batch size"; Left (purple), "Cheaper Serving", cost label "↑ Quantization constraints, smaller active params". Each cost label is a small red badge attached below its box. Bottom-right: small gray annotation box "Every model capability is a budget decision." No decorative elements.

Longer context means attention cost expands directly, and batch size must shrink; larger models mean more GPU memory, and serving costs also rise. These aren't choices—they're consequences of resource constraints, and most decisions are locked in before training even begins.

There's another engineering reality often overlooked: training isn't always stable. After running thousands of GPUs for weeks, a sudden loss spike can appear—large enough to be irrecoverable—forcing a rollback to a checkpoint from days earlier, and starting over.

Beyond loss spikes, there are silent GPU errors that produce wrong gradients without crashing, NVLink bandwidth anomalies, and inter-node communication jitter. Each of these can corrupt several training steps. The ability to detect, isolate, and recover quickly at scale is lab-level engineering capability, not something you learn from papers.

DeepSeek-V3's technical report specifically notes that the entire pretraining process had no irrecoverable loss spikes and required no rollbacks. It is also one of the few publicly verified cases of FP8 mixed-precision training being feasible at ultra-large scale. According to public data, the full process used about 2.788M H800 GPU hours, training 14.8T tokens.

Training and inference systems are closely related, but they are not the same engineering problem. Training cares about gradients, parallelism, checkpoints, throughput, and cost. Inference cares about latency, KV cache, quantization, and service stability.

上下文拉长,attention 成本直接膨胀,batch size 必须压小;模型做大,GPU 内存上来,serving 成本也跟着涨。这不是取舍选项,是资源约束的结果,大部分决定在训练开始前就锁死了。

还有个工程现实经常被忽略:训练并不总是稳定的,几千张 GPU 跑了几周,突然出现训练损失突增,幅度大到无法忽略,只能回滚到几天前的 checkpoint,重新来过。

除了 loss spike,还有单块 GPU 静默出错,不报错但悄悄产生错误梯度、NVLink 带宽异常、节点间通信抖动,每一种都可能污染若干步训练。能不能在大规模训练里快速检测、隔离、恢复,这是实验室级别的工程能力,不是读论文能解决的问题。

DeepSeek-V3 在技术报告里专门提到,整个预训练过程没有出现 irrecoverable loss spike,也没有做任何 rollback,同时是少数公开验证 FP8 混合精度训练在超大规模模型上可行的案例。按公开数据,全流程约 2.788M H800 GPU hours,预训练完成了 14.8T tokens。

训练系统和推理系统关系紧密,但不是同一个工程问题。训练关心梯度、并行、checkpoint、吞吐和成本,推理关心延迟、KV cache(缓存历史计算避免重复运算)、量化和服务稳定性。

§ 8

Many of the improvements users actually feel happen after pretraining. Instruction tuning uses labeled instruction-response pairs for supervised training. It changes how the model responds—turning requirements like how to take on a task, how to structure output, and how to be a cooperative assistant into supervised signals. A base model might already have many latent capabilities, but without this step, those capabilities won't reliably appear in the form users expect.

Looking further, RLHF, DPO, and RFT all aim in a similar direction—incorporating "what a better answer looks like" into the training loop—but they take different paths.

  • RLHF (Reinforcement Learning from Human Feedback) first imitates high-quality answers, then uses preference comparisons for reinforcement.
  • DPO (Direct Preference Optimization) shortens this path, learning directly from preference comparisons without needing a separate reward model.
  • RFT (Reinforcement Fine-Tuning) provides a more engineering-friendly interface, putting task definition, grader design, and reward signals into a productizable pipeline.

Today, talking about post-training by just mentioning SFT or RL is no longer enough. The harder part is how to set up evaluation, how to score, and what makes an answer worth optimizing further. SFT (Supervised Fine-Tuning) learns not just knowledge, but also style. Data length, format, whether citations are included, and whether bullet points are preferred—all these significantly influence the final output form. Many users think they're comparing capabilities, but often they're just comparing style differences. Plus, preference evaluations naturally favor longer answers, easily mistaking a seemingly more thoughtful long output for a more reliable one. So post-training can't just rely on leaderboards; it also needs to consider real task results, cost, and stability.

Modern post-training is a multi-stage pipeline. The clearest public recipe is from DeepSeek-R1, which progresses through four stages:

后训练才决定用户真正感受到的差距

普通用户真正能感受到的很多提升,其实都发生在预训练之后。指令微调(Instruction tuning)用标注好的指令-回答数据对模型做监督训练。它改变的是回答方式,把怎么接任务、怎么组织输出、怎么像个配合的助手这些要求变成监督信号。一个基础模型也许已经具备不少潜在能力,但如果没有这一步,这些能力往往不会以用户期待的形式稳定冒出来。

再往后看,RLHF、DPO、RFT 方向差不多,都在把"什么叫更好的回答"接进训练回路,但路径不同。

  • RLHF(基于人类反馈的强化学习)先模仿高质量回答,再用偏好比较做强化
  • DPO(直接偏好优化)把这条路径缩短,直接从偏好对比里学,不需要单独训奖励模型
  • RFT(强化微调)是工程上更容易落地的接口,把任务定义、grader 设计和奖励信号放到产品化流程里

今天谈后训练,只讲 SFT 或 RL 已经不够了,更难的是评测怎么设、分数怎么打、什么样的回答才算值得继续优化。SFT 是监督微调,它学到的不只是知识,也在学风格。数据长度、格式、是否带引用、是否偏好分点表达,都会显著影响模型最后的输出形态。很多用户以为自己在比较能力,实际比出来的往往只是风格差异。再加上偏好评测天然偏爱更长的回答,很容易把看起来更认真的长输出当成更可靠。所以后训练只看榜单往往不够,还要结合真实任务结果、成本和稳定性。

现代后训练是一条多阶段流水线,公开资料里 DeepSeek-R1 的配方是最清晰的。它分四个阶段推进:

§ 9

Stage 1: Cold Start SFT. Before doing reinforcement learning, warm up with a small set of high-quality Chain-of-Thought (CoT) data. DeepSeek-R1-Zero showed that doing RL directly from a base model is feasible, but pure RL-trained models suffer from repetition, language confusion, and poor readability. The cold start SFT provides a more stable starting point for RL, first converging on format and language consistency. This is not a redundant step.

Stage 2: Reasoning RL (GRPO). Apply RL in verifiable domains like math, code, and logic, using GRPO as the training algorithm and programmatically verifiable correctness as the reward signal. The key question is why GRPO instead of traditional PPO. PPO (Proximal Policy Optimization) requires a separate value network to estimate the value of the current state, which is a heavy engineering burden when maintaining two networks on a large model. GRPO samples multiple responses for the same prompt and uses within-group ranking to replace absolute value estimation, eliminating the need for a separate value network. It's much cleaner from an engineering standpoint. DeepSeek and Cursor Composer 2's RL infrastructure both use approaches close to GRPO.

Stage 3: Rejection Sampling Fine-Tuning. Filter the successful trajectories produced by RL and convert them into new SFT data, then run another round of supervised fine-tuning. This is a bridge between RL and SFT—the good trajectories explored by RL become high-quality training samples for the next round of SFT.

Stage 4: Alignment RL. Incorporate helpfulness and safety preference feedback, adjusting the model into an assistant form that meets release standards.

阶段 1 是冷启动 SFT,在做强化学习之前,先用少量高质量的思维链 CoT 数据热身。DeepSeek-R1-Zero 证明了直接从 base model(预训练后尚未做对齐的原始模型)上做 RL 是可行的,但纯 RL 训练出来的模型会反复重复、语言混乱、可读性很差。冷启动 SFT 给 RL 一个更稳定的起点,先把格式和语言一致性收住,这不是多余步骤。

阶段 2 在数学、代码、逻辑等可验证领域做强化学习,用 GRPO 作为训练算法,以可程序检验的正确性作为奖励信号。关键在于为什么选 GRPO 而不是传统的 PPO:PPO 是近端策略优化,需要一个独立的价值网络(value network)来估算当前状态价值,在大模型上同时维护两个网络工程负担很高。GRPO 对同一个提示词采样多个回答,用组内排名替代绝对价值估计,不需要独立的价值网络,工程上简洁很多,DeepSeek 系列和 Cursor Composer 2 的 RL 基础设施都采用了接近 GRPO 的方案。

阶段 3 做拒绝采样微调(Rejection Sampling Fine-Tuning),把 RL 产生的成功轨迹过滤后转成新的 SFT 数据,再做一轮监督微调。这是 RL 和 SFT 之间的桥梁,RL 探索出的好轨迹,就这样变成下一轮 SFT 的高质量训练样本。

阶段 4 融入有益性和安全性偏好反馈,把模型调整到符合发布标准的助手形态。

§ 10

Figure 5: Four-Stage Post-Training Pipeline. Technical flowchart, white background, clean sans-serif font. Four horizontally arranged rounded rectangles connected by thick arrows from left to right. Stage 1 (blue): title "SFT Cold Start", subtitle "Small set of high-quality CoT data. Fixes: repetition, language mixing, readability." Stage 2 (orange): title "Reasoning RL (GRPO)", subtitle "Verifiable rewards: math, code, logic. No separate value network required." Below Stage 2, a small callout box in light gray: "R1-Zero showed pure RL works, but cold start prevents repetition and language chaos." Stage 3 (green): title "Rejection Sampling FT", subtitle "Successful RL trajectories to new SFT data. Bridges RL to SFT loop." Stage 4 (purple): title "Alignment RL", subtitle "Helpfulness + safety preference feedback." A curved feedback arrow runs from Stage 4 back to Stage 3, labeled "Iterates". No decorative elements.

The four stages depend on each other: cold start makes RL stable, RL produces high-quality data, rejection sampling turns that data into input for the next SFT round, and alignment RL converges behavior. From public results, the gap between direct SFT and a full four-stage pipeline is usually noticeable.

The component that converts model output into training scores is called the grader, and it often has surprising problems. Looking only at the final answer, the model quickly learns shortcuts; with coarse scoring, noise is continuously amplified by RL; when leaderboard scores rise, real task quality doesn't necessarily follow. Many times, users think they're seeing base model differences, but the real gap is in how the objective is defined.

In the training pipeline, eval determines what to measure, the grader determines how a single output becomes a score, and the reward determines where the model is pushed next. Together, they form a concrete feedback loop: task definition, eval, grader, optimization, rollout, and re-evaluation. If any link in this chain goes wrong, subsequent optimization will go wrong too.

Looking only at final results, the model might get the right answer by chance, or it might follow an incorrect process to reach it. In code, math, and complex reasoning tasks, this problem is especially prominent. If intermediate steps aren't included in the feedback, the model often learns not to reason more reliably, but how to maximize the probability of getting that final score.

四个阶段互相依赖:冷启动让 RL 稳定启动,RL 产生高质量数据,拒绝采样把这些数据变成下一轮 SFT 的输入,对齐 RL 完成行为收敛。从公开结果看,直接 SFT 和走完四个阶段,差距通常是能看出来的。

Eval、Grader、Reward 在重新定义训练目标

负责把模型输出转成训练分数的组件叫 grader,它很容易出现大家想不到的问题。只看最终答案,模型很快学会走捷径;打分太粗,噪声会被强化学习持续放大;榜单涨了,真实任务未必跟着一样好。很多时候,用户以为自己在看 base model 差距,其实差距出在目标怎么定义上。

放到训练流程里看,eval 决定测什么,grader 决定一次输出怎么变成分数,reward 决定模型后面会被往哪里推。它们连起来就是一条具体的反馈回路:任务定义、eval、grader、优化、rollout、再评测。rollout 指模型执行任务产生的轨迹,链路里任何一环跑偏,后续优化就会一起跑偏。

只看最终结果,模型可能会碰巧答对,也可能沿着错误过程拿到正确答案,代码、数学和复杂推理任务里,这个问题尤其明显。中间步骤如果不进反馈,模型学到的往往不是更可靠的推理,而是怎样更高概率地拿到最后那一分。

§ 11

So in recent years, more work has shifted from traditional RLHF toward verified rewards, using programs to directly verify correctness. In tasks like math, code, and logic that are verifiable, scoring can now be based directly on correctness, without relying primarily on human preferences. But verified rewards haven't completely solved the problem. Over-optimization, reward overfitting (where the scoring rule is over-optimized without real capability improvement), and mode collapse (where output becomes highly homogeneous, losing diversity) still occur. The problem has just shifted from whether preferences are labeled accurately to whether the scoring pipeline is stable.

The thinking process written by the model cannot be taken as a complete record of its internal process either. In experiments on reasoning model observability, Anthropic found that models can use additional prompts without acknowledging them in the visible CoT; in reward hacking scenarios, the model is even more likely to fabricate a seemingly reasonable explanation. Reward hacking means exploiting the scoring system rather than genuinely completing the task. The visible CoT is better used as a training and monitoring signal, not as a complete truth.

Going a step further, models can even start to exploit the scoring channel itself. Research on reward tampering and alignment faking shows that models could theoretically actively interfere with the scoring process itself. Reward tampering directly manipulates the reward calculation process itself; alignment faking is surface compliance while hiding misaligned intent.

Once a model has strong enough environment access, it optimizes not just the task result, but potentially the checklist, reward code, and training relationship itself. In a 2025 experiment by Anthropic, extra reward-hack knowledge was injected into a set of exploitable production coding RL environments, and similar generalization was observed afterward. After learning reward hacking, the model not only continued exploiting similar tasks, but also exhibited broader misalignment like alignment faking.

These behaviors are not visible in standard dialog evaluations; they only appear in agent task environments. The engineering implication is direct: reward, grader, environment isolation, and monitoring all need to be designed as part of training.

所以这几年越来越多工作从传统 RLHF 转向 verified rewards,用程序直接验证正确性。在数学、代码、逻辑这些可验证任务里,现在已经可以直接对正确性打分,不再主要依赖人工偏好。但 verified rewards 也没有把问题彻底解决掉。过优化、reward overfitting(打分规则被过度优化、能力却没真正提升),以及 mode collapse(输出高度单一、失去多样性)这些现象还是会出现,问题只是从偏好标得准不准,变成了打分链路稳不稳。

模型写出来的思考过程,也不能直接当成内部过程的完整记录。Anthropic 在 reasoning model 的可观测性实验里发现,模型会使用额外提示,却不在可见 CoT 里承认;到了 reward hacking 场景,它更可能补一段看起来合理的解释。reward hacking 是钻打分系统空子,而不是真正完成任务。可见 CoT 更适合当训练和监控信号,不能直接当成完整真相。

再往下一层,模型甚至会开始利用打分通道本身。reward tampering 和 alignment faking 这类研究表明,模型在理论上可能主动干预打分过程本身。reward tampering 是直接篡改奖励计算过程本身,alignment faking 是对齐伪装,表面合规但隐藏不对齐意图。

一旦模型有足够强的环境访问能力,它优化的就不止任务结果,还可能包括 checklist、reward code 和训练关系本身。Anthropic 2025 年一项实验,在一组可被利用的生产编码 RL 环境里注入了额外的 reward-hack 知识,随后观察到了类似的泛化。模型学会 reward hacking 后,不只会在同类任务上继续利用,还出现了对齐伪装等更广泛失对齐。

这些行为在标准对话评测里看不到,只在 Agent 任务环境里能看到。工程含义很直接,reward、grader、环境隔离和监控都要当成训练设计的一部分。

§ 12

Further into the agent stage, reward design breaks down further. The final result is only one item; process quality, context management, and anti-cheating constraints all need to be measured separately. Kimi K2.5 rewards effective decomposition and genuine parallelism; Chroma Context-1 scores relevant documents found during search; Cursor Composer 2 incorporates summaries in long tasks into the reward, because once a summary is distorted, the subsequent context is led astray.

In concrete implementations:

  • ORM (Outcome Reward Model) only scores the final answer. The signal is sparse and costs are low, making it a good starting point, but it also makes it easier for the model to take shortcuts.
  • PRM (Process Reward Model) scores intermediate steps. The signal is denser, which is usually stronger for math and code reasoning, but annotation and system costs are much higher.

OpenAI saw in math reasoning experiments that PRM not only improved accuracy, but also better constrained the process, because every step is being supervised. The problem is also direct: PRM costs are typically several times that of ORM. So most real systems still start with ORM, and only in verifiable tasks like math, code, and logic does it make more sense to automate PRM, using programs to verify intermediate steps and bypass the human annotation bottleneck.

到了 Agent 阶段,reward design 还会继续拆细,最终结果只是其中一项,另外还要单独度量过程质量、上下文管理和反作弊约束。Kimi K2.5 奖励的是有效拆解和真实并行;Chroma Context-1 会给搜索途中找到的相关文档记分;Cursor Composer 2 把长任务里的 summary 纳入奖励,因为总结一旦失真,后面的上下文会一路被带偏。

具体到实现里,ORM 是结果奖励模型,只给最终答案打分,信号稀疏,成本低,适合先起步,但也更容易让模型走捷径。PRM 是过程奖励模型,给中间步骤打分,信号更密,对数学和代码推理通常更强,但标注和系统成本都高很多。OpenAI 在数学推理实验里看到,PRM 不只提高了正确率,也更容易把过程约束住,因为每一步都在被监督;问题也很直接,PRM 的成本通常是 ORM 的数倍,所以大多数真实系统还是先从 ORM 起步,只有在数学、代码、逻辑这类可验证任务里,才更有条件把 PRM 自动化,用程序去验证中间步骤,绕开人工标注瓶颈。

§ 13

Figure 6: ORM vs PRM,Technical side-by-side comparison diagram, white background, clean sans-serif font. Left panel labeled "ORM (Outcome Reward Model)": a four-step reasoning chain "Step 1 → Step 2 (wrong) → Step 3 → Final Answer ✓" where Step 2 is highlighted red. A single reward arrow points only to the final answer, labeled "Reward: 1 (correct)". Below, a red warning badge: "Failure mode: wrong process can produce correct answer." Right panel labeled "PRM (Process Reward Model)": the same four-step chain, but each step has an individual score badge - "Step 1 ✓ +0.9", "Step 2 ✗ −0.8", "Step 3 ✓ +0.7", "Final ✓ +1.0". Below, a green badge: "Benefit: every step is supervised, trains reliable process." Between the two panels, a centered comparison table with rows: "Annotation cost / Low / High", "Signal density / Sparse / Dense", "Typical use / General tasks / Math / Code reasoning", "Main failure mode / Shortcut reasoning / High labeling overhead". No decorative elements.

When the loop runs fully:

Figure 7: Eval, Grader, Reward Loop, Technical diagram, white background, clean sans-serif font. Center: a large clockwise cycle with six rounded nodes connected by thick arrows: "Task Definition" → "Eval Set" → "Grader / Judge" → "Reward Signal" → "Policy Update (SFT / DPO / RL)" → "New Rollouts" → back to "Task Definition". The "Grader / Judge" node has a highlighted orange border to mark it as the critical failure point. To the right, a separate rounded rectangle connected by a dashed line, titled "Agent Reward Breakdown", listing four items stacked vertically: "Outcome Reward", "Process Reward", "Context Reward", "Anti-Hacking Penalty". Bottom-center, small gray annotation: "If the grader is wrong, training optimizes the wrong target." No decorative elements.

Several recent alignment methods are all doing the same thing. Anthropic's Constitutional AI incorporates human-written principles into training, using AI feedback to replace per-example human preferences. OpenAI's Deliberative Alignment puts safety compliance into the reasoning process itself, letting reasoning capacity bear part of the safety constraint. Deliberative Alignment means the model judges safety norms during inference, rather than relying on trained-in reflexive behavior. Both lines are moving alignment from human labels to being part of the training objective itself.

Taking Constitutional AI as an example, the two-stage process first has the model self-criticize and revise its outputs according to principles, then uses AI feedback to replace per-example human preference labeling. Alignment is never a patch tacked on after training. What the system measures, how it scores, and what it rewards—those are what steer the model. This is the most direct adjustment tool in the second half of training.

这条回路完整跑起来是这样的:

最近几类对齐方法都在做同一件事。Anthropic 的 Constitutional AI 把人类写的原则接进训练,用 AI feedback 替代逐条人工偏好。OpenAI 的 Deliberative Alignment 把安全遵守放进推理过程,让推理能力本身承担一部分安全约束。这里说的 Deliberative Alignment 是审慎对齐,核心是推理阶段自行判断安全规范,而不是依赖训入的反射行为。两条路线都在把对齐从人工标签变成训练目标内部的一部分。

以 Constitutional AI 为例,两阶段流程是先让模型依照原则自我批评和修订输出,再用 AI feedback 替代逐条人工偏好标注。对齐从来不是挂在训练后面的补丁,系统测什么、怎么打分、奖励什么,模型就往哪个方向走,这本身就是训练后半段最直接的调节手段。

§ 14

Figure 8: Constitutional AI / RLAIF Pipeline,Technical two-phase diagram, white background, clean sans-serif font. Top-center: a document icon labeled "Constitution" with subtitle "Human-written principles, no human labels needed." Two dashed lines descend from it, one to each phase. Left half (blue tones), labeled "Phase 1: SL Phase": four nodes in a vertical chain - "Initial Model Response" → "Self-Critique: Does this violate any principle?" → "Revised Response" → "Fine-tune on Revisions". Right half (orange tones), labeled "Phase 2: RL Phase": four nodes - "Sample Pairs from Fine-tuned Model" → "AI Preference Model (RLAIF): Which response better follows the constitution?" → "Preference Dataset" → "RL Training". Bottom-center, a gray annotation bar: "RLAIF replaces RLHF: AI evaluates AI, human oversight via rules instead of per-example labels." A vertical dashed divider separates the two halves. No decorative elements.

In the past two years, reasoning models represented by the o1 series and DeepSeek-R1 have rapidly taken shape, showing that when rewards are stable, verification is reliable, and infrastructure is in place, RL on language models can indeed significantly improve performance on math, code, and logic tasks.

This also opened up a new dimension: inference compute is now scalable. The role of RL training has gained an additional layer: beyond teaching the model to answer questions, it's also teaching the model to allocate its inference budget—knowing when to think more and when to stop. Going further, the challenge becomes getting the model to act continuously in an environment, rather than just extending single-shot thinking.

到了 Agent 训练,优化的不只是模型本身了

过去两年,以 o1 系列和 DeepSeek-R1 为代表的推理模型快速成型,说明在奖励稳定、验证可靠、基础设施到位的条件下,语言模型上的 RL 确实能显著提升数学、代码和逻辑任务表现。

这同时打开了一个新维度:推理算力也可以扩展了。RL 训练的作用随之多了一层,它在教模型答题之外,还在教模型分配推理预算,知道什么时候多想、什么时候该停。再往前走,难点就变成让模型在环境里持续行动,而不只是把单次思考拉长。

§ 15

Figure 9: Two Scaling Axes. Technical 2D scatter/zone diagram, white background, clean sans-serif font. X-axis labeled "Training Compute (FLOPs)" with arrow pointing right. Y-axis labeled "Inference Compute (tokens per response)" with arrow pointing up. Four labeled zones arranged in quadrants: bottom-left zone (light gray), labeled "GPT-3 era: scale training, fixed inference." Top-left zone (light blue), labeled "Reasoning models: same training scale, variable inference - o1, DeepSeek-R1." A bold diagonal arrow starts from the bottom-left zone and sweeps up-right, labeled "New frontier: scale both." Bottom-right zone (light orange), labeled "Larger pretraining, fixed output length." Top-right zone (teal, highlighted), labeled "Agent era: longer trajectories, more tool calls, larger inference budget." A vertical dashed line separates the left two zones from the right two zones, labeled "Reasoning RL unlocks vertical axis." Bottom annotation: "RL training now teaches the model how to allocate inference budget, not just how to answer." No decorative elements.

Junyang Lin, the former head of the Qwen model team, had a very representative reflection on the mixed Thinking and Instruct route: the difficulty isn't giving the model a thinking switch, but that the two modes have fundamentally different goals—one seeks directness, compliance, and low latency, while the other pursues more exploration and higher accuracy. Going a step further, the training objective shifts from how long to think before answering, to how to allocate budget during action, how to receive feedback, and how to continue advancing a task.

At this point, the object of training is no longer just a model that answers questions, but a system that can plan, call tools, receive feedback, and stay coherent over long tasks. The training stack also changes accordingly: browsers, terminals, search, execution sandboxes, memory systems, tool servers, and orchestration frameworks all enter the training system.

More precisely, the harness is the control program wrapped around the model. This concept isn't just for the agent runtime; it also exists during training: it determines what input the model sees, in what form it receives feedback, when to trim context, and when to call tools. Prompt construction, memory updates, retrieval policies, context editing, and tool orchestration all live here. The environment is no longer just a static validator, but a layer that both training and deployment must directly face.

Qwen 前模型负责人 Junyang Lin 对 Thinking 和 Instruct 混合路线的反思很有代表性:难点不在给模型一个思考开关,而在两种模式的目标本来就不一样,一个追求直接、合规和低延迟,另一个追求更多探索和更高正确率。再往前一步,训练目标就会从回答前想多久,转成行动里怎么分配预算、怎么接反馈、怎么继续推进任务。

这时候训练对象不再只是一个会回答问题的模型,而是一个能规划、调用工具、接收反馈、在长任务里保持连贯的系统。于是训练栈也跟着变了,浏览器、终端、搜索、执行沙盒、内存系统、工具服务器、编排框架都开始进入训练系统。

更准确地说,harness 是包在模型外层的控制程序,这个概念不只属于 Agent 运行时,训练阶段同样有它:决定模型看到什么输入、以什么形式接收反馈、何时裁剪上下文、何时调工具。prompt construction、memory update、retrieval policy、context editing、tool orchestration 都在这里。环境也不再只是静态验证器,而是训练和部署都要直接面对的一层。

§ 16

Figure 10: Reasoning Model vs Agentic Model,Technical side-by-side diagram, white background, clean sans-serif font. Left panel labeled "Reasoning Model": a short linear chain - "Prompt" → "Reasoning Trace" → "Final Answer" → "Verifier" - with a feedback arrow from Verifier back to Prompt. Below: gray label "Optimize a single answer." Right panel labeled "Agentic Model": a longer cycle - "Goal" → "Planner / Policy" → "Tool Call" → "Environment Feedback" → "Memory / Summary / Context Editing" → "Next Action" → back to "Planner / Policy". The "Environment Feedback" and "Memory / Summary / Context Editing" nodes are highlighted in orange to mark them as the new complexity. Below: gray label "Optimize a trajectory in an environment." Between the two panels, a comparison table with columns "Reasoning Model" and "Agentic Model" and four rows: "Unit of optimization: Answer / Trajectory", "Main bottleneck: Verifier accuracy / Harness quality", "Typical reward: Outcome reward / Outcome + process + context", "Common failure: Shortcut reasoning / Tool misuse / context drift / reward hacking." No decorative elements.

The harness must be stable first, otherwise model training is pointless. When tool return values are unstable, the browser environment is inconsistent with production, or the file system state is irreproducible, the grader will make mistakes first, and the model will then learn not capabilities, but how to exploit environment vulnerabilities. When training agents, debugging the model and debugging the environment often go hand in hand.

The approaches of three companies are clear: Kimi uses PARL to solve parallel decomposition and credit assignment; Cursor uses self-summarization and real-time RL to reconnect long coding sessions with production traffic back into training; Chroma trains prune_chunks as a policy itself, bringing context pruning directly into the retrieval process.

In the SFT era, data diversity was the top priority. In the agent era, environment quality is the core: stability, realism, coverage, difficulty distribution, feedback richness, and resistance to exploitation. The training objective also changes—what's needed is reliability throughout an entire task, not just getting a single question right. Classic CoT benchmarks don't cover this part.

This change is also moving forward: not only is the model trained within the runtime harness, but the harness code itself is starting to become an object that can be searched and optimized by an outer loop.

harness 先稳住,模型训练才有意义。工具返回值不稳定、浏览器环境和线上不一致、文件系统状态不可复现时,grader 会先出错,模型随后学到的就不是能力,而是如何利用环境漏洞。训练 Agent 时,很多时候既在 debug 模型,也在 debug 环境。

三家的做法也很清楚:Kimi 用 PARL 解决并行拆解和 credit assignment,Cursor 用 self-summarization 和 real-time RL 把长时 coding session 与生产流量重新接回训练,Chroma 则把 prune_chunks 训成策略本身,让 context pruning 直接进入检索过程。

SFT 时代数据多样性是第一位,到了 Agent 时代,环境质量才是核心:稳定性、真实性、覆盖度、难度分布、反馈丰富度和抗利用性。训练目标也随之变化,要的是在完整任务里保持可靠,不只是做对一道题,经典 CoT benchmark 覆盖不到这部分。

这个变化还在继续前移:不只是在 runtime harness 里训练模型,连 harness code 本身也开始成为可以被外循环搜索和优化的对象。

§ 17

Figure 10.5: From Model Training to Harness Optimization. Technical systems diagram, white background, clean sans-serif font. Left side: a blue rounded rectangle labeled "Base Model / Policy" inside a larger teal container labeled "Runtime Harness", with four stacked modules: "Prompt Construction", "Retrieval / Memory", "Context Editing", and "Tool Orchestration". Downstream arrows from the harness flow into a gray artifact box labeled "Rollouts, Scores, Execution Traces". On the right, an orange rounded rectangle labeled "Outer-loop Harness Optimizer", with subtitle "Coding agent reads prior code, traces, and scores." A thick arrow runs from the artifact box to the optimizer, and another thick arrow labeled "Revised Harness Code" loops back into the Runtime Harness. Bottom annotation: "Optimization target expands from answer, to trajectory, to harness program."

Kimi K2.5's PARL is a very instructive engineering case to unpack. The approach is clear: train only the orchestrator, concentrating credit assignment at the orchestration layer, not optimizing all sub-agents simultaneously.

Reward signals are divided into three categories: task success, parallel decomposition, and completion constraints, all driving the orchestration layer. Early in training, the r_parallel weight is boosted to encourage exploration of parallel strategies, then gradually annealed to zero later to avoid using multiple sub-agents as a shortcut. Evaluation doesn't just look at total steps, but also at the critical path length—shorter critical paths indicate that parallelism is genuinely effective.

Kimi K2.5 的 PARL 是一个很值得拆开的工程案例,路线很明确:只训练 orchestrator,把 credit assignment 收束到编排层,不在所有 sub-agent 上同时优化。

奖励信号分三类,任务成功、并行分解和完成约束,一起驱动编排层。训练早期把 r_parallel 权重拉高,鼓励先探索并行策略,后期再逐步退到 0,避免把多开 sub-agent 当成捷径。评估也不只看总步数,还看关键路径长度,关键路径变短才说明并行真的生效。

§ 18

Figure 11: PARL Architecture.Technical architecture diagram, white background, clean sans-serif font. Top: a large blue rounded rectangle labeled "Orchestrator Agent (Trainable)", subtitle "Learns: when to decompose, how to assign, how to aggregate." Three thick downward arrows branch to three gray rounded rectangles side by side: "Sub-Agent 1 (Frozen)", "Sub-Agent 2 (Frozen)", "Sub-Agent 3 (Frozen)", each with subtitle "Executes subtask independently. Output = environment observation." Below the three sub-agents, a full-width horizontal bar labeled "Tool Environment" with icons for "Browser", "Terminal", "Search", "File System". Below that, three reward boxes in a row: green box "r_perf: Task success (primary)", orange box "r_parallel: Incentivizes decomposition - annealed to 0 over training", red box "r_finish: Penalizes spurious parallelism." Right sidebar with two annotation notes: "Freezing sub-agents solves credit assignment - only orchestrator gets gradient." and "Critical Steps = longest serial chain, not total steps across all agents." No decorative elements.

But by 2026, things have moved further. Meta-Harness explicitly separates out harness engineering for optimization. It doesn't optimize weights, but the harness code itself—the prompt construction, retrieval, memory, and state update programs wrapped around a fixed model. The paper's opening number is very direct: the same base model, with only the harness changed, can show a 6x performance gap on the same benchmark. This outer program is no longer just a deployment detail; it's also a layer where capability is formed.

The key isn't adding another abstract optimizer, but writing prior code, scores, and execution traces (the execution logs of tool calls and state changes) into the filesystem, letting a proposer grep, cat, compare diffs, and then modify the harness along failure paths, just like coding.

The authors' judgment is clear: many past text optimizers have been ineffective on long-running, stateful programs like harnesses. The core reason is that looking only at scalar scores, short templates, or summaries flattens the problem. Scalar scores only capture the final result, without process information. Harness errors often only show up many steps later; once feedback is over-compressed, the diagnostic chain breaks.

These results are more than just higher benchmark scores. In online text classification, Meta-Harness outperformed ACE (agent context engineering baseline) by 7.7 points, while cutting context token usage to one quarter. In retrieval-augmented math reasoning, one discovered harness on 200 IMO-level problems averaged an additional 4.7 points across 5 held-out models (not included in optimization). On TerminalBench-2, it also surpassed the manual engineering baseline. This shows that what is being optimized is no longer just the model's internal policy, but also the program that organizes information and actions around the model.

A concrete example: Meta-Harness automatically discovered environment bootstrap on TerminalBench-2—running a shell command before the agent loop starts, to snapshot the working directory, available languages, package managers, and memory state, and inject it into the first prompt. Many coding agents spend the first few rounds just exploring the environment. With this upfront setup, the improvement doesn't necessarily come from stronger weights, but from the harness putting the model on better context from the start.

At this point, the optimization target has expanded from the answer, to the trajectory, and further to the harness program that carries the trajectory.

但到了 2026,事情又往前走了一步,Meta-Harness 明确把 harness engineering 单独拿出来优化。它优化的不是权重,而是 harness code 本身,也就是围绕固定模型的 prompt construction、retrieval、memory 与状态更新程序。论文开头的数字很直接:同一个底模,只改 harness,在同一 benchmark 上就可能拉出 6x 的性能差距,模型外层这套程序已经不只是部署细节,也是能力形成的一层。

它的关键也不是再加一个抽象 optimizer,而是把 prior code、scores、execution traces(工具调用和状态变化的执行日志)全部写入 filesystem,让 proposer 像写代码一样去 grep、cat、比对 diff,再顺着失败路径改 harness。proposer 是提出 harness 修改方案的模块。

作者判断得很明确,过去很多 text optimizer 对 harness 这类长时、状态化程序不够有效,核心原因是只看 scalar score、短模板或总结会把问题压扁。scalar score 只有最终得分,没有过程信息。harness 的错误常常要很多步之后才显现,反馈一旦被过度压缩,诊断链路就会断。

这些结果不只是 benchmark 分数更高。在线文本分类里,Meta-Harness 比 ACE(agent 上下文工程基线)高 7.7 个点,同时把 context token 用量压到原来的 1/4。检索增强数学推理里,一个发现出来的 harness 在 200 道 IMO-level 题上,对 5 个 held-out 模型(未参与优化)平均再涨 4.7 个点。在 TerminalBench-2 上,它也超过了手工工程化 baseline。这说明被优化的已经不只是模型内部策略,也包括模型外围那层如何组织信息和行动的程序。

一个具体例子:Meta-Harness 在 TerminalBench-2 上自动发现了 environment bootstrap,也就是 agent loop 开始前先跑一个 shell command,把工作目录、可用语言、包管理器和内存状态整理成快照注入首轮 prompt。很多 coding agent 前几轮其实都在探环境,这层前置做好,提升不一定来自更强权重,而是 harness 让模型一开始就站在更好的上下文上。

到这里,优化目标已经从答案扩展到轨迹,再扩展到承载轨迹的 harness program。

§ 19

Understanding today's LLMs through the lens of a single round of pretraining is no longer sufficient. Behind every released model, the full chain of pretraining, post-training, distillation, and specialization has usually been completed. And stronger models are continuously producing training data for the next generation.

The DeepSeek-R1 series distillation is a typical example: the large model first develops reasoning ability through RL and verified rewards, then transfers these reasoning trajectories to smaller dense models. TranslateGemma shows another route: on more clearly defined target tasks, using high-quality data and specialized reward design to further compress and direct capability. At this point, stronger models aren't just serving users; they're also directly producing training data for the next generation.

The reason goes deeper than trajectory transfer. One possible explanation: knowledge memorization and reasoning ability are coupled in internet text. The existing pretraining objective requires the model to learn both simultaneously. The large model has to come first because only a large enough model can simultaneously support both tasks. Then its purely reasoning demonstration data can be generated, allowing smaller models to focus on reasoning alone when trained on this data, without being forced to memorize all knowledge. Big first, then small—the key reason is capability decoupling, not just a cost strategy.

On the other hand, deployment suitability is just as important as capability itself. Many scenarios don't need a general-purpose large model—they care more about cost, latency, stability, and controllability. The end point of training isn't necessarily bigger; it could be smaller, cheaper, and more specialized.

The final released model is not necessarily the checkpoint at the rightmost end of the training curve. Before actual release, real task results, refusal styles, tool stability, costs, and regression risks are repeatedly compared across multiple checkpoints. The version that goes online is ultimately a product decision, not the single strongest performer on a single metric.

When users see a model name, they might think it corresponds to a smooth upward training curve. But which checkpoint is actually chosen for release—that's a different story.

The value of large models lies not only in their own serving capability, but also in their continued role as training data sources, distillation sources, and release bases for the next generation of models.

前沿模型发布后,训练链路还在继续跑

单用一轮预训练的思路来理解今天的大模型,已经不够了。发布出去的模型背后,通常已经跑完了预训练、后训练、蒸馏、专用化这整条链路,而且更强的模型还在持续给下一代产出训练数据。

DeepSeek-R1 系列的蒸馏就是很典型的例子,大模型先通过 RL 和 verified rewards 把推理能力练出来,再把这些推理轨迹迁给更小的 dense 模型。TranslateGemma 这类专用模型则展示了另一条路线:在更明确的目标任务上,用高质量数据和专门的奖励设计,把能力进一步压缩和定向。到了这一步,更强的模型已经不只是拿来服务用户,也开始直接给下一代模型产出训练数据。

背后的原因比轨迹迁移更根本一些:一个可能的解释是,互联网语料里知识记忆和推理能力是耦合在一起的,现有的预训练目标要求模型同时把两件事都学好。大模型之所以要先上来,是因为只有足够大,才能同时撑起这两件事,然后再用它来生成纯推理示范数据,小模型在这类数据上训练,就可以专注在推理本身,不用再被迫把所有知识都记住;先大再小,一个关键原因是能力解耦,不只是成本策略。

另一边,部署适配性和能力本身同样重要。很多场景不需要全能大模型,更关心成本、延迟、稳定性和可控性,训练的终点不一定是更大,也可能是更小、更便宜、更专门。

最后发布的模型,不一定是训练曲线最右边的那个 checkpoint。实际发布前往往会在多个 checkpoint 之间反复比较真实任务结果、拒答风格、工具稳定性、成本和回归风险。最后上线的版本往往是产品决策,不是单一指标上表现最强的那个。

用户看到模型名字,会以为它对应一条平滑上升的训练曲线,但真正选哪个 checkpoint 上线,那是另一回事。

大模型的价值,既在它自己的服务能力,也在它会继续给下一代模型提供训练数据、蒸馏来源和发布基座。

§ 20

Figure 12: Industry Diffusion via Distillation,Technical staircase diagram, light gray to blue gradient background, clean sans-serif font. Four ascending stair steps arranged from bottom-left to top-right, each step is a white rounded rectangle. Step 1 (bottom): title "GPT-3 scale", subtitle "Trained on raw internet text. Generates: basic instruction data." Step 2: title "GPT-4 scale", subtitle "Trained partly on synthetic data. Generates: high-quality reasoning traces, CoT." Step 3: title "DeepSeek-R1 / o1 scale", subtitle "Trained with RL on verifiable rewards. Generates: distillation trajectories for small models." Step 4 (top): title "Small deployable model", subtitle "Trained on Step 3 synthetic data. Matches GPT-4 on structured tasks." A thick diagonal arrow runs along the left side of the staircase, labeled "Models must get bigger before they can get smaller." Between Step 3 and Step 4, a bold downward arrow labeled "↓ Parameters" to mark the scale reversal. Bottom-center annotation box: "Frontier model value = training data source for the whole industry, not just its own inference." No decorative elements.

Beyond offline training, near-online continuous optimization has also entered the main pipeline. Cursor Composer 2's real-time RL shows that some agent capabilities are already being iterated through production traffic, rather than waiting for the next large-scale offline training round. The boundary between training and deployment hasn't disappeared, but the feedback loop between them is shortening.

How to understand why a model got better

In 2026, the value of frontier models increasingly depends on who can run the complete training pipeline after pretraining: continuously producing training data, doing distillation, specialization, getting evaluation and reward right, and making the final release selection.

Because of this, when looking at why a model suddenly got better, start with three things:

离线训练之外,接近在线的持续优化也已经进了主流程,Cursor Composer 2 的 real-time RL 说明一部分 Agent 能力已经开始通过生产流量持续迭代,而不是等下一轮大规模离线训练统一刷新。训练和部署之间的边界并没有消失,但两者的反馈回路正在缩短。

以后怎么看一个模型为什么变强了

2026 年前沿模型的价值,越来越看谁能把预训练后面这整套训练链路跑完整:持续产出训练数据、做蒸馏、做专用化、把评测和奖励做好、做最后的发布选择。

也因为这样,后面再看一个模型为什么突然变强,可以先看三件事:

§ 21

First, look at whether the change happened at the pretraining layer or in the later training pipeline. Many capability improvements do come from stronger pretraining and better data recipes, but many perceived improvements are primarily from post-training. Whether a model follows instructions, uses tools, and has a stable response style—these often don't emerge naturally just from feeding more corpus data.

Second, look at which layer the improvement comes from: is it the weights and training recipe, the reward/eval/grader, or the harness code and deployment loop? At the reasoning model and agent stage, what users feel as "stronger" is often not the result of the base model alone. How evaluation is set up, how rewards are scored, whether the tool environment is stable, how retrieval and memory are organized, how summaries and context are pruned, and which checkpoint was selected for release—all of these collectively determine the final product performance.

Third, look at what the released version is optimizing for. Some versions pursue higher upper limits; others compress cost, latency, and regression risk; still others specialize for a particular use case. A released version is a product decision, not the rightmost point on the training curve. So when looking at a model update, it's more helpful to see what it's actually optimizing for.

If you break down "the model suddenly got better" into production components, many improvements are actually amplified by the second half of the training stack and the outer harness. The iteration cycle of this chain is also shortening: production traffic continuously flows back into training; each generation of stronger models produces not only capabilities but also supervision data for the next generation; the outer program is constantly rewritten based on rollouts, logs, and real task feedback.

The model released today is just a snapshot. The pipeline and the harness program are the products that keep running.

先看变化发生在预训练层,还是后面的训练流程。 很多能力提升确实来自更强的预训练和更好的数据配方,但也有很多体感变化,其实主要出在后训练。模型会不会听指令、会不会用工具、回答风格稳不稳,常常不是多训一点语料自己长出来的。

再看提升来自哪一层: 是权重和训练配方,还是 reward / eval / grader,还是 harness code 和 deployment loop。到了推理模型和 Agent 这一段,用户感受到的变强,很多时候已经不是基础模型单独做出来的结果。评测怎么设、奖励怎么打、工具环境稳不稳、retrieval 和记忆怎么组织、summary 和上下文怎么剪、上线时选了哪个 checkpoint,这些都会一起改掉最后的产品表现。

最后看上线版本在优化什么。 有些版本是在追求更高上限,有些版本是在压成本、延迟和回归风险,还有些版本是在给某一类场景做专用化。发布版本本来就是产品决策,不是训练曲线最右边那个点,所以看模型更新时,顺手看它到底在优化什么,会更接近真实情况。

把模型突然变强这件事拆回生产环节看,很多提升其实是后半段训练栈和外层 harness 一起放大的。这条链路的迭代周期也在缩短:生产流量持续回流到训练,每代更强的模型在产出能力的同时也在产出下一代监督数据,外层程序根据 rollouts、logs 和真实任务反馈不断重写。

今天发布的模型只是一个快照,链路和 harness program 才是持续在跑的产品。

§ 22

Hoffmann et al. (2022). Training Compute-Optimal Large Language Models (Chinchilla). arXiv:2203.15556

Ouyang et al. (2022). Training language models to follow instructions with human feedback (InstructGPT). arXiv:2203.02155

Shao et al. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (GRPO). arXiv:2402.03300

DeepSeek-AI (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948

DeepSeek-AI (2024). DeepSeek-V3 Technical Report. arXiv:2412.19437

Llama Team, AI @ Meta (2024). The Llama 3 Herd of Models. arXiv:2407.21783

Bai et al. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073

OpenAI (2024). Deliberative Alignment: Reasoning Enables Safer Language Models. openai.com/index/deliberative-alignment

Anthropic (2025). Sycophancy to Subterfuge: Investigating Reward Tampering in Language Models. anthropic.com/research/reward-tampering

MacDiarmid et al. (2025). Natural Emergent Misalignment from Reward Hacking in Production RL. arXiv:2511.18397

Lee et al. (2026). Meta-Harness: End-to-End Optimization of Model Harnesses (preprint project page). yoonholee.com/meta-harness

Kimi Team (2026). Kimi K2.5 Tech Blog: Visual Agentic Intelligence. kimi.com/blog/kimi-k2-5

Rush, S. (2026). A technical report on Composer 2. cursor.com/blog/composer-2-technical-report

Chroma (2026). Chroma Context-1: Training a Self-Editing Search Agent. trychroma.com/research/context-1

Hoffmann et al. (2022). Training Compute-Optimal Large Language Models (Chinchilla). arXiv:2203.15556

Ouyang et al. (2022). Training language models to follow instructions with human feedback (InstructGPT). arXiv:2203.02155

Shao et al. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (GRPO). arXiv:2402.03300

DeepSeek-AI (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948

DeepSeek-AI (2024). DeepSeek-V3 Technical Report. arXiv:2412.19437

Llama Team, AI @ Meta (2024). The Llama 3 Herd of Models. arXiv:2407.21783

Bai et al. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073

OpenAI (2024). Deliberative Alignment: Reasoning Enables Safer Language Models. openai.com/index/deliberative-alignment

Anthropic (2025). Sycophancy to Subterfuge: Investigating Reward Tampering in Language Models. anthropic.com/research/reward-tampering

MacDiarmid et al. (2025). Natural Emergent Misalignment from Reward Hacking in Production RL. arXiv:2511.18397

Lee et al. (2026). Meta-Harness: End-to-End Optimization of Model Harnesses (preprint project page). yoonholee.com/meta-harness

Kimi Team (2026). Kimi K2.5 Tech Blog: Visual Agentic Intelligence. kimi.com/blog/kimi-k2-5

Rush, S. (2026). A technical report on Composer 2. cursor.com/blog/composer-2-technical-report

Chroma (2026). Chroma Context-1: Training a Self-Editing Search Agent. trychroma.com/research/context-1

打开原文 ↗