日刊 /2026-06-12 / 用奖励函数替代标注数据：GRPO 将 Qwen3-8B 的 JSON 结构输出有效性从 62% 提升至 82%

用奖励函数替代标注数据：GRPO 将 Qwen3-8B 的 JSON 结构输出有效性从 62% 提升至 82%

原文 x.com 收录 2026-06-12 06:00 阅读 12 min

AI 解读

本文是一线实操记录，作者将 DeepSeek-R1 采用的 GRPO（群体相对策略优化）方法用于一个具体任务：训练 Qwen3-8B 从发票文本中提取结构化的 JSON 字段。传统 SFT（监督微调）通过模仿范例来训练，其在 token 级别的损失函数无法有效惩罚格式错误，导致模型在结构合规性上很快触及天花板。作者的核心论点是，只要能用代码定义“正确性”（例如 JSON 能否解析且符合 schema），就可以用一个 Python 奖励函数替代标注数据，驱动模型自我对抗学习。实操中，奖励函数对不合法 JSON 给 0 分，对合法但不符合 schema 的给 0.5 分，对完全合规的给 1 分，这一中间分值提供了关键的学习梯度。训练在 Fireworks 平台的 H200 上运行，最终在 50 个模型未见过的评估样本上，schema 合规率从基线的 62% 提升至 82%，超越了 GPT-4.1 的 58%，且推理成本和延迟更低。文章适用于需要模型可靠生成结构化输出（如 SQL、API 响应、工具调用）的工程师，提供了完整的奖励函数、数据集和训练配置代码。

原文 12 分钟

原文 x.com ↗

§ 1

You ask a language model for JSON, and most of the time it works.

Then one response comes back broken. A number shows up as a string, or the model wraps the JSON in a line of explanation, and the parser chokes.

That small failure rate is the whole problem. The output feeds a function call or a database write, and looking like valid JSON is not the same as being valid JSON.

The code downstream only finds out which one it got when it breaks.

That is what structured output means. The model returns data in a fixed shape that matches a schema, not free text that happens to read right.

Agents, tool calls, and data pipelines all run on this now. The model is writing for code to run, not for a person to read.

Getting this reliable is the hard part, and the usual answer is to give the model more. You add examples, tighten the prompt, and fine-tune on correct outputs.

It helps a little, then stops. The ceiling was never the amount of data. It is the objective.

当你请求大语言模型输出 JSON 时，大多数时候它确实能正常工作。

但总会有某次响应出现差错：数字变成了字符串，或者模型在 JSON 外面包了一层解释性文本，解析器随即报错。

这个虽小但致命的失败率，恰恰是整个问题的核心。输出将供给函数调用或数据库写入，而看起来像是合法 JSON 与真正是合法 JSON，完全是两回事。

下游代码只有在崩溃的那一刻，才会知道自己拿到的是哪一种。

这正是结构化输出的含义：模型返回的数据必须匹配一个固定形状的 schema，而不是一段碰巧读起来顺眼的自由文本。

如今，智能体、工具调用和数据管道都依赖于此。模型是在为代码的执行而写作，而非为人阅读而写作。

让这件事变得可靠，才是真正的难点。最常见的应对方法是给模型更多东西：添加示例、收紧提示词、在正确输出上做微调。

这些方法会有一些帮助，然后便停滞不前。瓶颈从来不是数据量，而是训练目标本身。

§ 2

DeepSeek-R1 showed a way around this. Training a strong model used to mean annotation pipelines, preference pairs, and a team of labelers.

DeepSeek replaced all of it with one Python function that checks whether an answer is right. If you can define correctness via code, you do not need the rest.

That is the idea behind GRPO. Instead of learning from examples, the model learns from a reward function you write.

For each prompt, it generates a few candidate answers. The reward function scores them, and the model is pushed toward the ones that score higher.

DeepSeek-R1 展示了一条绕过此困境的道路。过去，训练一个强大的模型意味着需要标注管线、偏好对和一个标注团队。

DeepSeek 用一条 Python 函数替代了所有这些——只要检查答案是否正确即可。如果你能用代码定义正确性，就不再需要其余那些繁复的流程。

这就是 GRPO 背后的思想。模型不再从示例中学习，而是从你编写的奖励函数中学习。

对于每个提示词，模型会生成几个候选答案，奖励函数为它们打分，然后模型被推向那些得分更高的答案。

§ 3

In this walkthrough, I use it to fine-tune Qwen3-8B for JSON extraction. The loop runs from a local notebook while the model trains on remote H200s.

The reward function does one thing: it checks whether each output parses and matches the schema.

Schema-valid output went from 62% on the base model to 82% after training, past GPT-4.1 on the same eval at 58%.

Before the build, it helps to see why the obvious approach stalls. That is what makes everything after it click.

在本教程中，我将用此方法微调 Qwen3-8B 以进行 JSON 提取。训练循环从本地笔记本发起，而模型则在远程 H200 上训练。

奖励函数只做一件事：检查每个输出能否被正确解析并匹配 schema。

Schema 合规输出率从基础模型的 62% 提升到了训练后的 82%，超过了在同一评估集上得分为 58% 的 GPT-4.1。

在动手搭建之前，先理解为什么那些显而易见的方法会陷入停滞，会很有帮助——这也是让后续所有内容变得清晰的关键。

§ 4

SFT learns by copying examples. Show it correct completions, and it gets good at producing output that looks like them.

But looking like valid JSON and being valid JSON are different goals. SFT only ever chases the first one.

The loss is measured token by token. A completion with one field typed wrong scores almost the same as a perfect one.

So you add more examples. The number ticks up, then flattens, because the limit is the objective, not the data.

SFT 通过复制示例来学习。给它展示正确的补全结果，它就会擅长生成看起来像它们的输出。

但看起来像合法 JSON 与真正是合法 JSON 是完全不同的目标。SFT 永远只追求前者。

损失函数是按 token 逐个计算的。一个字段类型错误的补全结果，与一个完美的补全结果得分几乎相同。

于是你添加更多示例。数字会上升一点，然后持平——因为限制来自训练目标，而非数据本身。

§ 5

Once you see the problem, the fix is clear. You define correct in code, and train the model against that definition directly.

This is what GRPO does. It swaps labeled examples for a reward function.

For each prompt, the model generates a small group of answers, usually four to eight. Your reward function scores every one of them.

The scores are normalized inside the group. The update then reinforces the answers that scored above the group's average.

So the model is always comparing its own outputs against each other. It learns what "more correct" means for your task, not what "more similar to an example" means.

Here is how the reward function scores three different outputs for the same prompt.

Output that doesn't parse as JSON scores 0.0.
Output that parses but fails the schema scores 0.5.
Output that parses and matches the schema scores 1.0.

That middle score matters more than it looks. Without it, valid JSON with the wrong field types would score zero, the same as complete garbage.

The model would lose an important signal, that valid structure is already progress. The 0.5 is what gives training something to climb toward.

一旦看清问题，修复方案就很明确了：用代码定义「正确」，然后直接针对这一定义训练模型。

GRPO 正是这么做的。它用奖励函数替代了带标签的示例。

对于每个提示词，模型会生成一小批答案（通常 4 到 8 个），你的奖励函数对每个答案进行评分。

分数在组内被归一化，然后更新过程会强化那些高于组平均分的答案。

因此，模型始终是在将自己的输出相互比较。它学到的是，在你的任务中「更正确」意味着什么，而不是「更像示例」意味着什么。

下面是奖励函数对同一个提示词的三个不同输出进行打分的方式：

无法解析为 JSON 的输出，得 0.0 分。
能解析但不符合 schema 的输出，得 0.5 分。
能解析且符合 schema 的输出，得 1.0 分。

中间那个分数比看上去重要得多。如果没有它，字段类型错误的合法 JSON 会得 0 分，和完全垃圾的输出毫无区别。

模型将丢失一个重要信号：有效的结构本身就是进步。0.5 分正是为训练提供一个可以向上攀登的阶梯。

§ 6

GRPO is heavier than SFT. On an 8B model it needs H200s and runs for hours.

Every step, it generates several completions per prompt, scores all of them, and updates the weights. That repeats across the whole dataset, many times over.

This is not something you run on your laptop.

There is also a timing problem SFT never has. During rollout, the model samples answers from its current weights. During training, those same weights are changing underneath it.

If the inference side and the trainer fall out of sync, you sample from a stale model and train on answers the current one would never give. This is where most custom RL setups fall apart.

Fireworks' Training API handles both sides. You write the training logic in Python on your own machine.

Their infrastructure does the rest. It provisions the GPUs, runs the forward and backward passes, saves checkpoints, and resyncs the inference deployment after every step.

The setup is three steps. You write the reward function, upload the dataset, and configure the run.

Let's go through each one.

GRPO 比 SFT 重得多。对于一个 8B 参数的模型，它需要 H200 级 GPU，且运行数小时。

每步训练中，模型都要为每个提示词生成多个补全结果、对它们全部评分，然后更新权重。这个过程在整个数据集上重复进行，循环多次。

这不是你可以在笔记本电脑上运行的东西。

此外，还有一个 SFT 从未遇到的时序问题。在 rollout 阶段，模型从其当前权重中采样答案。而在训练阶段，这些相同的权重正在被更新。

如果推理侧和训练器失去同步，你可能会从一个过期的模型采样，并基于当前模型永远不会给出的答案进行训练——这正是大多数自定义 RL 设置崩坏的地方。

Fireworks 的 Training API 同时处理了这两个方面。你在自己的机器上用 Python 编写训练逻辑。

其余工作由他们的基础设施完成：分配 GPU、执行前向和反向传播、保存检查点，并在每一步之后重新同步推理部署实例。

整个设置分为三个步骤：编写奖励函数、上传数据集、配置并运行训练。

下面逐一讲解。

§ 7

This is the only place your task is defined. The schema says what a correct output looks like, and score() checks each completion against it.

For invoice extraction, I pull four fields from raw text: vendor, date, amount, and currency.

import json
from jsonschema import validate, ValidationError

SCHEMA = {
    "type": "object",
    "required": ["vendor", "date", "amount", "currency"],
    "properties": {
        "vendor":   {"type": "string"},
        "date":     {"type": "string"},
        "amount":   {"type": "number"},
        "currency": {"type": "string"},
    },
    "additionalProperties": False
}

def score(completion: str) -> float:
    try:
        parsed = json.loads(completion.strip())
    except (json.JSONDecodeError, ValueError):
        return 0.0
    try:
        validate(instance=parsed, schema=SCHEMA)
        return 1.0
    except ValidationError:
        return 0.5

jsonschema handles the type checks, the required fields, and any nested rules in a single call.

For a different task, like SQL or tool-call formatting, you swap in a new schema. The score() function stays the same.

这是唯一需要定义你的任务的地方。Schema 描述了正确输出应有的样子，而 score() 函数则检查每个补全是否符合该 schema。

对于发票提取任务，我从原始文本中提取四个字段：vendor（供应商）、date（日期）、amount（金额）和 currency（货币）。

import json
from jsonschema import validate, ValidationError

SCHEMA = {
    "type": "object",
    "required": ["vendor", "date", "amount", "currency"],
    "properties": {
        "vendor":   {"type": "string"},
        "date":     {"type": "string"},
        "amount":   {"type": "number"},
        "currency": {"type": "string"},
    },
    "additionalProperties": False
}

def score(completion: str) -> float:
    try:
        parsed = json.loads(completion.strip())
    except (json.JSONDecodeError, ValueError):
        return 0.0
    try:
        validate(instance=parsed, schema=SCHEMA)
        return 1.0
    except ValidationError:
        return 0.5

jsonschema 库在一行调用中同时处理了类型检查、必填字段以及任何嵌套规则。

对于不同的任务，比如 SQL 或工具调用格式化，只需要替换一个新的 schema。score() 函数的结构保持不变。

§ 8

GRPO does not need labeled outputs. The dataset is just the prompts you would send in production.

The model writes its own completions during training, and score() grades them as they come.

I used 200 training prompts. They cover different vendor names, date formats, amount styles, and currency codes.

I also set aside 50 eval prompts that the model never trains on.

Variety matters more than volume here. Prompts that all look alike produce a model that breaks on real invoice variation.

{"messages": [{"role": "user", "content": "Extract the following fields from this invoice:\n\nBill from Acme Corp, dated 2024-03-15, total $1,250.00 USD.\n\nReturn valid JSON only."}]}
{"messages": [{"role": "user", "content": "Extract the following fields from this invoice:\n\nReceived from TechSupplies Inc on January 8 2024, amount due: 340 euros.\n\nReturn valid JSON only."}]}

Upload the dataset and wait for the READY state before the training job can use it.

from fireworks import Fireworks

client = Fireworks()
client.datasets.create(dataset_id="invoice-extraction-grpo-v1",
                       dataset={"exampleCount": "200"})
client.datasets.upload(dataset_id="invoice-extraction-grpo-v1",
                       file="./train_prompts.jsonl")

# poll client.datasets.get(...).state until "READY" before proceeding

GRPO 不需要带标签的输出。数据集仅仅是你将在生产环境中发送的那些提示词。

模型在训练过程中自行生成补全结果，然后由 score() 函数实时评分。

我使用了 200 条训练提示词，覆盖了不同的供应商名称、日期格式、金额样式和货币代码。

此外，我还预留了 50 条模型从未见过的评估提示词。

在这里，多样性比数量更重要。如果所有提示词都长得差不多，训练出来的模型在面对真实发票的多样性时就会失效。

{"messages": [{"role": "user", "content": "Extract the following fields from this invoice:\n\nBill from Acme Corp, dated 2024-03-15, total $1,250.00 USD.\n\nReturn valid JSON only."}]}
{"messages": [{"role": "user", "content": "Extract the following fields from this invoice:\n\nReceived from TechSupplies Inc on January 8 2024, amount due: 340 euros.\n\nReturn valid JSON only."}]}

上传数据集后，需要等待其状态变为 READY，然后训练任务才能使用它。

from fireworks import Fireworks

client = Fireworks()
client.datasets.create(dataset_id="invoice-extraction-grpo-v1",
                       dataset={"exampleCount": "200"})
client.datasets.upload(dataset_id="invoice-extraction-grpo-v1",
                       file="./train_prompts.jsonl")

# poll client.datasets.get(...).state until "READY" before proceeding

§ 9

rl_loop runs the whole thing. It provisions the trainer, schedules the rollouts, syncs the weights, and cleans up when the run is done.

You connect your score() function by wrapping it and assigning it to rl_loop.reward_fn. The wrapper gets both the completion and the dataset row, so you can reach ground-truth metadata if your reward needs it.

from training.recipes.rl_loop import Config, main
from training.utils import DeployConfig, InfraConfig, WeightSyncConfig
import training.recipes.rl_loop as rl_loop

# Wire your reward function to the training loop
def invoice_reward(completion: str, row: dict) -> float:
    return score(completion)

rl_loop.reward_fn = invoice_reward

cfg = Config(
    base_model="accounts/fireworks/models/qwen3-8b",
    dataset="invoice-extraction-grpo-v1",
    max_rows=200,
    epochs=1,
    completions_per_prompt=4,
    max_completion_tokens=256,
    temperature=1.0,
    max_seq_len=4096,
    policy_loss="grpo",
    output_model_id="invoice-extractor-v1",
    infra=InfraConfig(
        training_shape_id="accounts/fireworks/trainingShapes/qwen3-8b-128k",
    ),
    deployment=DeployConfig(
        deployment_id="invoice-extractor-v1",
        tokenizer_model="Qwen/Qwen3-8B",
    ),
    weight_sync=WeightSyncConfig(weight_sync_interval=1),
)

main(cfg)

A few of these settings are worth a note.

dataset points to the dataset_id you uploaded in Step 2. Fireworks pulls it from their storage directly.
completions_per_prompt=4 sets the group size for GRPO. Production runs often use 8 to 16, which gives more signal per step at higher compute cost. Four is enough here. The reward is clear-cut enough that even a small group shows real variance between answers.
weight_sync_interval=1 resyncs the inference deployment after every step. That keeps rollout sampling from the exact model being trained. Production runs often set this to 4 or 8 for speed. For a short 200-step run, 1 gives the tightest feedback loop, which is what you want.
One Qwen3 quirk to handle. It defaults to thinking mode and adds thinking... response blocks before the answer. Strip them at eval time with content.split(" response")[-1].strip(). Suppress them in training by adding /no-think to the system prompt. Otherwise the reward function reads the reasoning text instead of the JSON, and scores everything 0.0.

rl_loop 负责运行整个流程。它会配置训练器、调度 rollout、同步权重，并在运行结束后进行清理。

你需要通过包装自己的 score() 函数并将其赋值给 rl_loop.reward_fn 来接入训练循环。包装器会同时接收补全结果和数据集行，这样如果你的奖励需要访问真实元数据，也能轻松实现。

from training.recipes.rl_loop import Config, main
from training.utils import DeployConfig, InfraConfig, WeightSyncConfig
import training.recipes.rl_loop as rl_loop

# Wire your reward function to the training loop
def invoice_reward(completion: str, row: dict) -> float:
    return score(completion)

rl_loop.reward_fn = invoice_reward

cfg = Config(
    base_model="accounts/fireworks/models/qwen3-8b",
    dataset="invoice-extraction-grpo-v1",
    max_rows=200,
    epochs=1,
    completions_per_prompt=4,
    max_completion_tokens=256,
    temperature=1.0,
    max_seq_len=4096,
    policy_loss="grpo",
    output_model_id="invoice-extractor-v1",
    infra=InfraConfig(
        training_shape_id="accounts/fireworks/trainingShapes/qwen3-8b-128k",
    ),
    deployment=DeployConfig(
        deployment_id="invoice-extractor-v1",
        tokenizer_model="Qwen/Qwen3-8B",
    ),
    weight_sync=WeightSyncConfig(weight_sync_interval=1),
)

main(cfg)

有几个配置项值得特别说明：

dataset 指向你在第二步上传的 dataset_id，Fireworks 会直接从其存储中拉取数据。
completions_per_prompt=4 设置了 GRPO 的组大小。生产环境通常使用 8 到 16，这能提供更多每步信号，但计算成本更高。这里 4 就够了，因为奖励函数的区分度足够清晰，即使小组也能展示答案间的真实差异。
weight_sync_interval=1 在每一步之后重新同步推理部署，确保 rollout 始终从当前正在训练的模型采样。生产环境为了速度通常设为 4 或 8。对于仅 200 步的短训练来说，设为 1 能提供最紧密的反馈循环，这正是你想要的。
需要注意 Qwen3 的一个特性：它默认采用思考模式，会在答案前添加 思考中... 响应 等标记块。在评估时，可以用 content.split(" 响应")[-1].strip() 去掉这些内容；在训练时，通过在系统提示中添加 /no-think 来抑制该行为。否则，奖励函数会读取推理文本而非 JSON，将所有输出都评分为 0.0。

§ 10

Base Qwen3-8B scores 62% schema-valid on the 50 held-out prompts. After GRPO training on Fireworks H200s, the fine-tuned model hits 82%.

That is past GPT-4.1 on the same eval, which lands at 58%.

Here is the baseline run first, on the 50 prompts the model never trained on.

And here is the same eval after training.

The trained model runs on a Fireworks serverless endpoint, at a fraction of GPT-4.1's per-token cost. Latency is lower too, since the output is short and predictable.

The real difference shows up on messy inputs. A prompted general-purpose model starts to slip, while the trained model holds, because it learned the constraint instead of the examples.

基础版 Qwen3-8B 在 50 条从未见过的评估提示词上获得了 62% 的 schema 合规率。经过 Fireworks H200 上的 GRPO 训练后，微调模型达到了 82%。

这超越了在同一评估集上得分为 58% 的 GPT-4.1。

下面是基线模型在 50 条未见提示词上的运行结果。

这是训练后相同评估的结果。

训练后的模型运行在 Fireworks 的无服务器端点上，每 token 成本仅为 GPT-4.1 的一小部分。由于输出简短且可预测，延迟也更低。

真正的差异体现在处理混乱的输入时。一个经过提示的通用模型开始出错，而训练过的模型则能保持稳定——因为它学到的是约束条件本身，而非具体的示例。

§ 11

This pattern works for any task where you can check correctness in code. SQL that has to parse, API responses that must match a shape, tool calls, code that has to pass a linter.

If you can score an output, you can train a model to chase that score.

What DeepSeek-R1 proved at frontier scale holds for your own small task. The model you get has practiced your definition of correct, not memorized examples of it.

On inputs it has never seen, that is the difference that holds up.

The full code is in the repo below. That includes the reward function, the training config, the dataset builder, and the eval script.

Training API docs → Finetuning Code →

(don't forget to star 🌟)

Thanks for reading, and to Fireworks for partnering on today's article.

这套模式适用于任何可以用代码检查正确性的任务：需要能正确解析的 SQL、必须匹配特定形状的 API 响应、工具调用、以及必须通过 linter 检查的代码。

只要能给输出打分，就能训练模型去追逐那个分数。

DeepSeek-R1 在前沿规模上证明的道理，同样适用于你自己的小任务。你得到的模型，已经练习过你对「正确」的定义，而不是仅仅记住了它的示例。

在面对从未见过的输入时，正是这种差异让模型能够坚持住。

完整代码在下方仓库中，包括奖励函数、训练配置、数据集构建器和评估脚本。

Training API 文档 → 微调代码 →

（别忘了点星 🌟）

感谢阅读，也感谢 Fireworks 对今天这篇文章的支持。

打开原文 ↗

标签

AI 工程微调 GRPO 结构化输出

读完这条，下一步

→ code-generation → tool-calling → rl-fine-tuning

术语

GRPO (Group Relative Policy Optimization) · 群体相对策略优化: 一种强化学习训练方法，模型对每个 prompt 生成一组回答，奖励函数对其打分后，模型朝组内相对高分的方向更新，无需标注正确答案。
SFT (Supervised Fine-Tuning) · 监督微调: 通过让模型模仿标注好的正确示例来进行训练的方法，其损失函数在 token 级别计算，难以有效区分结构错误与语义错误。
reward function · 奖励函数: 一段自定义的 Python 代码，用于判断模型输出的质量并返回标量分数，在 GRPO 中替代了人工标注的偏好数据。