Glean 拾遗
Recent picks

1pick · chronological

06-11

Training an LLM to Generate Reliable Structured Output Using GRPO and a Reward Function

A hands-on report on replacing labeled data with a code-defined reward function to train structured output. The author fine-tunes Qwen3-8B for JSON invoice extraction using GRPO. Supervised fine-tuning stalls because its token-level loss only optimizes for surface similarity, not structural validity. The fix: a reward function that scores completions 0.0 (invalid JSON), 0.5 (valid JSON but wrong schema), or 1.0 (fully compliant), providing a learning gradient. Training on Fireworks H200s raised schema-valid output from a baseline of 62% to 82% on held-out prompts, exceeding GPT-4.1's 58%, with lower cost and latency. The approach transfers to any task where correctness is verifiable in code, such as SQL, API calls, or tool use. Full reward function, dataset, and training config are provided.

x.com · 12 min · AI Engineering · Fine-tuning · GRPO