What Really Differentiates LLMs Happens After Pretraining: A Full Post-Training Pipeline Breakdown
A comprehensive deep-dive into the full LLM training pipeline, arguing that the real capability gap in 2026 lies not in pretraining but in the post-training stack: instruction tuning, RL, reward design, Agent training, and distillation. The article breaks down the end-to-end process step-by-step — from data recipes and system architecture constraints, through the four-stage post-training pipeline (Cold Start SFT → GRPO-based Reasoning RL → Rejection Sampling FT → Alignment RL), Grader/Reward evaluation loops, Agent training with PARL and Meta-Harness, to distillation and deployment. Key engineering insights include DeepSeek-R1's public recipe, why GRPO simplifies PPO by removing the value network, PRM vs ORM trade-offs, and the shift from optimizing answers to optimizing harness programs. Targeted at engineers who want to trace concrete capability gains back to specific training stages.