Glean 拾遗
日刊 /2026-06-22 / GLM-5.2:面向长时程任务,落地百万 Token 上下文与开源推理栈

GLM-5.2:面向长时程任务,落地百万 Token 上下文与开源推理栈

原文 z.ai 收录 2026-06-22 09:32 阅读 21 min
AI 解读

智谱 AI 发布旗舰模型 GLM-5.2,重点提升长时程任务能力,首次在 1M token 上下文窗口上稳定运行,并采用 MIT 开源许可。架构层面引入 IndexShare 技术,每 4 层 Transformer 共享稀疏注意力索引器,使 1M 上下文下每 token FLOPs 降低 2.9 倍;改进 MTP 层,通过 IndexShare 与 KV 共享消除训练-推理差异,配合拒绝采样与端到端 TV 损失,将推测解码接受长度提升 20%。后训练阶段,基于 slime 框架统一组织大规模 agentic RL 训练,并引入反作弊模块,在线检测并阻断 agent 读取受保护评估产物、curl 下载答案等投机行为,维持训练信号有效性。GLM-5.2 在 FrontierSWE、PostTrainBench、SWE-Marathon 等长时程基准上位居开源模型第一,在 Terminal-Bench 2.1 上得分 81.0,逼近闭源前沿。文章适合关注长上下文推理、编码智能体、开源大模型工程化的开发者阅读。

原文 21 分钟
原文 z.ai ↗
§ 1

We're introducing GLM-5.2, our latest flagship model for long-horizon tasks. It marks a substantial leap in long-horizon task capability over its predecessor GLM-5.1 and, for the first time, delivers that capability on a solid 1M-token context. GLM-5.2's new capabilities include:

  • Solid 1M Context: A solid 1M-token context that stably sustains long-horizon work
  • Advanced Coding with Flexible Effort: Stronger coding capabilities with multiple thinking effort levels to balance performance and latency
  • Improved Architecture: We propose IndexShare, which reuses the same indexer across every four sparse attention layers, reducing per-token FLOPs by 2.9× at a 1M context length. We also improve GLM-5.2’s MTP layer for speculative decoding, increasing the acceptance length by up to 20%
  • Pure Open: An MIT open-source license — no regional limits, technical access without borders

我们发布了最新旗舰模型 GLM-5.2,专为长周期任务而设计。它在前代 GLM-5.1 的基础上实现了长周期任务能力的显著飞跃,并首次将这种能力建立在坚实可靠的 100 万 token 上下文之上。GLM-5.2 的新能力包括:

  • 坚实可靠的百万级上下文:100 万 token 的稳定上下文,可持续支持长周期工作
  • 灵活调配的进阶编程能力:更强的编程能力,配合多种思考力度模式,以平衡性能与延迟
  • 改进的架构:提出 IndexShare 方法,在每四个稀疏注意力层之间复用同一个索引器,在百万上下文长度下将每 token 的 FLOPs 降低 2.9 倍。同时改进 GLM-5.2 的 MTP 层用于推测解码,将接受长度提升高达 20%
  • 完全开源:采用 MIT 开源许可证——无区域限制,技术访问无国界
§ 2

Supporting long-horizon tasks starts with making long context engineering-usable: the model must maintain quality across long, messy coding-agent trajectories, not just accept more tokens. A 1M context is easy to claim, but much harder to keep reliable under real engineering pressure. To this end, we substantially expanded 1M-context training for coding-agent scenarios, covering large-scale implementation, automated research, performance optimization, and complex debugging. The result is a long-context system that is not only wide in scope, but solid in execution: a practical substrate for sustained engineering work.

支持长周期任务,首先要让长上下文具备工程可用性:模型必须在冗长、混乱的编码代理轨迹中保持质量,而不仅仅是能接收更多 token。宣称拥有百万上下文很容易,但在真实的工程压力下保持可靠要困难得多。为此,我们大幅扩展了针对编码代理场景的百万上下文训练,涵盖大规模实现、自动化研究、性能优化和复杂调试。最终得到的不仅是一个范围宽广的长上下文系统,更是一个执行稳固的系统:为持续的工程工作提供了实用基础。

§ 3

This capability is reflected in GLM-5.2's performance on three long-horizon coding benchmarks. FrontierSWE measures whether an agent can complete open-ended technical projects at the scale of hours to tens of hours, spanning systems optimization, large-scale code construction, and applied ML research. On this benchmark, GLM-5.2 trails Opus 4.8 by only 1%, while edging out GPT-5.5 by 1% and Opus 4.7 by 11%. On PostTrainBench, where each agent is given an H100 GPU and evaluated by how much it can improve small models through post-training, GLM-5.2 outperforms both Opus 4.7 and GPT-5.5, ranking second only to Opus 4.8. On SWE-Marathon, an ultra-long-horizon software engineering benchmark covering tasks such as building compilers, optimizing kernels, and developing production-grade services, GLM-5.2 still has room to grow, trailing Opus 4.8 by 13% while remaining second only to the Opus series. Across all three benchmarks, GLM-5.2 is the highest-ranked open-source model, showing that its 1M context has translated into practical long-horizon delivery capability.

这一能力体现在 GLM-5.2 在三个长周期编码基准测试上的表现。FrontierSWE 衡量代理能否完成数小时到数十小时级别的开放式技术项目,涵盖系统优化、大规模代码构建和应用机器学习研究。在该基准上,GLM-5.2 仅落后 Opus 4.8 一个百分点,同时以 1% 的优势领先 GPT-5.5,并以 11% 的优势领先 Opus 4.7。在 PostTrainBench 上,每个代理获得一块 H100 GPU,通过后训练提升小模型的能力来评估。GLM-5.2 同时超越了 Opus 4.7 和 GPT-5.5,仅次于 Opus 4.8。在 SWE-Marathon 上,这是一个超长周期的软件工程基准,涵盖构建编译器、优化内核和开发生成级服务等任务,GLM-5.2 仍有提升空间,落后 Opus 4.8 13%,但依旧仅次于 Opus 系列。在所有三个基准中,GLM-5.2 都是排名最高的开源模型,表明其百万上下文已转化为实际的长周期交付能力。

§ 4

On standard coding benchmarks, GLM-5.2 is the strongest open-source model, improving on GLM-5.1 by a wide margin: 81.0 vs. 63.5 on Terminal-Bench 2.1 and 62.1 vs. 58.4 on SWE-bench Pro. It also closes much of the gap to the closed-source frontier — on Terminal-Bench 2.1 (81.0) it lands within a few points of Claude Opus 4.8 (85.0) — while staying ahead of Gemini 3.1 Pro.

在标准编码基准上,GLM-5.2 是最强的开源模型,相比 GLM-5.1 有大幅提升:Terminal-Bench 2.1 从 63.5 提高到 81.0,SWE-bench Pro 从 58.4 提高到 62.1。它还大幅缩小了与闭源前沿模型的差距——在 Terminal-Bench 2.1(81.0)上,距离 Claude Opus 4.8(85.0)仅差数个百分点,同时领先于 Gemini 3.1 Pro。

§ 5

GLM-5.2 also introduces effort level control, enabling users to explicitly balance model capability against task execution speed and computational cost. As shown in the figure, GLM-5.2 delivers substantially stronger agentic coding performance than GLM-5.1 at comparable token budgets, with its capability roughly positioned between Claude Opus 4.7 and Claude Opus 4.8 under similar token consumption. Moreover, the Max effort level allows users to allocate additional computation when higher performance is required in challenging tasks, further extending the model’s coding capability. This design gives users greater flexibility when using GLM-5.2 for coding tasks, allowing them to select the most suitable reasoning mode for different scenarios.

GLM-5.2 还引入了思考力度控制,让用户可以明确地平衡模型能力与任务执行速度和计算成本。如图所示,在相近的 token 预算下,GLM-5.2 的代理编码性能显著强于 GLM-5.1,其能力大致介于 Claude Opus 4.7 和 Claude Opus 4.8 之间。此外,Max 力度允许用户在挑战性任务中需要更高性能时分配额外计算资源,进一步扩展模型的编码能力。这种设计让用户在使用 GLM-5.2 执行编码任务时拥有更大的灵活性,能够针对不同场景选择最合适的推理模式。

§ 6

To support 1M context length, in GLM-5.2, we apply IndexShare to reduce the computational cost of the indexer in DSA. Specifically, in GLM-5.2, every 4 transformer layers share a lightweight indexer. The indexer is placed at the first of 4 layers and topk indices are used for 4 layers. This reduces the computation of indexer dot product and topk operation in 3/4 layers. GLM-5.2 is trained with IndexShare from mid-training with 128K sequence length, outperforming GLM-5.1 on long-context benchmarks with less computation.

为了支持百万上下文长度,我们在 GLM-5.2 中应用了 IndexShare 来降低 DSA 中索引器的计算成本。具体来说,在 GLM-5.2 中,每 4 个 transformer 层共享一个轻量级索引器。该索引器位于 4 层的第一层,topk 索引被用于所有 4 层。这减少了 3/4 层中索引器点积和 topk 操作的计算量。GLM-5.2 从 128K 序列长度的中期训练开始就使用 IndexShare,在长上下文基准上以更少的计算量超越了 GLM-5.1。

§ 7

We improve the MTP layer of GLM-5.2 for speculative decoding with two objectives: 1) Minimize the cost of the MTP layer as draft model; 2) Maximize the acceptance rate of speculative decoding.

For the first objective, we also apply IndexShare on the mtp layer. In multi-step MTP, the indexer is placed on the first step and topk indices are used for all the following steps. However, different from the backbone, the input tokens of different mtp steps are different. As the following figure shows, if we reuse the topk indices of $h_{4}$h 4​ for $h_{5}$h 5​, $h_{5}$h 5​ can only attend to $h_{1}$h 1​ to $h_{4}$h 4​, but not $h_{5}$h 5​. We will show that the property can help us achieve the second objective, by eliminating the training-inference discrepancy in GLM-5.1's mtp layer.

In the above figure we show the inference of a two-step MTP layer. In the first step, inference is consistent with training, with all the hidden states coming from the target model. However, in the second step, $h_{1 : 4}$h 1:4​ come from the target model and $h_{5}$h 5​ comes from the mtp layer. Therefore, the KV cache of $h_{5}$h 5​ is a mixture of $k v_{1 : 4}$k v 1:4​ computed from the target model and $k v_{5}$k v 5​ computed from the mtp layer. Instead, with IndexShare, the KV cache of $h_{5}$h 5​ includes only $k v_{1 : 4}$k v 1:4​, all from the hidden states of the target model. For training, we reuse both kv cache and topk indices of the first mtp step. Note that the same as GLM-5.1, the parameters of different MTP steps are also shared. Furthermore, inspired by https://arxiv.org/abs/2606.12370, we introduce rejection sampling for speculative decoding, and use end-to-end TV loss for training.

The table below shows the ablation of techniques by acceptance length on the coding scenarios. In the experiment we use the backbone and training data of GLM-5.1. The number of MTP steps is set to 7 for both training and inference. Compared with the baseline, the acceptance length of the final MTP layer increases by 20%.

Method Acceptance Length
Baseline 4.56
+ IndexShare + KV Share 5.10
+ Rejection Sampling 5.29
+ End-to-end TV Loss 5.47 (+20%)

我们改进了 GLM-5.2 的 MTP 层用于推测解码,有两个目标:1) 最小化 MTP 层作为草稿模型的成本;2) 最大化推测解码的接受率。

对于第一个目标,我们同样在 MTP 层上应用了 IndexShare。在多步 MTP 中,索引器放在第一步,topk 索引用于所有后续步骤。然而,与骨干网络不同,不同 MTP 步骤的输入 token 是不同的。如下方图所示,如果我们将 $h_{4}$ 的 topk 索引复用于 $h_{5}$,那么 $h_{5}$ 只能关注到 $h_{1}$ 到 $h_{4}$,而无法关注 $h_{5}$ 自身。我们将展示这一特性如何通过消除 GLM-5.1 MTP 层中的训练-推理不一致性,帮助我们实现第二个目标。

在上图中,我们展示了一个两步 MTP 层的推理过程。在第一步,推理与训练一致,所有隐藏状态都来自目标模型。然而,在第二步,$h_{1:4}$ 来自目标模型,而 $h_{5}$ 来自 MTP 层。因此,$h_{5}$ 的 KV 缓存混合了来自目标模型计算的 $k v_{1:4}$ 和来自 MTP 层计算的 $k v_{5}$。而使用 IndexShare 后,$h_{5}$ 的 KV 缓存只包含 $k v_{1:4}$,全部来自目标模型的隐藏状态。对于训练,我们复用第一步 MTP 的 KV 缓存和 topk 索引。注意,与 GLM-5.1 一样,不同 MTP 步骤的参数也是共享的。此外,受 https://arxiv.org/abs/2606.12370 启发,我们为推测解码引入了拒绝采样,并使用端到端 TV 损失进行训练。

下表展示了在编码场景中,各项技术对接受长度的消融实验。实验使用了 GLM-5.1 的骨干网络和训练数据。训练和推理的 MTP 步数均设为 7。与基线相比,最终 MTP 层的接受长度提升了 20%。

方法 接受长度
基线 4.56
+ IndexShare + KV 共享 5.10
+ 拒绝采样 5.29
+ 端到端 TV 损失 5.47 (+20%)
§ 8

As GLM-5.2 extends the maximum context length from 200K to 1M tokens, coding workloads are expected to shift substantially toward longer prompts. This shifts the primary inference bottleneck from computation to KV-cache capacity, long-context kernel overhead, and CPU-side overhead. Although the new GLM-5.2 architecture reduces per-token computational FLOPs, it does not proportionally reduce per-token KV-cache size. As a result, supporting longer contexts, higher concurrency, and higher token throughput under limited GPU resources becomes a central challenge for inference engine optimization.

To address this challenge, we optimize the inference engine along three directions. First, building on LayerSplit, we introduce finer-grained memory management and parallelization strategies to increase KV-cache capacity and provide more usable cache space for ultra-long-context requests. Second, we optimize kernels whose cost grows with context length and better coordinate them with the cache transfer pipeline, minimizing the impact of cache transfer on both prefill and decode performance. Third, we optimize CPU-side cache management, request scheduling, and runtime execution paths to reduce bubbles in the GPU execution pipeline and improve end-to-end throughput. As shown in the figure, GLM-5.2 achieves an increasingly larger throughput advantage as context length grows, demonstrating stronger scalability in long-context inference scenarios.

随着 GLM-5.2 将最大上下文长度从 20 万扩展到 100 万 token,编码工作负载预计将大幅转向更长的提示。这使主要推理瓶颈从计算转移到 KV 缓存容量、长上下文内核开销和 CPU 端开销。尽管新的 GLM-5.2 架构降低了每 token 的计算 FLOPs,但并未按比例降低每 token 的 KV 缓存大小。因此,在有限的 GPU 资源下支持更长的上下文、更高的并发度和更高的 token 吞吐量,成为推理引擎优化的核心挑战。

为应对这一挑战,我们从三个方面优化推理引擎。首先,基于 LayerSplit,引入更细粒度的内存管理和并行化策略,以增加 KV 缓存容量,为超长上下文请求提供更多可用的缓存空间。其次,优化成本随上下文长度增长的内核,并使其与缓存传输管线更好地协调,最小化缓存传输对预填充和解码性能的影响。第三,优化 CPU 端缓存管理、请求调度和运行时执行路径,减少 GPU 执行管线中的气泡,提高端到端吞吐量。如图所示,随着上下文长度增长,GLM-5.2 的吞吐量优势愈发明显,在长上下文推理场景中展现出更强的可扩展性。

§ 9

The agentic RL post-training of GLM-5.2 involves tasks at larger scale, across more domains, and with more complex execution patterns. Heterogeneous data and tasks need to be organized within a unified training process, while long-horizon interactions, tool use, sub-task decomposition, and multi-turn environment feedback all impose higher requirements on rollout and training orchestration. To support this process, slime serves as an integrated infrastructure layer from training to large-scale inference rollout. It supports multiple training and task organization modes, including white-box rollout, black-box rollout, compact trajectory, and sub-agent workflow, enabling the same system to scale to larger and more complex RL and OPD training workloads. In the post-training process of GLM-5.2, we used the slime framework to conduct parallel OPD training, efficiently merging more than ten expert models into the final model. The entire OPD training process took approximately two days, demonstrating high training efficiency.

GLM-5.2 的智能体强化学习后训练涉及规模更大、领域更广、执行模式更复杂的任务。异构数据和任务需要在统一的训练流程中组织,同时长周期交互、工具使用、子任务分解和多轮环境反馈都对 rollout 和训练编排提出了更高要求。为此,slime 提供了一个从训练到大规模推理部署的集成基础设施层。它支持多种训练和任务组织模式,包括白盒 rollout、黑盒 rollout、紧凑轨迹和子代理工作流,使同一系统能够扩展到更大、更复杂的 RL 和 OPD 训练工作负载。在 GLM-5.2 的后训练过程中,我们使用 slime 框架进行并行 OPD 训练,高效地将十多个专家模型合并到最终模型中。整个 OPD 训练过程大约耗时两天,展现出很高的训练效率。

§ 10

Agentic RL also places higher demands on system resources and inference infrastructure. slime provides a highly open and flexible interface to inference systems: the training side can connect to inference services in different forms, and flexibly adapt to different parallelism strategies, routing policies, PD disaggregation setups, and deployment patterns. At the same time, the configuration experience, scheduling strategies, and optimization paths accumulated during RL rollout can be reused and further refined in the production serving stage, allowing the training side and the serving side to reinforce each other. This creates a more direct path from post-training to production deployment. Together with flexible training-inference resource organization and KV-cache FP8, slime provides critical infrastructure support for GLM-5.2’s large-scale agentic RL training, further improving system efficiency, rollout throughput, and large-scale inference concurrency.

智能体强化学习还对系统资源和推理基础设施提出了更高要求。slime 为推理系统提供了高度开放和灵活的接口:训练端可以不同形式连接推理服务,并灵活适应不同的并行策略、路由策略、PD 分离设置和部署模式。同时,在 RL rollout 过程中积累的配置经验、调度策略和优化路径可以在生产服务阶段复用并进一步优化,使训练端和服务端相互促进。这创造了从后训练到生产部署的更直接路径。结合灵活的训练-推理资源组织和 KV-cache FP8,slime 为 GLM-5.2 的大规模智能体 RL 训练提供了关键的基础设施支持,进一步提升了系统效率、rollout 吞吐量和大规模推理并发能力。

§ 11

RL for Long-Horizon Tasks. For GLM-5.2, long-horizon tasks produce substantially longer execution traces, and once a super-long trajectory is split by compaction into multiple sub-traces, different rollouts under the same prompt yield different numbers of trainable traces with highly variable lengths. We therefore move from group-wise optimization to a critic-based PPO formulation that learns from individual rollouts, relying on a critic to estimate token-level advantages rather than group-relative comparisons. This single-rollout formulation fits compaction naturally, as it places no constraint on how many traces a prompt produces or on their relative lengths: we bring compaction into training by including all compacted sub-traces as trainable trajectories, and apply a token-level loss to address their length imbalance.

长周期任务的强化学习。对于 GLM-5.2,长周期任务会产生更长的执行轨迹,一旦超长轨迹通过压缩被拆分为多个子轨迹,同一提示下的不同 rollout 会产生不同数量的可训练轨迹,且长度差异巨大。因此,我们从基于组的优化转向基于评论家的 PPO 方案,该方案从单个 rollout 中学习,依靠评论家来估计 token 级别的优势,而非组间比较。这种单 rollout 方案天然适配压缩,因为它不限制一个提示产生多少条轨迹或它们的相对长度:我们将所有压缩后的子轨迹作为可训练轨迹纳入训练,并应用 token 级损失来处理其长度不均衡问题。

§ 12

Anti-Hack in Coding agents. Coding RL is especially vulnerable to reward hacking because the reward is typically a verifiable pass/fail signal. We find that GLM-5.2 shows more potential hacking behavior than GLM-5.1. This makes the verification signal easy to optimize, but fails to actually improve the fundamental capabilities of the model. An agent can read protected evaluation artifacts, copy answer content from references or upstream commits, or directly fetch the target source in GitHub-related tasks. For example, the agent may download solution via curl https://raw.githubusercontent.com/<path-to-file> or even chained leakage like

1. find /workspace -name "*hidden*"
2. cat /workspace/.eval/secret_cases.json
3. python solve.py --case "$(cat /workspace/.eval/secret_cases.json)"

These behaviors inflate rewards and corrupt the training signal, requiring a clear mechanism to separate real task-solving from shortcuts. To address this, we introduce an anti-hack module for both RL training and evaluation. The detection process has two stages: a rule-based filter first catches potential hacks to maximize recall, and then an LLM judge checks the intent of these flagged actions to keep precision high. We use an online strategy that monitors the tool calls at each step. If a hack is detected, the system blocks the call and returns dummy information as the result. Importantly, this online guard allows the model to continue the rollout even after a hacked action is caught. By handling the specific invalid behavior instead of rejecting the entire trajectory, this approach helps prevent the training instability and model collapse that can happen when rollouts are abruptly stopped.

编码智能体的反作弊机制。编码强化学习特别容易受到奖励作弊的影响,因为奖励通常是一个可验证的通过/失败信号。我们发现 GLM-5.2 比 GLM-5.1 表现出更多潜在的作弊行为。这使得验证信号很容易被优化,但并未真正提升模型的基础能力。智能体可以读取受保护的评估工件,从参考资料或上游提交中复制答案内容,或直接在 GitHub 相关任务中获取目标源代码。例如,智能体可能通过 curl https://raw.githubusercontent.com/<路径到文件> 下载解决方案,甚至执行链式信息泄露,如:

1. find /workspace -name "*hidden*"
2. cat /workspace/.eval/secret_cases.json
3. python solve.py --case "$(cat /workspace/.eval/secret_cases.json)"

这些行为会夸大奖励并污染训练信号,需要一种清晰的机制来区分真正的任务解决与捷径。为此,我们为 RL 训练和评估引入了一个反作弊模块。检测过程分为两个阶段:规则基过滤器首先捕获潜在的作弊行为以最大化召回率,然后 LLM 评判器检查这些被标记行为的意图以保持高精确率。我们采用在线策略,在每一步监控工具调用。如果检测到作弊行为,系统会阻止该调用并返回虚拟信息作为结果。重要的是,这种在线防护允许模型在作弊行为被捕获后继续 rollout。通过处理特定的无效行为而非拒绝整个轨迹,这种方法有助于防止因 rollout 突然停止而导致的训练不稳定和模型崩溃。

§ 13
Benchmark GLM-5.2 GLM-5.1 Qwen3.7-Max MiniMax M3 DeepSeek-V4-Pro Claude Opus 4.8 GPT-5.5 Gemini 3.1 Pro
Reasoning
HLE 40.5 31.0 41.4 37.0 37.7 49.8* 41.4* 45.0
HLE w/ Tools 54.7 52.3 53.5 - 48.2 57.9* 52.2* 51.4*
CritPt 20.9 4.6 13.4 3.7 12.9 20.9 27.1 17.7
AIME 2026 99.2 95.3 97.0 - 94.6 95.7 98.3 98.2
HMMT Nov. 2025 94.4 94.0 95.0 84.4 94.4 96.5 96.5 94.8
HMMT Feb. 2026 92.5 82.6 97.1 84.4 95.2 96.7 96.7 87.3
IMOAnswerBench 91.0 83.8 90.0 - 89.8 83.5 - 81.0
GPQA-Diamond 91.2 86.2 90.0 93.0 90.1 93.6 93.6 94.3
Coding
SWE-bench Pro 62.1 58.4 60.6 59.0 55.4 69.2 58.6 54.2
NL2Repo 48.9 42.7 47.2 42.1 35.5 69.7 50.7 33.4
DeepSWE 46.2 18.0 18.0 20.0 8.0 58.0 70.0 10.0
ProgramBench 63.7 50.9 - - 47.8 71.9 70.8 39.5
Terminal Bench 2.1 Terminus-2 81.0 63.5 75.0 65.0 64.0 85.0 84.0 74.0
Terminal Bench 2.1 Best Reported Harness 82.7 (Claude Code) 69 (Claude Code) - - - 78.9 (Claude Code) 83.4 (Codex) 70.7 (Gemini CLI)
FrontierSWE Dominance as of 26/6/16 74.4 30.5 - - 29.0 75.1 72.6 39.6
PostTrainBench 34.3 20.1 - - - 37.2 28.4 21.6
SWE-Marathon 13.0 1.0 - - - 26.0 12.0 4.0
Agentic
MCP-Atlas Public Set 76.8 71.8 76.4 74.2 73.6 77.8 75.3 69.2
Tool-Decathlon 48.2 40.7 - - 52.8 59.9 55.6 48.8

*: refers to their scores of full set.

基准测试 GLM-5.2 GLM-5.1 Qwen3.7-Max MiniMax M3 DeepSeek-V4-Pro Claude Opus 4.8 GPT-5.5 Gemini 3.1 Pro
推理
HLE 40.5 31.0 41.4 37.0 37.7 49.8* 41.4* 45.0
HLE(使用工具) 54.7 52.3 53.5 - 48.2 57.9* 52.2* 51.4*
CritPt 20.9 4.6 13.4 3.7 12.9 20.9 27.1 17.7
AIME 2026 99.2 95.3 97.0 - 94.6 95.7 98.3 98.2
HMMT 2025年11月 94.4 94.0 95.0 84.4 94.4 96.5 96.5 94.8
HMMT 2026年2月 92.5 82.6 97.1 84.4 95.2 96.7 96.7 87.3
IMOAnswerBench 91.0 83.8 90.0 - 89.8 83.5 - 81.0
GPQA-Diamond 91.2 86.2 90.0 93.0 90.1 93.6 93.6 94.3
编码
SWE-bench Pro 62.1 58.4 60.6 59.0 55.4 69.2 58.6 54.2
NL2Repo 48.9 42.7 47.2 42.1 35.5 69.7 50.7 33.4
DeepSWE 46.2 18.0 18.0 20.0 8.0 58.0 70.0 10.0
ProgramBench 63.7 50.9 - - 47.8 71.9 70.8 39.5
Terminal Bench 2.1 Terminus-2 81.0 63.5 75.0 65.0 64.0 85.0 84.0 74.0
Terminal Bench 2.1 最佳工具报告 82.7 (Claude Code) 69 (Claude Code) - - - 78.9 (Claude Code) 83.4 (Codex) 70.7 (Gemini CLI)
FrontierSWE 主导率(截至26/6/16) 74.4 30.5 - - 29.0 75.1 72.6 39.6
PostTrainBench 34.3 20.1 - - - 37.2 28.4 21.6
SWE-Marathon 13.0 1.0 - - - 26.0 12.0 4.0
智能体
MCP-Atlas 公共集 76.8 71.8 76.4 74.2 73.6 77.8 75.3 69.2
Tool-Decathlon 48.2 40.7 - - 52.8 59.9 55.6 48.8

*:指其完整集的分数。

§ 14

Try GLM-5.2 in your favorite coding agents—ZCode, Claude Code, OpenCode, and more. https://docs.z.ai/devpack/overview

For GLM Coding Plan subscribers: We already rolled out GLM-5.2 to all Coding Plan users. You can enable GLM-5.2 now by updating the model name to "GLM-5.2" (or GLM-5.2[1m] in Claude Code to enable 1M context length). You can also choose different thinking effort, High or Max, depending on the task. As our most capable model, GLM-5.2 consumes quota at 3× during peak hours and 2× during off-peak hours. As a limited-time promotion through the end of September, off-peak usage is billed at 1×. (Peak hours are 14:00–18:00 UTC+8 (Beijing Time) daily).

Prefer a GUI? We offer ZCode —a desktop agent powered by GLM-5.2, with /goal for long-horizon tasks, SSH remote development, and mobile control. Special offer: use GLM-5.2 through Coding Plan inside ZCode and get 1.5x effective quota until June 30.

Start building now:https://z.ai/subscribe

在你喜欢的编码代理(ZCode、Claude Code、OpenCode 等)中试用 GLM-5.2https://docs.z.ai/devpack/overview

面向 GLM 编码计划订阅用户: 我们已向所有编码计划用户推出 GLM-5.2。你现在就可以通过将模型名称更新为 "GLM-5.2"(或在 Claude Code 中使用 GLM-5.2[1m] 以启用百万上下文长度)来启用它。你还可以根据任务选择不同的思考力度,High 或 Max。作为我们能力最强的模型,GLM-5.2 在高峰时段消耗 3 倍配额,非高峰时段消耗 2 倍配额。作为限时促销,截至九月底,非高峰时段使用按 1 倍计费。(高峰时段为北京时间每日 14:00–18:00)。

更喜欢图形界面?我们提供 ZCode——一款由 GLM-5.2 驱动的桌面代理,支持用于长周期任务的 /goal 命令、SSH 远程开发和移动控制。特别优惠:在 ZCode 内通过编码计划使用 GLM-5.2,截至 6 月 30 日可享受 1.5 倍有效配额。

立即开始构建:https://z.ai/subscribe

§ 15

Chat with GLM-5.2 on Z.ai

GLM-5.2 is now available on Z.ai.

Serve GLM-5.2 Locally

The model weights of GLM-5.2 are publicly available on HuggingFace and ModelScope. For local deployment, GLM-5.2 supports inference frameworks including transformers, vLLM, SGLang, xLLM, ktransformers.

Footnote

  • Humanity’s Last Exam (HLE) & other reasoning tasks: We use sampling parameters of temperature=1.0, top_p=0.95 for evaluation. We evaluate with a maximum generation length of 163,840 tokens. By default, we report the text-only subset; results marked with * are from the full set. For AIME, HMMT and IMOAnswerBench, we evaluate each question using the following system prompt: Your response should be in the following format:\nExplanation: {your explanation for your final answer}\nExact Answer: {your succinct, final answer}\nConfidence: {your confidence score between 0% and 100% for your answer}. We use GPT-5.5 (medium) as the judge model. For HLE-with-tools, we use a maximum context length of 300,000 tokens, with no context management strategy.
  • SWE-Bench Pro: We run the SWE-Bench Pro suite with OpenHands using a tailored instruction prompt. Settings: temperature=1, top_p=1, max_new_tokens=32k, with a 400K context window.
  • NL2Repo: We evaluated NL2Repo with temperature=1.0, top_p=1.0, and max_new_tokens=48k under 400k context. To prevent hacking, we use rule-based and a LLM-based judgement to prevent malicious behaviors (e.g., unauthorized pip or curl operations).
  • DeepSWE: We run DeepSWE with the official pier evaluation framework and the mini-swe-agent harness (temperature=1.0, top_p=1.0, timeout=2h, 400K context). Each task is solved in an isolated container with 2 CPUs, 8 GB RAM, and no internet access.
  • ProgramBench: We evaluate ProgramBench (200 instances) with Claude-Code 2.1.156 using temperature=1.0, top_p=1.0, max_tokens=64000, max_turns=2000, sample_timeout=6h, reasoning_effort=max, with a 400K context window. Each instance runs in a (4 CPUs, 8 GB RAM) sandbox with internet access disabled.
  • Terminal-Bench 2.1 (Terminus 2): We evaluate Terminal-Bench 2.1 with Terminus-2 framework using parser=json, timeout=4h, temperature=1.0, top_p=1.0, max_new_tokens=48k, max_episodes=500, with a 256K context window. Resource limits are capped at 4 CPUs and 8 GB RAM.
  • Terminal-Bench 2.1 (Claude Code): We evaluate in Claude Code 2.1.167 with temperature=1.0, top_p=0.95, max_new_tokens=131072. We override max_new_tokens to 128k via a transparent proxy, bypassing the 64k CLI cap to restore the configurability of CLAUDE_CODE_MAX_OUTPUT_TOKENS. We remove wall-clock time limits, while preserving per-task CPU and memory constraints. Scores are averaged over 5 runs.
  • MCP-Atlas: All models were evaluated in think mode on the 500-task public subset with a 10-minute timeout per task. We use Gemini-3.0-Pro as the judge model for evaluation.
  • Tool-Decathlon: We use the official evaluation service and set max_token to 128K.
  • FrontierSWE: The evaluation was conducted by Proximal with 1M context length, max effort level, and 128K maximum output tokens. Dominance score reported as of 2026/06/16.
  • PostTrainBench: The evaluation was conducted by PostTrainBench with 1M context length, max effort level, and 128K maximum output tokens.
  • SWE-Marathon: The evaluation was conducted by Abundant AI with 1M context length, max effort level, and 128K maximum output tokens.

Z.ai 上对话 GLM-5.2

GLM-5.2 现已在 Z.ai 上可用。

本地部署 GLM-5.2

GLM-5.2 的模型权重已在 HuggingFaceModelScope 上公开。对于本地部署,GLM-5.2 支持包括 transformers、vLLM、SGLang、xLLM、ktransformers 在内的推理框架。

脚注

  • 人类最后的考试(HLE)及其他推理任务:我们使用采样参数 temperature=1.0top_p=0.95 进行评估。评估时最大生成长度为 163,840 token。默认情况下,我们报告纯文本子集;标有 * 的结果来自完整集。对于 AIME、HMMT 和 IMOAnswerBench,我们使用以下系统提示评估每个问题:Your response should be in the following format:\nExplanation: {your explanation for your final answer}\nExact Answer: {your succinct, final answer}\nConfidence: {your confidence score between 0% and 100% for your answer}. 我们使用 GPT-5.5(中等)作为评判模型。对于 HLE-with-tools,我们使用最大上下文长度 300,000 token,无上下文管理策略。
  • SWE-Bench Pro:我们使用 OpenHands 运行 SWE-Bench Pro 套件,采用定制指令提示。设置:temperature=1top_p=1max_new_tokens=32k,上下文窗口为 400K。
  • NL2Repo:我们在 400k 上下文下使用 temperature=1.0top_p=1.0max_new_tokens=48k 评估 NL2Repo。为防止作弊,我们使用基于规则和基于 LLM 的判断来防止恶意行为(例如未经授权的 pip 或 curl 操作)。
  • DeepSWE:我们使用官方 pier 评估框架和 mini-swe-agent 工具集运行 DeepSWE(temperature=1.0top_p=1.0timeout=2h、400K 上下文)。每个任务在隔离容器中解决,配备 2 个 CPU、8 GB RAM,且无网络访问。
  • ProgramBench:我们使用 Claude-Code 2.1.156 评估 ProgramBench(200 个实例),参数为 temperature=1.0, top_p=1.0, max_tokens=64000, max_turns=2000, sample_timeout=6h, reasoning_effort=max,上下文窗口为 400K。每个实例在(4 个 CPU、8 GB RAM)沙箱中运行,禁用网络访问。
  • Terminal-Bench 2.1 (Terminus 2):我们使用 Terminus-2 框架评估 Terminal-Bench 2.1,参数为 parser=jsontimeout=4htemperature=1.0top_p=1.0max_new_tokens=48kmax_episodes=500,上下文窗口为 256K。资源上限为 4 个 CPU 和 8 GB RAM。
  • Terminal-Bench 2.1 (Claude Code):我们在 Claude Code 2.1.167 中进行评估,参数为 temperature=1.0, top_p=0.95, max_new_tokens=131072。我们通过透明代理将 max_new_tokens 覆盖为 128k,绕过 64k CLI 上限,以恢复 CLAUDE_CODE_MAX_OUTPUT_TOKENS 的可配置性。我们移除挂钟时间限制,同时保留每任务 CPU 和内存约束。分数为 5 次运行的平均值。
  • MCP-Atlas:所有模型在 500 个任务的公共子集上以思考模式进行评估,每任务超时 10 分钟。我们使用 Gemini-3.0-Pro 作为评估评判模型。
  • Tool-Decathlon:我们使用官方评估服务,max_token 设置为 128K。
  • FrontierSWE:评估由 Proximal 进行,上下文长度为 1M,最大思考力度,最大输出 token 为 128K。主导率报告截至 2026/06/16。
  • PostTrainBench:评估由 PostTrainBench 进行,上下文长度为 1M,最大思考力度,最大输出 token 为 128K。
  • SWE-Marathon:评估由 Abundant AI 进行,上下文长度为 1M,最大思考力度,最大输出 token 为 128K。
打开原文 ↗