Glean 拾遗
Daily /2026-06-12 / Designing loops with Fable 5: self-correction and cross-session memory

Designing loops with Fable 5: self-correction and cross-session memory

Source x.com Glean’d 2026-06-12 06:00 Read 5 min
AI summary

R. Lance Martin demonstrates two loop patterns for Anthropic's Fable 5: self-correction and cross-session memory. On the Parameter Golf challenge (train a model under 16MB and 10 minutes on 8xH100s), Fable 5 with CMA and a verifier sub-agent improved the pipeline roughly 6x more than Opus 4.7, favoring structural changes over scalar tuning. On a continual learning SQL benchmark, Fable 5 progressed through fail-investigate-verify-distill into general rules, reaching 73% verification coverage, while Opus 4.7 and Sonnet 4.6 stalled at sparse notes or uncertain schemas. The key takeaway: design loops and environment feedback so the model can hillclimb, rather than relying on direct prompting.

Original · 5 min
x.com ↗
§ 1

Mythos-class models like Claude Fable 5 have changed the way many of us work at Anthropic. I want to share two tips for getting the most out of this class of models.

Mythos 级模型(如 Claude Fable 5)已经改变了许多人在 Anthropic 的工作方式。我想分享两个技巧,帮你充分发挥这类模型的能力。

§ 2

Self-correction loops

There’s been a lot of interest in loops recently. @bcherny has mentioned that “(his) job is to write loops.” Letting models hillclimb on an evaluation is a common recipe for improving task performance: /goal in Claude Code and Outcomes in Claude Managed Agent are primitives that let you apply this general recipe for your specific task.

As mentioned in our prompting guide, Fable 5 is good at self-correcting in a loop. A well designed goal or rubric adds feedback to the environment that Claude is running in. This let’s Claude run, collect feedback via the goal or rubric, self-correct, and proceed until the goal or rubric is satisfied.

近期大家对循环/loop 的兴趣很高。@bcherny 提到“他的工作就是写循环”。让模型在评估上爬山/hillclimb 是提升任务性能的常见方法:Claude Code 中的 /goal 和 Claude Managed Agent 中的 Outcomes 是基本原语,让你能为自己的任务应用这一通用方法。

正如我们的提示指南中所说,Fable 5 擅长在循环中自我修正。一个设计良好的 goal 或 rubric 会给 Claude 运行的环境添加反馈。这让 Claude 能够运行、通过 goal 或 rubric 收集反馈、自我修正,并持续进行,直到满足 goal 或 rubric。

§ 3

One subtle point: what does the judging is important. We’ve seen that models have problems with self-critique on their own outputs. Prithvi Rajasekaran wrote about this in our engineering blog here.

We’ve found that a verifier sub-agent tends to outperform self-critique with Fable 5, because grading is done in an independent context window. Outcomes in CMA handles this by spawning a grader sub-agent for you.

一个微妙之处在于:谁来评判至关重要。我们发现模型对自己输出进行自我批判/self-critique 时会出问题。Prithvi Rajasekaran 在我们的工程博客中写过这一点。

我们发现,对于 Fable 5,验证子智能体/verifier sub-agent 通常优于自我批判,因为评分是在独立的上下文窗口中进行的。CMA 中的 Outcomes 会为你自动生成一个评分子智能体来解决这个问题。

§ 4

I’ll share one toy example that I used to test Fable: Parameter Golf is an open source ML engineering challenge to train the best model that fits in a 16MB artifact in < 10 minutes on 8xH100s.

It’s a bit like @karpathy's autoresearch project: it tests the ability of an agent to edit basic training code (a single train_gpt.py file), launch training, poll the log, read the score, and decide what experiment to run next.

I compared Fable 5 to Opus 4.7 on this challenge using Claude Managed Agents (CMA). CMA provides the agent harness as well as a hosted sandbox, so it’s well-suited for long-running tasks with Fable 5. For Parameter Golf, I gave CMA access to 8xH100 GPUs as a self-hosted sandbox.

For each test, I supplied a rubric (a file) with the nine checkable criteria (e.g., run a baseline, run 20 experiments, etc). Then, I ran Parameter Golf for up to 8 hours. The Outcomes grader confirmed that all experimental criteria were met before allowing Claude to stop the work.

Fable 5 improved the training pipeline ~6x more than Opus 4.7. If we consider experiments as structural (e.g., architecture changes) or scalar (e.g., adjusts a constant), Fable 5 bet on larger structural changes and showed resilience (e.g., pushing through a quantization regression to its biggest win).

Opus 4.7's first experiment produced a small win and nearly everything after followed the same template: adjust a scalar, measure, keep if positive.

我来分享一个测试 Fable 的示例:Parameter Golf 是一个开源的 ML 工程挑战,目标是在 8×H100 上、10 分钟内训练出能放入 16MB 工件的最佳模型。

这有点像 @karpathy 的自动研究项目:测试智能体编辑基础训练代码(单个 train_gpt.py 文件)、启动训练、轮询日志、读取分数、并决定下一次实验的能力。

我使用 Claude Managed Agents (CMA) 在这个挑战上对比了 Fable 5 与 Opus 4.7。CMA 提供了智能体框架以及托管的沙箱,非常适合 Fable 5 长时间运行的任务。对于 Parameter Golf,我让 CMA 访问 8 块 H100 GPU 作为自托管沙箱。

每次测试,我提供了一个包含九项可检查标准(例如运行基线、运行 20 次实验等)的 rubric 文件。然后让 Parameter Golf 运行最多 8 小时。Outcomes 的评分器在允许 Claude 停止工作前,确认所有实验标准都已满足。

Fable 5 对训练管线的改进幅度比 Opus 4.7 大约高 6 倍。如果把实验分为结构性(如架构变化)和标量性(如调整常量),Fable 5 更倾向于押注大的结构性变化,并展现出韧性(例如,在一次量化回归后仍推进,最终取得最大胜利)。

Opus 4.7 的第一次实验产生了小收益,随后几乎每次实验都遵循相同模式:调整标量、测量、若为正则保留。

§ 5

Memory

Memory is another area where Fable excels. We can think about this as a outer loop that spans across sessions: Claude writes to memory during a session and those memories can be retrieved in future sessions.

@pgasawa and team recently published Continual Learning Bench 1.0, so I wanted to test this on Fable 5 vs earlier models.

I compared Fable 5, Opus 4.7, and Sonnet 4.6 on one of the tasks from the benchmark: the task asks an agent to answer sequential questions given access to a SQL database. Each question is a separate agent session and memory is provided.

For this, I used CMA with memory, which gives each agent access to a mounted filesystem that can be shared across sessions.

内存/Memory 是 Fable 另一个擅长的领域。我们可以将其视为跨会话的外循环:Claude 在一次会话中写入内存,这些记忆可以在未来的会话中检索。

@pgasawa 和团队最近发布了 Continual Learning Bench 1.0,所以我希望在 Fable 5 上与更早的模型测试对比。

我比较了 Fable 5、Opus 4.7 和 Sonnet 4.6 在该基准测试的一项任务上的表现:任务要求智能体在访问 SQL 数据库的情况下回答一系列问题。每个问题都是一个独立的智能体会话,并提供了内存。

为此,我使用了带内存的 CMA,它让每个智能体都能访问一个挂载的文件系统,该文件系统可在会话间共享。

§ 6

For this task, effective use of memory benefits from a progression: fail (get something wrong and document), investigate (before moving on, figure out why), verify (turn the diagnosis into a checked fact), distill (turn verification into a general rule), and consult (read the rule, instead of re-deriving it).

Sonnet 4.6 exits around step 1: its store is a list of failure notes and open guesses (e.g., "maybe prc instead of prc_usd?"). It rarely consults prior notes. To improve performance, task-specific memory instructions are needed.

Opus 4.7 exits around step 3: it creates a schema reference with uncertainty flagged (e.g., "possibly prc in cents? Verify."), but verification coverage is low: at 7-33% of questions (median run ~17%).

Fable 5 tends to complete the progression: in its strongest runs, verification coverage is up to 73% (22 of 30) and it distills learnings into general rules that help with future tasks.

对于这个任务,有效使用内存得益于一个递进过程:失败(出错并记录)、调查(继续前查明原因)、验证(将诊断转化为已核实的事实)、提炼(将验证转化为通用规则)、以及查阅(读取规则,而非重新推导)。

Sonnet 4.6 大约停留在第 1 步:它的存储是一份失败笔记和开放式猜想(例如“可能是 prc 而非 prc_usd?”)。它很少查阅之前的笔记。要提高性能,需要提供任务特定的内存指令。

Opus 4.7 大约停留在第 3 步:它创建了一个带有不确定性标记的模式参考(例如“可能是 prc 单位是美分?验证。”),但验证覆盖率很低,在 7%–33% 的问题之间(中位数运行约为 17%)。

Fable 5 倾向于完成整个递进过程:在其最强运行中,验证覆盖率高达 73%(30 题中覆盖 22 题),并将学习提炼为通用规则,帮助解决未来的任务。

§ 7

Rather than directly prompting and steering Fable 5, it's often better to design loops that let the model to self-correct in response to environment feedback (e.g., /goal or Outcomes) and manage its own context (e.g., via memory).

I've shared just a few small scale experiments that I've run, but its worth testing Fable 5 for yourself on challenging tasks and using loops for self-correction or memory.

To get started, see our docs or ask the latest version of Claude Code, which can use our built-in /claude-api skill to tell you about Fable 5 (e.g., prompting best practices), /goal, Claude Managed Agents, or other API features.

与其直接提示和引导 Fable 5,通常更好的做法是设计循环,让模型根据环境反馈自我修正(例如通过 /goal 或 Outcomes),并管理自己的上下文(例如通过内存)。

这里我只分享了自己运行过的几个小规模实验,但你很值得在具有挑战性的任务上亲自测试 Fable 5,并使用循环来实现自我修正或内存管理。

要开始上手,请查阅我们的文档,或询问最新版本的 Claude Code,它可以使用我们内置的 /claude-api 技能来告诉你关于 Fable 5 的信息(例如提示最佳实践)、/goal、Claude Managed Agents 或其他 API 功能。

Open source ↗