Glean 拾遗
Daily /2026-06-11 / Designing loops with Fable 5: self-correction and memory in agentic workflows

Designing loops with Fable 5: self-correction and memory in agentic workflows

Source x.com Glean’d 2026-06-11 06:00 Read 5 min
AI summary

The author shares two practical directions for improving agentic workflows with Anthropic's Claude Fable 5 model: self-correction loops and cross-session memory. In a Parameter Golf challenge—train the best model within a 16MB artifact in under 10 minutes on 8×H100 GPUs—Fable 5 improved the training pipeline roughly 6× more than Opus 4.7 when using Claude Managed Agents with Outcomes judged by an independent verifier sub-agent against nine checkable criteria. Fable 5 bet on larger structural changes and pushed through a quantization regression, while Opus 4.7 stuck to tuning scalar hyperparameters. For memory, the author used a SQL-based task from Continual Learning Bench 1.0 with filesystem-backed memory across agent sessions. Sonnet 4.6 only logged failures and guesses; Opus 4.7 built flagged schema references but verified only 17% of questions; Fable 5 reached 73% verification coverage in the best run and distilled learnings into general rules. Engineers interested in agent architecture and model capability boundaries will find the experiments relevant.

Original · 5 min
x.com ↗
§ 1

Mythos-class models like Claude Fable 5 have changed the way many of us work at Anthropic. I want to share two tips for getting the most out of this class of models.

像 Claude Fable 5 这样的神话级模型/Mythos-class model 改变了 Anthropic 内部很多人的工作方式。我想分享两个充分发挥这类模型潜力的技巧。

§ 2

There’s been a lot of interest in loops recently. @bcherny has mentioned that “(his) job is to write loops.” Letting models hillclimb on an evaluation is a common recipe for improving task performance: /goal in Claude Code and Outcomes in Claude Managed Agent are primitives that let you apply this general recipe for your specific task.

As mentioned in our prompting guide, Fable 5 is good at self-correcting in a loop. A well designed goal or rubric adds feedback to the environment that Claude is running in. This let’s Claude run, collect feedback via the goal or rubric, self-correct, and proceed until the goal or rubric is satisfied.

近期大家对循环/loops 非常感兴趣。@bcherny 曾提到“他的工作就是写循环”。让模型在评估基础上不断 hillclimb 是提升任务表现的常见方法:Claude Code 中的 /goal 和 Claude Managed Agent 中的 Outcomes 就是让你针对特定任务应用这种通用方法的原语。

正如我们的提示词指南中提到的,Fable 5 擅长在循环中进行自我修正。一个设计良好的目标或评分规则/rubric 会在 Claude 运行的环境中增加反馈信息。这使 Claude 能够运行、通过目标或评分规则收集反馈、自我纠正,并持续进行,直到满足目标或评分规则。

§ 3

I’ll share one toy example that I used to test Fable: Parameter Golf is an open source ML engineering challenge to train the best model that fits in a 16MB artifact in < 10 minutes on 8xH100s.

It’s a bit like @karpathy's autoresearch project: it tests the ability of an agent to edit basic training code (a single train_gpt.py file), launch training, poll the log, read the score, and decide what experiment to run next.

我分享一个用来测试 Fable 5 的示例:Parameter Golf 是一个开源机器学习工程挑战赛,要求在 8 块 H100 GPU、10 分钟内训练出最佳模型,且模型体积必须能放入 16MB 的 artifact。

它有点像 @karpathy 的自动研究项目:测试 agent 编辑基础训练代码(单个 train_gpt.py 文件)、启动训练、轮询日志、读取分数,并决定下一步运行哪个实验的能力。

§ 4

I compared Fable 5 to Opus 4.7 on this challenge using Claude Managed Agents (CMA). CMA provides the agent harness as well as a hosted sandbox, so it’s well-suited for long-running tasks with Fable 5. For Parameter Golf, I gave CMA access to 8xH100 GPUs as a self-hosted sandbox.

One subtle point: what does the judging is important. We’ve seen that models have problems with self-critique on their own outputs. Prithvi Rajasekaran wrote about this in our engineering blog here.

我使用 Claude Managed Agents (CMA) 在 Parameter Golf 挑战中将 Fable 5 与 Opus 4.7 进行了对比。CMA 提供了 agent 框架以及托管的沙箱,非常适合与 Fable 5 一起运行长时间任务。对于 Parameter Golf,我将 8 块 H100 GPU 作为自托管沙箱交给 CMA 使用。

一个微妙之处:评判者是什么很重要。我们发现模型在自我评判自身输出时存在困难。Prithvi Rajasekaran 在我们的工程博客中写到了这一点。

§ 5

We’ve found that a verifier sub-agent tends to outperform self-critique with Fable 5, because grading is done in an independent context window. Outcomes in CMA handles this by spawning a grader sub-agent for you.

For each test, I supplied a rubric (a file) with the nine checkable criteria (e.g., run a baseline, run 20 experiments, etc). Then, I ran Parameter Golf for up to 8 hours. The Outcomes grader confirmed that all experimental criteria were met before allowing Claude to stop the work.

我们发现,对于 Fable 5,使用验证子 agent/verifier sub-agent 通常优于自我评判,因为评分是在独立的上下文窗口中进行的。CMA 中的 Outcomes 通过为你生成一个评分子 agent 来处理这个问题。

对于每次测试,我提供了一个包含九个可检查条件的评分规则文件(例如,运行基线、运行 20 次实验等)。然后,我运行 Parameter Golf 最多 8 小时。在允许 Claude 停止工作之前,Outcomes 评分器会确认所有实验条件均已满足。

§ 6

Fable 5 improved the training pipeline ~6x more than Opus 4.7. If we consider experiments as structural (e.g., architecture changes) or scalar (e.g., adjusts a constant), Fable 5 bet on larger structural changes and showed resilience (e.g., pushing through a quantization regression to its biggest win).

Opus 4.7's first experiment produced a small win and nearly everything after followed the same template: adjust a scalar, measure, keep if positive.

Fable 5 对训练管线的改进程度比 Opus 4.7 高出约 6 倍。如果我们将实验分为结构性(例如架构更改)和标量型(例如调整一个常数),Fable 5 押注了更大的结构性变化,并展现出韧性(例如,克服量化回归,最终获得最大胜利)。

Opus 4.7 的第一个实验取得了小幅提升,此后几乎所有实验都遵循相同的模板:调整一个标量,测量,如果正向则保留。

§ 7

Memory is another area where Fable excels. We can think about this as a outer loop that spans across sessions: Claude writes to memory during a session and those memories can be retrieved in future sessions.

@pgasawa and team recently published Continual Learning Bench 1.0, so I wanted to test this on Fable 5 vs earlier models.

记忆/Memory 是 Fable 5 另一个擅长的领域。我们可以将其视为跨越多个会话的外循环/outer loop:Claude 在一个会话期间写入记忆,这些记忆可以在未来的会话中检索。

@pgasawa 及其团队最近发布了 Continual Learning Bench 1.0,所以我希望在这个基准上测试 Fable 5 与早期模型。

§ 8

I compared Fable 5, Opus 4.7, and Sonnet 4.6 on one of the tasks from the benchmark: the task asks an agent to answer sequential questions given access to a SQL database. Each question is a separate agent session and memory is provided.

For this, I used CMA with memory, which gives each agent access to a mounted filesystem that can be shared across sessions.

For this task, effective use of memory benefits from a progression: fail (get something wrong and document), investigate (before moving on, figure out why), verify (turn the diagnosis into a checked fact), distill (turn verification into a general rule), and consult (read the rule, instead of re-deriving it).

我在基准测试的一个任务上比较了 Fable 5、Opus 4.7 和 Sonnet 4.6:该任务要求 agent 在有权访问 SQL 数据库的情况下回答连续的问题。每个问题都是一个独立的 agent 会话,并提供了记忆。

为此,我使用了带记忆功能的 CMA,它为每个 agent 提供了一个可跨会话共享的挂载文件系统。

对于这个任务,有效利用记忆得益于一个渐进过程:失败(出错并记录)、调查(继续之前找出原因)、验证(将诊断转化为经过核查的事实)、提炼(将验证转化为通用规则)以及查阅(读取规则,而不是重新推导)。

§ 9

Sonnet 4.6 exits around step 1: its store is a list of failure notes and open guesses (e.g., "maybe prc instead of prc_usd?"). It rarely consults prior notes. To improve performance, task-specific memory instructions are needed.

Opus 4.7 exits around step 3: it creates a schema reference with uncertainty flagged (e.g., "possibly prc in cents? Verify."), but verification coverage is low: at 7-33% of questions (median run ~17%).

Sonnet 4.6 大约停留在第 1 步:它的存储是一堆失败笔记和未确定的猜测(例如“可能是 prc 而不是 prc_usd?”)。它很少查阅之前的笔记。要提升性能,需要特定于任务的记忆指令。

Opus 4.7 大约停留在第 3 步:它会创建一个模式引用并标记不确定性(例如“可能是以美分计的 prc?验证。”),但验证覆盖率很低:仅覆盖 7-33% 的问题(中位数运行约 17%)。

§ 10

Fable 5 tends to complete the progression: in its strongest runs, verification coverage is up to 73% (22 of 30) and it distills learnings into general rules that help with future tasks.

Fable 5 倾向于完成整个渐进过程:在其最强的运行中,验证覆盖率高达 73%(30 个问题中验证 22 个),并将学到的经验提炼成通用规则,有助于处理未来的任务。

§ 11

Rather than directly prompting and steering Fable 5, it's often better to design loops that let the model to self-correct in response to environment feedback (e.g., /goal or Outcomes) and manage its own context (e.g., via memory).

I've shared just a few small scale experiments that I've run, but its worth testing Fable 5 for yourself on challenging tasks and using self-correction and memory loops.

To get started, see our docs or ask the latest version of Claude Code, which can use our built-in /claude-api skill to tell you about Fable 5 (e.g., prompting best practices), /goal, Claude Managed Agents, or other API features.

与其直接提示和引导 Fable 5,通常更好的做法是设计循环,让模型能够根据环境反馈(例如 /goal 或 Outcomes)进行自我修正,并管理自己的上下文(例如通过记忆)。

我分享的只是一些小规模实验,但值得自己在具有挑战性的任务上测试 Fable 5,并在工作中使用自我修正和记忆循环。

要开始使用,请参阅我们的文档或询问最新版本的 Claude Code,它可以利用我们内置的 /claude-api 技能来告诉你关于 Fable 5(例如提示词最佳实践)、/goal、Claude Managed Agents 或其他 API 功能的信息。

Open source ↗