Glean 拾遗
← All issues
#003 Latest 6/8–6/14 Published Jun 14

The Compounding Agent: From Prompt Engineering to Self-Improving Systems

This week’s picks converge on a clear inflection point: AI agents are graduating from conversational fluency to engineering maturity. The bar is no longer the quality of a single response, but whether an agent can self-correct across hours, accumulate memory across sessions, and let humans operate as architects rather than operators. The editorial arc moves from a leap in model capability—Anthropic's Fable 5—to the cognitive bottleneck that emerges when scaling agent use (the orchestration tax), and finally to the engineering practices that deliver compounding returns: rigorous AGENTS.md files, dynamic workflows, and cross-session memory loops. A recurring law emerges: real leverage comes not from spawning more agents, but from designing systems where each run’s failures, lessons, and rules form the training data for the next. This is the pivot from prompt-driven to design-driven agentic engineering.

22 picks 5 sections ~6 hr
Section 01

Model Leap: From Mythos-Class Power to Useful Steering

5 / 22
www.anthropic.com · 26 min
01

Claude Fable 5 and Claude Mythos 5Anthropic 发布 Mythos 级模型 Claude Fable 5:能力最强,但带着更多安全限制

Anthropic launched Claude Fable 5, its most capable publicly released model, rated Mythos-class. It achieves state-of-the-art on nearly all benchmarks, especially on long, complex tasks in software engineering, knowledge work, and vision. To mitigate misuse risks in cybersecurity and biology, Fable 5 ships with conservative safety classifiers that fall back to Opus 4.8 for sensitive queries, triggering in under 5% of sessions. Claude Mythos 5, the same model with safeguards lifted, is available to cyberdefenders and certain biologists. New 30-day business data retention is mandated for Mythos-class models. The post includes results from Stripe, Cognition, and internal protein design and genomics experiments. Pricing is $10/M input tokens and $50/M output tokens.

x.com · 12 min
02

Training an LLM to Generate Reliable Structured Output Using GRPO and a Reward Function用奖励函数替代标注数据:GRPO 将 Qwen3-8B 的 JSON 结构输出有效性从 62% 提升至 82%

A hands-on report on replacing labeled data with a code-defined reward function to train structured output. The author fine-tunes Qwen3-8B for JSON invoice extraction using GRPO. Supervised fine-tuning stalls because its token-level loss only optimizes for surface similarity, not structural validity. The fix: a reward function that scores completions 0.0 (invalid JSON), 0.5 (valid JSON but wrong schema), or 1.0 (fully compliant), providing a learning gradient. Training on Fireworks H200s raised schema-valid output from a baseline of 62% to 82% on held-out prompts, exceeding GPT-4.1's 58%, with lower cost and latency. The approach transfers to any task where correctness is verifiable in code, such as SQL, API calls, or tool use. Full reward function, dataset, and training config are provided.

x.com · 5 min
03

Claude API adds auto-caching: single cache_control param cuts input cost to 10%Claude API 新增自动缓存:用 cache_control 参数一行切到 1/10 成本

Anthropic introduced prompt auto-caching in the Claude Messages API. Instead of manually moving breakpoints across conversation turns, a single top-level `cache_control: {type: 'ephemeral'}` auto-places the cache at the last cacheable block. Cached tokens cost 10% of base input price and reduce prefill latency. Ideal for agents and coding assistants where most context remains identical turn-over-turn. The post cites Manus founder @peakji on cache hit rate being the most critical metric for production agents, and links to Claude Code's cache-friendly prompt design insights.

x.com · 7 min
04

The Missing Link Between Agents and ApplicationsHeadless Tools:让智能体直接在浏览器和桌面应用里执行动作

This article introduces Headless Tools, a mechanism that allows agents to act directly on client-side runtimes such as browsers and desktop applications. The author argues that most current agent tools are server-side, limiting them to API calls while blocking access to browser state, device APIs, and in-app actions. Headless Tools wrap client-side capabilities like geolocation, clipboard, IndexedDB, and application-specific commands as standard tools invocable by the model. The model sees only a tool schema, while the server and client coordinate execution behind the scenes. Code examples in TypeScript demonstrate the pattern, alongside real-world use in a Slidev presentation plugin and browser-local agent memory. Privacy is improved because sensitive data can remain on-device. This is valuable for teams embedding AI agents into rich frontend contexts such as design tools, document editors, and desktop utilities.

x.com · 28 min
05

Build a Self-Improving Agent System with Claude Fable 5 in 14 Steps用14个步骤在Fable 5上构建自我进化的智能体系统

A practical guide based on Anthropic engineering posts and experiments detailing how to build a self-improving agent system around Claude Fable 5. It argues most users underutilize the Mythos-class model, treating it like a bigger Sonnet 4.6. The architecture layers primitives (model, sub-agents, worktrees), orchestration (/goal and Outcomes loops, Dynamic Workflows, Routines for cloud execution), memory (state files and compounding Skills), and self-improvement (vision self-checks, eval loops, rule distillation). Key tactics include using an independent verifier sub-agent instead of self-critique, ensuring parallel safety with git worktrees, running multi-day tasks on cloud infrastructure, and following a 5-stage memory progression from failure documentation to general rule consultation. Designed for engineers building compound systems rather than prompting for minutes.

Section 02

Cognitive Overload and Human Bottlenecks: Reclaiming First Principles

5 / 22
x.com · 9 min
06

The Orchestration Tax: Why 20 AI Agents Don't Mean 20x OutputAI 代理的编排税:为什么开 20 个 agent 不等于 20 倍产出

Addy Osmani coins 'orchestration tax': spawning agents is cheap, but closing the loop—reviewing, judging, merging—is serial and bounded by your cognitive bandwidth. Using Amdahl's Law and Python's GIL as analogies, he argues you are the single-threaded bottleneck in an otherwise concurrent system. Tactics: cap parallelism to your review rate, split work into delegable vs. judgment-heavy piles, batch reviews to cut context-switch costs, and force agents to self-verify. Aimed at engineers who run multiple AI agents daily and feel busy but unproductive.

x.com · 24 min
07

10 Lessons for Writing a Good AGENTS.md for Codex and Claude CodeAGENTS.md 写作十诫:让 AI 编码代理真正听懂你的项目

Ten hard-won lessons from running Codex and Claude Code side by side, distilled into a survival guide for writing AGENTS.md files that actually work. Key moves include capping the root file at 200 lines, listing what not to introduce alongside the actual stack, writing rules the tool can mechanically check instead of vague principles like “keep it simple”, and treating the entry file as a router to architecture docs rather than a single dump. Other high-leverage practices involve using PLANS.md to break long-running tasks into reversible phases inside an isolated worktree, giving high-risk directories their own local guardrails, layering intent–intercept–permission–sandbox so red lines aren't left to model memory alone, storing auditable long-term memory in MEMORY.md with a 30‑day hurdle, and separating personal style from team conventions from machine permissions. The guide closes with a copy‑ready skeleton and the principle that the entry file should grow like a test suite every time the tool gets something wrong.

github.com · 14 min
08

Composable Agent Skills for Real Engineering Workflows给工程师的 AI 编码工作流:一组可组合的 Agent Skills

Matt Pocock's personal agent skills for Claude Code and Codex, targeting four common failure modes in AI-assisted development: misalignment, verbosity, broken code, and design entropy. Instead of controlling the process, these small, composable skills embed engineering fundamentals—grilling sessions for alignment, shared ubiquous language for concision, TDD red-green-refactor loops for code quality, and architecture rescue tools. They work with any model and are designed to be hacked and adapted in your own .claude directory.

x.com · 28 min
09

Every Agentic Engineering Hack I Know (June 2026)我的 Agentic 工程实战技巧(2026年6月版)

The author shares 22 practical hacks for agentic engineering with Claude Code and Codex. The core is a plan-first workflow: use /ce-plan to generate a plan.md that guides the agent; humans skim or ask inline instead of reading it. Hacks include: voice input via Monologue or Wispr Flow (LLMs handle imperfect transcription); running 4-6 separate agent sessions in cmux tabs; defaulting terminal tabs to Claude Code and bypassing all permission prompts with sound alerts on completion; giving Claude an email address via AgentMail to trigger sessions remotely; using last30days before planning to search community discussions and news in parallel; turning repeated tasks into reusable skills to compound agent capabilities. He stresses that human value lies in providing taste and direction, not typing, and warns against AI addiction. The post is packed with copy-paste config snippets and concrete tools, aimed at engineers deep into AI-assisted development.

x.com · 16 min
10

25 Claude Features, Workflows, and Tricks That Most Users Don't KnowClaude Projects 深度指南:25 个被低估的特性、工作流与技巧

A practical guide by @eng_khairallah1 detailing 25 workflows and techniques to fully leverage Claude Projects. The core thesis is treating Projects as evolving, persistent workspaces rather than transient chat sessions. It provides actionable strategies including a structured instruction template, strategic file organization, the Living Instructions pattern, and advanced concepts like voice calibration files and competitive intelligence hubs. The guide emphasizes a compounding knowledge strategy where each interaction refines Claude's contextual understanding, suitable for power users aiming to transform Claude from a generic tool into a domain-specific specialist.

Section 03

Loops, Memory, and Verification: Building the Compounding Flywheel

6 / 22
x.com · 18 min
11

How to Design a Loop That Prompts Your Agent设计一个自行驱动 Agent 的多步任务循环

This article presents a loop architecture that enables an AI agent to autonomously complete multi-step tasks by building an automated prompting system instead of manually crafting each prompt. It breaks down the loop into five parts: defining a 'done' check, building prompts from dynamic state rather than hand-fed instructions, executing actions while capturing all outputs, feeding failures back as the next prompt, and setting hard stop conditions like max turns and cost limits. A walkthrough of fixing a login bug shows the loop in action, emphasizing that real costs come from repeated turns, making guardrails critical. Encapsulating repeated operations into reusable skills is highlighted as the key to long-term value, and common mistakes—like lacking an exit condition or discarding error output—are pointed out. Suitable for developers shifting from one-shot prompts to designing agent control flows.

x.com · 14 min
12

Loop Engineering: Designing the System That Prompts Your Coding Agents循环工程:让代码智能体在后台自主运行,而你设计的是循环本身

Addy Osmani argues that interacting with coding agents is shifting from prompt engineering to 'loop engineering'—designing a system that autonomously discovers tasks, delegates work, and verifies results using five building blocks: scheduled automations, parallel worktrees, project skills, connector plugins, and checker sub-agents. He maps how Claude Code and Codex both implement all five, noting that the leverage point has moved from writing good prompts to architecting persistent loops. The post cautions that loops amplify existing problems: verification, comprehension debt, and cognitive surrender become sharper risks. Intended for senior engineers evaluating how to productize AI coding tools beyond one-shot interactions.

x.com · 17 min
13

How to Master Dynamic Workflows in Claude Code: 6 Patterns and 14 StepsClaude Code 动态工作流实战:6 种模式与 14 步完整指南

This article provides a systematic guide to Dynamic Workflows in Claude Code, shipped in late May 2026. It moves beyond manual prompt chaining by letting Claude generate a bespoke JavaScript harness for a specific task. The author first explains the mental model: how workflows structurally fix agentic laziness, self-preferential bias, and goal drift inherent in single-context sessions. It then breaks down six core patterns—classify-and-act, fan-out-and-synthesize, adversarial verification, generate-and-filter, tournament, and loop until done—each with code skeletons. Real-world use cases show how to compose patterns for migrations, deep research, triage, and lightweight evals. Practical controls like /goal, /loop, token budgets, and the quarantine pattern for untrusted inputs are covered. It also advises on saving successful workflows and shipping them as Skills. This guide is for engineers aiming to tackle long-running, parallel, or adversarial tasks beyond a single Claude Code session.

x.com · 5 min
14

Designing loops with Fable 5: self-correction and memory in agentic workflowsClaude Fable 5 实战:用自校正循环和跨会话记忆打磨代理任务

The author shares two practical directions for improving agentic workflows with Anthropic's Claude Fable 5 model: self-correction loops and cross-session memory. In a Parameter Golf challenge—train the best model within a 16MB artifact in under 10 minutes on 8×H100 GPUs—Fable 5 improved the training pipeline roughly 6× more than Opus 4.7 when using Claude Managed Agents with Outcomes judged by an independent verifier sub-agent against nine checkable criteria. Fable 5 bet on larger structural changes and pushed through a quantization regression, while Opus 4.7 stuck to tuning scalar hyperparameters. For memory, the author used a SQL-based task from Continual Learning Bench 1.0 with filesystem-backed memory across agent sessions. Sonnet 4.6 only logged failures and guesses; Opus 4.7 built flagged schema references but verified only 17% of questions; Fable 5 reached 73% verification coverage in the best run and distilled learnings into general rules. Engineers interested in agent architecture and model capability boundaries will find the experiments relevant.

x.com · 21 min
15

How To Build AI Agents in 2026 (That Actually Work)2026 年如何构建真正可用的 AI Agent:从认知模型到代码实操

This article systematically deconstructs the architecture and engineering practices for building practical AI agents. It clarifies the boundaries between chatbots, AI agents, and agentic AI, emphasizing that a real agent is a system that persistently loops toward a goal rather than delivering a one-shot answer. The author explains the ReAct loop (Reasoning + Acting) and breaks down the five building blocks: the LLM as the brain, tools as hands, short-term and long-term memory, self-correcting loops, and verification. Using a case study of a startup research agent for the fitness niche, the article walks through goal setting, tool integration, loop construction, memory implementation, and the addition of a critic agent, complete with copy-paste system prompts. It highlights six common failure modes and recommends a 2026 tech stack including Claude Code, LangGraph, and MCP. The piece provides a weekend roadmap to build a basic agent from a 50-line Python script and is aimed at developers shifting from prompt engineering to designing agent systems.

github.com · 19 min
16

Claude Agents & Skills for Investment Banking, Research, PE, and Wealth Management面向投资银行、研究、私募等金融场景的 Claude 智能体与技能集

Anthropic's official reference implementation of Claude agents for financial services, offering 9 end-to-end workflow agents for investment banking, research, PE, and wealth management, along with 8 vertical skill packs and 12+ MCP data connectors. Everything is file-based (Markdown/YAML), installable as Cowork plugins or deployable via Managed Agents API. Designed for technical teams who need ready-made finance AI workflows while retaining full customization.

Section 04

Agents in the Wild: From Personal Productivity to Redesigned Workflows

5 / 22
claude.com · 9 min
17

How an Anthropic seller rebuilt his team's workflows with Claude CodeAnthropic 销售用 Claude Code 从零编程构建内部工具套件

Jared Sires, a former account executive at Anthropic with no coding experience, used Claude Code to build CLAFTS, a Gmail-integrated tool that drafts customer emails in his voice while pulling context from live product documentation. The tool saves 10-15 hours per week. He expanded this into a sales plugin with skills for daily briefs, recaps, and pipeline management, wired into Salesforce, Gong, and other systems via MCP servers. About 80% of Anthropic's sales org now uses the plugin. The piece illustrates how non-technical practitioners can leverage AI coding tools to eliminate technical barriers and deliver workflow-specific software.

github.com · 29 min
18

AI Skills Marketplace for Product Managers: 100+ Structured Workflows from Discovery to Growth产品经理的 AI 技能市场:100+ 结构化工作流,从发现到增长

pm-skills is an AI skills marketplace for product managers, packing 100+ codified PM skills and 42 chained workflows into 9 installable plugins. It transforms established product methodologies by Teresa Torres, Marty Cagan, and others into structured, step-by-step AI-guided processes, going beyond generic text generation to enforce product-thinking rigor. Covering the full lifecycle from discovery to growth, it works as a Claude Code/Cowork plugin and offers cross-platform skill support. Ideal for PMs and founders who want to embed AI into their decision-making workflow, not just accelerate document output.

github.com · 27 min
19

AI Agent Skill: Cross-Platform Social Search and 30-Day SynthesisAI 代理技能:跨平台社交搜索与 30 天舆情简报

/last30days is an AI agent skill that aggregates the latest content from Reddit, X, YouTube, TikTok, Hacker News, and more into a 30-day briefing. It uses entity pre-research to identify key people, communities, and topics, then searches in parallel and scores by real engagement (upvotes, likes, money) rather than SEO. An AI synthesizes a cited, in-depth summary. Open-source (MIT), it supports Claude Code and 50+ agent frameworks. Ideal for engineers, PMs, and researchers needing a quick, grounded update before meetings or decisions.

x.com · 17 min
20

AI Agents: What They Are and How to Build a Telegram Bot with Claude CodeAI Agent 实战:从理论光谱到零代码构建 Telegram 机器人

This guide clarifies that AI agents are not a category but a spectrum from simple chat to autonomous loops, defined by tools, memory, and a loop. It then provides a no-code, step-by-step tutorial on building a Telegram bot agent with Claude Code, including system prompt templates, systemd deployment, persistent memory, cost tracking, and practical skills. Also addresses the common memory problem and offers concrete fixes. Suitable for engineers who want a practical agent without writing code themselves.

github.com · 9 min
21

Maple: An Open-Source Observability Platform Built on OpenTelemetry and ClickHouseMaple:基于 OpenTelemetry + ClickHouse 的开源可观测性平台

Maple is an open-source observability platform for traces, logs, and metrics, built on OpenTelemetry and ClickHouse. It features an OTLP ingest gateway with key-based auth, a chat agent, alert evaluation, and SQLite/Turso-backed dashboard persistence. The monorepo ships with Clerk or self-hosted auth modes and a full suite of CI/CD workflows for Cloudflare Workers, targeting teams that want to own their observability stack in a TypeScript-native codebase.

Section 05

More

1 / 22
x.com · 5 min
22

Designing loops with Fable 5: self-correction and cross-session memory如何为 Claude Fable 5 设计循环:自校正与跨会话记忆

R. Lance Martin demonstrates two loop patterns for Anthropic's Fable 5: self-correction and cross-session memory. On the Parameter Golf challenge (train a model under 16MB and 10 minutes on 8xH100s), Fable 5 with CMA and a verifier sub-agent improved the pipeline roughly 6x more than Opus 4.7, favoring structural changes over scalar tuning. On a continual learning SQL benchmark, Fable 5 progressed through fail-investigate-verify-distill into general rules, reaching 73% verification coverage, while Opus 4.7 and Sonnet 4.6 stalled at sparse notes or uncertain schemas. The key takeaway: design loops and environment feedback so the model can hillclimb, rather than relying on direct prompting.