标签 · AI Engineering — Glean

Recent picks

66picks · chronological

07-27

How Anthropic runs large-scale code migrations with Claude Code

Anthropic engineers used Claude Code (Fable 5 and Opus 4.8) to migrate Bun from Zig to Rust in two weeks, producing one million lines of code with 100% test pass rate; another engineer ported a Python codebase to 165,000 lines of TypeScript over a weekend. The article details a six-step migration process: create rulebook and dependency map, stress-test rules, translate everything (parallel), compile, run, and match behavior. The core insight is to fix the loop that produces code, not the code itself. It discusses when migrations are justified, why AI changes the economics (parallelism, clear context, built-in referee, self-generating queue), and best practices (use smaller models for implementation, largest for review). Concrete token and cost data are provided: Bun migration consumed ~5.9B uncached input tokens and 690M output tokens, costing ~$165k at API pricing.

x.com · 15 min · Agent Architecture · AI Engineering · Anthropic

07-27

A Guide to Building Personal AI Infrastructure

This article presents a systematic guide to building a personal AI digital assistant based on Daniel Miessler's Personal AI Infrastructure (PAI) framework. The core thesis: don't start with tools, start with yourself. It covers the TELOS identity system (10 Markdown files defining mission, goals, beliefs, etc.), a three-tier memory architecture (hot/warm/cold), a decision priority chain (goal → code → CLI → prompt → agent), a user/system separation directory design, and an event hook system. It emphasizes that architecture matters more than model choice—a good context management system with an ordinary model often outperforms a top model without context. Suitable for engineers and knowledge workers who want to build a personalized, continuously learning AI assistant.

x.com · 5 min · Agent Architecture · AI Engineering · Context Engineering

07-26

Why Harness Engineering Is So Hard

Based on five months of real-world experience (104 commits), this article dissects the structural difficulties of turning LLM demonstrations into reliable products. Key challenges include: inability to write deterministic tests (same input yields different output each time), silent and graded failures (1% error hidden in 99% correct output), debugging natural language paragraphs instead of code (a single adjective can be a bug), the additive instinct trap (prompt growing from 20 to 200 lines causes contradictions), examples steering harder than rules, unstable model foundation (vendor updates silently shift behavior), slow and expensive feedback loops, and invisible work (outsiders think it's just writing prompts). The author argues that harness engineering (prompts, validators, evals, guardrails) is the true moat, with difficulty stemming from the probabilistic nature of the substrate, which cannot be engineered away but only absorbed. Recommended for LLM app developers, AI engineers, and tech leads.

x.com · 16 min · AI Engineering · Developer Tools · LLM

07-26

Agent Harness Engineering vs. Loop Engineering vs. Graph Engineering

A practical guide distinguishing three architecture layers for AI agents: Agent Harness (code, config, runtime around the model), Loop (repeated work-feedback cycles), and Graph (explicit workflow topology). The author explains what each layer owns, common mistakes, and how to choose the right lever when debugging. Includes a symptom-to-layer mapping table and a production-ready checklist. Essential reading for teams moving agents from demos to production.

x.com · 17 min · Agent Architecture · Agent Engineering · Agents

07-25

Introducing Claude Opus 5: Near-Frontier Performance at Half the Cost

Anthropic releases Claude Opus 5, approaching the frontier intelligence of Fable 5 at half the cost. It achieves state-of-the-art results on coding and knowledge work benchmarks, e.g., Frontier-Bench v0.1 (more than double Opus 4.8 performance) and ARC-AGI 3 (3x score over next best). However, it remains behind Mythos 5 on cybersecurity tasks. The model offers tunable effort settings for cost-performance trade-offs. Customer reports highlight gains in software engineering, finance, and legal work. Alignment improves, but cybersecurity capabilities are intentionally limited. Safety classifiers intervene ~85% less than Fable 5. Pricing same as Opus 4.8, with fast mode available.

www.anthropic.com · 20 min · AI Engineering · Anthropic · Cost Optimization

07-25

AI Agent Engineering in Depth: From Principles to Production

An open-source book providing a comprehensive guide to AI Agent engineering, from fundamental principles to production practices. Written by Li Bojie, it follows the core formula 'Agent = LLM + Context + Tools' across 10 chapters, covering context engineering, memory, tool use, coding agents, evaluation, post-training, continuous evolution, multimodal, and multi-agent collaboration. Includes 92 hands-on experiments (70+ runnable) on MCP, RAG, RL, etc. Ideal for engineers and researchers building production AI agents.

github.com · 15 min · Agents · AI Engineering · Context Engineering

07-23

From Loop to Graph: The Real Shift in AI Agent Improvement

This essay explores the shift from single-loop optimization to graph-structured improvement in AI agent engineering. It dissects four failure modes of single loops: Goodhart's law, blindness to target correctness, conflict among loops, and measurement decay. Mature systems employ graphs of loops—champion-challenger, drift monitoring, rollback, and audit loops—each watching and constraining others. But graphs alone fail without anchors: measures that touch reality, frozen rules, and human judgment on what 'better' means. The key is groundedness, not topology. For AI engineers, MLOps practitioners, and agent designers.

x.com · 13 min · Agent Architecture · AI Engineering · Loop Engineering

07-22

AI-Native Markdown IDE and LLM Wiki

Open Knowledge is an open-source, AI-native Markdown IDE that also serves as an LLM-powered wiki. It seamlessly integrates Markdown/MDX editing, personal knowledge management, and AI agents (like Claude and Codex) to help you build a second brain. Ideal for developers and knowledge workers who want intelligent note-taking and documentation.

github.com · 1 min · Agents · AI Engineering · Developer Tools

07-21

How to Build the Loops That Just Replaced Entire Prompt Engineering

The era of crafting better prompts to improve AI agents is quietly ending. Top engineers now build 'loops'—systems where agents plan, execute, verify, and iterate without human intervention. Using Karpathy's overnight 700-experiment run that found 20 optimizations as a case study, the article breaks down loop architecture: automation heartbeat, skill files (project context), sub-agents (writer vs reviewer), connectors, and a verifier gate. It warns against Ralph Wiggum loops (premature exit) and comprehension debt (accumulated unread code). Practical advice: start with one boring recurring task, verify manually before scheduling, measure cost per accepted change. Essential for engineers building production agent workflows.

x.com · 9 min · Agents · AI Engineering · LLM

07-20

Graph Engineering Replaces RAG: Insights from Microsoft, Stanford, Anthropic

Traditional RAG struggles with complex multi-entity questions. Microsoft, Stanford, and Anthropic independently discovered that graph engineering (knowledge graphs) can improve accuracy by 18% and reduce costs by 85%. Microsoft's GraphRAG converts unstructured text into a knowledge graph, achieving 18% higher accuracy and 85% lower token costs. Stanford's DSPy and STORM treat the model as a node in a graph, and scaling law research shows a small model with a good graph outperforms a large model with a poor graph. Anthropic's Claude, combined with MCP, enabled LaunchNotes to detect incidents 5x faster and cut meeting time by 50%. The article details the full pipeline, five prompt templates, and five business applications. Ideal for engineers advancing beyond basic RAG.

x.com · 16 min · AI Engineering · GraphRAG · Knowledge Graph

07-19

LangChain’s Open-Source Software Factory

LangChain open-sources four internal software engineering agent tools: local coding agent dcode, cloud coding agent OpenSWE, automated code review OpenSWE Review, and repo knowledge documentation OpenWiki. It includes real usage data (OpenSWE triggered ~1,000 times from Slack last week) and benchmark results (OpenSWE Review scores 47%, #1 among open-source tools). Built on Deep Agents framework with LangSmith observability. Targeted at engineers building controllable, observable agent pipelines.

x.com · 7 min · Agents · AI Engineering · Code Review

07-19

Interview with Yang Zhilin: AGI, Long Context, and China's AI Race

In early 2024, Yang Zhilin, founder of Moonshot AI (Kimi), gave an in-depth interview covering his AGI conviction, the strategic choice of long context as the first step, views on open-source vs closed-source, and the organizational form needed for AI startups. He argues that true AGI companies must combine science, engineering, and business, insist on B2C, and synchronize user scaling with model scaling. The interview includes real-time assessments of Sora, catching up with GPT-4, and the gap between Chinese and Silicon Valley AI companies, revealing the balance between technical idealism and pragmatism.

x.com · 60 min · AI Engineering · AI Industry · Context Engineering

07-18

Unibase Memory: Share Context Across ChatGPT, Claude, and Gemini

When using multiple AI tools, context is lost between sessions, wasting hours re-explaining. Unibase Memory is a Chrome extension that captures, organizes, and injects memory across ChatGPT, Claude, and Gemini, enabling shared context. The post covers a 5-step setup from installation to advanced workflows (research-to-draft, persistent brand voice, cross-tool building), with local encryption and optional decentralized sync. For engineers and creators juggling multiple AI models, it offers a practical solution to AI memory fragmentation.

x.com · 12 min · AI Engineering · Ai Tooling · Context Engineering

07-16

The Short Leash AI Coding Method For Beating Fable

This post distills over a year of research on using AI agents for security-critical software. The author introduces the “Short Leash” method: only expert developers can use it; never enable YOLO mode; manually review every diff in the permissions prompt to keep the AI on track; commit after each subtask to safeguard against regressions. It also details AI-assisted code review: pair human and AI, with AI catching surface errors and humans guiding direction. PR authors must self-review line-by-line and disclose AI models used. This approach beats Fable even with non‑frontier models, without sacrificing quality. Targeted at senior engineers who want productivity gains without giving up understanding.

blog.okturtles.org · 7 min · AI Agents · AI Engineering · Code

07-16

Better Models: Worse Tools

Armin reports a counterintuitive bug encountered while developing Pi code editor: newer Claude models, including Opus 4.8 and Sonnet 5, invent extra fields in the nested edits[] array when calling Pi's custom edit tool, causing the tool call to be rejected. Older Claude models do not exhibit this behavior. Armin hypothesizes that Anthropic's reinforcement learning has specifically optimized newer models for Claude Code's built-in edit tool, inadvertently degrading performance on other tool schemas. The piece questions whether third-party coding harnesses must implement multiple edit tools per model family, and highlights the fundamental trade-off between specialized training and general tool compatibility.

simonwillison.net · 2 min · Agent Engineering · AI Engineering · Claude Code

07-15

How to Create Loops with Claude

This article advocates shifting from writing single prompts to designing loops—automated systems that keep AI agents working without human intervention. It breaks down a loop into six components: automation triggers, git worktrees for parallel isolation, skills (procedure manuals), connectors, sub-agents, and persistent memory files (e.g., STATE.md). The evaluator-optimizer pattern is highlighted: one agent generates, another verifies against objective gates like test suites or type checkers. Stop conditions must be checkable by external signals, not the agent's own claim. An autonomy ladder (suggest, draft, apply low-risk, full auto) helps gradually earn trust. The article also warns about token costs and the need for command allowlists in unattended loops.

x.com · 10 min · Agent Architecture · AI Engineering · Claude Code

07-14

Model and effort in Claude Code: knowing more vs. trying harder

This article by a Claude Code team member explains the real mechanism behind model and effort settings. Model selection swaps frozen weights (knowledge), while effort controls how much work Claude does—how many files it reads, tests it runs, and how thoroughly it verifies. Using analogies (specialist vs expert vs generalist) and diagrams, it clarifies when to upgrade the model (not enough knowledge) vs increase effort (not enough trying). Practical advice: start with default effort, choose larger models for hard problems, smaller ones for routine tasks to save cost. Key insight: check context first, then decide if Claude didn't know or didn't try hard enough.

x.com · 14 min · AI Engineering · Claude Code · LLM

07-14

The 'Caveman Skill' That Claims 65% Token Savings Actually Saves Only 8.5%

This article analyzes the recent trend of 'caveman skills' (such as the Caveman project) that prompt AI coding tools to output minimal language to save tokens. The author points out that the claimed 65% token savings comes from chat scenarios, whereas in agentic programming tasks, tool calls and system prompts dominate token usage. A controlled test by JetBrains (86 tasks, 240 trials) showed that even with forced activation, output token savings were only 8.5%, and in practice the savings are even smaller due to conditional activation. The article also discusses the cost of brevity: loss of information leads to more developer follow-ups and agent rework. The author argues that true cost optimization comes from context management (e.g., prompt caching) and reducing unnecessary tool calls, not from compressing output.

x.com · 2 min · AI Engineering · Claude Code · Prompt Engineering

07-13

Lightweight terminal-based coding agent with local execution and cloud integration

Codex CLI is a lightweight coding agent from OpenAI that runs locally in your terminal, powered by your ChatGPT subscription or API key. Unlike IDE plugins or desktop apps, it offers a pure CLI experience tailored for terminal-centric developers. It supports macOS, Linux, and Windows and can be installed via shell script, npm, or Homebrew. Built with Rust and Bazel, it emphasizes performance and portability. Open-sourced under Apache-2.0, it's ideal for developers exploring command-line AI coding assistants.

github.com · 5 min · AI Engineering · CLI · Coding Agent

07-11

Agentic test processes: from chip design to AI workflows

Drawing from his experience at chip company Centaur, the author compares test processes that scale well with LLM agents: no code review by default, heavy reliance on fuzzing, and a dedicated test team. He argues that while LLMs are poor at writing tests directly, directed fuzzing with LLMs can find real bugs in minutes. The article highlights the high variance of LLM outputs—benchmark rankings often flip with minor task changes—and cautions against over-reliance on aggregated metrics. Through examples like building a superhuman board game AI, he advocates systematic data-driven iteration over prompt tricks. Targeted at engineers interested in AI-assisted development, testing, and agent workflows.

danluu.com · 91 min · AI Engineering · Benchmarks · Developer Tools

07-10

Build self-improving agent system with Fable 5 in 14 steps : loops, dynamic workflows, routines

This article provides a detailed 14-step roadmap for building a self-improving agent system using Claude Fable 5. It moves Fable 5 from a prompt-and-close tool to a compounding system: using /goal and Outcomes for self-correcting loops, independent verifier sub-agents over self-critique, state files (STATE.md) and Skills for cross-session memory, and Dynamic Workflows and Routines for long-running autonomy. It includes a cost-capability matrix (Fable 5 for orchestration, Sonnet 4.6 for workers, Haiku 4.5 for graders) and guidance on handling the Mythos safety boundary. Suitable for AI engineers and system designers aiming to leverage Fable 5's days-long autonomous capability.

x.com · 28 min · Agent Engineering · Agents · AI Engineering

07-10

Continual Learning for Agents: Eval, A/B, and Self-Improvement at Replit

This article argues that continual learning for agents isn't limited to weight updates—agents using closed frontier models can improve via harness and context layers. Using Replit Agent as a case study, it details a three-layer evaluation system: ViBench, an offline benchmark for vibe coding that scores apps built from scratch against natural-language PRDs; online A/B tests to capture real user behavior; and Telescope, a trace analysis system that clusters failure patterns. These feed a self-improvement loop that automatically proposes patches (reviewed by engineers). A concrete example shows how a cold-start regression was detected, fixed, and shipped in one day. The piece is relevant for engineers building AI agents and evaluation infrastructure.

x.com · 16 min · Agent Architecture · Agents · AI Engineering

07-10

Rewriting Bun in Rust: 535K Lines, 11 Days, 64 AI Agents

Bun's creator Jarred Sumner recounts how he used Anthropic's Claude Fable 5 to rewrite Bun's 535,496 lines of Zig into Rust in 11 days. The motivation: Zig's manual memory management caused numerous use-after-free, double-free, and memory leaks when mixed with JavaScriptCore's GC. Instead of an incremental port, he orchestrated 64 Claude agents in parallel using dynamic workflows and adversarial review. 100% of Bun's test suite (over 600k assertions) passed on all 6 platforms. The rewrite fixed 128 bugs, reduced memory usage by up to 90%, shrank the binary by ~20%, and improved throughput by 2-5%. The article details the workflow, common porting mistakes (e.g., debug_assert! side effects, slice overruns, comptime format differences), and how Rust's Drop systematically prevented memory leaks. A first-hand account of using cutting-edge AI to accomplish a year-long team project in less than two weeks.

bun.com · 65 min · Agent Engineering · AI Engineering · Code

07-08

Loop Patterns in Claude Code: A Practical Guide

The Claude Code team's official blog post introduces four loop modes: turn-based, goal-based, time-based, and proactive loops. It covers how each is triggered, stopped, and when to use them, along with token management and code quality tips. Practical CLI commands and SKILL.md examples are provided. The article emphasizes starting simple and gradually automating repetitive tasks. Essential reading for engineers using or evaluating Claude Code for autonomous development.

x.com · 9 min · Agents · AI Engineering · Ai Tooling

07-08

Harness Engineering for Self-Improvement

This comprehensive survey by Lilian Weng systematically examines the critical role of harness engineering in recursive self-improvement (RSI) for AI systems. A harness is the system layer surrounding a base model that orchestrates execution, context management, tool calling, persistent state, and workflow design. The post synthesizes three design patterns (workflow automation, filesystem as persistent memory, sub-agents and backend jobs) and dives into frontier works: context engineering (ACE, MCE), meta-optimization (Meta-Harness), workflow automation search (ADAS, AFlow), self-improving harnesses (STOP, Self-Harness), and evolutionary search (AlphaEvolve, Darwin Gödel Machine). It concludes with open challenges: weak evaluators, memory management, diversity collapse, reward hacking. Essential reading for AI engineers and agent researchers.

lilianweng.github.io · 42 min · Agent Architecture · Agents · AI Engineering

07-07

A Field Guide to Fable: Finding Your Unknowns

The author shares hands-on experience with Claude Fable for agentic coding, emphasizing that the prompt (map) never fully matches the codebase (territory). He categorizes unknowns into four types (known knowns, known unknowns, unknown knowns, unknown unknowns) and provides practical techniques to systematically discover them: blindspot passes, brainstorming & prototypes, interviews, references, implementation plans, implementation notes, pitches, and quizzes. Ends with a real example of editing the Fable launch video. Suitable for engineers using AI-assisted coding.

x.com · 13 min · Agent Engineering · Agents · AI Engineering

07-07

Human-in-the-Loop Workflow Design: From Approval Fatigue to Planned Review

Based on an analysis of 400,000 Claude Code sessions, this article reveals that 93% of permission prompts are approved, leading to 'consent fatigue' where humans are nominally in the loop but functionally tuned out. The author proposes restructuring the workflow into three layers: input (precise task description, constraints, examples), steering (plan-level review instead of per-action approval), and output review (defining quality criteria and self-assessing). A single evaluation checkpoint improved generation quality by 8–10% in controlled tests. The article provides actionable steps to move from per-action approval to strategic intervention, targeting AI engineers and agent developers.

x.com · 11 min · Agent Architecture · AI Engineering · Claude Code

07-04

Superpowers 6: Cutting Build Cost 60% via Autoresearch Loop

Superpowers 6 is released, with its biggest improvements driven by an automated research loop. The author used Anthropic's Fable model (briefly available) to systematically optimize their Subagent Driven Development pipeline. Over 36 hours and ~$165 in token spend, 25 experiments were run, yielding a 50% reduction in wall-clock time and 60% reduction in token consumption vs. v5. Key optimizations: merging spec compliance and code review agents, pre-baking review packets to minimize git operations, and dynamic agent allocation based on task type (e.g., using cheap haiku for non-code plans). The post also documents falsified hypotheses (e.g., capping controller thinking backfires) and emphasizes the role of their eval suite in rigorous measurement.

blog.fsck.com · 8 min · Agent Engineering · AI Engineering · Anthropic

07-02

Building a Good Vertical Agent: Context as a Cache Hierarchy

The article argues that a good vertical agent is a faithful compression of its task distribution, and its context should be organized as L1/L2/L3 cache tiers. Using their Shortcut spreadsheet agent as example, they detail extreme optimizations: reading a range compresses 500 formulas into a single legend line via R1C1 normalization and aliasing; after writing, a structured diff groups, samples, and triages changes, flagging #REF! errors under MUST FIX. L2 provides curated English specs fetched on demand, like the pivot table recipe that bakes in gotchas (suspendLayout/resumeLayout, raw integer 8 for aggregation). L3 is the raw API reference plus a 100-line grep skill that lets the model mine tens of thousands of lines in bounded steps. The prompt budget mirrors the frequency curve, and the hierarchy moves as models improve. Practical, transferable advice for engineers building reliable agents in any domain.

x.com · 21 min · Agent Architecture · Agent Infrastructure · AI Engineering

07-01

What I Learned About Agent Skills from Building Popular Ones

The author, having built several popular Skills (PPT, social media cards, logo generator, AI desk card), argues that Agents amplify rather than erase capability gaps. A Skill is defined as a reusable capability unit that bundles expert experience, workflows, taste, and tool calls. Core insights: Skill design is externalizing human taste as constraints (e.g., no pure white/black, text must not cover faces); architecture should be 'short center, thick radius' with SKILL.md holding only high-signal flow; quality must be maintained like code, with gotchas from real failures being the most valuable; the ecosystem should present each Skill as a feature page, not a repository list; distribution relies on GitHub for cross-platform reach and content platforms for community building, creating a flywheel of articles, products, and use cases. A full lifecycle from real need to feedback iteration is proposed. The article is aimed at AI Agent developers, product managers, and content creators, offering concrete cases and actionable design principles.

x.com · 13 min · Agent Architecture · Agent Skill Repository · AI Engineering

07-01

Micro-Agent: Beat Frontier Models with Collaboration inside Model API

The vLLM Semantic Router proposes a different take: a router is not just a request dispatcher but an amplifier of model capability. The core idea is to encapsulate multi-model collaboration inside a single model API call. The user sees one model endpoint (vllm-sr/auto), but behind it the router can automatically select a collaboration pattern — from cost-aware escalation (Confidence), parallel aggregation (Ratings), repeated mixture-of-model reasoning (ReMoM), disagreement-as-signal (Fusion), to budgeted micro-agent workflows (Workflows). These patterns are controlled, configurable, observable runtimes, not application glue. Benchmarks on LiveCodeBench, GPQA-Diamond, and Humanity's Last Exam show the closed-model collaboration scheme (VSR Closed) achieving 92.6%, 96.0%, and 50.0% respectively, matching or beating single frontier models like Fugu Ultra and GPT-5.5. This article is valuable because it sinks multi-model collaboration from the product or application layer down to the serving infrastructure layer, while preserving a single model identity. For engineers building inference routing, multi-model strategies, or cost optimization.

vllm.ai · 14 min · AI Engineering · Cost Optimization · LLM

06-30

How To Make Codebases AI Agents Love

This article argues that codebase structure matters more than prompts or AGENTS.md files for AI agent output quality. The core idea is applying 'deep modules' from A Philosophy of Software Design: each module exposes a simple interface controlling lots of implementation. The author introduces 'grey box modules'—developers own and test the interface, AI owns the implementation inside. This improves feedback loops (tests are feedback), navigability (filesystem mirrors mental model), and reduces cognitive load (developers only track 7-8 module boundaries). The article notes TypeScript's difficulty enforcing boundaries and recommends the Effect library. For engineers optimizing AI coding workflows.

www.aihero.dev · 5 min · Agent Architecture · AI Engineering · Code

06-29

how to be good at research

A thread by @itsreallyvivek arguing that research skill is a stack of trainable sub-skills, not a gift. Core moves: pick problems you genuinely want to exist (Schulman), upgrade inputs by reading old papers and skipping summaries, write everything down to expose hidden gaps (Graham, Feynman, Darwin), tighten the experimental loop with scripted tooling (Karpathy), stare directly at failure cases instead of loss curves (Andrew Ng), deliberately wander across subfields to find your unfair advantage, and cultivate collaborators who will tell you an idea is bad. The post synthesizes concrete tactics from Hamming, Sutton, Shannon, and others, emphasizing falsifiable forecasts, reproducible tooling, and reading raw data over third-hand threads. Actionable for research engineers and PhD students tired of surface imitation.

x.com · 10 min · AI Engineering · Career Advice · Experiment Design

06-29

Context Engineering for AI Agents: The Complete Playbook

This article systematically explains why context engineering is the most critical skill for building reliable AI agents. It argues that agent degradation usually stems from poor context window management rather than model limitations. The context window is likened to RAM, and as tool outputs, retrieval results, and conversation history accumulate, attention thins and the “Lost in the Middle” effect kicks in. Four core strategies are presented: Write (persist information outside context), Select (just-in-time retrieval), Compress (proactively reduce tokens), and Isolate (separate contexts for different jobs). The article details four failure modes—poisoning, distraction, confusion, and clash—and offers concrete evidence: Chroma benchmarks show continuous performance decline well before token limits, RAG‑MCP improved tool selection accuracy from 14% to 43% while halving token usage, and KV‑cache hit rates can yield a 10× cost reduction. A real-world workflow that shipped ~35,000 lines of Rust code in 7 hours using frequent intentional compaction is presented. The target audience is engineers building production‑grade agents.

x.com · 21 min · Agents · AI Engineering · Context Engineering

06-28

The 5 Levels of Loop Design: From Prompting to Autonomous Agents

The creator of Claude Code says he no longer writes prompts—loops prompt it instead. This post introduces a 5-level progression of human-AI workflow: from Level 1 (single-turn prompting), through Level 2 (manual loop of do-check-correct), Level 3 (verified loop with separate judges for 'done'), Level 4 (self-running loop using /goal command with guardrails), to Level 5 (autonomous systems where loops self-start, run in parallel, and persist lessons into a skill base). Each level comes with a tell and a concrete next step. For developers who still feel they are 'babysitting' their AI agents.

x.com · 7 min · Agent Architecture · Agent Engineering · Agents

06-27

Stop Being the Loop: How to Make Claude Work While You Sleep

Boris Cherny, who built Claude Code at Anthropic, no longer writes prompts by hand—he writes loops. This guide explains what a real loop is: a small system that runs Claude repeatedly until a job is done, complete with self-checking, state persistence, and automatic stopping. Unlike cron jobs, loops contain a decision-maker (Claude) that can adapt mid-stream. The article covers Claude Code's /goal (loop until done) and /loop (repeat on a schedule) commands, and provides a paste-ready charter template with sections for goals, work sources, work instructions, self-verification, memory, and stop conditions. Ideal for engineers transitioning from prompting to building persistent, autonomous AI workflows.

x.com · 9 min · Agent Engineering · Agent Orchestration · AI Engineering

06-27

Agentic Code Review

When coding agents produce thousands of lines of often solid code in minutes, the engineering bottleneck shifts from writing to trusting, making review the most leveraged skill in software. Multi-source 2026 data (Faros AI, CodeRabbit, GitClear, GitHub) shows: AI users generate ~4x raw output but only ~12% more delivered value; code churn up 861%, defect rate from 9% to 54%, review duration up 441.5%, and zero-review merges up 31.3%. The article argues the fix is not to stop using AI but to tier review effort by blast radius: light for solo no-user projects, heavy for large enterprises. Specific advice: triage PRs upfront, require evidence before review, watch test rewrites, run two differently-structured AI reviewers in parallel, and upgrade humans from line-level review to spot-checking and auditing. The durable skill is understanding a system well enough to stand behind it.

addyosmani.com · 29 min · Agent Engineering · AI Engineering · Code Review

06-27

The New Software Lifecycle: From Writing Code to Judging It

Key insights from a Google whitepaper on how AI transforms the software lifecycle. The core thesis: an agent is 10% model and 90% harness (instructions, tools, sandboxes, orchestration, observability). Context engineering is the primary cost lever, with a critical distinction between static context (loaded every turn, reliable but expensive) and dynamic context (loaded on demand, cost-efficient but needs careful design). Verification determines whether you're vibe coding or doing agentic engineering: tests for deterministic parts, evals for non-deterministic output and trajectory. Real data: one team moved a coding agent from outside top 30 to top 5 on Terminal Bench 2.0 by changing only the harness with the same model; LangChain added 13.7 points on the same benchmark by changing system prompt, tools, and middleware around a fixed model. Implementation collapses from weeks to hours, while specification and verification become the new bottlenecks. For engineers and tech leads adopting AI agents in production workflows.

addyosmani.com · 15 min · Agent Architecture · AI Engineering · Context Engineering

06-26

Loop engineering: the 14-step roadmap from prompter to loop designer

This post from @0xCodez on X provides a comprehensive 14-step roadmap for transitioning from manual prompting to designing autonomous loop systems in AI-assisted coding. Based on Anthropic engineering docs, Addy Osmani's essay, and recent studies, it's structured in three tiers: first, a 4-condition test to decide if a loop is warranted; second, five building blocks (automations, worktrees, skills, connectors via MCP, sub-agents with maker-checker split); third, building the minimal viable loop and avoiding failure modes like the 'Ralph Wiggum loop', comprehension debt, and security tax. The author emphasizes that loops are not universal—they only earn their cost when tasks repeat, verification is automated, the token budget can absorb waste, and the agent has senior engineer tools. Ideal for engineers already using coding agents who want to orchestrate them into batched, automated workflows.

x.com · 23 min · Agents · AI Engineering · Ai Tooling

06-25

A Comfortable AX for Agent Search

Raft CTO Tenny argues that returning raw IDs or full content to an agent doing a search is bad design. The correct approach mirrors web search results: return a highlighted snippet, context preview, and one explicit next action (e.g., 'read surrounding context'). Every token in the agent's context window has a cost, so results must be compact, immediately scannable, and paired with an actionable next step. This is UX design extended—but the user is now an agent reading tokens, not a person looking at a screen.

raft.build · 11 min · Agent Tool Design · AI Engineering · Context Engineering

06-24

Loop Engineering: The AI skill every builder needs in 2026

This community-authored article introduces 'Loop Engineering,' arguing that the most effective AI builders are shifting from one-shot prompting to designing automated feedback loops for AI agents. Rather than crafting a perfect prompt, engineers should build systems that discover, plan, execute, verify, and iterate until a verified outcome is reached. It covers six building blocks (automations, worktrees, skills, plugins/connectors, subagents, memory), two loop scales (single-agent vs. fleet), and two types (open vs. closed), while frankly addressing the critical hidden cost of tokens. A practical primer for engineering teams turning AI agents from experiments into production workflows.

x.com · 12 min · Agent Architecture · Agents · AI Engineering

06-23

30 Core Agentic Engineering Concepts Every Developer Should Know

This article distills 20 foundational concepts in agentic engineering, covering building blocks (Agent loop, Think-Act-Observe, state, patterns), configuration (config files, workflow files, prompt caching, context rot), capability (MCP, live document retrieval, persistent memory), orchestration (subagents, agent loops), guardrails (sandboxing, permissions, hooks, prompt injection defense, pre-commit gates), and observability (tracing, metrics). The author argues that frameworks change but these underlying ideas persist; understanding them makes any new tool familiar. Includes concrete config examples and practical advice (e.g., keep config files under 100 lines, distinguish proxy metrics from outcome metrics).

x.com · 24 min · Agent Architecture · Agents · AI Engineering

06-22

GLM-5.2: Built for Long-Horizon Tasks

Zhipu AI introduces GLM-5.2, a flagship model for long-horizon tasks with a solid 1M-token context and an MIT license. Architecture innovations include IndexShare, which reuses the sparse attention indexer across four layers to cut per-token FLOPs by 2.9× at 1M context, and an improved MTP layer that raises speculative decoding acceptance length by 20% through IndexShare, KV sharing, rejection sampling, and end-to-end TV loss. Agentic RL post-training is backed by the slime framework, and an anti-hack module detects and blocks reward-hacking behaviors like fetching evaluation files or curl-downloading answers. GLM-5.2 ranks as the top open-source model on long-horizon benchmarks such as FrontierSWE (only 1% behind Opus 4.8) and Terminal-Bench 2.1 (81.0), making it relevant for engineers building coding agents and long-context inference systems.

z.ai · 21 min · Agent Architecture · AI · AI Engineering

06-21

Ponytail: Lazy Senior Dev Inside Your AI Agent, Cuts Code Bloat by ~54%

Ponytail is a rule plugin for 14+ AI coding agents (Claude Code, Codex, Copilot CLI, etc.) that injects a lazy-senior-dev mindset. Before generating code, it forces the agent to climb a ladder: does this need to exist? Can the standard library or native platform feature do it? Can it be one line? Only then writes the minimum viable solution. Benchmarked on real Claude Code sessions editing a real FastAPI + React repository across 12 feature tickets, it cuts lines of code by 54% (mean), tokens by 22%, cost by 20%, and time by 27% while keeping 100% safety on validation, error handling, security, and accessibility. Ideal for developers tired of AI bloat and over-engineering.

github.com · 12 min · Agents · AI Engineering · Code Generation

06-18

Factory 2.0: From coding agents to software factories

Factory announces its 2.0 release, repositioning from individual AI coding agents to an end-to-end 'software factory'. The post argues that improving individual productivity is no longer enough; enterprises need an interconnected, agent-native system that forms a continuous feedback loop from signals (bug reports, customer feedback) through planning, building, testing, reviewing, securing, shipping, and monitoring. Key design principles include model independence (allowing deliberate model choice or automatic routing per task), sovereign intelligence (data plane and control plane options from cloud to fully air-gapped, with all agent sessions and reviews feeding back into the system), and continual learning and self-improvement across the lifecycle. The article lists customers such as NVIDIA, EY, Adobe, and Palo Alto Networks already running software factories in production. Autonomy is described as a gradual maturation process, using simple Droids, skills, automations, Droid Computers, and multi-agent Missions for different levels of human guidance and agent readiness. The piece is a product launch announcement with some technical concepts, targeting engineers and managers interested in enterprise AI engineering and agent orchestration.

x.com · 5 min · Agent Architecture · Agents · AI Engineering

06-17

Agentic coding and persistent returns to expertise

Anthropic analyzed ~400,000 Claude Code sessions, finding that users make most planning decisions while Claude handles execution. Domain expertise, not coding background, is the key to success: expert-rated sessions achieve verified success over twice as often as novices, though intermediate users capture most of the benefit. Non-software occupations succeed at coding tasks within 5 points of software engineers. Over seven months, the share of debugging sessions fell from 33% to 19%, while end-to-end tasks like deployment, data analysis, and document writing grew, and estimated task value rose ~25%. The report details methodology for decision attribution, expertise classification, and success verification, along with limitations. Suitable for engineers and researchers interested in AI coding tools, agent collaboration, and skill transfer.

www.anthropic.com · 27 min · Agents · AI Engineering · Claude Code

06-15

Claude Official Cookbooks: Engineering Recipes from RAG to Multimodal Agents

Anthropic's official collection of practical coding recipes for building with Claude. It provides runnable Jupyter notebooks covering capabilities like classification, summarization, and RAG, alongside advanced techniques such as tool use, multimodal vision, and sub-agent orchestration. The latest additions introduce the Claude Agent SDK and Managed Agents, demonstrating how to build observable, hostable agents—from research assistants to SRE bots—with just a few lines of code.

github.com · 8 min · Agents · AI Engineering · Anthropic

06-15

Decomposing the agent harness into swappable workers: the iii engine architecture

Mike Piccolo argues that monolithic agent frameworks force a tradeoff by bundling the loop, tools, memory, and orchestration into one block, which long-running teams inevitably rewrite. He walks through the iii engine's production worker stack, where all thirteen harness responsibilities—credential resolution, policy checks, turn FSM, session persistence, budget tracking, etc.—are decomposed into 11 independently replaceable workers. Each worker connects to the engine via WebSocket and registers functions and triggers using a single primitive (iii.trigger()), making the harness a composable set of installable workers. The post provides a step-by-step trace of a turn through provisioning, streaming, policy-gated tool dispatch, and reactive approval wake-ups, alongside concrete examples of swapping the model catalog, adding a provider, or integrating a Slack approval surface. The core bet: an agent harness should be a slider of composable workers rather than a framework you fork. This is for backend engineers building or scaling custom agent infrastructure who are hitting the composability limits of existing frameworks.

x.com · 20 min · Agent Architecture · AI Engineering · Observability

06-15

A frontier without an ecosystem is not stable

Satya Nadella argues that the future of the firm in an AI-driven economy relies on creating a compounding learning loop that integrates human capital and AI 'token capital.' He emphasizes that organizations must build agentic systems that own their institutional knowledge and private RL environments, ensuring they can swap underlying models without losing proprietary expertise. Warning against a future where a few models commoditize all value, he advocates for building a 'frontier ecosystem' that enables broad value distribution across every industry, rather than solely chasing a frontier model. This piece targets executives and senior technologists strategizing AI adoption.

x.com · 5 min · AI Engineering · AI Industry · Context Engineering

06-14

Anthropic's Analytics Agent Stack: Tackling Entity Ambiguity, Staleness, and Retrieval Failure

Anthropic’s data team shares how they use Claude to automate 95% of business analytics queries at roughly 95% accuracy. They identify three core failure modes—concept‑entity ambiguity, data staleness, and retrieval failure—and describe a four‑layer agentic stack to address them: data foundations (canonical datasets, rigorous governance), sources of truth (semantic layer, lineage, business knowledge graph), skills (knowledge and procedural skills, which lifted accuracy from ~21% to >95%), and validation (offline evals, adversarial review, online monitoring). The post includes concrete practices such as colocating docs with code, treating metadata as a first‑class product, and an appendix with a skill file skeleton. It is aimed at data engineers and analysts building LLM‑powered self‑service analytics.

claude.com · 32 min · Agents · AI Engineering · Analytics

06-14

Building cloud agent infrastructure: what's different, and what we learned

A hands-on report from CREAO detailing the architectural challenges of moving AI agents from a single-user desktop to a multi-tenant cloud sandbox. It presents two hard-won lessons. First, decouple slowly-changing user environments from fast-changing platform code by freezing user sandboxes into snapshots and hot-swapping the runner library in ~300ms via an atomic sequence involving chattr, V8 compile cache purging, and post-run re-snapshotting. Second, enforce strict credential isolation by ensuring no long-lived secrets ever enter the sandbox; a host-side API bridge verifies sandbox calls using a dual check of IP allowlisting and short-lived, per-run JWTs, so a compromised agent yields only an expiring, network-pinned token. Concrete commands, validation steps, and design rationale included. Recommended for backend and infrastructure engineers productizing agents in shared environments.

x.com · 10 min · Agents · AI Engineering · Infra

06-13

Build a Self-Improving Agent System with Claude Fable 5 in 14 Steps

A practical guide based on Anthropic engineering posts and experiments detailing how to build a self-improving agent system around Claude Fable 5. It argues most users underutilize the Mythos-class model, treating it like a bigger Sonnet 4.6. The architecture layers primitives (model, sub-agents, worktrees), orchestration (/goal and Outcomes loops, Dynamic Workflows, Routines for cloud execution), memory (state files and compounding Skills), and self-improvement (vision self-checks, eval loops, rule distillation). Key tactics include using an independent verifier sub-agent instead of self-critique, ensuring parallel safety with git worktrees, running multi-day tasks on cloud infrastructure, and following a 5-stage memory progression from failure documentation to general rule consultation. Designed for engineers building compound systems rather than prompting for minutes.

x.com · 28 min · Agent Architecture · AI Engineering · Claude Code

06-13

Claude Fable 5 and Claude Mythos 5

Anthropic launched Claude Fable 5, its most capable publicly released model, rated Mythos-class. It achieves state-of-the-art on nearly all benchmarks, especially on long, complex tasks in software engineering, knowledge work, and vision. To mitigate misuse risks in cybersecurity and biology, Fable 5 ships with conservative safety classifiers that fall back to Opus 4.8 for sensitive queries, triggering in under 5% of sessions. Claude Mythos 5, the same model with safeguards lifted, is available to cyberdefenders and certain biologists. New 30-day business data retention is mandated for Mythos-class models. The post includes results from Stripe, Cognition, and internal protein design and genomics experiments. Pricing is $10/M input tokens and $50/M output tokens.

www.anthropic.com · 26 min · AI Engineering · AI Industry · Anthropic

06-12

How an Anthropic seller rebuilt his team's workflows with Claude Code

Jared Sires, a former account executive at Anthropic with no coding experience, used Claude Code to build CLAFTS, a Gmail-integrated tool that drafts customer emails in his voice while pulling context from live product documentation. The tool saves 10-15 hours per week. He expanded this into a sales plugin with skills for daily briefs, recaps, and pipeline management, wired into Salesforce, Gong, and other systems via MCP servers. About 80% of Anthropic's sales org now uses the plugin. The piece illustrates how non-technical practitioners can leverage AI coding tools to eliminate technical barriers and deliver workflow-specific software.

claude.com · 9 min · Agent Architecture · AI Engineering · Claude Code

06-12

25 Claude Features, Workflows, and Tricks That Most Users Don't Know

A practical guide by @eng_khairallah1 detailing 25 workflows and techniques to fully leverage Claude Projects. The core thesis is treating Projects as evolving, persistent workspaces rather than transient chat sessions. It provides actionable strategies including a structured instruction template, strategic file organization, the Living Instructions pattern, and advanced concepts like voice calibration files and competitive intelligence hubs. The guide emphasizes a compounding knowledge strategy where each interaction refines Claude's contextual understanding, suitable for power users aiming to transform Claude from a generic tool into a domain-specific specialist.

x.com · 16 min · Agent Architecture · AI Engineering · Anthropic

06-12

How To Build AI Agents in 2026 (That Actually Work)

This article systematically deconstructs the architecture and engineering practices for building practical AI agents. It clarifies the boundaries between chatbots, AI agents, and agentic AI, emphasizing that a real agent is a system that persistently loops toward a goal rather than delivering a one-shot answer. The author explains the ReAct loop (Reasoning + Acting) and breaks down the five building blocks: the LLM as the brain, tools as hands, short-term and long-term memory, self-correcting loops, and verification. Using a case study of a startup research agent for the fitness niche, the article walks through goal setting, tool integration, loop construction, memory implementation, and the addition of a critic agent, complete with copy-paste system prompts. It highlights six common failure modes and recommends a 2026 tech stack including Claude Code, LangGraph, and MCP. The piece provides a weekend roadmap to build a basic agent from a 50-line Python script and is aimed at developers shifting from prompt engineering to designing agent systems.

x.com · 21 min · Agent Architecture · AI Agents · AI Engineering

06-11

Designing loops with Fable 5: self-correction and cross-session memory

R. Lance Martin demonstrates two loop patterns for Anthropic's Fable 5: self-correction and cross-session memory. On the Parameter Golf challenge (train a model under 16MB and 10 minutes on 8xH100s), Fable 5 with CMA and a verifier sub-agent improved the pipeline roughly 6x more than Opus 4.7, favoring structural changes over scalar tuning. On a continual learning SQL benchmark, Fable 5 progressed through fail-investigate-verify-distill into general rules, reaching 73% verification coverage, while Opus 4.7 and Sonnet 4.6 stalled at sparse notes or uncertain schemas. The key takeaway: design loops and environment feedback so the model can hillclimb, rather than relying on direct prompting.

x.com · 5 min · Agent Architecture · Agents · AI Engineering

06-11

The Missing Link Between Agents and Applications

This article introduces Headless Tools, a mechanism that allows agents to act directly on client-side runtimes such as browsers and desktop applications. The author argues that most current agent tools are server-side, limiting them to API calls while blocking access to browser state, device APIs, and in-app actions. Headless Tools wrap client-side capabilities like geolocation, clipboard, IndexedDB, and application-specific commands as standard tools invocable by the model. The model sees only a tool schema, while the server and client coordinate execution behind the scenes. Code examples in TypeScript demonstrate the pattern, alongside real-world use in a Slidev presentation plugin and browser-local agent memory. Privacy is improved because sensitive data can remain on-device. This is valuable for teams embedding AI agents into rich frontend contexts such as design tools, document editors, and desktop utilities.

x.com · 7 min · AI Agents · AI Engineering · Browser

06-11

Training an LLM to Generate Reliable Structured Output Using GRPO and a Reward Function

A hands-on report on replacing labeled data with a code-defined reward function to train structured output. The author fine-tunes Qwen3-8B for JSON invoice extraction using GRPO. Supervised fine-tuning stalls because its token-level loss only optimizes for surface similarity, not structural validity. The fix: a reward function that scores completions 0.0 (invalid JSON), 0.5 (valid JSON but wrong schema), or 1.0 (fully compliant), providing a learning gradient. Training on Fireworks H200s raised schema-valid output from a baseline of 62% to 82% on held-out prompts, exceeding GPT-4.1's 58%, with lower cost and latency. The approach transfers to any task where correctness is verifiable in code, such as SQL, API calls, or tool use. Full reward function, dataset, and training config are provided.

x.com · 12 min · AI Engineering · Fine-tuning · GRPO

06-10

Designing loops with Fable 5: self-correction and memory in agentic workflows

The author shares two practical directions for improving agentic workflows with Anthropic's Claude Fable 5 model: self-correction loops and cross-session memory. In a Parameter Golf challenge—train the best model within a 16MB artifact in under 10 minutes on 8×H100 GPUs—Fable 5 improved the training pipeline roughly 6× more than Opus 4.7 when using Claude Managed Agents with Outcomes judged by an independent verifier sub-agent against nine checkable criteria. Fable 5 bet on larger structural changes and pushed through a quantization regression, while Opus 4.7 stuck to tuning scalar hyperparameters. For memory, the author used a SQL-based task from Continual Learning Bench 1.0 with filesystem-backed memory across agent sessions. Sonnet 4.6 only logged failures and guesses; Opus 4.7 built flagged schema references but verified only 17% of questions; Fable 5 reached 73% verification coverage in the best run and distilled learnings into general rules. Engineers interested in agent architecture and model capability boundaries will find the experiments relevant.

x.com · 5 min · Agent Architecture · AI Agents · AI Engineering

06-09

Loop Engineering: Designing the System That Prompts Your Coding Agents

Addy Osmani argues that interacting with coding agents is shifting from prompt engineering to 'loop engineering'—designing a system that autonomously discovers tasks, delegates work, and verifies results using five building blocks: scheduled automations, parallel worktrees, project skills, connector plugins, and checker sub-agents. He maps how Claude Code and Codex both implement all five, noting that the leverage point has moved from writing good prompts to architecting persistent loops. The post cautions that loops amplify existing problems: verification, comprehension debt, and cognitive surrender become sharper risks. Intended for senior engineers evaluating how to productize AI coding tools beyond one-shot interactions.

x.com · 14 min · Agent Architecture · AI Agents · AI Engineering

06-09

How to Design a Loop That Prompts Your Agent

This article presents a loop architecture that enables an AI agent to autonomously complete multi-step tasks by building an automated prompting system instead of manually crafting each prompt. It breaks down the loop into five parts: defining a 'done' check, building prompts from dynamic state rather than hand-fed instructions, executing actions while capturing all outputs, feeding failures back as the next prompt, and setting hard stop conditions like max turns and cost limits. A walkthrough of fixing a login bug shows the loop in action, emphasizing that real costs come from repeated turns, making guardrails critical. Encapsulating repeated operations into reusable skills is highlighted as the key to long-term value, and common mistakes—like lacking an exit condition or discarding error output—are pointed out. Suitable for developers shifting from one-shot prompts to designing agent control flows.

x.com · 18 min · Agent Architecture · Agents · AI Engineering

06-08

Composable Agent Skills for Real Engineering Workflows

Matt Pocock's personal agent skills for Claude Code and Codex, targeting four common failure modes in AI-assisted development: misalignment, verbosity, broken code, and design entropy. Instead of controlling the process, these small, composable skills embed engineering fundamentals—grilling sessions for alignment, shared ubiquous language for concision, TDD red-green-refactor loops for code quality, and architecture rescue tools. They work with any model and are designed to be hacked and adapted in your own .claude directory.

github.com · 14 min · Agents · AI Engineering · Claude Code

06-08

Every Agentic Engineering Hack I Know (June 2026)

The author shares 22 practical hacks for agentic engineering with Claude Code and Codex. The core is a plan-first workflow: use /ce-plan to generate a plan.md that guides the agent; humans skim or ask inline instead of reading it. Hacks include: voice input via Monologue or Wispr Flow (LLMs handle imperfect transcription); running 4-6 separate agent sessions in cmux tabs; defaulting terminal tabs to Claude Code and bypassing all permission prompts with sound alerts on completion; giving Claude an email address via AgentMail to trigger sessions remotely; using last30days before planning to search community discussions and news in parallel; turning repeated tasks into reusable skills to compound agent capabilities. He stresses that human value lies in providing taste and direction, not typing, and warns against AI addiction. The post is packed with copy-paste config snippets and concrete tools, aimed at engineers deep into AI-assisted development.

x.com · 28 min · Agent Infrastructure · Agents · AI Engineering

06-08

How to Master Dynamic Workflows in Claude Code: 6 Patterns and 14 Steps

This article provides a systematic guide to Dynamic Workflows in Claude Code, shipped in late May 2026. It moves beyond manual prompt chaining by letting Claude generate a bespoke JavaScript harness for a specific task. The author first explains the mental model: how workflows structurally fix agentic laziness, self-preferential bias, and goal drift inherent in single-context sessions. It then breaks down six core patterns—classify-and-act, fan-out-and-synthesize, adversarial verification, generate-and-filter, tournament, and loop until done—each with code skeletons. Real-world use cases show how to compose patterns for migrations, deep research, triage, and lightweight evals. Practical controls like /goal, /loop, token budgets, and the quarantine pattern for untrusted inputs are covered. It also advises on saving successful workflows and shipping them as Skills. This guide is for engineers aiming to tackle long-running, parallel, or adversarial tasks beyond a single Claude Code session.

x.com · 17 min · Agents · AI Engineering · Anthropic

06-07

Weekly AI Roundup: Claude Limits Doubled, SpaceX IPO, Microsoft Model Data Contradiction

A roundup of 10 major AI and tech news items from the first week of June 2026. MiniMax M3 was released, beating GPT-5.5 on coding benchmarks at $0.6/M tokens, though independent verification is pending. DeepSeek raised ~$7.4B in its first external funding round, while Unitree completed its IPO review in a record 73 days. Kimi Work, Coze 3.0, and Qwen3.7-Plus all launched new Agent capabilities. Doubao announced subscription plans. ChatGPT surpassed 1 billion monthly active users. Anthropic doubled Claude Cowork's usage limits, secretly filed for an IPO, and published a report stating Claude writes 80% of its own code. NVIDIA unveiled the ARM-based RTX Spark at Computex. SpaceX is set to IPO on June 12, with Google disclosed paying $920M/month for compute. Microsoft's MAI-Thinking-1 faced backlash after its claimed 'clean data' was revealed to include Common Crawl, and GitHub Copilot's switch to metered billing caused developer bills to spike.

mp.weixin.qq.com · 7 min · AI Engineering · AI Industry · Cost Optimization