标签 · Token-Optimization — Glean

Recent picks

10picks · chronological

07-16

Cut Claude Code token costs by rendering system prompts & history as images

pxpipe is a local proxy that intercepts Claude Code API requests, rendering bulky text parts like system prompts, tool docs, and old history into compact PNG images. Since image token pricing depends on pixel dimensions rather than text length, the approach cuts input tokens by ~60%, leading to a 59–70% end-to-end cost reduction. It rewrites requests before they leave the machine, preserving prompt caching. By default it works with Claude Fable 5 and GPT-5.6, with dashboard controls for opt-in models. It includes profitability gates and benchmarks showing near parity in coding tasks, though exact-string recall is lossy. The project is aimed at developers using LLM coding agents who want to slash API costs without sacrificing functionality.

github.com · 12 min · Ai Tooling · Anthropic · CLI

07-14

Model and effort in Claude Code: knowing more vs. trying harder

This article by a Claude Code team member explains the real mechanism behind model and effort settings. Model selection swaps frozen weights (knowledge), while effort controls how much work Claude does—how many files it reads, tests it runs, and how thoroughly it verifies. Using analogies (specialist vs expert vs generalist) and diagrams, it clarifies when to upgrade the model (not enough knowledge) vs increase effort (not enough trying). Practical advice: start with default effort, choose larger models for hard problems, smaller ones for routine tasks to save cost. Key insight: check context first, then decide if Claude didn't know or didn't try hard enough.

x.com · 14 min · AI Engineering · Claude Code · LLM

07-14

The 'Caveman Skill' That Claims 65% Token Savings Actually Saves Only 8.5%

This article analyzes the recent trend of 'caveman skills' (such as the Caveman project) that prompt AI coding tools to output minimal language to save tokens. The author points out that the claimed 65% token savings comes from chat scenarios, whereas in agentic programming tasks, tool calls and system prompts dominate token usage. A controlled test by JetBrains (86 tasks, 240 trials) showed that even with forced activation, output token savings were only 8.5%, and in practice the savings are even smaller due to conditional activation. The article also discusses the cost of brevity: loss of information leads to more developer follow-ups and agent rework. The author argues that true cost optimization comes from context management (e.g., prompt caching) and reducing unnecessary tool calls, not from compressing output.

x.com · 2 min · AI Engineering · Claude Code · Prompt Engineering

07-10

Getting started with loops in Claude Code

The Claude Code team defines four loop patterns (turn-based, goal-based, time-based, proactive) with trigger, stop criteria, use cases, and token management tips. Concrete commands like /goal, /loop, /schedule and a SKILL.md example show how to make agents iterate, self-verify, and compose primitives into automated workflows. A practical guide for developers exploring agent engineering.

x.com · 9 min · Agent Engineering · Claude Code · Context Engineering

07-04

Switching from Superpowers to mattpocock/skills: Less Token Waste, More Control

The author shares a real-world comparison between Superpowers and mattpocock/skills, explaining why they switched. Superpowers uses hooks to enforce a rigid workflow, which is helpful for novices but often overcomplicates simple tasks and burns excessive tokens. mattpocock/skills takes a 'real engineer' approach, giving control back to the user via explicit commands like /grill-with-docs, /to-prd, /to-issues, and /implement. Key advantages: lower token consumption, built-in debugging (/tdd, /diagnosing-bugs), model handoff (/handoff), and architecture refactoring (/improve-codebase-architecture). The author pairs these skills with Fable 5 and Codex 5.5 models, storing PRDs and issues on GitHub for traceability. A candid take for engineers evaluating agent frameworks and tooling.

justinyan.me · 3 min · Agent Engineering · Claude Code Marketplace · Framework

06-30

Practical Guide to Setting Up a Local Coding Agent Stack with Open-Weight Models

This is a step-by-step tutorial for building a fully local coding agent using open-weight LLMs (primarily Qwen3.6 35B-A3B) served via Ollama and the Qwen-Code harness. The author covers model selection, speed/memory benchmarking with a custom script, a small agent capability evaluation (5 tasks), and a security audit checklist before running any harness. It then compares the same local model across three harnesses—Qwen-Code, Codex (open-source), and Claude Code—finding that Codex achieves the same task success rate with roughly half the token usage of Claude Code. The guide also explains SSH tunneling to run the model on a dedicated machine (e.g., DGX Spark) while using the harness on the main workstation. Targeted at engineers comfortable with the CLI who want a transparent, inspectable, and free alternative to proprietary coding agents.

magazine.sebastianraschka.com · 45 min · Coding Agent · Local LLM · Ollama

06-24

Loop Engineering: The AI skill every builder needs in 2026

This community-authored article introduces 'Loop Engineering,' arguing that the most effective AI builders are shifting from one-shot prompting to designing automated feedback loops for AI agents. Rather than crafting a perfect prompt, engineers should build systems that discover, plan, execute, verify, and iterate until a verified outcome is reached. It covers six building blocks (automations, worktrees, skills, plugins/connectors, subagents, memory), two loop scales (single-agent vs. fleet), and two types (open vs. closed), while frankly addressing the critical hidden cost of tokens. A practical primer for engineering teams turning AI agents from experiments into production workflows.

x.com · 12 min · Agent Architecture · Agents · AI Engineering

06-22

GLM-5.2: Built for Long-Horizon Tasks

Zhipu AI introduces GLM-5.2, a flagship model for long-horizon tasks with a solid 1M-token context and an MIT license. Architecture innovations include IndexShare, which reuses the sparse attention indexer across four layers to cut per-token FLOPs by 2.9× at 1M context, and an improved MTP layer that raises speculative decoding acceptance length by 20% through IndexShare, KV sharing, rejection sampling, and end-to-end TV loss. Agentic RL post-training is backed by the slime framework, and an anti-hack module detects and blocks reward-hacking behaviors like fetching evaluation files or curl-downloading answers. GLM-5.2 ranks as the top open-source model on long-horizon benchmarks such as FrontierSWE (only 1% behind Opus 4.8) and Terminal-Bench 2.1 (81.0), making it relevant for engineers building coding agents and long-context inference systems.

z.ai · 21 min · Agent Architecture · AI · AI Engineering

06-16

A Local-First Context Compression Layer for AI Agents: Library, Proxy, and MCP in One Stack

Headroom is a local-first context compression layer built specifically for AI coding agents. It slashes token consumption by 60-95% by compressing tool outputs, logs, files, and RAG results before they reach the LLM, all while maintaining answer accuracy. Usable as a Python/TypeScript library, a transparent proxy, a CLI wrapper for popular agents, or an MCP server, it fits into existing workflows without friction. Internally, it combines JSON structure-aware compression, AST-based code minification, and a custom fine-tuned model, grounded by a novel CCR reversible compression system that guarantees original data is never lost. This tool is ideal for engineers who rely heavily on coding agents and want to cut API costs without altering their current toolchain.

github.com · 18 min · Agents · Ast-Minification · Context Engineering

06-16

The Context Compression Layer for AI Agents: 60–95% Fewer Tokens, Zero Accuracy Loss

Headroom is a local-first context compression layer for AI agents that slashes token usage from tool outputs, logs, files, and RAG chunks by 60–95% before they reach the LLM, with preserved accuracy. It offers library, proxy, MCP server, and agent wrapper modes, using a content router to select the best compressor for JSON, code, or prose. Reversible compression ensures originals are retrievable on demand. With cross-agent memory and `headroom learn` for mining failed sessions, it is ideal for engineers running coding agents daily and anyone seeking to slash LLM costs without changing their workflow.

github.com · 18 min · Agent Architecture · Ai-Memory · Context Engineering