Superpowers: How to Make an AI Agent Run All Night and Deliver Usable Results
The author shares their journey from a failed attempt at orchestrating long-running AI agent tasks to discovering the Superpowers Skill Set, which solves the core pain points. Superpowers decomposes the development workflow into three phases: brainstorming, writing-plans, and executing-plans (with subagent-driven-development). Key design elements include: using separate prompt templates (implementer, spec-reviewer, code-quality-reviewer) to enforce separation of concerns; spinning up a fresh subagent for each task to avoid cascading context pollution; using hard constraints like "Never/HARD-GATE" to prevent agent deviation; and enforcing software engineering best practices such as TDD, DRY, and YAGNI. The author argues that with frontier models like Opus 4.8 and Codex GPT-5.5 now being sufficiently capable, the real bottleneck is harness design—using clear specifications and structured processes to make even cheaper models reliable for long-duration tasks.
Earlier this year while building my own Agent, I wondered whether I could use GitHub issues to write PRDs and break down tasks, enabling long-running agent orchestration. The agent architecture I was playing with was similar to OpenClaw, but the LLM core accessed Codex via ACP.
However, I overreached. My original motivation was that my $200 ChatGPT and $200 Codex subscriptions were going to waste, so I dreamed of filing a request before bed and waking up to completed work. The PRD writing and subtask decomposition went smoothly, but when it came to dispatching and executing subagents, things fell apart. I eventually moved on to vibe coding other things and abandoned it.
年初我在做自己的Agent时想到是否可以用Github issues的方式来写PRD和任务拆分,以此实现长任务的agent orchestration。当时我自己做着玩的agent架构跟OpenClaw相似,但是llm core以acp方式接入codex。
但是当时步子迈得太大,因为初衷是觉得我的$200 cc和$200 codex用不完太浪费,想在我睡觉前提个需求,我睡醒了他们就把任务完成。结果让agent写PRD和拆分子任务这一步还算顺利,最终派发subagent执行的时候出了问题没能解决。后来转去Vibe别的东西了就没再继续。
Turns out, the Obra/Superpowers toolkit I've been using recently already "solved" this problem late last year — not by dispatching subagents via OpenClaw, but as Skills that run directly inside CLI tools like Claude Code and Codex, which already support a subagent environment.
This Skill set excels at having an Agent execute large tasks, but it overcomplicates small ones. Since we can handle the latter by hand or voice prompt with AI anyway, we can focus on the former.
结果我最近一直在用的obra/superpowers在去年底就已经“解决”了这个问题,只是不是通过OpenClaw的方式来派发Subagents,而是作为Skills,在Claude Code/Codex等具备Subagent环境的Cli中直接执行。
这个Skill set可以非常好地满足让Agent执行一个大需求的场景,但是他不适用于小需求,会把事情变得更复杂。当然后者我们手搓/口喷就能让AI完成,所以我们可以更关注前者。
- Think and confirm upfront to reduce communication and rework costs.
To realize the dream of "giving the AI a task at bedtime and waking up to usable results," a central requirement is to communicate our needs clearly and in enough detail.
Doing this perfectly in one human-written prompt is very difficult. Superpowers solves this with several clever design techniques.
- 把思考与确认前置,减少沟通与返工成本
想要最终实现“睡前给AI一个任务,它跑一晚上执行,我早上起来验收”这个目标,非常核心的一点是,我们要确保足够详细地,明确地跟AI讲述清楚需求。
这点要靠人工手写prompt一次完成非常困难,Superpowers通过几个巧妙的设计帮助我们解决这个问题。
Before writing the Spec or PRD, Superpowers runs a brainstorming phase that iterates details with the human. It even provides a Visual Companion that can build a rough web demo for visual confirmation of the proposed design. During this process, it only confirms one question at a time; the Visual Companion can offer multiple options; it follows the YAGNI principle (keep it as simple as possible, only what's truly useful); it explores alternatives for the user to choose from; it confirms incrementally (draft first, then refine); and it allows flexibility (if things go wrong, you can restart a discussion). The output of this step is a spec document — once this conversation is done, the hardest human work is complete.
在正式开始写Spec/PRD之前,先通过brainstorming跟人类反复确认好细节,甚至提供了Visual Companion用来编写一个粗糙的web demo跟人类视觉确认即将进行的方案设计。在这个过程中,一次只确认一个问题,visual comapnion可以提供多个可选项,YAGNI原则(尽可能简单只保留真正有用的部分),探索更多可行选项以供用户选择,增量确认(先给出初稿设计,再逐步确认细节),灵活性(如果出了大问题可以重新讨论重新开始)。这一步的产物是输出spec文档,跟AI聊好这一步就已经完成了人类的一大步。
After the spec is done, the next step is writing plans. The first step broke a large requirement into a Spec; the Plans step then breaks it into sufficiently small tasks. Each task is a sub-project with clear boundaries. Each step is a 2–5 minute atomic action (write a failing test / run and confirm it fails / write minimal implementation / run tests to confirm pass / commit code), following DRY, YAGNI, TDD, and frequent commit principles (lots of acronyms learned 😂). It also prohibits common AI lazy habits like TODO comments, "add appropriate error handling," or "similar to Task N." Finally, the Plan must be self-reviewed: checking Spec coverage, scanning for placeholders, and verifying consistency. The resulting plans are high-quality.
在第一步完成以后,将开始根据spec执行writing-plans。第一步我们把一个足够大的需求写成了Spec,接下来的Plans需要将其拆分成足够小的Tasks。这些Tasks是sub-projects,有明确的边界。每一步是2-5分钟的原子动作(写一个失败的测试用例/跑一遍确认失败/写一个最小实现/跑测试确认通过用例检查/提交代码),贯彻DRY/YAGNI/TDD/频繁Commit等原则(学会好多缩写黑话😂),同时禁止TODO/add appropriate error handling/similar to Task N等常见AI偷懒行为。并且最终Plan写完了需要self-review,确认Spec覆盖度/占位符扫描/一致性检查等等。这一步写出来的Plans质量很高。
Once plans are ready, the system moves to executing plans. The fallback is a single session running end-to-end (if the environment doesn't support subagents), but since both Claude Code and Codex CLI support subagents, we typically choose subagent-driven development. During execution, it reads the plan first, then reviews, then executes. If it gets stuck, it stops to ask rather than guessing.
有了Plans之后接着executing-plans,默认是保底的一个Session从头跑到尾(如果当前环境不支持Subagents),但是Claude Code/Codex Cli都支持Subagents,所以通常我们会选择subagent-driven-development。在执行的过程中,先读Plans,然后Review,然后开始执行,如果有卡住的时候就停下来问,不让AI自己猜。
The key to making an AI run for over an hour on a single requirement is subagent-driven development. It creates a new subagent for each task, and each subagent goes through two phases of review to ensure quality output. The Skill set includes three prompt files — implementer-prompt.md, spec-reviewer-prompt.md, and code-quality-reviewer-prompt.md — for execution and review. Since each subagent only uses its own context, the main session context never gets overloaded. They follow TDD, and some tasks can even run in parallel. The Skill also includes a set of Red Flags to keep the AI on track: don't write code on main/master without permission (large tasks use a worktree instead, handled by another skill); don't skip tasks before fixing issues; don't let subagents read the Plan file — they only need enough context, reading Plan files is the main session's job. Very harness-oriented.
真正一个需求能让AI跑1hr以上的关键点在subagent-driven-development。他给每一个Task都创建一个新的Subagent,每个Subagent会有两个阶段的Review保证输出。这个Skill里还带有implementer-prompt.md spec-reviewer-prompt.md和code-quality-reviewer-prompt.md三个prompt,分别用于执行和Review。因为每个Subagent都只用自己的context所以不会撑爆主Session context,并且他们按照TDD实现,有一些任务甚至可以并行执行。同时Skill里还写了一堆Red Flags,不让AI跑偏,比如不要未经允许就在main/master开始写代码(正常一个大需求会开一个worktree来写,有另一个skill,不做赘述);不要还没修完issues就跳过开始执行下一步;不要让Subagent去读Plan file,Subagent只需要被告知足够的context的就行,读Plan file是主Session的工作,等等等等,非常harness。
This gives Superpowers a core software development pipeline:
brainstorming → writing-plans → executing-plans / subagent-driven-development
The output of each step is the input to the next. Subagents never inherit the main session's context, keeping each independent and uncontaminated. It also explicitly prohibits the AI from stopping mid-task to ask whether to continue — this is key to sustaining runs over an hour. For quality assurance, it strictly follows the Spec first, then performs Code Quality Review: get it right first, then make it good. In my earlier agent orchestration attempts, task duration went up, but quality fell off a cliff after 20 minutes, exactly because we didn't strictly follow the Spec. This is critical.
如此superpowers就完成了一套核心的软件开发流水线:
brainstorming → writing-plans → executing-plans / subagent-driven-development
前一步的输出是下一步的输入。Subagent不继承主Session的context,使得每个Subagent可以独立不受污染。同时明令禁止AI在执行任务时停下来问要不要继续,这是能持续跑1hr以上的关键。在质量保证方面,先严格Follow Spec,然后再做Code Quality Review,要先把事情做对了,再来考虑做得好不好。之前我在做我的Agent orchestration的时候,跑一个任务的时长能上去,但是超过20min之后任务的质量就断崖式下跌,就是没有严格Follow Spec,这点非常重要。
- What's Next?
The word "Harness" has been trending lately, though nobody can quite define it. I think Superpowers' design is an excellent example of Harness in practice.
@RoCry recently shared a video with me: "Harness Engineering: How to Build Software When Humans Steer, Agents Execute — Ryan Lopopolo, OpenAI - YouTube." It makes a key point: "What if the large language model is good enough?"
@RoCry's take: if LLMs today are like a fresh graduate — basic ability and IQ are adequate — then to guide them to write code, you just need to teach them specific coding conventions and methods. Once they master those, almost everything can be delegated; we just need to build a proper harness. (That's the gist; I've paraphrased.)
- What's Next?
最近很流行Harness这个词,虽然没人能说清楚这个词究竟是啥意思。我觉得superpower的设计就是一种非常不错的Harness实践。
@RoCry前两天给我分享一个视频,Harness Engineering: How to Build Software When Humans Steer, Agents Execute — Ryan Lopopolo, OpenAI - YouTube,说里面提到一个观点: "What if the large language model is good enough?"
@RoCry觉得,如果LLM现在的实力像一个毕业生,基础能力和智商已经达标,那接下来如果想带他写代码,就只要交给他具体的编码规范和方法,只要他掌握了这些,就几乎所有的事情都能交给他去做了,我们只需要做好harness就好了。(大意如此,我做了修改)
I think that makes a lot of sense. As of June 2026, frontier models are already very capable — Opus 4.8/4.7 and Codex GPT-5.5 can handle most tasks. The trickiest problem we face is AI going rogue, and Superpowers' Skill practice is one form of Harness: by writing credible Specs and Plans through clear upfront communication, and using them as a basis for judgment, even cheaper models become reliable and produce usable results.
In Superpowers' design, words like "Never" and "HARD-GATE" provide strong constraints on the AI. The contractor-dispatching model between the main agent and subagents makes long-running tasks feasible. Separating the Implementer and Reviewer roles ensures final quality aligns with the Spec. And because it's a Skill Set, it works with multiple CLI tools — Claude Code, Codex, and more.
我觉得很有道理。现在(2026年6月)的Frontier模型已经非常好了,Opus 4.8/4.7, Codex GPT-5.5已经能胜任大部分工作。我们目前遇到最棘手的问题是AI总是不听使唤,而Superpowers的Skill实践就是Harness的一种,通过前期明确清晰的沟通写出可置信的Spec/Plan,并以此作为判断的依据,从而让即使廉价的模型也能变得可靠,产出可用的结果。
在Superpowers的设计中,“Never/HARD-GATE/”等用词可以给AI强力约束,主Agent和Subagent的包工头派发任务模式使长任务实现变得可能,Implementer和Reviewer职责分离使得最终质量可以往Spec上靠。并且他适配多种Cli,因为它是一套Skill Set,Claude Code/Code等多个Cli都能用。
I recently chatted with a friend who believed that if Harness is good enough, we could use cheap AI models to do work that previously required expensive LLMs. We were discussing agent-based product design at the time, not software engineering. It struck me as insightful, so I switched my agent tools' base model from Gemini to DeepSeek. In my product's Harness, I engineering-implement everything that can be hard-coded, reserving LLM API calls only for text understanding and processing. Now DeepSeek V4 Pro offers both good capability and very affordable pricing. Using a cheaper model for production is pure cost reduction — why not?
前阵子和朋友聊,他觉得如果Harness做得足够好,我们完全可以用便宜的AI来实现必须LLM才能做的工作。当时我们在聊基于Agent的产品设计,还不是软件工程。我觉得很有启发,所以我也把我的Agent工具的基座模型从Gemini换成了DeepSeek。在产品上的Harness我是把所有可能可以用工程师实现的部分,全部使用工程实现,只留下文本理解,文本处理等需要LLM的部分才利用LLM API。而且现在DeepSeek v4 pro不仅实力有不错提升,价格还贼亲民,线上产品用更便宜的模型就是降低成本,何乐而不为。
Superpowers is just one Skill set of its kind. A couple of days ago, @Clu recommended mattpocock/skills and /grill-me. I haven't tried them yet but plan to.
Superpowers has its limitations. For simple tasks, if the AI automatically decides to invoke Superpowers, it wastes tokens and time, overcomplicating straightforward problems. So use it carefully — judge whether the task truly fits before handing it over.
Superpowers只是同类Skill set中的一种,前两天 @Clu 给我推荐了mattpocock/skills这个skills,/grill-me。我还没去尝试,打算也试试看。
superpowers有他的局限,比如简单任务如果被AI判断要自动调用superpowers就会很浪费token和时间,把简单问题复杂化,所以使用的时候也要小心,判断好适用他的需求再给他。