Glean 拾遗
日刊 /2026-06-03 / 构建生产级长时间运行AI Agent的5种设计模式

构建生产级长时间运行AI Agent的5种设计模式

原文 x.com 收录 2026-06-03 09:56 阅读 11 min
AI 解读

Google Cloud 分享了构建最多存活7天的AI Agent的5种设计模式:检查点恢复(按批次持久化进度)、委托审批(暂停时零资源消耗、秒级恢复)、分层记忆治理(记忆银行、记忆档案、代理身份/注册表/网关防漂移与泄露)、环境感知处理(事件驱动代理,策略外化至网关免重部署)、舰队编排(独立部署专精代理,故障不级联)。每种模式包含ADK代码示例与架构图,并讨论了生产化挑战如记忆漂移和策略外化。面向需要将Agent从对话机器人扩展为自主工作者的开发者。

原文 11 分钟
原文 x.com ↗
§ 1

Developers spend weeks perfecting prompt engineering, tool calling, and response latency. None of it matters when your agent needs to stay alive for five days.

The workflows that actually matter in production (processing thousands of insurance claims, running week-long sales sequences, reconciling financial data across systems) don't fit inside a single conversation turn. They take days, not seconds.

The moment you try to build these long-running agents, you realize most agent architectures are stateless. They reconstruct context from databases on every interaction. They lose the reasoning chain, the soft signals, and the confidence gradients that made the agent's previous decisions make sense.

At Cloud Next 26, we announced that Agent Runtime now supports long-running agents that maintain state for up to seven days. In this article, we’ll share five essential agent design patterns for building long-running agents with Agent Runtime.

5 Agent Design patterns for Long-running AI Agents

开发者花费数周完善提示工程、工具调用和响应延迟。但当你需要代理存活五天时,这一切都不重要了。

生产环境中真正重要的工作流——处理数千个保险索赔、运行长达一周的销售序列、在系统间核对财务数据——无法容纳在一次对话轮次中。它们需要数天,而非数秒。

一旦你尝试构建这些长期运行的代理,你就会意识到大多数代理架构是无状态的。它们在每次交互时从数据库重建上下文。它们丢失了推理链、软信号以及使代理先前决策合理的置信度梯度。

在Cloud Next 26上,我们宣布Agent Runtime现在支持长期运行代理,可保持状态长达七天。在本文中,我们将分享使用Agent Runtime构建长期运行代理的五个基本设计模式。

5个长期运行AI代理的设计模式

§ 2

The most common failure mode in multi-day workflows is context loss. An agent processes 200 documents over four hours, then hits an error on document 201. Without checkpointing, you restart from scratch.

Long-running agents on Agent Runtime maintain persistent execution state in a secure cloud sandbox. The agent has full access to bash commands and a sandboxed file system, so you can write intermediate results to disk, maintain processing logs, and recover from failures.

Treat your agent like a long-running server process, not a request handler. The same way you build a data pipeline that processes millions of records: checkpoint progress, handle partial failures, ensure idempotency.

Here is how you structure a document processing agent that checkpoints after every batch using Google Agent Development Kit:

from google.adk import Agent, ToolContext

class DocumentProcessor(Agent):
    """Processes large document sets with checkpoint-and-resume."""

    async def process_batch(self, docs: list, ctx: ToolContext):
        checkpoint = self.load_checkpoint()  # Resume from last position
        start_idx = checkpoint.get("last_processed", 0)

        for i, doc in enumerate(docs[start_idx:], start=start_idx):
            result = await self.classify_and_extract(doc)
            self.results.append(result)

            # Checkpoint every 50 documents
            if (i + 1) % 50 == 0:
                self.save_checkpoint({
                    "last_processed": i + 1,
                    "partial_results": self.results,
                    "timestamp": datetime.now().isoformat()
                })

        return self.compile_final_report()

Notice the checkpoint granularity. Not after every document (wasteful). Not only at the end (risky). Fifty documents per batch balances durability against overhead. Your specific number depends on how expensive each unit of work is.

多日工作流中最常见的失败模式是上下文丢失。代理在四小时内处理了200个文档,然后在第201个文档时遇到错误。如果没有检查点,你只能从头开始。

Agent Runtime上的长期运行代理在安全的云沙箱中维护持久化执行状态。代理可以完全访问bash命令和沙箱文件系统,因此你可以将中间结果写入磁盘、维护处理日志并从故障中恢复。

把你的代理当作一个长时间运行的服务器进程,而不是一个请求处理器。就像构建处理数百万条记录的数据管道一样:检查进度、处理部分故障、确保幂等性。

以下是使用Google Agent Development Kit构建一个每批后检查点的文档处理代理的方法:

from google.adk import Agent, ToolContext

class DocumentProcessor(Agent):
    """使用检查点与恢复处理大型文档集。"""

    async def process_batch(self, docs: list, ctx: ToolContext):
        checkpoint = self.load_checkpoint()  # 从最后位置恢复
        start_idx = checkpoint.get("last_processed", 0)

        for i, doc in enumerate(docs[start_idx:], start=start_idx):
            result = await self.classify_and_extract(doc)
            self.results.append(result)

            # 每50个文档检查一次
            if (i + 1) % 50 == 0:
                self.save_checkpoint({
                    "last_processed": i + 1,
                    "partial_results": self.results,
                    "timestamp": datetime.now().isoformat()
                })

        return self.compile_final_report()

注意检查点粒度:不是每个文档之后(浪费),也不是仅在末尾(风险)。每批50个文档平衡了持久性和开销。你的具体数字取决于每个工作单元的成本。

§ 3

Every framework advertises human-in-the-loop.

But in practice, most implementations are: serialize state to JSON, send a webhook, hope someone checks it. The problems compound fast. JSON serialization loses implicit reasoning context. Notifications compete with dozens of alerts.

When the human responds hours later, the agent has to deserialize, re-establish context, and hope nothing changed.

Long-running agents handle this differently. When the agent hits an approval gate, it pauses in place. Full execution state stays intact: reasoning chain, working memory, tool call history, pending action.

Here's what that looks like in practice:

The critical detail: hours 8 through 32 are dead time for the agent but active time for the human. The agent consumes zero compute while paused. Sub-second cold starts mean zero latency penalty when it resumes.

Mission Control provides the inbox that makes this manageable at scale. Notifications categorized into "Needs your input," "Errors," and "Completed." If you're managing twenty long-running agents, you're not hunting through Slack channels to figure out which ones need attention.

每个框架都在宣传人机循环。

但实践中,大多数实现是:将状态序列化为JSON,发送webhook,希望有人检查它。问题迅速叠加。JSON序列化丢失了隐含的推理上下文。通知与数十个警报竞争。

当人类数小时后响应时,代理必须反序列化、重建上下文,并希望一切没有变化。

长期运行代理以不同方式处理此问题。当代理遇到审批门时,它就地暂停。完整执行状态保持不变:推理链、工作记忆、工具调用历史、待处理操作。

以下是实际效果:

关键细节:第8到32小时是代理的空闲时间,但却是人类的活跃时间。暂停期间代理消耗零计算。亚秒级冷启动意味着恢复时零延迟惩罚。

Mission Control提供了使此规模可管理化的收件箱。通知分类为“需要你的输入”、“错误”和“已完成”。如果你管理着二十个长期运行代理,你无需在Slack频道中搜寻哪些需要关注。

§ 4

A seven-day agent needs more than session state. It needs to remember things from previous sessions, user preferences from weeks ago, and organizational context that no single conversation could contain.

This is where Memory Bank and the new Memory Profiles work together.

Memory Bank (now available for everyone) dynamically generates and curates memories from conversations, organized by topic. Memory Profiles add low-latency access to specific, high-accuracy details. Think of Memory Bank as long-term memory and Memory Profiles as working memory.

But here's the problem most developers don't anticipate until production: memory drift. Your agent's behavior isn't shaped only by its code and prompts. It's shaped by accumulated experience. If an agent "learns" from a few atypical interactions that a procedural shortcut is acceptable, it might start applying that shortcut broadly. And if multiple agents read and write to shared memory pools, data leakage between distinct workflows becomes a real risk.

You can't let agents write to a vector database unchecked. You need to govern them the same way you govern microservices.

一个持续七天的代理需要的不仅仅是会话状态。它需要记住之前会话中的信息、数周前的用户偏好以及任何单个对话都无法包含的组织上下文。

这就是Memory Bank和新Memory Profiles协同工作的地方。

Memory Bank(现已对所有用户开放)动态生成并按主题整理来自对话的记忆。Memory Profiles增加了对特定高精度细节的低延迟访问。将Memory Bank视为长期记忆,Memory Profiles视为工作记忆。

但大多数开发者在生产环境中才会预料到的问题:记忆漂移。代理的行为不仅由其代码和提示塑造,还由积累的经验塑造。如果代理从少数非典型交互中“学习”到某个程序捷径是可接受的,它可能会开始广泛应用该捷径。如果多个代理读写共享记忆池,不同工作流之间的数据泄露就变成了真正的风险。

你不能让代理不受检查地写入向量数据库。你需要像治理微服务一样治理它们。

§ 5

This is where Agent Identity, Agent Registry, and Agent Gateway come in. They bring standard infrastructure concepts into the agent lifecycle:

Agent Identity works like IAM for agents. Just as a microservice needs a service account, an agent needs a cryptographic identity that determines exactly which memory banks and tools it's authorized to access.

Agent Registry works like service discovery. When you have dozens of long-running agents, you need a centralized way to track which agents are active, what version of the prompt and code they're running, and what their current execution state is.

Agent Gateway works like an API gateway tailored for LLMs. It sits between the agent and its memory and tools, evaluating requests against organizational policies. If an agent tries to commit PII to its long-term Memory Bank, the Gateway blocks the transaction.

Build auditing into your memory layer from day one. The question isn't just "what are my agents doing?" It's "what are my agents remembering, and how is that changing their behavior?"

这就是Agent Identity、Agent Registry和Agent Gateway发挥作用的地方。它们将标准基础设施概念引入代理生命周期:

Agent Identity的作用类似于代理的IAM。正如微服务需要服务账号,代理需要一个加密身份,精确决定它有权访问哪些记忆库和工具。

Agent Registry的作用类似于服务发现。当你有数十个长期运行代理时,你需要一种集中方式来跟踪哪些代理处于活动状态、它们运行的提示和代码版本以及当前执行状态。

Agent Gateway的作用类似于为LLM定制的API网关。它位于代理与其记忆和工具之间,根据组织策略评估请求。如果代理试图将PII提交到其长期Memory Bank,Gateway会阻止该事务。

从第一天起就在记忆层构建审计。问题不只是“我的代理在做什么?”,而是“我的代理在记住什么,这如何改变它们的行为?”

§ 6

Not every long-running agent interacts with humans. Some are ambient. They watch for events, process data streams, and take action in the background without any user prompting.

Batch and Event-Driven Agents connect directly to BigQuery tables and Pub/Sub streams.

Here's a concrete example: a content moderation agent that processes user-generated content as it arrives.

This agent runs for days. It doesn't wait for someone to ask it to moderate content. It processes events as they arrive, maintains its own state about trends and patterns, and escalates only when necessary.

The important architectural decision here ties back to the governance layer from Pattern 3.

Don't hardcode content policies into the agent. Define them in Agent Gateway and the agent enforces them at runtime. When policies change, you update Gateway once and every ambient agent picks up the new rules. The agent's identity (from Agent Identity) determines which policies apply to it, and the Registry tracks which version of the agent is running against which policy set.

This separation matters because ambient agents run unsupervised for long stretches. If you hardcode policies, every policy change requires redeploying every agent. If you externalize policies through the Gateway, you update once and the fleet adapts.

并非每个长期运行代理都与人类交互。有些是环境型的。它们监听事件、处理数据流,并在后台执行操作,无需用户提示。

批处理和事件驱动代理直接连接到BigQuery表和Pub/Sub流。

这是一个具体示例:一个内容审核代理,在用户生成内容到达时进行处理。

该代理运行数天。它不等待有人要求审核内容。它处理到达的事件,维护关于趋势和模式的状态,仅在必要时升级。

这里重要的架构决策回到了模式3中的治理层。

不要将内容策略硬编码到代理中。在Agent Gateway中定义它们,代理在运行时执行。当策略更改时,你只需更新Gateway一次,每个环境代理都会采用新规则。代理的身份(来自Agent Identity)决定了哪些策略适用于它,Registry跟踪哪个版本的代理针对哪个策略集运行。

这种分离很重要,因为环境代理在长时间内无人监督。如果硬编码策略,每次策略更改都需要重新部署每个代理。如果通过Gateway外部化策略,你只需更新一次,整个代理群就会适应。

§ 7

The final pattern is about managing multiple long-running agents as a coordinated fleet. In production, you rarely have a single agent working alone. You have a coordinator agent that delegates sub-tasks to specialist agents, each running independently for different durations.

Consider a sales prospecting sequence:

Each specialist has its own Agent Identity (so it can only access the tools and memory it needs), its own policy enforcement through Agent Gateway (so the Outreach Agent can't access financial data meant for the Scoring Agent), and its own entry in the Agent Registry (so you can track versions and execution state across the fleet).

The coordinator maintains global state and handles handoffs between specialists. This is the same coordinator/worker pattern used in distributed systems for decades. What's new is that ADK handles this natively with graph-based workflows that define coordination logic declaratively.

The operational advantage of treating each specialist as an independent unit is that you can update them independently too.

If your Scoring Agent's ranking logic needs improvement, you can deploy the new version, monitor its performance through Agent Observability, and promote it only when the results hold up. And because each agent runs in its own container (with Bring Your Own Container support for your existing CI/CD and security requirements), a bad deployment in one specialist never cascades to the others.

最后一个模式是关于将多个长期运行代理作为协调集群进行管理。在生产环境中,很少有一个代理单独工作。你有一个协调代理,将子任务委托给专业代理,每个代理独立运行不同时长。

考虑一个销售线索勘探序列:

每个专业代理都有其自己的Agent Identity(因此只能访问所需的工具和内存)、通过Agent Gateway执行的策略(因此外联代理无法访问评分代理的财务数据),以及在Agent Registry中的条目(以便跟踪整个集群的版本和执行状态)。

协调代理维护全局状态并处理专业代理之间的交接。这与分布式系统中使用了数十年的协调者/工作者模式相同。新之处在于ADK通过基于图的工作流原生支持这一点,以声明方式定义协调逻辑。

将每个专业代理视为独立单元的操作优势在于,你也可以独立更新它们。

如果你的评分代理的排名逻辑需要改进,你可以部署新版本,通过Agent Observability监控其性能,并在结果稳定后才进行推广。由于每个代理运行在自己的容器中(支持自带容器以满足现有的CI/CD和安全需求),一个专业代理的错误部署永远不会级联到其他代理。

§ 8

These patterns compose. A compliance system might use Checkpoint-and-Resume for document processing, Delegated Approval for review gates, Memory-Layered Context for cross-session knowledge, and Fleet Orchestration to coordinate specialists.

The key question: what is the longest uninterrupted unit of work your agent needs to perform? If it's minutes, you probably don't need long-running agents. If it's hours or days, these patterns are where you start.

Long-running agents are available today on Gemini Enterprise Agent Platform. Build with ADK, deploy on Agent Runtime, monitor via Mission Control. The combination of 7-day persistence, human-in-the-loop approvals, and long-term memory is what turns an agent from a chatbot into an autonomous worker.

Start here: https://cloud.google.com/gemini-enterprise/agents

这些模式可以组合使用。一个合规系统可能使用检查点与恢复进行文档处理、委托审批进行审核门、内存分层上下文进行跨会话知识,以及集群编排来协调专业代理。

关键问题是:你的代理需要执行的最长不间断工作单元是什么?如果是几分钟,你可能不需要长期运行代理。如果是几小时或几天,这些模式就是你的起点。

长期运行代理现已在Gemini Enterprise Agent Platform上可用。使用ADK构建,在Agent Runtime上部署,通过Mission Control监控。7天持久化、人机循环审批和长期记忆的组合,将代理从聊天机器人转变为自主工作者。

从这里开始:https://cloud.google.com/gemini-enterprise/agents

打开原文 ↗