Designing subagents for Metaglass AIOS, part 5: delegation, isolation, and the summary contract

This is the fifth piece in a series on designing the Metaglass AIOS kernel. Earlier parts: context management, the agent loop, the tool system, prompt composition. This one is about subagents — when one agent spawns another, what gets passed down, what comes back up, and the surprisingly large engineering surface that question opens.

Subagents are the single largest scaling lever an agentic harness has. They also break almost every assumption a single-loop design makes about context, budgets, and termination. This is the part of the kernel where "we'll figure that out later" decisions compound the fastest, and where I have the most candid story to tell about what's shipped versus what's still an interface waiting for an implementation.

Why naive implementations fall short

The first subagent system anyone writes looks like this: a tool called spawn_subagent(prompt) that calls back into the same loop with the prompt as the new user message. Five lines of code. It runs.

Then you ship it and watch:

The child inherits the parent's full conversation history; the context budget you were trying to save just doubled.
The child has access to every tool the parent has, including tools that should not exist in a sub-task (spawn_subagent itself — now you've got infinite recursion in one bad model decision).
The child runs forever because nothing capped its iteration budget separately from the parent's.
The child writes to durable memory and corrupts the parent's working state mid-task.
The parent gets back a 50KB raw transcript and doesn't know what to do with it.
Two parallel children both edit the same file and the second silently clobbers the first.
The parent crashes; the children keep running for an hour and then deliver completion announcements to a session that no longer exists.
A child fails with a transient API error and the parent has no way to retry just that one, so it kills the whole batch.
The child is "done" but its grandchildren are still running and the parent thinks the work is finished.
The parent has nothing to do while the child runs and burns turns waiting.

Each of these is a separate design decision masquerading as a bug. The naive spawn-into-the-same-loop pattern survives demos and explodes in production. A real subagent system is not "another call to the loop" — it's a delegation contract with five distinct sub-problems: who can spawn, what gets passed down, what tools the child can touch, how the result comes back, and how the lifecycle is cleaned up when things go sideways.

The shape the problem keeps taking

Reading enough subagent implementations across the field, the same six concerns kept showing up regardless of architecture:

Spawn — who can create a child, with what configuration, and from where in the parent's flow.
Isolation — what the child shares with the parent (context, tools, memory, workspace) and what it doesn't.
Scoping — which tools the child has, which it explicitly cannot have, and how that's enforced.
Communication — how results flow back: synchronous return, push-based completion, or polling.
Lifecycle — spawn, run, complete, cancel, restart, clean up; persisted across crashes; resilient to descendants still running.
Aggregation — when the parent has multiple children, how their results combine without overwhelming the parent's context.

Every working system handles all six. The interesting differences are in which of the six get the most investment. The simple systems make spawn and synchronous return the entire story; the production systems treat lifecycle and aggregation as first-class engineering surfaces.

That six-concern shape became the spine of the AIOS design. The kernel exposes a TaskSpawner that owns spawn, scoping, communication, and lifecycle. Each child is a separately-constructed ConversationEngine instance (not a re-entry into the parent's loop) with its own ports, its own budget, and its own history. The boundary is hard: parents never see children's internals; children never see parents' history.

The honest part of this story is that the infrastructure is built and the factory that makes it usable in production is not. TaskSpawner has 344 lines of code, well-tested with mocks, and has never been called outside tests because ConversationEngineAgentFactory doesn't exist yet. This is the single largest "designed but not wired" surface in the kernel. I'll be explicit about it throughout.

Mechanism by mechanism

Spawn as a tool, not a primitive

Every production subagent system I read exposed spawning as a tool the model calls, not as a primitive the harness invokes on the model's behalf. The model decides what to delegate, when, and to whom, just like any other tool decision. The harness validates the call and dispatches.

The reason is uniformity. If spawning is a tool, it goes through the same retry, validation, confirmation, and policy paths as every other tool. The model already knows how to read tool descriptions and pick the right one; there's no special protocol to learn. And the loop doesn't need a separate "spawn check" branch — it's just another execute(toolCall).

What I chose for AIOS: spawn_task is designed as a virtual tool registered by the engine when TaskSpawner is in its dependencies. The model calls it with { description, prompt, subagentType }. The dispatcher intercepts the call before it reaches ToolProvider.execute() (same pattern as TodoWrite, AskUserQuestion, and batch_tools — see part 3) and routes to TaskSpawner.spawn(). Today the interception lives behind an unimplemented factory; once ConversationEngineAgentFactory lands, the tool becomes live without any loop change.

Agent types with default configurations

The cleanest pattern I saw was named types — explore, execute, general-purpose, Plan — each carrying default model tier, default tool allowlist, and a behavioral suffix appended to the system prompt. The model picks a type; the harness resolves defaults; the parent can override individual fields per spawn.

The reason for types is the same reason restaurants have menus rather than ad-hoc requests. The model doesn't have to invent a tool set and a model choice and a prompt suffix every time it spawns; it picks explore and gets the right defaults. The parent (or the user) can still override, but the defaults handle the common case.

What I chose for AIOS: six types in AGENT_TYPE_CONFIGS:

explore — Haiku, read-only tools (Read, Glob, Grep, LS), exploration prompt suffix.
execute — Sonnet, full tools, execution prompt suffix.
Bash — Sonnet, Bash-only, command-execution prompt suffix.
Skill — Sonnet, read-only with skill-runtime hooks, skill-execution prompt suffix.
Plan — Sonnet, read + web tools, planning prompt suffix.
general-purpose — Sonnet, full tools, no suffix.

The type list is opinionated and probably over-specified for current usage; explore and execute would cover ~90% of cases. But the cost of carrying extra types is low — they're just config rows — and removing them later is easier than adding them later.

Model tier per type — the cheap-explorer pattern

Across every system I read, the same cost-optimization pattern recurred: read-only exploration runs on the cheapest, fastest model; writes and complex reasoning run on the expensive model. An exploration agent that searches 50 files doesn't need frontier reasoning; it needs throughput.

What I chose for AIOS: the type defaults encode this directly. explore defaults to Haiku-tier; everything else defaults to Sonnet. The parent can override per-spawn, but the defaults push the model toward the cost-efficient choice by default. This is one of those design decisions where the right default does more than any amount of guidance in the system prompt.

Tool scoping by allowlist

The naive system gives the child every tool the parent has. The disciplined systems enforce a per-type allowlist, and explicitly block a small set of tools that should never appear in children regardless of allowlist:

spawn_task / delegate_task — no recursive delegation by default.
AskUserQuestion / clarify — no user interaction from subagents (they're not in the user-facing session).
memory_write — no writes to shared durable memory (children can't corrupt the parent's worldview).
Anything that sends external messages on the parent's behalf.

The block list is the safety net that catches mistakes in the allowlist. Belt and suspenders.

What I chose for AIOS: allowlists are part of AgentTypeConfig.allowedTools (either an explicit list or '*' for full access). The block list isn't formalized yet — today, the recursion guard would have to come from the depth check (see below), and writable side-effect tools are excluded by virtue of the read-only allowlists. Formalizing a DELEGATE_BLOCKED_TOOLS set is on the immediate roadmap and a smaller change than it sounds.

Toolset intersection: children can't escalate

A subtle but important rule: the child's tool set should be the intersection of the type's allowlist and the parent's available tools. The child can never gain a tool the parent doesn't have. Otherwise, a child can become a privilege-escalation vector — the model spawns a subagent to do something the parent itself was forbidden from doing.

What I chose for AIOS: the proposed scoped tool provider does intersection at construction time. The factory creates a ToolProvider that wraps the parent's provider and filters to the type's allowlist, with parent capabilities as the upper bound. Children inherit a subset, never a superset.

Context isolation by default

The single most important property of a subagent: the child does not see the parent's conversation history. The parent's prompt, tool calls, accumulated tool results, todo state — none of it is visible. The child gets:

A fresh system prompt (often a slimmer version tailored to the task — see the minimal mode in part 4).
The prompt field from the spawn call as its user message.
Optionally, a small context field that the parent explicitly chose to pass.
Nothing else.

This is what makes subagents a context-management strategy, not just a parallelism strategy. The whole point is that a 200K-token exploration in the child collapses to a few KB of summary in the parent. If the child inherited the parent's context, the summary would be the only saving, and a small one.

What I chose for AIOS: each spawned agent is a separately-constructed ConversationEngine with its own history array. The parent passes only the prompt. There's no shared memory, no shared todo list, no shared compaction state. This is the property that makes the design worth building even before the factory is wired.

The summary contract

If the child only sends back a summary, the summary's quality becomes load-bearing. The parent can't go ask the child for more detail later (the child's context is gone); it has to make decisions on whatever shape the child returned.

The most disciplined systems treat the return as a structured payload, not just a string:

status — completed, failed, cancelled, interrupted.
summary — the human-readable result, capped at a sensible size (100KB is a common cap).
tool_trace — a list of which tools the child used, with sizes and statuses. Lets the parent see what the child did without seeing the full transcript.
duration and tokens — for cost accounting.
exit_reason — natural completion, budget exhausted, interrupted, error.

The size cap matters a lot. A child that returns a 50KB transcript turns a context-saving strategy into a context-poisoning one. Cap, truncate with a notice, log the full thing for debugging.

What I chose for AIOS: TaskResult carries taskId, success, data, error, status. The shape is intentionally minimal today — closer to a tool result than to a structured payload. Adding tool_trace, duration, tokens, and a size cap on data is on the roadmap and matches the StructuredToolResult pattern already in use for regular tools (see part 3). The pattern is the same; the type is just different.

Synchronous return vs. push-based completion

Two communication models I saw:

Synchronous return — parent calls spawn() and awaits the result. Simple, but parent blocks on the slowest child.
Push-based completion — parent calls spawn(), immediately gets back a task ID, and continues working. The child eventually pushes a "completion event" into the parent's session as a new user message. The parent's loop drains the queue at the top of each turn.

The push-based pattern is strictly more flexible. The parent can do other work while children run. Multiple children can complete in any order. Background tasks (run_in_background: true) become a natural extension. The cost is more lifecycle bookkeeping — a registry of in-flight runs, retry logic for delivering completion events, expiry timeouts for completions that never arrive.

The most polished systems combined both: a runInBackground flag chooses between synchronous and push-based per spawn.

What I chose for AIOS: TaskSpawner supports both. runInBackground: false (the default) makes spawn() await the result; runInBackground: true returns immediately with status: 'running' and the parent polls via isRunning(taskId) or blocks via getResult(taskId). True push-based completion (the child injects a completion event into the parent's message queue) is the obvious refinement once mid-run message injection lands in the loop (see part 2). The seam is already there.

Parallel spawning via batched tool calls

The most elegant pattern I saw for parallelism: the model emits multiple spawn_task tool calls in a single turn. The loop dispatches them concurrently. The results come back as separate tool results, in any order. No orchestrator, no coordination protocol — just regular tool dispatch with the parallelism handled by the same machinery that runs any other read-only tool batch.

This trick depends on the parallel tool dispatcher I called out in part 2 as the highest-impact deferred item. When that lands, parallel subagents come along for free.

What I chose for AIOS: TaskSpawner supports concurrent tasks via its Map<string, RunningTask> registry and maxConcurrent cap. Parallel spawning through the model requires the parallel tool dispatcher, which is still pending. Today, the model can emit multiple spawn_task calls and they'll execute serially. Once parallel dispatch lands, the same calls execute concurrently with no other change.

Concurrency caps

Spawn without a cap is a denial-of-service vector against your own LLM budget. Three concurrent children is a common cap — enough for useful fan-out, low enough that one bad model decision doesn't spawn fifty agents.

What I chose for AIOS: TaskSpawner enforces a maxConcurrent cap (configurable, default modest). Exceeding it returns a structured failure to the model rather than queueing — the model sees the constraint and can decide to wait, batch, or skip the parallelism. Queueing internally is the next refinement, but the explicit failure is honest about the resource bound and lets the model adapt.

Depth limits

Recursive spawning is the most dangerous failure mode. A general-purpose child that has access to spawn_task can spawn its own general-purpose children, each of which can spawn more. The cost grows exponentially.

Two patterns I saw:

Hard depth cap — the harness tracks nesting depth and rejects spawn beyond a limit (typically 2 or 3).
Role-based escalation — depth maps to a role (main → orchestrator → leaf), and only main/orchestrator can spawn. Leaves can't. The role determines control scope as well as spawn capability.

Both work. The role-based version is more expressive (orchestrators can also kill or steer their children); the hard cap is simpler.

What I chose for AIOS: a hard depth cap (proposed: 3) enforced at spawn time. Recursive spawning is blocked by the absence of spawn_task from the default allowlist for any type except general-purpose. Role-based escalation is over-engineering for the current scale; if AIOS grows to multi-orchestrator workflows, the role model is the natural upgrade.

Cancellation and interrupt propagation

A parent that wants to stop needs to cascade the stop to its children. The pattern: each subagent registers itself in the parent's active_children list at spawn; on interrupt, the parent iterates the list and cancels each child; children propagate to their own descendants. The cleanup happens bottom-up.

Without cascading cancellation, the parent stops but the children keep burning tokens — and the user sees the parent's "stopped" message while the system continues to spend money.

What I chose for AIOS: TaskSpawner exposes cancel(taskId) and cancelAll(). Agent.cancel() propagates the abort signal into the child's ConversationEngine, which already handles cancellation end-to-end (see part 2). What's missing is automatic cascading — today, the parent has to call cancelAll() explicitly when it itself is cancelled. Wiring that into the engine's own abort path is straightforward and on the roadmap.

Descendant-aware completion

A subtle but critical lifecycle case: a child marks itself "done" while its own grandchildren are still running. If the parent treats the child's completion as the end of that branch, the grandchildren's results are orphaned. The right behavior is to wait for the subtree to settle before announcing the child's completion to the parent.

The pattern: each completion event carries a "descendants pending?" check. If pending, the run is marked wakeOnDescendantSettle: true and re-checked once the children finish. Only after the full subtree completes does the announcement reach the parent.

What I chose for AIOS: not implemented. The current depth cap (3) and synchronous default (runInBackground: false) make this less urgent — children naturally await their grandchildren in the synchronous path. Push-based completion will make this urgent, and the design is well-understood: a "settle check" before announcement and a re-arm if descendants are pending.

Orphan detection and registry persistence

A subagent system that doesn't survive a process restart is fundamentally a dev tool, not a production one. The fix: persist the in-flight task registry to disk on every state change, and on startup, reconcile against actual session state. Tasks whose underlying session no longer exists are marked orphaned and cleaned up.

What I chose for AIOS: today, TaskSpawner's registry is in-memory only. The ConversationStore and CheckpointManager (see part 2) already handle per-session persistence; extending them to capture the parent-child task graph is the right next step. It's one of the higher-priority deferred items because the absence of persistence makes the rest of the subagent infrastructure brittle in exactly the same way the rest of the loop's persistence makes the loop robust.

Steering: mid-run redirection

The most powerful subagent primitive I saw is steering: the parent can inject a new instruction into a running child without killing it. The child receives the new message at the top of its next turn and adjusts course. This avoids the kill-and-restart anti-pattern, where the parent has to throw away in-flight work and start over.

Steering requires the mid-run message injection feature on the child's loop (the same feature called out as deferred in part 2). When that lands for the parent, it works for subagents too.

What I chose for AIOS: deferred, paired with mid-run message injection on the loop side. Once the loop's message queue is drained at the top of each turn, steering is a trivial TaskSpawner.steer(taskId, message) method that pushes into the child's queue.

Workspace isolation

For coding agents, a particularly elegant pattern: each subagent gets its own git worktree. Changes don't affect the parent's working directory; if the child commits, the parent gets back the branch name; if the child makes no changes, the worktree is cleaned up automatically. This enables speculative edits — the child can try a refactor and the parent can decide whether to merge.

For Metaglass, the analog isn't git worktrees but vault snapshots — the child operates on a copy-on-write view of the vault and writes are merged back only on completion. This is a much larger lift than the git worktree pattern (vaults aren't versioned the same way) and isn't on the immediate roadmap, but the conceptual seam is in WorkspaceManager for whoever builds it.

What I chose for AIOS: no workspace isolation today. Children operate on the same vault as the parent. For read-only explore tasks this is fine; for execute tasks that mutate the vault, it's a real constraint. Vault-snapshot isolation is on the long-range roadmap.

Credential and provider override

A nice pattern from the production systems: children can be routed to a different provider or different model than the parent. The parent runs on Sonnet on Anthropic; children run on Haiku via OpenRouter. The cost optimization is real, and the routing is invisible to the model.

What I chose for AIOS: AgentConfig.model accepts a tier (haiku / sonnet / opus); routing to a different provider would require extending LLMProvider to be per-spawn. Today, the parent's provider is reused. This is a fair next move, especially because Metaglass already supports multiple providers through the AI SDK — it's just not wired through TaskSpawner yet.

The problem-to-mechanism table

Problem	Mechanism
Spawning is a special control-flow case	Spawn is a virtual tool (`spawn_task`) handled like `TodoWrite` and friends
Child needs sensible defaults	Named agent types with default model + tools + prompt suffix
Cheap exploration shouldn't pay frontier prices	Model tier per type; `explore` defaults to Haiku
Child has too many tools	Per-type allowlist + (planned) block list for never-delegatable tools
Privilege escalation via children	Toolset intersection: child tools ⊆ parent tools
Context budget doubled by spawning	Hard context isolation — child sees only its `prompt`, never parent's history
Parent drowns in raw child transcripts	Summary contract: structured `TaskResult` with capped `data` (size cap pending)
Parent blocks on slowest child	`runInBackground: true` returns immediately; (planned) push-based completion
Parallel children require an orchestrator	Multiple `spawn_task` calls in one turn + (deferred) parallel tool dispatcher
Spawn DoS via unbounded fan-out	`maxConcurrent` cap with structured failure on exceed
Recursive delegation explosions	Hard depth cap + `spawn_task` excluded from default allowlists
Parent cancellation leaves children running	`cancelAll()`; (planned) automatic cascade on parent abort
Grandchild work orphaned by child's "done"	(deferred) descendant-aware completion with `wakeOnDescendantSettle`
Subagent state lost on restart	(deferred) registry persistence + orphan reconciliation
Kill-and-restart wastes in-flight work	(deferred) steering via mid-run message injection
Children mutate the parent's workspace	(long-range) vault-snapshot isolation per child
Children always pay parent's per-token rate	(deferred) per-spawn provider/credential override

What comes next

Honest accounting of what's shipped vs. designed:

Shipped (interface + behavior, tested with mocks):

TaskSpawner with spawn / cancel / list / get-result.
Six agent types with per-type model + tool defaults and prompt suffixes.
AgentConfig and TaskParams / TaskResult shapes.
Synchronous and runInBackground modes.
Concurrency cap.
Event emission (task:spawned, task:completed).

Designed (interface defined, no production implementation):

AgentFactory interface — the seam between TaskSpawner and ConversationEngine. The kernel can't actually spawn real children until ConversationEngineAgentFactory is implemented.
spawn_task tool registration in ConversationEngine.
Scoped tool provider (toolset intersection at child construction).
Depth limiting at spawn time.

Deferred (clear design, no implementation):

DELEGATE_BLOCKED_TOOLS set for never-delegatable tools.
Structured TaskResult with tool_trace, duration, tokens, size cap.
Automatic cancellation cascade.
Push-based completion via the loop's message queue.
Steering (paired with mid-run message injection in the loop).
Registry persistence + orphan reconciliation.
Descendant-aware completion (wakeOnDescendantSettle).
Per-spawn provider/credential override.

Long-range:

Vault-snapshot workspace isolation.
Role-based depth (main / orchestrator / leaf) if multi-orchestrator workflows emerge.

The highest-impact next step is unambiguously ConversationEngineAgentFactory. Until that lands, the subagent infrastructure is well-tested architecture that produces zero production value. This is honest — the bet was to build the spawning infrastructure before the factory because the factory has a clear shape and the infrastructure had unknowns. The bet pays off the moment the factory lands; until then, it's design dead weight, and it's worth being explicit about that.

Three lessons

If I had to compress this survey into advice for someone designing a subagent system from scratch:

Subagents are a context-management strategy first, a parallelism strategy second. The whole reason to build them is that a large exploration in the child becomes a small summary in the parent. If the child inherits the parent's context, you've made the system more complex without making it more capable. Hard context isolation is the property that earns the design its keep; everything else is decoration on top.
The summary is load-bearing; treat it as a contract. Once the child's context is gone, the parent has only the summary to act on. A bad summary, an oversized result, an unclear status field — these are not edge cases, they are the primary failure modes of any subagent system. Cap the size, structure the payload, include the tool trace, log the full transcript for debugging. The summary is the only thing the parent ever sees.
Don't build the infrastructure before the factory. This one is from experience, not from reading others' code. It is very tempting to design the elegant lifecycle — agent types, registry, push completion, depth limits, role hierarchies — before you've built the thing that actually spawns a real child. Don't. Build the simplest end-to-end path first (one type, synchronous return, no scoping) and prove it works; then add the rest. The cost of premature infrastructure is that it carries weight in the codebase without producing value, and the design assumptions calcify before reality has tested them.

The rest is tuning. Type lists, concurrency caps, depth limits, summary size caps — dials, not decisions. The decisions are the three above, and the subagent systems that get them right end up looking remarkably similar even when their authors didn't talk to each other.

Earlier in this series: Part 1: Context management · Part 2: The agent loop · Part 3: The tool system · Part 4: Prompt composition.