Designing the tool system for Metaglass AIOS, part 3: what the agent is allowed to touch

This is the third piece in a series on designing the Metaglass AIOS kernel. Part 1 was on context management; part 2 was on the agent loop. This one is about the tool system — the layer that decides what the agent can actually do, and how the rest of the kernel interacts with that surface.

If the loop is the heartbeat and context is the working memory, tools are the hands. They're also where almost every safety, performance, and extensibility decision in the system gets made. A well-designed tool system makes the agent capable; a poorly designed one makes the agent dangerous, slow, or both.

Why naive implementations fall short

The first tool system anyone writes looks like this: a flat dictionary of { name: handler } pairs, each handler a Python or TypeScript function. The model emits a tool call, the harness looks up the handler, runs it, and returns the result as a string. Five lines of code.

Then you ship it and discover:

The model invents tool names that don't exist and the harness throws an uncaught exception.
A tool returns 200KB of binary content and poisons the rest of the conversation.
A tool fails with a transient network error and the agent gives up on the whole task.
Two tool calls in the same turn touch the same file and overwrite each other.
The agent calls a destructive tool (rm -rf) with no confirmation gate.
A tool's schema is too vague and the model passes garbage arguments; the tool throws a TypeError halfway through executing.
The user adds an MCP server with 60 tools and now every prompt is 20% schemas.
A tool's description is buried under others and the model never picks it.
A long-running tool blocks the loop for 90 seconds with no way to cancel.
A tool succeeds but returns nothing useful and the model has no idea whether the work happened.

Each of these is a design decision masquerading as an edge case. The naive dictionary-of-handlers survives demos and dies in production. A real tool system is not a registry — it's a pipeline: registration → schema validation → policy gating → execution → result formatting → error normalization. Each stage exists because something specific goes wrong without it.

The shape the problem keeps taking

After reading enough tool implementations across the field, the same five-layer shape kept emerging:

Definition — what is a tool, structurally? Name, description, input schema, execute function, plus metadata for routing and risk.
Registration — where do tools come from? Built-in, plugin, MCP, dynamically discovered, hot-reloaded.
Policy — which tools are available right now, given who's asking, where they're asking from, and what mode the agent is in?
Execution — argument validation, dispatch, retry, parallelism, cancellation, error wrapping.
Result handling — formatting, truncation, structured fields, follow-up suggestions, observation summaries for the LLM.

Every production system implements all five. The interesting differences are which layers each system invests in. Some put all their effort in registration and execution and treat policy as an afterthought; others (the ones that survive multi-tenant workloads) make policy a first-class concern.

That five-layer shape became the spine of the AIOS design. The kernel owns the interface and the execution contract; the host owns what tools actually exist. The boundary between them is the ToolProvider interface, and the kernel never knows what tools are behind it — only that calls go in and ToolResults come out.

Mechanism by mechanism

Tool anatomy

There's a remarkably stable definition of "what a tool is" across every system I read:

A unique name.
A description the LLM uses to decide when to call it.
An input schema (typically JSON Schema or equivalent).
An execute function that takes parameters and returns a result.

That's the core. The interesting variations are in what else gets attached. Toolset membership for filtering. Side-effect classification (none / reversible / irreversible). Cost tier. Required environment variables. Availability check function. Parallelism eligibility. Emoji for CLI rendering. Each of these is a hook that some downstream layer reads.

What I chose for AIOS: the kernel's Tool interface is the minimal four-field shape. Everything else lives in ToolMetadataRegistry as a parallel structure keyed by tool name. The reason is decoupling — the metadata can grow without forcing every tool definition to grow with it, and the host can opt into providing metadata for tools that need it (mutation, irreversible side effects, expensive cost) without burdening trivial tools.

Tool description as prompt engineering

This is the most under-appreciated decision in any tool system. The description field is not documentation — it's prompt. The model reads it on every request. If the description is vague, the model picks the wrong tool or the right tool with wrong arguments. If it's clear, half the model's job is already done.

The most disciplined systems I read treated descriptions as prompt engineering artifacts: long, specific, full of "do this / don't do that" guidance, anti-patterns, edge cases, examples of correct usage. The descriptions are sometimes 20+ lines and cost real tokens. They earn the tokens back many times over in fewer wrong tool selections.

What I chose for AIOS: the descriptions live alongside each tool definition in learning-os/src/core/skills/runtime/tools/. They're closer to the verbose end than the terse end, but there's no enforcement — no lint rule that says "your description is too short" or "your description doesn't mention error modes." That's a fair next move and one I'd build before adding new tools at scale.

Registry pattern: central vs. composed

There are two design lines here. One: a central singleton registry that tools register into at module load time. Two: a per-run composed array assembled from multiple sources (built-ins, plugins, MCP servers, user config).

The central registry is simpler to reason about but globalizes the tool surface — everyone sees the same set. The composed array is messier but lets the same agent expose different tool sets to different sessions: more in a privileged session, fewer in a sandboxed one.

What I chose for AIOS: the kernel sees a single ToolProvider per ConversationEngine instance. The host can construct that provider however it wants — directly from a registry, by filtering an existing provider, by wrapping an MCP client. The kernel doesn't care. This puts AIOS closer to the composed model in practice while keeping the kernel-side interface as simple as the central-registry model.

The policy cascade

This is the layer most systems get wrong on the first pass. "Is this tool allowed?" is not a yes/no question — it's a question with at least six independent answers that have to be resolved into one.

The most sophisticated system I read used a six-layer cascade, in priority order:

Deny lists (highest — always override allow).
Provider/model overrides — some models can't safely use some tools.
Sandbox restrictions — if the run is sandboxed, certain tools are off.
Agent-level policies — this specific agent's allow/deny rules.
Group/role policies — what this user or channel can access.
Global policies — the project-wide defaults.

The same agent ends up with different tool sets depending on which model is running, whether the session is sandboxed, who the caller is, and what global config says. The resolver runs once per execute() call and the result is what the LLM sees.

What I chose for AIOS: today, policy is a coarse filter applied at adapter construction — createReadOnlyToolAdapter(), createPlanningToolAdapter(), createFilteredToolAdapter(patterns). The host picks the adapter; the kernel sees whatever's in it. This works for the current single-user, single-vault workload but won't survive multi-user or multi-channel scenarios. A proper cascading policy engine is the largest gap in the tool layer and the right next investment if AIOS grows toward multi-tenancy.

Argument validation and self-healing

LLMs hallucinate arguments. They invent fields that aren't in the schema, omit required ones, pass strings where numbers are expected. A naive tool system throws a TypeError and corrupts the history.

The clean answer is to validate against the schema at dispatch time, and on failure, return the validation error to the model as a tool result. The model sees its own mistake, reads the schema again, and corrects itself on the next turn. This is the self-healing pattern, and it's elegant: no special prompts, no exception handlers in the loop, just a uniform error path.

What I chose for AIOS: the AI SDK does schema validation up front, and validation failures are normalized to a structured ToolResult with success: false and an error field describing the violation. The model gets to read the error and retry. No exception ever reaches the loop body. This is one of the cheapest design wins in the kernel — it costs almost nothing to implement and removes an entire class of failure.

Tool result formatting

The naive system returns JSON.stringify(result) and calls it a day. This produces several pathologies:

Large objects become walls of unreadable JSON.
The model has to parse structure out of free text on every result.
There's no way for a tool to say "I succeeded but here's a human-readable summary."
There's no way for a tool to suggest what the agent should do next.

The disciplined systems separated the result into multiple channels:

A structured data field with the actual payload.
An observation field — a short human-readable summary the LLM reads first.
A success boolean for unambiguous status.
An optional actions field listing suggested follow-up tools.
Metadata: duration, item counts, whether the result was truncated, where it came from.

What I chose for AIOS: the kernel's ToolResult has success, data, error, observation. The host extends this with StructuredToolResult adding type, summary, fields, actions, and metadata. The actions field is the most distinctive piece — a tool can say "I returned 47 search results; you probably want to call vault.read_note on the top three." That suggestion goes to the model alongside the result, and it measurably reduces the number of turns spent figuring out the next step.

Result truncation

Across every system I read, the single largest source of accidental context death was a tool returning an unbounded blob. A 200KB file. A 10,000-row query result. A binary asset rendered as base64. One bad result can poison the next twenty turns.

The defensive pattern: every result passes through a truncator before going to the model. If it's over a threshold (typically tens of KB), it gets cut with a "result truncated, N bytes hidden" marker. The original is logged for debugging; only the truncated form goes into history.

What I chose for AIOS: the loop applies a configurable resultMaxChars truncation in the dispatch path, and ToolResultFormatter handles the cut. The cut is character-based rather than token-based — fast, slightly imprecise, fine for the threshold-decision use case.

Retry policy

Tools fail. Some failures are the model's fault (bad arguments, surface those). Some are the environment's fault (network blip, rate limit, transient 5xx — hide those and retry). Mixing them up either spams the model with noise or buries real bugs as retries.

The clean answer is to classify errors at the dispatch layer:

Retryable: network errors, timeouts, 429, 502, 503, 504. Retry with exponential backoff and jitter.
Non-retryable: 401, 403, 404, validation errors, structured errors from the tool itself. Surface to the model immediately.

Retries don't count against the agent's turn budget. They're transparent to the model unless they exhaust the budget, at which point the final error gets surfaced as a normal tool result.

What I chose for AIOS: ToolRetryPolicy is a first-class kernel module. Three attempts max, base delay 1000ms, max delay 10000ms, exponential backoff with jitter. An onRetry(attempt, error, delayMs) callback exposes the retry events for logging. This pairs directly with the loop's design — retries are invisible to the model, observable to the operator.

Parallelism

When the model emits multiple tool calls in a single response, the naive thing is to run them sequentially. The optimization is to run independent ones in parallel.

"Independent" is the hard word. Reads of distinct files are independent. Writes to the same file aren't. A read after a write to the same path needs ordering. A user-interactive tool (AskUserQuestion) can't run alongside anything.

The disciplined systems used layered heuristics: an allowlist of "always parallel-safe" tools, a denylist of "never parallel" tools, a path-overlap check for file-scoped tools, and a destructive-command detector for shell. Plus a bounded worker pool to cap concurrency.

What I chose for AIOS: ToolMetadataRegistry records allowsParallelExecution for every tool, but the loop executes them sequentially today. This is the same gap I called out in part 2 — the metadata is ahead of the dispatcher. Wiring up parallel dispatch against the existing metadata is the single highest-impact unrealized optimization in the kernel.

Confirmation gates

Some tools should never run without explicit user approval. Destructive shell commands. File deletes. External writes. The metadata layer can flag these; the loop has to honor the flag.

What I chose for AIOS: requiresConfirmation is a metadata field. When a tool with this flag is invoked, the kernel routes through AskUserQuestion-style approval before executing. The host's UserInterface provider renders the approval UI and returns the decision. This is one of the places the kernel and host cooperate most tightly — the kernel knows when to ask, the host knows how to ask.

Meta-tools handled in the kernel

A handful of tools aren't really "tools" in the same sense — they're control-flow primitives for the loop itself. TodoWrite updates the agent's plan. AskUserQuestion blocks for user input. batch_tools runs a sequence in one call.

These belong in the kernel, not in the tool provider. They don't have side effects on the host; they have side effects on the loop. Routing them through the same dispatcher as regular tools muddies the boundary.

What I chose for AIOS: ConversationEngine checks for these names before falling through to ToolProvider.execute(). TodoWrite updates the kernel's todo state and emits a todo:updated event. AskUserQuestion pauses the loop and waits on the host's UserInterface. batch_tools re-enters the dispatch path for each tool in the batch. The host never sees these — they're kernel concerns.

Hot reload and dynamic capabilities

Production tool sets change. MCP servers come and go. Plugin tools load and unload. The skill set grows mid-session as the user installs new skills. A tool system that requires a process restart for any change is fundamentally a dev tool, not a production one.

The cleanest pattern I saw was a notification protocol — tool providers can emit "the tool list changed" events, and the kernel refreshes its catalog without disconnecting the session.

What I chose for AIOS: the ToolProvider interface includes list() and has() for runtime queries, and the loop re-checks the tool catalog at the top of each turn. Hot-loading a new skill makes its tools available on the next turn without restarting the engine. There's no formal list_changed event yet — the loop just re-reads — but the seam exists.

Schema overhead and deferred loading

Every tool's schema goes into the prompt on every request. With many tools, this becomes a substantial fixed cost paid even when most tools won't be called. MCP-rich deployments are where this hurts most.

The cleanest answer I saw was deferred loading: at session start, only tool names and one-line descriptions go into the prompt. The model can ask for a tool by name, and the full schema is fetched on demand and injected for subsequent turns. One extra round-trip in exchange for a much smaller per-turn schema budget.

What I chose for AIOS: not implemented. AIOS today bundles a moderate number of tools (~30) and the schema overhead is real but tolerable. If the tool count grows past 100 — either through MCP support or aggressive plugin growth — deferred loading is the obvious next move. The ToolProvider interface is shaped to allow it; it just isn't exercised.

MCP and external tool protocols

Most modern tool systems support an extension protocol — MCP being the de-facto standard — for adding external tools. The protocol normalizes schemas, lifecycle, and transport so that external tools appear as first-class to the agent.

What I chose for AIOS: no MCP yet. Tools are registered programmatically through SkillToolProvider. This is a deliberate scope choice for a Tauri-native application — most of the tools AIOS needs (vault, graph, search, memory, shell, web) are host-native and benefit from direct integration rather than going through a protocol. MCP is on the roadmap and the wrapping pattern is well-understood; it just isn't urgent for the current workload.

Toolset composition

Different deployments need different tool sets. A CLI session might enable everything. A Slack channel might restrict to read-only. A scheduled job might disable user-interactive tools entirely. The right abstraction is a toolset — a named, composable bundle that can include other toolsets and individual tools.

What I chose for AIOS: today, the host uses the adapter factories (createReadOnlyToolAdapter, createPlanningToolAdapter, createFilteredToolAdapter) for this. They're functional but coarser than named toolsets. The next move, when policy goes multi-tenant, is to introduce named toolsets that the policy cascade can reference by name — "vault-readonly" instead of a list of patterns.

Owner-only and identity-gated tools

Some tools should only be callable by specific identities — administrative actions, credential changes, scheduling. The naive system has no concept of caller identity at the tool dispatch layer; the right design checks identity at execution.

What I chose for AIOS: not yet needed. Metaglass is single-user today. The seam is in ToolProvider.execute(id, params, context) where context can carry caller identity. When multi-user arrives, identity-gated tools slot into the same policy cascade described above.

The problem-to-mechanism table

The same material reorganized as problem → mechanism.

Problem	Mechanism
Model invents tool names	Return `Unknown tool` as a structured `ToolResult`, never throw
Model passes bad arguments	Schema validation at dispatch; validation error becomes tool result
Tool returns 200KB blob	`ToolResultFormatter` truncates to `resultMaxChars` before history
Transient network/rate-limit failures	`ToolRetryPolicy` with exponential backoff + jitter, retries off-budget
Two tool calls hit the same file	Metadata flags + (planned) parallel dispatcher with path-overlap check
Destructive tool runs without consent	`requiresConfirmation` metadata + kernel approval route
Tools become a context tax	(deferred) deferred schema loading; toolset filtering today
Tool fails silently with no observable status	`success` + `observation` + structured `error` in every result
Model doesn't know what to do next after a result	`actions` field on `StructuredToolResult` for follow-up suggestions
Tool list changes mid-session	`ToolProvider` re-queried per turn; hot reload supported
Same agent needs different tools per context	Adapter factories today; policy cascade as the proper next step
Long-running tool blocks the loop	`AbortController` propagated into `execute(id, params, context)`
Tool-call vs. loop-control concerns mix	Kernel meta-tools (TodoWrite, AskUserQuestion, batch_tools) handled before `ToolProvider`
External tools need to plug in	`ToolProvider` interface is the wrapping point; MCP a future adapter
Caller identity matters for some tools	`context` parameter on `execute()`; cascading policy as future use

What comes next

The AIOS tool system ships with the load-bearing pieces: a minimal Tool interface with ToolMetadataRegistry riding alongside, a clean ToolProvider boundary that keeps the kernel ignorant of what's behind it, schema validation that returns errors as results instead of throwing, structured ToolResults with observations and follow-up actions, ToolRetryPolicy with retryable/non-retryable classification, ToolResultFormatter with size-based truncation, confirmation gates for destructive tools, and meta-tool handling in the kernel for loop-control primitives.

The deferred items — parallel dispatch against existing metadata, deferred schema loading, MCP integration, cascading policy engine for multi-tenant scenarios, identity-gated tools, named toolsets — are all additive at known seams. None requires restructuring the existing ToolProvider boundary, which is the test the design has to pass.

Parallel dispatch remains the highest-impact deferred item. The cascading policy engine is the highest-impact strategic one — without it, AIOS doesn't scale past single-user, and the metadata for it is already in ToolMetadataRegistry waiting to be read by a resolver that doesn't exist yet.

Three lessons

If I had to compress this survey into advice for someone designing a tool system from scratch:

Tools are not handlers; tools are a pipeline. The dictionary-of-handlers model is a category error. A working tool system is registration → schema validation → policy gating → execution → retry → result formatting, with each stage independently designed. If you can't point at where each of those stages lives in your code, you don't have a tool system — you have a switch statement that occasionally calls functions.
Tool descriptions are prompt, not docs. The description field is the most overlooked piece of prompt engineering in any agent system. Treat each one as a small piece of behavioral specification: when to call it, when not to, what good arguments look like, what the result means. Vague descriptions cost more in wrong tool selections than verbose ones cost in tokens. Lint them, review them, and write them with the same care as the system prompt itself.
Errors are results, not exceptions. Every error path that crosses the tool-system boundary should land in the model's hands as a structured ToolResult, not as a thrown exception in the loop. Schema violations, retry-budget exhaustion, missing tools, validation failures, sandbox denials — all of them become results the model reads and reacts to. This single rule eliminates an entire category of harness bug and unlocks the model's ability to self-correct.

The rest is tuning. Truncation thresholds, retry counts, parallelism heuristics, policy resolution order — dials, not decisions. The decisions are the three above, and the tool systems that get them right end up looking remarkably similar even when their authors didn't talk to each other.

Earlier in this series: Part 1: Context management · Part 2: The agent loop.