Designing the agent loop for Metaglass AIOS, part 2: turns, tools, and termination

This is the second piece in a series on designing the Metaglass AIOS kernel. The first was on context management. This one is about the structure beneath context management — the loop itself. The thing that calls the LLM, runs the tools, decides whether to go again, and decides when to stop.

The agent loop is the part of any harness that looks deceptively simple in pseudocode and turns out to contain almost every interesting decision the system makes. I'm going to walk through the design space I found in the field, then through the choices I made for AIOS — what's in the kernel today, what's deferred, and what I'd defend hardest under scrutiny.

Why naive implementations fall short

The first agent loop anyone writes looks like this:

while response.has_tool_calls:
    results = run_tools(response.tool_calls)
    history.append(response, results)
    response = call_llm(history)
return response.text

That loop is correct in the sense that it terminates and produces output. It's wrong in almost every other sense. The first time you run it on a real task you'll discover most of the following:

The model loops forever on a tool that keeps returning the same error.
A single tool call takes 90 seconds and the user has no way to interrupt it.
A transient API error kills the run and there's no way to resume.
The agent calls 30 tools in series when 5 of them could have run in parallel.
The tool result is 50,000 tokens of binary garbage and now the whole context is poisoned.
The model decides it's "done" after one turn because it doesn't realize it has more work to do.
The agent works fine for 8 turns and then silently produces nonsense because the history got into an invalid shape.

Every one of these is a separate sub-problem inside what was supposed to be a five-line loop. The naive loop survives demos; it does not survive long-running tasks. The job of a real agent loop is to be a small, simple shape on the outside, and to absorb all of these failure modes on the inside without growing tentacles.

The shape the problem keeps taking

After reading enough loop implementations across the field, I started seeing the same skeleton everywhere:

Outer control — turn counter, wall-clock budget, abort signal, error threshold. The supervisory layer that owes the user a termination guarantee.
Inner control — for each turn: call the model, decode the response, dispatch tool calls, collect results, repair the history, decide whether to continue. The execution layer that owes the model a well-formed conversation.
Cross-cutting concerns — streaming, interruption, retries, loop detection, parallelism, verification. The stuff that doesn't belong cleanly to either layer but has to live somewhere.

The differences between harnesses are almost entirely in how they organize that third bucket. The outer and inner shapes converge. The cross-cutting concerns are where opinions live.

That shape became the spine of the AIOS design. Outer control sits in runLoop(). Inner control is the per-turn body. The cross-cutting concerns each live in their own kernel module — LoopDetector, ToolRetryPolicy, MessageRepair, CheckpointManager, ReflectionEngine, VerificationEngine — and the loop composes them rather than embedding them.

Mechanism by mechanism

Single-threaded vs. multi-threaded

The first design choice every harness makes is whether the loop is a single thread of execution or a graph of cooperating agents. The strongest single-threaded systems I read were ruthlessly explicit about this — one flat message history, one loop, no orchestration overlay. Multi-threaded designs existed but tended to pay for themselves only in narrow cases.

The argument for single-threaded is that it's debuggable. You can read the history top to bottom and reconstruct exactly what happened. With multi-agent graphs, the same task can produce ten different traces and you can't tell which one your bug came from.

What I chose for AIOS: single-threaded by default. One ConversationEngine instance, one history array, one turn counter. Multi-agent capability exists as a deliberate, scoped extension via TaskSpawner — each child is a fully isolated ConversationEngine with its own ports and its own budget — but the master loop never branches. Complexity goes into the tool system, not the loop.

Blended phases vs. plan-then-execute

The classic structured design separates planning from execution: the agent first writes a plan, then executes it step by step. The classic ReAct design blends them — the agent freely interleaves reading, editing, verifying, re-reading.

Plan-then-execute is easier to monitor and easier to interrupt at a clean boundary. Blended is more flexible and handles surprise better. Almost every loop I read in production code was blended; the plan-first systems tended to be either research artifacts or specialized for narrow domains.

What I chose for AIOS: blended phases, with optional gradient guidance toward planning when the input warrants it. IntentClassifier runs once before the loop and classifies the input as TRIVIAL | SIMPLE_QUERY | MULTI_STEP | COMPLEX. For MULTI_STEP and COMPLEX, the loop nudges the model toward writing a todo plan via soft-prompting that decays as turns progress. The agent isn't forced to plan — it's reminded that planning exists. This is the "soft" middle between rigid plan-then-execute and pure ReAct, and it works because the reminder is cheap and the model is free to ignore it.

Termination conditions

A loop has to know when it's done. There is no single condition that covers all cases; in every harness I looked at, termination was a list of conditions, each catching a different failure mode.

The list converges to roughly this:

Condition	Why it exists
Model emits no tool calls	Natural completion — the model decided it's done
Max turns reached	Hard ceiling on iterations
Wall-clock timeout	Hard ceiling on real time
Abort signal	User-requested cancellation
Token/cost budget exceeded	Hard ceiling on money
Unrecoverable error	API failure with no retry budget left
Loop detected	The model is going in circles

Every working system implements at least the top four. The remaining three are differentiators.

What I chose for AIOS: all seven, with default bounds of 50 turns and a 10-minute wall-clock timeout. The LLM call itself is wrapped in Promise.race([llmPromise, timeoutPromise]) so a slow provider can't burn the entire budget on one turn. The abort signal flows from the host into the kernel via an AbortController that the engine owns and exposes. The loop-detection condition lives in a dedicated LoopDetector module.

Turn counting

Surprisingly subtle. Does a final text-only response count as a turn? Does a programmatic tool call count? Does a sub-agent's turns count against the parent's budget?

The cleanest systems I read had explicit rules:

A "turn" is one round trip: model produces output with tool calls, harness runs the tools, results feed back.
A final text-only response does not count.
Programmatic tool calls (the agent calling itself, basically) can be refunded from the budget so they don't penalize the agent for being well-organized.
Sub-agent budgets are separate from the parent's budget.

What I chose for AIOS: the first three rules above. Sub-agent budgets are separate by design — each spawned ConversationEngine has its own counter, which means the total work across the tree can exceed the parent's cap but each branch is independently bounded.

Parallel tool execution

The model often emits multiple tool calls in a single response. The naive thing is to run them sequentially. The optimization is to run independent ones in parallel.

"Independent" is the hard word. Two read_file calls on different files are independent. Two write_file calls on the same file are not. Two reads of the same file are independent if no write has happened in between. The most granular system I saw used a layered heuristic:

An explicit allowlist of read-only "always safe to parallelize" tools.
An explicit denylist of tools that must be sequential (e.g., anything that asks the user a question).
For file-scoped tools, a path-overlap check.
For shell commands, a destructive-command detector (rm, mv, sed -i, git reset, etc.).
A bounded worker pool — typically 8 — to cap concurrency.

What I chose for AIOS: the loop has metadata for parallelism via ToolMetadataRegistry, but executes tools sequentially today. The single biggest unrealized optimization in the kernel is wiring up parallel execution against that metadata. The skeleton is there; the dispatcher isn't. This is the highest-impact deferred item on the list.

Tool retry policy

Tools fail. Sometimes the right move is to surface the failure to the model and let it decide; sometimes the right move is to retry transparently because the failure was transient (rate limit, network blip, brief 5xx). Mixing those policies up burns the agent's turn budget on noise.

The disciplined systems I read split errors into two classes:

Retryable (network, rate limit, idempotent 5xx) — the harness retries silently with backoff, up to a small bound, before surfacing to the model.
Non-retryable (validation, permission denied, structured error from the tool) — surfaced immediately so the model can react.

What I chose for AIOS: ToolRetryPolicy is a first-class kernel module. It classifies errors, applies bounded exponential backoff for retryable ones, and only surfaces failures to the model after the retry budget is exhausted. The retries don't count against the agent's turn budget.

Mid-loop steering

A loop that the user can't interrupt mid-task is a loop that runs the wrong way to completion. Every interactive harness needs both interruption (stop) and steering (correct without stopping).

The pattern I saw most often: a dual-buffer or queue that allows new messages to be injected into an active run. The loop reads from the queue at the top of each turn. If something is waiting, it gets folded into the next prompt. The user can correct the agent mid-task without restarting.

What I chose for AIOS: abort/cancellation is fully wired (the AbortController propagates through every layer including the LLM call itself). Mid-run message injection is on the roadmap — the queue shape is straightforward, and the per-turn check at the top of the loop is the natural place to drain it. Right now the model gets one prompt at the start of each execute() call and steering means waiting for the next call.

History repair

Tool results get pruned. Old turns get summarized. Some providers enforce strict turn-ordering and role-alternation rules. The history that the loop hands to the LLM has to be well-formed regardless of what happened to it in between.

The clean answer is a dedicated pass that runs immediately before each LLM call: walk the history, repair any structural violations, ensure tool calls are paired with tool results, ensure roles alternate correctly. The repair pass is invisible most of the time, but the one time it isn't, it's the difference between a 400 error from the API and a successful turn.

What I chose for AIOS: MessageRepair is its own kernel module, invoked from the loop before each LLM call. It's defensive code — it should almost never have to do anything — but having it there means context-management aggressiveness is decoupled from provider strictness. We can prune harder because we know the repair pass will fix any breakage on the way out.

Loop detection

The most insidious failure mode of a tool-calling agent is the loop where the model keeps trying the same thing in the same way and getting the same error. The naive turn limit will eventually catch it, but by then the budget is gone.

The detection heuristic is straightforward: hash the last N tool calls and their results; if the same hash repeats more than K times in a row, the agent is stuck. Break the loop, surface a structured signal to the model ("you tried this three times and it failed three times — try something else"), and let it adapt.

What I chose for AIOS: LoopDetector is a kernel module. The heuristic is conservative — same tool, same argument hash, same error class, three times in a row. When it fires, the loop doesn't terminate; it injects a diagnostic message into the next prompt and lets the model recover. Hard termination is reserved for the budget and timeout conditions.

Verification and reflection

The naive loop trusts the model when it says "done." A harder problem: how do you know the work is actually done?

Two patterns I saw, often used together:

Verification — after the loop terminates naturally, run a separate pass that checks the work against the original goal. If verification fails, re-enter the loop with the verification feedback as a new prompt.
Reflection — periodically (or at termination), let the model write a structured reflection: what was the goal, what did I do, what's left? The reflection feeds back into the next turn or the next session.

Both add cost. Both prevent the failure mode where the agent declares victory and walks away from broken work.

What I chose for AIOS: VerificationEngine and ReflectionEngine are both kernel modules, both off by default, both enabled by config. The cost is real and not every task needs them — a quick file lookup doesn't earn a reflection — so they're opt-in at the engine level rather than baked into every run.

Checkpointing

A loop that can't resume is a loop that loses everything when it crashes. For long-running tasks (the whole point of the kernel), this is unacceptable.

The pattern: after each turn, serialize the loop state — current turn number, full message history, active todos, abort state — to durable storage. On resume, deserialize and pick up where the loop left off. The interface is small; the engineering discipline to actually checkpoint correctly (atomic writes, schema versioning, partial-write recovery) is the hard part.

What I chose for AIOS: CheckpointManager and ConversationStore together handle this. After each turn the engine writes a JSONL transcript and updates a checkpoint blob. resume() is a first-class entry point alongside execute(). This is the property that lets a 50-turn task survive a crash on turn 23.

Streaming vs. batch

Streaming token-by-token responses is visually nice and unlocks early termination on tokens that haven't been generated yet. Batch responses are simpler — the loop waits for the full response, then dispatches tools.

The interesting observation: most production loops are internally batch, with streaming added on as a UI layer rather than as a loop primitive. The control flow is the same either way; streaming just changes when the user sees what.

What I chose for AIOS: batch internally, streaming exposed at the UI layer via the provider adapter. The LLMProvider interface defines both chat() and stream(); the loop uses chat(). This keeps the loop reasoning about complete responses (which simplifies retries, repair, and parallelism) while still letting the UI render tokens as they arrive.

Effort levels

Some tasks deserve more reasoning than others. The cheap dispatch — listing files — should not run the same depth of thought as the expensive refactor.

The pattern that worked in the field: an effort knob (low/medium/high/max) that adjusts the model's reasoning depth without changing the loop structure. Same code path, different reasoning budget.

What I chose for AIOS: not yet exposed at the kernel level, but the seam is there in the LLM provider config. Right now the engine selection (MinimalEngine vs. MetaglassEngine) plays a similar role at a coarser grain — the minimal engine skips intent classification, todo guidance, reflection, and verification, so the loop overhead drops to near-zero for trivial tasks. A finer-grained effort knob is a future refinement, not a re-architecture.

Pre-loop intent classification

Most harnesses treat all inputs the same and let the loop figure it out. A small minority classify the input first and pick a path.

The cost is one cheap LLM call (or a regex fallback for obvious cases). The benefit is that trivial inputs don't pay the cost of the full loop, and complex inputs get nudged toward planning before they start.

What I chose for AIOS: IntentClassifier runs pre-loop. Phase one is a regex pass that catches ~30–40% of inputs instantly (greetings, simple lookups, explicit multi-step requests). Phase two is a Haiku-tier LLM call for everything else. The output drives the gradient TodoWrite guidance for the rest of the run. This is the single most opinionated design choice in the AIOS loop and the one I'd defend most carefully — most loops don't do this, and the reason most loops don't do this is that it adds a fixed overhead to every run. For Metaglass's workload (vault-aware editing where the input variety is huge) the overhead pays back.

The problem-to-mechanism table

The same material reorganized as problem → mechanism, so it can be read as a checklist.

Problem	Mechanism
Agent runs forever	Turn cap + wall-clock timeout + abort signal
Single slow call burns the whole budget	`Promise.race` between LLM call and per-call timeout
Loop oscillates on a failing tool	Dedicated `LoopDetector` with hash-based recurrence check
Transient API errors kill the run	`ToolRetryPolicy` with bounded backoff, retries off-budget
User can't interrupt mid-task	`AbortController` propagated end-to-end; future mid-run message queue
Provider rejects malformed history after pruning	Dedicated `MessageRepair` pass before every LLM call
Trivial inputs pay the cost of the full loop	Pre-loop intent classification + dual-engine selection
Model declares victory on broken work	Optional `VerificationEngine` after natural termination
Loop state is lost on crash	`CheckpointManager` + JSONL transcript, `resume()` entry point
Parallelizable tool calls run sequentially	(deferred) parallel dispatcher reading `ToolMetadataRegistry`
Mid-task user steering requires restart	(deferred) mid-run message queue drained at top of each turn
Effort doesn't match task	Engine selection today; finer effort knob deferred
Sub-agent budget pollutes the parent's	Separate `ConversationEngine` per spawn with its own counter
Streaming complicates control flow	Batch internally; stream at the UI layer only
Cross-cutting concerns tangle the loop	Each concern lives in its own kernel module; the loop composes

What comes next

The AIOS loop ships with the load-bearing pieces in place: turn-bounded outer control with seven termination conditions, per-call timeout via Promise.race, abort propagation, intent classification, gradient todo guidance, tool retry policy, message repair, loop detection, checkpoint + resume, and isolated sub-agent budgets. The deferred items — parallel tool execution against existing metadata, mid-run message injection, finer-grained effort knobs, opt-in reflection/verification by default — are all additive against the existing kernel module shape. None of them require touching runLoop() itself, which is the test I care about.

Parallel tool execution is the single most consequential deferred item, both because the metadata to drive it already exists and because the latency win on multi-tool turns is large. It's the next thing I'd build.

Three lessons

If I had to compress this survey into advice for someone designing an agent loop from scratch:

Keep runLoop() tiny; push complexity into modules. Every loop I read that tried to inline retries, repair, parallelism, and loop detection into the main while-loop became unreadable within a year. Every loop that pushed those concerns into named, composable modules stayed readable. The loop's job is to call the model, dispatch the tools, and decide whether to continue. Everything else is a module.
Termination is a list, not a condition. The single most common failure mode in naive loops is treating "no tool calls" as the only stopping condition. Real loops need at least four (no tool calls, max turns, wall-clock timeout, abort signal) and benefit from three more (budget, error threshold, loop detection). Write them down as a table; check that each one has an actual code path that handles it.
Repair the history before every call, even when you think you don't need to. Pruning, summarization, sub-agent integration, and provider rules all conspire to produce histories that are almost well-formed. Almost is the dangerous part. A defensive repair pass costs almost nothing when the history is already valid and saves the entire run when it isn't. It's the cheapest insurance in the kernel.

The rest is tuning. Turn limits, retry counts, classification thresholds, parallelism heuristics — dials, not decisions. The decisions are the three above, and the loops that get them right end up looking remarkably similar even when their authors didn't talk to each other.

Part 1 in this series: Designing context management for Metaglass AIOS.