Over the past several months I've been designing the context-management layer for Metaglass AIOS — the agentic kernel that powers Metaglass — and along the way I read the internals of every long-running agent harness I could get my hands on. Different teams, different stacks, different opinions about almost everything, and yet a surprising convergence on the shape of the problem. Context management is the hardest part of building an agent that can run for more than a few turns, and most of the interesting design decisions in any harness end up being context decisions in disguise.
What follows is a walk through the design space as I encountered it: the recurring problems, the mechanisms that solve them, and the choices I made for Metaglass AIOS. I'm writing this partly as a record of why the kernel looks the way it does, and partly because if you're building one of these systems yourself, the design space is smaller than it first appears, and most of the dials worth turning are the same dials.
Why naive implementations fall short
Naive long-running agents fail the same way every time. The conversation grows, tool outputs balloon, the model loses track of the original goal, and somewhere around the 80% mark the agent either starts repeating completed work or politely informs you it can't proceed. The window is finite; the task is not.
Three forces push against each other inside that window:
- History — the original goal, prior decisions, failed approaches. The agent needs these to stay coherent.
- Recency — the current file state, the last tool error, the user's most recent correction.
- Relevance — what's needed right now but isn't in the conversation yet: a file, a memory, a graph neighbor.
A good context strategy is a budget allocation across those three. Get it wrong and the agent forgets the goal, stares at the wrong file, or wastes tokens on history it will never use. The first cut of any harness I read tried to handle this with a single mechanism — usually "summarize when full" — and the first cut always lost.
The shape the problem keeps taking
After reading enough of these systems, the same three-layer shape kept emerging regardless of whose code I was looking at:
- Assemble — pull what's relevant for this turn from outside the conversation (files, graph neighbors, vector hits, memory).
- Compress — when history grows too large, summarize the middle while protecting the head and tail.
- Persist — make sure anything that must survive compression actually does (instruction files re-injected from disk, durable memories written before the discard, parent-session chains in a database).
Every production system I studied implements all three. The interesting differences are in where the dial sits and who decides what's important. That insight became the spine of the Metaglass AIOS design: build all three layers, draw clear lines between them, and let each layer be tuned independently.
Mechanism by mechanism
Threshold-triggered compaction
The most basic move: pick a percentage of the context window, and when usage crosses it, summarize. The systems I looked at sat all over this dial — some compressed as early as 50% of the window with a cheap model, others waited until 95% and leaned on a large window plus server-side summarization. Some used token-budget guards with hard floors instead of a single percentage.
There's no "right" threshold. A lower number means more frequent compressions, each cheaper but more numerous, and more cumulative loss. A higher number means more headroom for tool calls between compressions, at the risk of a sudden expensive summary right when the agent most needs to think.
What I chose for AIOS: 70% of a 100K-token budget, by default. Conservative enough to leave runway for a final burst of tool calls, aggressive enough that I never see the agent hit the ceiling. The trigger lives inside ConversationEngine.runLoop(), evaluated before each LLM call.
Head/tail protection
Compress the middle; never compress the ends. This was the single most universal pattern I saw — every system that worked, worked because of this.
The intuition: the opening of the conversation carries the goal, and the closing carries the immediate state. Both are disproportionately load-bearing. Everything else is fair game. Different systems drew the line in different places — a fixed number of head messages plus a token-budgeted tail in some, a fixed N most recent turns in others — but the principle was identical.
What I chose for AIOS: preserve the last 5 turns verbatim; let the summarizer touch everything else. If I implement nothing else from this list, I implement this.
Structured summary templates
Free-form summaries drift. The most disciplined systems I read forced every summary against a fixed template — Goal, Progress, Decisions, Files, Next Steps. The template does two jobs. It forces the summarizer to extract decision-relevant information rather than ambient chatter, and it gives the downstream model a predictable shape to reason against. Without structure, summaries gravitate toward narrative recap; with structure, they encode state.
What I chose for AIOS: the kernel's summarization prompt explicitly names what to preserve — key decisions, important info, tools called, user preferences. It's a step short of a strict template, and a fair next move is to tighten it into one.
Iterative summaries
A naive compactor re-summarizes raw history every time it fires. The loss is compounded, not linear — a little more of the early conversation disappears on each pass. The cleanest systems I read fed the previous summary back into the summarizer so the next compaction folds it forward instead of starting from scratch. The chain becomes roughly idempotent against repeated compactions, at the cost of one more block of tokens to summarize.
What I chose for AIOS: not yet implemented. This is the highest-value addition on my list — single-summary compaction works at first, but I can already see the decay over long sessions.
Instruction files that survive compaction
Some content must not be compressed, ever. The cleanest pattern I saw was almost embarrassingly simple: keep static instructions in a file on disk, and re-read it into the system prompt on every request. Compaction may have summarized the original copy out of history, but the next turn gets a fresh one anyway. The agent's behavioral instructions never decay.
The generalization is: if a piece of context is static and authoritative, store it outside the conversation and re-load it. Don't make the compactor responsible for preserving things you can preserve for free by reading a file.
What I chose for AIOS: Metaglass already has a vault. The kernel treats designated vault notes as instruction sources and re-injects them per turn; same pattern, with the vault as the storage layer.
Tool-result pruning
Across every system I read, tool outputs were the single largest source of token bloat. They're long, they're often useful for only one turn, and the model is rarely going back to re-read them. The aggressive systems replaced old tool outputs with placeholder text like [Old tool output cleared to save context space]; the more sophisticated ones ran proactive truncation passes and guarded against any single huge failure from poisoning the rest of the run.
There's a corollary I almost missed on the first read: once you prune messages, you may have to repair the history to satisfy provider rules (turn ordering, role alternation). Pruning is not free — it has to leave a well-formed conversation behind.
What I chose for AIOS: the kernel truncates oversized tool results on ingest, before they ever enter the history. The repair pass for provider-specific rules is a planned addition; right now AIOS targets a single provider family and the rules are forgiving.
Flush-before-discard
This was the most distinctive idea I encountered, and the one I most wanted to steal. Before compaction throws old context away, a silent agentic turn fires with a prompt along the lines of:
Session nearing compaction. Write any lasting notes to durable memory now. Reply with
NO_REPLYif nothing is worth saving.
The agent gets one chance to decide what's worth keeping and to write it to durable memory itself. This shifts authorship of preservation from a heuristic ("keep the first 3 messages") to the agent's own judgment about what mattered. The summary still happens — but anything the agent flagged is now safely in a file the next session can read.
It's a clever inversion: instead of asking "what should the compactor preserve?", ask the agent "what would you lose sleep over forgetting?"
What I chose for AIOS: on the roadmap, and the design is straightforward because the vault already gives us a natural place for the agent to write. The kernel's MemoryPort is the seam where it'll plug in.
Pluggable context engines
The most flexible architecture I saw exposed the context layer as a set of lifecycle hooks — ingest, assemble, compact, afterTurn — and let different engines implement them differently. A default engine handled the common case; custom engines could register for specialized workloads.
The hook that earned its keep was assemble. It returned both an ordered message list and a separate string prepended to the system prompt for this turn only. That second channel is what makes retrieval-augmented context, recall hints, and per-turn instructions possible without polluting the conversation history. Most systems only have the message channel; the two-channel return is a small design choice that opens a lot of doors.
What I chose for AIOS: the ports architecture is already lifecycle-shaped — ContextPort, MemoryPort, ToolPort, and so on — and the assemble() method on ContextPort is the obvious place to add the second channel. I haven't shipped the system-prompt-addition return yet, but the seam is there.
Two-phase split: assembly vs. compression
This is the line I drew most deliberately for AIOS. Two problems most systems conflate:
- Pre-conversation assembly — gathering relevant notes, graph neighbors, and memory hits before the conversation starts. Handled by the host's
ContextAdapter. - Mid-conversation compression — managing growth during the conversation. Handled by the kernel's
ContextCompressor.
The host knows the vault and the graph; the kernel knows token budgets. By separating them, neither has to reason about the other's domain. The tradeoff is that the kernel can't pull new context mid-conversation — it can only compress what's already there. For Metaglass's use case (knowledge-graph-driven editing) this is the right call. For agents that need to retrieve new files mid-task, it would have to evolve.
This is the cleanest design choice in the whole layer, and the one I'd defend hardest if someone pushed back on it. Mixing the two responsibilities is what produces compactors that try to be retrievers and retrievers that try to be compactors; neither job gets done well.
Subagent isolation as a context strategy
The largest scaling lever I saw was using subagents not just for parallelism but as a context-management tactic. Each subagent gets a fresh window; only its summary returns to the parent. A 200K-token exploration that would have eaten the parent's budget collapses to a 2K-token report.
The hidden cost: the parent loses access to everything except the summary. So the subagent's report quality is itself load-bearing — a bad summary is now the ceiling on what the parent can do with that exploration. Worth measuring.
What I chose for AIOS: subagents are a first-class concept in the kernel, and isolation is the default. Each subagent has its own port set and its own context budget. The summary-quality problem is real and I track it as an open issue.
Deferred schema loading
Tool schemas don't grow during the conversation — they're paid upfront, every request, regardless of whether the agent ever calls them. With external tool servers proliferating, this overhead has gotten large enough to matter. The cleanest fix I saw deferred schema loading until the agent actually wanted a tool: the names are listed; the full schemas are fetched on demand.
It's a niche optimization but a clean one, and it's a useful reminder that "context management" includes the stuff that never compresses, not just the conversation.
What I chose for AIOS: not yet implemented, but easy to slot into the ToolPort boundary when the tool count justifies it.
Hybrid retrieval
The most capable systems I read implemented retrieval as more than vector search:
- Keyword search (BM25 or FTS5) for precision.
- Vector similarity for semantic recall.
- Diversity re-ranking (MMR) so you don't get five paraphrases of the same hit.
- Temporal decay so older memories rank lower.
- Graph traversal — backlinks and forward links from the active note.
The lesson is that "RAG" framed as a single technique is wrong. Production retrieval is layered, and each layer fixes a failure mode the others have. Pure vector search hallucinates similarity; pure keyword search misses synonyms; both ignore recency.
What I chose for AIOS: Metaglass is unusual among agentic systems in that the host already has all five layers — Tantivy for full-text, an embedding pipeline for vector search, a petgraph-backed knowledge graph for traversal, and per-document timestamps for temporal weighting. The kernel doesn't do retrieval at all; it consumes pre-scored ContextBlock objects from the host. This is a direct consequence of the two-phase split above.
Session split chains
The cleanest answer I saw to the "what if the summary was wrong?" problem: preserve full history across compressions by linking sessions in a database via a parent_session_id. The live context is compressed, but the original messages stay queryable. If the agent or the user later needs to ask "what did we actually do three compactions ago?", the answer exists. The summary is the working view; the chain is the ground truth.
What I chose for AIOS: the JSONL transcript on disk is the equivalent — every message persisted, indexed by session. The parent-child chain is the next refinement.
Cheap summarizer models
Summarization is largely an extraction task, not a reasoning task, and a frontier model is overkill for it. The disciplined systems allowed an override so the summary step ran on a smaller, faster model while the main loop kept the frontier model. The savings compound across a long session.
What I chose for AIOS: the kernel takes a summary_model_override config, defaulting to the same model as the main loop but easily switched to a cheaper one per deployment.
Out-of-band signals
A nice pattern I saw for injecting transient signals: budget warnings (e.g., "iteration 18/25") inserted directly into tool result content, not as separate messages, and stripped on replay. The signal influences the agent's decision now but doesn't pollute future context or break prompt caches. It's the in-band equivalent of a status bar — visible when relevant, invisible afterward.
What I chose for AIOS: not yet implemented. The kernel currently exposes budget state through a separate channel that the agent doesn't see directly; bringing it into the tool result envelope is a clean future move.
Estimation over measurement
For threshold checks, the fast systems used a character-based heuristic (around 4 chars per token) instead of calling a tokenizer. The estimate is imprecise; it doesn't need to be precise. The authoritative count comes back in the API response a moment later. The heuristic exists only to decide whether to compress before the next call, where being off by 5% costs nothing.
This is a general pattern worth internalizing: estimation is free, measurement is expensive, and most decisions inside a hot loop don't need the precision of measurement.
What I chose for AIOS: the kernel uses the same ~4 chars per token heuristic for pre-flight checks and reconciles against API usage fields after each call.
Manual override
Even the most opinionated automatic systems offered an escape hatch — a way for the user to say "compact now, and when you do, focus on the API changes." Sometimes the user knows what matters next better than the heuristic does. The override is small but it solves the case where automation would otherwise make a wrong call confidently.
What I chose for AIOS: a /compact command surface is on the roadmap. The kernel's compactor already accepts a focus directive parameter; the UI surface to invoke it is what's missing.
The problem-to-mechanism table
The same material reorganized as problem → mechanism, so it can be read as a checklist when designing a new harness or auditing an existing one.
| Problem | Mechanism |
|---|---|
| Context overflow | Threshold-triggered compaction (50%–95% depending on appetite) |
| Loss of original goal | Protect head messages; re-inject instruction files |
| Loss of recent state | Protect last N messages or last N tokens |
| Tool-output bloat | Proactive truncation; placeholder replacement; provider-specific sanitization |
| Summary drift across compactions | Iterative summaries that fold prior summary forward |
| Summaries miss what matters | Structured templates (Goal / Progress / Decisions / Files / Next Steps) |
| Heuristic can't know what's important | Flush-before-discard agentic turn |
| Need new context mid-conversation | Two-channel assemble: messages + system-prompt addition |
| Relevance | Hybrid retrieval (keyword + vector + diversity + temporal + graph) |
| Complex tasks blow the parent budget | Subagent isolation |
| Schema overhead | Deferred tool-schema loading |
| Cross-session continuity | Session split chain in a database |
| Cost of compression | Cheap summarizer model override |
| Signaling without context pollution | Turn-scoped warnings in tool output, stripped on replay |
| Per-turn token estimation | Character-based heuristic; trust API usage for truth |
| Automation makes a wrong call | Manual /compact <focus> override |
What comes next
The Metaglass AIOS context layer ships with the load-bearing pieces in place: threshold-triggered compaction, head/tail protection, a two-phase split between host assembly and kernel compression, ContextBlock-based prioritized assembly, hybrid retrieval through the host, subagent isolation, and JSONL transcript persistence. The pieces I'm explicitly deferring — iterative summaries, flush-before-discard, the system-prompt addition channel, deferred schema loading, in-band budget signals, parent-child session chains, manual focus override — are all designs I have a clear seam for in the existing port architecture. None of them require rewriting what's there; each is an additive change at a known boundary.
That property — that the next dozen improvements all fit into the existing shape — is the test I'd apply to any context-management design. If the design is right, the things you didn't build yet should fit into it. If the design is wrong, every addition feels like a special case.
Three lessons
If I had to compress (sorry) this whole survey into advice for someone designing one of these layers from scratch, it would be three things:
- Head/tail protection is non-negotiable. Of every mechanism I read, this is the one that shows up in every working system and would hurt the most to remove. The goal lives in the head; the state lives in the tail. Protect both.
- Separate "what's relevant" from "what fits." The two-phase split between assembly and compression is the cleanest design choice in the whole layer. Even if you don't implement it as separate components, think about them as separate problems. Conflating them produces compactors that try to be retrievers and retrievers that try to be compactors, and neither job gets done well.
- Ask the agent what to preserve. The most agent-native idea in the entire space is letting the agent itself flag what's worth saving before compaction discards it. Heuristics will always be approximate; the agent often knows exactly which sentence from the last 50 turns is the one that matters next. Give it a chance to say so before you delete it.
The rest is tuning. Threshold percentages, summary models, retrieval weights — these are dials, not decisions. The decisions are the three above, and the systems that get them right end up looking remarkably similar even when their authors didn't talk to each other.