Designing skills for Metaglass AIOS, part 7: declarative behavior between prompts and code

This is the seventh piece in a series on designing the Metaglass AIOS kernel. Earlier parts: context management, the agent loop, the tool system, prompt composition, subagents, memory and state persistence. This one is about skills — the layer that sits between prompts (ephemeral, unstructured) and tools (compiled, binary), letting users and the agent itself extend behavior without writing code.

Skills are the most opinionated piece of Metaglass. Where the agent loop and context management converge to similar shapes across systems, skill design varies wildly — from "just inject a markdown file into the prompt" to "a full workflow engine with variable interpolation, loops, conditionals, and chained execution across multiple skills." AIOS sits at the elaborate end of that spectrum, deliberately, and this article is mostly the story of why I made that choice and where the line should sit.

Why naive implementations fall short

The first skill system anyone writes looks like this: a folder of .md files; when the user types /skill-name, load the file and stuff it into the prompt as instructions. Five lines of code. It works for the demo.

Then you ship it and the limits arrive:

The user has 50 skills installed and the prompt is now mostly skill descriptions.
A skill needs to take parameters and there's no syntax to express them.
Two skills do similar things and the agent can't decide which to use.
A skill needs to call a tool, then use the result, then call another tool — and a markdown file doesn't say how.
The user wants to override a built-in skill without modifying the source; there's no precedence rule.
A skill needs to fire automatically when the user's input matches its intent, not just when explicitly invoked.
A community-contributed skill contains a prompt injection that runs every time the skill is loaded.
The agent itself wants to learn — to write new skills from successful task completions — and the format doesn't support self-modification.
A skill should only appear when the user has the required tool installed, but there's no eligibility check.
The same skill needs to work as a slash command, as an auto-triggered behavior, and as a callable workflow from another skill.

Each of these is a separate decision masquerading as a missing feature. The naive "markdown file as prompt" survives demos and stops scaling at the moment users start having opinions about how their agent should behave on repeating tasks. A real skill system is not a folder of markdown — it's a registry, a matcher, an injector, and (sometimes) an executor, each with its own concerns.

The shape the problem keeps taking

Reading enough skill implementations across the field, the same six-concern skeleton kept appearing:

Format — what is a skill, structurally? File layout, frontmatter, body sections, supporting assets.
Discovery — where skills come from. Bundled, user-installed, project-local, plugin-provided, community-hub.
Resolution — what to do when two sources define the same skill. Priority, override, versioning.
Matching — given a user input or model intent, which skill applies? Name match, keyword match, semantic search, model-selected.
Injection — how skill content gets to the model. Full content, descriptions-only-at-start, on-demand body loading, two-tier degradation.
Execution — does the skill just inform the model (pure prompt injection) or does it actually run (workflow steps with variable resolution and tool invocations)?

Every working skill system makes all six decisions. The biggest split in the field is between systems that stop at injection (skills are prompt content; the model executes) and systems that go further into execution (skills are programs with their own interpreter).

The simple-end systems argue: the LLM is the executor. You don't need a workflow engine because the model can follow markdown instructions just fine. The cost of a workflow engine is paid once, in complexity; the benefit is paid back rarely, in determinism.

The elaborate-end systems argue: pure prompt injection works for note templates and breaks the moment you want a skill that actually does something — calls a tool, uses the result, calls another tool, returns a structured output. At that point, you either build a workflow engine or you write each skill as a one-off blob of prompt that hopes the model will get it right every time.

AIOS lives at the elaborate end. Skills can be pure prompt enrichment (when that's enough), pure workflow execution (when determinism matters), or a hybrid (when the workflow produces content the model then refines). The choice belongs to the skill author, not the framework.

Mechanism by mechanism

Skill format: markdown with frontmatter

The single most universal pattern in this layer: skills are SKILL.md files — markdown with YAML frontmatter. Every system I read uses this shape. It's becoming a de-facto cross-framework standard, which means a skill written for one harness can often be read by another.

The frontmatter carries the machine-readable metadata: name, description, triggers, inputs, tools, version, platform requirements. The markdown body carries the human-and-LLM-readable instructions: purpose, when to use, output template, guidelines, quality checklist.

What I chose for AIOS: the same shape. SKILL.md with YAML frontmatter parsed by SkillParser. The frontmatter includes the standard fields (name, description, version, triggers, tools) plus Metaglass-specific extensions for the knowledge graph (node_type, suggested_edges), input specifications (inputs: ["key|required"]), and output templates (output: assets/template.md). The body has named sections — ## Purpose, ## When to Use, ## Output Template, ## Guidelines, ## Quality Checklist, ## Variants — each parsed into its own field on the resolved skill object. The parser tolerates missing sections; the only required fields are name and description.

The format is portable enough that a skill written for AIOS can largely be read by other SKILL.md-compatible systems. The graph-integration fields are extensions; other systems ignore unknown frontmatter keys.

Asset directories: supporting files beyond the markdown

A clean pattern from the more mature systems: a skill isn't just SKILL.md. It's a directory that contains SKILL.md plus optional supporting resources — references/ (documentation the skill can cite), templates/ (file templates the skill outputs), scripts/ (executable scripts the skill invokes), assets/ (images, data).

The directory shape lets a skill be a self-contained bundle. To share or version a skill, you move the directory. To extend a skill, you add files to it without rewriting the markdown.

What I chose for AIOS: each built-in skill is a directory under learning-os/src/core/skills/builtin/<skill-name>/ with SKILL.md plus assets/template.md (and optional variant templates). The 11 built-ins (concept-note, daily-note, decision-log, how-to, meeting-note, person-note, project-brief, project-note, reading-note, task-tracker, weekly-review) all follow this shape. The assets/template.md is the output structure the skill produces; the body of SKILL.md references it.

Discovery: where skills come from

A working skill system pulls from multiple sources. The cleanest pattern I saw uses a priority cascade — bundled skills at the lowest priority, project-local skills at the highest, with user-installed and plugin-provided skills in between. Same-named skills at higher priority override lower ones (like CSS specificity).

The priority order varies by system. Some put workspace skills at the top because they're the most specific; some put user skills at the top because they're the most personal. The shape is consistent; the rank order is opinionated.

What I chose for AIOS:

Source	Priority	Origin
`builtin`	0 (lowest)	Shipped with Metaglass
`user`	1	User-installed (out-of-band install path)
`plugin`	1	Plugin-registered
`vault`	2 (highest)	Skills under the active vault

Vault skills win because Metaglass's whole stance is "your vault is the source of truth." A user who writes a skill in their vault expects it to take precedence over both shipped defaults and plugin contributions. The SkillRegistry resolves precedence via getHighestPriorityVersion(skillName) on every lookup.

Eligibility filtering

Not every discovered skill should be available right now. A skill that requires ffmpeg shouldn't appear when ffmpeg isn't installed. A skill that needs GITHUB_TOKEN shouldn't show up when the env var is missing. A macOS-only skill shouldn't appear on Linux.

The pattern: a requires block in the frontmatter (bins, env, config, os) that the registry checks at load time. Ineligible skills are silently dropped — they don't waste prompt tokens and they don't tempt the model to invoke something that will fail.

What I chose for AIOS: the frontmatter supports platforms, tools, and a prerequisites block. The registry filters at discovery time. The filtering is on the roadmap to extend with full bin/env checks (the current state checks tool dependencies but not binary availability). This is a small extension and an obvious one.

Two-phase loading: descriptions at start, bodies on demand

The single largest cost-control mechanism in any skill system. Loading every skill's full content into the prompt at session start doesn't scale past a handful of skills; loading nothing means the agent doesn't know what's available.

The disciplined systems split the load:

Session start — only skill names and one-line descriptions go into the system prompt. The agent knows what exists but not how each one works in detail.
On invocation — when the agent decides to use a skill, the full SKILL.md body is loaded and injected as instructions for that turn.

This is the "progressive disclosure" pattern. With 100 skills installed, the per-turn skill overhead stays at ~5K tokens (descriptions) instead of ~50K (full content). The cost is one extra dispatch when a skill is actually used; the savings compound every turn it isn't.

What I chose for AIOS: progressive disclosure is the default. SkillPromptProvider matches skills against the user's goal (top 3 by relevance) and only loads the full body for those matches. Non-matching skills cost nothing per turn. SkillContentLoader caches loaded bodies so repeated invocations of the same skill within a session don't repeatedly read from disk.

The variation from the canonical pattern: AIOS loads matched bodies eagerly rather than waiting for the model to invoke a skill tool. This is a deliberate trade — the body is loaded based on the host's matching (semantic + keyword), so the model sees the relevant skill content without having to ask for it. For Metaglass's vault-editing workload (where most "matches" are right), this is faster than the dispatch-on-invocation pattern. For workloads where matches are speculative, the dispatch-on-invocation pattern wastes fewer tokens.

Budget-aware injection with degradation

A subtle but important refinement: even after picking the top-N matched skills, the total injected content can exceed budget. The disciplined systems implement two-tier degradation:

Full format — name, description, body sections, asset references.
Compact format — name and file path only, descriptions trimmed.
Truncation — binary search for the largest subset of skills that fits the character budget.

The agent ends up aware of more skills than fit in full form, with the most relevant ones fully injected and the rest available by name. The compact tier preserves the "I know this skill exists" signal without paying the description cost.

What I chose for AIOS: today, AIOS uses a fixed top-3 match cap rather than budget-aware degradation. With 11 built-in skills the cap is enough; as the skill set grows (especially if vault skills proliferate), moving to the budget+degradation pattern is the obvious refinement. The character-budget approach pairs naturally with the existing per-turn token estimation work in part 1.

Matching: how skills get invoked

Three matching modes show up across the systems I read:

Name match — user types /skill-name; exact name lookup. Simple, deterministic.
Trigger phrase match — frontmatter lists trigger phrases ("create a concept note", "explain {concept}"); pattern match against user input. Supports variable extraction.
Semantic match — embed the user input, embed every skill description, top-k by cosine similarity. Catches paraphrased requests.

The best systems combine all three. Name match for explicit invocations. Trigger phrases for natural-language activation. Semantic match as the fallback for everything else, with keyword overlap as a secondary check to prevent embedding hallucinations.

What I chose for AIOS: all three. SkillRegistry.findByTrigger(input) runs the trigger-phrase match (with {variable} extraction). SkillRegistry.findByDescription(query) runs keyword overlap with stop-word filtering. SkillRegistry.findByDescriptionAsync(query, options) delegates to SkillEmbeddingService for semantic match and falls back to keyword when embeddings aren't available. The matching threshold is conservative (0.5 keyword overlap) to keep false positives low — better to miss a possible skill than to spam the prompt with irrelevant ones.

Invocation control: user-invocable vs. model-invocable

Not every skill should be triggerable by the model. Some skills have side effects that the user should consent to explicitly (a deploy skill, a payment skill, an external-API skill). The pattern is frontmatter flags:

user-invocable: true — user can trigger via /skill-name.
disable-model-invocation: false — model can trigger when it matches the input.

The two flags are orthogonal. A "background knowledge" skill might be neither (loaded as context, never invoked). A deploy skill might be user-invocable only. A note-template skill might be both.

What I chose for AIOS: the model-invocation guard exists as agent_ready: false (which means "this skill is human-only and not yet ready for autonomous invocation"). The user-invocation surface today goes through skill matching in SkillPromptProvider regardless of the flag — there's no slash-command UI on top of skills yet. As the skill set grows and the slash-command surface lands, formalizing both flags (separate user-invocable and disable-model-invocation) is the right refinement.

Tool scoping per skill

Skills that execute (not just inject) need to declare which tools they're allowed to use. A note-template skill needs vault.create_note; a deploy skill needs Bash. The disciplined pattern is a frontmatter tools (or allowed-tools) field listing the skill's tool dependencies — both as documentation and as a sandbox boundary.

What I chose for AIOS: frontmatter tools: [] is parsed and stored. Enforcement at execution time goes through the host's tool-adapter factories (see part 3) — a skill execution can be wrapped with a tool adapter scoped to the skill's declared tools. This isn't strictly enforced today (skill executions inherit the parent's tool set), but the seam exists and the formalization is straightforward.

Dual-mode execution: prompt enrichment vs. workflow

This is where AIOS diverges hardest from the simple-end systems.

Mode 1: Prompt enrichment. The skill's body is injected into the system prompt. The LLM reads the purpose, output template, guidelines, and quality checklist, then generates content matching that structure using its normal tool loop. The skill is a style guide; the model is the executor. This handles 90% of cases — note templates, document scaffolds, structured outputs.

Mode 2: Workflow execution. The skill defines a sequence of steps with tool invocations, variable resolution, conditionals, and loops. The runtime executes the steps deterministically; the LLM is only invoked when a step explicitly calls for it. The skill is a program; the model is one of several executors. This handles the cases where reliability matters more than flexibility — data pipelines, multi-step actions with intermediate validation, integrations with external systems.

Most skills are mode 1. A few are mode 2. The distinction is implicit in the skill's structure: if there's no steps list, it's a prompt-enrichment skill; if there's a steps list, the workflow runtime takes over.

What I chose for AIOS: both modes coexist. SkillPromptProvider handles mode 1: match → load body → inject as system prompt addition with purpose, guidelines, template, and checklist. SmartSkillService handles mode 2: route input to a matched skill's trigger → resolve inputs → execute steps via StepExecutor → resolve the output template via VariableResolver → return structured ActionExecutionResult.

The split is a real design commitment, not just two side-by-side features. Mode 1 keeps the skill format simple enough that a user can write a useful skill in five minutes. Mode 2 makes the format powerful enough for cases that demand determinism. Forcing every skill into mode 1 leaves real workflows under-served; forcing every skill into mode 2 makes the simplest cases overcomplicated.

The workflow runtime: variables, conditionals, loops, filters

If you commit to mode 2, you commit to a template engine. The disciplined systems use a Jinja-style syntax that's familiar to anyone who's written a template before:

{{inputs.title}} — input variable access.
{{search.results}} — access the output of a previous step by step ID.
{{value | filter}} — apply a filter (truncate(50), date("YYYY-MM-DD"), join(", "), default("N/A")).
{% if condition %}...{% endif %} — conditional blocks.
{% for item in array %}...{% endfor %} — iteration.

The filter library is small but high-leverage. length, slice, truncate, date, join, default, first, last cover the common cases without turning into a full programming language.

What I chose for AIOS: VariableResolver (562 lines) implements the Jinja-style syntax with the filter set above. StepExecutor checks abort signals, evaluates step conditions, resolves variables in params, invokes tools via SkillToolRegistry, and returns outputs that subsequent steps can reference. This is the most code-heavy piece of the skill system, and it earns its keep on the small number of skills that are workflows rather than templates — but those are exactly the skills that would otherwise be unreliable.

Skill chaining: outputs as inputs

A natural extension of workflow execution: chain multiple skills together. The output of skill A feeds into skill B's inputs. The chain can have conditionals (continueOnError, condition per step) and the chain executor handles the orchestration.

This is the place where the workflow engine starts to pay off in ways that pure prompt injection can't reach. A "weekly review" skill might chain a retrieve-tasks workflow, a summarize-completed workflow, and a plan-next-week template — each is its own skill, the chain composes them, and the runtime handles the variable plumbing between them.

What I chose for AIOS: SkillChainExecutor (474 lines) handles sequential multi-skill execution with variable interpolation between steps, conditional skipping, and continueOnError behavior. This is a genuinely powerful feature and it's also the one I'd cut first if I were designing AIOS for someone else's workload. For knowledge-graph editing — where multi-step workflows like "review and re-link this week's notes" are common — it pays back. For coding agents where each task is mostly novel, the same machinery would be over-engineering.

Knowledge-graph integration

A Metaglass-specific extension: skills can declare a node_type and suggested_edges in their frontmatter. A concept-note skill knows it produces a node of type concept; it knows that concepts typically connect via prerequisite, leads_to, and similar_to edges. The skill produces not just a note but a typed, linked node in the knowledge graph.

This is the kind of integration that's only available because Metaglass is a vault-first application. The skill system isn't just producing markdown; it's producing graph nodes with structural meaning. Other systems can't do this because they don't have a graph to feed into.

What I chose for AIOS: node_type and suggested_edges are frontmatter fields, parsed by SkillParser, and consumed by the skill-execution path to register the produced node in the graph with appropriate edges. This is one of the places where being opinionated about the substrate (vault + knowledge graph) lets the skill system do something the simple-end systems structurally cannot.

Skill self-modification: the agent writing its own skills

A pattern in the more mature systems: the agent itself can create, edit, and delete skills via dedicated tools (skill_manage, skill_view). The agent learns from successful task completions by writing new skills; it patches existing skills when conventions change. This is the "self-improving loop."

The cost is real — agent-written skills can contain prompt injections, can drift from the user's intent, can accumulate noise. The systems that allow self-modification pair it with security scanning on writes (the same injection-scanning pattern from part 6) and with user review queues for new skills.

What I chose for AIOS: not yet implemented. Skills are user-authored or shipped, not agent-authored. This is a deliberate scope choice — self-modifying skills add real complexity and a real attack surface, and the value isn't proven for Metaglass's workload yet. The seam exists (skills are markdown files; the agent already has vault-write tools), so adding a curated skill.create tool with injection scanning is a future move when the user value justifies the security work.

Community sources and the install lifecycle

The most mature systems treat skills as packages with an install/update/uninstall lifecycle. A community hub provides discoverable skills; a CLI command installs them; an update mechanism keeps them current; an uninstall removes them cleanly. Skills can declare install dependencies (brew install ffmpeg, npm install some-mcp-server) that the install pipeline executes.

What I chose for AIOS: not implemented. Skills today come from builtin/ (shipped) or the vault (user-authored). A formal install pipeline with hub discovery, version pinning, and dependency resolution is well-understood territory but a significant scope expansion. It's the right move when the skill ecosystem grows beyond what one person maintains in their vault; before that, the operational overhead exceeds the benefit.

Skill snapshot for subagents

A subtle but useful pattern: when a parent agent spawns a subagent (see part 5), the subagent inherits a frozen snapshot of the parent's resolved skill set. The subagent doesn't re-run discovery; it gets the skills the parent had at spawn time. This makes subagent behavior reproducible and keeps the skill set stable across the lifetime of the spawn.

What I chose for AIOS: the SkillPort interface and the kernel's TaskSpawner are designed to support snapshot inheritance, but the actual snapshot-pass-down isn't wired (the AgentFactory it would flow through is the same one called out as missing in part 5). When that lands, snapshot inheritance is a small additional change.

The kernel-skill gap

The honest admission: today, skills live in the host. The kernel has no native SkillProvider interface. Skills surface to the kernel as:

Prompt content — injected via SkillPromptProvider at conversation start (one-shot, not per-turn).
Tools — runtime skill operations registered as tools via SkillToolProvider.
Workflows — executed by SmartSkillService when the host detects a workflow trigger.

What's missing is a per-turn skill matcher in the kernel. The model can't reach for a skill mid-conversation if no host-side trigger fired at the start. The proposed fix is an optional SkillProvider interface in the kernel deps with match(), enrich(), and list() methods, hooked into the loop so each turn can check whether the in-flight conversation has shifted toward a different skill than the one matched at the start.

This is the most visible limitation in the current implementation — skills are pre-conversation, not per-turn. It's also the cleanest fix on the roadmap: the seam is in the kernel's dependency injection, the host already has all the matching machinery, and the per-turn integration is one new port and one new call in the loop.

The problem-to-mechanism table

Problem	Mechanism
Markdown-blob skills don't scale past a handful	Format = YAML frontmatter + structured body sections + asset directory
Skills come from multiple places	4-source registry with priority cascade (`builtin < user/plugin < vault`)
Skill-required deps aren't installed	(deferred) `requires` block with bin/env/config eligibility checks
Loading all skills floods the prompt	Progressive disclosure: descriptions at start, full body on match
Many skills match; budget can't fit them	Top-3 today; (deferred) budget-aware degradation with compact-format fallback
User explicit invocation	Trigger phrases with `{variable}` extraction
Natural-language activation	`SkillRegistry.findByDescription` keyword match + `SkillEmbeddingService` semantic
Some skills shouldn't be model-invokable	`agent_ready` today; (deferred) full `user-invocable` / `disable-model-invocation` split
Skills with side effects need tool scoping	`tools: []` frontmatter; (deferred) enforced wrap via tool-adapter factories
Note templates and workflows have different needs	Dual-mode execution: prompt enrichment OR workflow steps
Workflows need data plumbing	`VariableResolver` with Jinja-style syntax + filter library
Multi-step workflows need composition	`SkillChainExecutor` for sequential chaining with conditionals
Skills should produce structured graph nodes	`node_type` + `suggested_edges` frontmatter + graph-aware execution
Agent-written skills risk poisoning the system	(deferred) injection scanning at the skill-write path
Skill management needs a lifecycle	(deferred) install pipeline + hub discovery + version pinning
Subagents should see stable skills	(deferred) skill snapshot inheritance via `SkillPort`
Skills can't be matched mid-conversation	(deferred) kernel `SkillProvider` interface for per-turn matching

What comes next

The AIOS skill system ships with the load-bearing pieces in place: the SKILL.md + frontmatter + asset-directory format with SkillParser, the 4-source registry with priority cascade in SkillRegistry, name + trigger + keyword + semantic matching via SkillEmbeddingService, progressive disclosure with on-demand body loading and caching via SkillContentLoader, dual-mode execution (prompt enrichment via SkillPromptProvider, workflow execution via SmartSkillService + StepExecutor + VariableResolver), multi-skill chaining via SkillChainExecutor, and knowledge-graph integration via node_type/suggested_edges.

The deferred items at known seams:

Kernel SkillProvider interface for per-turn matching — the single largest gap; closes the pre-conversation-only limitation.
Eligibility filtering with full bin/env/config checks — small extension of the existing prerequisites parsing.
Budget-aware injection with degradation tiers — pairs with the per-turn token estimation work from part 1.
Full user-invocable / disable-model-invocation split — replaces the current agent_ready flag.
Tool-scope enforcement at skill execution time via tool-adapter wrapping.
Skill snapshot inheritance for subagents (waits on the AgentFactory from part 5).
Injection scanning on agent-written skills (same pattern as the memory-write scanner from part 6).
Self-modifying skills — agent can create/edit/delete skills via curated tools, gated by injection scanning and user review.
Install pipeline + community hub — full lifecycle management when the skill ecosystem outgrows one user's vault.

The highest-impact deferred item is the kernel SkillProvider interface. Until it lands, AIOS can't shift skills mid-conversation, and that's the most user-visible limitation. The highest-stakes deferred item, if and when self-modifying skills are introduced, is injection scanning — the same chokepoint argument that applies to memory writes applies twice as hard to skills, because a poisoned skill executes on every match.

Three lessons

If I had to compress this survey into advice for someone designing a skill system from scratch:

Skills are not prompts and they are not programs; they are both. The simple-end systems treat skills as prompt injections and serve 90% of cases beautifully. The elaborate-end systems treat skills as workflows and serve the remaining 10% reliably. The mistake is forcing every skill into one camp. A dual-mode design — pure prompt enrichment when that's enough, workflow execution when determinism matters — covers the full spectrum without compromising either end. If your skill system supports only one mode, you'll either fail at templates or fail at integrations; pick both.
Progressive disclosure is not an optimization; it's the architecture. Loading every skill at full content into the prompt fails at the moment your user installs the tenth skill. The two-phase load (descriptions at start, bodies on match) isn't a token-saving trick — it's the only design that scales past a handful of skills. Build it from day one. The corollary: matching has to be good, because the model's only view of unmatched skills is the description. Invest in matching the same way you'd invest in any other retrieval system: name match first, trigger phrases second, semantic match as the fallback, keyword overlap as a sanity check.
Skills are a security boundary; treat them like one. Every input that crosses into a skill — frontmatter values, body content, agent-written skills, community-installed skills — is a potential injection vector. The systems that take this seriously scan skill content at install time, scan agent-written skills at write time, and re-scan on every read. The systems that don't, eventually ship a "prompt injection in shared skill compromises every user who installs it" CVE. Decide which kind of system you're building before you ship the skill-write surface.

The rest is tuning. Match thresholds, budget caps, degradation tiers, filter libraries, source priority order — dials, not decisions. The decisions are the three above, and the skill systems that get them right end up looking remarkably similar even when their authors didn't talk to each other.

Earlier in this series: Part 1: Context management · Part 2: The agent loop · Part 3: The tool system · Part 4: Prompt composition · Part 5: Subagents · Part 6: Memory and state persistence.