hive-c0re coordinator internals

Architecture notes for the hive-c0re coordinator daemon's internal subsystems. For the public API surface (dashboard, socket protocol) see docs/conventions.md and docs/persistence.md.

Job queue

Every container/meta operation (rebuild, meta-update, first-spawn, power changes) is submitted to the global job-DAG queue (hive-c0re/src/job_queue/) as a DAG of primitive nodes. One scheduler task drives all DAGs; concurrency comes from the resource classes below, not from multiple workers. The old special cases — the graceful-stop watcher thread, the deferred-start fast-lane follow-up, the meta-update cascade pre-enqueue — are all just DAG shapes now.

Two levels: DAG and node

The DAG is the unit of cancel / approval-resolution and the dashboard group; the node is the unit of scheduling / execution / build-log / step label. Deps are intra-DAG edges only (AfterOk by default: the dep must succeed, a failed/cancelled dep cancels the dependent — cancel-downstream). Cross-DAG ordering comes from the per-agent lease, never from edges between DAGs. Submit-time validation (petgraph toposort) rejects cyclic specs outright, fixing the old queue's "circular dep silently deadlocks" caveat.

Node inventory (primitives)

Nix-heavy — hold one of the buildSlots permits for the node's duration:

Node	Wraps
`Prebuild`	`lifecycle::prebuild_toplevel` — build the toplevel out-of-band while the container keeps serving (its meta preamble is the upstream `MetaSync` node)
`Swap`	drop-in rewrite + `nixos-container update` profile-swap (requires the container stopped); the post-swap bookkeeping tail lives in the sibling `PostSwap` node
`Create`	first-spawn provisioning + `nixos-container create` (atomic build+create)
`MetaLock`	meta flake lock bump (`lock_update` / boot-sweep `lock_update_hyperhive`, commit fused — see below); fans out child `Rebuild` DAGs on completion
`DeployWindow`	resource-holding root of the merge-config-PR deploy subtree — declares the build slot, the lease and the meta window, then completes immediately so its children run under them (see Approvals below)
`DeployApply`	the deploy's irreversible half: ff-merge the reviewed PR head, two-phase meta deploy, container rebuild

Cheap — no build slot:

Node	Behavior
`MergeVerify`	the deploy's pre-merge gate — PR-head drift check, fetch, `verify_commit` eval. Mutates nothing, so a rejection here needs no compensation
`DeployTail`	the deploy's `AfterAny` compensation + bookkeeping tail — rolls `applied/main` back from the parked `refs/hyperhive/rollback/<id>` and aborts the staged meta lock when the deploy never confirmed good, then mirrors the config repo to the forge. Infallible by construction
`MetaSync`	the rebuild's meta preamble — rebuild-dir prep, idempotent meta `sync_agents`, optional per-agent relock. Holds the `MetaWindow` resource (below); deliberately its own node so the window never covers `Prebuild`'s multi-minute build
`Reconcile`	idempotent power converge: read `wanted` (below) + observed state; start if `Up` & down (cold-start fallback included), stop if `Offline` & up, else noop
`StopForUpdate`	mechanical `nixos-container stop` for the profile swap; never touches `wanted`; noop if already stopped
`PostSwap`	the swap's Ok-only bookkeeping tail — rev marker, forge/matrix sync, manager kick, rescan, meta-inputs snapshot; `AfterOk(Swap)` so it runs only on a successful swap (the `Rebuilt` manager event still fires once per DAG from the terminal hook, not here)
`Signal`	set the graceful fence + kick, so the harness runs one stop-checkpoint turn
`Drain`	await the harness clearing the fence, bounded by the 3-min graceful-stop timeout; resolves ok either way
`WriteDropin`	`set_nspawn_flags` + `set_resource_limits` + daemon-reload
`WritePermFile`	commit `tool-groups.json` / `capabilities.json` (single git commit under `META_LOCK`) + emit the P3RM1SS10NS snapshots

There is deliberately no GitCommit node: meta.rs fuses each mutation with its commit under its internal META_LOCK mutex, so a standalone commit node would open a dirty-working-tree window between nodes.

Two further layers protect the meta repo across windows that span multiple META_LOCK acquisitions — above all the approval deploy's prepare→finalize span, which keeps a bumped flake.lock staged uncommitted for the whole container build:

The deploy window (Resource::MetaWindow): a global, capacity-1 queue resource declared by every node kind that mutates the meta repo — MetaSync, MetaLock, WritePermFile, Provision's agent registration, and DeployWindow — the deploy subtree's root, which holds it across every phase below it (NodeKind::needs_meta_window). Two meta mutations can therefore never interleave, so no commit lands inside another node's staged window. It is a queue resource rather than a runtime mutex because a resource is held by a subtree root across its whole subtree, which a MutexGuard (bounded by one executor fn) cannot — that is what lets a multi-node deploy own one window. For the same reason the window must stay off long store-only work: the rebuild's meta preamble is its own MetaSync node, a sibling of (never a parent of) Prebuild, so the toplevel build runs outside the window and buildSlots > 1 still gives concurrent rebuilds across agents.
Path-limited commits: the targeted meta committers (perm files, topology, lock bumps, finalize) commit -- <their paths> with path-scoped dirty checks, so even a non-queue caller (boot migration, destroy's sync_agents) can never sweep someone else's staged content into its commit.

Every operation as a DAG

The stop / start power ops write the durable wanted intent via a head SetWanted node (not a pre-submit side effect) — it holds the agent lease, so intent-write + reconcile is atomic per-agent. restart is the exception: it writes no intent (no SetWanted head) — it bounces the container and lets the tail Reconcile converge to the agent's existing wanted, so a deliberately-stopped agent is not forced back up by a hive-wide restart. The hive-wide power ops — restart, stop, and start — take an agent list: a hive-wide hivectl restart / stop / start is ONE DAG with a per-agent subgraph each (independent roots, run concurrently on their own leases), not N separate DAGs.

These are built dynamically from each agent's live running state (an async lifecycle::is_running read), so they live in job_queue/submit.rs, not the pure/sync templates.rs. Per-agent shape rule: stop/start carry a head SetWanted (intent) — restart does not; the tail Reconcile (convergence guarantee — cheap, noops when already converged) is ALWAYS present; only the mechanical nodes (Signal/Drain/StopForUpdate) are state-conditional — skipped for a down agent (nothing to quiesce/stop). Keeping Reconcile in every shape closes the TOCTOU window: if an agent flips state between the is_running read and node exec, the tail Reconcile still converges it in-DAG (with StopForUpdate-noop as the backstop) — no reliance on an external reconcile sweep. start folds the per-agent stale-rev upgrade in (a down + stale agent's subgraph is a rebuild-then-start).

rebuild(a):        MetaSync(a) → Prebuild(a) → StopForUpdate(a) → Swap(a) →(after-ok) PostSwap(a) →(after-any) Reconcile(a)
stop(a..):     online a: SetWanted(a,Off) → [Signal→Drain→ if graceful] Reconcile(a)
               offline a: SetWanted(a,Off) → Reconcile(a)                    (N subgraphs, 1 DAG)
restart(a..):  online a: [Signal→Drain→ if graceful] StopForUpdate(a) → Reconcile(a)  (no SetWanted)
               offline a: Reconcile(a)  (nothing to stop; Reconcile converges to existing wanted)
start(a..):    a: SetWanted(a,Up) → Reconcile(a)   (down+stale ⇒ SetWanted(a,Up) → «rebuild subgraph»)
spawn(a):          [wanted=Up at approve]  Create(a) → WriteDropin(a) → Reconcile(a)
perm-change(a):    WritePermFile(a) → «rebuild subgraph»
meta-update(inp):  MetaLock(inp) →(in-DAG) «rebuild subgraph» per affected agent
boot:              (if any rev marker stale) MetaLock(hyperhive) →(in-DAG) «rebuild subgraph» per stale agent;
                   plus Reconcile(a) for every drifted agent  (all ONE DAG)

Notable collapses:

rebuild is one uniform shape — no was_running branch. StopForUpdate noops when already down; the tail Reconcile auto-noops the start when wanted = Offline (a rebuild of a deliberately-stopped agent leaves it stopped).
The swap-failure recovery-start is structural: Reconcile deps on Swap with the one AfterAny edge in the system — it runs after Swap terminal ok or fail, bringing a wanted-up agent back on its old config.
Deferred start is automatic: Reconcile holds no build slot, so the next DAG's Prebuild starts as soon as Swap frees the slot.
Graceful stop needs no watcher thread: Signal/Drain are cheap, so a whole-hive graceful stop fires every agent's signal immediately and all drains overlap; each DAG's tail Reconcile does the actual stop.
The meta-update cascade grows in the same DAG on completion: MetaLock's executor computes the affected agent set after the bump lands and grows one rebuild subgraph per agent into its own DAG via append_subgraph (rooted on the MetaLock, relock = false so the cascade doesn't revert the bump). Not child DAGs — one DAG, no parent_id. A failed bump appends nothing (no cancel-children dance). Same shape as the startup sweep; the meta-update DAG carries the Rebuilding transient so each cascade agent keeps crash-watch suppression during its Swap.

Desired-state (spec vs status)

Per-agent power intent — wanted: Up | Offline — is durable as the agent_power table in the coordinator DB (hive-c0re/src/stores/power.rs). container_view remains the observed status; Reconcile nodes converge the two. Setting wanted is never a queued node: the submit layer (job_queue/submit.rs) writes the row synchronously, then submits the DAG whose Reconcile reads the fresh value — rapid toggles are last-writer-wins. Power toggles never commit to the meta repo. Every operator power surface — dashboard buttons, the MCP tools, and hivectl stop/start/restart/kill — rides the queue through that submit layer, so intent, lease serialization, and crash-watch suppression can't drift per surface; the only direct starts left are the root-agent bootstrap and infra containers (no lease, no harness). Cancelling a still-queued power DAG reverts wanted to the observed state — a cancel means "don't do it", not "do it later". Agents without a row are seeded from observed state on first touch (running ⇒ Up); destroy removes the row.

The admin-socket responses carry the submitted DAG ids; hivectl polls HostRequest::QueueDag (~1s) and prints a progress line per DAG — roll-up glyph, template, agent, node chain with the running node's step label — so CLI verbs block until their jobs finish (--no-wait opts out; failures exit non-zero). Nodes appended in-DAG (a MetaLock growing per-agent rebuild subgraphs, a Reconcile fanning its Start/Stop) join the same DAG, so they surface under that DAG's id in the same loop — no separate child DAGs.

Scheduler semantics

A node is ready when it's Queued, every dep is satisfied, and its resources are free. Resources:

Build slots — services.hyperhive.c0re.buildSlots permits (default 1), held by nix-heavy nodes for the node's duration.
Per-agent lifecycle lease — keyed on the node's agent (agent is per-node; a DAG can span agents) and globally exclusive per agent across all DAGs: acquired at a container-affecting node (SetWanted, StopForUpdate, Swap, Signal, Drain, Reconcile, WriteDropin, Create, DeployWindow), held by the owning DAG until it's terminal, so two DAGs never interleave container ops on the same agent. A DAG touching several agents holds one lease per agent. (SetWanted is a store write, not a container op, but takes the lease anyway so a power-op DAG's intent write + reconcile is atomic — two racing ops can't clobber intent before either reconciles.) Lease-exempt: MetaSync, Prebuild, MetaLock, WritePermFile — they touch the store / meta, not the running container, which is exactly why a stop can land while another DAG's prebuild is still building.

Among simultaneously-ready nodes competing for a resource, DAG-submit order wins (FIFO) so bulk operations drain predictably. The scheduler also owns the DAG-lifetime transient guard (dashboard pill + crash-watch suppression), created on lease acquisition and dropped when the DAG settles terminal.

The queue is in-memory only and lost on hive-c0re restart — deliberate: desired state is re-derived at boot from the DB + rev markers (see Boot reconcile), so there is no durable-recovery machinery to go wrong.

Cancel, history

Submit-time dedup was removed with the agent-per-node move (a multi-agent DAG has no single agent to key a dedup on), so every submit enqueues a fresh DAG; whether any dedup needs reintroducing is tracked as a follow-up.

Cancel only applies to still-fully-queued DAGs (an in-flight nix build isn't interruptible) — each op is one DAG now, so there are no child DAGs to cascade to. Roll-up state: Failed if any node failed, else Running / Queued / Cancelled / Done. The snapshot retains the 5 most recent terminal DAGs per template.

Approvals

MergeConfigPr approvals ride as a four-node deploy subtree:

DeployWindow (root — build slot + lease + meta window, no work of its own)
├── MergeVerify                     drift gate, fetch, verify_commit
├── DeployApply    AfterOk(verify)  park rollback ref, ff-merge, deploy
└── DeployTail     AfterAny(apply)  compensate, mirror to forge

The root's resources are held across the whole subtree, so the two-phase prepare_deploy / finalize_deploy span keeps its staged flake.lock protected even though the phases are separate nodes. Splitting them buys three things a single opaque node couldn't have: per-phase visibility on the dashboard, a MergeVerify failure that provably mutated nothing, and a compensation step that survives a hive-c0re restart — the pre-merge applied/main is parked in refs/hyperhive/rollback/<approval-id>, not in a local variable, so DeployTail can still undo a half-finished deploy after a crash.

Spawn and UpdateMetaInputs approvals map onto the ordinary spawn / meta-update shapes. The scheduler fires actions::resolve_approval_dag exactly once when any approval-carrying DAG settles terminal — deploys included, since their outcome is now the DAG's own state (including cancelled-while-queued, which fails the approval instead of dangling it).

Wire shape

RebuildQueueChanged { seq, queue: [DagView…] } (event name kept). Each DagView carries the entry-level fields (id, kind = template string, roll-up state, source, reason, timestamps, inputs, approval_id) plus nodes: [NodeView…] — per-node agent, kind, deps, state, step, build_log_id, timestamps, error. There is no DAG-level agent (agent is per-node, so a DAG can span agents); consumers derive a DAG's agent(s) from its nodes. Step labels and build logs are per-node; the dashboard renders the node chain on each queue card and keys the live-log panel off the running node.

Container view

container_view.rs maintains an in-memory snapshot of every nixos-container's systemd service state. It is polled on coordinator startup and re-scanned after every lifecycle operation (spawn, rebuild, kill) so the dashboard always reflects the actual container status without a live nixos-container list call on each render.

Boot reconcile

On startup, auto_update::run classifies every agent by rev freshness (the per-agent .{name}.hyperhive-rev marker under /var/lib/hyperhive/applied/ vs the current flake path) and persisted wanted intent, then:

Config path — when any marker is stale, submit one Boot DAG: a MetaLock (hyperhive input bump, non-fatal) that grows an in-DAG Rebuild subgraph for each stale agent whose wanted = Up (topology-sorted, parents first). Stale but wanted-offline agents get no boot-time nix work — their rebuild happens on their next start (the start submit path upgrades a stale start to rebuild+start), which is also why the lock bump runs even when every stale agent is offline: those later start-upgrades must build against the bumped lock. Each child rebuild's tail Reconcile brings the agent (back) up, covering both the running-stale and stopped-but-wanted-up cases.
Power path — every agent whose observed state drifted from wanted gets a plain Reconcile DAG (kind = reconcile, source auto_update).

Booting with no config change performs no meta commit — only reconciles. The sweep reason records the rebuild / deferred / up-to-date counts so the operator sees at a glance how much work the boot triggered. Agents without an agent_power row are seeded from observed state during classification (the one-time migration; thereafter the DB is authoritative).

Meta flake

meta.rs owns the single coordinator-managed flake at /var/lib/hyperhive/meta/. This flake consumes every agent's applied config repo as a flake input and exports one nixosConfiguration per agent. Container lifecycle ops drive the lock file so meta's git log is the system-wide deploy audit trail.

Key operations:

sync_agents (idempotent) — render flake.nix for the current agent set, init the repo on first call, relock if the rendered contents changed, commit. Called by spawn / destroy / startup migration.
prepare_deploy + finalize_deploy / abort_deploy — two-phase for the MergeConfigPr deploy path so a failed nixos-container update leaves no orphan commit in meta. Prepare writes the new lock without committing; finalize commits with the deploy message; abort restores the lock.
lock_update_hyperhive — one-shot for the boot-reconcile path (the sweep DAG's MetaLock node): bumps the hyperhive input lock and commits; the scheduler fans out the agent rebuilds on completion.

Every public meta.rs operation takes the module's internal META_LOCK mutex, so concurrent job-queue nodes (and the approval deploy pipeline) never race on the repo's .git/index.lock.

Container lifecycle (`lifecycle.rs`)

Every container operation ultimately calls into lifecycle.rs. Two paths exist: rebuild (existing container) and spawn (first-time creation).

Rebuild path (existing container)

Goal: apply the new system profile and any EXTRA_NSPAWN_FLAGS / drop-in changes in a single start, with minimum downtime.

nixos-container update only runs systemctl reload container@<c> when the container is already up (per isContainerRunning in nixos-container.pl). Stopping first turns update into a boot-style operation: it builds + nix-env --sets the new profile and skips the in-container switch-to-configuration. The subsequent start then applies both the new profile and any EXTRA_NSPAWN_FLAGS changes in one go, rather than the double-bounce a live update would trigger.

Sequence for a rebuild DAG (each step is its own queue node):

MetaSync — rebuild-dir prep, meta sync_agents, and (unless this is a meta-update cascade child) the per-agent relock. Short, and the only step that mutates the meta repo, so it is the only one holding the global deploy window.
Prebuild — build the new system.build.toplevel before stopping. The container keeps serving the previous generation while eval + fetch + build happen out-of-band. nixos-container update then finds the result cached and skips straight to the profile-swap. Build failures surface here, before the running container is touched. (Runs even for a stopped container — same total nix work, one uniform DAG shape.)
StopForUpdate — bring the container down (noop when already stopped).
Swap — nixos-container update --flake meta#<name> profile-swap (near-instant after the prebuild).
Reconcile — boot into the new generation when wanted = Up; the in-container activation script transitions old → new. Holds no build slot, so the next DAG's Prebuild overlaps the container boot — the old "deferred start" split, now structural.

The approval deploy uses this same chain rather than a rebuild path of its own. Its DeployApply node does not build: it merges, opens the two-phase meta deploy, and returns the chain above as a subgraph the scheduler grafts into the live DAG under that node. A FinalizeDeploy node gated on the graft's completion then plants the deploy tag — so "did the agent come back up?" is answered by Reconcile succeeding, the same way it is for every other rebuild, instead of by a fused inline start.

The grafted nodes land inside DeployWindow's subtree, so they re-enter the meta window and build slot it already holds rather than deadlocking against it.

Cold-start fallback

start after update can exit non-zero when packages are removed between generations: the old-generation activation script references units that no longer exist in the new closure, causing systemd to exit non-zero. The container may be half-started at that point.

Fallback: stop (graceful SIGTERM drain) → kill (SIGKILL any lingering processes) → start (clean cold-start, no generation transition, new activation runs cleanly). Both errors are preserved and surfaced if the cold-start also fails. The fallback lives in lifecycle::start_with_fallback, used by every Reconcile node's start action.

Spawn path (new container)

For a first-time create, nixos-container create is atomic: if the build fails, no container record is left to clean up. A separate prebuild would just duplicate the eval, so it's skipped. Sequence: create --flake meta#<name> → write nspawn flags → systemctl daemon-reload → start.

Prebuild attr path

nix build does not auto-resolve meta#<name> against nixosConfigurations the way nixos-container does internally. The explicit attr path <flake-root>#nixosConfigurations.<name>.config.system.build.toplevel is required; using the bare meta#<name> ref would make nix look in packages, legacyPackages, or the flake root directly — none of which exist in the rendered meta flake.

Host-level resource + performance options

A handful of services.hyperhive.c0re.* options tune container resource limits, build parallelism, and first-spawn latency.

Build slots

buildSlots (default 1) sets how many nix-heavy job-queue nodes (prebuilds, profile swaps, first-spawn creates, meta lock bumps) run concurrently. The default serializes all heavy nix work like the pre-DAG rebuild queue did; raise it on hosts with the cores/RAM to build several agent toplevels at once. Per-agent correctness is independent of the count — each agent's container-affecting ops serialize on its lifecycle lease regardless.

Container resource limits

agentCpuQuota and agentMemoryMax map directly to systemd CPUQuota= and MemoryMax=. hive-c0re writes a container@h-<name>.service.d/ drop-in file on each spawn and rebuild, so changes take effect on the next lifecycle op without requiring a host rebuild.

Option	Default	Description
`services.hyperhive.c0re.agentCpuQuota`	`"200%"`	CPU cap per agent, as a percentage of one core (`"200%"` = 2 cores). Raise if agents hit CPU limits during builds or heavy tool use.
`services.hyperhive.c0re.agentMemoryMax`	`"4G"`	Memory cap per agent. Raise for agents that run large nix builds or hold big in-memory data.

For a hive-wide cap across all containers together, set systemd.slices.machine.serviceConfig.CPUQuota in your NixOS config — all nspawn containers live in machine.slice.

Pre-building agent templates

preBuildAgentTemplates (default false) causes the host NixOS build to pre-fetch the per-container system closures (agent-base + manager toplevels) into /nix/store, instead of leaving that work to the first nixos-container start. The trade-off:

On (recommended for x86_64 hosts that care about first-spawn latency): the first nixos-container start for any new agent completes in seconds because nothing is left to fetch. Cost: the full nixpkgs runtime closure + claude-code + the harness binary are added to the host system closure (low single-digit GB additional).
Off (default): the host closure stays lean; the first spawn does all the eval + fetch work at runtime (can take several minutes on a fresh store).

Note: toplevels are pinned to x86_64-linux. Enabling on an aarch64 host forces a cross-compilation or remote-builder build, which is almost never desired. Leave off on non-x86 hosts.