hive-c0re coordinator internals
Architecture notes for the hive-c0re coordinator daemon's internal subsystems.
For the public API surface (dashboard, socket protocol) see docs/conventions.md
and docs/persistence.md.
Rebuild queue
Every long-running container/meta operation (rebuild, meta-update, first-spawn)
goes through the global rebuild queue (hive-c0re/src/rebuild_queue.rs). A single
background worker drains it in FIFO order so two nixos-container update runs on
the same agent never overlap, and a fresh agent rebuild never races a meta-update's
lock bump.
Why one queue
Before the rebuild queue landed, four independent call paths could fire
auto_update::rebuild_agent concurrently:
- Dashboard manual rebuild button
update-all/meta-updatecascade- Approval handler (apply-commit / spawn)
- Startup auto-update sweep
Nothing serialised them. nix-daemon serialises the actual store ops, but the rest
of rebuild_agent (token sync, kick, rescan, lock-bump emit) interleaved
unpredictably. The single-worker queue gives operators a visible, ordered runway and
lets the UI render "what's about to happen" instead of "something might be happening
somewhere."
Queue kinds
| Kind | Description |
|---|---|
Rebuild |
Single-agent rebuild. Covers manual, approval-driven, auto-update, and meta-update cascade variants — all funnel through the same path. |
MetaUpdate |
nix flake update on the meta flake. The worker runs the lock bump itself, then enqueues a cascade of Rebuild entries with parent_id set to the meta-update's id. |
Spawn |
First-deploy of a new agent (approval-driven). Same serialisation as Rebuild from the operator's POV. |
Destroy |
For future use (destroy --purge does real I/O). Variant exists so the wire shape doesn't change later; not currently routed through the queue. |
Restart |
Stop + start a container without touching config (~5-10s). Routed through the queue so it serialises against in-flight rebuilds for the same agent — prevents a restart racing a rebuild mid-flight. Sources: dashboard ↺ button, manager restart MCP tool. |
PermChange |
Write a tool-group or capability change to the shared JSON file (tool-groups.json / capabilities.json), then rebuild the agent so the updated HIVE_TOOL_GROUPS / HIVE_CAPABILITIES env var takes effect. Serialising the file write through the queue prevents concurrent dashboard batch-apply actions from racing on the shared file. After a successful file write, emits CapabilitiesChanged or ToolGroupsChanged SSE snapshot so the P3RM1SS10NS tab updates live. |
Intentionally not queued (sub-second ops): start, stop, kill.
Dedup
Enqueueing (kind, agent) that already has a Queued entry returns the existing
entry's id and appends the new request as an "also requested by …" line. Running
entries do not dedup — a re-queue during a run is legitimate (something changed
since the current run started).
Sources
| Source | Meaning |
|---|---|
Manual |
Operator clicked rebuild / update-all / meta-update on the dashboard, or any other direct human action (CLI, manager tool). |
AutoUpdate |
Legacy startup-sweep source (flat, no parent). Replaced by StartupSweep for new boots. |
StartupSweep |
Child of a StartupSweep parent entry; boot-time per-agent rebuild with the sweep as the visual group header. |
Approval |
Triggered by an operator-approved ApprovalKind::{Spawn, ApplyCommit}. |
Cascade parent tracking
MetaUpdate and StartupSweep entries fan out Rebuild children, each carrying
parent_id = <parent_id>. The dashboard groups children under their parent
in the queue panel so the operator sees the whole cascade as a tree,
not a flat list.
Step labels
Each queue entry has a mutable step: Option<String> field that the worker updates
as it progresses through lifecycle phases ("nix build", "nixos-container stop",
"nixos-container update", "nixos-container start"). The dashboard polls
/api/state and renders the current step beneath the running entry so the operator
can see which phase is taking time.
Dependency tracking
Each entry carries a depends_on: Vec<u64> field. The worker skips entries whose
dependencies are not yet resolved — a dependency is resolved when its id is either
in the queue as a terminal entry (Done / Failed / Cancelled) or no longer in
the queue at all (evicted by trim_history, which only evicts terminals).
Use cases:
- Chain a
Rebuildafter an explicit prerequisite step without coupling them through theparent_idcascade mechanism. - Sequence a
PermChange+Rebuildpair where the rebuild must not start until the perm-file write commits (already handled by the single-worker FIFO today, butdepends_onallows explicit cross-kind sequencing when parallel workers are added).
depends_on is part of the dedup key: two entries with the same (kind, agent, parent_id, inputs, approval_id) but different dep sets are treated as distinct work.
Worker re-notification: the worker drains take_next() in a tight loop after
each entry finishes. When a dep entry transitions to terminal, the loop re-evaluates
the queue immediately, so downstream entries are unblocked with no extra wakeup. No
additional notify_one() call is needed.
Circular-dep caveat: if A depends on B and B depends on A, neither entry ever becomes runnable — the worker skips both indefinitely with no error. Callers must ensure acyclic dep graphs. Cycle detection is deferred to a future iteration (when parallel workers make a stuck queue more visible).
Container view
container_view.rs maintains an in-memory snapshot of every nixos-container's
systemd service state. It is polled on coordinator startup and re-scanned after
every lifecycle operation (spawn, rebuild, kill) so the dashboard always reflects
the actual container status without a live nixos-container list call on each
render.
Auto-update sweep
On startup, auto_update.rs rebuilds every known container unconditionally.
nixos-container update is a no-op at the nix level when nothing changed (same
store path), so the cost is low and avoids rev-marker staleness — all agents always
need an update pass when any meta commit lands.
auto_update::run enqueues a single StartupSweep parent entry (kind = startup_sweep, agent = "hyperhive") followed by per-agent Rebuild children
(source = startup_sweep, parent_id = sweep_id). The worker processes the parent
by bumping the meta hyperhive input lock, then transitions it to Done. The child
rebuilds drain sequentially through the queue; the dashboard renders them nested
under the parent so the operator can see the whole boot-time sweep in one group.
Before this change, each boot enqueued flat Rebuild entries with
source = AutoUpdate and no parent — visible but ungrouped.
Meta flake
meta.rs owns the single coordinator-managed flake at /var/lib/hyperhive/meta/.
This flake consumes every agent's applied config repo as a flake input and exports
one nixosConfiguration per agent. Container lifecycle ops drive the lock file so
meta's git log is the system-wide deploy audit trail.
Key operations:
sync_agents(idempotent) — renderflake.nixfor the current agent set, init the repo on first call, relock if the rendered contents changed, commit. Called by spawn / destroy / startup migration.prepare_deploy+finalize_deploy/abort_deploy— two-phase for theRequestApplyCommitpath so a failednixos-container updateleaves no orphan commit in meta. Prepare writes the new lock without committing; finalize commits with the deploy message; abort restores the lock.lock_update_hyperhive— one-shot for the auto-update path: bumps thehyperhiveinput lock, commits, cascades agent rebuilds.
Container lifecycle (lifecycle.rs)
Every container operation ultimately calls into lifecycle.rs. Two paths exist:
rebuild (existing container) and spawn (first-time creation).
Rebuild path (existing container)
Goal: apply the new system profile and any EXTRA_NSPAWN_FLAGS / drop-in changes
in a single start, with minimum downtime.
nixos-container update only runs systemctl reload container@<c> when the
container is already up (per isContainerRunning in nixos-container.pl). Stopping
first turns update into a boot-style operation: it builds + nix-env --sets the
new profile and skips the in-container switch-to-configuration. The subsequent
start then applies both the new profile and any EXTRA_NSPAWN_FLAGS changes in
one go, rather than the double-bounce a live update would trigger.
Sequence for a running container:
prebuild_toplevel— build the newsystem.build.toplevelbefore stopping. The container keeps serving the previous generation while eval + fetch + build happen out-of-band.nixos-container updatethen finds the result cached and skips straight to the profile-swap. Build failures surface here, before the running container is touched.nixos-container stop— bring the container down.nixos-container update --flake meta#<name>— profile-swap (near-instant after the prebuild).nixos-container start— boot into the new generation; the in-container activation script transitions old → new.
If the container is already stopped, step 1 is skipped (no downtime to shave — no point evaluating the flake twice).
Cold-start fallback
start after update can exit non-zero when packages are removed between
generations: the old-generation activation script references units that no longer
exist in the new closure, causing systemd to exit non-zero. The container may be
half-started at that point.
Fallback: stop (graceful SIGTERM drain) → kill (SIGKILL any lingering processes)
→ start (clean cold-start, no generation transition, new activation runs cleanly).
Both errors are preserved and surfaced if the cold-start also fails.
Spawn path (new container)
For a first-time create, nixos-container create is atomic: if the build fails,
no container record is left to clean up. A separate prebuild would just duplicate
the eval, so it's skipped. Sequence: create --flake meta#<name> → write nspawn
flags → systemctl daemon-reload → start.
Prebuild attr path
nix build does not auto-resolve meta#<name> against nixosConfigurations the
way nixos-container does internally. The explicit attr path
<flake-root>#nixosConfigurations.<name>.config.system.build.toplevel is required;
using the bare meta#<name> ref would make nix look in packages, legacyPackages,
or the flake root directly — none of which exist in the rendered meta flake.
Host-level resource + performance options
Three services.hyperhive.c0re.* options tune container resource
limits and first-spawn latency. All three apply uniformly to every
agent container.
Container resource limits
agentCpuQuota and agentMemoryMax map directly to systemd
CPUQuota= and MemoryMax=. hive-c0re writes a
container@h-<name>.service.d/ drop-in file on each spawn and
rebuild, so changes take effect on the next lifecycle op without
requiring a host rebuild.
| Option | Default | Description |
|---|---|---|
services.hyperhive.c0re.agentCpuQuota |
"200%" |
CPU cap per agent, as a percentage of one core ("200%" = 2 cores). Raise if agents hit CPU limits during builds or heavy tool use. |
services.hyperhive.c0re.agentMemoryMax |
"4G" |
Memory cap per agent. Raise for agents that run large nix builds or hold big in-memory data. |
For a hive-wide cap across all containers together, set
systemd.slices.machine.serviceConfig.CPUQuota in your NixOS
config — all nspawn containers live in machine.slice.
Pre-building agent templates
preBuildAgentTemplates (default false) causes the host NixOS
build to pre-fetch the per-container system closures
(agent-base + manager toplevels) into /nix/store, instead of
leaving that work to the first nixos-container start. The
trade-off:
- On (recommended for x86_64 hosts that care about first-spawn
latency): the first
nixos-container startfor any new agent completes in seconds because nothing is left to fetch. Cost: the full nixpkgs runtime closure + claude-code + the harness binary are added to the host system closure (low single-digit GB additional). - Off (default): the host closure stays lean; the first spawn does all the eval + fetch work at runtime (can take several minutes on a fresh store).
Note: toplevels are pinned to x86_64-linux. Enabling on an
aarch64 host forces a cross-compilation or remote-builder build,
which is almost never desired. Leave off on non-x86 hosts.
See also
docs/approvals.md— approval flow + scheduled promptsdocs/persistence.md— SQLite schema, state-dir layoutdocs/conventions.md— wire protocol, recipient sentinelsdocs/agent-hierarchy.md— topology and parent/child relations