hive-c0re coordinator internals

Architecture notes for the hive-c0re coordinator daemon's internal subsystems. For the public API surface (dashboard, socket protocol) see docs/conventions.md and docs/persistence.md.


Rebuild queue

Every long-running container/meta operation (rebuild, meta-update, first-spawn) goes through the global rebuild queue (hive-c0re/src/rebuild_queue.rs). A single background worker drains it in FIFO order so two nixos-container update runs on the same agent never overlap, and a fresh agent rebuild never races a meta-update's lock bump.

Why one queue

Before the rebuild queue landed, four independent call paths could fire auto_update::rebuild_agent concurrently:

Nothing serialised them. nix-daemon serialises the actual store ops, but the rest of rebuild_agent (token sync, kick, rescan, lock-bump emit) interleaved unpredictably. The single-worker queue gives operators a visible, ordered runway and lets the UI render "what's about to happen" instead of "something might be happening somewhere."

Queue kinds

Kind Description
Rebuild Single-agent rebuild. Covers manual, approval-driven, auto-update, and meta-update cascade variants — all funnel through the same path.
MetaUpdate nix flake update on the meta flake. The worker runs the lock bump itself, then enqueues a cascade of Rebuild entries with parent_id set to the meta-update's id.
Spawn First-deploy of a new agent (approval-driven). Same serialisation as Rebuild from the operator's POV.
Destroy For future use (destroy --purge does real I/O). Variant exists so the wire shape doesn't change later; not currently routed through the queue.
Restart Stop + start a container without touching config (~5-10s). Routed through the queue so it serialises against in-flight rebuilds for the same agent — prevents a restart racing a rebuild mid-flight. Sources: dashboard ↺ button, manager restart MCP tool.
PermChange Write a tool-group or capability change to the shared JSON file (tool-groups.json / capabilities.json), then rebuild the agent so the updated HIVE_TOOL_GROUPS / HIVE_CAPABILITIES env var takes effect. Serialising the file write through the queue prevents concurrent dashboard batch-apply actions from racing on the shared file. After a successful file write, emits CapabilitiesChanged or ToolGroupsChanged SSE snapshot so the P3RM1SS10NS tab updates live.

Intentionally not queued (sub-second ops): start, stop, kill.

Dedup

Enqueueing (kind, agent) that already has a Queued entry returns the existing entry's id and appends the new request as an "also requested by …" line. Running entries do not dedup — a re-queue during a run is legitimate (something changed since the current run started).

Sources

Source Meaning
Manual Operator clicked rebuild / update-all / meta-update on the dashboard, or any other direct human action (CLI, manager tool).
AutoUpdate Legacy startup-sweep source (flat, no parent). Replaced by StartupSweep for new boots.
StartupSweep Child of a StartupSweep parent entry; boot-time per-agent rebuild with the sweep as the visual group header.
Approval Triggered by an operator-approved ApprovalKind::{Spawn, ApplyCommit}.

Cascade parent tracking

MetaUpdate and StartupSweep entries fan out Rebuild children, each carrying parent_id = <parent_id>. The dashboard groups children under their parent in the queue panel so the operator sees the whole cascade as a tree, not a flat list.

Step labels

Each queue entry has a mutable step: Option<String> field that the worker updates as it progresses through lifecycle phases ("nix build", "nixos-container stop", "nixos-container update", "nixos-container start"). The dashboard polls /api/state and renders the current step beneath the running entry so the operator can see which phase is taking time.

Dependency tracking

Each entry carries a depends_on: Vec<u64> field. The worker skips entries whose dependencies are not yet resolved — a dependency is resolved when its id is either in the queue as a terminal entry (Done / Failed / Cancelled) or no longer in the queue at all (evicted by trim_history, which only evicts terminals).

Use cases:

depends_on is part of the dedup key: two entries with the same (kind, agent, parent_id, inputs, approval_id) but different dep sets are treated as distinct work.

Worker re-notification: the worker drains take_next() in a tight loop after each entry finishes. When a dep entry transitions to terminal, the loop re-evaluates the queue immediately, so downstream entries are unblocked with no extra wakeup. No additional notify_one() call is needed.

Circular-dep caveat: if A depends on B and B depends on A, neither entry ever becomes runnable — the worker skips both indefinitely with no error. Callers must ensure acyclic dep graphs. Cycle detection is deferred to a future iteration (when parallel workers make a stuck queue more visible).


Container view

container_view.rs maintains an in-memory snapshot of every nixos-container's systemd service state. It is polled on coordinator startup and re-scanned after every lifecycle operation (spawn, rebuild, kill) so the dashboard always reflects the actual container status without a live nixos-container list call on each render.


Auto-update sweep

On startup, auto_update.rs rebuilds every known container unconditionally. nixos-container update is a no-op at the nix level when nothing changed (same store path), so the cost is low and avoids rev-marker staleness — all agents always need an update pass when any meta commit lands.

auto_update::run enqueues a single StartupSweep parent entry (kind = startup_sweep, agent = "hyperhive") followed by per-agent Rebuild children (source = startup_sweep, parent_id = sweep_id). The worker processes the parent by bumping the meta hyperhive input lock, then transitions it to Done. The child rebuilds drain sequentially through the queue; the dashboard renders them nested under the parent so the operator can see the whole boot-time sweep in one group.

Before this change, each boot enqueued flat Rebuild entries with source = AutoUpdate and no parent — visible but ungrouped.

Meta flake

meta.rs owns the single coordinator-managed flake at /var/lib/hyperhive/meta/. This flake consumes every agent's applied config repo as a flake input and exports one nixosConfiguration per agent. Container lifecycle ops drive the lock file so meta's git log is the system-wide deploy audit trail.

Key operations:


Container lifecycle (lifecycle.rs)

Every container operation ultimately calls into lifecycle.rs. Two paths exist: rebuild (existing container) and spawn (first-time creation).

Rebuild path (existing container)

Goal: apply the new system profile and any EXTRA_NSPAWN_FLAGS / drop-in changes in a single start, with minimum downtime.

nixos-container update only runs systemctl reload container@<c> when the container is already up (per isContainerRunning in nixos-container.pl). Stopping first turns update into a boot-style operation: it builds + nix-env --sets the new profile and skips the in-container switch-to-configuration. The subsequent start then applies both the new profile and any EXTRA_NSPAWN_FLAGS changes in one go, rather than the double-bounce a live update would trigger.

Sequence for a running container:

  1. prebuild_toplevel — build the new system.build.toplevel before stopping. The container keeps serving the previous generation while eval + fetch + build happen out-of-band. nixos-container update then finds the result cached and skips straight to the profile-swap. Build failures surface here, before the running container is touched.
  2. nixos-container stop — bring the container down.
  3. nixos-container update --flake meta#<name> — profile-swap (near-instant after the prebuild).
  4. nixos-container start — boot into the new generation; the in-container activation script transitions old → new.

If the container is already stopped, step 1 is skipped (no downtime to shave — no point evaluating the flake twice).

Cold-start fallback

start after update can exit non-zero when packages are removed between generations: the old-generation activation script references units that no longer exist in the new closure, causing systemd to exit non-zero. The container may be half-started at that point.

Fallback: stop (graceful SIGTERM drain) → kill (SIGKILL any lingering processes) → start (clean cold-start, no generation transition, new activation runs cleanly). Both errors are preserved and surfaced if the cold-start also fails.

Spawn path (new container)

For a first-time create, nixos-container create is atomic: if the build fails, no container record is left to clean up. A separate prebuild would just duplicate the eval, so it's skipped. Sequence: create --flake meta#<name> → write nspawn flags → systemctl daemon-reloadstart.

Prebuild attr path

nix build does not auto-resolve meta#<name> against nixosConfigurations the way nixos-container does internally. The explicit attr path <flake-root>#nixosConfigurations.<name>.config.system.build.toplevel is required; using the bare meta#<name> ref would make nix look in packages, legacyPackages, or the flake root directly — none of which exist in the rendered meta flake.


Host-level resource + performance options

Three services.hyperhive.c0re.* options tune container resource limits and first-spawn latency. All three apply uniformly to every agent container.

Container resource limits

agentCpuQuota and agentMemoryMax map directly to systemd CPUQuota= and MemoryMax=. hive-c0re writes a container@h-<name>.service.d/ drop-in file on each spawn and rebuild, so changes take effect on the next lifecycle op without requiring a host rebuild.

Option Default Description
services.hyperhive.c0re.agentCpuQuota "200%" CPU cap per agent, as a percentage of one core ("200%" = 2 cores). Raise if agents hit CPU limits during builds or heavy tool use.
services.hyperhive.c0re.agentMemoryMax "4G" Memory cap per agent. Raise for agents that run large nix builds or hold big in-memory data.

For a hive-wide cap across all containers together, set systemd.slices.machine.serviceConfig.CPUQuota in your NixOS config — all nspawn containers live in machine.slice.

Pre-building agent templates

preBuildAgentTemplates (default false) causes the host NixOS build to pre-fetch the per-container system closures (agent-base + manager toplevels) into /nix/store, instead of leaving that work to the first nixos-container start. The trade-off:

Note: toplevels are pinned to x86_64-linux. Enabling on an aarch64 host forces a cross-compilation or remote-builder build, which is almost never desired. Leave off on non-x86 hosts.


See also