Gotchas

NixOS + nspawn quirks and lessons we hit the hard way. If something here looks unmotivated in the code, there's usually a story underneath.

nixos-container doesn't expose --bind on the CLI

The CLI doesn't accept --bind. Path is via EXTRA_NSPAWN_FLAGS in /etc/nixos-containers/<NAME>.conf — the start script (/nix/store/.../container_-start) expands it unquoted into the systemd-nspawn invocation. lifecycle::set_nspawn_flags() rewrites this line.

/run/systemd/nspawn/*.nspawn overrides are ignored

nixos-container's start script builds the nspawn command line directly. Dropping a .nspawn file under /run/systemd/nspawn/ looks like the obvious extension point and does nothing. Use EXTRA_NSPAWN_FLAGS (above).

boot.isNspawnContainer = true

Not boot.isContainer = true. Renamed in nixos-25.11+.

nixos-container create auto-assigns HOST_ADDRESS / LOCAL_ADDRESS

…in the .conf. The start script's if HOST_ADDRESS set → --network-veth branch then forces a private netns — silently fatal for our web UIs (the bind is invisible from the host). We force-clear HOST_ADDRESS / LOCAL_ADDRESS / HOST_ADDRESS6 / LOCAL_ADDRESS6 / HOST_BRIDGE and set PRIVATE_NETWORK=0.

systemd service PATH ≠ host PATH

The hive-c0re service sets path = [ pkgs.git "/run/current-system/sw" ]. In-container harness services do the same so anything an agent adds to its own agent.nix (environment.systemPackages) is visible to the mcp__bash__run MCP tool (and any other in-container process) without editing the service definition. environment.HYPERHIVE_GIT bakes git's absolute path in (read by lifecycle::git_command()) for the host.

systemd.services.*.path appends /bin to every entry

NixOS's systemd.services.<unit>.path list feeds every entry through lib.makeBinPath, which appends /bin unconditionally. That's the right thing for Nix packages (their outPath is the store root, not the bin/ subdir), but it bites when you pass a string that already ends with /bin:

# ❌ /run/wrappers/bin → /run/wrappers/bin/bin (does not exist)
path = [ "/run/wrappers/bin" "/run/current-system/sw" ];

# ✅ /run/wrappers → /run/wrappers/bin  (the real wrappers dir)
path = [ "/run/wrappers" "/run/current-system/sw" ];

The bug is silent: nix eval succeeds, the unit starts, but PATH contains a non-existent directory. The first symptom is usually sudo: must be owned by uid 0 and have the setuid bit set because the setuid sudo wrapper lives at /run/wrappers/bin/sudo and the path entry resolves to /run/wrappers/bin/bin instead.

RuntimeDirectoryPreserve = "yes"

…keeps /run/hyperhive/ (and the per-agent sub-dirs) across hive-c0re restarts. Without it, every restart wipes bind sources and existing containers can't be started.

register_agent is idempotent

Drops any prior socket task before rebinding. Required so a hive-c0re restart followed by rebuild alice recreates the agent's socket without needing a clean reinstall.

claude-code is unfree

The flake pins it to nixpkgs-unstable via overlays.claude-unstable (stable lags too far). The overlay sets config.allowUnfreePredicate on its unstable import to whitelist claude-code specifically — scoped, only this one package. harness-base.nix does the same at the container level because each per-agent nixosConfiguration evaluates its own nixpkgs instance and the operator's host-level allowUnfree does not propagate in. Operators don't need to set anything on their side.

Claude credentials are per-agent

/var/lib/hyperhive/agents/<name>/claude/ bind-mounts to /home/<name>/.claude (RW). Sharing one dir across agents is NOT viable — OAuth refresh tokens rotate, so any sibling refresh invalidates all the others. Login flow runs from the per-agent web UI; creds persist across destroy/recreate (--purge wipes them).

Persistent notes dir per agent

/var/lib/hyperhive/agents/<name>/state/ bind-mounts to /agents/<name>/state (RW; uniform for sub-agents + manager). The harness exposes the same path via $HYPERHIVE_STATE_DIR. System prompts tell agents to keep durable knowledge here (notes.md, anything else). The harness also writes its events log here (hyperhive-events.sqlite). Survives destroy/recreate alongside the claude dir.

Web UI ports collide on hash

Sub-agent web UI ports are deterministic FNV-1a of the agent name modulo 900 (range 8100..8999). With ~30 agents the birthday-paradox collision rate gets meaningful; at 2–3 agents you can still get unlucky. Operator resolves a collision by renaming the offending agent (different hash → different port) and rebuilding. No state file, no probing, no port-allocation drift — the value is reproducible from just the name. Every agent — including the manager — hashes into 8100..8999 via the same FNV-1a; dashboard at cfg.dashboardPort (default 7000).

Restart races on TCP bind

Both the dashboard and per-agent web UI use tokio::net::TcpSocket with SO_REUSEADDR plus a retry-on-AddrInUse loop (12 tries, exponential backoff capped at 2s, ~22s total). REUSEADDR handles the TIME_WAIT case from a clean previous exit; retry covers the genuine "previous process is still alive during a systemd restart overlap" case. REUSEADDR does not allow two simultaneous LISTEN sockets on the same port (that would be SO_REUSEPORT, which we don't use) — exclusivity is preserved.

Orphan approvals

If state dirs are wiped out from under a pending approval (test scripts, manual rm -rf), the dashboard's next render marks them failed with note "agent state dir missing" so they fall out of pending. They stay in sqlite for audit.

Nix store cp -r preserves read-only bits

Copying a nix store path with cp -r src/. $out/ inside a pkgs.runCommand derivation preserves the read-only permissions of store files. Any subsequent write into the copied tree (adding new files in subdirectories) fails with EPERM. Fix: pass --no-preserve=mode,ownership so the output tree is writable.

SPA fallback: use Accept header map, not try_files ... /index.html

The naive nginx pattern for a path-prefix SPA (try_files $uri $uri/ /matrix/index.html) silently swallows asset 404s — a missing JS file returns index.html with a 200, so the JS runtime never loads and the page renders blank with no visible error. Extension allowlists (tried as an alternative) have the same maintenance problem: any new file extension the SPA ships breaks silently.

The pattern that works (hive-gateway.nix) keys the fallback on the HTTP Accept header:

# Outside the server block (appendHttpConfig):
map $http_accept $matrix_spa_target {
  default          "/__matrix_spa_no_html_fallback";
  "~*text/html"    "/matrix/index.html";
}

# Inside the location:
try_files $uri $uri/ $matrix_spa_target =404;

Top-frame navigations always send Accept: text/html,... (chrome / firefox / safari are consistent). Asset fetches (image/*, application/javascript, */*) don't carry text/html, so they fall through to the trailing =404. No extension list to maintain; no named-location indirection needed.

nix build flake#name does not walk into nixosConfigurations

nix build resolves the fragment (#name) against the flake's top-level output attrs — not against nixosConfigurations specifically. nixos-container and nixos-rebuild use their own internal convention that routes an agent name to nixosConfigurations.<name>.config.system.build.toplevel, but nix build has no such convention.

# ❌ silently builds the wrong thing (or errors if attr doesn't exist)
nix build /var/lib/hyperhive/meta#argus.config.system.build.toplevel

# ✅ explicit path nix build actually resolves
nix build /var/lib/hyperhive/meta#nixosConfigurations.argus.config.system.build.toplevel

lifecycle::prebuild_toplevel hit this once by constructing the attr path as {flake_ref}.config… — which produced meta#argus.config… instead of meta#nixosConfigurations.argus.config…. The fix: split_once('#') to separate flake path from name, then template {path}#nixosConfigurations.{name}.config.system.build.toplevel.

hive-forge: prefer over raw curl pipelines

Full CLI reference: docs/tools/forge.md. Never use raw curl for forge access.

Containerized nix-daemon needs sandbox-fallback = true

Agent containers bind-mount the host's nix-daemon socket. nspawn containers don't get user-namespaces by default, so nix build invocations inside the container can't set up the build sandbox and fail outright if the host daemon's nix.settings.sandbox-fallback is false (nixpkgs default). nix/templates/harness-base.nix does lib.mkForce true so builds fall back to unsandboxed local builds rather than failing. Security implications: docs/security.md.

Linking workspace binaries locally needs nix develop

The Rust workspace links libsqlite3-sys (rusqlite) against the system libsqlite3. Agent containers carry no system libsqlite3 on the linker path, so a plain cargo build of any binary dies with cannot find -lsqlite3 (deps and ring compile fine — only the final link fails). cargo check / cargo clippy still work in the ambient shell since they never link.

Build + run binaries through the dev shell, which carries sqlite on NIX_LDFLAGS:

nix develop -c cargo build -p hive-c0re --bin hivectl
nix develop -c cargo run -p hive-c0re --bin hivectl -- <args>

This is also how you regenerate committed generated docs locally — e.g. docs/tools/hivectl-cli.md via the hivectl markdown-docs subcommand (its hivectl-docs flake check otherwise only fails in CI on drift).

Split asset derivations away from the rust workspace

nix/assets.nix builds the branding SVG/PNG family + claude system-prompt template + claude-settings JSON as its own derivation, separate from the hive-ag3nt / hive-c0re crates. Reason: when the rust build's src was the whole repo tree, any tweak to branding/agent-configs.svg or hive-ag3nt/prompts/system.md invalidated the cargo cache and forced a full rebuild. crane (and naersk before it) couldn't see "these inputs are unused by rust" on its own — the split breaks the coupling at the derivation boundary. The agent-configs PNG is rendered from the SVG via rsvg-convert at build time; librsvg dependency lives here, not in the rust derivation's nativeBuildInputs.

Weston VNC compositor (per-agent hyperhive.gui.enable)

nix/templates/weston-vnc.nix adds an optional Weston Wayland compositor with the VNC backend, surfaced as hyperhive.gui.enable = true per-agent. The harness's /screen/ws WebSocket relay (docs/web-ui/agent.md::Per-agent endpoints) connects to the compositor at 127.0.0.1:<vnc_port>.

Nix options reference (nix/docs/default.nix)

pkgs.nixosOptionsDoc over two evaluated module trees: hostEval (a stub NixOS system loading self.nixosModules.default with every hyperhive subsystem mkForce false so heavy build inputs stay out of the eval) and agentEval (reuses the already-evaluated agent-base container config so the per-agent options tree is identical to what a real agent container sees).

Three output trees consumed by flake.nix, all markdown:

Pipeline:

Host options live entirely under services.hyperhive.*. The pickSubtrees filter is rooted at ["services" "hyperhive"] so the options tree picks up everything under that root — picking against stray roots produces an empty tree and renders the host page as template chrome with no <h2> headers.