Gotchas

NixOS + nspawn quirks and lessons we hit the hard way. If something here looks unmotivated in the code, there's usually a story underneath.

`nixos-container` doesn't expose `--bind` on the CLI

The CLI doesn't accept --bind. Path is via EXTRA_NSPAWN_FLAGS in /etc/nixos-containers/<NAME>.conf — the start script (/nix/store/.../container_-start) expands it unquoted into the systemd-nspawn invocation. lifecycle::set_nspawn_flags() rewrites this line.

`/run/systemd/nspawn/*.nspawn` overrides are ignored

nixos-container's start script builds the nspawn command line directly. Dropping a .nspawn file under /run/systemd/nspawn/ looks like the obvious extension point and does nothing. Use EXTRA_NSPAWN_FLAGS (above).

`boot.isNspawnContainer = true`

Not boot.isContainer = true. Renamed in nixos-25.11+.

`nixos-container create` auto-assigns `HOST_ADDRESS` / `LOCAL_ADDRESS`

…in the .conf. The start script's if HOST_ADDRESS set → --network-veth branch then forces a private netns — silently fatal for our web UIs (the bind is invisible from the host). We force-clear HOST_ADDRESS / LOCAL_ADDRESS / HOST_ADDRESS6 / LOCAL_ADDRESS6 / HOST_BRIDGE and set PRIVATE_NETWORK=0.

systemd service PATH ≠ host PATH

The hive-c0re service sets path = [ pkgs.git "/run/current-system/sw" ]. In-container harness services do the same so anything an agent adds to its own agent.nix (environment.systemPackages) is visible to the mcp__bash__run MCP tool (and any other in-container process) without editing the service definition. environment.HYPERHIVE_GIT bakes git's absolute path in (read by lifecycle::git_command()) for the host.

`systemd.services.*.path` appends `/bin` to every entry

NixOS's systemd.services.<unit>.path list feeds every entry through lib.makeBinPath, which appends /bin unconditionally. That's the right thing for Nix packages (their outPath is the store root, not the bin/ subdir), but it bites when you pass a string that already ends with /bin:

# ❌ /run/wrappers/bin → /run/wrappers/bin/bin (does not exist)
path = [ "/run/wrappers/bin" "/run/current-system/sw" ];

# ✅ /run/wrappers → /run/wrappers/bin  (the real wrappers dir)
path = [ "/run/wrappers" "/run/current-system/sw" ];

The bug is silent: nix eval succeeds, the unit starts, but PATH contains a non-existent directory. The first symptom is usually sudo: must be owned by uid 0 and have the setuid bit set because the setuid sudo wrapper lives at /run/wrappers/bin/sudo and the path entry resolves to /run/wrappers/bin/bin instead.

`RuntimeDirectoryPreserve = "yes"`

…keeps /run/hyperhive/ (and the per-agent sub-dirs) across hive-c0re restarts. Without it, every restart wipes bind sources and existing containers can't be started.

`register_agent` is idempotent

Drops any prior socket task before rebinding. Required so a hive-c0re restart followed by rebuild alice recreates the agent's socket without needing a clean reinstall.

`claude-code` is unfree

claude-code comes from the flake's main nixpkgs (nixos-26.05). It's unfree, so the agent modules set config.allowUnfreePredicate at the container level to whitelist claude-code specifically — scoped, only this one package. This is needed because each per-agent nixosConfiguration evaluates its own nixpkgs instance and the operator's host-level allowUnfree does not propagate in. Operators don't need to set anything on their side.

Claude credentials are per-agent

/var/lib/hyperhive/agents/<name>/claude/ bind-mounts to /home/<name>/.claude (RW). Sharing one dir across agents is NOT viable — OAuth refresh tokens rotate, so any sibling refresh invalidates all the others. Login flow runs from the per-agent web UI; creds persist across destroy/recreate (--purge wipes them).

Persistent notes dir per agent

/var/lib/hyperhive/agents/<name>/state/ bind-mounts to /agents/<name>/state (RW; uniform for all agents). The harness exposes the same path via $HYPERHIVE_STATE_DIR. System prompts tell agents to keep durable knowledge here (notes.md, anything else). The harness also writes its events log here (hyperhive-events.sqlite). Survives destroy/recreate alongside the claude dir.

Web UI ports collide on hash

Sub-agent web UI ports are deterministic FNV-1a of the agent name modulo 900 (range 8100..8999). With ~30 agents the birthday-paradox collision rate gets meaningful; at 2–3 agents you can still get unlucky. Operator resolves a collision by renaming the offending agent (different hash → different port) and rebuilding. No state file, no probing, no port-allocation drift — the value is reproducible from just the name. Every agent hashes into 8100..8999 via the same FNV-1a; dashboard at cfg.dashboardPort (default 7000).

Restart races on TCP bind

Both the dashboard and per-agent web UI use tokio::net::TcpSocket with SO_REUSEADDR plus a retry-on-AddrInUse loop (12 tries, exponential backoff capped at 2s, ~22s total). REUSEADDR handles the TIME_WAIT case from a clean previous exit; retry covers the genuine "previous process is still alive during a systemd restart overlap" case. REUSEADDR does not allow two simultaneous LISTEN sockets on the same port (that would be SO_REUSEPORT, which we don't use) — exclusivity is preserved.

Orphan approvals

If state dirs are wiped out from under a pending approval (test scripts, manual rm -rf), the dashboard's next render marks them failed with note "agent state dir missing" so they fall out of pending. They stay in sqlite for audit.

Nix store `cp -r` preserves read-only bits

Copying a nix store path with cp -r src/. $out/ inside a pkgs.runCommand derivation preserves the read-only permissions of store files. Any subsequent write into the copied tree (adding new files in subdirectories) fails with EPERM. Fix: pass --no-preserve=mode,ownership so the output tree is writable.

SPA fallback: use `Accept` header map, not `try_files ... /index.html`

The naive nginx pattern for a path-prefix SPA (try_files $uri $uri/ /matrix/index.html) silently swallows asset 404s — a missing JS file returns index.html with a 200, so the JS runtime never loads and the page renders blank with no visible error. Extension allowlists (tried as an alternative) have the same maintenance problem: any new file extension the SPA ships breaks silently.

The pattern that works (hive-gateway.nix) keys the fallback on the HTTP Accept header:

# Outside the server block (appendHttpConfig):
map $http_accept $matrix_spa_target {
  default          "/__matrix_spa_no_html_fallback";
  "~*text/html"    "/matrix/index.html";
}

# Inside the location:
try_files $uri $uri/ $matrix_spa_target =404;

Top-frame navigations always send Accept: text/html,... (chrome / firefox / safari are consistent). Asset fetches (image/*, application/javascript, */*) don't carry text/html, so they fall through to the trailing =404. No extension list to maintain; no named-location indirection needed.

`nix build flake#name` does not walk into `nixosConfigurations`

nix build resolves the fragment (#name) against the flake's top-level output attrs — not against nixosConfigurations specifically. nixos-container and nixos-rebuild use their own internal convention that routes an agent name to nixosConfigurations.<name>.config.system.build.toplevel, but nix build has no such convention.

# ❌ silently builds the wrong thing (or errors if attr doesn't exist)
nix build /var/lib/hyperhive/meta#argus.config.system.build.toplevel

# ✅ explicit path nix build actually resolves
nix build /var/lib/hyperhive/meta#nixosConfigurations.argus.config.system.build.toplevel

lifecycle::prebuild_toplevel hit this once by constructing the attr path as {flake_ref}.config… — which produced meta#argus.config… instead of meta#nixosConfigurations.argus.config…. The fix: split_once('#') to separate flake path from name, then template {path}#nixosConfigurations.{name}.config.system.build.toplevel.

`hive-forge`: prefer over raw curl pipelines

Full CLI reference: docs/tools/forge.md. Never use raw curl for forge access.

Containerized nix-daemon needs `sandbox-fallback = true`

Agent containers bind-mount the host's nix-daemon socket. nspawn containers don't get user-namespaces by default, so nix build invocations inside the container can't set up the build sandbox and fail outright if the host daemon's nix.settings.sandbox-fallback is false (nixpkgs default). nix/agent-modules/default.nix does lib.mkForce true so builds fall back to unsandboxed local builds rather than failing. Security implications: docs/security.md.

Linking workspace binaries locally needs `nix develop`

The Rust workspace links libsqlite3-sys (rusqlite) against the system libsqlite3. Agent containers carry no system libsqlite3 on the linker path, so a plain cargo build of any binary dies with cannot find -lsqlite3 (deps and ring compile fine — only the final link fails). cargo check / cargo clippy still work in the ambient shell since they never link.

Build + run binaries through the dev shell, which carries sqlite on NIX_LDFLAGS:

nix develop -c cargo build -p hive-c0re --bin hivectl
nix develop -c cargo run -p hive-c0re --bin hivectl -- <args>

This is also how you regenerate committed generated docs locally — e.g. docs/tools/hivectl-cli.md via the hivectl markdown-docs subcommand (its hivectl-docs flake check otherwise only fails in CI on drift).

Split asset derivations away from the rust workspace

nix/packages/assets.nix builds the branding SVG/PNG family + claude system-prompt template + claude-settings JSON as its own derivation, separate from the hive-ag3nt / hive-c0re crates. Reason: when the rust build's src was the whole repo tree, any tweak to branding/agent-configs.svg or hive-ag3nt/prompts/system.md invalidated the cargo cache and forced a full rebuild. crane (and naersk before it) couldn't see "these inputs are unused by rust" on its own — the split breaks the coupling at the derivation boundary. The agent-configs PNG is rendered from the SVG via rsvg-convert at build time; librsvg dependency lives here, not in the rust derivation's nativeBuildInputs.

Weston VNC compositor (per-agent `hyperhive.gui.enable`)

nix/agent-modules/weston-vnc.nix adds an optional Weston Wayland compositor with the VNC backend, surfaced as hyperhive.gui.enable = true per-agent. The harness's /screen/ws WebSocket relay (docs/web-ui/agent.md::Per-agent endpoints) connects to the compositor at 127.0.0.1:<vnc_port>.

Port allocation: a fixed port (hyperhive.gui.vncPort, default 5900). No per-agent hashing: network isolation is unconditional (each agent has its own netns — see docs/network.md#container-isolation), so the VNC port is container-local and can't collide across agents. The harness learns the port from the HIVE_GUI_VNC_PORT env var (set on the harness service when gui.enable) — no marker file, no runtime hash. (Unlike the agent web-UI port, which is still an FNV-1a hash because those listen on the shared host stack — see Web UI ports collide on hash.)
Non-root, shared user session: weston runs as the agent's own user (hyperhive.user.name, the same user hive-ag3nt runs as), not root, so the GUI and the agent share one session. The runtime dir is a fixed /run/gui (systemd RuntimeDirectory=gui, 0700, RuntimeDirectoryPreserve=yes so it survives weston restarts for the wayland client sharing the /run/gui/wayland-0 socket). Wayland clients in the agent's config (e.g. bitburner electron) must run as the same user with XDG_RUNTIME_DIR=/run/gui.
One shared D-Bus session bus (gui-dbus.service): a single persistent dbus-daemon --session bound at /run/gui/bus, run as the agent user, ordered before weston.service (it shares the same RuntimeDirectory=gui, creating the dir first). Chromium/electron via ozone refuse to map an xdg_toplevel without a reachable session bus ("Failed to connect to the bus" → binds xdg_wm_base then destroys it = invisible window even though CDP works). The fix is not to wrap each client in its own dbus-run-session (a private throwaway bus per process — that's a separate session, defeating the one-session model); it's this one shared bus, whose address is exported as DBUS_SESSION_BUS_ADDRESS=unix:path=/run/gui/bus via systemd.globalEnvironment so weston, the harness and every GUI client inherit it.
Fixed Wayland socket name (--socket=wayland-0): weston is launched with --socket=wayland-0 so the socket path is deterministic. nix/agent-modules/weston-vnc.nix exports WAYLAND_DISPLAY=wayland-0 and XDG_RUNTIME_DIR=/run/gui as global system environment variables (gated on hyperhive.gui.enable) so every systemd service in the container inherits them. Without this, services starting Wayland clients could not find the compositor — libwayland falls back to a headless display or errors out, the app "works" on a second invisible display, and the VNC session shows a blank weston desktop (#540 double-screen).
VNC bind address: weston's VNC backend has no CLI bind-address flag (unlike the RDP backend's --address), so the listener binds 0.0.0.0. The harness relay only connects via 127.0.0.1; the host firewall blocks the per-agent VNC port range from external access. A future weston.ini [vnc] address= will let us restrict the bind directly once upstream supports it.
PAM service name: literal weston-remote-access — that's the string libweston passes to pam_start() in libweston/auth.c. Using weston falls back to the system default PAM stack and rejects auth. The service is configured to pam_permit.so for all three module types (auth / account / session) so the browser's empty Apple-DH credentials (type 30) always pass — neatvnc ≥ 0.9 calls the PAM auth callback regardless of weston.ini auth-method=none, so the permit fallback is what actually lets the empty-cred client through.
Type = "simple" (not notify): switch-to-configuration must never block on weston signalling readiness. A misconfigured weston degrades to a Restart=on-failure loop visible in journalctl, it does not abort the nixos-container update. Same reasoning as the tea-login unit in nix/agent-modules/forge.nix.
[core] idle-time=0: disables weston's 300-second idle timeout. Without it the VNC desktop fades to black and desktop-shell shows its click-to-unlock screen — useless for an agent desktop viewed over /screen. idle-time=0 updates the idle timer with a 0ms delay, which wl_event_source_timer_update treats as "disarm", so the compositor never goes idle and never locks.

Nix options reference (`nix/docs/default.nix`)

pkgs.nixosOptionsDoc over two evaluated module trees: hostEval (a stub NixOS system loading the nix/host-modules/ aggregator with every hyperhive subsystem mkForce false so heavy build inputs stay out of the eval) and agentEval (evaluates agent.nix fresh for the per-agent options tree).

Three output trees consumed by flake.nix, all markdown:

docs-host — operator-facing host module options (services.hyperhive.*)
docs-agent — per-agent harness options (hyperhive.* declared in nix/agent-modules/)
docs — bundle of index.md + host.md + agent.md

Pipeline:

CommonMark from nixosOptionsDoc.optionsCommonMark is the only output — the source of truth, emitted as .md.
HTML + CSS is rendered downstream by the website repo (nix/options.nix there), which consumes this bundle's host.md / agent.md, renders them with cmark-gfm, and shares one stylesheet (docs.css) across /options/ and the prose /docs/ tree. Keeping rendering in the website means the theme has a single home and the colours are shared.
transformOptions strips the nix-store prefix from option declaration paths and rewrites them as forge URLs, so the rendered docs link back to the source.

Host options live entirely under services.hyperhive.*. The pickSubtrees filter is rooted at ["services" "hyperhive"] so the options tree picks up everything under that root — picking against stray roots produces an empty tree and renders the host page as template chrome with no <h2> headers.

Docs drv stability: `nixSrc`

Naively, the docs evaluation depends on self (the flake's store path), so every commit — even Rust-only or frontend-only changes — produces new docs drv hashes. The remote builder must rebuild docs from scratch for every PR branch, and if its store is full the build fails with a cached failure that blocks CI for the whole branch.

The fix (nix/docs/default.nix):

nixSrc — builtins.path on the nix/ directory, wrapped in builtins.unsafeDiscardStringContext to strip self's store-path context. The resulting store path is content-addressed from the nix/ file contents only. Docs drvs only change when a .nix file changes.
The package options the modules consume (hyperhive.packages.*, services.hyperhive.c0re.*) carry no in-module defaults and every default that references them has a defaultText, so the doc walk never forces a package — no stubs needed, and the Rust/frontend build closure stays out of the eval.
Both hostEval and agentEval are evaluated from nixSrc paths (not self), so the docs drv dependency chain ends at nixSrc.

Why builtins.unsafeDiscardStringContext? The path string toString self + "/nix" carries self's string context, which would make builtins.path include self as a build dependency even after content-addressing the directory. Discarding the context makes the resulting nixSrc truly independent of self's store path.

`nix fmt` fails in a git worktree with "object not found"

nix fmt (and any nix command that fetches a git+file:// flake URL) uses libgit2 internally to compute revCount — the number of commits reachable from HEAD. This walk fails with:

error: getting Git object '<hash>': object not found (libgit2 error code = 9)

when a commit that was reachable at some earlier evaluation is now gone (GC'd, rebased away, or pruned). The failure is persistent: clearing ~/.cache/nix/{eval-cache-v6,gitv3,fetcher-cache-v4.sqlite} does not help because the missing object is a structural gap in the git object graph itself, not in nix's caches.

Workaround: use a plain clone, not a git worktree.

git clone http://<forge>/hyperhive/hyperhive.git ~/hh-work
cd ~/hh-work && nix fmt

The root cause is specific to worktrees: a worktree shares the object store with its parent repo. If the parent repo's history was rewritten (rebase, force-push, git gc --prune) while the worktree was checked out at a branch tip that references the pruned commits via its reflog or history, libgit2's rev-walk encounters the gap. A plain clone has its own self-consistent object store and is immune to the issue.

Gotchas

nixos-container doesn't expose --bind on the CLI

/run/systemd/nspawn/*.nspawn overrides are ignored

boot.isNspawnContainer = true

nixos-container create auto-assigns HOST_ADDRESS / LOCAL_ADDRESS

systemd service PATH ≠ host PATH

systemd.services.*.path appends /bin to every entry

RuntimeDirectoryPreserve = "yes"

register_agent is idempotent

claude-code is unfree