Gotchas
NixOS + nspawn quirks and lessons we hit the hard way. If something here looks unmotivated in the code, there's usually a story underneath.
nixos-container doesn't expose --bind on the CLI
The CLI doesn't accept --bind. Path is via EXTRA_NSPAWN_FLAGS in
/etc/nixos-containers/<NAME>.conf — the start script
(/nix/store/.../container_-start) expands it unquoted into the
systemd-nspawn invocation. lifecycle::set_nspawn_flags() rewrites
this line.
/run/systemd/nspawn/*.nspawn overrides are ignored
nixos-container's start script builds the nspawn command line
directly. Dropping a .nspawn file under /run/systemd/nspawn/
looks like the obvious extension point and does nothing. Use
EXTRA_NSPAWN_FLAGS (above).
boot.isNspawnContainer = true
Not boot.isContainer = true. Renamed in nixos-25.11+.
nixos-container create auto-assigns HOST_ADDRESS / LOCAL_ADDRESS
…in the .conf. The start script's if HOST_ADDRESS set → --network-veth branch then forces a private netns — silently fatal
for our web UIs (the bind is invisible from the host). We
force-clear HOST_ADDRESS / LOCAL_ADDRESS / HOST_ADDRESS6 /
LOCAL_ADDRESS6 / HOST_BRIDGE and set PRIVATE_NETWORK=0.
systemd service PATH ≠ host PATH
The hive-c0re service sets path = [ pkgs.git "/run/current-system/sw" ].
In-container harness services do the same so anything an agent adds
to its own agent.nix (environment.systemPackages) is visible to
the mcp__bash__run MCP tool (and any other in-container process) without
editing the service definition.
environment.HYPERHIVE_GIT bakes git's absolute path in (read by
lifecycle::git_command()) for the host.
systemd.services.*.path appends /bin to every entry
NixOS's systemd.services.<unit>.path list feeds every entry through
lib.makeBinPath, which appends /bin unconditionally. That's
the right thing for Nix packages (their outPath is the store root,
not the bin/ subdir), but it bites when you pass a string that
already ends with /bin:
# ❌ /run/wrappers/bin → /run/wrappers/bin/bin (does not exist)
path = [ "/run/wrappers/bin" "/run/current-system/sw" ];
# ✅ /run/wrappers → /run/wrappers/bin (the real wrappers dir)
path = [ "/run/wrappers" "/run/current-system/sw" ];
The bug is silent: nix eval succeeds, the unit starts, but PATH
contains a non-existent directory. The first symptom is usually
sudo: must be owned by uid 0 and have the setuid bit set because
the setuid sudo wrapper lives at /run/wrappers/bin/sudo and
the path entry resolves to /run/wrappers/bin/bin instead.
RuntimeDirectoryPreserve = "yes"
…keeps /run/hyperhive/ (and the per-agent sub-dirs) across
hive-c0re restarts. Without it, every restart wipes bind sources and
existing containers can't be started.
register_agent is idempotent
Drops any prior socket task before rebinding. Required so a
hive-c0re restart followed by rebuild alice recreates the agent's
socket without needing a clean reinstall.
claude-code is unfree
The flake pins it to nixpkgs-unstable via
overlays.claude-unstable (stable lags too far). The overlay sets
config.allowUnfreePredicate on its unstable import to whitelist
claude-code specifically — scoped, only this one package.
harness-base.nix does the same at the container level because
each per-agent nixosConfiguration evaluates its own nixpkgs
instance and the operator's host-level allowUnfree does not
propagate in. Operators don't need to set anything on their side.
Claude credentials are per-agent
/var/lib/hyperhive/agents/<name>/claude/ bind-mounts to
/home/<name>/.claude (RW). Sharing one dir across agents is NOT viable —
OAuth refresh tokens rotate, so any sibling refresh invalidates all
the others. Login flow runs from the per-agent web UI; creds persist
across destroy/recreate (--purge wipes them).
Persistent notes dir per agent
/var/lib/hyperhive/agents/<name>/state/ bind-mounts to
/agents/<name>/state (RW; uniform for sub-agents + manager).
The harness exposes the same path
via $HYPERHIVE_STATE_DIR. System prompts tell agents to keep
durable knowledge here (notes.md, anything else). The harness also
writes its events log here (hyperhive-events.sqlite).
Survives destroy/recreate alongside the claude dir.
Web UI ports collide on hash
Sub-agent web UI ports are deterministic FNV-1a of the agent name
modulo 900 (range 8100..8999). With ~30 agents the birthday-paradox
collision rate gets meaningful; at 2–3 agents you can still get
unlucky. Operator resolves a collision by renaming the offending
agent (different hash → different port) and rebuilding. No state
file, no probing, no port-allocation drift — the value is
reproducible from just the name. Every agent — including the
manager — hashes into 8100..8999 via the same FNV-1a; dashboard
at cfg.dashboardPort (default 7000).
Restart races on TCP bind
Both the dashboard and per-agent web UI use tokio::net::TcpSocket
with SO_REUSEADDR plus a retry-on-AddrInUse loop (12 tries,
exponential backoff capped at 2s, ~22s total). REUSEADDR handles
the TIME_WAIT case from a clean previous exit; retry covers the
genuine "previous process is still alive during a systemd restart
overlap" case. REUSEADDR does not allow two simultaneous
LISTEN sockets on the same port (that would be SO_REUSEPORT,
which we don't use) — exclusivity is preserved.
Orphan approvals
If state dirs are wiped out from under a pending approval (test
scripts, manual rm -rf), the dashboard's next render marks them
failed with note "agent state dir missing" so they fall out of
pending. They stay in sqlite for audit.
Nix store cp -r preserves read-only bits
Copying a nix store path with cp -r src/. $out/ inside a
pkgs.runCommand derivation preserves the read-only permissions of
store files. Any subsequent write into the copied tree (adding new
files in subdirectories) fails with EPERM. Fix: pass
--no-preserve=mode,ownership so the output tree is writable.
SPA fallback: use Accept header map, not try_files ... /index.html
The naive nginx pattern for a path-prefix SPA (try_files $uri $uri/ /matrix/index.html) silently swallows asset 404s — a missing JS file
returns index.html with a 200, so the JS runtime never loads and the
page renders blank with no visible error. Extension allowlists (tried
as an alternative) have the same maintenance problem: any new file
extension the SPA ships breaks silently.
The pattern that works (hive-gateway.nix) keys the fallback on the
HTTP Accept header:
# Outside the server block (appendHttpConfig):
map $http_accept $matrix_spa_target {
default "/__matrix_spa_no_html_fallback";
"~*text/html" "/matrix/index.html";
}
# Inside the location:
try_files $uri $uri/ $matrix_spa_target =404;
Top-frame navigations always send Accept: text/html,... (chrome /
firefox / safari are consistent). Asset fetches (image/*,
application/javascript, */*) don't carry text/html, so they
fall through to the trailing =404. No extension list to maintain;
no named-location indirection needed.
nix build flake#name does not walk into nixosConfigurations
nix build resolves the fragment (#name) against the flake's
top-level output attrs — not against nixosConfigurations
specifically. nixos-container and nixos-rebuild use their own
internal convention that routes an agent name to
nixosConfigurations.<name>.config.system.build.toplevel, but
nix build has no such convention.
# ❌ silently builds the wrong thing (or errors if attr doesn't exist)
nix build /var/lib/hyperhive/meta#argus.config.system.build.toplevel
# ✅ explicit path nix build actually resolves
nix build /var/lib/hyperhive/meta#nixosConfigurations.argus.config.system.build.toplevel
lifecycle::prebuild_toplevel hit this once by constructing the attr
path as {flake_ref}.config… — which produced meta#argus.config…
instead of meta#nixosConfigurations.argus.config…. The fix:
split_once('#') to separate flake path from name, then template
{path}#nixosConfigurations.{name}.config.system.build.toplevel.
hive-forge: prefer over raw curl pipelines
Full CLI reference: docs/tools/forge.md.
Never use raw curl for forge access.
Containerized nix-daemon needs sandbox-fallback = true
Agent containers bind-mount the host's nix-daemon socket. nspawn
containers don't get user-namespaces by default, so nix build
invocations inside the container can't set up the build sandbox
and fail outright if the host daemon's
nix.settings.sandbox-fallback is false (nixpkgs default).
nix/templates/harness-base.nix does lib.mkForce true so builds
fall back to unsandboxed local builds rather than failing. Security
implications: docs/security.md.
Linking workspace binaries locally needs nix develop
The Rust workspace links libsqlite3-sys (rusqlite) against the
system libsqlite3. Agent containers carry no system libsqlite3 on
the linker path, so a plain cargo build of any binary dies with
cannot find -lsqlite3 (deps and ring compile fine — only the
final link fails). cargo check / cargo clippy still work in the
ambient shell since they never link.
Build + run binaries through the dev shell, which carries sqlite
on NIX_LDFLAGS:
nix develop -c cargo build -p hive-c0re --bin hivectl
nix develop -c cargo run -p hive-c0re --bin hivectl -- <args>
This is also how you regenerate committed generated docs locally —
e.g. docs/tools/hivectl-cli.md via the hivectl markdown-docs
subcommand (its hivectl-docs flake check otherwise only fails in
CI on drift).
Split asset derivations away from the rust workspace
nix/assets.nix builds the branding SVG/PNG family + claude
system-prompt template + claude-settings JSON as its own derivation,
separate from the hive-ag3nt / hive-c0re crates. Reason: when the
rust build's src was the whole repo tree, any tweak to
branding/agent-configs.svg or hive-ag3nt/prompts/system.md
invalidated the cargo cache and forced a full rebuild. crane (and
naersk before it) couldn't see "these inputs are unused by rust" on
its own — the split breaks the coupling at the derivation boundary.
The agent-configs PNG is rendered from the SVG via rsvg-convert at
build time; librsvg dependency lives here, not in the rust
derivation's nativeBuildInputs.
Weston VNC compositor (per-agent hyperhive.gui.enable)
nix/templates/weston-vnc.nix adds an optional Weston Wayland
compositor with the VNC backend, surfaced as
hyperhive.gui.enable = true per-agent. The harness's
/screen/ws WebSocket relay (docs/web-ui/agent.md::Per-agent endpoints)
connects to the compositor at 127.0.0.1:<vnc_port>.
- Port allocation: deterministic FNV-1a of the agent name
(read from
/etc/hostname, leadingh-stripped) mapped into[15900, 16799]. Mirrors the agent web-UI port pattern fromdocs/gotchas.md::Web UI ports collide on hash— same FNV-1a constant, different range. The compositor's startup script writes/etc/hyperhive/gui.json = {"vnc_port":N,"auth":"none","wayland_display":"wayland-0"}so the harness reads the port at runtime; no nix-side / harness-side hash duplication. - Fixed Wayland socket name (
--socket=wayland-0): weston is launched with--socket=wayland-0so the socket path is deterministic.harness-base.nixexportsWAYLAND_DISPLAY=wayland-0andXDG_RUNTIME_DIR=/run/user/0as global system environment variables (gated onhyperhive.gui.enable) so every systemd service in the container inherits them. Without this, services starting Wayland clients could not find the compositor — libwayland falls back to a headless display or errors out, the app "works" on a second invisible display, and the VNC session shows a blank weston desktop (#540 double-screen). - VNC bind address: weston's VNC backend has no CLI
bind-address flag (unlike the RDP backend's
--address), so the listener binds0.0.0.0. The harness relay only connects via127.0.0.1; the host firewall blocks the per-agent VNC port range from external access. A future weston.ini[vnc] address=will let us restrict the bind directly once upstream supports it. - PAM service name: literal
weston-remote-access— that's the string libweston passes topam_start()inlibweston/auth.c. Usingwestonfalls back to the system default PAM stack and rejects auth. The service is configured topam_permit.sofor all three module types (auth / account / session) so the browser's empty Apple-DH credentials (type 30) always pass — neatvnc ≥ 0.9 calls the PAM auth callback regardless ofweston.iniauth-method=none, so the permit fallback is what actually lets the empty-cred client through. Type = "simple"(notnotify):switch-to-configurationmust never block on weston signalling readiness. A misconfigured weston degrades to aRestart=on-failureloop visible injournalctl, it does not abort thenixos-container update. Same reasoning as thetea-loginunit inharness-base.nix.[core] idle-time=0: disables weston's 300-second idle timeout. Without it the VNC desktop fades to black and desktop-shell shows its click-to-unlock screen — useless for an agent desktop viewed over/screen.idle-time=0updates the idle timer with a 0ms delay, whichwl_event_source_timer_updatetreats as "disarm", so the compositor never goes idle and never locks.
Nix options reference (nix/docs/default.nix)
pkgs.nixosOptionsDoc over two evaluated module trees:
hostEval (a stub NixOS system loading self.nixosModules.default
with every hyperhive subsystem mkForce false so heavy build
inputs stay out of the eval) and agentEval (reuses the already-evaluated
agent-base container config so the per-agent options tree is
identical to what a real agent container sees).
Three output trees consumed by flake.nix, all markdown:
docs-host— operator-facing host module options (services.hyperhive.*)docs-agent— per-agent harness options (hyperhive.*declared innix/templates/harness-base.nix)docs— bundle ofindex.md+host.md+agent.md
Pipeline:
- CommonMark from
nixosOptionsDoc.optionsCommonMarkis the only output — the source of truth, emitted as.md. - HTML + CSS is rendered downstream by the website repo
(
nix/options.nixthere), which consumes this bundle'shost.md/agent.md, renders them withcmark-gfm, and shares one stylesheet (docs.css) across/options/and the prose/docs/tree. Keeping rendering in the website means the theme has a single home and the colours are shared. transformOptionsstrips the nix-store prefix from option declaration paths and rewrites them as forge URLs, so the rendered docs link back to the source.
Host options live entirely under services.hyperhive.*. The
pickSubtrees filter is rooted at ["services" "hyperhive"] so the
options tree picks up everything under that root — picking against
stray roots produces an empty tree and renders the host page as
template chrome with no <h2> headers.