Persistence + retention

Where state lives, what survives what, and how it's bounded.

Three sqlite databases

/var/lib/hyperhive/db/broker.sqlite (host)

Six tables, all in one file — four queues plus the schedule header/targets split:

Retention:

/harness/hyperhive-events.sqlite (per agent)

Lives inside each container's bind-mounted /harness/ dir (host path: /var/lib/hyperhive/agents/<name>/harness/hyperhive-events.sqlite). One table:

The harness writes; the host vacuums. hive-c0re::events_vacuum runs hourly and sweeps every existing agent harness dir. Retention is type-scoped: it deletes only the verbose stream rows (the raw claude stream-json deltas — one per text chunk / tool use, the bulk of the file's size) older than 14 days, and keeps every other kind (turn_start, turn_end, note, status_changed, model_changed, token_usage_changed, turn_state_changed) indefinitely — those are small and carry the semantic per-turn history the operator scrolls back through when debugging a regression. Age-only within the stream kind — no row cap — so a chatty turn doesn't lose its stream history sooner than a quiet one. Centralising retention on the host means a misbehaving harness can't disable its own vacuum and agents don't need any cleanup wiring of their own.

Path overridable via HYPERHIVE_EVENTS_DB (for dev / no-/harness setups). On open failure the Bus falls back to no-store mode rather than crashing the harness — events still broadcast over SSE, just nothing persisted.

/harness/hyperhive-turn-stats.sqlite (per agent)

Per-turn analytics sink. One row per claude turn captures identity (model, wake_from, result_kind), timing (started_at, ended_at, duration_ms), cost (input / output / cache_read / cache_creation token counts), behaviour (tool_call_count + tool_call_breakdown_json), and post-turn snapshot metrics (open_threads_count, open_reminders_count — fetched via the same socket the harness already uses for GetOpenThreads + CountPendingReminders). Bin-loop helpers build_row + record land each row at turn_end; writes are best-effort, a sqlite hiccup logs + lets the turn loop continue.

A sibling bash_commands(ts INTEGER, head TEXT) table in the same file is written by the hive-bash-mcp daemon (not the harness): one row per executed bash task recording the normalised command head - the basename of the first real command, looking past cd repo && prefixes, env-assignments, and prefix-runners like sudo/env. It backs the "favorite tools" view on the /stats page (aggregated host-side). Best-effort and created on first write (CREATE TABLE IF NOT EXISTS), so it's simply absent until a bash task runs.

turn-stats.sqlite has no vacuum — it's one tiny row per turn (~hundreds of KB even over months), read directly by the /stats page and the hive-wide stats view, so pruning it would only lose trend history for no space gain.

/state/hyperhive-harness.json (per agent)

Consolidated harness state file written atomically (.tmp + rename) by Bus::emit_status whenever rate-limited or login-failed flags change. Shape:

{ "rate_limited": false, "needs_login": false }

hive-c0re reads this file on each build_all sweep (~10s) via container_view::read_harness_flags. Falls back to the legacy individual sentinel files (hyperhive-rate-limited, hyperhive-needs-login) if the JSON is absent, so existing containers keep working through the transition window before their next rebuild.

/var/lib/hyperhive/db/build_logs.sqlite (host)

Full stdout + stderr capture for every nixos-container / nix build invocation the lifecycle layer fires. One row per invocation; the row accumulates lines as the child runs.

Replaces the legacy 32-line stderr ring buffer that lifecycle::run kept. The ring tail routinely truncated real eval errors ("tried alternatives" blocks alone are often 30+ lines), so failures bailed with an arbitrary tail whose full stream only lived in the host journal. With this table the dashboard can surface the entire log.

Two indices:

Writes are best-effort: append_stdout / append_stderr / finish log a warning on sqlite error and let the build continue. A failed log row never blocks a rebuild.

/harness/hyperhive-model (per agent)

Single-line text file holding the claude model name currently selected for this agent (default haiku when absent). Written by Bus::set_model whenever the operator flips it via /model <name> in the web terminal. Read once at harness boot in Bus::new. Path overridable via HYPERHIVE_MODEL_FILE. Survives destroy/recreate, gone on --purge.

State dirs (per agent)

Under /var/lib/hyperhive/agents/<name>/:

Parent access to child state

A parent agent gets each direct child's state, harness, and config dirs bind-mounted read-write (bind_child_agent_dirs in lifecycle.rs). The RW on state is deliberate, not an oversight: a parent manages its children, which includes writing into a child's state for recovery (e.g. seeding notes, clearing a stuck sentinel) as well as reading it. config is RW because the parent authors proposed config changes for the child (the approval flow commits into the child's config repo), and harness is RW for the same management reasons. Per-child isolation still holds: a container only ever has its own dirs plus its direct children's bind-mounted, never a sibling's.

Under /var/lib/hyperhive/applied/<name>/ — the hive-c0re-only applied repo. Tracks flake.nix (module-only boilerplate; never edited after first spawn) + agent.nix (the actual config; the manager's edits land here via the approval flow) + any other files the manager committed. .git/ carries the proposal / approved / building / deployed / failed / denied tag history.

Under /var/lib/hyperhive/meta/ — the swarm-wide deploy flake plus system-level config files. Single git repo for the whole host; every hive-c0re mutation that should survive a restart is committed here. Contents:

Manager has the meta dir RO-mounted at /meta/.

Marker file /var/lib/hyperhive/.meta-migration-done is written by the startup migration after every container has been repointed at meta#<n>. Removing it forces a re-run on next hive-c0re start (idempotent — only the actual repoint step would re-fire).

Destroy vs purge

The manager is non-destroyable from both paths (declarative container; would fight with the host's NixOS config).

Run-time dirs

/run/hyperhive/ is tmpfs-backed (systemd RuntimeDirectory=) but preserved across hive-c0re restarts via RuntimeDirectoryPreserve=yes. Without that, every restart wipes bind sources and existing containers can't be started.

On startup, Coordinator::register_agent drops any prior socket task before rebinding — idempotent so a hive-c0re restart followed by rebuild alice recreates the agent's socket without a clean reinstall.

First-boot agent-user migration

The harness runs as a per-agent unix user inside the container (hyperhive.user.name, defaults to the agent's logical label so each container has a uniquely-named user). Operators with legacy root-owned state dirs need a one-time data shuffle so they don't lose their claude session.

system.activationScripts.hive-agent-user-migrate (in nix/templates/harness-base.nix) runs on every activation, marker-guarded so the substantive moves only happen once per container lifetime:

  1. ${homeDir} exists with the right ownership — covers the very first boot before useradd's createHome has had a chance to chown. Also re-applies on every rebuild in case the meta-flake's per-agent name evolves (rare).
  2. Migrate any leftover /root/.claude content into ${homeDir}/.claude — legacy claude wrote to root's empty home; the bind mount didn't exist yet. Marker (/var/lib/hive-agent-user-migrated) guards single-shot. cp -an (no-clobber) so any pre-existing files at the new location win — never blow over data already there.
  3. Chown the bind-mounted state dir (/agents/*/state) recursively so the agent user can read/write it. Wildcard matches the single agent that container sees; -h skips symlinks the agent might have planted.
  4. Chown the ~/.claude/ bind-mount recursively. Legacy claude wrote .credentials.json 0600 root:root; the current harness reads ~/.claude/ as the agent user to decide Online vs NeedsLogin in login::has_session. Without the chown the existing credentials get silently treated as "no session" and the operator re-prompts every boot.

The activation script will eventually become unnecessary once no operators have legacy root-owned state dirs left to migrate; drop the body + marker check at that point.

Matrix per-agent daemon + token-arrival trigger

hive-matrix-daemon is a long-running matrix-sdk Client + sync process per agent. Holds the unix socket the stdio hive-matrix-mcp bridge talks to, emits hyperhive wake signals on incoming room events via /run/hive/mcp.sock. Conditional on hyperhive.matrix.enable (which both the daemon AND the auto-injected extraMcpServers.matrix entry read).

Socket path lives inside the systemd-managed runtime dir (RuntimeDirectory = "hive-matrix"/run/hive-matrix/, owned by the agent user) so the daemon can bind without needing root over /run/ itself. Both daemon + bridge agree on the path via the HIVE_MATRIX_SOCKET env var.

First-boot ordering: hive-c0re provisions the matrix token AFTER agent containers come up. Without the path-trigger sibling (systemd.paths.hive-matrix-daemon, PathExistsGlob = /agents/*/state/matrix-token), the daemon would exit 0 quietly the first time it ran and the MCP would have no backend until the next restart. The .path unit makes the appearance of the token re-fire the service so the daemon comes alive in the same boot cycle as provisioning. matrix-avatar-sync.path uses the same pattern for the icon-upload oneshot.

matrix-avatar-sync (two-step media + profile dance)

Mirrors the forge-avatar oneshot's shape (docs/conventions.md:: Best-effort oneshot services) but differs in protocol: matrix avatars are a two-step POST /media/r0/uploadPUT /profile/<user_id>/avatar_url dance, both authenticated by the access_token written by hive-c0re::matrix::ensure_user_for to <state>/matrix-token.

Triggered by EITHER boot (wantedBy = multi-user.target) OR the sibling matrix-avatar-sync.path firing on token appearance. Both paths re-run the oneshot idempotently — running the avatar set twice is harmless.

Critically: RemainAfterExit = false (not the more common true for oneshots). systemd treats RemainAfterExit = true oneshots as "still running" after the first exit, so the second trigger from the .path watcher becomes a no-op. Setting it to false lets re-fires actually re-execute. The trade-off is the service unit shows inactive (dead) between fires — visible in journalctl but harmless; the .path unit drives the lifecycle.