Persistence + retention

Where state lives, what survives what, and how it's bounded.

Sqlite databases

`/var/lib/hyperhive/db/broker.sqlite` (host)

Seven tables, all in one file — four queues, the schedule header/targets split, and the per-agent power-intent registry:

messages — every inter-agent / operator-bound message. sender / recipient / body / sent_at / delivered_at / acked_at / in_reply_to. in_reply_to links a reply to its parent row id; the dashboard and per-agent inbox render these as threaded rows.
reminders — mcp__hyperhive__remind queue. agent / message / file_path / due_at / created_at / sent_at / attempt_count / last_error. file_path set when a body exceeded the inline soft-cap and got auto-spilled to a file under the agent's state dir; the worker delivers a short pointer instead. attempt_count / last_error accumulate on delivery-failed retries.
approvals — the queue. agent / kind (merge_config_pr | spawn | init_config | update_meta_inputs | schedule_prompt) / commit_ref / requested_at / status / resolved_at / note.
operator_questions — ask / answer queue (despite the table name, stores both operator-targeted + agent-to-agent questions since the ask rename). asker / question / options_json / multi / asked_at / deadline_at (ttl) / answered_at / answer / target. target IS NULL = operator path (dashboard); target = '<agent>' = peer Q&A (HelperEvent::QuestionAsked pushed into target's inbox, answered via Answer request). Migrated via ALTER TABLE ADD COLUMN against pragma_table_info.
scheduled_prompts — recurring + one-shot prompt queue. owner / body / interval_seconds (NULL = one-shot) / next_fire_at_unix / created_at_unix / source ("operator" or "approval:<id>") / cancelled_at_unix / description. owner drives cancel-permission checks (operator vs the submitting agent). Cancelled rows are tombstoned and reaped by the worker on its next pass.
scheduled_prompt_targets — per-target state for each schedule. schedule_id / target / cancelled_at_unix / last_fired_at_unix / last_result. ON DELETE CASCADE from scheduled_prompts(id) — requires PRAGMA foreign_keys = ON per connection (set at open).
agent_power — one tiny row per agent: agent PK / wanted (up | offline) / updated_at — the durable power intent behind the job queue's desired-state reconciliation (docs/coordinator.md::Job queue; owner: hive-c0re/src/stores/power.rs). Written synchronously by every operator/agent power action (dashboard start/stop, hivectl stop, the MCP kill/start tools, spawn approval); read by Reconcile nodes and the boot reconcile. Intent survives hive-c0re restarts — in-flight queue work deliberately does not. Agents without a row are seeded from observed state on first touch (running ⇒ up); destroy deletes the row.

Retention:

Broker::vacuum_delivered runs hourly via a tokio task in hive-c0re::main. Drops acked message rows older than 30 days (acked_at IS NOT NULL). Undelivered + delivered-but-not-acked rows are always kept — the harness ack_turns only after a successful turn, so an unacked row can still be requeued via requeue_inflight on a crash.
Approvals and questions are kept indefinitely — both are audit trails. actions::destroy and answered questions stay visible to anything that queries by id.
Reminder rows are kept after sent_at is set (audit trail); no automatic vacuum today.
Scheduled prompts: one-shot rows are deleted on fire by the worker; recurring rows live until the operator cancels them (cancel_schedule MCP / dashboard ✗) which tombstones via cancelled_at_unix, then reap_cancelled drops the row on the next worker pass.
agent_power rows live until the agent is destroyed (one row per agent — nothing to vacuum).

`/harness/hyperhive-events.sqlite` (per agent)

Lives inside each container's bind-mounted /harness/ dir (host path: /var/lib/hyperhive/agents/<name>/harness/hyperhive-events.sqlite). One table:

events(id, ts, kind, payload_json) — every LiveEvent the harness emits during turn loop execution.

The harness writes; the host vacuums. hive-c0re::events_vacuum runs hourly and sweeps every existing agent harness dir. Retention is type-scoped: it deletes only the verbose stream rows (the raw claude stream-json deltas — one per text chunk / tool use, the bulk of the file's size) older than 14 days, and keeps every other kind (turn_start, turn_end, note, status_changed, model_changed, token_usage_changed, turn_state_changed) indefinitely — those are small and carry the semantic per-turn history the operator scrolls back through when debugging a regression. Age-only within the stream kind — no row cap — so a chatty turn doesn't lose its stream history sooner than a quiet one. Centralising retention on the host means a misbehaving harness can't disable its own vacuum and agents don't need any cleanup wiring of their own.

Path overridable via HYPERHIVE_EVENTS_DB (for dev / no-/harness setups). On open failure the Bus falls back to no-store mode rather than crashing the harness — events still broadcast over SSE, just nothing persisted.

`/harness/hyperhive-turn-stats.sqlite` (per agent)

Per-turn analytics sink. One row per claude turn captures identity (model, wake_from, result_kind), timing (started_at, ended_at, duration_ms), cost (input / output / cache_read / cache_creation token counts), behaviour (tool_call_count + tool_call_breakdown_json), and post-turn snapshot metrics (open_threads_count, open_reminders_count — fetched via the same socket the harness already uses for GetOpenThreads + CountPendingReminders). Bin-loop helpers build_row + record land each row at turn_end; writes are best-effort, a sqlite hiccup logs + lets the turn loop continue.

A sibling bash_commands(ts INTEGER, head TEXT) table in the same file is written by the hive-bash-daemon (not the harness): one row per executed bash task recording the normalised command head - the basename of the first real command, looking past cd repo && prefixes, env-assignments, and prefix-runners like sudo/env. It backs the "favorite tools" view on the /stats page (aggregated host-side). Best-effort and created on first write (CREATE TABLE IF NOT EXISTS), so it's simply absent until a bash task runs.

turn-stats.sqlite has no vacuum — it's one tiny row per turn (~hundreds of KB even over months), read directly by the /stats page and the hive-wide stats view, so pruning it would only lose trend history for no space gain.

`/state/hyperhive-harness.json` (per agent)

Consolidated harness state file written atomically (.tmp + rename) by Bus::emit_status whenever rate-limited or login-failed flags change. Shape:

{ "rate_limited": false, "needs_login": false, "active_model": "…", "forge_cursor": { "42": "2026-07-01T18:00:00Z" } }

rate_limited — set when the harness detects a 429 from the Claude API; cleared by any subsequent status emit. Drives ContainerView.rate_limited on the dashboard.
needs_login — set when a turn hits 401 (expired OAuth credentials); cleared by "online" status (re-auth completed). Drives the needs_login flag alongside the claude_has_session check.
active_model — the resolved Claude model for the dashboard badge.
forge_cursor — the forge_notify delivery-dedupe cursor (notification thread id → last-delivered updated_at), so a rebuild/restart doesn't re-deliver the whole currently-unread forge backlog. See forge.md.

Multiple harness tasks write this file (the turn loop for the first three fields, the forge_notify poller for forge_cursor), so every writer goes read-modify-write under a shared in-process lock — each preserves the fields it doesn't own rather than reconstructing the file from scratch.

hive-c0re reads this file on each build_all sweep (~10s) via container_view::read_harness_flags. Falls back to the legacy individual sentinel files (hyperhive-rate-limited, hyperhive-needs-login) if the JSON is absent, so existing containers keep working through the transition window before their next rebuild.

`/var/lib/hyperhive/db/build_logs.sqlite` (host)

Full stdout + stderr capture for every nixos-container / nix build invocation the lifecycle layer fires. One row per invocation; the row accumulates lines as the child runs.

Replaces the legacy 32-line stderr ring buffer that lifecycle::run kept. The ring tail routinely truncated real eval errors ("tried alternatives" blocks alone are often 30+ lines), so failures bailed with an arbitrary tail whose full stream only lived in the host journal. With this table the dashboard can surface the entire log.

Two indices:

(agent, started_at) — backs the per-agent latest-N lookup used by the agent card chip.
(status, finished_at) — backs the retention sweep that runs as part of the existing hourly vacuum.

Writes are best-effort: append_stdout / append_stderr / finish log a warning on sqlite error and let the build continue. A failed log row never blocks a rebuild.

`/harness/hyperhive-model` (per agent)

Single-line text file holding the claude model name currently selected for this agent (default haiku when absent). Written by Bus::set_model whenever the operator flips it via /model <name> in the web terminal. Read once at harness boot in Bus::new. Path overridable via HYPERHIVE_MODEL_FILE. Survives destroy/recreate, gone on --purge.

`/harness/paused` (per agent)

Empty marker file. Its presence parks the agent's turn loop: the harness keeps serving its web UI and MCP daemons but drives no turns, and inbox messages queue unacked until it's removed (see turn loop).

Unusually, it's read and written from both sides of the harness bind-mount, and that's the whole design: the harness stats it in-container via hive_sh4re::paths::paused_marker, while hive-c0re stats it on the host (Coordinator::is_paused) to populate the paused field on the agent card, and creates/removes it (Coordinator::set_paused) for hivectl agents pause|resume and the dashboard toggle. Because the file itself is the only shared state there's no protocol between them, no round-trip into the container, and pause keeps working when the harness is wedged or the container is stopped.

It lives in /harness/ rather than /state/ deliberately: /state/ is the agent's own space to fill, and this is harness control state. Survives destroy/recreate, gone on --purge — so a paused agent comes back paused after a restart, which is the intended behaviour rather than an accident of storage.

State dirs (per agent)

Under /var/lib/hyperhive/agents/<name>/:

config/ — the proposed nix repo (root-agent-editable). Bind-mounted read-only to /agents/<name>/config inside the sub-agent's own container so the agent can inspect what defines it and request precise changes from the root agent; RW into the root agent via the /agents tree bind.
claude/ — claude OAuth credentials, bind-mounted RW to /home/<name>/.claude inside the container.
state/ — durable notes and hyperhive-harness.json. Bind-mounted to /agents/<name>/state inside the container (uniform for all agents). The $HYPERHIVE_STATE_DIR env var exposes the same path to in-container scripts. Notable files written here by the harness:
- hyperhive-status — single-line free-text status string written by set_status; cleared on explicit set_status(""). Read by hive-c0re and the per-agent /api/dashboard-state endpoint to surface the status chip on the dashboard. Absent when no status is set.
- hyperhive-harness.json — rate-limited / needs-login flags read by the dashboard's async container-state fetch. See docs/web-ui/dashboard.md::Container row.
harness/ — harness-internal ephemeral state; not intended for agent consumption. Bind-mounted to /agents/<name>/harness inside the container ($HYPERHIVE_HARNESS_DIR). Contents:
- bash-tasks/ — task JSON + stdout/stderr files for background mcp__bash__run jobs. JSON files are <id>.json (status + tails), <id>.out / <id>.err (full captured output). hive-c0re::bash_tasks_vacuum runs hourly and deletes terminal task trios older than 48 hours; non-terminal (still-running) tasks are never deleted by vacuum.
- hyperhive-todos.sqlite — loose-ends-v2 todo store. In-container MCP daemons (hive-bash-daemon, hive-matrix-daemon) and forge_notify upsert keyed todos here over the harness's in-agent socket (HIVE_AGENT_SOCKET); the harness merges them into get_loose_ends output and clears a row on mark_todo_done. Replaced the old file-based mcp-loose-ends/ scanner.
- hyperhive-events.sqlite — turn-loop event log.
- hyperhive-turn-stats.sqlite — per-turn timing stats.
- hyperhive-model — single-line model name override file.

Parent access to child state

A parent agent gets each direct child's state, harness, and config dirs bind-mounted read-write (bind_child_agent_dirs in lifecycle.rs). The RW on state is deliberate, not an oversight: a parent manages its children, which includes writing into a child's state for recovery (e.g. seeding notes, clearing a stuck sentinel) as well as reading it. config is RW because the parent authors proposed config changes for the child (the approval flow commits into the child's config repo), and harness is RW for the same management reasons. Per-child isolation still holds: a container only ever has its own dirs plus its direct children's bind-mounted, never a sibling's.

Under /var/lib/hyperhive/applied/<name>/ — the hive-c0re-only applied repo. Tracks flake.nix (module-only boilerplate; never edited after first spawn) + agent.nix (the actual config; the root agent's edits land here via the approval flow) + any other files committed via the approval flow. .git/ carries the proposal / approved / building / deployed / failed / denied tag history.

Under /var/lib/hyperhive/meta/ — the swarm-wide deploy flake plus system-level config files. Single git repo for the whole host; every hive-c0re mutation that should survive a restart is committed here. Contents:

flake.nix — declares one nixpkgs input per agent + one nixosConfigurations.<n> output per agent. flake.lock is the canonical "what's deployed where." The git log is the deploy audit trail (one commit per successful deploy or hyperhive bump).
topology.json — parent/child agent graph ({ "alice": "root", "bob": "alice", "root": null }). Written by topology::set_parent; read by the dashboard, the renderer, and <parent> / <children> recipient resolution.
tool-groups.json — per-agent MCP tool group grants ({ "alice": ["messaging", "inbox", "execution"] }). Written by tool_groups::set_groups; injected as HIVE_TOOL_GROUPS env var into each agent's container.
capabilities.json — per-agent capability grants ({ "atlas": ["read_host_journal"] }). Written by capabilities::set_caps; injected as HIVE_CAPABILITIES env var. Absent agents have no extra capabilities.

The root agent has the meta dir RO-mounted at /meta/.

Marker file /var/lib/hyperhive/.meta-migration-done is written by the startup migration after every container has been repointed at meta#<n>. Removing it forces a re-run on next hive-c0re start (idempotent — only the actual repoint step would re-fire).

Destroy vs purge

DESTR0Y (default) — stops + removes the nspawn container, drops the systemd drop-in, fails any pending approvals. State dirs stay put; the agent appears in the dashboard's K3PT ST4T3 section as a tombstone with ⊕ R3V1V3 and PURG3 actions. R3V1V3 queues a Spawn approval that reuses the kept state on approve (no re-login).
PURG3 (opt-in via the dashboard button or hivectl agents destroy --purge <name>) — DESTR0Y plus wipes /var/lib/hyperhive/{agents,applied}/<name>/. Config history, claude creds, /state/ notes, and the harness dir are all gone. No undo.

The root/bootstrap container is imperative infrastructure — managed end-to-end by hive-c0re, not declared in the host's NixOS config. auto_update::ensure_root_agent recreates it on the next hive-c0re startup if it's absent (bypassing the approval queue, as required infrastructure). A soft policy guard in actions::destroy currently refuses to destroy it; even without that guard, destroying it would only be transient — hive-c0re brings it back on the next startup.

btrfs subvolumes for `/var/lib/hyperhive/agents/<name>`

On a btrfs host, a brand-new agent's state root is created as a btrfs subvolume instead of a plain directory (progressive enhancement — see the #1762 lane). This is a no-op fallback on non-btrfs hosts and for any agent whose root already exists, so nothing is auto-migrated: existing agents keep their plain dirs until an explicit opt-in upgrade.

Creation: lifecycle::ensure_agent_state_subvolume runs before the per-agent subdirs are created (spawn / rebuild / InitConfig). It skips the work when the root already exists; otherwise it asks hive-priv (EnsureAgentSubvolume) to btrfs subvolume create the root when the FS is btrfs (statfs magic gate) and chown it to the hive-core user so the normal state/ claude/ harness/ mkdirs succeed inside it.
DESTR0Y keeps the subvolume exactly like a plain dir — revival reuses it untouched.
PURG3 deletes it correctly: a subvolume root can't be removed with rmdir/remove_dir_all, so purge first calls hive-priv (DeleteAgentSubvolume) which btrfs subvolume deletes it iff it's actually a subvolume, then the normal remove_dir_all sweep covers plain-dir agents + the applied dir.

Per-subvolume disk-usage accounting and optional quotas are a follow-up (the qgroup work), not part of the base migration.

Run-time dirs

/run/hyperhive/ is tmpfs-backed (systemd RuntimeDirectory=) but preserved across hive-c0re restarts via RuntimeDirectoryPreserve=yes. Without that, every restart wipes bind sources and existing containers can't be started.

/run/hyperhive/host.sock — admin socket (host-side CLI).
/run/hyperhive/agents/<name>/mcp.sock — per-agent socket (bind-mounted into the container as /run/hive/mcp.sock).

On startup, Coordinator::register_agent drops any prior socket task before rebinding — idempotent so a hive-c0re restart followed by rebuild alice recreates the agent's socket without a clean reinstall.

First-boot agent-user migration

The harness runs as a per-agent unix user inside the container (hyperhive.user.name, defaults to the agent's logical label so each container has a uniquely-named user). Operators with legacy root-owned state dirs need a one-time data shuffle so they don't lose their claude session.

system.activationScripts.hive-agent-user-migrate (in nix/agent-modules/user.nix) runs on every activation, marker-guarded so the substantive moves only happen once per container lifetime:

${homeDir} exists with the right ownership — covers the very first boot before useradd's createHome has had a chance to chown. Also re-applies on every rebuild in case the meta-flake's per-agent name evolves (rare).
Migrate any leftover /root/.claude content into ${homeDir}/.claude — legacy claude wrote to root's empty home; the bind mount didn't exist yet. Marker (/var/lib/hive-agent-user-migrated) guards single-shot. cp -an (no-clobber) so any pre-existing files at the new location win — never blow over data already there.
Chown the bind-mounted state dir (/agents/*/state) recursively so the agent user can read/write it. Wildcard matches the single agent that container sees; -h skips symlinks the agent might have planted.
Chown the ~/.claude/ bind-mount recursively. Legacy claude wrote .credentials.json 0600 root:root; the current harness reads ~/.claude/ as the agent user to decide Online vs NeedsLogin in login::has_session. Without the chown the existing credentials get silently treated as "no session" and the operator re-prompts every boot.

The activation script will eventually become unnecessary once no operators have legacy root-owned state dirs left to migrate; drop the body + marker check at that point.

Matrix per-agent daemon + token-arrival trigger

hive-matrix-daemon is a long-running matrix-sdk Client + sync process per agent. Serves its MCP tools directly over streamable-http (hyperhive.mcp.matrixHttpPort, no stdio bridge — same shape as hive-bash-daemon), emits hyperhive wake signals on incoming room events via /run/hive/mcp.sock. Conditional on hyperhive.matrix.enable (which both the daemon AND the auto-injected extraMcpServers.matrix entry read).

First-boot ordering: hive-c0re provisions the matrix token AFTER agent containers come up. Without the path-trigger sibling (systemd.paths.hive-matrix-daemon, PathExistsGlob = /agents/*/state/matrix-token), the daemon would exit 0 quietly the first time it ran and the MCP would have no backend until the next restart. The .path unit makes the appearance of the token re-fire the service so the daemon comes alive in the same boot cycle as provisioning. The same token watcher also drives avatar setting: on a restart the daemon re-runs each account's bring-up, which sets the avatar (see below).

matrix avatar (set by the daemon over the live Client)

The agent icon (hyperhive.icon, an SVG) is published as each matrix account's profile avatar by hive-matrix-daemon itself (hive-matrix-mcp::client::sync_avatar), not a separate oneshot. After the daemon builds + restores an account's Client (authenticated, pointed at that account's resolved homeserver), it calls matrix-sdk's account().upload_avatar() — one call that uploads the media and sets avatar_url. Because it reuses the live Client, there is no hardcoded homeserver URL, no token re-read, and no token-file globbing: the daemon already iterates every configured + dashboard-discovered account in its bring-up loop, so the avatar is set for every account.

Nix rasterizes the SVG to a 512x512 PNG at build time (iconPng, via librsvg) and forwards its store path as HIVE_ICON_PNG on the daemon unit, gated on hyperhive.icon != null. No icon configured → the env is unset → sync_avatar returns early and no avatar is set.

Idempotency is per-account: an avatar-icon-hash file in each account's matrix-sdk state_dir. The daemon hashes the PNG bytes and skips the upload when unchanged, because every upload mints a fresh mxc:// URI that emits a profile state event in every joined room — re-uploading identical bytes is timeline spam. A dashboard-provisioned account gets its avatar when the systemd.paths.hive-matrix-daemon token watcher restarts the daemon (which re-runs the per-account bring-up), so no separate avatar trigger is needed. Avatar failures are swallowed (logged, non-fatal) so they never break account bring-up or sync.

Persistence + retention

Sqlite databases

/var/lib/hyperhive/db/broker.sqlite (host)

/harness/hyperhive-events.sqlite (per agent)

/harness/hyperhive-turn-stats.sqlite (per agent)

/state/hyperhive-harness.json (per agent)

/var/lib/hyperhive/db/build_logs.sqlite (host)

/harness/hyperhive-model (per agent)

/harness/paused (per agent)