Persistence + retention
Where state lives, what survives what, and how it's bounded.
Three sqlite databases
/var/lib/hyperhive/db/broker.sqlite (host)
Six tables, all in one file — four queues plus the schedule header/targets split:
messages— every inter-agent / operator-bound message.sender / recipient / body / sent_at / delivered_at / acked_at / in_reply_to.in_reply_tolinks a reply to its parent row id; the dashboard and per-agent inbox render these as threaded rows.reminders—mcp__hyperhive__remindqueue.agent / message / file_path / due_at / created_at / sent_at / attempt_count / last_error.file_pathset when a body exceeded the inline soft-cap and got auto-spilled to a file under the agent's state dir; the worker delivers a short pointer instead.attempt_count/last_erroraccumulate on delivery-failed retries.approvals— the queue.agent / kind (apply_commit | spawn | init_config | update_meta_inputs | schedule_prompt) / commit_ref / requested_at / status / resolved_at / note.operator_questions—ask/answerqueue (despite the table name, stores both operator-targeted + agent-to-agent questions since theaskrename).asker / question / options_json / multi / asked_at / deadline_at (ttl) / answered_at / answer / target.target IS NULL= operator path (dashboard);target = '<agent>'= peer Q&A (HelperEvent::QuestionAskedpushed into target's inbox, answered viaAnswerrequest). Migrated viaALTER TABLE ADD COLUMNagainstpragma_table_info.scheduled_prompts— recurring + one-shot prompt queue.owner / body / interval_seconds (NULL = one-shot) / next_fire_at_unix / created_at_unix / source ("operator" or "approval:<id>") / cancelled_at_unix / description.ownerdrives cancel-permission checks (operator vs the submitting agent). Cancelled rows are tombstoned and reaped by the worker on its next pass.scheduled_prompt_targets— per-target state for each schedule.schedule_id / target / cancelled_at_unix / last_fired_at_unix / last_result.ON DELETE CASCADEfromscheduled_prompts(id)— requiresPRAGMA foreign_keys = ONper connection (set at open).
Retention:
Broker::vacuum_deliveredruns hourly via a tokio task inhive-c0re::main. Drops acked message rows older than 30 days (acked_at IS NOT NULL). Undelivered + delivered-but-not-acked rows are always kept — the harnessack_turns only after a successful turn, so an unacked row can still be requeued viarequeue_inflighton a crash.- Approvals and questions are kept indefinitely — both are
audit trails.
actions::destroyand answered questions stay visible to anything that queries by id. - Reminder rows are kept after
sent_atis set (audit trail); no automatic vacuum today. - Scheduled prompts: one-shot rows are deleted on fire by the
worker; recurring rows live until the operator cancels them
(
cancel_scheduleMCP / dashboard ✗) which tombstones viacancelled_at_unix, thenreap_cancelleddrops the row on the next worker pass.
/harness/hyperhive-events.sqlite (per agent)
Lives inside each container's bind-mounted /harness/ dir (host
path: /var/lib/hyperhive/agents/<name>/harness/hyperhive-events.sqlite).
One table:
events(id, ts, kind, payload_json)— everyLiveEventthe harness emits during turn loop execution.
The harness writes; the host vacuums. hive-c0re::events_vacuum
runs hourly and sweeps every existing agent harness dir. Retention
is type-scoped: it deletes only the verbose stream rows (the
raw claude stream-json deltas — one per text chunk / tool use, the
bulk of the file's size) older than 14 days, and keeps every other
kind (turn_start, turn_end, note, status_changed,
model_changed, token_usage_changed, turn_state_changed)
indefinitely — those are small and carry the semantic per-turn
history the operator scrolls back through when debugging a
regression. Age-only within the stream kind — no row cap — so a
chatty turn doesn't lose its stream history sooner than a quiet one.
Centralising retention on the host means a misbehaving harness can't
disable its own vacuum and agents don't need any cleanup wiring of
their own.
Path overridable via HYPERHIVE_EVENTS_DB (for dev / no-/harness
setups). On open failure the Bus falls back to no-store mode
rather than crashing the harness — events still broadcast over SSE,
just nothing persisted.
/harness/hyperhive-turn-stats.sqlite (per agent)
Per-turn analytics sink. One row per claude turn captures
identity (model, wake_from, result_kind), timing
(started_at, ended_at, duration_ms), cost (input / output /
cache_read / cache_creation token counts), behaviour
(tool_call_count + tool_call_breakdown_json), and post-turn
snapshot metrics (open_threads_count,
open_reminders_count — fetched via the same socket the harness
already uses for GetOpenThreads + CountPendingReminders).
Bin-loop helpers build_row + record land each row at
turn_end; writes are best-effort, a sqlite hiccup logs + lets
the turn loop continue.
A sibling bash_commands(ts INTEGER, head TEXT) table in the same
file is written by the hive-bash-mcp daemon (not the harness): one
row per executed bash task recording the normalised command head -
the basename of the first real command, looking past cd repo &&
prefixes, env-assignments, and prefix-runners like sudo/env. It
backs the "favorite tools" view on the /stats page (aggregated
host-side). Best-effort and created on first write
(CREATE TABLE IF NOT EXISTS), so it's simply absent until a bash
task runs.
turn-stats.sqlite has no vacuum — it's one tiny row per turn
(~hundreds of KB even over months), read directly by the /stats page
and the hive-wide stats view, so pruning it would only lose trend
history for no space gain.
/state/hyperhive-harness.json (per agent)
Consolidated harness state file written atomically (.tmp + rename) by
Bus::emit_status whenever rate-limited or login-failed flags change.
Shape:
{ "rate_limited": false, "needs_login": false }
rate_limited— set when the harness detects a 429 from the Claude API; cleared by any subsequent status emit. DrivesContainerView.rate_limitedon the dashboard.needs_login— set when a turn hits 401 (expired OAuth credentials); cleared by"online"status (re-auth completed). Drives theneeds_loginflag alongside theclaude_has_sessioncheck.
hive-c0re reads this file on each build_all sweep (~10s) via
container_view::read_harness_flags. Falls back to the legacy individual
sentinel files (hyperhive-rate-limited, hyperhive-needs-login) if the
JSON is absent, so existing containers keep working through the transition
window before their next rebuild.
/var/lib/hyperhive/db/build_logs.sqlite (host)
Full stdout + stderr capture for every nixos-container / nix build invocation the lifecycle layer fires. One row per invocation;
the row accumulates lines as the child runs.
Replaces the legacy 32-line stderr ring buffer that lifecycle::run
kept. The ring tail routinely truncated real eval errors ("tried
alternatives" blocks alone are often 30+ lines), so failures bailed
with an arbitrary tail whose full stream only lived in the host
journal. With this table the dashboard can surface the entire log.
Two indices:
(agent, started_at)— backs the per-agent latest-N lookup used by the agent card chip.(status, finished_at)— backs the retention sweep that runs as part of the existing hourly vacuum.
Writes are best-effort: append_stdout / append_stderr / finish
log a warning on sqlite error and let the build continue. A failed
log row never blocks a rebuild.
/harness/hyperhive-model (per agent)
Single-line text file holding the claude model name currently
selected for this agent (default haiku when absent). Written by
Bus::set_model whenever the operator flips it via /model <name> in the web terminal. Read once at harness boot in
Bus::new. Path overridable via HYPERHIVE_MODEL_FILE.
Survives destroy/recreate, gone on --purge.
State dirs (per agent)
Under /var/lib/hyperhive/agents/<name>/:
config/— the proposed nix repo (manager-editable). Bind-mounted read-only to/agents/<name>/configinside the sub-agent's own container so the agent can inspect what defines it and request precise changes from the manager; RW into the manager via the/agentstree bind.claude/— claude OAuth credentials, bind-mounted RW to/home/<name>/.claudeinside the container.state/— durable notes andhyperhive-harness.json. Bind-mounted to/agents/<name>/stateinside the container (uniform for sub-agents + manager). The$HYPERHIVE_STATE_DIRenv var exposes the same path to in-container scripts. Notable files written here by the harness:hyperhive-status— single-line free-text status string written byset_status; cleared on explicitset_status(""). Read by hive-c0re and the per-agent/api/dashboard-stateendpoint to surface the status chip on the dashboard. Absent when no status is set.hyperhive-harness.json— rate-limited / needs-login flags read by the dashboard's async container-state fetch. Seedocs/web-ui/dashboard.md::Container row.
harness/— harness-internal ephemeral state; not intended for agent consumption. Bind-mounted to/agents/<name>/harnessinside the container ($HYPERHIVE_HARNESS_DIR). Contents:bash-tasks/— task JSON + stdout/stderr files for backgroundmcp__bash__runjobs. JSON files are<id>.json(status + tails),<id>.out/<id>.err(full captured output).hive-c0re::bash_tasks_vacuumruns hourly and deletes terminal task trios older than 48 hours; non-terminal (still-running) tasks are never deleted by vacuum.mcp-loose-ends/— JSON files published by external MCP daemons (e.g.hive-bash-mcp,hive-matrix-mcp) listing their active loose-end summary strings. Each file is<daemon>.jsoncontaining a JSON array of plain-text lines. Read byget_loose_endsto surface active background work without hardcoding per-MCP knowledge in the harness. Files are created/removed by the external daemons themselves.hyperhive-events.sqlite— turn-loop event log.hyperhive-turn-stats.sqlite— per-turn timing stats.hyperhive-model— single-line model name override file.
Parent access to child state
A parent agent gets each direct child's state, harness, and
config dirs bind-mounted read-write (bind_child_agent_dirs in
lifecycle.rs). The RW on state is deliberate, not an oversight: a
parent manages its children, which includes writing into a child's
state for recovery (e.g. seeding notes, clearing a stuck sentinel) as
well as reading it. config is RW because the parent authors proposed
config changes for the child (the approval flow commits into the
child's config repo), and harness is RW for the same management
reasons. Per-child isolation still holds: a container only ever has
its own dirs plus its direct children's bind-mounted, never a
sibling's.
Under /var/lib/hyperhive/applied/<name>/ — the hive-c0re-only
applied repo. Tracks flake.nix (module-only boilerplate; never
edited after first spawn) + agent.nix (the actual config; the
manager's edits land here via the approval flow) + any other
files the manager committed. .git/ carries the proposal /
approved / building / deployed / failed / denied tag history.
Under /var/lib/hyperhive/meta/ — the swarm-wide deploy flake plus
system-level config files. Single git repo for the whole host; every
hive-c0re mutation that should survive a restart is committed here.
Contents:
flake.nix— declares onenixpkgsinput per agent + onenixosConfigurations.<n>output per agent.flake.lockis the canonical "what's deployed where." The git log is the deploy audit trail (one commit per successful deploy or hyperhive bump).topology.json— parent/child agent graph ({ "alice": "manager", "bob": "alice", "manager": null }). Written bytopology::set_parent; read by the dashboard, the renderer, and<parent>/<children>recipient resolution.tool-groups.json— per-agent MCP tool group grants ({ "alice": ["messaging", "inbox", "execution"] }). Written bytool_groups::set_groups; injected asHIVE_TOOL_GROUPSenv var into each agent's container.capabilities.json— per-agent capability grants ({ "atlas": ["read_host_journal"] }). Written bycapabilities::set_caps; injected asHIVE_CAPABILITIESenv var. Absent agents have no extra capabilities.
Manager has the meta dir RO-mounted at /meta/.
Marker file /var/lib/hyperhive/.meta-migration-done is
written by the startup migration after every container has
been repointed at meta#<n>. Removing it forces a re-run on
next hive-c0re start (idempotent — only the actual repoint
step would re-fire).
Destroy vs purge
DESTR0Y(default) — stops + removes the nspawn container, drops the systemd drop-in, fails any pending approvals. State dirs stay put; the agent appears in the dashboard's K3PT ST4T3 section as a tombstone with⊕ R3V1V3andPURG3actions.R3V1V3queues a Spawn approval that reuses the kept state on approve (no re-login).PURG3(opt-in via the dashboard button orhive-c0re destroy --purge <name>) — DESTR0Y plus wipes/var/lib/hyperhive/{agents,applied}/<name>/. Config history, claude creds, /state/ notes, and the harness dir are all gone. No undo.
The manager is non-destroyable from both paths (declarative container; would fight with the host's NixOS config).
Run-time dirs
/run/hyperhive/ is tmpfs-backed (systemd RuntimeDirectory=) but
preserved across hive-c0re restarts via RuntimeDirectoryPreserve=yes.
Without that, every restart wipes bind sources and existing
containers can't be started.
/run/hyperhive/host.sock— admin socket (host-side CLI)./run/hyperhive/manager/mcp.sock— manager-privileged socket./run/hyperhive/agents/<name>/mcp.sock— per-sub-agent socket (bind-mounted into the container as/run/hive/mcp.sock).
On startup, Coordinator::register_agent drops any prior socket
task before rebinding — idempotent so a hive-c0re restart followed
by rebuild alice recreates the agent's socket without a clean
reinstall.
First-boot agent-user migration
The harness runs as a per-agent unix user inside the container
(hyperhive.user.name, defaults to the agent's logical label so each
container has a uniquely-named user). Operators with legacy root-owned
state dirs need a one-time data shuffle so they don't lose their claude
session.
system.activationScripts.hive-agent-user-migrate (in
nix/templates/harness-base.nix) runs on every activation,
marker-guarded so the substantive moves only happen once per
container lifetime:
${homeDir}exists with the right ownership — covers the very first boot beforeuseradd'screateHomehas had a chance to chown. Also re-applies on every rebuild in case the meta-flake's per-agent name evolves (rare).- Migrate any leftover
/root/.claudecontent into${homeDir}/.claude— legacyclaudewrote to root's empty home; the bind mount didn't exist yet. Marker (/var/lib/hive-agent-user-migrated) guards single-shot.cp -an(no-clobber) so any pre-existing files at the new location win — never blow over data already there. - Chown the bind-mounted state dir (
/agents/*/state) recursively so the agent user can read/write it. Wildcard matches the single agent that container sees;-hskips symlinks the agent might have planted. - Chown the
~/.claude/bind-mount recursively. Legacyclaudewrote.credentials.json0600 root:root; the current harness reads~/.claude/as the agent user to decide Online vs NeedsLogin inlogin::has_session. Without the chown the existing credentials get silently treated as "no session" and the operator re-prompts every boot.
The activation script will eventually become unnecessary once no operators have legacy root-owned state dirs left to migrate; drop the body + marker check at that point.
Matrix per-agent daemon + token-arrival trigger
hive-matrix-daemon is a long-running matrix-sdk Client + sync
process per agent. Holds the unix socket the stdio
hive-matrix-mcp bridge talks to, emits hyperhive wake signals
on incoming room events via /run/hive/mcp.sock. Conditional on
hyperhive.matrix.enable (which both the daemon AND the
auto-injected extraMcpServers.matrix entry read).
Socket path lives inside the systemd-managed runtime dir
(RuntimeDirectory = "hive-matrix" → /run/hive-matrix/, owned by
the agent user) so the daemon can bind without needing root over
/run/ itself. Both daemon + bridge agree on the path via the
HIVE_MATRIX_SOCKET env var.
First-boot ordering: hive-c0re provisions the matrix token AFTER
agent containers come up. Without the path-trigger sibling
(systemd.paths.hive-matrix-daemon, PathExistsGlob = /agents/*/state/matrix-token), the daemon would exit 0 quietly the
first time it ran and the MCP would have no backend until the next
restart. The .path unit makes the appearance of the token re-fire
the service so the daemon comes alive in the same boot cycle as
provisioning. matrix-avatar-sync.path uses the same pattern for
the icon-upload oneshot.
matrix-avatar-sync (two-step media + profile dance)
Mirrors the forge-avatar oneshot's shape (docs/conventions.md:: Best-effort oneshot services) but differs in protocol: matrix
avatars are a two-step POST /media/r0/upload → PUT /profile/<user_id>/avatar_url dance, both authenticated by the
access_token written by hive-c0re::matrix::ensure_user_for to
<state>/matrix-token.
Triggered by EITHER boot (wantedBy = multi-user.target) OR the
sibling matrix-avatar-sync.path firing on token appearance. Both
paths re-run the oneshot idempotently — running the avatar set
twice is harmless.
Critically: RemainAfterExit = false (not the more common
true for oneshots). systemd treats RemainAfterExit = true
oneshots as "still running" after the first exit, so the second
trigger from the .path watcher becomes a no-op. Setting it to
false lets re-fires actually re-execute. The trade-off is the
service unit shows inactive (dead) between fires — visible in
journalctl but harmless; the .path unit drives the lifecycle.