Conventions

Code-style and process expectations across the workspace. Most of these exist because something already went wrong without them.

Naming

Containers are length-bounded by nixos-container (≤ 11 chars).
Sub-agents are h-<name> with <name> ≤ 9 chars.
One agent is the bootstrap/root container, with a fixed name (ruth today).
MAX_AGENT_NAME in lifecycle.rs enforces the cap.
Per-agent web UI port = WEB_PORT_BASE + FNV1a(name) % WEB_PORT_RANGE (8100..8999) for every agent; dashboard cfg.dashboardPort (default 7000).

Hive identity (label + domain + display names)

Four env vars cover the identity surface, read by hive_ag3nt::identity:

HIVE_LABEL — short, hive-local agent label (iris, damocles). label() returns it; falls back to empty string if the env var is missing so downstream callers can decide how to surface "unknown agent" rather than getting a panic from this module.
HYPERHIVE_HIVE_DOMAIN — the hive's canonical DNS domain (e.g. darkest.space), set by hive-c0re.nix from services.hyperhive.domain. When configured, qualified_label() returns ${label}@${domain} (e.g. iris@darkest.space); when unset (single-hive deployments, dev/test) it degrades to just the short label so existing callers see no change. The qualified form surfaces in the per-agent web UI title, the system-prompt template, and /api/state.qualified_label.
HYPERHIVE_HIVE_NAME — human display name of this hive (pr1ma). Read by hive_name(); None when unset.
HYPERHIVE_SWARM_NAME — human display name of the wider swarm this hive belongs to (constellat1on). Read by swarm_name(); federated hives at different DNS domains can share a swarm name.

hive_name + swarm_name are distinct from HYPERHIVE_HIVE_DOMAIN: the domain may carry the hive name as its leftmost label by convention, but the convention isn't machine-readable, and federated hives at different DNS domains can share a swarm name. Humans want both: the address (@darkest.space) AND the prose name (pr1ma). Matrix MXIDs still use the domain-based convention untouched.

qualify(label) is the same shape as qualified_label() but applies to an arbitrary label the caller already has (e.g. a peer name from the broker); it's the right surface when rendering a peer's name when the caller knows it's hive-local.

Identity = socket

There are no auth tokens on the per-agent unix sockets. The socket path identifies the principal; perms come from "who has the bind-mount." A sub-agent only sees its own /run/hive/mcp.sock; hive-c0re owns the host admin socket.

Wake injection

AgentRequest::Wake { from, body } (and the manager-flavour mirror) is the wake-event-injection surface. Recipient is implicit — the agent the socket belongs to — and from is caller-chosen so the wake prompt can label the source verbatim ("matrix: new message in #general", "forge: PR #42 opened", etc.). Typical caller: an in-container background task (the matrix daemon, a scraper, the forge-notify webhook subscriber) that needs to signal "external work has arrived" without going through the broker as a peer agent.

Identity = socket means anything that can connect to /run/hive/mcp.sock is implicitly trusted to inject wakes. That's fine: the bind-mount only exposes the socket inside the agent's own container, so the trust boundary is the container's process namespace, not the wire surface.

Recipient sentinels

A few recipient names are reserved by the broker and have special meaning that ordinary agent labels can never collide with — agent name validation rejects any character outside [a-z0-9_-], so the angle-bracket and asterisk shapes below are structurally safe.

* — broadcast: deliver to every running agent except the sender (agent_server::handle_send fans out via Coordinator::broadcast_send).
operator — the human at the dashboard. Messages accumulate in the inbox view; no agent ever recv's them.
<parent> — the sender's parent per topology.json. Rewritten at send time by topology::resolve_recipient: looks up parent_of(sender) and falls back to operator when the sender is a root agent (or absent from topology entirely). Lets agents address their parent without learning the label, so runtime reparenting propagates with zero agent-side restart.
<children> — fan-out to every direct descendant of the sender per topology.json. Resolved in agent_server::handle_send via topology::children_of(sender): one message is delivered to each child, bypassing the allow-list check (structural fan-out targets are never user-listed peers). No-op for leaf agents (returns Ok when the child set is empty). Lets a sub-manager nudge its subtree without enumerating labels.

When a <children> or <parent> send resolves to real recipients, the broker stores the resolved label(s) as the message recipient(s) — the dashboard and recv side see the real routes. The sentinels are purely send-time addressing conveniences.

Wire protocol

JSON line-delimited over unix sockets in both directions (host admin / manager / agent). SSE streams (/dashboard/stream on hive-c0re, /events/stream on the per-agent web UIs) are text/event-stream; each frame carries a seq field for the snapshot-dedupe dance (see docs/web-ui.md). Request/response types live in hive-sh4re — change them in one place. The dashboard event vocabulary lives in hive-c0re::dashboard_events::DashboardEvent.

Broker delivery + ack cycle

AgentRequest::Recv is the only path that delivers messages to an agent. Always returns a list (Messages { messages }) — empty when nothing's pending, single-pop when max = None (default 1, the single-message behaviour), batched up to max when caller asks for more (server-side cap is 5; values above clamp silently). wait_seconds long-polls for the first message; once one arrives — or one is already pending — the call drains up to max in total before returning, so a single Recv call coalesces a burst.

Per-row bookkeeping inside the broker:

delivered_at = NOW set on every popped row.
Each recipient has an in-memory unacked_ids list of every row delivered since the last AckTurn.
redelivered = true on a row if RequeueInflight resurfaced it (the harness prepends a "may already be handled" hint when this flag is set so the per-message warning is visible).

AgentRequest::AckTurn closes out the in-memory list — the harness fires it after TurnOutcome::Ok, marking every message popped since the last ack as fully handled. Claude doesn't see this surface; it's strictly a harness↔broker pairing. On TurnOutcome::Failed the harness intentionally skips the ack so the unacked rows stay in-flight in the DB and get picked up by the next requeue sweep.

AgentRequest::RequeueInflight is the recovery pair: fired by the harness exactly once at boot, before the serve loop starts. Catches the crashed-mid-turn / OOM-killed / container-restarted cases where a previous harness session popped messages but never drove them to a clean turn-end. Resets delivered_at back to NULL on every unacked row (so the next Recv pops them again), and remembers each id in a per-recipient in-memory set so the next Recv can tag the row with redelivered: true. Idempotent + cheap when there's nothing in flight, so the at-boot fire is unconditional.

AgentRequest::AckUntil { up_to } is the agent-facing bulk-triage escape hatch (mcp__hyperhive__ack_until). Unlike AckTurn it IS visible to claude: each recv row and wake prompt carries a [msg #<id>] marker (the broker row id; transient pings show no marker — their sentinel id 0 has nothing to ack), and ack_until(up_to: n) marks every one of the agent's rows with id <= n handled in a single UPDATE — pending and delivered alike. This bounds the redelivered-flood cost after a restart: instead of popping dozens of already-handled messages one turn at a time, the agent notes the highest id it has seen and acks up to it. Recipient-scoped (an agent can only ack its own rows); also drains the in-memory unacked_ids / requeued_ids bookkeeping below the cutoff so a later AckTurn doesn't double-update and a stale redelivery tag can't outlive its row. The operator-side sibling is the dashboard's "mark all read" (unbounded, per-agent).

Question routing (Ask / Answer)

AgentRequest::Ask (and the manager-flavour mirror) surfaces a structured question that either lands in the operator's dashboard queue or in a peer agent's inbox. The recipient is the to field:

to = None or to = Some("operator") — routes to the operator-question queue. The dashboard renders the question with any options as a chip strip plus a free-text fallback (Other…) so the operator is never trapped by an incomplete list. The legacy AskOperator variant collapses into this case.
to = Some(<agent>) — peer Q&A. The target agent receives a HelperEvent::QuestionAsked { id, asker, question, options, multi } in their inbox. They reply via AgentRequest::Answer (or ManagerRequest::Answer if they're the manager); the answer threads back to the asker as a HelperEvent::QuestionAnswered event.

Shape fields are uniform across both targets:

options is advisory — the dashboard chips are decoration over a free-text fallback; peer-agent recipients see the list in their QuestionAsked event and can return any string.
multi = true lets the answerer pick multiple options (checkboxes in the dashboard, a hint in the peer-agent event). The answer comes back as a single string with selections joined by ", ".
ttl_seconds auto-cancels with answer [expired] (and answerer: "ttl-watchdog") when the wait becomes moot. None = wait indefinitely or until manual cancel.

Response shape is always QuestionQueued { id } — the asker stores the id and correlates the asynchronous answer event when it lands. Authorisation on Answer: only the question's target agent (or the operator via the dashboard) is permitted to reply; an answer attempt from anyone else fails the wire-side check.

Loose-ends wire shape

LooseEnd is the per-row response shape for GetLooseEnds (both the agent-flavour and manager-flavour requests). Tagged enum so new thread kinds (forge PRs, long-running approvals from a privileged bot, etc.) can land later without breaking existing handlers. Each row carries enough context that the caller renders it directly as a bulleted list, no follow-up fetch needed.

Per-flavour scoping is uniform across the three variants:

agent-flavour GetLooseEnds only surfaces rows the calling agent has standing in. Approval rows only appear when the calling agent is the manager (sub-agents don't submit approvals). Question rows surface where the agent is asker OR target (the routing semantics from the Ask/Answer subsection above). Reminder rows are scoped to owner == self.
manager-flavour GetLooseEnds lists every pending row in the swarm — full audit view.

Per-variant fields:

Approval { id, agent, commit_ref, description?, age_seconds } — agent is the affected agent (target of the spawn / config commit), not the asker. description is the manager's free-text blurb shown on the dashboard card. commit_ref is the kind-specific payload (see docs/approvals.md::Approval kinds (wire shapes)).
Question { id, asker, target?, question, age_seconds } — target = None = operator-routed (dashboard); Some(agent) = peer-to-peer thread.
Reminder { id, owner, message, due_at, age_seconds } — due_at is the absolute time the scheduler is targeting (RFC 3339 on the wire, see Timestamps on the wire below); clients compute time-until-fire against it.
PendingMessages { count } — undelivered inbox messages the agent still owes itself a recv for. Informational + not cancellable (drain with recv); only emitted when count > 0, and surfaced first in the agent-flavour list as the most actionable signal. Counted host-side from the broker (count_pending), so it reflects what's genuinely still queued — the wake-message that drove the current turn is already delivered and not counted.
UnreadMatrix { rooms, summary } — unread matrix notifications. Informational + not cancellable (clear with mark_read). Unlike the others this is injected by the in-container harness, not hive-c0re, because the matrix daemon lives inside the agent.

age_seconds saturates at zero on any clock anomaly (back-step, unsynchronised wall clock, etc.) so the bulleted list never shows nonsense ages.

CancelLooseEnd { kind, id } is the matching write surface. The kind enum (Question / Reminder / Approval) selects which underlying store the dispatcher reaches into. Question and Reminder cancel from either surface subject to ownership checks (asker for the question, scheduler for the reminder). Approval is manager-only — sub-agents don't submit approvals so they have nothing of their own to withdraw; their wire surface returns a clear error if they try. Cancelling an approval transitions the row to ApprovalStatus::Cancelled and fires ApprovalResolved { status: "cancelled" } so the dashboard pulls the card out of the pending pane.

Agent metadata

AgentRequest::GetAgentMeta { name } returns identity + status for an agent. Self-introspection when name = None (replaces the older Whoami request); target query when name = Some.

Response is AgentMeta { name, running, hyperhive_rev, status_text, status_set_at, hive_name, swarm_name }:

hyperhive_rev: None only when the configured flake URL has no canonical path. Otherwise carries the rev the target is currently pinned at.
running: whether the target's container is currently up. When false, the host clears status_text / status_set_at — on-disk values from before the stop are stale snapshots and shouldn't be shown as live status. Defaults to true on the wire (older harnesses never serialised it, and the host only knew how to ask about live containers — keeps backwards-compat with pre-running-field payloads).
status_text / status_set_at: last value written via SetStatus, plus its unix timestamp. Both None when the target has never set a status, when the agent name is unknown, or when running = false (see above).
hive_name / swarm_name: display names read from HYPERHIVE_HIVE_NAME / HYPERHIVE_SWARM_NAME env (sourced from services.hyperhive.hiveName / services.hyperhive.swarmName). Both None when the options aren't configured.

Timestamps on the wire

Timestamp fields that cross a JSON boundary (dashboard API + SSE, the wire structs in hive-sh4re) serialize as RFC 3339 UTC strings (2026-07-02T18:30:00Z) via hive_sh4re::wire_time — Rust keeps the fields as i64 unix seconds internally, only the JSON representation changes, and deserialization leniently accepts both the string form and the legacy bare integer (rolling-deploy skew, persisted blobs). Input-direction fields agents compute as epoch (first_fire_at_unix, schedule-edit next_fire_at_unix, Wakeup::At) stay integers. The *_unix field names are kept for now — renaming is the wire-types refactor's concern. The dashboard frontend parses via util.js::epochSec wherever it needs arithmetic and feeds the string straight to new Date(s) for display.

Tool groups

The MCP tool surface an agent receives is derived from a set of named ToolGroup values (hive_sh4re::ToolGroup), not from a hardcoded binary flavor.

Group	Tools
`messaging`	`send`, `recv`, `ask`, `answer`
`meta`	`get_agent_meta` (`set_status` is always-on, see below)
`inbox`	`get_loose_ends`, `cancel_loose_end`, `remind`, `request_next_turn`
`execution`	vestigial — `mcp__bash__run` / `mcp__bash__status` are always available unconditionally via `extraMcpServers`; this group's entries expand to non-existent `mcp__hyperhive__run` / `mcp__hyperhive__status` and have no effect. See `docs/tools/bash.md`.
`lifecycle`	`kill`, `start`, `restart`, `update` (privileged)
`approvals`	`request_init_config`, `request_update_meta_inputs` (privileged)
`scheduling`	`request_schedule_prompt`, `fire_schedule_now`, `cancel_schedule`, `edit_schedule`, `list_schedules` (privileged)
`diagnostics`	`get_logs` (privileged)

Always-on tools — set_status is exposed to every agent regardless of which groups it holds (ToolGroup::ALWAYS_ON_TOOLS). The operator dashboard depends on every agent being able to report its status chip, and the server-side SetStatus handler has no tool-group check (only length validation), so gating it would only desync the --allowedTools list from what the host actually accepts. Revoking meta therefore drops get_agent_meta but never set_status.

Config storage — per-agent tool groups live in /var/lib/hyperhive/meta/tool-groups.json (hive-c0re-owned, committed to the meta repo alongside topology.json). Format: { "alice": ["messaging", "meta", "inbox", "lifecycle"], "bob": ["messaging", "meta", "inbox"] }. An absent entry means "use role default". Tool permissions are intentionally NOT configurable from agent.nix — that file goes through the manager's approval flow, so letting it declare its own groups would let the manager grant itself any tool by submitting a config commit, bypassing the operator gate.

Setting groups — the operator sets groups via the dashboard or hive-c0re::tool_groups::set_groups(name, groups). After a change meta::sync_agents commits the updated file; the next agent rebuild picks up the new HIVE_TOOL_GROUPS env var. Agents with no entry get no var.

Runtime resolution — at session start the harness reads HIVE_TOOL_GROUPS (a comma-separated list of snake_case group names injected by the meta renderer from tool-groups.json). Unrecognised tokens are logged and skipped. Falls back to ToolGroup::AGENT_DEFAULT (messaging, meta, inbox, execution) when the var is absent or empty.

Updating the surface — when a new #[tool] fn is added to HiveServer in hive-ag3nt/src/mcp.rs, add its name to the matching ToolGroup::tools() slice in hive-sh4re/src/lib.rs. That's the single source of truth; mcp_config::allowed_mcp_tools (in hive-ag3nt/src/mcp_config.rs) reads it at session start.

Capabilities

Capabilities gate system-level access that goes beyond the MCP tool surface — things an agent can access, not just call. Parallel to tool groups but orthogonal: an agent can have a tool group that registers a tool AND a capability that allows the underlying resource access.

Capability	Effect
`manage_root_agent`	may lifecycle-manage the root/manager agent via `kill`/`start`/`restart`
`read_host_journal`	`get_host_journal` MCP tool is registered + `GET /journal-host` requests are served
`query_agent_state`	may call `get_loose_ends` / `CountPendingReminders` targeting non-child agents
`infra_admin`	may call `restart(name)` on hive infrastructure containers (`hive-ci`, `hive-gateway`, `hive-forge`); each restart is logged to the dashboard AUDIT trail

Config storage — per-agent capabilities live in /var/lib/hyperhive/meta/capabilities.json alongside tool-groups.json. Format: { "atlas": ["read_host_journal"], "ruth": ["manage_root_agent"] }. An absent entry means "no extra capabilities". render_flake in meta.rs reads this file and injects HIVE_CAPABILITIES (comma-separated snake_case names) into each agent's systemd service env; absent entries emit no env var so agents without capabilities don't trigger a spurious rebuild.

Setting capabilities — the operator sets capabilities via the C4P4B1L1T13S section in the dashboard's P3RM1SS10NS tab. hive-c0re::capabilities::set_caps(name, caps) is the write path. After a change meta::sync_agents commits the updated file; the next agent rebuild picks up the new HIVE_CAPABILITIES env var.

Runtime resolution — at session start the harness reads HIVE_CAPABILITIES and resolves each token to a Capability variant. Unrecognised tokens are logged and skipped. An absent or empty var means no extra capabilities.

Capability NOT configurable from agent.nix — same reasoning as tool groups: an agent that could grant its own capabilities via a config commit would bypass the operator approval gate.

Adding a new capability — add a variant to Capability in hive-sh4re/src/lib.rs + an arm to as_str. Add it to Capability::ALL (the source of truth for the permissions UI columns). Implement the access check in the relevant handler (agent_server.rs, mcp.rs, or dashboard.rs).

Async forms

Dashboard + per-agent mutating forms carry data-async; a delegated submit listener in assets/tabs.js (+ assets/app.js for the per-agent UI) intercepts, shows a spinner, POSTs application/x-www-form-urlencoded (axum's Form extractor rejects multipart), calls refreshState() on success. New mutating forms should add data-async and optionally data-confirm (for a JS-side confirm() prompt) or data-prompt="…" (for a window.prompt() whose answer goes into a hidden input named by data-prompt-field, default note).

refreshState defers automatically when document.activeElement sits inside a managed section so the operator's typing isn't lost; collapsible <details data-restore-key=…> survive the re-render via snapshotOpenDetails / restoreOpenDetails.

`rebuild` is the reconcile verb

lifecycle::rebuild idempotently rewrites /etc/nixos-containers/<C>.conf (PRIVATE_NETWORK=0, clears HOST_ADDRESS / LOCAL_ADDRESS, sets EXTRA_NSPAWN_FLAGS), regenerates applied/<name>/flake.nix, writes the systemd limits drop-in, then nixos-container update + stop + start.

Anything that changes per-container state on the host should be re-applied here so a manual ↻ R3BU1LD from the dashboard is sufficient to recover.

Actions are factored

approve / deny / destroy (and the lifecycle helper) live in actions.rs / dashboard.rs. The admin socket and the dashboard POST handlers both call into them so the two surfaces never drift.

Commit messages

Short, lowercase, no Co-Authored-By trailer. Imperative mood, no period. Body explains why if non-obvious; otherwise the subject alone is fine. Wrap at ~72 cols.

Commit before test

Stage and commit when work looks ready, then run validation (cargo check, nix flake check, real deploy). Failures get a follow-up commit rather than an amend. The commit history is the work log; rewriting it loses signal.

Building & local checks

Build through the flake devshell, not a bare toolchain — agent containers ship no global rust. nix develop -c <cmd> runs one command inside the project-pinned env (cargo/clippy/rustfmt plus the C compiler + libsqlite3/ring link deps); outside it a bare cargo build fails with failed to find tool "cc" / cannot find -lsqlite3. One command per invocation — agents run each task as a fresh non-interactive process, so there's no persistent shell to reuse.

nix develop -c cargo clippy --all-targets -- -D warnings
nix develop -c cargo test
nix fmt                       # treefmt — authoritative, NOT bare cargo fmt

nix fmt (treefmt) is the formatter CI gates on; bare cargo fmt misses the non-rust files treefmt also covers, so always run nix fmt before pushing.

Clippy discipline — never add #[allow(clippy::…)]. All lints are CI-fatal at -D warnings (pedantic included); every warning that fires must be fixed, not silenced. Common patterns:

too_many_lines — extract a helper function or a sub-struct (the TurnAccum extraction in stats.rs is a worked example).
doc_markdown (brand name without backticks in a doc comment) — add backticks: `DOMPurify` instead of DOMPurify.
must_use / unused_results — actually handle or explicitly discard the return value (let _ = … is fine when intentional).

If a lint seems wrong for a specific call site, file an issue and ask mara — don't add #[allow] speculatively. The gate is intentional.

The devshell checks are not the full nix flake check. Clippy / fmt / cargo test cover most gates, but nix flake check runs extra check derivations they don't:

hivectl-docs regenerates docs/tools/hivectl-cli.md from hivectl's clap tree and fails if the committed copy is stale. So after any change to a hivectl verb or flag, regenerate it:
```
nix develop -c cargo run --bin hivectl -- markdown-docs > docs/tools/hivectl-cli.md
```
clippy / fmt / cargo test all pass without this — only the flake check catches the drift, and ci-log often can't show you why (it 500s on a fast failure), so you're left guessing "builder flake" when it's a stale doc.
there's also a flake cargo-test check and NixOS module evaluation in the set.

When local clippy/fmt/test pass but CI's nix flake check fails, don't assume a transient builder problem — reproduce the real gate locally: nix flake check (shares the build farm, use sparingly) or build just the suspect check, e.g. nix build .#checks.x86_64-linux.hivectl-docs.

Best-effort oneshot services

The harness ships a family of one-shot systemd services that configure agent-side surfaces from values hive-c0re writes into the state dir at provisioning time:

tea-login — writes ~/.config/tea/config.yml from the forge-token written by hive-c0re::forge::ensure_user_for, so tea repos create / tea pulls create work without interactive prompts.
forge-avatar-sync — uploads hyperhive.icon SVG to the agent's Forgejo profile, so the icon shows up on commits / PRs / issue comments.

(The matrix profile avatar is not a oneshot — hive-matrix-daemon sets it over its live authenticated Client; see docs/persistence.md::matrix avatar.)

Shape contract — every one of these:

Always exit 0, even on internal failure. A non-zero exit would mark the unit failed, which in turn aborts nixos-container update and blocks rebuilds. The agent's capability surface is not allowed to gate the container build.
No set -e in the script body. Subshell failures must not propagate. Use ... || true on every external call that can fail (forge unreachable, missing icon, parse error, etc.)
Skip silently when prerequisites are missing: no token file, no icon, no reachable upstream → echo a short skip line + exit 0. The next boot tries again.
Wired to multi-user.target so they run on every boot (lets a rotated token / new icon take effect without systemctl restart gymnastics).
Re-runnable: a second invocation produces the same final state (idempotent uploads, idempotent config rewrites). Used by the .path watchers that re-fire on token appearance (see docs/persistence.md::Matrix per-agent daemon).

The artefact lives under the agent user's home where applicable (~/.config/tea/config.yml) and is chown'd to that user, but the service itself stays root-owned so the bootstrap ordering doesn't need a user-existence check before each fire.

This pattern keeps the rebuild path resilient: any failure inside these services degrades the corresponding surface (no tea config, no avatar) but never blocks the container from coming up. The operator notices through journalctl -u <unit> rather than a broken switch-to-configuration.