Turn loop + MCP

How the harness wakes up, what it asks claude to do, and what tools claude has access to in return.

The loop

Each agent harness (hive-agent — one serve-loop binary for all agents) runs:

Check the pause marker (<harness>/paused). While it exists the loop does nothing but re-stat it every 5 s — no broker poll, no claude process. Because step 1 is never reached, messages stay queued and unacked, so a resume drains the backlog instead of losing it; reminders and todo wakes buffer in their channels. The check runs before the self-continue slot is consumed, so a pending request_next_turn survives the pause. Set it with hivectl agents pause <name> or the dashboard toggle; see persistence.
Long-poll Recv on its socket. The host-side broker (broker.rs::recv_blocking_batch) returns immediately if there's a pending message, otherwise waits up to 30 s for a broker Sent event for this recipient.
Pop one message. Peek the remaining inbox depth with Status.
Emit LiveEvent::TurnStart { from, body, unread } onto the SSE bus.
Spawn claude (one process per turn) and pipe the wake prompt over stdin.
Stream stdout (JSON lines) into the bus as LiveEvent::Stream(value). Pump stderr as Note.
Wait for claude to exit and classify the turn's outcome from the stream + exit — success, compaction, rate-limit, auth-failure, or hard failure. The outcome drives the post-turn action (see Turn outcomes); compaction is handled inside the session (see Compaction). Rate-limit and auth-failure detection is described below.
Emit LiveEvent::TurnEnd { ok, note }. Sleep poll_ms to avoid tight loops on transient failures.

Rate limit — a 429 / rate_limit marker on stderr, or a parsed {"type":"error"} rate-limit event on stdout (conversation-text mentions don't count), sets the rate_limited sentinel, parks for HIVE_RATE_LIMIT_SLEEP_SECS (default 300), then retries. The UI shows a ⊘ rate limited badge while parked.
Auth failure (401) — drive_turn retries once (transient token-refresh races clear on retry); a second AuthFailed writes {state_dir}/hyperhive-needs-login, requeues the message, and parks in wait_for_login — the same path as a cold boot with no session. The operator re-auths via the per-agent web UI; the queued message then drives the next turn.
Login detection — both boot (login::has_session, Online vs NeedsLogin) and wait_for_login's resume check key off the credential files in login::CRED_FILE_NAMES (the set /logout deletes). wait_for_login resumes only when that set changes (a new file or a newer mtime), so stale credentials on disk at the 401 don't trigger an instant false-resume, and leftover session-history files don't read as a live session after a logout + container recreate.

Harness binary shape

Two sibling binaries out of the one hive-ag3nt crate, all role-agnostic. (The earlier split into hive-ag3nt + hive-m1nd was collapsed because the privilege boundary lives server-side at the broker socket (/run/hive/mcp.sock): ManagerRequest calls are refused by the standard agent socket regardless of who sends them.)

hive-agent — long-running harness loop (the inbox poll + claude-pump + ack/requeue cycle described above).
hive-agent-mcp — MCP server for the built-in hyperhive surface. Run with --http <addr> as a persistent streamable-HTTP daemon (the hive-mcp-http systemd unit, on hyperhive.mcp.httpPort, default 8790); claude connects to its URL via --mcp-config. HTTP is the sole transport — no per-turn stdio child (eliminates the re-registration race).

`Surface` trait + zero-sized type tags

AgentRequest / AgentResponse (= ManagerRequest / ManagerResponse — type aliases) are the wire types. There is one role: agent. bin/hive-agent.rs factors the turn loop through a Surface trait with one zero-sized impl (AgentSurface) wrapping:

One async method per wire op: ack_turn, requeue_inflight, inbox_unread, post_turn_counts, send_to_parent, recv_next.

main() calls serve_main::<AgentSurface> for all roles. The turn loop (serve_loop / handle_turn) has no per-role branches.

Boot wiring

serve_main reads HIVE_PORT (default DEFAULT_WEB_PORT) + HIVE_LABEL (default "hive" for standalone runs; the meta flake sets it unconditionally for any container-deployed agent; see docs/conventions.md::Hive identity for the env stack), opens turn-stats sqlite, prepares the on-boot files (see claude-invocation), installs claude plugins, spawns forge_notify::run + web_ui::serve, and either drops into serve_loop directly (Online) or parks on the login flow first (NeedsLogin).

Plugin install failures are not fatal: each entry comes back as a human-readable failure string that gets routed via Surface::send_to_parent to the agent's topology parent (the broker resolves <parent> per topology::parent_of; root agents and the manager fall through to operator).

Turn outcomes

turn::TurnOutcome (Result<bool, TurnError> — Ok(compacted) on success, else a TurnError) drives the post-claude branch:

Outcome	Action
`Ok(_)` (`false` normal / `true` compacted)	`ack_turn`
`Err(PromptTooLong)`	`drive_turn` archived the session (the lib already compacted + retried and it still overflowed); requeue inflight so the message redelivers into a fresh session that fits — no status park
`Err(RateLimited)`	sleep `HIVE_RATE_LIMIT_SLEEP_SECS` (default 300), requeue inflight, status back to `online`
`Err(AuthFailed)`	emit `needs_login_idle` sentinel, requeue inflight, park in `wait_for_login`
`Err(SessionNotFound)`	resume + create self-heal both missed ("shouldn't happen"); requeue inflight so the next turn creates fresh — no status park, message not dropped
`Err(ApiStall)`	idle watchdog killed claude after `HIVE_TURN_IDLE_SECS` (default 600) of output silence; sleep `HIVE_STALL_SLEEP_SECS` (default 60), requeue inflight, status back to `online`
`Err(Failed(err))`	route `[system] \`` claude turn failed:\n`tovia`send_to_parent`

ApiStall catches an Anthropic API stall — a multi-retry connection storm where the stream goes silent for minutes. The idle watchdog lives in hive-claude's driver (Config::idle_timeout, enforced around child.wait()): the timer resets on every stdout line, so a large but still-streaming turn is never cut — only complete output silence for the window trips it. The harness sets the window from HIVE_TURN_IDLE_SECS (0 disables) and maps the driver's Error::IdleTimeout onto TurnError::ApiStall.

After the outcome handler, the stats sink records a row and the hyperhive-continue sentinel (dropped by the request_next_turn MCP tool) is consumed if present. handle_turn reports the result to serve_loop via TurnControl { auth_failed, continue_requested, pending }. When a continue was requested, the turn did not auth-fail, and the inbox is empty (pending == 0), serve_loop drives the next turn in-process with a synthetic { from: "self", body: "continue" } message (synthetic_continue) — it never goes through the broker, so the self-continue doesn't persist to sqlite or show up as a recv'able inbox message. If real messages are already pending the continue is dropped: those messages drive the next turn(s) via recv_next, so an explicit self-wake isn't needed (this is the request_next_turn contract — "no effect if a new inbox message arrives before this turn ends"). The should_self_continue predicate encodes exactly that decision.

Sub-pages

The rest lives in three topic pages under turn-loop/:

claude-invocation.md — how the harness spawns claude --print each turn, the two-pronged compaction (reactive + proactive), and the on-boot files it materialises (--mcp-config, --system-prompt-file).
config.md — the optional per-agent knobs the meta flake wires in (reference docs, icon, passwordless sudo, dashboard links, custom static files, connectivity overrides, claude plugins, cargo message filtering).
mcp.md — the MCP tool surface claude sees: core tools, privileged tool groups, self-wake, authoritative state, the tool envelope, and the built-in tool whitelist.

Per-subsystem impl detail lives in each module's //! doc-comment; these pages describe present-state behaviour + wiring, not line-level mechanics.