Files
hive/docs/architecture/agent-usage-and-status-tracking.md
T
2026-05-04 17:55:44 -07:00

28 KiB
Raw Blame History

Agent Usage & Status Tracking — Capability Document

Audience: Lead software architect (paired with frontend + business requirement docs) Scope: Queen agent first (default local-runtime entry point), then downstream colonies/workers Status: Capability inventory + proposal — no implementation commitments Date: 2026-05-04


1. Why this document exists

We have a business need to track agent usage (what was consumed: tokens, cost, runtime, calls) and agent status (what state agents are in: alive, phase, progress, blocked) starting from the Queen agent, and surface this on the cloud for product and business consumers. This document inventories the capabilities the runtime can expose today vs. what is net-new, so architecture can pick a scope before frontend and product write against it.

The Queen is the right anchor: every local-runtime session today starts with a Queen, and every colony/worker is forked from a Queen call — so a tracking surface rooted at the Queen automatically covers the whole agent tree.

Headline constraint for the architect: the runtime is local-by-default. Every byte described in §3 — events, runtime logs, progress DBs, session state, LLM cost numbers — is written to the user's machine under ~/.hive/ (or the platform-specific Electron userData directory). Nothing is shipped to the cloud today. The business ask therefore implies a new local→cloud transport boundary, with the data-residency, privacy, and identity decisions that come with it. §4.5 makes the gap explicit per-surface; §8 lists the risks; §9 frames the cloud cut-over as the gating decision for any "Slice 2+" work.


2. Vocabulary — what we actually mean

Term Definition in this codebase
Session One Queen runtime instance. ID: session_{YYYYMMDD_HHMMSS}_{uuid8}. Persisted at ~/.hive/sessions/{session_id}/.
Queen Long-lived conversational agent. One per session. Single event-loop node. Phases: independent → incubating → working → reviewing. See queen/nodes/__init__.py:494-518.
Colony Persistent stateless container forked by create_colony. Has its own SQLite progress DB at ~/.hive/colonies/{colony_name}/data/progress.db.
Worker Ephemeral agent running inside a colony to execute a task.
Run / execution One trigger-to-completion invocation inside a node. Carries run_id, execution_id, trace_id (OTel-aligned).
Usage Quantitative consumption: input/output/cached tokens, USD cost, wall-clock latency, tool-call counts.
Status Qualitative state: phase, alive/stalled, current task, blocked-on, queue depth, last-heartbeat.

3. What exists today (capabilities, not commitments)

The runtime is already heavily instrumented. Most of what business wants is already emitted — the gap is persistence, aggregation, and a stable API surface.

3.1 Event Bus — the spine

core/framework/host/event_bus.py:61-177 defines an in-process async pub/sub with 40+ event types scoped by stream_id, session_id, colony_id, execution_id, run_id, correlation_id, timestamp.

Relevant for usage/status:

  • Lifecycle: EXECUTION_STARTED/COMPLETED/FAILED/PAUSED/RESUMED/RESURRECTED
  • Queen: QUEEN_PHASE_CHANGED, QUEEN_IDENTITY_SELECTED
  • Colony/Worker: COLONY_CREATED, WORKER_COLONY_LOADED, WORKER_COMPLETED, WORKER_FAILED, SUBAGENT_REPORT
  • LLM: LLM_TURN_COMPLETE, LLM_TEXT_DELTA, LLM_REASONING_DELTA, CONTEXT_USAGE_UPDATED
  • Tools: TOOL_CALL_STARTED, TOOL_CALL_COMPLETED, TOOL_CALL_REPLAY_DETECTED
  • Health: NODE_STALLED, NODE_TOOL_DOOM_LOOP, STREAM_TTFT_EXCEEDED, STREAM_INACTIVE, STREAM_NUDGE_SENT
  • Tasks (right-rail panel): TASK_CREATED, TASK_UPDATED, TASK_DELETED, TASK_LIST_RESET
  • Triggers: TRIGGER_AVAILABLE/ACTIVATED/DEACTIVATED/FIRED/REMOVED/UPDATED

Persistence today: in-memory only, plus optional JSONL export when HIVE_DEBUG_EVENTS=1 (event_bus.py:33-54). There is no SQL events table.

3.2 Three-level runtime logs (per session)

core/framework/tracker/runtime_log_schemas.py defines:

Level Schema File Granularity
L1 RunSummaryLog summary.json Per graph run — totals + execution_quality + trace_id
L2 NodeDetail details.jsonl Per node — exit_status, input/output tokens, latency_ms, retry/accept/escalate/continue counts
L3 NodeStepLog tool_logs.jsonl Per LLM step — tool calls, verdicts, error traces, latency_ms

Storage: ~/.hive/sessions/{session_id}/logs/ (runtime_log_store.py). Schemas already carry OTel fields (trace_id, span_id, parent_span_id) — wire-ready, not yet exported.

3.3 LLM call accounting

core/framework/llm/provider.py:11-32LLMResponse carries: model, input_tokens, output_tokens, cached_tokens, cache_creation_tokens, cost_usd, stop_reason. Cost is computed from model_catalog.py when the model is priced; otherwise 0.0.

Gap: cost lives in the response object and is rolled into L2/L3 logs, but is not in the event bus stream and not in any aggregate query surface.

3.4 Colony Progress DB

core/framework/host/progress_db.py:44-110 — per-colony SQLite (WAL mode):

  • tasks (id, seq, priority, goal, status: pending|claimed|started|completed|failed, worker_id, claimed_at, started_at, completed_at, retry_count, last_error)
  • steps, sop_checklist, colony_meta

This is the closest thing we have to a status SQL store today, but it is per-colony and task-shaped — not session-shaped or usage-shaped.

3.5 Queen task system (right-rail panel)

The mechanism the IDE-selection prompt describes is real: each task_update emits TASK_UPDATED on the bus, which a future SSE/WS surface can stream. State transitions: pending → in_progress → completed. Task body carries subject, active_form, blocks, blocked_by, metadata. Source: tasks/events.py:52-159.

3.6 HTTP surface (already shipping)

core/framework/server/routes_sessions.py:

  • POST /api/sessions — create
  • GET /api/sessions/{session_id} — current state including queen_phase, queen_model, colony_id, uptime_seconds
  • GET /api/sessions/{session_id}/stats — runtime statistics (extension point)
  • GET /api/sessions/{session_id}/events/history — replay persisted events

SSE primitive exists at server/sse.py but is not yet wired to a global event-stream route. This is the natural attach point for a real-time status feed.

3.7 Worker health snapshot

get_worker_health_summary() (worker_monitoring_tools.py:71-99) returns: session_id, session_status, total_steps, recent_verdicts, stall_minutes, evidence_snippet. Used today by Queen during the WORKING phase; can be exposed via API.


3.8 Where every byte lives today (data residency map)

Every storage location below is on the end-user's machine. There is no cloud sink, no telemetry endpoint, no managed database, no analytics service. The HTTP server in core/framework/server/ binds to localhost for the desktop UI; it is not a cloud API.

HIVE_HOME defaults to ~/.hive/ and is overridden by the desktop shell to the platform userData dir (e.g. ~/Library/Application Support/Hive/ on macOS, %APPDATA%\Hive\ on Windows). Source: config.py:20-44.

Data On-disk location (per machine) Format Lifetime Currently shipped off-device?
Event bus stream in-process memory only Python objects Process lifetime No
Event debug log (opt-in) HIVE_HOME/event_logs/<ts>.jsonl when HIVE_DEBUG_EVENTS=1 JSONL Until user deletes No
Session state HIVE_HOME/sessions/{session_id}/state.json JSON Until user deletes No
Conversations HIVE_HOME/sessions/{session_id}/conversations/ JSON Until user deletes No
Artifacts HIVE_HOME/sessions/{session_id}/artifacts/ mixed Until user deletes No
L1 run summary (tokens, cost, quality) HIVE_HOME/sessions/{session_id}/logs/summary.json JSON Until user deletes No
L2 node details HIVE_HOME/sessions/{session_id}/logs/details.jsonl JSONL Until user deletes No
L3 step / tool logs HIVE_HOME/sessions/{session_id}/logs/tool_logs.jsonl JSONL Until user deletes No
Colony task / step / SOP state HIVE_HOME/colonies/{colony_name}/data/progress.db SQLite (WAL) Until user deletes No
Queen / colony / skill / memory configs HIVE_HOME/{queens,colonies,skills,memories}/ files Until user deletes No
LLM cost_usd numbers computed in-process from model_catalog.py, then written into L1/L2/L3 logs above Same as logs No

What this means for the cloud requirement: the question for the architect is not "where do we get the data" — the data is fully captured. The question is "how does it leave the machine, in what shape, with whose consent, and where does it land." That decision is upstream of every endpoint in §6 and every storage option in §5.

Three architectural shapes worth considering (architect to choose):

  • Shape A — On-device only, queried over LAN/tunnel. Cloud product reaches into the runtime via an authenticated tunnel; no data is replicated. Strongest privacy. Hardest for cross-device rollups.
  • Shape B — Outbox push. Runtime keeps the local store as source of truth and asynchronously pushes a redacted, billing-grade subset (no prompts, no tool args by default) to a cloud aggregate. Best fit for the typical "agent status dashboard + usage rollup" product.
  • Shape C — Cloud-first runtime. Runtime writes events directly to a cloud bus and treats local files as a cache. Largest rewrite; not recommended for a desktop-first product.

Shape B is the lowest-friction path to the stated business outcome. The rest of this document is written with Shape B as the default and calls out where Shape A or C would change things.


4. Capability matrix — what we can offer

Each row is a candidate frontend/business surface, scored by feasibility from current state.

# Capability Status Backed by
Status
S1 Queen phase indicator (independent/incubating/working/reviewing) Ready QUEEN_PHASE_CHANGED event + session detail field
S2 Per-task progress (right-rail) Ready TASK_* events
S3 Live LLM streaming indicator (typing, thinking, tool-calling) Ready LLM_TEXT_DELTA, LLM_REASONING_DELTA, TOOL_CALL_STARTED/COMPLETED
S4 Stall / stuck-agent detection Ready NODE_STALLED, STREAM_INACTIVE, NODE_TOOL_DOOM_LOOP
S5 Colony tree (Queen → colonies → workers) snapshot Partial — data exists in session/colony stores; need a join query
S6 Worker health roll-up across colonies Partial — per-worker tool exists; needs aggregation route
S7 Liveness heartbeat ("agent X last seen Y ago") Net-new — must derive from event timestamps or add a periodic ping
S8 Trigger schedule (when will Queen wake next) Ready TRIGGER_* events
Usage
U1 Tokens per session (input/output/cached) Partial — captured per-step in L3, summed in L1; no API
U2 USD cost per session / colony / model Partialcost_usd per LLM call in logs; no rollup
U3 Tool-call counts and types Partial — events exist; no aggregate
U4 Wall-clock runtime and active-time per agent Partial — derivable from EXECUTION_STARTED/COMPLETED
U5 Cost attribution per Queen-spawned colony Partialcolony_id is on every event; needs a query
U6 Per-user / per-tenant aggregation Net-new — there is no user/tenant identity in events today
U7 Daily / monthly usage rollups for billing Net-new — requires persistent event store
U8 Quota / cap enforcement (block when over budget) Net-new — requires real-time meter + policy hook

Read of the matrix: ~70% of "status" surfaces are shipping-grade today behind a thin local API. ~70% of "usage" surfaces need a persistence + aggregation layer. The events themselves are not the bottleneck.

Local vs. cloud read of the same matrix. Every "Ready" / "Partial" cell above is ready in-process on the local machine. Making each row visible to a cloud consumer adds an additional step:

Capability class Local (today / near-term) Cloud (business ask)
Live status (S1S4, S8) Stream from in-process event bus over local SSE Push events through outbox → cloud relay → cloud SSE/WS to product UI.
Tree / health (S5, S6) Join local session + colony stores Same join, but on cloud-side replica of session/colony index.
Liveness (S7) Derive from local event timestamps Requires the runtime to post a heartbeat; cloud cannot infer aliveness from absence.
Per-session usage (U1U5) Read L1/L2/L3 logs on disk Outbox sends durable rows (no deltas) to cloud usage table.
Tenant rollups (U6U7) Not possible — no identity in events Cloud-side aggregation keyed on session→user join, identity attached at outbox time.
Quotas (U8) Local meter feasible, but pointless without cloud truth Cloud is the meter of record; runtime calls home to check.

5. Proposed data model (architect to validate)

Three new persisted entities, plus reuse of existing event types:

AgentSession                     UsageEvent                    StatusSnapshot
-----------                      -----------                   ---------------
session_id (PK)                  id (PK)                       session_id (FK)
queen_id                         session_id (FK)               taken_at
queen_model                      colony_id                     phase
started_at                       worker_id                     active_run_id
ended_at                         agent_role  (queen|worker)    active_node
status      (active|done|failed) event_type  (LLM|TOOL|...)    open_task_count
user_id     (when multi-tenant)  model                         in_flight_workers
tenant_id   (when multi-tenant)  input_tokens                  last_event_at
total_input_tokens               output_tokens                 stall_score
total_output_tokens              cached_tokens
total_cached_tokens              cost_usd
total_cost_usd                   latency_ms
total_tool_calls                 tool_name      (nullable)
last_event_at                    occurred_at
                                 trace_id
                                 execution_id

Storage choice (architect call). All three options today are local; only Option C reaches the cloud business surface.

  • Option A — local SQLite outbox at HIVE_HOME/runtime.db. Pros: zero infra, fits desktop, makes local queries cheap. Cons: per-host; no cross-device aggregation; does not satisfy the cloud requirement on its own.
  • Option B — DuckDB on the existing JSONL event logs. Pros: zero ingestion code; analyst-friendly. Cons: cold-start latency on big histories; also local-only.
  • Option C — push events to a managed cloud store (Postgres, ClickHouse, BigQuery) via an outbox pattern. Pros: cross-host rollups, billing-grade, the only option that actually delivers the cloud-visible status/metrics product. Cons: introduces a new transport, identity, and privacy/redaction story; needs explicit user opt-in for desktop builds.

The realistic shape is the hybrid called out in §3.8 Shape B: A locally as the durable buffer and source of truth, C in the cloud as the business-facing aggregate, with a one-way outbox that moves a redacted, durable-event-only subset over the wire. This document recommends that hybrid; everything in §6 and §7 is written against it.


6. Surface API — what frontend would consume

All routes assume the event-bus → SSE bridge exists (the one missing wire — see §3.6). Frontend sees this from day one.

Locality note. The /api/... routes below are served by the local runtime HTTP server today. For the cloud product, the same shapes need a cloud-side counterpart fed by the outbox. Two practical patterns: (1) cloud product calls cloud-hosted versions of these routes (against the aggregate), or (2) cloud product proxies authenticated requests back to the user's runtime. §3.8 Shape A vs. Shape B picks between them.

Real-time channel

GET  /api/sessions/{session_id}/events/stream      (SSE)
       ↳ filter=phase,task,llm_stream,tool,worker,trigger,health
GET  /api/agents/queen/stream                      (SSE) — global queen events

Status reads

GET  /api/sessions/{session_id}                    — already shipping
GET  /api/sessions/{session_id}/tree               — Queen → colonies → workers
GET  /api/sessions/{session_id}/health             — stall_score, last_event_at, in_flight
GET  /api/colonies/{colony_id}/workers             — health roll-up

Usage reads

GET  /api/sessions/{session_id}/usage              — tokens, cost, latency, tool-calls
GET  /api/sessions/{session_id}/usage/by-model     — split by model
GET  /api/colonies/{colony_id}/usage               — same shape, colony scope
GET  /api/agents/queen/usage?range=...&group_by=... — rollup view (billing)

Admin / business

GET  /api/usage/rollup?range=...&group_by=user|tenant|model|colony
POST /api/quotas/{tenant}                          — set caps (if quota work in scope)

7. Net-new work — sized in shirt-size, not days

Workstream Local / Cloud Size Depends on Notes
Event-bus → local SSE bridge (sse.py exists, route does not) Local S Unlocks all real-time status surfaces in the desktop UI. Highest leverage piece.
Persisted local event store (SQLite outbox) Local M Decision §5 One writer, append-only; reuse existing JSONL writer. Source of truth for cloud push.
Local aggregation queries + /usage endpoints Local M Persisted store Per-session usage on disk.
Outbox transport (local → cloud) Boundary ML Local store + auth New work: durable queue, retry, redaction policy, opt-in switch, schema versioning. This is the bridge to the cloud product.
Cloud event ingest + aggregate store Cloud L Outbox transport New cloud infra (Postgres/ClickHouse/BigQuery). Hosting, ops, retention policy, access controls.
Cloud-side status/usage API + dashboards Cloud M Cloud aggregate Mirrors §6 endpoints against the cloud store; this is what business users actually see.
Identity layer (user_id / tenant_id on events) Boundary M Auth model Currently no user identity in events. Identity attaches at outbox time, not at emit time.
OpenTelemetry exporter (schema is ready) Boundary SM trace_id/span_id already populated; an OTel collector can be the cloud sink instead of a custom outbox.
Quota / policy hooks Cloud-authoritative L Cloud store + identity Cloud holds the meter; runtime calls home synchronously on a critical path.
Liveness/heartbeat (S7) Local emit, cloud consume S Outbox Runtime must actively post; cloud cannot infer liveness from absence.
Cost attribution UI rollups Cloud S /usage cloud endpoints Shared with frontend doc.

Critical path for first frontend release (local desktop UI): SSE bridge → status endpoints (S1S5) → per-session usage endpoint (U1, U2). Everything else is incremental.

Critical path for first cloud release (business ask): local event store → outbox transport with redaction + opt-in → cloud ingest → cloud /usage and /status endpoints. The local UI work above is not a prerequisite for the cloud cut, but most of the local-side primitives (event store, durable-event filtering) are shared, so doing them in order minimizes rework.


8. Risks and tradeoffs the architect should weigh

  1. Event volume. LLM_TEXT_DELTA fires per token. A persisted store must filter — don't write deltas, write LLM_TURN_COMPLETE. This is the #1 way the table blows up.
  2. Privacy / desktop posture — the central architectural constraint. The runtime is local by default (config.py:20-44). The data inventory in §3.8 confirms that no data leaves the user's machine today, including the data the business ask needs in the cloud. Closing that gap is not "add a metrics push" — it is a new system boundary with: (a) explicit user opt-in (defaults must be safe for OSS / self-hosted users), (b) a documented redaction list (no prompts, no tool args, no file paths in the default payload), (c) schema versioning so cloud aggregates do not break on runtime upgrades, (d) a clear answer for self-hosted / air-gapped deployments where the cloud sink is unreachable, (e) regional data-residency rules if the product is sold internationally. This is the single largest design decision in the document.
  3. Cost-table accuracy. cost_usd is computed from a static catalog. Using it for billing means committing to keeping the catalog current (or pulling from provider invoices). For display, the current approach is fine; for charging, it is not.
  4. Identity coupling. Events are session-scoped today. Adding user_id/tenant_id everywhere is invasive. Recommend pinning identity at the session boundary and joining on session at query time, rather than threading identity through every event payload.
  5. Status vs. heartbeat semantics. "Idle" is not "dead." A Queen sitting in independent waiting for a user message is healthy and should not page anyone. The stall-score in §5 must distinguish idle-by-design from stalled-by-bug — the existing STREAM_INACTIVE / NODE_STALLED events already make this distinction; preserve it.
  6. Backpressure from observability. If usage tracking sits in the LLM call path (for quotas), it must not add latency. Recommend: meter is async/eventual for display; only quota checks are synchronous, and only when the customer has a quota.
  7. Worker-side gap. Worker LLM calls are accounted in their own session's L1L3 logs but are not automatically rolled into the parent Queen session. Cost attribution from Queen → spawned colony requires either (a) a parent_session_id field on the colony's session row, or (b) walking the COLONY_CREATED event graph at query time. (a) is cleaner.

9. Recommendation

Ship in four thin slices. The first two are local-only and unblock the desktop UI; the last two are what actually deliver the business ask of cloud-visible status and metrics.

  1. Slice 1 — Live local status (1 sprint, fully local). SSE bridge + /sessions/{id}/events/stream + /sessions/{id}/health + /sessions/{id}/tree. Frontend (local UI) gets the right-rail and the agent-tree. No persistence work, no cloud. (S1S5, S8.)

  2. Slice 2 — Per-session local usage store (12 sprints, fully local). Persisted event store (SQLite outbox at HIVE_HOME/runtime.db), filtered to durable event types only. /sessions/{id}/usage + /colonies/{id}/usage. No identity, no rollups, no cloud transport yet. This is the foundation the cloud slice rides on. (U1U5.)

  3. Slice 3 — Local→cloud outbox + cloud ingest (the cloud cut, scope-defining). Durable outbox queue, redaction policy, opt-in toggle, identity attachment, schema versioning, retry/backoff. Cloud-side ingest service + aggregate store. This is where the local-only world becomes a cloud product. Architect must decide §3.8 Shape, §5 storage, redaction defaults, and identity model before this slice can start.

  4. Slice 4 — Cloud rollups, dashboards, quotas (scope TBD with product). Tenant aggregation, daily/monthly rollups, quota enforcement, OTel export, business dashboards. (U6U8.) Defer until business confirms billing model — the answer (per-seat vs. per-token vs. per-colony) changes the data model.

Slices 1 and 2 are mostly wiring — the events exist, the schemas exist, the storage paths exist. Slice 3 is the first slice that introduces a new architectural boundary (local→cloud transport + identity + privacy contract); everything novel about the business ask lives there. Slice 4 is business design, not engineering scope.


10. Open questions for the architect

The first four are direct consequences of the local-first / cloud-required gap surfaced in §3.8 and §8.2.

  1. Cloud transport shape — Shape A, B, or C from §3.8? This decision is upstream of the entire data model. Recommend Shape B (outbox push) absent a strong privacy argument for Shape A.
  2. Redaction default for the cloud payload. What goes (model, token counts, latency, tool names, status) vs. what stays local (prompts, tool arguments, tool results, file paths, conversation content)? Need a written allowlist before Slice 3 starts.
  3. Self-hosted / air-gapped users. If the cloud sink is unreachable or disabled, what does the runtime do — buffer indefinitely, drop oldest, or refuse to start? Defaults differ for OSS vs. SaaS distributions.
  4. Identity binding point. Do we attach user_id / tenant_id at event-emit time (invasive, threads identity through every node), at session-create time (clean, requires session-level auth), or at outbox-flush time (simplest, but loses per-event provenance)? Recommend session-create.
  5. Do we need quota enforcement, or only quota visibility in v1?
  6. Frontend doc: are status and usage rendered in the same panel or different surfaces? This affects whether we ship one merged endpoint or two.
  7. Are we willing to pay the cost-table maintenance burden, or should "cost" stay labeled as estimated and not be used for invoicing?

Appendix — Pointers