28 KiB
Agent Usage & Status Tracking — Capability Document
Audience: Lead software architect (paired with frontend + business requirement docs) Scope: Queen agent first (default local-runtime entry point), then downstream colonies/workers Status: Capability inventory + proposal — no implementation commitments Date: 2026-05-04
1. Why this document exists
We have a business need to track agent usage (what was consumed: tokens, cost, runtime, calls) and agent status (what state agents are in: alive, phase, progress, blocked) starting from the Queen agent, and surface this on the cloud for product and business consumers. This document inventories the capabilities the runtime can expose today vs. what is net-new, so architecture can pick a scope before frontend and product write against it.
The Queen is the right anchor: every local-runtime session today starts with a Queen, and every colony/worker is forked from a Queen call — so a tracking surface rooted at the Queen automatically covers the whole agent tree.
Headline constraint for the architect: the runtime is local-by-default. Every byte described in §3 — events, runtime logs, progress DBs, session state, LLM cost numbers — is written to the user's machine under
~/.hive/(or the platform-specific ElectronuserDatadirectory). Nothing is shipped to the cloud today. The business ask therefore implies a new local→cloud transport boundary, with the data-residency, privacy, and identity decisions that come with it. §4.5 makes the gap explicit per-surface; §8 lists the risks; §9 frames the cloud cut-over as the gating decision for any "Slice 2+" work.
2. Vocabulary — what we actually mean
| Term | Definition in this codebase |
|---|---|
| Session | One Queen runtime instance. ID: session_{YYYYMMDD_HHMMSS}_{uuid8}. Persisted at ~/.hive/sessions/{session_id}/. |
| Queen | Long-lived conversational agent. One per session. Single event-loop node. Phases: independent → incubating → working → reviewing. See queen/nodes/__init__.py:494-518. |
| Colony | Persistent stateless container forked by create_colony. Has its own SQLite progress DB at ~/.hive/colonies/{colony_name}/data/progress.db. |
| Worker | Ephemeral agent running inside a colony to execute a task. |
| Run / execution | One trigger-to-completion invocation inside a node. Carries run_id, execution_id, trace_id (OTel-aligned). |
| Usage | Quantitative consumption: input/output/cached tokens, USD cost, wall-clock latency, tool-call counts. |
| Status | Qualitative state: phase, alive/stalled, current task, blocked-on, queue depth, last-heartbeat. |
3. What exists today (capabilities, not commitments)
The runtime is already heavily instrumented. Most of what business wants is already emitted — the gap is persistence, aggregation, and a stable API surface.
3.1 Event Bus — the spine
core/framework/host/event_bus.py:61-177 defines an in-process async pub/sub with 40+ event types scoped by stream_id, session_id, colony_id, execution_id, run_id, correlation_id, timestamp.
Relevant for usage/status:
- Lifecycle:
EXECUTION_STARTED/COMPLETED/FAILED/PAUSED/RESUMED/RESURRECTED - Queen:
QUEEN_PHASE_CHANGED,QUEEN_IDENTITY_SELECTED - Colony/Worker:
COLONY_CREATED,WORKER_COLONY_LOADED,WORKER_COMPLETED,WORKER_FAILED,SUBAGENT_REPORT - LLM:
LLM_TURN_COMPLETE,LLM_TEXT_DELTA,LLM_REASONING_DELTA,CONTEXT_USAGE_UPDATED - Tools:
TOOL_CALL_STARTED,TOOL_CALL_COMPLETED,TOOL_CALL_REPLAY_DETECTED - Health:
NODE_STALLED,NODE_TOOL_DOOM_LOOP,STREAM_TTFT_EXCEEDED,STREAM_INACTIVE,STREAM_NUDGE_SENT - Tasks (right-rail panel):
TASK_CREATED,TASK_UPDATED,TASK_DELETED,TASK_LIST_RESET - Triggers:
TRIGGER_AVAILABLE/ACTIVATED/DEACTIVATED/FIRED/REMOVED/UPDATED
Persistence today: in-memory only, plus optional JSONL export when
HIVE_DEBUG_EVENTS=1(event_bus.py:33-54). There is no SQL events table.
3.2 Three-level runtime logs (per session)
core/framework/tracker/runtime_log_schemas.py defines:
| Level | Schema | File | Granularity |
|---|---|---|---|
| L1 | RunSummaryLog |
summary.json |
Per graph run — totals + execution_quality + trace_id |
| L2 | NodeDetail |
details.jsonl |
Per node — exit_status, input/output tokens, latency_ms, retry/accept/escalate/continue counts |
| L3 | NodeStepLog |
tool_logs.jsonl |
Per LLM step — tool calls, verdicts, error traces, latency_ms |
Storage: ~/.hive/sessions/{session_id}/logs/ (runtime_log_store.py). Schemas already carry OTel fields (trace_id, span_id, parent_span_id) — wire-ready, not yet exported.
3.3 LLM call accounting
core/framework/llm/provider.py:11-32 — LLMResponse carries: model, input_tokens, output_tokens, cached_tokens, cache_creation_tokens, cost_usd, stop_reason. Cost is computed from model_catalog.py when the model is priced; otherwise 0.0.
Gap: cost lives in the response object and is rolled into L2/L3 logs, but is not in the event bus stream and not in any aggregate query surface.
3.4 Colony Progress DB
core/framework/host/progress_db.py:44-110 — per-colony SQLite (WAL mode):
tasks(id, seq, priority, goal, status: pending|claimed|started|completed|failed, worker_id, claimed_at, started_at, completed_at, retry_count, last_error)steps,sop_checklist,colony_meta
This is the closest thing we have to a status SQL store today, but it is per-colony and task-shaped — not session-shaped or usage-shaped.
3.5 Queen task system (right-rail panel)
The mechanism the IDE-selection prompt describes is real: each task_update emits TASK_UPDATED on the bus, which a future SSE/WS surface can stream. State transitions: pending → in_progress → completed. Task body carries subject, active_form, blocks, blocked_by, metadata. Source: tasks/events.py:52-159.
3.6 HTTP surface (already shipping)
core/framework/server/routes_sessions.py:
POST /api/sessions— createGET /api/sessions/{session_id}— current state includingqueen_phase,queen_model,colony_id,uptime_secondsGET /api/sessions/{session_id}/stats— runtime statistics (extension point)GET /api/sessions/{session_id}/events/history— replay persisted events
SSE primitive exists at server/sse.py but is not yet wired to a global event-stream route. This is the natural attach point for a real-time status feed.
3.7 Worker health snapshot
get_worker_health_summary() (worker_monitoring_tools.py:71-99) returns: session_id, session_status, total_steps, recent_verdicts, stall_minutes, evidence_snippet. Used today by Queen during the WORKING phase; can be exposed via API.
3.8 Where every byte lives today (data residency map)
Every storage location below is on the end-user's machine. There is no cloud sink, no telemetry endpoint, no managed database, no analytics service. The HTTP server in core/framework/server/ binds to localhost for the desktop UI; it is not a cloud API.
HIVE_HOME defaults to ~/.hive/ and is overridden by the desktop shell to the platform userData dir (e.g. ~/Library/Application Support/Hive/ on macOS, %APPDATA%\Hive\ on Windows). Source: config.py:20-44.
| Data | On-disk location (per machine) | Format | Lifetime | Currently shipped off-device? |
|---|---|---|---|---|
| Event bus stream | in-process memory only | Python objects | Process lifetime | No |
| Event debug log (opt-in) | HIVE_HOME/event_logs/<ts>.jsonl when HIVE_DEBUG_EVENTS=1 |
JSONL | Until user deletes | No |
| Session state | HIVE_HOME/sessions/{session_id}/state.json |
JSON | Until user deletes | No |
| Conversations | HIVE_HOME/sessions/{session_id}/conversations/ |
JSON | Until user deletes | No |
| Artifacts | HIVE_HOME/sessions/{session_id}/artifacts/ |
mixed | Until user deletes | No |
| L1 run summary (tokens, cost, quality) | HIVE_HOME/sessions/{session_id}/logs/summary.json |
JSON | Until user deletes | No |
| L2 node details | HIVE_HOME/sessions/{session_id}/logs/details.jsonl |
JSONL | Until user deletes | No |
| L3 step / tool logs | HIVE_HOME/sessions/{session_id}/logs/tool_logs.jsonl |
JSONL | Until user deletes | No |
| Colony task / step / SOP state | HIVE_HOME/colonies/{colony_name}/data/progress.db |
SQLite (WAL) | Until user deletes | No |
| Queen / colony / skill / memory configs | HIVE_HOME/{queens,colonies,skills,memories}/ |
files | Until user deletes | No |
LLM cost_usd numbers |
computed in-process from model_catalog.py, then written into L1/L2/L3 logs above | — | Same as logs | No |
What this means for the cloud requirement: the question for the architect is not "where do we get the data" — the data is fully captured. The question is "how does it leave the machine, in what shape, with whose consent, and where does it land." That decision is upstream of every endpoint in §6 and every storage option in §5.
Three architectural shapes worth considering (architect to choose):
- Shape A — On-device only, queried over LAN/tunnel. Cloud product reaches into the runtime via an authenticated tunnel; no data is replicated. Strongest privacy. Hardest for cross-device rollups.
- Shape B — Outbox push. Runtime keeps the local store as source of truth and asynchronously pushes a redacted, billing-grade subset (no prompts, no tool args by default) to a cloud aggregate. Best fit for the typical "agent status dashboard + usage rollup" product.
- Shape C — Cloud-first runtime. Runtime writes events directly to a cloud bus and treats local files as a cache. Largest rewrite; not recommended for a desktop-first product.
Shape B is the lowest-friction path to the stated business outcome. The rest of this document is written with Shape B as the default and calls out where Shape A or C would change things.
4. Capability matrix — what we can offer
Each row is a candidate frontend/business surface, scored by feasibility from current state.
| # | Capability | Status | Backed by |
|---|---|---|---|
| Status | |||
| S1 | Queen phase indicator (independent/incubating/working/reviewing) | Ready | QUEEN_PHASE_CHANGED event + session detail field |
| S2 | Per-task progress (right-rail) | Ready | TASK_* events |
| S3 | Live LLM streaming indicator (typing, thinking, tool-calling) | Ready | LLM_TEXT_DELTA, LLM_REASONING_DELTA, TOOL_CALL_STARTED/COMPLETED |
| S4 | Stall / stuck-agent detection | Ready | NODE_STALLED, STREAM_INACTIVE, NODE_TOOL_DOOM_LOOP |
| S5 | Colony tree (Queen → colonies → workers) snapshot | Partial — data exists in session/colony stores; need a join query | |
| S6 | Worker health roll-up across colonies | Partial — per-worker tool exists; needs aggregation route | |
| S7 | Liveness heartbeat ("agent X last seen Y ago") | Net-new — must derive from event timestamps or add a periodic ping | |
| S8 | Trigger schedule (when will Queen wake next) | Ready | TRIGGER_* events |
| Usage | |||
| U1 | Tokens per session (input/output/cached) | Partial — captured per-step in L3, summed in L1; no API | |
| U2 | USD cost per session / colony / model | Partial — cost_usd per LLM call in logs; no rollup |
|
| U3 | Tool-call counts and types | Partial — events exist; no aggregate | |
| U4 | Wall-clock runtime and active-time per agent | Partial — derivable from EXECUTION_STARTED/COMPLETED |
|
| U5 | Cost attribution per Queen-spawned colony | Partial — colony_id is on every event; needs a query |
|
| U6 | Per-user / per-tenant aggregation | Net-new — there is no user/tenant identity in events today | |
| U7 | Daily / monthly usage rollups for billing | Net-new — requires persistent event store | |
| U8 | Quota / cap enforcement (block when over budget) | Net-new — requires real-time meter + policy hook |
Read of the matrix: ~70% of "status" surfaces are shipping-grade today behind a thin local API. ~70% of "usage" surfaces need a persistence + aggregation layer. The events themselves are not the bottleneck.
Local vs. cloud read of the same matrix. Every "Ready" / "Partial" cell above is ready in-process on the local machine. Making each row visible to a cloud consumer adds an additional step:
| Capability class | Local (today / near-term) | Cloud (business ask) |
|---|---|---|
| Live status (S1–S4, S8) | Stream from in-process event bus over local SSE | Push events through outbox → cloud relay → cloud SSE/WS to product UI. |
| Tree / health (S5, S6) | Join local session + colony stores | Same join, but on cloud-side replica of session/colony index. |
| Liveness (S7) | Derive from local event timestamps | Requires the runtime to post a heartbeat; cloud cannot infer aliveness from absence. |
| Per-session usage (U1–U5) | Read L1/L2/L3 logs on disk | Outbox sends durable rows (no deltas) to cloud usage table. |
| Tenant rollups (U6–U7) | Not possible — no identity in events | Cloud-side aggregation keyed on session→user join, identity attached at outbox time. |
| Quotas (U8) | Local meter feasible, but pointless without cloud truth | Cloud is the meter of record; runtime calls home to check. |
5. Proposed data model (architect to validate)
Three new persisted entities, plus reuse of existing event types:
AgentSession UsageEvent StatusSnapshot
----------- ----------- ---------------
session_id (PK) id (PK) session_id (FK)
queen_id session_id (FK) taken_at
queen_model colony_id phase
started_at worker_id active_run_id
ended_at agent_role (queen|worker) active_node
status (active|done|failed) event_type (LLM|TOOL|...) open_task_count
user_id (when multi-tenant) model in_flight_workers
tenant_id (when multi-tenant) input_tokens last_event_at
total_input_tokens output_tokens stall_score
total_output_tokens cached_tokens
total_cached_tokens cost_usd
total_cost_usd latency_ms
total_tool_calls tool_name (nullable)
last_event_at occurred_at
trace_id
execution_id
Storage choice (architect call). All three options today are local; only Option C reaches the cloud business surface.
- Option A — local SQLite outbox at
HIVE_HOME/runtime.db. Pros: zero infra, fits desktop, makes local queries cheap. Cons: per-host; no cross-device aggregation; does not satisfy the cloud requirement on its own. - Option B — DuckDB on the existing JSONL event logs. Pros: zero ingestion code; analyst-friendly. Cons: cold-start latency on big histories; also local-only.
- Option C — push events to a managed cloud store (Postgres, ClickHouse, BigQuery) via an outbox pattern. Pros: cross-host rollups, billing-grade, the only option that actually delivers the cloud-visible status/metrics product. Cons: introduces a new transport, identity, and privacy/redaction story; needs explicit user opt-in for desktop builds.
The realistic shape is the hybrid called out in §3.8 Shape B: A locally as the durable buffer and source of truth, C in the cloud as the business-facing aggregate, with a one-way outbox that moves a redacted, durable-event-only subset over the wire. This document recommends that hybrid; everything in §6 and §7 is written against it.
6. Surface API — what frontend would consume
All routes assume the event-bus → SSE bridge exists (the one missing wire — see §3.6). Frontend sees this from day one.
Locality note. The
/api/...routes below are served by the local runtime HTTP server today. For the cloud product, the same shapes need a cloud-side counterpart fed by the outbox. Two practical patterns: (1) cloud product calls cloud-hosted versions of these routes (against the aggregate), or (2) cloud product proxies authenticated requests back to the user's runtime. §3.8 Shape A vs. Shape B picks between them.
Real-time channel
GET /api/sessions/{session_id}/events/stream (SSE)
↳ filter=phase,task,llm_stream,tool,worker,trigger,health
GET /api/agents/queen/stream (SSE) — global queen events
Status reads
GET /api/sessions/{session_id} — already shipping
GET /api/sessions/{session_id}/tree — Queen → colonies → workers
GET /api/sessions/{session_id}/health — stall_score, last_event_at, in_flight
GET /api/colonies/{colony_id}/workers — health roll-up
Usage reads
GET /api/sessions/{session_id}/usage — tokens, cost, latency, tool-calls
GET /api/sessions/{session_id}/usage/by-model — split by model
GET /api/colonies/{colony_id}/usage — same shape, colony scope
GET /api/agents/queen/usage?range=...&group_by=... — rollup view (billing)
Admin / business
GET /api/usage/rollup?range=...&group_by=user|tenant|model|colony
POST /api/quotas/{tenant} — set caps (if quota work in scope)
7. Net-new work — sized in shirt-size, not days
| Workstream | Local / Cloud | Size | Depends on | Notes |
|---|---|---|---|---|
| Event-bus → local SSE bridge (sse.py exists, route does not) | Local | S | — | Unlocks all real-time status surfaces in the desktop UI. Highest leverage piece. |
| Persisted local event store (SQLite outbox) | Local | M | Decision §5 | One writer, append-only; reuse existing JSONL writer. Source of truth for cloud push. |
Local aggregation queries + /usage endpoints |
Local | M | Persisted store | Per-session usage on disk. |
| Outbox transport (local → cloud) | Boundary | M–L | Local store + auth | New work: durable queue, retry, redaction policy, opt-in switch, schema versioning. This is the bridge to the cloud product. |
| Cloud event ingest + aggregate store | Cloud | L | Outbox transport | New cloud infra (Postgres/ClickHouse/BigQuery). Hosting, ops, retention policy, access controls. |
| Cloud-side status/usage API + dashboards | Cloud | M | Cloud aggregate | Mirrors §6 endpoints against the cloud store; this is what business users actually see. |
| Identity layer (user_id / tenant_id on events) | Boundary | M | Auth model | Currently no user identity in events. Identity attaches at outbox time, not at emit time. |
| OpenTelemetry exporter (schema is ready) | Boundary | S–M | — | trace_id/span_id already populated; an OTel collector can be the cloud sink instead of a custom outbox. |
| Quota / policy hooks | Cloud-authoritative | L | Cloud store + identity | Cloud holds the meter; runtime calls home synchronously on a critical path. |
| Liveness/heartbeat (S7) | Local emit, cloud consume | S | Outbox | Runtime must actively post; cloud cannot infer liveness from absence. |
| Cost attribution UI rollups | Cloud | S | /usage cloud endpoints |
Shared with frontend doc. |
Critical path for first frontend release (local desktop UI): SSE bridge → status endpoints (S1–S5) → per-session usage endpoint (U1, U2). Everything else is incremental.
Critical path for first cloud release (business ask): local event store → outbox transport with redaction + opt-in → cloud ingest → cloud /usage and /status endpoints. The local UI work above is not a prerequisite for the cloud cut, but most of the local-side primitives (event store, durable-event filtering) are shared, so doing them in order minimizes rework.
8. Risks and tradeoffs the architect should weigh
- Event volume.
LLM_TEXT_DELTAfires per token. A persisted store must filter — don't write deltas, writeLLM_TURN_COMPLETE. This is the #1 way the table blows up. - Privacy / desktop posture — the central architectural constraint. The runtime is local by default (config.py:20-44). The data inventory in §3.8 confirms that no data leaves the user's machine today, including the data the business ask needs in the cloud. Closing that gap is not "add a metrics push" — it is a new system boundary with: (a) explicit user opt-in (defaults must be safe for OSS / self-hosted users), (b) a documented redaction list (no prompts, no tool args, no file paths in the default payload), (c) schema versioning so cloud aggregates do not break on runtime upgrades, (d) a clear answer for self-hosted / air-gapped deployments where the cloud sink is unreachable, (e) regional data-residency rules if the product is sold internationally. This is the single largest design decision in the document.
- Cost-table accuracy.
cost_usdis computed from a static catalog. Using it for billing means committing to keeping the catalog current (or pulling from provider invoices). For display, the current approach is fine; for charging, it is not. - Identity coupling. Events are session-scoped today. Adding
user_id/tenant_ideverywhere is invasive. Recommend pinning identity at the session boundary and joining on session at query time, rather than threading identity through every event payload. - Status vs. heartbeat semantics. "Idle" is not "dead." A Queen sitting in
independentwaiting for a user message is healthy and should not page anyone. The stall-score in §5 must distinguish idle-by-design from stalled-by-bug — the existingSTREAM_INACTIVE/NODE_STALLEDevents already make this distinction; preserve it. - Backpressure from observability. If usage tracking sits in the LLM call path (for quotas), it must not add latency. Recommend: meter is async/eventual for display; only quota checks are synchronous, and only when the customer has a quota.
- Worker-side gap. Worker LLM calls are accounted in their own session's L1–L3 logs but are not automatically rolled into the parent Queen session. Cost attribution from Queen → spawned colony requires either (a) a parent_session_id field on the colony's session row, or (b) walking the
COLONY_CREATEDevent graph at query time. (a) is cleaner.
9. Recommendation
Ship in four thin slices. The first two are local-only and unblock the desktop UI; the last two are what actually deliver the business ask of cloud-visible status and metrics.
-
Slice 1 — Live local status (1 sprint, fully local). SSE bridge +
/sessions/{id}/events/stream+/sessions/{id}/health+/sessions/{id}/tree. Frontend (local UI) gets the right-rail and the agent-tree. No persistence work, no cloud. (S1–S5, S8.) -
Slice 2 — Per-session local usage store (1–2 sprints, fully local). Persisted event store (SQLite outbox at
HIVE_HOME/runtime.db), filtered to durable event types only./sessions/{id}/usage+/colonies/{id}/usage. No identity, no rollups, no cloud transport yet. This is the foundation the cloud slice rides on. (U1–U5.) -
Slice 3 — Local→cloud outbox + cloud ingest (the cloud cut, scope-defining). Durable outbox queue, redaction policy, opt-in toggle, identity attachment, schema versioning, retry/backoff. Cloud-side ingest service + aggregate store. This is where the local-only world becomes a cloud product. Architect must decide §3.8 Shape, §5 storage, redaction defaults, and identity model before this slice can start.
-
Slice 4 — Cloud rollups, dashboards, quotas (scope TBD with product). Tenant aggregation, daily/monthly rollups, quota enforcement, OTel export, business dashboards. (U6–U8.) Defer until business confirms billing model — the answer (per-seat vs. per-token vs. per-colony) changes the data model.
Slices 1 and 2 are mostly wiring — the events exist, the schemas exist, the storage paths exist. Slice 3 is the first slice that introduces a new architectural boundary (local→cloud transport + identity + privacy contract); everything novel about the business ask lives there. Slice 4 is business design, not engineering scope.
10. Open questions for the architect
The first four are direct consequences of the local-first / cloud-required gap surfaced in §3.8 and §8.2.
- Cloud transport shape — Shape A, B, or C from §3.8? This decision is upstream of the entire data model. Recommend Shape B (outbox push) absent a strong privacy argument for Shape A.
- Redaction default for the cloud payload. What goes (model, token counts, latency, tool names, status) vs. what stays local (prompts, tool arguments, tool results, file paths, conversation content)? Need a written allowlist before Slice 3 starts.
- Self-hosted / air-gapped users. If the cloud sink is unreachable or disabled, what does the runtime do — buffer indefinitely, drop oldest, or refuse to start? Defaults differ for OSS vs. SaaS distributions.
- Identity binding point. Do we attach
user_id/tenant_idat event-emit time (invasive, threads identity through every node), at session-create time (clean, requires session-level auth), or at outbox-flush time (simplest, but loses per-event provenance)? Recommend session-create. - Do we need quota enforcement, or only quota visibility in v1?
- Frontend doc: are status and usage rendered in the same panel or different surfaces? This affects whether we ship one merged endpoint or two.
- Are we willing to pay the cost-table maintenance burden, or should "cost" stay labeled as estimated and not be used for invoicing?
Appendix — Pointers
- Queen lifecycle: core/framework/agents/queen/nodes/__init__.py
- Event bus + types: core/framework/host/event_bus.py
- Runtime log schemas: core/framework/tracker/runtime_log_schemas.py
- Runtime log store: core/framework/tracker/runtime_log_store.py
- LLM accounting: core/framework/llm/provider.py, model_catalog.py
- Colony progress DB: core/framework/host/progress_db.py
- Task events: core/framework/tasks/events.py
- Session HTTP: core/framework/server/routes_sessions.py
- SSE primitive: core/framework/server/sse.py
- Worker health: core/framework/tools/worker_monitoring_tools.py
- Config / env vars: core/framework/config.py