diff --git a/core/framework/agents/queen/nodes/__init__.py b/core/framework/agents/queen/nodes/__init__.py index 6ef60eca..fe9d9c4d 100644 --- a/core/framework/agents/queen/nodes/__init__.py +++ b/core/framework/agents/queen/nodes/__init__.py @@ -240,19 +240,15 @@ See "Independent execution" for the per-step flow and granularity rule. ## File I/O (files-tools MCP) - read_file, write_file, edit_file, search_files - - edit_file covers single-file fuzzy find/replace (mode='replace', default) \ +- edit_file covers single-file fuzzy find/replace (mode='replace', default) \ and multi-file structured patches (mode='patch'). Patch mode supports \ Update / Add / Delete / Move atomically across many files in one call. - - search_files covers grep/find/ls in one tool: target='content' to \ +- search_files covers grep/find/ls in one tool: target='content' to \ search inside files, target='files' (with a glob like '*.py') to list \ -or find files. Mtime-sorted in files mode. +or find files. ## Browser Automation (gcu-tools MCP) -- Use `browser_*` tools — `browser_open(url)` is the cold-start entry point \ - (lazy-creates the context; no separate "start" call). Then `browser_navigate`, \ - `browser_click`, `browser_type`, `browser_snapshot`, \ - `browser_screenshot`, `browser_scroll`, \ - `browser_tabs`, `browser_close`, `browser_evaluate`, etc. +- Use `browser_*` tools — `browser_open(url)` is the cold-start entry point - MUST Follow the browser-automation skill protocol before using browser tools. ## Hand off to a colony @@ -261,9 +257,7 @@ or find files. Mtime-sorted in files mode. chat. It does NOT fork on its own; it spawns a one-shot evaluator \ that reads this conversation and decides whether the spec is settled \ enough to proceed. On approval your phase flips to INCUBATING and a \ - new tool surface (including create_colony itself) unlocks. On \ - rejection you stay here and keep the conversation going to fill the \ - gaps the evaluator named. + new tool surface (including create_colony itself) unlocks. """ _queen_tools_incubating = """ @@ -411,17 +405,19 @@ asks for specifics. Do not invent a new pass unless the user asks for one. _queen_behavior_independent = """ ## Independent execution -You are the agent. **For multi-step work (2+ atomic actions): call \ -`task_create_batch`** with one entry per atomic action, \ -before you touch any other tool. \ -Then work the list one task at a time: +You are the agent. you behave this way: +1. Identify if the user's prompt is a task assignment. If it is, \ +Use ask_user to clarify the scope and detail requirements, then always use \ +the `task_create_batch` to create a multi-step action plan. -1. `task_update` → in_progress before you start the step. -2. Do one real inline instance — open the browser, call the real API, \ +2. `task_update` → in_progress before you start the step. + +3. Do one real inline instance - either open the browser, call the real API, \ write to the real file. If the action is irreversible or touches \ shared systems, show and confirm before executing. Report concrete \ evidence (actual output, what worked / failed) after the run. -3. `task_update` → completed THE MOMENT it's done. **Do not let \ + +4. `task_update` → completed THE MOMENT it's done. **Do not let \ multiple finished tasks pile up unmarked.** There is no batch update \ tool by design — each `completed` transition is a discrete progress \ heartbeat in the user's right-rail panel. Without those transitions \ @@ -430,14 +426,14 @@ done. **Granularity: one task per atomic action, not one umbrella per project.** \ -Once finishing all current tasks, discuss with user about building \ -a colony so this sucess can be repeated or scaled +Once finishing a current task, discuss with user about building \ +a colony so this success outcome can be repeated or scaled ### How to handle large scale tasks -If the user ask you to finish the same task repeatly or at large scale \ -(more than 10 times), tell the user that you can do it once first then \ +If the user ask you to finish the same task repeatedly or at large scale \ +(more than 3 times), tell the user that you can do it once first then \ build a colony to fulfill the request but succeeding it once will be \ -beneficial to run it in the future, \ +beneficial to run transfer it to a swarm of workers(through start_incubating_colony), \ then focus on finishing the task once first. ### How to handle simple task (less then 2 atomic items) diff --git a/docs/architecture/agent-usage-and-status-tracking.md b/docs/architecture/agent-usage-and-status-tracking.md new file mode 100644 index 00000000..4faa298e --- /dev/null +++ b/docs/architecture/agent-usage-and-status-tracking.md @@ -0,0 +1,331 @@ +# Agent Usage & Status Tracking — Capability Document + +**Audience:** Lead software architect (paired with frontend + business requirement docs) +**Scope:** Queen agent first (default local-runtime entry point), then downstream colonies/workers +**Status:** Capability inventory + proposal — no implementation commitments +**Date:** 2026-05-04 + +--- + +## 1. Why this document exists + +We have a business need to track **agent usage** (what was consumed: tokens, cost, runtime, calls) and **agent status** (what state agents are in: alive, phase, progress, blocked) starting from the Queen agent, and surface this **on the cloud** for product and business consumers. This document inventories the capabilities the runtime can expose **today** vs. what is **net-new**, so architecture can pick a scope before frontend and product write against it. + +The Queen is the right anchor: every local-runtime session today starts with a Queen, and every colony/worker is forked from a Queen call — so a tracking surface rooted at the Queen automatically covers the whole agent tree. + +> **Headline constraint for the architect:** the runtime is **local-by-default**. Every byte described in §3 — events, runtime logs, progress DBs, session state, LLM cost numbers — is written to the user's machine under `~/.hive/` (or the platform-specific Electron `userData` directory). **Nothing is shipped to the cloud today.** The business ask therefore implies a new local→cloud transport boundary, with the data-residency, privacy, and identity decisions that come with it. §4.5 makes the gap explicit per-surface; §8 lists the risks; §9 frames the cloud cut-over as the gating decision for any "Slice 2+" work. + +--- + +## 2. Vocabulary — what we actually mean + +| Term | Definition in this codebase | +|---|---| +| **Session** | One Queen runtime instance. ID: `session_{YYYYMMDD_HHMMSS}_{uuid8}`. Persisted at `~/.hive/sessions/{session_id}/`. | +| **Queen** | Long-lived conversational agent. One per session. Single event-loop node. Phases: `independent → incubating → working → reviewing`. See [queen/nodes/__init__.py:494-518](../../core/framework/agents/queen/nodes/__init__.py#L494-L518). | +| **Colony** | Persistent stateless container forked by `create_colony`. Has its own SQLite progress DB at `~/.hive/colonies/{colony_name}/data/progress.db`. | +| **Worker** | Ephemeral agent running inside a colony to execute a task. | +| **Run / execution** | One trigger-to-completion invocation inside a node. Carries `run_id`, `execution_id`, `trace_id` (OTel-aligned). | +| **Usage** | Quantitative consumption: input/output/cached tokens, USD cost, wall-clock latency, tool-call counts. | +| **Status** | Qualitative state: phase, alive/stalled, current task, blocked-on, queue depth, last-heartbeat. | + +--- + +## 3. What exists today (capabilities, not commitments) + +The runtime is already heavily instrumented. Most of what business wants is **already emitted** — the gap is persistence, aggregation, and a stable API surface. + +### 3.1 Event Bus — the spine + +[core/framework/host/event_bus.py:61-177](../../core/framework/host/event_bus.py#L61-L177) defines an in-process async pub/sub with **40+ event types** scoped by `stream_id`, `session_id`, `colony_id`, `execution_id`, `run_id`, `correlation_id`, `timestamp`. + +Relevant for usage/status: + +- **Lifecycle:** `EXECUTION_STARTED/COMPLETED/FAILED/PAUSED/RESUMED/RESURRECTED` +- **Queen:** `QUEEN_PHASE_CHANGED`, `QUEEN_IDENTITY_SELECTED` +- **Colony/Worker:** `COLONY_CREATED`, `WORKER_COLONY_LOADED`, `WORKER_COMPLETED`, `WORKER_FAILED`, `SUBAGENT_REPORT` +- **LLM:** `LLM_TURN_COMPLETE`, `LLM_TEXT_DELTA`, `LLM_REASONING_DELTA`, `CONTEXT_USAGE_UPDATED` +- **Tools:** `TOOL_CALL_STARTED`, `TOOL_CALL_COMPLETED`, `TOOL_CALL_REPLAY_DETECTED` +- **Health:** `NODE_STALLED`, `NODE_TOOL_DOOM_LOOP`, `STREAM_TTFT_EXCEEDED`, `STREAM_INACTIVE`, `STREAM_NUDGE_SENT` +- **Tasks (right-rail panel):** `TASK_CREATED`, `TASK_UPDATED`, `TASK_DELETED`, `TASK_LIST_RESET` +- **Triggers:** `TRIGGER_AVAILABLE/ACTIVATED/DEACTIVATED/FIRED/REMOVED/UPDATED` + +> Persistence today: in-memory only, **plus** optional JSONL export when `HIVE_DEBUG_EVENTS=1` ([event_bus.py:33-54](../../core/framework/host/event_bus.py#L33-L54)). There is no SQL events table. + +### 3.2 Three-level runtime logs (per session) + +[core/framework/tracker/runtime_log_schemas.py](../../core/framework/tracker/runtime_log_schemas.py) defines: + +| Level | Schema | File | Granularity | +|---|---|---|---| +| L1 | `RunSummaryLog` | `summary.json` | Per graph run — totals + execution_quality + trace_id | +| L2 | `NodeDetail` | `details.jsonl` | Per node — exit_status, input/output tokens, latency_ms, retry/accept/escalate/continue counts | +| L3 | `NodeStepLog` | `tool_logs.jsonl` | Per LLM step — tool calls, verdicts, error traces, latency_ms | + +Storage: `~/.hive/sessions/{session_id}/logs/` ([runtime_log_store.py](../../core/framework/tracker/runtime_log_store.py)). Schemas already carry OTel fields (`trace_id`, `span_id`, `parent_span_id`) — wire-ready, not yet exported. + +### 3.3 LLM call accounting + +[core/framework/llm/provider.py:11-32](../../core/framework/llm/provider.py#L11-L32) — `LLMResponse` carries: `model`, `input_tokens`, `output_tokens`, `cached_tokens`, `cache_creation_tokens`, `cost_usd`, `stop_reason`. Cost is computed from [model_catalog.py](../../core/framework/llm/model_catalog.py) when the model is priced; otherwise `0.0`. + +> Gap: cost lives in the response object and is rolled into L2/L3 logs, but is **not** in the event bus stream and **not** in any aggregate query surface. + +### 3.4 Colony Progress DB + +[core/framework/host/progress_db.py:44-110](../../core/framework/host/progress_db.py#L44-L110) — per-colony SQLite (WAL mode): + +- `tasks` (id, seq, priority, goal, status: pending|claimed|started|completed|failed, worker_id, claimed_at, started_at, completed_at, retry_count, last_error) +- `steps`, `sop_checklist`, `colony_meta` + +This is the closest thing we have to a status SQL store today, but it is **per-colony** and **task-shaped** — not session-shaped or usage-shaped. + +### 3.5 Queen task system (right-rail panel) + +The mechanism the IDE-selection prompt describes is real: each `task_update` emits `TASK_UPDATED` on the bus, which a future SSE/WS surface can stream. State transitions: `pending → in_progress → completed`. Task body carries `subject`, `active_form`, `blocks`, `blocked_by`, `metadata`. Source: [tasks/events.py:52-159](../../core/framework/tasks/events.py#L52-L159). + +### 3.6 HTTP surface (already shipping) + +[core/framework/server/routes_sessions.py](../../core/framework/server/routes_sessions.py): + +- `POST /api/sessions` — create +- `GET /api/sessions/{session_id}` — current state including `queen_phase`, `queen_model`, `colony_id`, `uptime_seconds` +- `GET /api/sessions/{session_id}/stats` — runtime statistics (extension point) +- `GET /api/sessions/{session_id}/events/history` — replay persisted events + +SSE primitive exists at [server/sse.py](../../core/framework/server/sse.py) but is **not yet wired to a global event-stream route**. This is the natural attach point for a real-time status feed. + +### 3.7 Worker health snapshot + +`get_worker_health_summary()` ([worker_monitoring_tools.py:71-99](../../core/framework/tools/worker_monitoring_tools.py#L71-L99)) returns: `session_id`, `session_status`, `total_steps`, `recent_verdicts`, `stall_minutes`, `evidence_snippet`. Used today by Queen during the WORKING phase; can be exposed via API. + +--- + +## 3.8 Where every byte lives today (data residency map) + +Every storage location below is **on the end-user's machine**. There is no cloud sink, no telemetry endpoint, no managed database, no analytics service. The HTTP server in [core/framework/server/](../../core/framework/server/) binds to localhost for the desktop UI; it is not a cloud API. + +`HIVE_HOME` defaults to `~/.hive/` and is overridden by the desktop shell to the platform `userData` dir (e.g. `~/Library/Application Support/Hive/` on macOS, `%APPDATA%\Hive\` on Windows). Source: [config.py:20-44](../../core/framework/config.py#L20-L44). + +| Data | On-disk location (per machine) | Format | Lifetime | Currently shipped off-device? | +|---|---|---|---|---| +| Event bus stream | in-process memory only | Python objects | Process lifetime | No | +| Event debug log (opt-in) | `HIVE_HOME/event_logs/.jsonl` when `HIVE_DEBUG_EVENTS=1` | JSONL | Until user deletes | No | +| Session state | `HIVE_HOME/sessions/{session_id}/state.json` | JSON | Until user deletes | No | +| Conversations | `HIVE_HOME/sessions/{session_id}/conversations/` | JSON | Until user deletes | No | +| Artifacts | `HIVE_HOME/sessions/{session_id}/artifacts/` | mixed | Until user deletes | No | +| L1 run summary (tokens, cost, quality) | `HIVE_HOME/sessions/{session_id}/logs/summary.json` | JSON | Until user deletes | No | +| L2 node details | `HIVE_HOME/sessions/{session_id}/logs/details.jsonl` | JSONL | Until user deletes | No | +| L3 step / tool logs | `HIVE_HOME/sessions/{session_id}/logs/tool_logs.jsonl` | JSONL | Until user deletes | No | +| Colony task / step / SOP state | `HIVE_HOME/colonies/{colony_name}/data/progress.db` | SQLite (WAL) | Until user deletes | No | +| Queen / colony / skill / memory configs | `HIVE_HOME/{queens,colonies,skills,memories}/` | files | Until user deletes | No | +| LLM `cost_usd` numbers | computed in-process from [model_catalog.py](../../core/framework/llm/model_catalog.py), then written into L1/L2/L3 logs above | — | Same as logs | No | + +**What this means for the cloud requirement:** the question for the architect is not "where do we get the data" — the data is fully captured. The question is **"how does it leave the machine, in what shape, with whose consent, and where does it land."** That decision is upstream of every endpoint in §6 and every storage option in §5. + +Three architectural shapes worth considering (architect to choose): + +- **Shape A — On-device only, queried over LAN/tunnel.** Cloud product reaches into the runtime via an authenticated tunnel; no data is replicated. Strongest privacy. Hardest for cross-device rollups. +- **Shape B — Outbox push.** Runtime keeps the local store as source of truth and asynchronously pushes a redacted, billing-grade subset (no prompts, no tool args by default) to a cloud aggregate. Best fit for the typical "agent status dashboard + usage rollup" product. +- **Shape C — Cloud-first runtime.** Runtime writes events directly to a cloud bus and treats local files as a cache. Largest rewrite; not recommended for a desktop-first product. + +Shape B is the lowest-friction path to the stated business outcome. The rest of this document is written with Shape B as the default and calls out where Shape A or C would change things. + +--- + +## 4. Capability matrix — what we can offer + +Each row is a candidate frontend/business surface, scored by feasibility from current state. + +| # | Capability | Status | Backed by | +|---|---|---|---| +| **Status** | | | | +| S1 | Queen phase indicator (independent/incubating/working/reviewing) | **Ready** | `QUEEN_PHASE_CHANGED` event + session detail field | +| S2 | Per-task progress (right-rail) | **Ready** | `TASK_*` events | +| S3 | Live LLM streaming indicator (typing, thinking, tool-calling) | **Ready** | `LLM_TEXT_DELTA`, `LLM_REASONING_DELTA`, `TOOL_CALL_STARTED/COMPLETED` | +| S4 | Stall / stuck-agent detection | **Ready** | `NODE_STALLED`, `STREAM_INACTIVE`, `NODE_TOOL_DOOM_LOOP` | +| S5 | Colony tree (Queen → colonies → workers) snapshot | **Partial** — data exists in session/colony stores; need a join query | +| S6 | Worker health roll-up across colonies | **Partial** — per-worker tool exists; needs aggregation route | +| S7 | Liveness heartbeat ("agent X last seen Y ago") | **Net-new** — must derive from event timestamps or add a periodic ping | +| S8 | Trigger schedule (when will Queen wake next) | **Ready** | `TRIGGER_*` events | +| **Usage** | | | | +| U1 | Tokens per session (input/output/cached) | **Partial** — captured per-step in L3, summed in L1; no API | +| U2 | USD cost per session / colony / model | **Partial** — `cost_usd` per LLM call in logs; no rollup | +| U3 | Tool-call counts and types | **Partial** — events exist; no aggregate | +| U4 | Wall-clock runtime and active-time per agent | **Partial** — derivable from `EXECUTION_STARTED/COMPLETED` | +| U5 | Cost attribution per Queen-spawned colony | **Partial** — `colony_id` is on every event; needs a query | +| U6 | Per-user / per-tenant aggregation | **Net-new** — there is no user/tenant identity in events today | +| U7 | Daily / monthly usage rollups for billing | **Net-new** — requires persistent event store | +| U8 | Quota / cap enforcement (block when over budget) | **Net-new** — requires real-time meter + policy hook | + +**Read of the matrix:** ~70% of "status" surfaces are **shipping-grade today** behind a thin local API. ~70% of "usage" surfaces need a **persistence + aggregation layer**. The events themselves are not the bottleneck. + +**Local vs. cloud read of the same matrix.** Every "Ready" / "Partial" cell above is *ready in-process on the local machine*. Making each row visible to a **cloud** consumer adds an additional step: + +| Capability class | Local (today / near-term) | Cloud (business ask) | +|---|---|---| +| Live status (S1–S4, S8) | Stream from in-process event bus over local SSE | Push events through outbox → cloud relay → cloud SSE/WS to product UI. | +| Tree / health (S5, S6) | Join local session + colony stores | Same join, but on cloud-side replica of session/colony index. | +| Liveness (S7) | Derive from local event timestamps | Requires the runtime to *post* a heartbeat; cloud cannot infer aliveness from absence. | +| Per-session usage (U1–U5) | Read L1/L2/L3 logs on disk | Outbox sends durable rows (no deltas) to cloud usage table. | +| Tenant rollups (U6–U7) | Not possible — no identity in events | Cloud-side aggregation keyed on session→user join, identity attached at outbox time. | +| Quotas (U8) | Local meter feasible, but pointless without cloud truth | Cloud is the meter of record; runtime calls home to check. | + +--- + +## 5. Proposed data model (architect to validate) + +Three new persisted entities, plus reuse of existing event types: + +``` +AgentSession UsageEvent StatusSnapshot +----------- ----------- --------------- +session_id (PK) id (PK) session_id (FK) +queen_id session_id (FK) taken_at +queen_model colony_id phase +started_at worker_id active_run_id +ended_at agent_role (queen|worker) active_node +status (active|done|failed) event_type (LLM|TOOL|...) open_task_count +user_id (when multi-tenant) model in_flight_workers +tenant_id (when multi-tenant) input_tokens last_event_at +total_input_tokens output_tokens stall_score +total_output_tokens cached_tokens +total_cached_tokens cost_usd +total_cost_usd latency_ms +total_tool_calls tool_name (nullable) +last_event_at occurred_at + trace_id + execution_id +``` + +Storage choice (architect call). **All three options today are local; only Option C reaches the cloud business surface.** + +- **Option A — local SQLite outbox** at `HIVE_HOME/runtime.db`. Pros: zero infra, fits desktop, makes local queries cheap. Cons: per-host; no cross-device aggregation; **does not satisfy the cloud requirement on its own.** +- **Option B — DuckDB on the existing JSONL event logs.** Pros: zero ingestion code; analyst-friendly. Cons: cold-start latency on big histories; **also local-only.** +- **Option C — push events to a managed cloud store** (Postgres, ClickHouse, BigQuery) via an outbox pattern. Pros: cross-host rollups, billing-grade, the only option that actually delivers the cloud-visible status/metrics product. Cons: introduces a new transport, identity, and privacy/redaction story; needs explicit user opt-in for desktop builds. + +The realistic shape is the hybrid called out in §3.8 Shape B: **A locally** as the durable buffer and source of truth, **C in the cloud** as the business-facing aggregate, with a one-way outbox that moves a *redacted, durable-event-only* subset over the wire. This document recommends that hybrid; everything in §6 and §7 is written against it. + +--- + +## 6. Surface API — what frontend would consume + +All routes assume the event-bus → SSE bridge exists (the one missing wire — see §3.6). Frontend sees this from day one. + +> **Locality note.** The `/api/...` routes below are served by the **local runtime HTTP server** today. For the cloud product, the same shapes need a cloud-side counterpart fed by the outbox. Two practical patterns: (1) cloud product calls cloud-hosted versions of these routes (against the aggregate), or (2) cloud product proxies authenticated requests back to the user's runtime. §3.8 Shape A vs. Shape B picks between them. + +### Real-time channel + +``` +GET /api/sessions/{session_id}/events/stream (SSE) + ↳ filter=phase,task,llm_stream,tool,worker,trigger,health +GET /api/agents/queen/stream (SSE) — global queen events +``` + +### Status reads + +``` +GET /api/sessions/{session_id} — already shipping +GET /api/sessions/{session_id}/tree — Queen → colonies → workers +GET /api/sessions/{session_id}/health — stall_score, last_event_at, in_flight +GET /api/colonies/{colony_id}/workers — health roll-up +``` + +### Usage reads + +``` +GET /api/sessions/{session_id}/usage — tokens, cost, latency, tool-calls +GET /api/sessions/{session_id}/usage/by-model — split by model +GET /api/colonies/{colony_id}/usage — same shape, colony scope +GET /api/agents/queen/usage?range=...&group_by=... — rollup view (billing) +``` + +### Admin / business + +``` +GET /api/usage/rollup?range=...&group_by=user|tenant|model|colony +POST /api/quotas/{tenant} — set caps (if quota work in scope) +``` + +--- + +## 7. Net-new work — sized in shirt-size, not days + +| Workstream | Local / Cloud | Size | Depends on | Notes | +|---|---|---|---|---| +| Event-bus → local SSE bridge ([sse.py](../../core/framework/server/sse.py) exists, route does not) | Local | **S** | — | Unlocks all real-time status surfaces in the desktop UI. Highest leverage piece. | +| Persisted local event store (SQLite outbox) | Local | **M** | Decision §5 | One writer, append-only; reuse existing JSONL writer. Source of truth for cloud push. | +| Local aggregation queries + `/usage` endpoints | Local | **M** | Persisted store | Per-session usage on disk. | +| **Outbox transport (local → cloud)** | **Boundary** | **M–L** | Local store + auth | New work: durable queue, retry, redaction policy, opt-in switch, schema versioning. This is the bridge to the cloud product. | +| **Cloud event ingest + aggregate store** | **Cloud** | **L** | Outbox transport | New cloud infra (Postgres/ClickHouse/BigQuery). Hosting, ops, retention policy, access controls. | +| **Cloud-side status/usage API + dashboards** | **Cloud** | **M** | Cloud aggregate | Mirrors §6 endpoints against the cloud store; this is what business users actually see. | +| Identity layer (user_id / tenant_id on events) | Boundary | **M** | Auth model | Currently no user identity in events. Identity attaches at outbox time, not at emit time. | +| OpenTelemetry exporter (schema is ready) | Boundary | **S–M** | — | `trace_id`/`span_id` already populated; an OTel collector can be the cloud sink instead of a custom outbox. | +| Quota / policy hooks | Cloud-authoritative | **L** | Cloud store + identity | Cloud holds the meter; runtime calls home synchronously on a critical path. | +| Liveness/heartbeat (S7) | Local emit, cloud consume | **S** | Outbox | Runtime must actively post; cloud cannot infer liveness from absence. | +| Cost attribution UI rollups | Cloud | **S** | `/usage` cloud endpoints | Shared with frontend doc. | + +**Critical path for first frontend release (local desktop UI):** SSE bridge → status endpoints (S1–S5) → per-session usage endpoint (U1, U2). Everything else is incremental. + +**Critical path for first cloud release (business ask):** local event store → outbox transport with redaction + opt-in → cloud ingest → cloud `/usage` and `/status` endpoints. The local UI work above is *not* a prerequisite for the cloud cut, but most of the local-side primitives (event store, durable-event filtering) are shared, so doing them in order minimizes rework. + +--- + +## 8. Risks and tradeoffs the architect should weigh + +1. **Event volume.** `LLM_TEXT_DELTA` fires per token. A persisted store must filter — don't write deltas, write `LLM_TURN_COMPLETE`. This is the #1 way the table blows up. +2. **Privacy / desktop posture — the central architectural constraint.** The runtime is local by default ([config.py:20-44](../../core/framework/config.py#L20-L44)). The data inventory in §3.8 confirms that **no data leaves the user's machine today**, including the data the business ask needs in the cloud. Closing that gap is not "add a metrics push" — it is a new system boundary with: (a) explicit user opt-in (defaults must be safe for OSS / self-hosted users), (b) a documented redaction list (no prompts, no tool args, no file paths in the default payload), (c) schema versioning so cloud aggregates do not break on runtime upgrades, (d) a clear answer for self-hosted / air-gapped deployments where the cloud sink is unreachable, (e) regional data-residency rules if the product is sold internationally. This is the single largest design decision in the document. +3. **Cost-table accuracy.** `cost_usd` is computed from a static catalog. Using it for billing means committing to keeping the catalog current (or pulling from provider invoices). For *display*, the current approach is fine; for *charging*, it is not. +4. **Identity coupling.** Events are session-scoped today. Adding `user_id`/`tenant_id` everywhere is invasive. Recommend pinning identity at the *session* boundary and joining on session at query time, rather than threading identity through every event payload. +5. **Status vs. heartbeat semantics.** "Idle" is not "dead." A Queen sitting in `independent` waiting for a user message is healthy and should not page anyone. The stall-score in §5 must distinguish idle-by-design from stalled-by-bug — the existing `STREAM_INACTIVE` / `NODE_STALLED` events already make this distinction; preserve it. +6. **Backpressure from observability.** If usage tracking sits in the LLM call path (for quotas), it must not add latency. Recommend: meter is async/eventual for display; only quota checks are synchronous, and only when the customer has a quota. +7. **Worker-side gap.** Worker LLM calls are accounted in their own session's L1–L3 logs but are *not* automatically rolled into the parent Queen session. Cost attribution from Queen → spawned colony requires either (a) a parent_session_id field on the colony's session row, or (b) walking the `COLONY_CREATED` event graph at query time. (a) is cleaner. + +--- + +## 9. Recommendation + +Ship in four thin slices. The first two are local-only and unblock the desktop UI; the last two are what actually deliver the business ask of cloud-visible status and metrics. + +1. **Slice 1 — Live local status (1 sprint, fully local).** + SSE bridge + `/sessions/{id}/events/stream` + `/sessions/{id}/health` + `/sessions/{id}/tree`. Frontend (local UI) gets the right-rail and the agent-tree. No persistence work, no cloud. (S1–S5, S8.) + +2. **Slice 2 — Per-session local usage store (1–2 sprints, fully local).** + Persisted event store (SQLite outbox at `HIVE_HOME/runtime.db`), filtered to durable event types only. `/sessions/{id}/usage` + `/colonies/{id}/usage`. No identity, no rollups, **no cloud transport yet**. This is the foundation the cloud slice rides on. (U1–U5.) + +3. **Slice 3 — Local→cloud outbox + cloud ingest (the cloud cut, scope-defining).** + Durable outbox queue, redaction policy, opt-in toggle, identity attachment, schema versioning, retry/backoff. Cloud-side ingest service + aggregate store. **This is where the local-only world becomes a cloud product.** Architect must decide §3.8 Shape, §5 storage, redaction defaults, and identity model before this slice can start. + +4. **Slice 4 — Cloud rollups, dashboards, quotas (scope TBD with product).** + Tenant aggregation, daily/monthly rollups, quota enforcement, OTel export, business dashboards. (U6–U8.) Defer until business confirms billing model — the answer (per-seat vs. per-token vs. per-colony) changes the data model. + +Slices 1 and 2 are mostly **wiring** — the events exist, the schemas exist, the storage paths exist. Slice 3 is the **first slice that introduces a new architectural boundary** (local→cloud transport + identity + privacy contract); everything novel about the business ask lives there. Slice 4 is **business design**, not engineering scope. + +--- + +## 10. Open questions for the architect + +The first four are direct consequences of the local-first / cloud-required gap surfaced in §3.8 and §8.2. + +1. **Cloud transport shape — Shape A, B, or C from §3.8?** This decision is upstream of the entire data model. Recommend Shape B (outbox push) absent a strong privacy argument for Shape A. +2. **Redaction default for the cloud payload.** What goes (model, token counts, latency, tool names, status) vs. what stays local (prompts, tool arguments, tool results, file paths, conversation content)? Need a written allowlist before Slice 3 starts. +3. **Self-hosted / air-gapped users.** If the cloud sink is unreachable or disabled, what does the runtime do — buffer indefinitely, drop oldest, or refuse to start? Defaults differ for OSS vs. SaaS distributions. +4. **Identity binding point.** Do we attach `user_id` / `tenant_id` at event-emit time (invasive, threads identity through every node), at session-create time (clean, requires session-level auth), or at outbox-flush time (simplest, but loses per-event provenance)? Recommend session-create. +5. Do we need quota *enforcement*, or only quota *visibility* in v1? +6. Frontend doc: are status and usage rendered in the same panel or different surfaces? This affects whether we ship one merged endpoint or two. +7. Are we willing to pay the cost-table maintenance burden, or should "cost" stay labeled as estimated and not be used for invoicing? + +--- + +## Appendix — Pointers + +- Queen lifecycle: [core/framework/agents/queen/nodes/__init__.py](../../core/framework/agents/queen/nodes/__init__.py) +- Event bus + types: [core/framework/host/event_bus.py](../../core/framework/host/event_bus.py) +- Runtime log schemas: [core/framework/tracker/runtime_log_schemas.py](../../core/framework/tracker/runtime_log_schemas.py) +- Runtime log store: [core/framework/tracker/runtime_log_store.py](../../core/framework/tracker/runtime_log_store.py) +- LLM accounting: [core/framework/llm/provider.py](../../core/framework/llm/provider.py), [model_catalog.py](../../core/framework/llm/model_catalog.py) +- Colony progress DB: [core/framework/host/progress_db.py](../../core/framework/host/progress_db.py) +- Task events: [core/framework/tasks/events.py](../../core/framework/tasks/events.py) +- Session HTTP: [core/framework/server/routes_sessions.py](../../core/framework/server/routes_sessions.py) +- SSE primitive: [core/framework/server/sse.py](../../core/framework/server/sse.py) +- Worker health: [core/framework/tools/worker_monitoring_tools.py](../../core/framework/tools/worker_monitoring_tools.py) +- Config / env vars: [core/framework/config.py](../../core/framework/config.py)