feat: encourage action plan prompts

This commit is contained in:
Richard Tang
2026-05-04 17:53:26 -07:00
parent e7d4ce0057
commit 07600c5ab5
2 changed files with 350 additions and 23 deletions
+19 -23
View File
@@ -240,19 +240,15 @@ See "Independent execution" for the per-step flow and granularity rule.
## File I/O (files-tools MCP) ## File I/O (files-tools MCP)
- read_file, write_file, edit_file, search_files - read_file, write_file, edit_file, search_files
- edit_file covers single-file fuzzy find/replace (mode='replace', default) \ - edit_file covers single-file fuzzy find/replace (mode='replace', default) \
and multi-file structured patches (mode='patch'). Patch mode supports \ and multi-file structured patches (mode='patch'). Patch mode supports \
Update / Add / Delete / Move atomically across many files in one call. Update / Add / Delete / Move atomically across many files in one call.
- search_files covers grep/find/ls in one tool: target='content' to \ - search_files covers grep/find/ls in one tool: target='content' to \
search inside files, target='files' (with a glob like '*.py') to list \ search inside files, target='files' (with a glob like '*.py') to list \
or find files. Mtime-sorted in files mode. or find files.
## Browser Automation (gcu-tools MCP) ## Browser Automation (gcu-tools MCP)
- Use `browser_*` tools `browser_open(url)` is the cold-start entry point \ - Use `browser_*` tools `browser_open(url)` is the cold-start entry point
(lazy-creates the context; no separate "start" call). Then `browser_navigate`, \
`browser_click`, `browser_type`, `browser_snapshot`, \
<!-- vision-only -->`browser_screenshot`, <!-- /vision-only -->`browser_scroll`, \
`browser_tabs`, `browser_close`, `browser_evaluate`, etc.
- MUST Follow the browser-automation skill protocol before using browser tools. - MUST Follow the browser-automation skill protocol before using browser tools.
## Hand off to a colony ## Hand off to a colony
@@ -261,9 +257,7 @@ or find files. Mtime-sorted in files mode.
chat. It does NOT fork on its own; it spawns a one-shot evaluator \ chat. It does NOT fork on its own; it spawns a one-shot evaluator \
that reads this conversation and decides whether the spec is settled \ that reads this conversation and decides whether the spec is settled \
enough to proceed. On approval your phase flips to INCUBATING and a \ enough to proceed. On approval your phase flips to INCUBATING and a \
new tool surface (including create_colony itself) unlocks. On \ new tool surface (including create_colony itself) unlocks.
rejection you stay here and keep the conversation going to fill the \
gaps the evaluator named.
""" """
_queen_tools_incubating = """ _queen_tools_incubating = """
@@ -411,17 +405,19 @@ asks for specifics. Do not invent a new pass unless the user asks for one.
_queen_behavior_independent = """ _queen_behavior_independent = """
## Independent execution ## Independent execution
You are the agent. **For multi-step work (2+ atomic actions): call \ You are the agent. you behave this way:
`task_create_batch`** with one entry per atomic action, \ 1. Identify if the user's prompt is a task assignment. If it is, \
before you touch any other tool. \ Use ask_user to clarify the scope and detail requirements, then always use \
Then work the list one task at a time: the `task_create_batch` to create a multi-step action plan.
1. `task_update` in_progress before you start the step. 2. `task_update` in_progress before you start the step.
2. Do one real inline instance open the browser, call the real API, \
3. Do one real inline instance - either open the browser, call the real API, \
write to the real file. If the action is irreversible or touches \ write to the real file. If the action is irreversible or touches \
shared systems, show and confirm before executing. Report concrete \ shared systems, show and confirm before executing. Report concrete \
evidence (actual output, what worked / failed) after the run. evidence (actual output, what worked / failed) after the run.
3. `task_update` completed THE MOMENT it's done. **Do not let \
4. `task_update` completed THE MOMENT it's done. **Do not let \
multiple finished tasks pile up unmarked.** There is no batch update \ multiple finished tasks pile up unmarked.** There is no batch update \
tool by design each `completed` transition is a discrete progress \ tool by design each `completed` transition is a discrete progress \
heartbeat in the user's right-rail panel. Without those transitions \ heartbeat in the user's right-rail panel. Without those transitions \
@@ -430,14 +426,14 @@ done.
**Granularity: one task per atomic action, not one umbrella per project.** \ **Granularity: one task per atomic action, not one umbrella per project.** \
Once finishing all current tasks, discuss with user about building \ Once finishing a current task, discuss with user about building \
a colony so this sucess can be repeated or scaled a colony so this success outcome can be repeated or scaled
### How to handle large scale tasks ### How to handle large scale tasks
If the user ask you to finish the same task repeatly or at large scale \ If the user ask you to finish the same task repeatedly or at large scale \
(more than 10 times), tell the user that you can do it once first then \ (more than 3 times), tell the user that you can do it once first then \
build a colony to fulfill the request but succeeding it once will be \ build a colony to fulfill the request but succeeding it once will be \
beneficial to run it in the future, \ beneficial to run transfer it to a swarm of workers(through start_incubating_colony), \
then focus on finishing the task once first. then focus on finishing the task once first.
### How to handle simple task (less then 2 atomic items) ### How to handle simple task (less then 2 atomic items)
@@ -0,0 +1,331 @@
# Agent Usage & Status Tracking — Capability Document
**Audience:** Lead software architect (paired with frontend + business requirement docs)
**Scope:** Queen agent first (default local-runtime entry point), then downstream colonies/workers
**Status:** Capability inventory + proposal — no implementation commitments
**Date:** 2026-05-04
---
## 1. Why this document exists
We have a business need to track **agent usage** (what was consumed: tokens, cost, runtime, calls) and **agent status** (what state agents are in: alive, phase, progress, blocked) starting from the Queen agent, and surface this **on the cloud** for product and business consumers. This document inventories the capabilities the runtime can expose **today** vs. what is **net-new**, so architecture can pick a scope before frontend and product write against it.
The Queen is the right anchor: every local-runtime session today starts with a Queen, and every colony/worker is forked from a Queen call — so a tracking surface rooted at the Queen automatically covers the whole agent tree.
> **Headline constraint for the architect:** the runtime is **local-by-default**. Every byte described in §3 — events, runtime logs, progress DBs, session state, LLM cost numbers — is written to the user's machine under `~/.hive/` (or the platform-specific Electron `userData` directory). **Nothing is shipped to the cloud today.** The business ask therefore implies a new local→cloud transport boundary, with the data-residency, privacy, and identity decisions that come with it. §4.5 makes the gap explicit per-surface; §8 lists the risks; §9 frames the cloud cut-over as the gating decision for any "Slice 2+" work.
---
## 2. Vocabulary — what we actually mean
| Term | Definition in this codebase |
|---|---|
| **Session** | One Queen runtime instance. ID: `session_{YYYYMMDD_HHMMSS}_{uuid8}`. Persisted at `~/.hive/sessions/{session_id}/`. |
| **Queen** | Long-lived conversational agent. One per session. Single event-loop node. Phases: `independent → incubating → working → reviewing`. See [queen/nodes/__init__.py:494-518](../../core/framework/agents/queen/nodes/__init__.py#L494-L518). |
| **Colony** | Persistent stateless container forked by `create_colony`. Has its own SQLite progress DB at `~/.hive/colonies/{colony_name}/data/progress.db`. |
| **Worker** | Ephemeral agent running inside a colony to execute a task. |
| **Run / execution** | One trigger-to-completion invocation inside a node. Carries `run_id`, `execution_id`, `trace_id` (OTel-aligned). |
| **Usage** | Quantitative consumption: input/output/cached tokens, USD cost, wall-clock latency, tool-call counts. |
| **Status** | Qualitative state: phase, alive/stalled, current task, blocked-on, queue depth, last-heartbeat. |
---
## 3. What exists today (capabilities, not commitments)
The runtime is already heavily instrumented. Most of what business wants is **already emitted** — the gap is persistence, aggregation, and a stable API surface.
### 3.1 Event Bus — the spine
[core/framework/host/event_bus.py:61-177](../../core/framework/host/event_bus.py#L61-L177) defines an in-process async pub/sub with **40+ event types** scoped by `stream_id`, `session_id`, `colony_id`, `execution_id`, `run_id`, `correlation_id`, `timestamp`.
Relevant for usage/status:
- **Lifecycle:** `EXECUTION_STARTED/COMPLETED/FAILED/PAUSED/RESUMED/RESURRECTED`
- **Queen:** `QUEEN_PHASE_CHANGED`, `QUEEN_IDENTITY_SELECTED`
- **Colony/Worker:** `COLONY_CREATED`, `WORKER_COLONY_LOADED`, `WORKER_COMPLETED`, `WORKER_FAILED`, `SUBAGENT_REPORT`
- **LLM:** `LLM_TURN_COMPLETE`, `LLM_TEXT_DELTA`, `LLM_REASONING_DELTA`, `CONTEXT_USAGE_UPDATED`
- **Tools:** `TOOL_CALL_STARTED`, `TOOL_CALL_COMPLETED`, `TOOL_CALL_REPLAY_DETECTED`
- **Health:** `NODE_STALLED`, `NODE_TOOL_DOOM_LOOP`, `STREAM_TTFT_EXCEEDED`, `STREAM_INACTIVE`, `STREAM_NUDGE_SENT`
- **Tasks (right-rail panel):** `TASK_CREATED`, `TASK_UPDATED`, `TASK_DELETED`, `TASK_LIST_RESET`
- **Triggers:** `TRIGGER_AVAILABLE/ACTIVATED/DEACTIVATED/FIRED/REMOVED/UPDATED`
> Persistence today: in-memory only, **plus** optional JSONL export when `HIVE_DEBUG_EVENTS=1` ([event_bus.py:33-54](../../core/framework/host/event_bus.py#L33-L54)). There is no SQL events table.
### 3.2 Three-level runtime logs (per session)
[core/framework/tracker/runtime_log_schemas.py](../../core/framework/tracker/runtime_log_schemas.py) defines:
| Level | Schema | File | Granularity |
|---|---|---|---|
| L1 | `RunSummaryLog` | `summary.json` | Per graph run — totals + execution_quality + trace_id |
| L2 | `NodeDetail` | `details.jsonl` | Per node — exit_status, input/output tokens, latency_ms, retry/accept/escalate/continue counts |
| L3 | `NodeStepLog` | `tool_logs.jsonl` | Per LLM step — tool calls, verdicts, error traces, latency_ms |
Storage: `~/.hive/sessions/{session_id}/logs/` ([runtime_log_store.py](../../core/framework/tracker/runtime_log_store.py)). Schemas already carry OTel fields (`trace_id`, `span_id`, `parent_span_id`) — wire-ready, not yet exported.
### 3.3 LLM call accounting
[core/framework/llm/provider.py:11-32](../../core/framework/llm/provider.py#L11-L32) — `LLMResponse` carries: `model`, `input_tokens`, `output_tokens`, `cached_tokens`, `cache_creation_tokens`, `cost_usd`, `stop_reason`. Cost is computed from [model_catalog.py](../../core/framework/llm/model_catalog.py) when the model is priced; otherwise `0.0`.
> Gap: cost lives in the response object and is rolled into L2/L3 logs, but is **not** in the event bus stream and **not** in any aggregate query surface.
### 3.4 Colony Progress DB
[core/framework/host/progress_db.py:44-110](../../core/framework/host/progress_db.py#L44-L110) — per-colony SQLite (WAL mode):
- `tasks` (id, seq, priority, goal, status: pending|claimed|started|completed|failed, worker_id, claimed_at, started_at, completed_at, retry_count, last_error)
- `steps`, `sop_checklist`, `colony_meta`
This is the closest thing we have to a status SQL store today, but it is **per-colony** and **task-shaped** — not session-shaped or usage-shaped.
### 3.5 Queen task system (right-rail panel)
The mechanism the IDE-selection prompt describes is real: each `task_update` emits `TASK_UPDATED` on the bus, which a future SSE/WS surface can stream. State transitions: `pending → in_progress → completed`. Task body carries `subject`, `active_form`, `blocks`, `blocked_by`, `metadata`. Source: [tasks/events.py:52-159](../../core/framework/tasks/events.py#L52-L159).
### 3.6 HTTP surface (already shipping)
[core/framework/server/routes_sessions.py](../../core/framework/server/routes_sessions.py):
- `POST /api/sessions` — create
- `GET /api/sessions/{session_id}` — current state including `queen_phase`, `queen_model`, `colony_id`, `uptime_seconds`
- `GET /api/sessions/{session_id}/stats` — runtime statistics (extension point)
- `GET /api/sessions/{session_id}/events/history` — replay persisted events
SSE primitive exists at [server/sse.py](../../core/framework/server/sse.py) but is **not yet wired to a global event-stream route**. This is the natural attach point for a real-time status feed.
### 3.7 Worker health snapshot
`get_worker_health_summary()` ([worker_monitoring_tools.py:71-99](../../core/framework/tools/worker_monitoring_tools.py#L71-L99)) returns: `session_id`, `session_status`, `total_steps`, `recent_verdicts`, `stall_minutes`, `evidence_snippet`. Used today by Queen during the WORKING phase; can be exposed via API.
---
## 3.8 Where every byte lives today (data residency map)
Every storage location below is **on the end-user's machine**. There is no cloud sink, no telemetry endpoint, no managed database, no analytics service. The HTTP server in [core/framework/server/](../../core/framework/server/) binds to localhost for the desktop UI; it is not a cloud API.
`HIVE_HOME` defaults to `~/.hive/` and is overridden by the desktop shell to the platform `userData` dir (e.g. `~/Library/Application Support/Hive/` on macOS, `%APPDATA%\Hive\` on Windows). Source: [config.py:20-44](../../core/framework/config.py#L20-L44).
| Data | On-disk location (per machine) | Format | Lifetime | Currently shipped off-device? |
|---|---|---|---|---|
| Event bus stream | in-process memory only | Python objects | Process lifetime | No |
| Event debug log (opt-in) | `HIVE_HOME/event_logs/<ts>.jsonl` when `HIVE_DEBUG_EVENTS=1` | JSONL | Until user deletes | No |
| Session state | `HIVE_HOME/sessions/{session_id}/state.json` | JSON | Until user deletes | No |
| Conversations | `HIVE_HOME/sessions/{session_id}/conversations/` | JSON | Until user deletes | No |
| Artifacts | `HIVE_HOME/sessions/{session_id}/artifacts/` | mixed | Until user deletes | No |
| L1 run summary (tokens, cost, quality) | `HIVE_HOME/sessions/{session_id}/logs/summary.json` | JSON | Until user deletes | No |
| L2 node details | `HIVE_HOME/sessions/{session_id}/logs/details.jsonl` | JSONL | Until user deletes | No |
| L3 step / tool logs | `HIVE_HOME/sessions/{session_id}/logs/tool_logs.jsonl` | JSONL | Until user deletes | No |
| Colony task / step / SOP state | `HIVE_HOME/colonies/{colony_name}/data/progress.db` | SQLite (WAL) | Until user deletes | No |
| Queen / colony / skill / memory configs | `HIVE_HOME/{queens,colonies,skills,memories}/` | files | Until user deletes | No |
| LLM `cost_usd` numbers | computed in-process from [model_catalog.py](../../core/framework/llm/model_catalog.py), then written into L1/L2/L3 logs above | — | Same as logs | No |
**What this means for the cloud requirement:** the question for the architect is not "where do we get the data" — the data is fully captured. The question is **"how does it leave the machine, in what shape, with whose consent, and where does it land."** That decision is upstream of every endpoint in §6 and every storage option in §5.
Three architectural shapes worth considering (architect to choose):
- **Shape A — On-device only, queried over LAN/tunnel.** Cloud product reaches into the runtime via an authenticated tunnel; no data is replicated. Strongest privacy. Hardest for cross-device rollups.
- **Shape B — Outbox push.** Runtime keeps the local store as source of truth and asynchronously pushes a redacted, billing-grade subset (no prompts, no tool args by default) to a cloud aggregate. Best fit for the typical "agent status dashboard + usage rollup" product.
- **Shape C — Cloud-first runtime.** Runtime writes events directly to a cloud bus and treats local files as a cache. Largest rewrite; not recommended for a desktop-first product.
Shape B is the lowest-friction path to the stated business outcome. The rest of this document is written with Shape B as the default and calls out where Shape A or C would change things.
---
## 4. Capability matrix — what we can offer
Each row is a candidate frontend/business surface, scored by feasibility from current state.
| # | Capability | Status | Backed by |
|---|---|---|---|
| **Status** | | | |
| S1 | Queen phase indicator (independent/incubating/working/reviewing) | **Ready** | `QUEEN_PHASE_CHANGED` event + session detail field |
| S2 | Per-task progress (right-rail) | **Ready** | `TASK_*` events |
| S3 | Live LLM streaming indicator (typing, thinking, tool-calling) | **Ready** | `LLM_TEXT_DELTA`, `LLM_REASONING_DELTA`, `TOOL_CALL_STARTED/COMPLETED` |
| S4 | Stall / stuck-agent detection | **Ready** | `NODE_STALLED`, `STREAM_INACTIVE`, `NODE_TOOL_DOOM_LOOP` |
| S5 | Colony tree (Queen → colonies → workers) snapshot | **Partial** — data exists in session/colony stores; need a join query |
| S6 | Worker health roll-up across colonies | **Partial** — per-worker tool exists; needs aggregation route |
| S7 | Liveness heartbeat ("agent X last seen Y ago") | **Net-new** — must derive from event timestamps or add a periodic ping |
| S8 | Trigger schedule (when will Queen wake next) | **Ready** | `TRIGGER_*` events |
| **Usage** | | | |
| U1 | Tokens per session (input/output/cached) | **Partial** — captured per-step in L3, summed in L1; no API |
| U2 | USD cost per session / colony / model | **Partial**`cost_usd` per LLM call in logs; no rollup |
| U3 | Tool-call counts and types | **Partial** — events exist; no aggregate |
| U4 | Wall-clock runtime and active-time per agent | **Partial** — derivable from `EXECUTION_STARTED/COMPLETED` |
| U5 | Cost attribution per Queen-spawned colony | **Partial**`colony_id` is on every event; needs a query |
| U6 | Per-user / per-tenant aggregation | **Net-new** — there is no user/tenant identity in events today |
| U7 | Daily / monthly usage rollups for billing | **Net-new** — requires persistent event store |
| U8 | Quota / cap enforcement (block when over budget) | **Net-new** — requires real-time meter + policy hook |
**Read of the matrix:** ~70% of "status" surfaces are **shipping-grade today** behind a thin local API. ~70% of "usage" surfaces need a **persistence + aggregation layer**. The events themselves are not the bottleneck.
**Local vs. cloud read of the same matrix.** Every "Ready" / "Partial" cell above is *ready in-process on the local machine*. Making each row visible to a **cloud** consumer adds an additional step:
| Capability class | Local (today / near-term) | Cloud (business ask) |
|---|---|---|
| Live status (S1S4, S8) | Stream from in-process event bus over local SSE | Push events through outbox → cloud relay → cloud SSE/WS to product UI. |
| Tree / health (S5, S6) | Join local session + colony stores | Same join, but on cloud-side replica of session/colony index. |
| Liveness (S7) | Derive from local event timestamps | Requires the runtime to *post* a heartbeat; cloud cannot infer aliveness from absence. |
| Per-session usage (U1U5) | Read L1/L2/L3 logs on disk | Outbox sends durable rows (no deltas) to cloud usage table. |
| Tenant rollups (U6U7) | Not possible — no identity in events | Cloud-side aggregation keyed on session→user join, identity attached at outbox time. |
| Quotas (U8) | Local meter feasible, but pointless without cloud truth | Cloud is the meter of record; runtime calls home to check. |
---
## 5. Proposed data model (architect to validate)
Three new persisted entities, plus reuse of existing event types:
```
AgentSession UsageEvent StatusSnapshot
----------- ----------- ---------------
session_id (PK) id (PK) session_id (FK)
queen_id session_id (FK) taken_at
queen_model colony_id phase
started_at worker_id active_run_id
ended_at agent_role (queen|worker) active_node
status (active|done|failed) event_type (LLM|TOOL|...) open_task_count
user_id (when multi-tenant) model in_flight_workers
tenant_id (when multi-tenant) input_tokens last_event_at
total_input_tokens output_tokens stall_score
total_output_tokens cached_tokens
total_cached_tokens cost_usd
total_cost_usd latency_ms
total_tool_calls tool_name (nullable)
last_event_at occurred_at
trace_id
execution_id
```
Storage choice (architect call). **All three options today are local; only Option C reaches the cloud business surface.**
- **Option A — local SQLite outbox** at `HIVE_HOME/runtime.db`. Pros: zero infra, fits desktop, makes local queries cheap. Cons: per-host; no cross-device aggregation; **does not satisfy the cloud requirement on its own.**
- **Option B — DuckDB on the existing JSONL event logs.** Pros: zero ingestion code; analyst-friendly. Cons: cold-start latency on big histories; **also local-only.**
- **Option C — push events to a managed cloud store** (Postgres, ClickHouse, BigQuery) via an outbox pattern. Pros: cross-host rollups, billing-grade, the only option that actually delivers the cloud-visible status/metrics product. Cons: introduces a new transport, identity, and privacy/redaction story; needs explicit user opt-in for desktop builds.
The realistic shape is the hybrid called out in §3.8 Shape B: **A locally** as the durable buffer and source of truth, **C in the cloud** as the business-facing aggregate, with a one-way outbox that moves a *redacted, durable-event-only* subset over the wire. This document recommends that hybrid; everything in §6 and §7 is written against it.
---
## 6. Surface API — what frontend would consume
All routes assume the event-bus → SSE bridge exists (the one missing wire — see §3.6). Frontend sees this from day one.
> **Locality note.** The `/api/...` routes below are served by the **local runtime HTTP server** today. For the cloud product, the same shapes need a cloud-side counterpart fed by the outbox. Two practical patterns: (1) cloud product calls cloud-hosted versions of these routes (against the aggregate), or (2) cloud product proxies authenticated requests back to the user's runtime. §3.8 Shape A vs. Shape B picks between them.
### Real-time channel
```
GET /api/sessions/{session_id}/events/stream (SSE)
↳ filter=phase,task,llm_stream,tool,worker,trigger,health
GET /api/agents/queen/stream (SSE) — global queen events
```
### Status reads
```
GET /api/sessions/{session_id} — already shipping
GET /api/sessions/{session_id}/tree — Queen → colonies → workers
GET /api/sessions/{session_id}/health — stall_score, last_event_at, in_flight
GET /api/colonies/{colony_id}/workers — health roll-up
```
### Usage reads
```
GET /api/sessions/{session_id}/usage — tokens, cost, latency, tool-calls
GET /api/sessions/{session_id}/usage/by-model — split by model
GET /api/colonies/{colony_id}/usage — same shape, colony scope
GET /api/agents/queen/usage?range=...&group_by=... — rollup view (billing)
```
### Admin / business
```
GET /api/usage/rollup?range=...&group_by=user|tenant|model|colony
POST /api/quotas/{tenant} — set caps (if quota work in scope)
```
---
## 7. Net-new work — sized in shirt-size, not days
| Workstream | Local / Cloud | Size | Depends on | Notes |
|---|---|---|---|---|
| Event-bus → local SSE bridge ([sse.py](../../core/framework/server/sse.py) exists, route does not) | Local | **S** | — | Unlocks all real-time status surfaces in the desktop UI. Highest leverage piece. |
| Persisted local event store (SQLite outbox) | Local | **M** | Decision §5 | One writer, append-only; reuse existing JSONL writer. Source of truth for cloud push. |
| Local aggregation queries + `/usage` endpoints | Local | **M** | Persisted store | Per-session usage on disk. |
| **Outbox transport (local → cloud)** | **Boundary** | **ML** | Local store + auth | New work: durable queue, retry, redaction policy, opt-in switch, schema versioning. This is the bridge to the cloud product. |
| **Cloud event ingest + aggregate store** | **Cloud** | **L** | Outbox transport | New cloud infra (Postgres/ClickHouse/BigQuery). Hosting, ops, retention policy, access controls. |
| **Cloud-side status/usage API + dashboards** | **Cloud** | **M** | Cloud aggregate | Mirrors §6 endpoints against the cloud store; this is what business users actually see. |
| Identity layer (user_id / tenant_id on events) | Boundary | **M** | Auth model | Currently no user identity in events. Identity attaches at outbox time, not at emit time. |
| OpenTelemetry exporter (schema is ready) | Boundary | **SM** | — | `trace_id`/`span_id` already populated; an OTel collector can be the cloud sink instead of a custom outbox. |
| Quota / policy hooks | Cloud-authoritative | **L** | Cloud store + identity | Cloud holds the meter; runtime calls home synchronously on a critical path. |
| Liveness/heartbeat (S7) | Local emit, cloud consume | **S** | Outbox | Runtime must actively post; cloud cannot infer liveness from absence. |
| Cost attribution UI rollups | Cloud | **S** | `/usage` cloud endpoints | Shared with frontend doc. |
**Critical path for first frontend release (local desktop UI):** SSE bridge → status endpoints (S1S5) → per-session usage endpoint (U1, U2). Everything else is incremental.
**Critical path for first cloud release (business ask):** local event store → outbox transport with redaction + opt-in → cloud ingest → cloud `/usage` and `/status` endpoints. The local UI work above is *not* a prerequisite for the cloud cut, but most of the local-side primitives (event store, durable-event filtering) are shared, so doing them in order minimizes rework.
---
## 8. Risks and tradeoffs the architect should weigh
1. **Event volume.** `LLM_TEXT_DELTA` fires per token. A persisted store must filter — don't write deltas, write `LLM_TURN_COMPLETE`. This is the #1 way the table blows up.
2. **Privacy / desktop posture — the central architectural constraint.** The runtime is local by default ([config.py:20-44](../../core/framework/config.py#L20-L44)). The data inventory in §3.8 confirms that **no data leaves the user's machine today**, including the data the business ask needs in the cloud. Closing that gap is not "add a metrics push" — it is a new system boundary with: (a) explicit user opt-in (defaults must be safe for OSS / self-hosted users), (b) a documented redaction list (no prompts, no tool args, no file paths in the default payload), (c) schema versioning so cloud aggregates do not break on runtime upgrades, (d) a clear answer for self-hosted / air-gapped deployments where the cloud sink is unreachable, (e) regional data-residency rules if the product is sold internationally. This is the single largest design decision in the document.
3. **Cost-table accuracy.** `cost_usd` is computed from a static catalog. Using it for billing means committing to keeping the catalog current (or pulling from provider invoices). For *display*, the current approach is fine; for *charging*, it is not.
4. **Identity coupling.** Events are session-scoped today. Adding `user_id`/`tenant_id` everywhere is invasive. Recommend pinning identity at the *session* boundary and joining on session at query time, rather than threading identity through every event payload.
5. **Status vs. heartbeat semantics.** "Idle" is not "dead." A Queen sitting in `independent` waiting for a user message is healthy and should not page anyone. The stall-score in §5 must distinguish idle-by-design from stalled-by-bug — the existing `STREAM_INACTIVE` / `NODE_STALLED` events already make this distinction; preserve it.
6. **Backpressure from observability.** If usage tracking sits in the LLM call path (for quotas), it must not add latency. Recommend: meter is async/eventual for display; only quota checks are synchronous, and only when the customer has a quota.
7. **Worker-side gap.** Worker LLM calls are accounted in their own session's L1L3 logs but are *not* automatically rolled into the parent Queen session. Cost attribution from Queen → spawned colony requires either (a) a parent_session_id field on the colony's session row, or (b) walking the `COLONY_CREATED` event graph at query time. (a) is cleaner.
---
## 9. Recommendation
Ship in four thin slices. The first two are local-only and unblock the desktop UI; the last two are what actually deliver the business ask of cloud-visible status and metrics.
1. **Slice 1 — Live local status (1 sprint, fully local).**
SSE bridge + `/sessions/{id}/events/stream` + `/sessions/{id}/health` + `/sessions/{id}/tree`. Frontend (local UI) gets the right-rail and the agent-tree. No persistence work, no cloud. (S1S5, S8.)
2. **Slice 2 — Per-session local usage store (12 sprints, fully local).**
Persisted event store (SQLite outbox at `HIVE_HOME/runtime.db`), filtered to durable event types only. `/sessions/{id}/usage` + `/colonies/{id}/usage`. No identity, no rollups, **no cloud transport yet**. This is the foundation the cloud slice rides on. (U1U5.)
3. **Slice 3 — Local→cloud outbox + cloud ingest (the cloud cut, scope-defining).**
Durable outbox queue, redaction policy, opt-in toggle, identity attachment, schema versioning, retry/backoff. Cloud-side ingest service + aggregate store. **This is where the local-only world becomes a cloud product.** Architect must decide §3.8 Shape, §5 storage, redaction defaults, and identity model before this slice can start.
4. **Slice 4 — Cloud rollups, dashboards, quotas (scope TBD with product).**
Tenant aggregation, daily/monthly rollups, quota enforcement, OTel export, business dashboards. (U6U8.) Defer until business confirms billing model — the answer (per-seat vs. per-token vs. per-colony) changes the data model.
Slices 1 and 2 are mostly **wiring** — the events exist, the schemas exist, the storage paths exist. Slice 3 is the **first slice that introduces a new architectural boundary** (local→cloud transport + identity + privacy contract); everything novel about the business ask lives there. Slice 4 is **business design**, not engineering scope.
---
## 10. Open questions for the architect
The first four are direct consequences of the local-first / cloud-required gap surfaced in §3.8 and §8.2.
1. **Cloud transport shape — Shape A, B, or C from §3.8?** This decision is upstream of the entire data model. Recommend Shape B (outbox push) absent a strong privacy argument for Shape A.
2. **Redaction default for the cloud payload.** What goes (model, token counts, latency, tool names, status) vs. what stays local (prompts, tool arguments, tool results, file paths, conversation content)? Need a written allowlist before Slice 3 starts.
3. **Self-hosted / air-gapped users.** If the cloud sink is unreachable or disabled, what does the runtime do — buffer indefinitely, drop oldest, or refuse to start? Defaults differ for OSS vs. SaaS distributions.
4. **Identity binding point.** Do we attach `user_id` / `tenant_id` at event-emit time (invasive, threads identity through every node), at session-create time (clean, requires session-level auth), or at outbox-flush time (simplest, but loses per-event provenance)? Recommend session-create.
5. Do we need quota *enforcement*, or only quota *visibility* in v1?
6. Frontend doc: are status and usage rendered in the same panel or different surfaces? This affects whether we ship one merged endpoint or two.
7. Are we willing to pay the cost-table maintenance burden, or should "cost" stay labeled as estimated and not be used for invoicing?
---
## Appendix — Pointers
- Queen lifecycle: [core/framework/agents/queen/nodes/__init__.py](../../core/framework/agents/queen/nodes/__init__.py)
- Event bus + types: [core/framework/host/event_bus.py](../../core/framework/host/event_bus.py)
- Runtime log schemas: [core/framework/tracker/runtime_log_schemas.py](../../core/framework/tracker/runtime_log_schemas.py)
- Runtime log store: [core/framework/tracker/runtime_log_store.py](../../core/framework/tracker/runtime_log_store.py)
- LLM accounting: [core/framework/llm/provider.py](../../core/framework/llm/provider.py), [model_catalog.py](../../core/framework/llm/model_catalog.py)
- Colony progress DB: [core/framework/host/progress_db.py](../../core/framework/host/progress_db.py)
- Task events: [core/framework/tasks/events.py](../../core/framework/tasks/events.py)
- Session HTTP: [core/framework/server/routes_sessions.py](../../core/framework/server/routes_sessions.py)
- SSE primitive: [core/framework/server/sse.py](../../core/framework/server/sse.py)
- Worker health: [core/framework/tools/worker_monitoring_tools.py](../../core/framework/tools/worker_monitoring_tools.py)
- Config / env vars: [core/framework/config.py](../../core/framework/config.py)