feat: encourage action plan prompts

2026-05-04 17:53:26 -07:00
parent e7d4ce0057
commit 07600c5ab5
2 changed files with 350 additions and 23 deletions
@@ -240,19 +240,15 @@ See "Independent execution" for the per-step flow and granularity rule.
 ## File I/O (files-tools MCP)
 - read_file, write_file, edit_file, search_files
-  - edit_file covers single-file fuzzy find/replace (mode='replace', default) \
+- edit_file covers single-file fuzzy find/replace (mode='replace', default) \
 and multi-file structured patches (mode='patch'). Patch mode supports \
 Update / Add / Delete / Move atomically across many files in one call.
-  - search_files covers grep/find/ls in one tool: target='content' to \
+- search_files covers grep/find/ls in one tool: target='content' to \
 search inside files, target='files' (with a glob like '*.py') to list \
-or find files. Mtime-sorted in files mode.
+or find files.
 ## Browser Automation (gcu-tools MCP)
- Use `browser_*` tools — `browser_open(url)` is the cold-start entry point \
+- Use `browser_*` tools — `browser_open(url)` is the cold-start entry point 
  (lazy-creates the context; no separate "start" call). Then `browser_navigate`, \
  `browser_click`, `browser_type`, `browser_snapshot`, \
  <!-- vision-only -->`browser_screenshot`, <!-- /vision-only -->`browser_scroll`, \
  `browser_tabs`, `browser_close`, `browser_evaluate`, etc.
 - MUST Follow the browser-automation skill protocol before using browser tools.
 ## Hand off to a colony
@@ -261,9 +257,7 @@ or find files. Mtime-sorted in files mode.
  chat. It does NOT fork on its own; it spawns a one-shot evaluator \
  that reads this conversation and decides whether the spec is settled \
  enough to proceed. On approval your phase flips to INCUBATING and a \
-  new tool surface (including create_colony itself) unlocks. On \
+  new tool surface (including create_colony itself) unlocks.
  rejection you stay here and keep the conversation going to fill the \
  gaps the evaluator named.
 """
 _queen_tools_incubating = """
@@ -411,17 +405,19 @@ asks for specifics. Do not invent a new pass unless the user asks for one.
 _queen_behavior_independent = """
 ## Independent execution
-You are the agent. **For multi-step work (2+ atomic actions): call \
+You are the agent. you behave this way:
-`task_create_batch`** with one entry per atomic action, \
+1. Identify if the user's prompt is a task assignment. If it is, \
-before you touch any other tool. \
+Use ask_user to clarify the scope and detail requirements, then always use \
-Then work the list one task at a time:
+the `task_create_batch` to create a multi-step action plan. 
-1. `task_update` → in_progress before you start the step.
+2. `task_update` → in_progress before you start the step.
-2. Do one real inline instance — open the browser, call the real API, \
+
 3. Do one real inline instance - either open the browser, call the real API, \
 write to the real file. If the action is irreversible or touches \
 shared systems, show and confirm before executing. Report concrete \
 evidence (actual output, what worked / failed) after the run.
-3. `task_update` → completed THE MOMENT it's done. **Do not let \
+
 4. `task_update` → completed THE MOMENT it's done. **Do not let \
 multiple finished tasks pile up unmarked.** There is no batch update \
 tool by design — each `completed` transition is a discrete progress \
 heartbeat in the user's right-rail panel. Without those transitions \
@@ -430,14 +426,14 @@ done.
 **Granularity: one task per atomic action, not one umbrella per project.** \
-Once finishing all current tasks, discuss with user about building \
+Once finishing a current task, discuss with user about building \
-a colony so this sucess can be repeated or scaled
+a colony so this success outcome can be repeated or scaled
 ### How to handle large scale tasks
-If the user ask you to finish the same task repeatly or at large scale \
+If the user ask you to finish the same task repeatedly or at large scale \
-(more than 10 times), tell the user that you can do it once first then \
+(more than 3 times), tell the user that you can do it once first then \
 build a colony to fulfill the request but succeeding it once will be \
-beneficial to run it in the future, \
+beneficial to run transfer it to a swarm of workers(through start_incubating_colony), \
 then focus on finishing the task once first.
 ### How to handle simple task (less then 2 atomic items)
@@ -0,0 +1,331 @@
 # Agent Usage & Status Tracking — Capability Document
 **Audience:** Lead software architect (paired with frontend + business requirement docs)
 **Scope:** Queen agent first (default local-runtime entry point), then downstream colonies/workers
 **Status:** Capability inventory + proposal — no implementation commitments
 **Date:** 2026-05-04
 ---
 ## 1. Why this document exists
 We have a business need to track **agent usage** (what was consumed: tokens, cost, runtime, calls) and **agent status** (what state agents are in: alive, phase, progress, blocked) starting from the Queen agent, and surface this **on the cloud** for product and business consumers. This document inventories the capabilities the runtime can expose **today** vs. what is **net-new**, so architecture can pick a scope before frontend and product write against it.
 The Queen is the right anchor: every local-runtime session today starts with a Queen, and every colony/worker is forked from a Queen call — so a tracking surface rooted at the Queen automatically covers the whole agent tree.
 > **Headline constraint for the architect:** the runtime is **local-by-default**. Every byte described in §3 — events, runtime logs, progress DBs, session state, LLM cost numbers — is written to the user's machine under `~/.hive/` (or the platform-specific Electron `userData` directory). **Nothing is shipped to the cloud today.** The business ask therefore implies a new local→cloud transport boundary, with the data-residency, privacy, and identity decisions that come with it. §4.5 makes the gap explicit per-surface; §8 lists the risks; §9 frames the cloud cut-over as the gating decision for any "Slice 2+" work.
 ---
 ## 2. Vocabulary — what we actually mean
 | Term | Definition in this codebase |
 |---|---|
 | **Session** | One Queen runtime instance. ID: `session_{YYYYMMDD_HHMMSS}_{uuid8}`. Persisted at `~/.hive/sessions/{session_id}/`. |
 | **Queen** | Long-lived conversational agent. One per session. Single event-loop node. Phases: `independent → incubating → working → reviewing`. See [queen/nodes/__init__.py:494-518](../../core/framework/agents/queen/nodes/__init__.py#L494-L518). |
 | **Colony** | Persistent stateless container forked by `create_colony`. Has its own SQLite progress DB at `~/.hive/colonies/{colony_name}/data/progress.db`. |
 | **Worker** | Ephemeral agent running inside a colony to execute a task. |
 | **Run / execution** | One trigger-to-completion invocation inside a node. Carries `run_id`, `execution_id`, `trace_id` (OTel-aligned). |
 | **Usage** | Quantitative consumption: input/output/cached tokens, USD cost, wall-clock latency, tool-call counts. |
 | **Status** | Qualitative state: phase, alive/stalled, current task, blocked-on, queue depth, last-heartbeat. |
 ---
 ## 3. What exists today (capabilities, not commitments)
 The runtime is already heavily instrumented. Most of what business wants is **already emitted** — the gap is persistence, aggregation, and a stable API surface.
 ### 3.1 Event Bus — the spine
 [core/framework/host/event_bus.py:61-177](../../core/framework/host/event_bus.py#L61-L177) defines an in-process async pub/sub with **40+ event types** scoped by `stream_id`, `session_id`, `colony_id`, `execution_id`, `run_id`, `correlation_id`, `timestamp`.
 Relevant for usage/status:
 - **Lifecycle:** `EXECUTION_STARTED/COMPLETED/FAILED/PAUSED/RESUMED/RESURRECTED`
 - **Queen:** `QUEEN_PHASE_CHANGED`, `QUEEN_IDENTITY_SELECTED`
 - **Colony/Worker:** `COLONY_CREATED`, `WORKER_COLONY_LOADED`, `WORKER_COMPLETED`, `WORKER_FAILED`, `SUBAGENT_REPORT`
 - **LLM:** `LLM_TURN_COMPLETE`, `LLM_TEXT_DELTA`, `LLM_REASONING_DELTA`, `CONTEXT_USAGE_UPDATED`
 - **Tools:** `TOOL_CALL_STARTED`, `TOOL_CALL_COMPLETED`, `TOOL_CALL_REPLAY_DETECTED`
 - **Health:** `NODE_STALLED`, `NODE_TOOL_DOOM_LOOP`, `STREAM_TTFT_EXCEEDED`, `STREAM_INACTIVE`, `STREAM_NUDGE_SENT`
 - **Tasks (right-rail panel):** `TASK_CREATED`, `TASK_UPDATED`, `TASK_DELETED`, `TASK_LIST_RESET`
 - **Triggers:** `TRIGGER_AVAILABLE/ACTIVATED/DEACTIVATED/FIRED/REMOVED/UPDATED`
 > Persistence today: in-memory only, **plus** optional JSONL export when `HIVE_DEBUG_EVENTS=1` ([event_bus.py:33-54](../../core/framework/host/event_bus.py#L33-L54)). There is no SQL events table.
 ### 3.2 Three-level runtime logs (per session)
 [core/framework/tracker/runtime_log_schemas.py](../../core/framework/tracker/runtime_log_schemas.py) defines:
 | Level | Schema | File | Granularity |
 |---|---|---|---|
 | L1 | `RunSummaryLog` | `summary.json` | Per graph run — totals + execution_quality + trace_id |
 | L2 | `NodeDetail` | `details.jsonl` | Per node — exit_status, input/output tokens, latency_ms, retry/accept/escalate/continue counts |
 | L3 | `NodeStepLog` | `tool_logs.jsonl` | Per LLM step — tool calls, verdicts, error traces, latency_ms |
 Storage: `~/.hive/sessions/{session_id}/logs/` ([runtime_log_store.py](../../core/framework/tracker/runtime_log_store.py)). Schemas already carry OTel fields (`trace_id`, `span_id`, `parent_span_id`) — wire-ready, not yet exported.
 ### 3.3 LLM call accounting
 [core/framework/llm/provider.py:11-32](../../core/framework/llm/provider.py#L11-L32) — `LLMResponse` carries: `model`, `input_tokens`, `output_tokens`, `cached_tokens`, `cache_creation_tokens`, `cost_usd`, `stop_reason`. Cost is computed from [model_catalog.py](../../core/framework/llm/model_catalog.py) when the model is priced; otherwise `0.0`.
 > Gap: cost lives in the response object and is rolled into L2/L3 logs, but is **not** in the event bus stream and **not** in any aggregate query surface.
 ### 3.4 Colony Progress DB
 [core/framework/host/progress_db.py:44-110](../../core/framework/host/progress_db.py#L44-L110) — per-colony SQLite (WAL mode):
 - `tasks` (id, seq, priority, goal, status: pending|claimed|started|completed|failed, worker_id, claimed_at, started_at, completed_at, retry_count, last_error)
 - `steps`, `sop_checklist`, `colony_meta`
 This is the closest thing we have to a status SQL store today, but it is **per-colony** and **task-shaped** — not session-shaped or usage-shaped.
 ### 3.5 Queen task system (right-rail panel)
 The mechanism the IDE-selection prompt describes is real: each `task_update` emits `TASK_UPDATED` on the bus, which a future SSE/WS surface can stream. State transitions: `pending → in_progress → completed`. Task body carries `subject`, `active_form`, `blocks`, `blocked_by`, `metadata`. Source: [tasks/events.py:52-159](../../core/framework/tasks/events.py#L52-L159).
 ### 3.6 HTTP surface (already shipping)
 [core/framework/server/routes_sessions.py](../../core/framework/server/routes_sessions.py):
 - `POST /api/sessions` — create
 - `GET /api/sessions/{session_id}` — current state including `queen_phase`, `queen_model`, `colony_id`, `uptime_seconds`
 - `GET /api/sessions/{session_id}/stats` — runtime statistics (extension point)
 - `GET /api/sessions/{session_id}/events/history` — replay persisted events
 SSE primitive exists at [server/sse.py](../../core/framework/server/sse.py) but is **not yet wired to a global event-stream route**. This is the natural attach point for a real-time status feed.
 ### 3.7 Worker health snapshot
 `get_worker_health_summary()` ([worker_monitoring_tools.py:71-99](../../core/framework/tools/worker_monitoring_tools.py#L71-L99)) returns: `session_id`, `session_status`, `total_steps`, `recent_verdicts`, `stall_minutes`, `evidence_snippet`. Used today by Queen during the WORKING phase; can be exposed via API.
 ---
 ## 3.8 Where every byte lives today (data residency map)
 Every storage location below is **on the end-user's machine**. There is no cloud sink, no telemetry endpoint, no managed database, no analytics service. The HTTP server in [core/framework/server/](../../core/framework/server/) binds to localhost for the desktop UI; it is not a cloud API.
 `HIVE_HOME` defaults to `~/.hive/` and is overridden by the desktop shell to the platform `userData` dir (e.g. `~/Library/Application Support/Hive/` on macOS, `%APPDATA%\Hive\` on Windows). Source: [config.py:20-44](../../core/framework/config.py#L20-L44).
 | Data | On-disk location (per machine) | Format | Lifetime | Currently shipped off-device? |
 |---|---|---|---|---|
 | Event bus stream | in-process memory only | Python objects | Process lifetime | No |
 | Event debug log (opt-in) | `HIVE_HOME/event_logs/<ts>.jsonl` when `HIVE_DEBUG_EVENTS=1` | JSONL | Until user deletes | No |
 | Session state | `HIVE_HOME/sessions/{session_id}/state.json` | JSON | Until user deletes | No |
 | Conversations | `HIVE_HOME/sessions/{session_id}/conversations/` | JSON | Until user deletes | No |
 | Artifacts | `HIVE_HOME/sessions/{session_id}/artifacts/` | mixed | Until user deletes | No |
 | L1 run summary (tokens, cost, quality) | `HIVE_HOME/sessions/{session_id}/logs/summary.json` | JSON | Until user deletes | No |
 | L2 node details | `HIVE_HOME/sessions/{session_id}/logs/details.jsonl` | JSONL | Until user deletes | No |
 | L3 step / tool logs | `HIVE_HOME/sessions/{session_id}/logs/tool_logs.jsonl` | JSONL | Until user deletes | No |
 | Colony task / step / SOP state | `HIVE_HOME/colonies/{colony_name}/data/progress.db` | SQLite (WAL) | Until user deletes | No |
 | Queen / colony / skill / memory configs | `HIVE_HOME/{queens,colonies,skills,memories}/` | files | Until user deletes | No |
 | LLM `cost_usd` numbers | computed in-process from [model_catalog.py](../../core/framework/llm/model_catalog.py), then written into L1/L2/L3 logs above | — | Same as logs | No |
 **What this means for the cloud requirement:** the question for the architect is not "where do we get the data" — the data is fully captured. The question is **"how does it leave the machine, in what shape, with whose consent, and where does it land."** That decision is upstream of every endpoint in §6 and every storage option in §5.
 Three architectural shapes worth considering (architect to choose):
 - **Shape A — On-device only, queried over LAN/tunnel.** Cloud product reaches into the runtime via an authenticated tunnel; no data is replicated. Strongest privacy. Hardest for cross-device rollups.
 - **Shape B — Outbox push.** Runtime keeps the local store as source of truth and asynchronously pushes a redacted, billing-grade subset (no prompts, no tool args by default) to a cloud aggregate. Best fit for the typical "agent status dashboard + usage rollup" product.
 - **Shape C — Cloud-first runtime.** Runtime writes events directly to a cloud bus and treats local files as a cache. Largest rewrite; not recommended for a desktop-first product.
 Shape B is the lowest-friction path to the stated business outcome. The rest of this document is written with Shape B as the default and calls out where Shape A or C would change things.
 ---
 ## 4. Capability matrix — what we can offer
 Each row is a candidate frontend/business surface, scored by feasibility from current state.
 | # | Capability | Status | Backed by |
 |---|---|---|---|
 | **Status** | | | |
 | S1 | Queen phase indicator (independent/incubating/working/reviewing) | **Ready** | `QUEEN_PHASE_CHANGED` event + session detail field |
 | S2 | Per-task progress (right-rail) | **Ready** | `TASK_*` events |
 | S3 | Live LLM streaming indicator (typing, thinking, tool-calling) | **Ready** | `LLM_TEXT_DELTA`, `LLM_REASONING_DELTA`, `TOOL_CALL_STARTED/COMPLETED` |
 | S4 | Stall / stuck-agent detection | **Ready** | `NODE_STALLED`, `STREAM_INACTIVE`, `NODE_TOOL_DOOM_LOOP` |
 | S5 | Colony tree (Queen → colonies → workers) snapshot | **Partial** — data exists in session/colony stores; need a join query |
 | S6 | Worker health roll-up across colonies | **Partial** — per-worker tool exists; needs aggregation route |
 | S7 | Liveness heartbeat ("agent X last seen Y ago") | **Net-new** — must derive from event timestamps or add a periodic ping |
 | S8 | Trigger schedule (when will Queen wake next) | **Ready** | `TRIGGER_*` events |
 | **Usage** | | | |
 | U1 | Tokens per session (input/output/cached) | **Partial** — captured per-step in L3, summed in L1; no API |
 | U2 | USD cost per session / colony / model | **Partial** — `cost_usd` per LLM call in logs; no rollup |
 | U3 | Tool-call counts and types | **Partial** — events exist; no aggregate |
 | U4 | Wall-clock runtime and active-time per agent | **Partial** — derivable from `EXECUTION_STARTED/COMPLETED` |
 | U5 | Cost attribution per Queen-spawned colony | **Partial** — `colony_id` is on every event; needs a query |
 | U6 | Per-user / per-tenant aggregation | **Net-new** — there is no user/tenant identity in events today |
 | U7 | Daily / monthly usage rollups for billing | **Net-new** — requires persistent event store |
 | U8 | Quota / cap enforcement (block when over budget) | **Net-new** — requires real-time meter + policy hook |
 **Read of the matrix:** ~70% of "status" surfaces are **shipping-grade today** behind a thin local API. ~70% of "usage" surfaces need a **persistence + aggregation layer**. The events themselves are not the bottleneck.
 **Local vs. cloud read of the same matrix.** Every "Ready" / "Partial" cell above is *ready in-process on the local machine*. Making each row visible to a **cloud** consumer adds an additional step:
 | Capability class | Local (today / near-term) | Cloud (business ask) |
 |---|---|---|
 | Live status (S1–S4, S8) | Stream from in-process event bus over local SSE | Push events through outbox → cloud relay → cloud SSE/WS to product UI. |
 | Tree / health (S5, S6) | Join local session + colony stores | Same join, but on cloud-side replica of session/colony index. |
 | Liveness (S7) | Derive from local event timestamps | Requires the runtime to *post* a heartbeat; cloud cannot infer aliveness from absence. |
 | Per-session usage (U1–U5) | Read L1/L2/L3 logs on disk | Outbox sends durable rows (no deltas) to cloud usage table. |
 | Tenant rollups (U6–U7) | Not possible — no identity in events | Cloud-side aggregation keyed on session→user join, identity attached at outbox time. |
 | Quotas (U8) | Local meter feasible, but pointless without cloud truth | Cloud is the meter of record; runtime calls home to check. |
 ---
 ## 5. Proposed data model (architect to validate)
 Three new persisted entities, plus reuse of existing event types:
 ```
 AgentSession                     UsageEvent                    StatusSnapshot
 -----------                      -----------                   ---------------
 session_id (PK)                  id (PK)                       session_id (FK)
 queen_id                         session_id (FK)               taken_at
 queen_model                      colony_id                     phase
 started_at                       worker_id                     active_run_id
 ended_at                         agent_role  (queen|worker)    active_node
 status      (active|done|failed) event_type  (LLM|TOOL|...)    open_task_count
 user_id     (when multi-tenant)  model                         in_flight_workers
 tenant_id   (when multi-tenant)  input_tokens                  last_event_at
 total_input_tokens               output_tokens                 stall_score
 total_output_tokens              cached_tokens
 total_cached_tokens              cost_usd
 total_cost_usd                   latency_ms
 total_tool_calls                 tool_name      (nullable)
 last_event_at                    occurred_at
                                 trace_id
                                 execution_id
 ```
 Storage choice (architect call). **All three options today are local; only Option C reaches the cloud business surface.**
 - **Option A — local SQLite outbox** at `HIVE_HOME/runtime.db`. Pros: zero infra, fits desktop, makes local queries cheap. Cons: per-host; no cross-device aggregation; **does not satisfy the cloud requirement on its own.**
 - **Option B — DuckDB on the existing JSONL event logs.** Pros: zero ingestion code; analyst-friendly. Cons: cold-start latency on big histories; **also local-only.**
 - **Option C — push events to a managed cloud store** (Postgres, ClickHouse, BigQuery) via an outbox pattern. Pros: cross-host rollups, billing-grade, the only option that actually delivers the cloud-visible status/metrics product. Cons: introduces a new transport, identity, and privacy/redaction story; needs explicit user opt-in for desktop builds.
 The realistic shape is the hybrid called out in §3.8 Shape B: **A locally** as the durable buffer and source of truth, **C in the cloud** as the business-facing aggregate, with a one-way outbox that moves a *redacted, durable-event-only* subset over the wire. This document recommends that hybrid; everything in §6 and §7 is written against it.
 ---
 ## 6. Surface API — what frontend would consume
 All routes assume the event-bus → SSE bridge exists (the one missing wire — see §3.6). Frontend sees this from day one.
 > **Locality note.** The `/api/...` routes below are served by the **local runtime HTTP server** today. For the cloud product, the same shapes need a cloud-side counterpart fed by the outbox. Two practical patterns: (1) cloud product calls cloud-hosted versions of these routes (against the aggregate), or (2) cloud product proxies authenticated requests back to the user's runtime. §3.8 Shape A vs. Shape B picks between them.
 ### Real-time channel
 ```
 GET  /api/sessions/{session_id}/events/stream      (SSE)
       ↳ filter=phase,task,llm_stream,tool,worker,trigger,health
 GET  /api/agents/queen/stream                      (SSE) — global queen events
 ```
 ### Status reads
 ```
 GET  /api/sessions/{session_id}                    — already shipping
 GET  /api/sessions/{session_id}/tree               — Queen → colonies → workers
 GET  /api/sessions/{session_id}/health             — stall_score, last_event_at, in_flight
 GET  /api/colonies/{colony_id}/workers             — health roll-up
 ```
 ### Usage reads
 ```
 GET  /api/sessions/{session_id}/usage              — tokens, cost, latency, tool-calls
 GET  /api/sessions/{session_id}/usage/by-model     — split by model
 GET  /api/colonies/{colony_id}/usage               — same shape, colony scope
 GET  /api/agents/queen/usage?range=...&group_by=... — rollup view (billing)
 ```
 ### Admin / business
 ```
 GET  /api/usage/rollup?range=...&group_by=user|tenant|model|colony
 POST /api/quotas/{tenant}                          — set caps (if quota work in scope)
 ```
 ---
 ## 7. Net-new work — sized in shirt-size, not days
 | Workstream | Local / Cloud | Size | Depends on | Notes |
 |---|---|---|---|---|
 | Event-bus → local SSE bridge ([sse.py](../../core/framework/server/sse.py) exists, route does not) | Local | **S** | — | Unlocks all real-time status surfaces in the desktop UI. Highest leverage piece. |
 | Persisted local event store (SQLite outbox) | Local | **M** | Decision §5 | One writer, append-only; reuse existing JSONL writer. Source of truth for cloud push. |
 | Local aggregation queries + `/usage` endpoints | Local | **M** | Persisted store | Per-session usage on disk. |
 | **Outbox transport (local → cloud)** | **Boundary** | **M–L** | Local store + auth | New work: durable queue, retry, redaction policy, opt-in switch, schema versioning. This is the bridge to the cloud product. |
 | **Cloud event ingest + aggregate store** | **Cloud** | **L** | Outbox transport | New cloud infra (Postgres/ClickHouse/BigQuery). Hosting, ops, retention policy, access controls. |
 | **Cloud-side status/usage API + dashboards** | **Cloud** | **M** | Cloud aggregate | Mirrors §6 endpoints against the cloud store; this is what business users actually see. |
 | Identity layer (user_id / tenant_id on events) | Boundary | **M** | Auth model | Currently no user identity in events. Identity attaches at outbox time, not at emit time. |
 | OpenTelemetry exporter (schema is ready) | Boundary | **S–M** | — | `trace_id`/`span_id` already populated; an OTel collector can be the cloud sink instead of a custom outbox. |
 | Quota / policy hooks | Cloud-authoritative | **L** | Cloud store + identity | Cloud holds the meter; runtime calls home synchronously on a critical path. |
 | Liveness/heartbeat (S7) | Local emit, cloud consume | **S** | Outbox | Runtime must actively post; cloud cannot infer liveness from absence. |
 | Cost attribution UI rollups | Cloud | **S** | `/usage` cloud endpoints | Shared with frontend doc. |
 **Critical path for first frontend release (local desktop UI):** SSE bridge → status endpoints (S1–S5) → per-session usage endpoint (U1, U2). Everything else is incremental.
 **Critical path for first cloud release (business ask):** local event store → outbox transport with redaction + opt-in → cloud ingest → cloud `/usage` and `/status` endpoints. The local UI work above is *not* a prerequisite for the cloud cut, but most of the local-side primitives (event store, durable-event filtering) are shared, so doing them in order minimizes rework.
 ---
 ## 8. Risks and tradeoffs the architect should weigh
 1. **Event volume.** `LLM_TEXT_DELTA` fires per token. A persisted store must filter — don't write deltas, write `LLM_TURN_COMPLETE`. This is the #1 way the table blows up.
 2. **Privacy / desktop posture — the central architectural constraint.** The runtime is local by default ([config.py:20-44](../../core/framework/config.py#L20-L44)). The data inventory in §3.8 confirms that **no data leaves the user's machine today**, including the data the business ask needs in the cloud. Closing that gap is not "add a metrics push" — it is a new system boundary with: (a) explicit user opt-in (defaults must be safe for OSS / self-hosted users), (b) a documented redaction list (no prompts, no tool args, no file paths in the default payload), (c) schema versioning so cloud aggregates do not break on runtime upgrades, (d) a clear answer for self-hosted / air-gapped deployments where the cloud sink is unreachable, (e) regional data-residency rules if the product is sold internationally. This is the single largest design decision in the document.
 3. **Cost-table accuracy.** `cost_usd` is computed from a static catalog. Using it for billing means committing to keeping the catalog current (or pulling from provider invoices). For *display*, the current approach is fine; for *charging*, it is not.
 4. **Identity coupling.** Events are session-scoped today. Adding `user_id`/`tenant_id` everywhere is invasive. Recommend pinning identity at the *session* boundary and joining on session at query time, rather than threading identity through every event payload.
 5. **Status vs. heartbeat semantics.** "Idle" is not "dead." A Queen sitting in `independent` waiting for a user message is healthy and should not page anyone. The stall-score in §5 must distinguish idle-by-design from stalled-by-bug — the existing `STREAM_INACTIVE` / `NODE_STALLED` events already make this distinction; preserve it.
 6. **Backpressure from observability.** If usage tracking sits in the LLM call path (for quotas), it must not add latency. Recommend: meter is async/eventual for display; only quota checks are synchronous, and only when the customer has a quota.
 7. **Worker-side gap.** Worker LLM calls are accounted in their own session's L1–L3 logs but are *not* automatically rolled into the parent Queen session. Cost attribution from Queen → spawned colony requires either (a) a parent_session_id field on the colony's session row, or (b) walking the `COLONY_CREATED` event graph at query time. (a) is cleaner.
 ---
 ## 9. Recommendation
 Ship in four thin slices. The first two are local-only and unblock the desktop UI; the last two are what actually deliver the business ask of cloud-visible status and metrics.
 1. **Slice 1 — Live local status (1 sprint, fully local).**
   SSE bridge + `/sessions/{id}/events/stream` + `/sessions/{id}/health` + `/sessions/{id}/tree`. Frontend (local UI) gets the right-rail and the agent-tree. No persistence work, no cloud. (S1–S5, S8.)
 2. **Slice 2 — Per-session local usage store (1–2 sprints, fully local).**
   Persisted event store (SQLite outbox at `HIVE_HOME/runtime.db`), filtered to durable event types only. `/sessions/{id}/usage` + `/colonies/{id}/usage`. No identity, no rollups, **no cloud transport yet**. This is the foundation the cloud slice rides on. (U1–U5.)
 3. **Slice 3 — Local→cloud outbox + cloud ingest (the cloud cut, scope-defining).**
   Durable outbox queue, redaction policy, opt-in toggle, identity attachment, schema versioning, retry/backoff. Cloud-side ingest service + aggregate store. **This is where the local-only world becomes a cloud product.** Architect must decide §3.8 Shape, §5 storage, redaction defaults, and identity model before this slice can start.
 4. **Slice 4 — Cloud rollups, dashboards, quotas (scope TBD with product).**
   Tenant aggregation, daily/monthly rollups, quota enforcement, OTel export, business dashboards. (U6–U8.) Defer until business confirms billing model — the answer (per-seat vs. per-token vs. per-colony) changes the data model.
 Slices 1 and 2 are mostly **wiring** — the events exist, the schemas exist, the storage paths exist. Slice 3 is the **first slice that introduces a new architectural boundary** (local→cloud transport + identity + privacy contract); everything novel about the business ask lives there. Slice 4 is **business design**, not engineering scope.
 ---
 ## 10. Open questions for the architect
 The first four are direct consequences of the local-first / cloud-required gap surfaced in §3.8 and §8.2.
 1. **Cloud transport shape — Shape A, B, or C from §3.8?** This decision is upstream of the entire data model. Recommend Shape B (outbox push) absent a strong privacy argument for Shape A.
 2. **Redaction default for the cloud payload.** What goes (model, token counts, latency, tool names, status) vs. what stays local (prompts, tool arguments, tool results, file paths, conversation content)? Need a written allowlist before Slice 3 starts.
 3. **Self-hosted / air-gapped users.** If the cloud sink is unreachable or disabled, what does the runtime do — buffer indefinitely, drop oldest, or refuse to start? Defaults differ for OSS vs. SaaS distributions.
 4. **Identity binding point.** Do we attach `user_id` / `tenant_id` at event-emit time (invasive, threads identity through every node), at session-create time (clean, requires session-level auth), or at outbox-flush time (simplest, but loses per-event provenance)? Recommend session-create.
 5. Do we need quota *enforcement*, or only quota *visibility* in v1?
 6. Frontend doc: are status and usage rendered in the same panel or different surfaces? This affects whether we ship one merged endpoint or two.
 7. Are we willing to pay the cost-table maintenance burden, or should "cost" stay labeled as estimated and not be used for invoicing?
 ---
 ## Appendix — Pointers
 - Queen lifecycle: [core/framework/agents/queen/nodes/__init__.py](../../core/framework/agents/queen/nodes/__init__.py)
 - Event bus + types: [core/framework/host/event_bus.py](../../core/framework/host/event_bus.py)
 - Runtime log schemas: [core/framework/tracker/runtime_log_schemas.py](../../core/framework/tracker/runtime_log_schemas.py)
 - Runtime log store: [core/framework/tracker/runtime_log_store.py](../../core/framework/tracker/runtime_log_store.py)
 - LLM accounting: [core/framework/llm/provider.py](../../core/framework/llm/provider.py), [model_catalog.py](../../core/framework/llm/model_catalog.py)
 - Colony progress DB: [core/framework/host/progress_db.py](../../core/framework/host/progress_db.py)
 - Task events: [core/framework/tasks/events.py](../../core/framework/tasks/events.py)
 - Session HTTP: [core/framework/server/routes_sessions.py](../../core/framework/server/routes_sessions.py)
 - SSE primitive: [core/framework/server/sse.py](../../core/framework/server/sse.py)
 - Worker health: [core/framework/tools/worker_monitoring_tools.py](../../core/framework/tools/worker_monitoring_tools.py)
 - Config / env vars: [core/framework/config.py](../../core/framework/config.py)