chore: lint

feat: llm logging
feat: encourage action plan prompts
2026-05-04 17:57:56 -07:00 · 2026-05-04 17:57:20 -07:00 · 2026-05-04 17:55:44 -07:00 · 2026-05-04 12:36:28 -07:00 · 2026-05-04 12:35:21 -07:00 · 2026-05-04 12:35:21 -07:00
38 changed files with 1005 additions and 1266 deletions
@@ -84,11 +84,23 @@ jobs:
        with:
          enable-cache: true

-      - name: Install dependencies and run tests
+      - name: Install dependencies
        working-directory: tools
-        run: |
-          uv sync --extra dev
-          uv run pytest tests/ -v
+        run: uv sync --extra dev
+
+      - name: Install Playwright Chromium (Linux)
+        if: runner.os == 'Linux'
+        working-directory: tools
+        run: uv run playwright install --with-deps chromium
+
+      - name: Install Playwright Chromium (Windows)
+        if: runner.os == 'Windows'
+        working-directory: tools
+        run: uv run playwright install chromium
+
+      - name: Run tests
+        working-directory: tools
+        run: uv run pytest tests/ -v

  validate:
    name: Validate Agent Exports
@@ -407,7 +407,7 @@ Aden Hive supports **100+ LLM providers** via LiteLLM, giving users maximum flex
 | **Anthropic** | Claude 3.5 Sonnet, Haiku, Opus | Default provider, best for reasoning |
 | **OpenAI** | GPT-4, GPT-4 Turbo, GPT-4o | Function calling, vision |
 | **OpenRouter** | Any OpenRouter catalog model | Uses `OPENROUTER_API_KEY` and `https://openrouter.ai/api/v1` |
-| **Hive LLM** | `queen`, `kimi-2.5`, `GLM-5` | Uses `HIVE_API_KEY` and the Hive-managed endpoint |
+| **Hive LLM** | `queen`, `kimi-k2.5`, `GLM-5` | Uses `HIVE_API_KEY` and the Hive-managed endpoint |
 | **Google** | Gemini 1.5 Pro, Flash | Long context windows |
 | **DeepSeek** | DeepSeek V3 | Cost-effective, strong reasoning |
 | **Mistral** | Mistral Large, Medium, Small | Open weights, EU hosting |
@@ -435,7 +435,7 @@ DEFAULT_MODEL = "claude-haiku-4-5-20251001"

 **Provider-Specific Notes**
 - **OpenRouter**: store `provider` as `openrouter`, use the raw OpenRouter model ID in `model` (for example `x-ai/grok-4.20-beta`), and use `OPENROUTER_API_KEY`
- **Hive LLM**: store `provider` as `hive`, use Hive model names such as `queen`, `kimi-2.5`, or `GLM-5`, and use `HIVE_API_KEY`
+- **Hive LLM**: store `provider` as `hive`, use Hive model names such as `queen`, `kimi-k2.5`, or `GLM-5`, and use `HIVE_API_KEY`

 **For Development**
 - Use cheaper/faster models (Haiku, GPT-4o-mini)
@@ -240,19 +240,15 @@ See "Independent execution" for the per-step flow and granularity rule.

 ## File I/O (files-tools MCP)
 - read_file, write_file, edit_file, search_files
-  - edit_file covers single-file fuzzy find/replace (mode='replace', default) \
+- edit_file covers single-file fuzzy find/replace (mode='replace', default) \
 and multi-file structured patches (mode='patch'). Patch mode supports \
 Update / Add / Delete / Move atomically across many files in one call.
-  - search_files covers grep/find/ls in one tool: target='content' to \
+- search_files covers grep/find/ls in one tool: target='content' to \
 search inside files, target='files' (with a glob like '*.py') to list \
-or find files. Mtime-sorted in files mode.
+or find files.

 ## Browser Automation (gcu-tools MCP)
- Use `browser_*` tools — `browser_open(url)` is the cold-start entry point \
-  (lazy-creates the context; no separate "start" call). Then `browser_navigate`, \
-  `browser_click`, `browser_type`, `browser_snapshot`, \
-  <!-- vision-only -->`browser_screenshot`, <!-- /vision-only -->`browser_scroll`, \
-  `browser_tabs`, `browser_close`, `browser_evaluate`, etc.
+- Use `browser_*` tools — `browser_open(url)` is the cold-start entry point
 - MUST Follow the browser-automation skill protocol before using browser tools.

 ## Hand off to a colony
@@ -261,9 +257,7 @@ or find files. Mtime-sorted in files mode.
  chat. It does NOT fork on its own; it spawns a one-shot evaluator \
  that reads this conversation and decides whether the spec is settled \
  enough to proceed. On approval your phase flips to INCUBATING and a \
-  new tool surface (including create_colony itself) unlocks. On \
-  rejection you stay here and keep the conversation going to fill the \
-  gaps the evaluator named.
+  new tool surface (including create_colony itself) unlocks.
 """

 _queen_tools_incubating = """
@@ -411,17 +405,19 @@ asks for specifics. Do not invent a new pass unless the user asks for one.
 _queen_behavior_independent = """
 ## Independent execution

-You are the agent. **For multi-step work (2+ atomic actions): call \
-`task_create_batch`** with one entry per atomic action, \
-before you touch any other tool. \
-Then work the list one task at a time:
+You are the agent. you behave this way:
+1. Identify if the user's prompt is a task assignment. If it is, \
+Use ask_user to clarify the scope and detail requirements, then always use \
+the `task_create_batch` to create a multi-step action plan.

-1. `task_update` → in_progress before you start the step.
-2. Do one real inline instance — open the browser, call the real API, \
+2. `task_update` → in_progress before you start the step.
+
+3. Do one real inline instance - either open the browser, call the real API, \
 write to the real file. If the action is irreversible or touches \
 shared systems, show and confirm before executing. Report concrete \
 evidence (actual output, what worked / failed) after the run.
-3. `task_update` → completed THE MOMENT it's done. **Do not let \
+
+4. `task_update` → completed THE MOMENT it's done. **Do not let \
 multiple finished tasks pile up unmarked.** There is no batch update \
 tool by design — each `completed` transition is a discrete progress \
 heartbeat in the user's right-rail panel. Without those transitions \
@@ -430,14 +426,14 @@ done.

 **Granularity: one task per atomic action, not one umbrella per project.** \

-Once finishing all current tasks, discuss with user about building \
-a colony so this sucess can be repeated or scaled
+Once finishing a current task, discuss with user about building \
+a colony so this success outcome can be repeated or scaled

 ### How to handle large scale tasks
-If the user ask you to finish the same task repeatly or at large scale \
-(more than 10 times), tell the user that you can do it once first then \
+If the user ask you to finish the same task repeatedly or at large scale \
+(more than 3 times), tell the user that you can do it once first then \
 build a colony to fulfill the request but succeeding it once will be \
-beneficial to run it in the future, \
+beneficial to run transfer it to a swarm of workers(through start_incubating_colony), \
 then focus on finishing the task once first.

 ### How to handle simple task (less then 2 atomic items)
@@ -129,10 +129,7 @@ _TOOL_CATEGORIES: dict[str, list[str]] = {
    # Research — paper search, Wikipedia, ad-hoc web scrape. Pair with
    # browser_basic for richer site-by-site research; this category is the
    # lightweight always-available fallback.
-    "research": [
-        "web_scrape",
-        "pdf_read"
-    ],
+    "research": ["web_scrape", "pdf_read"],
    # Security — defensive scanning and reconnaissance. Engineering-only
    # surface; the rest of the queens shouldn't see port scanners.
    "security": [
@@ -61,10 +61,12 @@ _IDE_STATE_DB_KEY = "antigravityUnifiedStateSync.oauthToken"

 _BASE_HEADERS: dict[str, str] = {
    # Mimic the Antigravity Electron app so the API accepts the request.
+    # Google deprecates older client versions over time, so this needs periodic
+    # bumping to match whatever the current Antigravity desktop release advertises.
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 "
-        "(KHTML, like Gecko) Antigravity/1.18.3 Chrome/138.0.7204.235 "
-        "Electron/37.3.1 Safari/537.36"
+        "(KHTML, like Gecko) Antigravity/1.23.2 Chrome/138.0.7204.235 "
+        "Electron/39.2.3 Safari/537.36"
    ),
    "X-Goog-Api-Client": "google-cloud-sdk vscode_cloudshelleditor/0.1",
    "Client-Metadata": '{"ideType":"ANTIGRAVITY","platform":"MACOS","pluginType":"GEMINI"}',
@@ -254,6 +256,56 @@ def _clean_tool_name(name: str) -> str:
    return name[:64]


+def _sanitize_schema_for_gemini(schema: Any) -> Any:
+    """Convert JSON Schema 2020-12 features to the OpenAPI 3.0 dialect Gemini accepts.
+
+    Gemini's function_declarations parser rejects union ``"type": ["string", "null"]``.
+    Translate any such union to a single type plus ``"nullable": true``. Recurse into
+    ``properties``, ``items``, and the ``anyOf``/``oneOf``/``allOf`` combinators.
+    """
+    if isinstance(schema, list):
+        return [_sanitize_schema_for_gemini(s) for s in schema]
+    if not isinstance(schema, dict):
+        return schema
+
+    out = dict(schema)
+    t = out.get("type")
+    if isinstance(t, list):
+        non_null = [x for x in t if x != "null"]
+        has_null = "null" in t
+        if len(non_null) == 1:
+            out["type"] = non_null[0]
+            if has_null:
+                out["nullable"] = True
+        elif not non_null and has_null:
+            # Pure null type: fall back to string-nullable.
+            out["type"] = "string"
+            out["nullable"] = True
+        else:
+            # Multi-type non-null unions (e.g. ["string", "integer", "null"])
+            # have no faithful Gemini equivalent. Silently picking one type
+            # changes the contract for callers who rely on the others, so
+            # fail loud and let the schema author rewrite it as anyOf or
+            # narrow to a single type.
+            raise ValueError(
+                f"Unsupported Gemini schema union: {t!r}. "
+                "Gemini accepts a single primitive type plus optional 'nullable: true'. "
+                "Rewrite as anyOf or pick a single type."
+            )
+
+    if "properties" in out and isinstance(out["properties"], dict):
+        out["properties"] = {k: _sanitize_schema_for_gemini(v) for k, v in out["properties"].items()}
+    if "items" in out:
+        out["items"] = _sanitize_schema_for_gemini(out["items"])
+    if "additionalProperties" in out and isinstance(out["additionalProperties"], dict):
+        out["additionalProperties"] = _sanitize_schema_for_gemini(out["additionalProperties"])
+    for combinator in ("anyOf", "oneOf", "allOf"):
+        if combinator in out:
+            out[combinator] = _sanitize_schema_for_gemini(out[combinator])
+
+    return out
+
+
 def _to_gemini_contents(
    messages: list[dict[str, Any]],
    thought_sigs: dict[str, str] | None = None,
@@ -555,11 +607,13 @@ class AntigravityProvider(LLMProvider):
                        {
                            "name": _clean_tool_name(t.name),
                            "description": t.description,
-                            "parameters": t.parameters
-                            or {
-                                "type": "object",
-                                "properties": {},
-                            },
+                            "parameters": _sanitize_schema_for_gemini(
+                                t.parameters
+                                or {
+                                    "type": "object",
+                                    "properties": {},
+                                }
+                            ),
                        }
                        for t in tools
                    ]
@@ -2346,10 +2346,6 @@ class LiteLLMProvider(LLMProvider):
            kwargs["extra_body"]["store"] = False

        request_summary = _summarize_request_for_log(kwargs)
-        logger.debug(
-            "[stream] prepared request: %s",
-            json.dumps(request_summary, default=str),
-        )
        if request_summary["system_only"]:
            logger.warning(
                "[stream] %s request has no non-system chat messages "
@@ -326,7 +326,7 @@
          "supports_vision": false
        },
        {
-          "id": "kimi-2.5",
+          "id": "kimi-k2.5",
          "label": "Kimi 2.5 - Via Hive",
          "recommended": false,
          "max_tokens": 32768,
@@ -489,8 +489,8 @@
          "recommended": true
        },
        {
-          "id": "kimi-2.5",
-          "label": "kimi-2.5",
+          "id": "kimi-k2.5",
+          "label": "kimi-k2.5",
          "recommended": false
        },
        {
@@ -52,11 +52,11 @@ _DEFAULT_LOCAL_SERVERS: dict[str, dict[str, Any]] = {
        "args": ["run", "python", "files_server.py", "--stdio"],
    },
    "terminal-tools": {
-        "description": "Terminal capabilities: process exec, background jobs, PTY sessions, fs search. Bash-only on POSIX.",
+        "description": "Terminal capabilities",
        "args": ["run", "python", "terminal_tools_server.py", "--stdio"],
    },
    "chart-tools": {
-        "description": "BI/financial chart + diagram rendering: ECharts, Mermaid. Returns spec + downloadable PNG; chat embeds live.",
+        "description": "BI/financial chart + diagram rendering: ECharts, Mermaid",
        "args": ["run", "python", "chart_tools_server.py", "--stdio"],
    },
 }
@@ -137,6 +137,13 @@ class MCPRegistry:
        Skips entirely when the source-tree ``tools/`` directory cannot
        be located (e.g. wheel installs). Returns the list of names that
        were newly registered.
+
+        Also runs a self-heal pass over already-registered defaults: if an
+        entry's stdio cwd is unreachable on this machine (e.g. the registry
+        was copied from another developer's box and points at their
+        ``/Users/<them>/...`` path), the entry is overwritten with the
+        canonical config so the queen can actually spawn it. The user's
+        ``enabled`` toggle and ``overrides`` are preserved.
        """
        # parents: [0]=loader, [1]=framework, [2]=core, [3]=repo root
        tools_dir = Path(__file__).resolve().parents[3] / "tools"
@@ -165,8 +172,31 @@ class MCPRegistry:
                )
                del existing[stale]
                mutated = True
+
+        repaired: list[str] = []
+        for name, spec in _DEFAULT_LOCAL_SERVERS.items():
+            entry = existing.get(name)
+            if entry is None:
+                continue
+            if self._default_entry_runnable(entry, tools_dir, list(spec["args"])):
+                continue
+            existing[name] = self._build_default_entry(
+                name=name,
+                spec=spec,
+                cwd=cwd,
+                preserve_from=entry,
+            )
+            repaired.append(name)
+            mutated = True
+
        if mutated:
            self._write_installed(data)
+        if repaired:
+            logger.warning(
+                "MCPRegistry._seed_defaults: repaired %d default server(s) with unreachable cwd/script: %s",
+                len(repaired),
+                repaired,
+            )

        for name, spec in _DEFAULT_LOCAL_SERVERS.items():
            if name in existing:
@@ -188,6 +218,91 @@ class MCPRegistry:
            logger.info("MCPRegistry: seeded default local servers: %s", added)
        return added

+    @staticmethod
+    def _default_entry_runnable(entry: dict, tools_dir: Path, canonical_args: list[str]) -> bool:
+        """Return True iff ``entry`` can plausibly be spawned on this machine.
+
+        Checks:
+        - transport is stdio (only stdio defaults exist today; non-stdio
+          gets a free pass since we have nothing to compare against)
+        - stdio.cwd is an existing directory
+        - the entry script (the first ``.py`` arg, e.g. ``files_server.py``)
+          exists relative to that cwd
+
+        We deliberately do NOT spawn the subprocess here — this runs on
+        every read path and must be cheap. A filesystem reachability
+        check catches the cross-machine `cwd` drift that is the common
+        failure, without flapping on transient runtime errors.
+        """
+        transport = entry.get("transport") or "stdio"
+        if transport != "stdio":
+            return True
+        manifest = entry.get("manifest") or {}
+        stdio = manifest.get("stdio") or {}
+        cwd_str = stdio.get("cwd")
+        if not cwd_str:
+            return False
+        cwd_path = Path(cwd_str)
+        if not cwd_path.is_dir():
+            return False
+        # Find the script: the first arg ending in .py, falling back to the
+        # canonical spec if the registered args are unrecognizable. Modules
+        # invoked via `python -m foo.bar` (no .py arg) are accepted as long
+        # as the cwd exists — we can't cheaply prove the module imports.
+        registered_args = stdio.get("args") or []
+        script: str | None = next(
+            (a for a in registered_args if isinstance(a, str) and a.endswith(".py")),
+            None,
+        )
+        if script is None:
+            script = next(
+                (a for a in canonical_args if isinstance(a, str) and a.endswith(".py")),
+                None,
+            )
+        if script is None:
+            return True
+        return (cwd_path / script).is_file()
+
+    @classmethod
+    def _build_default_entry(
+        cls,
+        *,
+        name: str,
+        spec: dict[str, Any],
+        cwd: str,
+        preserve_from: dict | None,
+    ) -> dict:
+        """Construct a fresh canonical entry for a default server.
+
+        When ``preserve_from`` is provided, carries over the user's
+        ``enabled`` flag and ``overrides`` so a deliberate disable or
+        custom env var survives the repair.
+        """
+        manifest = {
+            "name": name,
+            "description": spec["description"],
+            "transport": {"supported": ["stdio"], "default": "stdio"},
+            "stdio": {
+                "command": "uv",
+                "args": list(spec["args"]),
+                "env": {},
+                "cwd": cwd,
+            },
+        }
+        entry = cls._make_entry(
+            source="local",
+            manifest=manifest,
+            transport="stdio",
+            installed_by="hive mcp init (auto-repair)",
+        )
+        if preserve_from is not None:
+            if "enabled" in preserve_from:
+                entry["enabled"] = bool(preserve_from["enabled"])
+            prior_overrides = preserve_from.get("overrides")
+            if isinstance(prior_overrides, dict):
+                entry["overrides"] = prior_overrides
+        return entry
+
    # ── Internal I/O ────────────────────────────────────────────────

    def _read_installed(self) -> dict:
@@ -377,6 +377,7 @@ async def create_queen(
        _queen_tools_working,
        finalize_queen_prompt,
    )
+    from framework.config import get_max_tokens as _get_max_tokens
    from framework.host.event_bus import AgentEvent, EventType
    from framework.llm.capabilities import supports_image_tool_results
    from framework.loader.mcp_registry import MCPRegistry
@@ -982,7 +983,12 @@ async def create_queen(
                llm=session.llm,
                available_tools=queen_tools,
                goal_context=queen_goal.to_prompt_context(),
-                max_tokens=lc.get("max_tokens", 8192),
+                # Honor configuration.json (llm.max_tokens) instead of
+                # hard-defaulting to 8192. The legacy fallback ignored both
+                # the user's saved ceiling AND the model's actual output
+                # capacity (e.g. glm-5.1 / kimi-k2.5 both support 32k out),
+                # which silently truncated long tool-emitting turns.
+                max_tokens=lc.get("max_tokens", _get_max_tokens()),
                stream_id="queen",
                execution_id=session.id,
                dynamic_tools_provider=phase_state.get_current_tools,
@@ -19,7 +19,7 @@ from datetime import datetime
 from pathlib import Path
 from typing import Any, Literal

-from framework.config import QUEENS_DIR
+from framework.config import QUEENS_DIR, get_max_tokens
 from framework.host.triggers import TriggerDefinition

 logger = logging.getLogger(__name__)
@@ -700,7 +700,10 @@ class SessionManager:
            available_tools=all_tools,
            goal_context=goal.to_prompt_context(),
            goal=goal,
-            max_tokens=8192,
+            # Worker output cap — pull from configuration.json instead of
+            # hard-coding 8192. glm-5.1/kimi-k2.5 both support 32k out, and
+            # capping at 8k silently truncates long worker turns mid-tool.
+            max_tokens=get_max_tokens(),
            stream_id=worker_name,
            execution_id=worker_name,
            identity_prompt=worker_data.get("identity_prompt", ""),
@@ -1,11 +1,22 @@
 """Write every LLM turn to ~/.hive/llm_logs/<ts>.jsonl for replay/debugging.

-Each line is a JSON object with the full LLM turn: the request payload
-(system prompt + messages), assistant text, tool calls, tool results, and
-token counts. The file is opened lazily on first call and flushed after every
-write. Errors are silently swallowed — this must never break the agent.
+Two record kinds, distinguished by ``_kind``:
+
+* ``session_header`` — emitted on the first turn of an ``execution_id`` and
+  any time its ``system_prompt`` or ``tools`` change. Carries those large
+  fields once instead of per-turn.
+* ``turn`` — one per LLM call. Carries per-turn outputs plus a
+  content-addressed message delta: ``message_hashes`` is the full ordered
+  message sequence for this turn, ``new_messages`` is hash → body for
+  messages we haven't emitted before for this ``execution_id``. The reader
+  reassembles full ``messages`` by accumulating ``new_messages`` across
+  prior turn records. Content-addressed (not positional) because the agent
+  prunes messages mid-session — a tail-delta would be wrong.
+
+Errors are silently swallowed — this must never break the agent.
 """

+import hashlib
 import json
 import logging
 import os
@@ -28,6 +39,12 @@ def _llm_debug_dir() -> Path:
 _log_file: IO[str] | None = None
 _log_ready = False  # lazy init guard

+# Per-execution_id delta state. Reset implicitly on process restart — a fresh
+# log file has no prior context, so re-emitting the header on first turn is
+# correct.
+_session_header_hash: dict[str, str] = {}
+_session_seen_msgs: dict[str, set[str]] = {}
+

 def _open_log() -> IO[str] | None:
    """Open the JSONL log file for this process."""
@@ -61,6 +78,17 @@ def _serialize_tools(tools: Any) -> list[dict[str, Any]]:
    return out


+def _content_hash(payload: Any) -> str:
+    raw = json.dumps(payload, default=str, sort_keys=True, ensure_ascii=False)
+    return hashlib.sha256(raw.encode("utf-8")).hexdigest()[:16]
+
+
+def _write_line(record: dict[str, Any]) -> None:
+    assert _log_file is not None
+    _log_file.write(json.dumps(record, default=str) + "\n")
+    _log_file.flush()
+
+
 def log_llm_turn(
    *,
    node_id: str,
@@ -75,7 +103,7 @@ def log_llm_turn(
    token_counts: dict[str, Any],
    tools: list[Any] | None = None,
 ) -> None:
-    """Write one JSONL line capturing a complete LLM turn.
+    """Write JSONL records capturing one LLM turn (header + turn delta).

    Never raises.
    """
@@ -89,23 +117,57 @@ def log_llm_turn(
            _log_ready = True
        if _log_file is None:
            return
-        record = {
-            # UTC + offset matches tool_call start_timestamp (agent_loop.py)
-            # so the viewer can render every event in one consistent local zone.
-            "timestamp": datetime.now(UTC).isoformat(),
-            "node_id": node_id,
-            "stream_id": stream_id,
-            "execution_id": execution_id,
-            "iteration": iteration,
-            "system_prompt": system_prompt,
-            "tools": _serialize_tools(tools),
-            "messages": messages,
-            "assistant_text": assistant_text,
-            "tool_calls": tool_calls,
-            "tool_results": tool_results,
-            "token_counts": token_counts,
-        }
-        _log_file.write(json.dumps(record, default=str) + "\n")
-        _log_file.flush()
+
+        # UTC + offset matches tool_call start_timestamp (agent_loop.py)
+        # so the viewer can render every event in one consistent local zone.
+        timestamp = datetime.now(UTC).isoformat()
+        serialized_tools = _serialize_tools(tools)
+
+        # Re-emit the header on first turn or whenever system/tools change.
+        # The Queen reflects different prompts across turns, so we can't
+        # assume strict immutability per execution_id.
+        header_hash = _content_hash({"system_prompt": system_prompt, "tools": serialized_tools})
+        if _session_header_hash.get(execution_id) != header_hash:
+            _write_line(
+                {
+                    "_kind": "session_header",
+                    "timestamp": timestamp,
+                    "execution_id": execution_id,
+                    "node_id": node_id,
+                    "stream_id": stream_id,
+                    "header_hash": header_hash,
+                    "system_prompt": system_prompt,
+                    "tools": serialized_tools,
+                }
+            )
+            _session_header_hash[execution_id] = header_hash
+
+        seen = _session_seen_msgs.setdefault(execution_id, set())
+        message_hashes: list[str] = []
+        new_messages: dict[str, dict[str, Any]] = {}
+        for msg in messages or []:
+            h = _content_hash(msg)
+            message_hashes.append(h)
+            if h not in seen:
+                seen.add(h)
+                new_messages[h] = msg
+
+        _write_line(
+            {
+                "_kind": "turn",
+                "timestamp": timestamp,
+                "execution_id": execution_id,
+                "node_id": node_id,
+                "stream_id": stream_id,
+                "iteration": iteration,
+                "header_hash": header_hash,
+                "message_hashes": message_hashes,
+                "new_messages": new_messages,
+                "assistant_text": assistant_text,
+                "tool_calls": tool_calls,
+                "tool_results": tool_results,
+                "token_counts": token_counts,
+            }
+        )
    except Exception:
        pass  # never break the agent
@@ -80,7 +80,7 @@ function buildGroups(
        title = formatCategoryTitle(cat);
      } else if (srv.name && srv.name !== "(unknown)") {
        key = `srv:${srv.name}`;
-        title = srv.name;
+        title = formatCategoryTitle(srv.name);
      } else {
        key = "other";
        title = "Other tools";
@@ -60,6 +60,21 @@ _HIVE_PATH_NAMES = (
 )


+@pytest.fixture(autouse=True)
+def _no_seed_mcp_defaults(monkeypatch):
+    """Skip bundled-server seeding in MCPRegistry.initialize() for tests.
+
+    Production wants ``initialize()`` to seed ``hive_tools`` / ``gcu-tools``
+    / ``files-tools`` / ``terminal-tools`` / ``chart-tools`` so a fresh
+    HIVE_HOME comes up with working defaults. Tests want a deterministic
+    empty registry — every assertion about counts, "no servers installed"
+    output, or first-element identity breaks otherwise. Patching here
+    keeps the production API clean and avoids a test-only flag on
+    ``initialize()``.
+    """
+    monkeypatch.setattr(_mcp_registry.MCPRegistry, "_seed_defaults", lambda self: [])
+
+
@pytest.fixture(autouse=True)
 def _isolate_hive_home_autouse(tmp_path, monkeypatch):
    """Per-test isolation of ``~/.hive`` to ``tmp_path/.hive``.
@@ -0,0 +1,73 @@
+"""Tests for the Antigravity Gemini schema sanitizer.
+
+Run with:
+    cd core
+    pytest tests/test_antigravity_schema.py -v
+"""
+
+import pytest
+
+from framework.llm.antigravity import _sanitize_schema_for_gemini
+
+
+def test_union_with_null_becomes_nullable():
+    assert _sanitize_schema_for_gemini({"type": ["string", "null"]}) == {
+        "type": "string",
+        "nullable": True,
+    }
+
+
+def test_plain_schema_passthrough():
+    assert _sanitize_schema_for_gemini({"type": "string"}) == {"type": "string"}
+
+
+def test_recurses_into_properties():
+    out = _sanitize_schema_for_gemini(
+        {
+            "type": "object",
+            "properties": {
+                "id": {"type": "integer"},
+                "owner": {"type": ["string", "null"]},
+            },
+            "required": ["id"],
+        }
+    )
+    assert out["properties"]["id"] == {"type": "integer"}
+    assert out["properties"]["owner"] == {"type": "string", "nullable": True}
+    assert out["required"] == ["id"]
+
+
+def test_recurses_into_items():
+    assert _sanitize_schema_for_gemini({"type": "array", "items": {"type": ["integer", "null"]}}) == {
+        "type": "array",
+        "items": {"type": "integer", "nullable": True},
+    }
+
+
+def test_recurses_into_combinators():
+    assert _sanitize_schema_for_gemini({"anyOf": [{"type": ["string", "null"]}, {"type": "integer"}]}) == {
+        "anyOf": [{"type": "string", "nullable": True}, {"type": "integer"}]
+    }
+
+
+def test_does_not_mutate_input():
+    schema = {"type": "object", "properties": {"x": {"type": ["string", "null"]}}}
+    snapshot = {"type": "object", "properties": {"x": {"type": ["string", "null"]}}}
+    _sanitize_schema_for_gemini(schema)
+    assert schema == snapshot
+
+
+def test_pure_null_type_falls_back_to_string():
+    assert _sanitize_schema_for_gemini({"type": ["null"]}) == {
+        "type": "string",
+        "nullable": True,
+    }
+
+
+def test_multi_type_non_null_union_raises():
+    """Silently picking one type would change the contract; fail loud instead."""
+    with pytest.raises(ValueError, match="Unsupported Gemini schema union"):
+        _sanitize_schema_for_gemini({"type": ["string", "integer", "null"]})
+
+    with pytest.raises(ValueError, match="Unsupported Gemini schema union"):
+        _sanitize_schema_for_gemini({"type": ["string", "integer"]})
@@ -889,7 +889,7 @@ def test_concurrency_safe_allowlist_is_conservative():
    allowlist = ToolRegistry.CONCURRENCY_SAFE_TOOLS

    # Positive assertions: known-safe read operations are present.
-    for name in ("read_file", "grep", "glob", "search_files", "web_search"):
+    for name in ("read_file", "terminal_rg", "terminal_find", "search_files", "web_scrape"):
        assert name in allowlist, f"{name} should be concurrency-safe"

    # Negative assertions: nothing that mutates state is allowed in.
@@ -0,0 +1,331 @@
+# Agent Usage & Status Tracking — Capability Document
+
+**Audience:** Lead software architect (paired with frontend + business requirement docs)
+**Scope:** Queen agent first (default local-runtime entry point), then downstream colonies/workers
+**Status:** Capability inventory + proposal — no implementation commitments
+**Date:** 2026-05-04
+
+---
+
+## 1. Why this document exists
+
+We have a business need to track **agent usage** (what was consumed: tokens, cost, runtime, calls) and **agent status** (what state agents are in: alive, phase, progress, blocked) starting from the Queen agent, and surface this **on the cloud** for product and business consumers. This document inventories the capabilities the runtime can expose **today** vs. what is **net-new**, so architecture can pick a scope before frontend and product write against it.
+
+The Queen is the right anchor: every local-runtime session today starts with a Queen, and every colony/worker is forked from a Queen call — so a tracking surface rooted at the Queen automatically covers the whole agent tree.
+
+> **Headline constraint for the architect:** the runtime is **local-by-default**. Every byte described in §3 — events, runtime logs, progress DBs, session state, LLM cost numbers — is written to the user's machine under `~/.hive/` (or the platform-specific Electron `userData` directory). **Nothing is shipped to the cloud today.** The business ask therefore implies a new local→cloud transport boundary, with the data-residency, privacy, and identity decisions that come with it. §4.5 makes the gap explicit per-surface; §8 lists the risks; §9 frames the cloud cut-over as the gating decision for any "Slice 2+" work.
+
+---
+
+## 2. Vocabulary — what we actually mean
+
+| Term | Definition in this codebase |
+|---|---|
+| **Session** | One Queen runtime instance. ID: `session_{YYYYMMDD_HHMMSS}_{uuid8}`. Persisted at `~/.hive/sessions/{session_id}/`. |
+| **Queen** | Long-lived conversational agent. One per session. Single event-loop node. Phases: `independent → incubating → working → reviewing`. See [queen/nodes/__init__.py:494-518](../../core/framework/agents/queen/nodes/__init__.py#L494-L518). |
+| **Colony** | Persistent stateless container forked by `create_colony`. Has its own SQLite progress DB at `~/.hive/colonies/{colony_name}/data/progress.db`. |
+| **Worker** | Ephemeral agent running inside a colony to execute a task. |
+| **Run / execution** | One trigger-to-completion invocation inside a node. Carries `run_id`, `execution_id`, `trace_id` (OTel-aligned). |
+| **Usage** | Quantitative consumption: input/output/cached tokens, USD cost, wall-clock latency, tool-call counts. |
+| **Status** | Qualitative state: phase, alive/stalled, current task, blocked-on, queue depth, last-heartbeat. |
+
+---
+
+## 3. What exists today (capabilities, not commitments)
+
+The runtime is already heavily instrumented. Most of what business wants is **already emitted** — the gap is persistence, aggregation, and a stable API surface.
+
+### 3.1 Event Bus — the spine
+
+[core/framework/host/event_bus.py:61-177](../../core/framework/host/event_bus.py#L61-L177) defines an in-process async pub/sub with **40+ event types** scoped by `stream_id`, `session_id`, `colony_id`, `execution_id`, `run_id`, `correlation_id`, `timestamp`.
+
+Relevant for usage/status:
+
+- **Lifecycle:** `EXECUTION_STARTED/COMPLETED/FAILED/PAUSED/RESUMED/RESURRECTED`
+- **Queen:** `QUEEN_PHASE_CHANGED`, `QUEEN_IDENTITY_SELECTED`
+- **Colony/Worker:** `COLONY_CREATED`, `WORKER_COLONY_LOADED`, `WORKER_COMPLETED`, `WORKER_FAILED`, `SUBAGENT_REPORT`
+- **LLM:** `LLM_TURN_COMPLETE`, `LLM_TEXT_DELTA`, `LLM_REASONING_DELTA`, `CONTEXT_USAGE_UPDATED`
+- **Tools:** `TOOL_CALL_STARTED`, `TOOL_CALL_COMPLETED`, `TOOL_CALL_REPLAY_DETECTED`
+- **Health:** `NODE_STALLED`, `NODE_TOOL_DOOM_LOOP`, `STREAM_TTFT_EXCEEDED`, `STREAM_INACTIVE`, `STREAM_NUDGE_SENT`
+- **Tasks (right-rail panel):** `TASK_CREATED`, `TASK_UPDATED`, `TASK_DELETED`, `TASK_LIST_RESET`
+- **Triggers:** `TRIGGER_AVAILABLE/ACTIVATED/DEACTIVATED/FIRED/REMOVED/UPDATED`
+
+> Persistence today: in-memory only, **plus** optional JSONL export when `HIVE_DEBUG_EVENTS=1` ([event_bus.py:33-54](../../core/framework/host/event_bus.py#L33-L54)). There is no SQL events table.
+
+### 3.2 Three-level runtime logs (per session)
+
+[core/framework/tracker/runtime_log_schemas.py](../../core/framework/tracker/runtime_log_schemas.py) defines:
+
+| Level | Schema | File | Granularity |
+|---|---|---|---|
+| L1 | `RunSummaryLog` | `summary.json` | Per graph run — totals + execution_quality + trace_id |
+| L2 | `NodeDetail` | `details.jsonl` | Per node — exit_status, input/output tokens, latency_ms, retry/accept/escalate/continue counts |
+| L3 | `NodeStepLog` | `tool_logs.jsonl` | Per LLM step — tool calls, verdicts, error traces, latency_ms |
+
+Storage: `~/.hive/sessions/{session_id}/logs/` ([runtime_log_store.py](../../core/framework/tracker/runtime_log_store.py)). Schemas already carry OTel fields (`trace_id`, `span_id`, `parent_span_id`) — wire-ready, not yet exported.
+
+### 3.3 LLM call accounting
+
+[core/framework/llm/provider.py:11-32](../../core/framework/llm/provider.py#L11-L32) — `LLMResponse` carries: `model`, `input_tokens`, `output_tokens`, `cached_tokens`, `cache_creation_tokens`, `cost_usd`, `stop_reason`. Cost is computed from [model_catalog.py](../../core/framework/llm/model_catalog.py) when the model is priced; otherwise `0.0`.
+
+> Gap: cost lives in the response object and is rolled into L2/L3 logs, but is **not** in the event bus stream and **not** in any aggregate query surface.
+
+### 3.4 Colony Progress DB
+
+[core/framework/host/progress_db.py:44-110](../../core/framework/host/progress_db.py#L44-L110) — per-colony SQLite (WAL mode):
+
+- `tasks` (id, seq, priority, goal, status: pending|claimed|started|completed|failed, worker_id, claimed_at, started_at, completed_at, retry_count, last_error)
+- `steps`, `sop_checklist`, `colony_meta`
+
+This is the closest thing we have to a status SQL store today, but it is **per-colony** and **task-shaped** — not session-shaped or usage-shaped.
+
+### 3.5 Queen task system (right-rail panel)
+
+The mechanism the IDE-selection prompt describes is real: each `task_update` emits `TASK_UPDATED` on the bus, which a future SSE/WS surface can stream. State transitions: `pending → in_progress → completed`. Task body carries `subject`, `active_form`, `blocks`, `blocked_by`, `metadata`. Source: [tasks/events.py:52-159](../../core/framework/tasks/events.py#L52-L159).
+
+### 3.6 HTTP surface (already shipping)
+
+[core/framework/server/routes_sessions.py](../../core/framework/server/routes_sessions.py):
+
+- `POST /api/sessions` — create
+- `GET /api/sessions/{session_id}` — current state including `queen_phase`, `queen_model`, `colony_id`, `uptime_seconds`
+- `GET /api/sessions/{session_id}/stats` — runtime statistics (extension point)
+- `GET /api/sessions/{session_id}/events/history` — replay persisted events
+
+SSE primitive exists at [server/sse.py](../../core/framework/server/sse.py) but is **not yet wired to a global event-stream route**. This is the natural attach point for a real-time status feed.
+
+### 3.7 Worker health snapshot
+
+`get_worker_health_summary()` ([worker_monitoring_tools.py:71-99](../../core/framework/tools/worker_monitoring_tools.py#L71-L99)) returns: `session_id`, `session_status`, `total_steps`, `recent_verdicts`, `stall_minutes`, `evidence_snippet`. Used today by Queen during the WORKING phase; can be exposed via API.
+
+---
+
+## 3.8 Where every byte lives today (data residency map)
+
+Every storage location below is **on the end-user's machine**. There is no cloud sink, no telemetry endpoint, no managed database, no analytics service. The HTTP server in [core/framework/server/](../../core/framework/server/) binds to localhost for the desktop UI; it is not a cloud API.
+
+`HIVE_HOME` defaults to `~/.hive/` and is overridden by the desktop shell to the platform `userData` dir (e.g. `~/Library/Application Support/Hive/` on macOS, `%APPDATA%\Hive\` on Windows). Source: [config.py:20-44](../../core/framework/config.py#L20-L44).
+
+| Data | On-disk location (per machine) | Format | Lifetime | Currently shipped off-device? |
+|---|---|---|---|---|
+| Event bus stream | in-process memory only | Python objects | Process lifetime | No |
+| Event debug log (opt-in) | `HIVE_HOME/event_logs/<ts>.jsonl` when `HIVE_DEBUG_EVENTS=1` | JSONL | Until user deletes | No |
+| Session state | `HIVE_HOME/sessions/{session_id}/state.json` | JSON | Until user deletes | No |
+| Conversations | `HIVE_HOME/sessions/{session_id}/conversations/` | JSON | Until user deletes | No |
+| Artifacts | `HIVE_HOME/sessions/{session_id}/artifacts/` | mixed | Until user deletes | No |
+| L1 run summary (tokens, cost, quality) | `HIVE_HOME/sessions/{session_id}/logs/summary.json` | JSON | Until user deletes | No |
+| L2 node details | `HIVE_HOME/sessions/{session_id}/logs/details.jsonl` | JSONL | Until user deletes | No |
+| L3 step / tool logs | `HIVE_HOME/sessions/{session_id}/logs/tool_logs.jsonl` | JSONL | Until user deletes | No |
+| Colony task / step / SOP state | `HIVE_HOME/colonies/{colony_name}/data/progress.db` | SQLite (WAL) | Until user deletes | No |
+| Queen / colony / skill / memory configs | `HIVE_HOME/{queens,colonies,skills,memories}/` | files | Until user deletes | No |
+| LLM `cost_usd` numbers | computed in-process from [model_catalog.py](../../core/framework/llm/model_catalog.py), then written into L1/L2/L3 logs above | — | Same as logs | No |
+
+**What this means for the cloud requirement:** the question for the architect is not "where do we get the data" — the data is fully captured. The question is **"how does it leave the machine, in what shape, with whose consent, and where does it land."** That decision is upstream of every endpoint in §6 and every storage option in §5.
+
+Three architectural shapes worth considering (architect to choose):
+
+- **Shape A — On-device only, queried over LAN/tunnel.** Cloud product reaches into the runtime via an authenticated tunnel; no data is replicated. Strongest privacy. Hardest for cross-device rollups.
+- **Shape B — Outbox push.** Runtime keeps the local store as source of truth and asynchronously pushes a redacted, billing-grade subset (no prompts, no tool args by default) to a cloud aggregate. Best fit for the typical "agent status dashboard + usage rollup" product.
+- **Shape C — Cloud-first runtime.** Runtime writes events directly to a cloud bus and treats local files as a cache. Largest rewrite; not recommended for a desktop-first product.
+
+Shape B is the lowest-friction path to the stated business outcome. The rest of this document is written with Shape B as the default and calls out where Shape A or C would change things.
+
+---
+
+## 4. Capability matrix — what we can offer
+
+Each row is a candidate frontend/business surface, scored by feasibility from current state.
+
+| # | Capability | Status | Backed by |
+|---|---|---|---|
+| **Status** | | | |
+| S1 | Queen phase indicator (independent/incubating/working/reviewing) | **Ready** | `QUEEN_PHASE_CHANGED` event + session detail field |
+| S2 | Per-task progress (right-rail) | **Ready** | `TASK_*` events |
+| S3 | Live LLM streaming indicator (typing, thinking, tool-calling) | **Ready** | `LLM_TEXT_DELTA`, `LLM_REASONING_DELTA`, `TOOL_CALL_STARTED/COMPLETED` |
+| S4 | Stall / stuck-agent detection | **Ready** | `NODE_STALLED`, `STREAM_INACTIVE`, `NODE_TOOL_DOOM_LOOP` |
+| S5 | Colony tree (Queen → colonies → workers) snapshot | **Partial** — data exists in session/colony stores; need a join query |
+| S6 | Worker health roll-up across colonies | **Partial** — per-worker tool exists; needs aggregation route |
+| S7 | Liveness heartbeat ("agent X last seen Y ago") | **Net-new** — must derive from event timestamps or add a periodic ping |
+| S8 | Trigger schedule (when will Queen wake next) | **Ready** | `TRIGGER_*` events |
+| **Usage** | | | |
+| U1 | Tokens per session (input/output/cached) | **Partial** — captured per-step in L3, summed in L1; no API |
+| U2 | USD cost per session / colony / model | **Partial** — `cost_usd` per LLM call in logs; no rollup |
+| U3 | Tool-call counts and types | **Partial** — events exist; no aggregate |
+| U4 | Wall-clock runtime and active-time per agent | **Partial** — derivable from `EXECUTION_STARTED/COMPLETED` |
+| U5 | Cost attribution per Queen-spawned colony | **Partial** — `colony_id` is on every event; needs a query |
+| U6 | Per-user / per-tenant aggregation | **Net-new** — there is no user/tenant identity in events today |
+| U7 | Daily / monthly usage rollups for billing | **Net-new** — requires persistent event store |
+| U8 | Quota / cap enforcement (block when over budget) | **Net-new** — requires real-time meter + policy hook |
+
+**Read of the matrix:** ~70% of "status" surfaces are **shipping-grade today** behind a thin local API. ~70% of "usage" surfaces need a **persistence + aggregation layer**. The events themselves are not the bottleneck.
+
+**Local vs. cloud read of the same matrix.** Every "Ready" / "Partial" cell above is *ready in-process on the local machine*. Making each row visible to a **cloud** consumer adds an additional step:
+
+| Capability class | Local (today / near-term) | Cloud (business ask) |
+|---|---|---|
+| Live status (S1–S4, S8) | Stream from in-process event bus over local SSE | Push events through outbox → cloud relay → cloud SSE/WS to product UI. |
+| Tree / health (S5, S6) | Join local session + colony stores | Same join, but on cloud-side replica of session/colony index. |
+| Liveness (S7) | Derive from local event timestamps | Requires the runtime to *post* a heartbeat; cloud cannot infer aliveness from absence. |
+| Per-session usage (U1–U5) | Read L1/L2/L3 logs on disk | Outbox sends durable rows (no deltas) to cloud usage table. |
+| Tenant rollups (U6–U7) | Not possible — no identity in events | Cloud-side aggregation keyed on session→user join, identity attached at outbox time. |
+| Quotas (U8) | Local meter feasible, but pointless without cloud truth | Cloud is the meter of record; runtime calls home to check. |
+
+---
+
+## 5. Proposed data model (architect to validate)
+
+Three new persisted entities, plus reuse of existing event types:
+
+```
+AgentSession                     UsageEvent                    StatusSnapshot
+-----------                      -----------                   ---------------
+session_id (PK)                  id (PK)                       session_id (FK)
+queen_id                         session_id (FK)               taken_at
+queen_model                      colony_id                     phase
+started_at                       worker_id                     active_run_id
+ended_at                         agent_role  (queen|worker)    active_node
+status      (active|done|failed) event_type  (LLM|TOOL|...)    open_task_count
+user_id     (when multi-tenant)  model                         in_flight_workers
+tenant_id   (when multi-tenant)  input_tokens                  last_event_at
+total_input_tokens               output_tokens                 stall_score
+total_output_tokens              cached_tokens
+total_cached_tokens              cost_usd
+total_cost_usd                   latency_ms
+total_tool_calls                 tool_name      (nullable)
+last_event_at                    occurred_at
+                                 trace_id
+                                 execution_id
+```
+
+Storage choice (architect call). **All three options today are local; only Option C reaches the cloud business surface.**
+
+- **Option A — local SQLite outbox** at `HIVE_HOME/runtime.db`. Pros: zero infra, fits desktop, makes local queries cheap. Cons: per-host; no cross-device aggregation; **does not satisfy the cloud requirement on its own.**
+- **Option B — DuckDB on the existing JSONL event logs.** Pros: zero ingestion code; analyst-friendly. Cons: cold-start latency on big histories; **also local-only.**
+- **Option C — push events to a managed cloud store** (Postgres, ClickHouse, BigQuery) via an outbox pattern. Pros: cross-host rollups, billing-grade, the only option that actually delivers the cloud-visible status/metrics product. Cons: introduces a new transport, identity, and privacy/redaction story; needs explicit user opt-in for desktop builds.
+
+The realistic shape is the hybrid called out in §3.8 Shape B: **A locally** as the durable buffer and source of truth, **C in the cloud** as the business-facing aggregate, with a one-way outbox that moves a *redacted, durable-event-only* subset over the wire. This document recommends that hybrid; everything in §6 and §7 is written against it.
+
+---
+
+## 6. Surface API — what frontend would consume
+
+All routes assume the event-bus → SSE bridge exists (the one missing wire — see §3.6). Frontend sees this from day one.
+
+> **Locality note.** The `/api/...` routes below are served by the **local runtime HTTP server** today. For the cloud product, the same shapes need a cloud-side counterpart fed by the outbox. Two practical patterns: (1) cloud product calls cloud-hosted versions of these routes (against the aggregate), or (2) cloud product proxies authenticated requests back to the user's runtime. §3.8 Shape A vs. Shape B picks between them.
+
+### Real-time channel
+
+```
+GET  /api/sessions/{session_id}/events/stream      (SSE)
+       ↳ filter=phase,task,llm_stream,tool,worker,trigger,health
+GET  /api/agents/queen/stream                      (SSE) — global queen events
+```
+
+### Status reads
+
+```
+GET  /api/sessions/{session_id}                    — already shipping
+GET  /api/sessions/{session_id}/tree               — Queen → colonies → workers
+GET  /api/sessions/{session_id}/health             — stall_score, last_event_at, in_flight
+GET  /api/colonies/{colony_id}/workers             — health roll-up
+```
+
+### Usage reads
+
+```
+GET  /api/sessions/{session_id}/usage              — tokens, cost, latency, tool-calls
+GET  /api/sessions/{session_id}/usage/by-model     — split by model
+GET  /api/colonies/{colony_id}/usage               — same shape, colony scope
+GET  /api/agents/queen/usage?range=...&group_by=... — rollup view (billing)
+```
+
+### Admin / business
+
+```
+GET  /api/usage/rollup?range=...&group_by=user|tenant|model|colony
+POST /api/quotas/{tenant}                          — set caps (if quota work in scope)
+```
+
+---
+
+## 7. Net-new work — sized in shirt-size, not days
+
+| Workstream | Local / Cloud | Size | Depends on | Notes |
+|---|---|---|---|---|
+| Event-bus → local SSE bridge ([sse.py](../../core/framework/server/sse.py) exists, route does not) | Local | **S** | — | Unlocks all real-time status surfaces in the desktop UI. Highest leverage piece. |
+| Persisted local event store (SQLite outbox) | Local | **M** | Decision §5 | One writer, append-only; reuse existing JSONL writer. Source of truth for cloud push. |
+| Local aggregation queries + `/usage` endpoints | Local | **M** | Persisted store | Per-session usage on disk. |
+| **Outbox transport (local → cloud)** | **Boundary** | **M–L** | Local store + auth | New work: durable queue, retry, redaction policy, opt-in switch, schema versioning. This is the bridge to the cloud product. |
+| **Cloud event ingest + aggregate store** | **Cloud** | **L** | Outbox transport | New cloud infra (Postgres/ClickHouse/BigQuery). Hosting, ops, retention policy, access controls. |
+| **Cloud-side status/usage API + dashboards** | **Cloud** | **M** | Cloud aggregate | Mirrors §6 endpoints against the cloud store; this is what business users actually see. |
+| Identity layer (user_id / tenant_id on events) | Boundary | **M** | Auth model | Currently no user identity in events. Identity attaches at outbox time, not at emit time. |
+| OpenTelemetry exporter (schema is ready) | Boundary | **S–M** | — | `trace_id`/`span_id` already populated; an OTel collector can be the cloud sink instead of a custom outbox. |
+| Quota / policy hooks | Cloud-authoritative | **L** | Cloud store + identity | Cloud holds the meter; runtime calls home synchronously on a critical path. |
+| Liveness/heartbeat (S7) | Local emit, cloud consume | **S** | Outbox | Runtime must actively post; cloud cannot infer liveness from absence. |
+| Cost attribution UI rollups | Cloud | **S** | `/usage` cloud endpoints | Shared with frontend doc. |
+
+**Critical path for first frontend release (local desktop UI):** SSE bridge → status endpoints (S1–S5) → per-session usage endpoint (U1, U2). Everything else is incremental.
+
+**Critical path for first cloud release (business ask):** local event store → outbox transport with redaction + opt-in → cloud ingest → cloud `/usage` and `/status` endpoints. The local UI work above is *not* a prerequisite for the cloud cut, but most of the local-side primitives (event store, durable-event filtering) are shared, so doing them in order minimizes rework.
+
+---
+
+## 8. Risks and tradeoffs the architect should weigh
+
+1. **Event volume.** `LLM_TEXT_DELTA` fires per token. A persisted store must filter — don't write deltas, write `LLM_TURN_COMPLETE`. This is the #1 way the table blows up.
+2. **Privacy / desktop posture — the central architectural constraint.** The runtime is local by default ([config.py:20-44](../../core/framework/config.py#L20-L44)). The data inventory in §3.8 confirms that **no data leaves the user's machine today**, including the data the business ask needs in the cloud. Closing that gap is not "add a metrics push" — it is a new system boundary with: (a) explicit user opt-in (defaults must be safe for OSS / self-hosted users), (b) a documented redaction list (no prompts, no tool args, no file paths in the default payload), (c) schema versioning so cloud aggregates do not break on runtime upgrades, (d) a clear answer for self-hosted / air-gapped deployments where the cloud sink is unreachable, (e) regional data-residency rules if the product is sold internationally. This is the single largest design decision in the document.
+3. **Cost-table accuracy.** `cost_usd` is computed from a static catalog. Using it for billing means committing to keeping the catalog current (or pulling from provider invoices). For *display*, the current approach is fine; for *charging*, it is not.
+4. **Identity coupling.** Events are session-scoped today. Adding `user_id`/`tenant_id` everywhere is invasive. Recommend pinning identity at the *session* boundary and joining on session at query time, rather than threading identity through every event payload.
+5. **Status vs. heartbeat semantics.** "Idle" is not "dead." A Queen sitting in `independent` waiting for a user message is healthy and should not page anyone. The stall-score in §5 must distinguish idle-by-design from stalled-by-bug — the existing `STREAM_INACTIVE` / `NODE_STALLED` events already make this distinction; preserve it.
+6. **Backpressure from observability.** If usage tracking sits in the LLM call path (for quotas), it must not add latency. Recommend: meter is async/eventual for display; only quota checks are synchronous, and only when the customer has a quota.
+7. **Worker-side gap.** Worker LLM calls are accounted in their own session's L1–L3 logs but are *not* automatically rolled into the parent Queen session. Cost attribution from Queen → spawned colony requires either (a) a parent_session_id field on the colony's session row, or (b) walking the `COLONY_CREATED` event graph at query time. (a) is cleaner.
+
+---
+
+## 9. Recommendation
+
+Ship in four thin slices. The first two are local-only and unblock the desktop UI; the last two are what actually deliver the business ask of cloud-visible status and metrics.
+
+1. **Slice 1 — Live local status (1 sprint, fully local).**
+   SSE bridge + `/sessions/{id}/events/stream` + `/sessions/{id}/health` + `/sessions/{id}/tree`. Frontend (local UI) gets the right-rail and the agent-tree. No persistence work, no cloud. (S1–S5, S8.)
+
+2. **Slice 2 — Per-session local usage store (1–2 sprints, fully local).**
+   Persisted event store (SQLite outbox at `HIVE_HOME/runtime.db`), filtered to durable event types only. `/sessions/{id}/usage` + `/colonies/{id}/usage`. No identity, no rollups, **no cloud transport yet**. This is the foundation the cloud slice rides on. (U1–U5.)
+
+3. **Slice 3 — Local→cloud outbox + cloud ingest (the cloud cut, scope-defining).**
+   Durable outbox queue, redaction policy, opt-in toggle, identity attachment, schema versioning, retry/backoff. Cloud-side ingest service + aggregate store. **This is where the local-only world becomes a cloud product.** Architect must decide §3.8 Shape, §5 storage, redaction defaults, and identity model before this slice can start.
+
+4. **Slice 4 — Cloud rollups, dashboards, quotas (scope TBD with product).**
+   Tenant aggregation, daily/monthly rollups, quota enforcement, OTel export, business dashboards. (U6–U8.) Defer until business confirms billing model — the answer (per-seat vs. per-token vs. per-colony) changes the data model.
+
+Slices 1 and 2 are mostly **wiring** — the events exist, the schemas exist, the storage paths exist. Slice 3 is the **first slice that introduces a new architectural boundary** (local→cloud transport + identity + privacy contract); everything novel about the business ask lives there. Slice 4 is **business design**, not engineering scope.
+
+---
+
+## 10. Open questions for the architect
+
+The first four are direct consequences of the local-first / cloud-required gap surfaced in §3.8 and §8.2.
+
+1. **Cloud transport shape — Shape A, B, or C from §3.8?** This decision is upstream of the entire data model. Recommend Shape B (outbox push) absent a strong privacy argument for Shape A.
+2. **Redaction default for the cloud payload.** What goes (model, token counts, latency, tool names, status) vs. what stays local (prompts, tool arguments, tool results, file paths, conversation content)? Need a written allowlist before Slice 3 starts.
+3. **Self-hosted / air-gapped users.** If the cloud sink is unreachable or disabled, what does the runtime do — buffer indefinitely, drop oldest, or refuse to start? Defaults differ for OSS vs. SaaS distributions.
+4. **Identity binding point.** Do we attach `user_id` / `tenant_id` at event-emit time (invasive, threads identity through every node), at session-create time (clean, requires session-level auth), or at outbox-flush time (simplest, but loses per-event provenance)? Recommend session-create.
+5. Do we need quota *enforcement*, or only quota *visibility* in v1?
+6. Frontend doc: are status and usage rendered in the same panel or different surfaces? This affects whether we ship one merged endpoint or two.
+7. Are we willing to pay the cost-table maintenance burden, or should "cost" stay labeled as estimated and not be used for invoicing?
+
+---
+
+## Appendix — Pointers
+
+- Queen lifecycle: [core/framework/agents/queen/nodes/__init__.py](../../core/framework/agents/queen/nodes/__init__.py)
+- Event bus + types: [core/framework/host/event_bus.py](../../core/framework/host/event_bus.py)
+- Runtime log schemas: [core/framework/tracker/runtime_log_schemas.py](../../core/framework/tracker/runtime_log_schemas.py)
+- Runtime log store: [core/framework/tracker/runtime_log_store.py](../../core/framework/tracker/runtime_log_store.py)
+- LLM accounting: [core/framework/llm/provider.py](../../core/framework/llm/provider.py), [model_catalog.py](../../core/framework/llm/model_catalog.py)
+- Colony progress DB: [core/framework/host/progress_db.py](../../core/framework/host/progress_db.py)
+- Task events: [core/framework/tasks/events.py](../../core/framework/tasks/events.py)
+- Session HTTP: [core/framework/server/routes_sessions.py](../../core/framework/server/routes_sessions.py)
+- SSE primitive: [core/framework/server/sse.py](../../core/framework/server/sse.py)
+- Worker health: [core/framework/tools/worker_monitoring_tools.py](../../core/framework/tools/worker_monitoring_tools.py)
+- Config / env vars: [core/framework/config.py](../../core/framework/config.py)
@@ -115,7 +115,7 @@ Hive LLM:
 Notes:

 - Set `provider` to `hive`
- Common Hive model values are `queen`, `kimi-2.5`, and `GLM-5`
+- Common Hive model values are `queen`, `kimi-k2.5`, and `GLM-5`
 - Hive LLM requests use the Hive endpoint at `https://api.adenhq.com`

 ### Search & Tools (optional)
@@ -0,0 +1,127 @@
+# 🐝 Hive Agent v0.11.0: Action Plans, Charts, and a Cleaner Queen
+
+> Major features released in Hive 0.11. Now Queen has an action plan for everything and charting capability to do analytics for you. Overall the conversation and agent experience is also improved a lot thanks to a major Queen prompt and tools refactor.
+
+---
+
+## ✨ Highlights
+
+### 📋 Queen now keeps an action plan for everything
+
+A new file-backed task system gives Queen a persistent, structured plan for every conversation — visible to the user, editable on the fly, and surviving session reload.
+
+- **File-backed task store** under `core/framework/tasks/` with full CRUD, scoping, hooks, and reminders. Tasks live on disk so they outlast a single agent run and can be inspected, replayed, or shared between Queen and colony workers.
+- **Multi-task creation in one call** — Queen can stage a whole plan up front instead of dripping out one task at a time, then tick items off as it works.
+- **Colony task templates** — colonies can publish a template task list that Queen picks up when the colony is invoked, so recurring workflows start with the same plan every time.
+- **Live task list in the UI** — a new `TaskListPanel` renders the plan in real time next to the chat, with item status flowing through the event bus as Queen marks tasks done.
+- **Task reminders + hooks** wire into Queen's loop so the plan stays in front of the model and structural blockers preventing tool calls on `task_*` are now resolved.
+
+### 📊 Charting capability for analytics
+
+Queen can now produce real charts inline in the conversation, not just describe them.
+
+- **New `chart_tools` MCP server** with ECharts and Mermaid renderers, an OpenHive theme, and a `chart-creation-foundations` skill that teaches Queen when to chart vs. when to table.
+- **Inline chart rendering in chat** — `EChartsBlock` and `MermaidBlock` components render the chart spec directly in the transcript; tool results get a contentful display with `ChartToolDetail` instead of a JSON dump.
+- **Chart spec normalization** in the renderer keeps Y-axis scaling, series colors, and theme tokens consistent regardless of how Queen phrases the spec.
+
+### 🧹 Major Queen prompt + tools refactor
+
+The biggest cleanup of Queen's tool surface and prompt since v0.7. Fewer, sharper tools; a shorter, more focused prompt; and a clearer model of what Queen has access to vs. what colonies do.
+
+- **File ops consolidated** — `apply_diff`, `apply_patch`, `hashline_edit`, the old `data_tools`, `grep_search`, and the legacy `coder_tools_server` are gone. A single rewritten `file_ops` module covers read / search / list / edit with a more predictable interface and ~1.7k fewer lines on net.
+- **Search and list-files unified** into one toolkit so Queen stops juggling near-duplicate variants.
+- **Browser tools audit** — interactions, navigation, tabs, and lifecycle trimmed and consolidated; `web_scrape` and `browser_open` merged into a single web-search-and-open path.
+- **New shell/terminal toolkit** (`shell_tools`) — replaces the old `execute_command_tool` and the inline command sanitizer with a typed module that has proper job control, PTY sessions, ring-buffered output, semantic exit codes, and a destructive-command warning gate. Five new preset skills (`shell-tools-foundations`, `-fs-search`, `-job-control`, `-pty-sessions`, `-troubleshooting`) teach Queen the new surface.
+- **Old lifecycle tools removed** — `queen_lifecycle_tools.py` shrunk by ~900 lines as deprecated default tools came out.
+- **Prompt simplification + improvements** — Queen's node prompts dropped redundant `_queen_style` blocks, tightened phrasing, and now lean on the task system for plan-keeping instead of restating the plan every turn.
+- **Tools editor frontend grouping** — `ToolsEditor.tsx` groups tools by category so configuring a queen profile is no longer a flat scroll through 80+ entries.
+
+---
+
+## 🆕 What's New
+
+### Tasks & Action Plans
+
+- **`core/framework/tasks/`** — full task subsystem: `store`, `models`, `events`, `hooks`, `reminders`, `scoping`, plus a `tools/` package exposing session and colony task tools to Queen. (@RichardTang-Aden)
+- **`POST /api/tasks` routes** for the frontend to read and mutate the live plan. (@RichardTang-Aden)
+- **`TaskListPanel` + `TaskItem` + `TaskListContext`** on the frontend render the plan in real time. (@RichardTang-Aden)
+- **Multi-task creation tool** lets Queen stage a whole plan in one call. (@RichardTang-Aden)
+- **Colony task templates** — colonies ship with a default task list that Queen adopts on entry. (@RichardTang-Aden)
+- **Hook + reminder fixes** so Queen reliably uses `task_*` tools instead of skipping them. (@RichardTang-Aden)
+
+### Charts
+
+- **`tools/src/chart_tools/`** — new MCP server with `renderer.py`, `theme.py`, `tools.py`, plus bundled `echarts.min.js` and `mermaid.min.js`. (@TimothyZhang7)
+- **`chart-creation-foundations` skill** teaches Queen when and how to chart. (@TimothyZhang7)
+- **`EChartsBlock` / `MermaidBlock` / `ChartToolDetail`** components render charts inline. (@TimothyZhang7)
+- **OpenHive chart theme** (`openhiveTheme.ts`) keeps chart styling consistent with the rest of the UI. (@TimothyZhang7)
+- **Chart spec normalization** in the renderer fixes Y-axis edge cases and series defaults. (@TimothyZhang7)
+
+### Queen Prompt & Tools Refactor
+
+- **Major file ops refactor** — single rewritten `file_ops` module replaces `apply_diff`, `apply_patch`, `hashline_edit`, `grep_search`, `data_tools`, and the legacy `coder_tools_server`. (@RichardTang-Aden)
+- **Edit-file refactor** with a tighter API surface and ~560 lines of dead `test_file_ops_hashline.py` removed. (@RichardTang-Aden)
+- **Search + list-files consolidation** into one toolkit. (@RichardTang-Aden)
+- **Browser tools audit** — navigation, interactions, lifecycle, and tabs trimmed; `web_scrape` and browser-open merged. (@RichardTang-Aden)
+- **`shell_tools` package** replaces `execute_command_tool` with proper job control, PTY sessions, ring-buffered output, semantic exit codes, and destructive-command warnings. (@TimothyZhang7)
+- **Five new shell preset skills** plus reference docs (`exit_codes.md`, `find_predicates.md`, `ripgrep_cheatsheet.md`, `signals.md`). (@TimothyZhang7)
+- **Old lifecycle tools removed** — `queen_lifecycle_tools.py` lost ~900 lines. (@RichardTang-Aden)
+- **Autocompaction + concurrency tools updated** to play nicely with the new tool registry. (@RichardTang-Aden)
+- **Prompt simplification** — `nodes/__init__.py` dropped redundant `_queen_style` block and tightened phrasing across nodes. (@RichardTang-Aden)
+- **`ToolsEditor` grouping** — frontend tool-config screen now groups tools by category. (@RichardTang-Aden)
+
+### Conversation & Agent Experience
+
+- **`ask_user` questions surface in the chat transcript** instead of vanishing into a side panel, and the question bubble now defers until the user actually answers. (@bryan)
+- **New-session navigation with Queen warm-up UI** — new `queen-routing.tsx` page handles the warm-up so the user sees progress instead of a blank screen. (@bryan)
+- **Sync tool result contentful display** — tool results render as structured cards (charts, file diffs, etc.) instead of raw JSON. (@TimothyZhang7)
+
+### Vision Fallback
+
+- **Vision model retry + fallback** — non-vision models can now route image inputs through a captioning step instead of failing. (@RichardTang-Aden)
+- **Vision fallback with intent** — caption prompts incorporate the user's intent so the caption is task-relevant. (@RichardTang-Aden)
+- **Vision fallback auth** — fallback path now uses the right credentials per provider. (@RichardTang-Aden)
+- **Looser max-token cap** on vision fallback for models that spend output tokens on internal thinking. (@RichardTang-Aden)
+- **Vision fallback model usage logging** for cost visibility. (@RichardTang-Aden)
+
+### Colonies
+
+- **`POST /api/colonies/import`** — onboard a colony from a `tar` / `tar.gz` upload. 50 MB cap, manual path-traversal validation (Python 3.11 compatible), symlinks/hardlinks/devices rejected, mode bits masked. Tests cover happy path, name override, replace flag, traversal, absolute paths, and corrupt archives. (@RichardTang-Aden)
+- **Refactored colony routes** — `routes_colonies.py` gained ~450 lines of structure for import/export/list flows. (@TimothyZhang7)
+
+### MCP & Tools
+
+- **SimilarWeb V5 integration** — 29 new MCP tools covering traffic & engagement, competitor intelligence, keywords/SERP, audience demographics, and segment analysis. Includes credential spec, health checker, README, and tests on Ubuntu and Windows. (#7066)
+- **MCP registry initialization fix** — registry no longer races on first install. (@RichardTang-Aden)
+
+---
+
+## 🐛 Bug Fixes
+
+- **Initial install** path resolution — hardcoded `HIVE_HOME` references replaced; all agent paths now prefixed by the resolved `HIVE_HOME`. (@RichardTang-Aden)
+- **Frontend recovery** after a broken state on session reload. (@RichardTang-Aden)
+- **Compaction issues** when the agent loop runs into the buffer mid-stream. (@RichardTang-Aden)
+- **LiteLLM patch** for a streaming-usage edge case. (@RichardTang-Aden)
+- **`ask_user` question bubble** now defers until the user answers. (@bryan)
+- **Incubating-mode approval guidance** correctly injects into the prompt. (@RichardTang-Aden)
+- **LLM debugger** — fixed timeline order and tool-call display. (@RichardTang-Aden)
+- **Shell split-command** parsing fix. (@TimothyZhang7)
+- **Chart Y-axis** + **chart spec normalization** edge cases. (@TimothyZhang7)
+- **Scroll behavior** on certain element selectors. (@bryan)
+- **CI fixes**: skills `HIVE_HOME` refactor regressions, `run_parallel_workers` losing task text on spawn, `test_capabilities` deprecated model identifiers, `test_colony_runtime_overseer` Windows flake. (#7141, #7149)
+- **Orphan Zoho CRM test directory** removed under `src/` after the MCP refactor. (#7142)
+- **Credentials** — `EnvVarStorage.exists` now matches `load` semantics for empty values. (#5680)
+
+---
+
+## 🚀 Upgrading from v0.10.5
+
+No migration required. Pull `main` at `v0.11.0` and restart Hive — existing `~/.hive/` profiles, queens, colonies, and sessions keep working.
+
+A few things to know:
+
+1. **Queen's default tool surface changed.** If you have a queen profile pinned to a removed tool (e.g. `apply_diff`, `apply_patch`, `hashline_edit`, `grep_search`, the old `execute_command_tool`), it'll fall back to the consolidated replacements. Custom profiles referencing those tool names should be updated.
+2. **Old `queen_lifecycle_tools` entries are gone.** If you wired any external code against those defaults, switch to the new task system.
+3. **Task plan is now persistent.** Queen will start staging a plan automatically on new sessions — if you don't want the panel, you can collapse it from the layout.
+
+Plan the work. Chart the result. 🐝
@@ -64,6 +64,54 @@ def _format_timestamp(raw: str) -> str:
        return raw


+def _reassemble_records(records: list[dict[str, Any]]) -> list[dict[str, Any]]:
+    """Convert new-format (header + turn-delta) records into legacy-shape full turns.
+
+    Records lacking ``_kind`` are passed through unchanged. Inputs must be in
+    file order so headers precede the turns that reference them.
+    """
+    headers: dict[str, dict[str, Any]] = {}  # execution_id -> latest session_header
+    pools: dict[str, dict[str, dict[str, Any]]] = {}  # execution_id -> hash -> message body
+
+    out: list[dict[str, Any]] = []
+    for rec in records:
+        kind = rec.get("_kind")
+        if kind == "session_header":
+            eid = str(rec.get("execution_id") or "")
+            headers[eid] = rec
+            pools.setdefault(eid, {})
+            continue
+        if kind == "turn":
+            eid = str(rec.get("execution_id") or "")
+            pool = pools.setdefault(eid, {})
+            new_msgs = rec.get("new_messages") or {}
+            if isinstance(new_msgs, dict):
+                pool.update(new_msgs)
+            hashes = rec.get("message_hashes") or []
+            messages = [pool[h] for h in hashes if h in pool]
+            header = headers.get(eid, {})
+            out.append(
+                {
+                    "timestamp": rec.get("timestamp", ""),
+                    "execution_id": eid,
+                    "node_id": rec.get("node_id", ""),
+                    "stream_id": rec.get("stream_id", ""),
+                    "iteration": rec.get("iteration", 0),
+                    "system_prompt": header.get("system_prompt", ""),
+                    "tools": header.get("tools", []),
+                    "messages": messages,
+                    "assistant_text": rec.get("assistant_text", ""),
+                    "tool_calls": rec.get("tool_calls", []),
+                    "tool_results": rec.get("tool_results", []),
+                    "token_counts": rec.get("token_counts", {}),
+                    "_log_file": rec.get("_log_file", ""),
+                }
+            )
+            continue
+        out.append(rec)
+    return out
+
+
 def _is_test_session(execution_id: str, records: list[dict[str, Any]]) -> bool:
    if execution_id.startswith("<MagicMock"):
        return True
@@ -100,6 +148,9 @@ def _discover_session_summaries(logs_dir: Path, limit_files: int, include_tests:
                        payload = json.loads(line)
                    except json.JSONDecodeError:
                        continue
+                    # session_header is metadata, not a turn — don't count it.
+                    if payload.get("_kind") == "session_header":
+                        continue
                    eid = str(payload.get("execution_id") or "").strip()
                    if not eid:
                        continue
@@ -157,6 +208,10 @@ def _load_session_data(logs_dir: Path, session_id: str, limit_files: int) -> lis

    records: list[dict[str, Any]] = []
    for path in files:
+        # Reassemble per-file: each file is self-contained because the writer
+        # re-emits the session_header on every process start, so we never need
+        # cross-file state to fill in messages/system_prompt/tools.
+        file_records: list[dict[str, Any]] = []
        try:
            with path.open(encoding="utf-8") as handle:
                for line_number, raw_line in enumerate(handle, start=1):
@@ -166,17 +221,23 @@ def _load_session_data(logs_dir: Path, session_id: str, limit_files: int) -> lis
                    try:
                        payload = json.loads(line)
                    except json.JSONDecodeError:
-                        payload = {
-                            "timestamp": "",
-                            "execution_id": "",
-                            "_parse_error": f"{path.name}:{line_number}",
-                            "_raw_line": line,
-                        }
-                    if str(payload.get("execution_id") or "").strip() == session_id:
-                        payload["_log_file"] = str(path)
-                        records.append(payload)
+                        records.append(
+                            {
+                                "timestamp": "",
+                                "execution_id": "",
+                                "_parse_error": f"{path.name}:{line_number}",
+                                "_raw_line": line,
+                                "_log_file": str(path),
+                            }
+                        )
+                        continue
+                    if str(payload.get("execution_id") or "").strip() != session_id:
+                        continue
+                    payload["_log_file"] = str(path)
+                    file_records.append(payload)
        except OSError:
            continue
+        records.extend(_reassemble_records(file_records))

    if not records:
        return None
@@ -22,13 +22,11 @@ Usage:

 from __future__ import annotations

-import contextlib
 import difflib
 import fnmatch
 import os
 import re
 import subprocess
-import sys
 import threading as _threading
 from collections.abc import Callable
 from dataclasses import dataclass, field
@@ -924,8 +922,7 @@ def _apply_hunk(content: str, hunk: _Hunk) -> tuple[str, str | None]:
            count = content.count(hunk.context_hint)
            if count > 1:
                return content, (
-                    f"addition-only hunk: context hint "
-                    f"'{hunk.context_hint}' is ambiguous ({count} occurrences)"
+                    f"addition-only hunk: context hint '{hunk.context_hint}' is ambiguous ({count} occurrences)"
                )
            if count == 1:
                idx = content.find(hunk.context_hint)
@@ -1045,9 +1042,7 @@ def _apply_v4a(
            for hunk_idx, hunk in enumerate(op.hunks):
                new_content, herr = _apply_hunk(content, hunk)
                if herr:
-                    errors.append(
-                        f"Op #{op_idx + 1} update {op.path} hunk #{hunk_idx + 1}: {herr}"
-                    )
+                    errors.append(f"Op #{op_idx + 1} update {op.path} hunk #{hunk_idx + 1}: {herr}")
                    break
                content = new_content
            fs_state[resolved] = content
@@ -1063,9 +1058,7 @@ def _apply_v4a(
                errors.append(f"Op #{op_idx + 1} move {op.path}: {err}")
                continue
            if os.path.exists(dst_resolved) and fs_exists.get(dst_resolved, True):
-                errors.append(
-                    f"Op #{op_idx + 1} move {op.path}: destination already exists"
-                )
+                errors.append(f"Op #{op_idx + 1} move {op.path}: destination already exists")
                continue
            fs_state[dst_resolved] = fs_state[resolved]
            fs_exists[dst_resolved] = True
@@ -1121,8 +1114,7 @@ def _apply_v4a(

    if apply_errors:
        return None, (
-            "Apply phase failed (state may be inconsistent — run `git diff` to assess):\n  "
-            + "\n  ".join(apply_errors)
+            "Apply phase failed (state may be inconsistent — run `git diff` to assess):\n  " + "\n  ".join(apply_errors)
        )

    summary_parts: list[str] = []
@@ -1177,10 +1169,7 @@ def _patch_replace(
            f"harness can track its state before you edit it."
        )
    if _fresh.status is Freshness.STALE:
-        return (
-            f"Refusing to edit '{path}': {_fresh.detail}. Re-read the file with "
-            f"read_file before editing."
-        )
+        return f"Refusing to edit '{path}': {_fresh.detail}. Re-read the file with read_file before editing."

    try:
        with open(resolved, encoding="utf-8") as f:
@@ -1217,9 +1206,7 @@ def _patch_replace(
                break

        if matched is None:
-            close = difflib.get_close_matches(
-                old_string[:200], content.split("\n"), n=3, cutoff=0.4
-            )
+            close = difflib.get_close_matches(old_string[:200], content.split("\n"), n=3, cutoff=0.4)
            msg = (
                f"Error: Could not find a unique match for old_string in {path}. "
                f"Use read_file to verify the current content, or search_files "
@@ -1352,14 +1339,8 @@ EDIT_FILE_PARAMS = {
        "tabs vs spaces, smart quotes vs ASCII, and literal \\n/\\t/\\r "
        "vs real control chars."
    ),
-    "new_string": (
-        "Replace mode only. Replacement text. Pass an empty string to "
-        "delete the matched text."
-    ),
-    "replace_all": (
-        "Replace mode only. Replace every occurrence instead of requiring "
-        "a unique match. Default False."
-    ),
+    "new_string": ("Replace mode only. Replacement text. Pass an empty string to delete the matched text."),
+    "replace_all": ("Replace mode only. Replace every occurrence instead of requiring a unique match. Default False."),
    "patch_text": (
        "Patch mode only. Structured patch body. File paths are embedded "
        "inside the body via '*** Update File: <path>' / "
@@ -1396,18 +1377,14 @@ SEARCH_FILES_DOC = (
 )
 SEARCH_FILES_PARAMS = {
    "pattern": (
-        "Regex (content mode) or glob (files mode, e.g. '*.py'). For an "
-        "'ls'-style listing pass '*' or '*.<ext>'."
+        "Regex (content mode) or glob (files mode, e.g. '*.py'). For an 'ls'-style listing pass '*' or '*.<ext>'."
    ),
    "target": (
        "'content' to grep inside files, 'files' to list/find files. "
        "Legacy aliases: 'grep' -> 'content', 'find'/'ls' -> 'files'. "
        "Default 'content'."
    ),
-    "path": (
-        "Directory (or, in content mode, a single file) to search. "
-        "Default '.'."
-    ),
+    "path": ("Directory (or, in content mode, a single file) to search. Default '.'."),
    "file_glob": (
        "Restrict content search to filenames matching this glob. "
        "Ignored in files mode (use the 'pattern' argument instead)."
@@ -1419,14 +1396,8 @@ SEARCH_FILES_PARAMS = {
        "default), 'files_only' (paths only), 'count' (per-file match "
        "counts)."
    ),
-    "context": (
-        "Lines of context before and after each match (content mode "
-        "only). Default 0."
-    ),
-    "hashline": (
-        "Content mode: include N:hhhh hash anchors in matched lines. "
-        "Default False."
-    ),
+    "context": ("Lines of context before and after each match (content mode only). Default 0."),
+    "hashline": ("Content mode: include N:hhhh hash anchors in matched lines. Default False."),
    "task_id": (
        "Optional anti-loop scope key. Defaults to a shared bucket; pass "
        "a per-task id when multiple agents share a process."
@@ -1719,4 +1690,3 @@ def register_file_tools(
                "Results have not changed — use what you have instead of re-searching.]"
            )
        return result
-
@@ -137,10 +137,7 @@ def register_tools(mcp: FastMCP) -> None:
                            "error": f"Blocked by robots.txt: {url}",
                            "url": url,
                            "skipped": True,
-                            "hint": (
-                                "Pass respect_robots_txt=False if you have "
-                                "authorization to scrape this site."
-                            ),
+                            "hint": ("Pass respect_robots_txt=False if you have authorization to scrape this site."),
                        }
                except Exception:
                    pass  # If robots.txt can't be fetched, proceed anyway
@@ -343,8 +340,19 @@ def register_tools(mcp: FastMCP) -> None:
                for br in content_elem.find_all("br"):
                    br.replace_with(NavigableString("\n"))
                block_tags = (
-                    "p", "h1", "h2", "h3", "h4", "h5", "h6",
-                    "li", "tr", "div", "section", "article", "blockquote",
+                    "p",
+                    "h1",
+                    "h2",
+                    "h3",
+                    "h4",
+                    "h5",
+                    "h6",
+                    "li",
+                    "tr",
+                    "div",
+                    "section",
+                    "article",
+                    "blockquote",
                )
                for block in content_elem.find_all(block_tags):
                    block.insert_before(NavigableString("\n"))
@@ -375,7 +383,7 @@ def register_tools(mcp: FastMCP) -> None:
            truncated = end < total_length
            sliced = text[offset:end]
            if truncated and len(sliced) >= 3:
-                sliced = sliced[: -3] + "..."
+                sliced = sliced[:-3] + "..."

            structured_data: dict[str, Any] = {}
            if json_ld:
@@ -29,11 +29,7 @@ def register_chart_tools(mcp: FastMCP) -> list[str]:

    register_tools(mcp)

-    return [
-        name
-        for name in mcp._tool_manager._tools.keys()
-        if name.startswith("chart_")
-    ]
+    return [name for name in mcp._tool_manager._tools.keys() if name.startswith("chart_")]


 __all__ = ["register_chart_tools"]
@@ -247,9 +247,7 @@ async def _render_in_page(
        # expected to have already coerced JSON-string specs into dicts
        # in chart_tools/tools.py — this is a defense-in-depth check.
        if isinstance(spec, str):
-            raise RendererError(
-                "spec arrived as a string; it should have been parsed to a dict in chart_render"
-            )
+            raise RendererError("spec arrived as a string; it should have been parsed to a dict in chart_render")
        try:
            json.dumps(spec)
        except (TypeError, ValueError) as exc:
@@ -168,8 +168,8 @@ def build_theme(theme: str = "light") -> dict:
        # being CSS-hello-world green/red.
        "candlestick": {
            "itemStyle": {
-                "color": "#3d7a4a",      # up body
-                "color0": "#a8453d",     # down body
+                "color": "#3d7a4a",  # up body
+                "color0": "#a8453d",  # down body
                "borderColor": "#3d7a4a",
                "borderColor0": "#a8453d",
            },
@@ -174,9 +174,7 @@ def register_tools(mcp: FastMCP) -> None:
                # browser-side flakes. We retry once for the latter; if
                # the second attempt fails too, surface the error so the
                # agent can fix it.
-                logger.warning(
-                    "chart_render attempt %d/%d failed: %s", attempt + 1, 2, exc
-                )
+                logger.warning("chart_render attempt %d/%d failed: %s", attempt + 1, 2, exc)
                if attempt == 0:
                    await asyncio.sleep(0.15)
                    continue
@@ -71,17 +71,10 @@ def build_exec_envelope(
        # the foundational skill documents). For simplicity we always
        # store both when either overflows so the agent can fetch the
        # other stream in full too if it wants.
-        combined = (
-            b"--- stdout ---\n"
-            + stdout_bytes
-            + b"\n--- stderr ---\n"
-            + stderr_bytes
-        )
+        combined = b"--- stdout ---\n" + stdout_bytes + b"\n--- stderr ---\n" + stderr_bytes
        output_handle = store.put(combined)

-    semantic_status, semantic_message = classify(
-        command, exit_code, timed_out=timed_out, signaled=signaled
-    )
+    semantic_status, semantic_message = classify(command, exit_code, timed_out=timed_out, signaled=signaled)

    warning = get_warning(command)

@@ -53,9 +53,7 @@ if TYPE_CHECKING:
 # directly — the alternative is spawning the first program with the rest
 # of the line as junk argv, which either errors or returns fake success
 # (e.g. `echo "..." && ps ...` → echo prints the literal command).
-_SHELL_METACHARS: frozenset[str] = frozenset(
-    {"|", "&&", "||", ";", ">", "<", ">>", "<<", "&", "2>", "2>&1", "|&"}
-)
+_SHELL_METACHARS: frozenset[str] = frozenset({"|", "&&", "||", ";", ">", "<", ">>", "<<", "&", "2>", "2>&1", "|&"})


 def register_exec_tools(mcp: FastMCP) -> None:
@@ -126,7 +124,8 @@ def register_exec_tools(mcp: FastMCP) -> None:
                        return _err_envelope(command, "command was empty")
                    if any(t in _SHELL_METACHARS for t in tokens) or any(
                        # globs that shlex left unexpanded (`*`, `?`, `[`)
-                        any(c in t for c in "*?[") and t != "[" for t in tokens
+                        any(c in t for c in "*?[") and t != "["
+                        for t in tokens
                    ):
                        auto_shell = True

@@ -20,7 +20,6 @@ from gcu.browser.bridge import BeelineBridge
 from gcu.browser.tools.advanced import register_advanced_tools
 from gcu.browser.tools.inspection import register_inspection_tools
 from gcu.browser.tools.interactions import register_interaction_tools
-from gcu.browser.tools.lifecycle import register_lifecycle_tools
 from gcu.browser.tools.navigation import register_navigation_tools
 from gcu.browser.tools.tabs import register_tab_tools

@@ -20,9 +20,7 @@ def test_register_chart_tools_lands_all(mcp):
    from chart_tools import register_chart_tools

    names = register_chart_tools(mcp)
-    assert set(names) == EXPECTED_TOOLS, (
-        f"missing: {EXPECTED_TOOLS - set(names)}, extra: {set(names) - EXPECTED_TOOLS}"
-    )
+    assert set(names) == EXPECTED_TOOLS, f"missing: {EXPECTED_TOOLS - set(names)}, extra: {set(names) - EXPECTED_TOOLS}"


 def test_all_tools_have_chart_prefix(mcp):
@@ -2,8 +2,8 @@

 These tests cover the stale-edit guard added for Gap 4:
 - read_file records a per-file hash snapshot
- edit_file / write_file / hashline_edit refuse to run when the on-disk
-  file has diverged from the last recorded read
+- edit_file / write_file refuse to run when the on-disk file has
+  diverged from the last recorded read
 - write_file is allowed without a prior read when the target doesn't
  exist yet (brand-new file, nothing to clobber)
 - re-recording after a successful write keeps chained edits working
@@ -52,7 +52,6 @@ def tools(sandbox: Path):
        "read_file": _find_tool(mcp, "read_file"),
        "write_file": _find_tool(mcp, "write_file"),
        "edit_file": _find_tool(mcp, "edit_file"),
-        "hashline_edit": _find_tool(mcp, "hashline_edit"),
    }


@@ -129,7 +128,7 @@ def test_edit_file_refuses_without_prior_read(sandbox: Path, tools):
    # Clear the cache first so there's definitely no recorded read.
    file_state_cache.reset_all()

-    result = tools["edit_file"]("e.py", "hello", "world")
+    result = tools["edit_file"]("replace", "e.py", "hello", "world")
    assert "Refusing to edit" in result
    assert "read_file" in result

@@ -140,7 +139,7 @@ def test_edit_file_proceeds_after_read(sandbox: Path, tools):
    file_state_cache.reset_all()

    tools["read_file"]("f.py")
-    result = tools["edit_file"]("f.py", "hello", "world")
+    result = tools["edit_file"]("replace", "f.py", "hello", "world")
    assert "Replaced" in result
    assert target.read_text() == "print('world')\n"

@@ -157,7 +156,7 @@ def test_edit_file_refuses_when_file_changed_between_read_and_edit(sandbox: Path
    target.write_text("print('bye')\n")
    os.utime(str(target), None)

-    result = tools["edit_file"]("g.py", "hello", "world")
+    result = tools["edit_file"]("replace", "g.py", "hello", "world")
    assert "Refusing to edit" in result
    assert "Re-read" in result

@@ -185,10 +184,10 @@ def test_chained_edits_in_same_turn_do_not_self_invalidate(sandbox: Path, tools)
    file_state_cache.reset_all()

    tools["read_file"]("chained.py")
-    r1 = tools["edit_file"]("chained.py", "a", "A")
+    r1 = tools["edit_file"]("replace", "chained.py", "a", "A")
    assert "Replaced" in r1
    # Immediate second edit must NOT trip the stale guard because
    # edit_file re-records the post-write state.
-    r2 = tools["edit_file"]("chained.py", "b", "B")
+    r2 = tools["edit_file"]("replace", "chained.py", "b", "B")
    assert "Replaced" in r2
    assert target.read_text() == "print('A')\nprint('B')\n"
@@ -2,10 +2,13 @@

 from __future__ import annotations

+import sys
 import time

 import pytest

+pytestmark = pytest.mark.skipif(sys.platform == "win32", reason="terminal_tools is POSIX-only (uses resource module)")
+

@pytest.fixture
 def exec_tool(mcp):
@@ -2,10 +2,13 @@

 from __future__ import annotations

+import sys
 import time

 import pytest

+pytestmark = pytest.mark.skipif(sys.platform == "win32", reason="terminal_tools is POSIX-only (uses resource module)")
+

@pytest.fixture
 def job_tools(mcp):
@@ -63,9 +66,7 @@ def test_merge_stderr(job_tools):
        merge_stderr=True,
    )
    job_id = started["job_id"]
-    result = job_tools["logs"](
-        job_id=job_id, stream="merged", wait_until_exit=True, wait_timeout_sec=5
-    )
+    result = job_tools["logs"](job_id=job_id, stream="merged", wait_until_exit=True, wait_timeout_sec=5)
    assert "stdout1" in result["data"]
    assert "stderr1" in result["data"]

@@ -3,9 +3,12 @@
 from __future__ import annotations

 import shutil
+import sys

 import pytest

+pytestmark = pytest.mark.skipif(sys.platform == "win32", reason="terminal_tools is POSIX-only (uses resource module)")
+

@pytest.fixture
 def search_tools(mcp):
@@ -2,8 +2,12 @@

 from __future__ import annotations

+import sys
+
 import pytest

+pytestmark = pytest.mark.skipif(sys.platform == "win32", reason="terminal_tools is POSIX-only (uses resource module)")
+

 def test_resolve_shell_rejects_zsh():
    from terminal_tools.common.limits import ZshRefused, _resolve_shell
@@ -2,6 +2,12 @@

 from __future__ import annotations

+import sys
+
+import pytest
+
+pytestmark = pytest.mark.skipif(sys.platform == "win32", reason="terminal_tools is POSIX-only (uses resource module)")
+
 EXPECTED_TOOLS = {
    "terminal_exec",
    "terminal_job_start",
@@ -20,9 +26,7 @@ def test_register_terminal_tools_lands_all_ten(mcp):
    from terminal_tools import register_terminal_tools

    names = register_terminal_tools(mcp)
-    assert set(names) == EXPECTED_TOOLS, (
-        f"missing: {EXPECTED_TOOLS - set(names)}, extra: {set(names) - EXPECTED_TOOLS}"
-    )
+    assert set(names) == EXPECTED_TOOLS, f"missing: {EXPECTED_TOOLS - set(names)}, extra: {set(names) - EXPECTED_TOOLS}"


 def test_all_tools_have_terminal_prefix(mcp):
@@ -280,7 +280,7 @@ class TestPatchToolReplaceMode:
        result = edit_fn(
            mode="replace",
            path="b.py",
-            old_string='print(“hi”)',
+            old_string="print(“hi”)",
            new_string='print("HELLO")',
        )
        assert "Error" not in result
@@ -331,14 +331,7 @@ class TestPatchToolPatchMode:
        """A V4A Update hunk replaces matched lines and writes."""
        target = tmp_path / "u.py"
        target.write_text("def f():\n    return 1\n", encoding="utf-8")
-        body = (
-            "*** Begin Patch\n"
-            "*** Update File: u.py\n"
-            " def f():\n"
-            "-    return 1\n"
-            "+    return 42\n"
-            "*** End Patch\n"
-        )
+        body = "*** Begin Patch\n*** Update File: u.py\n def f():\n-    return 1\n+    return 42\n*** End Patch\n"
        edit_fn = _get_tool_fn(file_ops_mcp, "edit_file")
        result = edit_fn(mode="patch", patch_text=body)
        assert "Error" not in result
@@ -347,13 +340,7 @@ class TestPatchToolPatchMode:

    def test_patch_add_file(self, file_ops_mcp, tmp_path):
        """Add File: creates a new file from + lines."""
-        body = (
-            "*** Begin Patch\n"
-            "*** Add File: new.py\n"
-            "+# new\n"
-            "+x = 1\n"
-            "*** End Patch\n"
-        )
+        body = "*** Begin Patch\n*** Add File: new.py\n+# new\n+x = 1\n*** End Patch\n"
        edit_fn = _get_tool_fn(file_ops_mcp, "edit_file")
        result = edit_fn(mode="patch", patch_text=body)
        assert "Error" not in result
@@ -466,9 +466,7 @@ class TestWebScrapeToolAIFriendlyOutput:

        result = await web_scrape_fn(url="https://example.com")
        assert "structured_data" in result
-        assert result["structured_data"]["json_ld"] == [
-            {"@type": "Article", "headline": "Hello"}
-        ]
+        assert result["structured_data"]["json_ld"] == [{"@type": "Article", "headline": "Hello"}]

    @pytest.mark.asyncio
    @patch(_STEALTH_PATH)
Author	SHA1	Message	Date
Richard Tang	fe74718fd9	chore: lint CI / Lint Python (push) Waiting to run Details CI / Test Python Framework (ubuntu-latest) (push) Waiting to run Details CI / Test Python Framework (windows-latest) (push) Waiting to run Details CI / Test Tools (ubuntu-latest) (push) Waiting to run Details CI / Test Tools (windows-latest) (push) Waiting to run Details CI / Validate Agent Exports (push) Blocked by required conditions Details	2026-05-04 17:57:56 -07:00
Richard Tang	07c97e2e9b	feat: llm logging	2026-05-04 17:57:20 -07:00
Richard Tang	07600c5ab5	feat: encourage action plan prompts	2026-05-04 17:55:44 -07:00
Richard Tang	e7d4ce0057	chore: lint	2026-05-04 12:36:28 -07:00
Richard Tang	d9813288d9	fix: install system mcp when they fail	2026-05-04 12:35:21 -07:00
Richard Tang	41fbdcb940	fix(frontend): mcp tools server title format	2026-05-04 12:35:21 -07:00
Hundao	4a9b22719b	fix(antigravity): unblock Gemini chats — schema sanitizer + UA bump (#7170 ) * fix(antigravity): translate JSON Schema unions to Gemini nullable Tool parameter schemas using JSON Schema 2020-12 unions like "type": ["string", "null"] crash Gemini's function_declarations parser with HTTP 400. Two existing tools trip this: - core/framework/tasks/tools/colony_tools.py:52 (owner in _update_schema) - core/framework/tasks/tools/session_tools.py:84-87 (same shape) Add an adapter-level sanitizer that walks the schema tree and converts union-with-null to OpenAPI 3.0 "nullable": true (which Gemini accepts). Recurses into properties, items, additionalProperties, and the anyOf/oneOf/allOf combinators. Source schemas remain valid JSON Schema so OpenAI/Anthropic backends are unaffected. * fix(antigravity): bump spoofed UA past Google's deprecation cutoff Google has deprecated client version "Antigravity/1.18.3" — chats now return "This version of Antigravity is no longer supported" instead of a real model response. Bump the spoofed User-Agent to "Antigravity/1.23.2" + "Electron/39.2.3" (current desktop release) and add a comment that this needs periodic re-bumping. A more durable fix (auto-detect from the installed app's Info.plist) is a follow-up. * fix(antigravity): fail loud on multi-type non-null Gemini schema unions Per review on PR #7170: silently picking the first type from a union like ["string", "integer", "null"] changes the contract for callers that rely on the other types, and the failure is hard to diagnose at the Gemini side. Replace the silent narrowing with a ValueError that points the schema author at anyOf or a single type. A repo scan finds no current Gemini-bound schemas using multi-type non-null unions, so this branch is preventative for future authors. * chore(antigravity): drop em dash from test docstring	2026-05-05 01:16:48 +08:00
Hundao	8cb0531959	fix(ci): unblock main CI, sort imports + install Playwright Chromium (#7172 ) * fix(lint): organize imports in queen_orchestrator.create_queen Ruff I001 blocks CI on every PR against main. The deferred imports inside create_queen were not in alphabetical order between the queen package and the framework package; ruff auto-fix moves framework.config below the framework.agents.queen.nodes block. No behavior change. * fix(ci): install Playwright Chromium before Test Tools job The new chart_tools smoke tests added in `feabf327` require a Chromium build for ECharts/Mermaid rendering, but the test-tools workflow only ran `uv sync` and went straight to pytest. Three tests (test_render_echarts_bar_chart, test_render_echarts_accepts_string_spec, test_render_mermaid_flowchart) crash on every PR with: BrowserType.launch: Executable doesn't exist at /home/runner/.cache/ms-playwright/chromium_headless_shell-1208/... Split the install/run into separate steps and add `playwright install chromium` before pytest. Use `--with-deps` on Linux to pull system libraries; Windows runners only need the browser binary. * fix(tests): adapt test_file_state_cache to new file_ops API The file_ops rewrite in `feabf327` dropped the standalone hashline_edit tool (the file_system_toolkits/hashline_edit/ directory was removed) and switched edit_file to a mode-first signature (mode, path, old_string, new_string, ...). The test fixture still tried to look up "hashline_edit" via the MCP tool manager and crashed with KeyError before any test could run, and the edit_file calls were positional in the old order so they hit "unknown mode 'e.py'" once the fixture was fixed. Drop the stale hashline_edit lookup and pass mode="replace" explicitly to every edit_file call. All 11 tests pass locally. * fix(tests): skip terminal_tools tests on Windows (POSIX-only) The new terminal_tools package added in `feabf327` imports the Unix-only `resource` module in tools/src/terminal_tools/common/limits.py to set RLIMIT_CPU / RLIMIT_AS / RLIMIT_FSIZE on subprocesses. Five of the six terminal_tools test files therefore crash on windows-latest with `ModuleNotFoundError: No module named 'resource'` once their fixtures trigger the import chain. test_terminal_tools_pty.py already has the right module-level skip (PTY is POSIX-only). Apply the same `pytestmark = skipif(win32)` to the other five so the whole suite skips cleanly on Windows. The terminal-tools package is bash-only by design (zsh refused at the shell-resolver level), so a Windows port is out of scope.	2026-05-05 00:32:59 +08:00
Richard Tang	feabf32768	fix: worker context token	2026-05-03 11:45:37 -07:00
Richard Tang	eee55ea8c7	chore: fix wrong model name	2026-05-03 11:35:05 -07:00
Richard Tang	78fffa63ec	chore: ci and release doc Release / Create Release (push) Waiting to run Details	2026-05-01 18:06:39 -07:00
Richard Tang	9a75d45351	chore: lint	2026-05-01 17:53:44 -07:00