fix: gcu system prompt

2026-04-13 10:00:00 -07:00
parent 273d4ec66e
commit 857af8e6a3
2 changed files with 171 additions and 167 deletions
@@ -17,20 +17,41 @@ Use browser nodes (with `tools: {policy: "all"}`) when:
 ## Available Browser Tools

 All tools are prefixed with `browser_`:
- `browser_start`, `browser_open` -- launch/navigate
- `browser_click`, `browser_fill`, `browser_type` -- interact
- `browser_snapshot` -- read page content (preferred over screenshot)
- `browser_screenshot` -- visual capture
- `browser_scroll`, `browser_wait` -- navigation helpers
- `browser_evaluate` -- run JavaScript
+- `browser_start`, `browser_open`, `browser_navigate` — launch/navigate
+- `browser_click`, `browser_click_coordinate`, `browser_fill`, `browser_type` — interact
+- `browser_press` (with optional `modifiers=["ctrl"]` etc.) — keyboard shortcuts
+- `browser_snapshot` — compact accessibility-tree read (structured)
+- `browser_screenshot` — visual capture (annotated PNG)
+- `browser_shadow_query`, `browser_get_rect` — locate elements (shadow-piercing via `>>>`)
+- `browser_coords` — convert image pixels to CSS pixels (always use `css_x/y`, never `physical_x/y`)
+- `browser_scroll`, `browser_wait` — navigation helpers
+- `browser_evaluate` — run JavaScript
+- `browser_close`, `browser_close_finished` — tab cleanup

-## System Prompt Tips for Browser Nodes
+## Pick the right reading tool
+
+**`browser_snapshot`** — compact accessibility tree of interactive elements. Fast, cheap, good for static or form-heavy pages where the DOM matches what's visually rendered (documentation, simple dashboards, search results, settings pages).
+
+**`browser_screenshot`** — visual capture + metadata (`cssWidth`, `devicePixelRatio`, scale fields). **Use this on any complex SPA** — LinkedIn, Twitter/X, Reddit, Gmail, Notion, Slack, Discord, any site using shadow DOM, virtual scrolling, React reconciliation, or dynamic layout. On these pages, snapshot refs go stale in seconds, shadow contents aren't in the AX tree, and virtual-scrolled elements disappear from the tree entirely. Screenshot is the **only** reliable way to orient yourself.
+
+Neither tool is "preferred" universally — they're for different jobs. Default to snapshot on text-heavy static pages, screenshot on SPAs and anything shadow-DOM-heavy. Activate the `browser-automation` skill for the full decision tree.
+
+## Coordinate rule: always CSS pixels
+
+Chrome DevTools Protocol `Input.dispatchMouseEvent` takes **CSS pixels**, not physical pixels. After a screenshot, use `browser_coords(image_x, image_y)` and feed the returned `css_x/y` (NOT `physical_x/y`) to `browser_click_coordinate`, `browser_hover_coordinate`, `browser_press_at`. Feeding physical pixels on a HiDPI display (DPR=1.6, 2, or 3) overshoots by `DPR×` and clicks land in the wrong place. `getBoundingClientRect()` already returns CSS pixels — pass through unchanged, no DPR multiplication.
+
+## System prompt tips for browser nodes

 ```
-1. Use browser_snapshot() to read page content (NOT browser_get_text)
-2. Use browser_wait(seconds=2-3) after navigation for page load
-3. If you hit an auth wall, call set_output with an error and move on
-4. Keep tool calls per turn <= 10 for reliability
+1. On LinkedIn / X / Reddit / Gmail / any SPA — use browser_screenshot to orient,
+   not browser_snapshot. Shadow DOM and virtual scrolling make snapshots unreliable.
+2. For static pages (docs, forms, search results), browser_snapshot is fine.
+3. Before typing into a rich-text editor (X compose, LinkedIn DM, Gmail, Reddit),
+   click the input area first with browser_click_coordinate so React / Draft.js /
+   Lexical register a native focus event. Otherwise the send button stays disabled.
+4. Use browser_wait(seconds=2-3) after navigation for SPA hydration.
+5. If you hit an auth wall, call set_output with an error and move on.
+6. Keep tool calls per turn <= 10 for reliability.
 ```

 ## Example
@@ -43,7 +64,7 @@ All tools are prefixed with `browser_`:
  "tools": {"policy": "all"},
  "input_keys": ["search_url"],
  "output_keys": ["profiles"],
-  "system_prompt": "Navigate to the search URL, paginate through results..."
+  "system_prompt": "Navigate to the search URL via browser_navigate(wait_until='load', timeout_ms=20000). Wait 3s for SPA hydration. On LinkedIn, use browser_screenshot to see the page — browser_snapshot misses shadow-DOM and virtual-scrolled content. Paginate through results by scrolling and screenshotting; extract each profile card by reading its visible layout..."
 }
 ```

@@ -51,3 +72,7 @@ Connected via regular edges:
 ```
 search-setup -> scan-profiles -> process-results
 ```
+
+## Further detail
+
+For rich-text editor quirks (Lexical, Draft.js, ProseMirror), shadow-DOM shortcuts, `beforeunload` dialog neutralization, Trusted Types CSP on LinkedIn, keyboard shortcut dispatch, and per-site selector tables — **activate the `browser-automation` skill**. That skill has the full verified guidance and is refreshed against real production sites.
@@ -1,12 +1,19 @@
 """Browser automation best-practices prompt.

-This module provides ``GCU_BROWSER_SYSTEM_PROMPT`` -- a canonical set of
+This module provides ``GCU_BROWSER_SYSTEM_PROMPT`` — a canonical set of
 browser automation guidelines that can be included in any node's system
 prompt that uses browser tools from the gcu-tools MCP server.

 Browser tools are registered via the global MCP registry (gcu-tools).
 Nodes that need browser access declare ``tools: {policy: "all"}`` in their
 agent.json config.
+
+Note: the canonical source of truth for browser automation guidance is
+the ``browser-automation`` default skill at
+``core/framework/skills/_default_skills/browser-automation/SKILL.md``.
+Activate that skill for the full decision tree. This module holds a
+compact subset suitable for direct inlining into a node's system prompt
+when a skill activation is not desired.
 """

 GCU_BROWSER_SYSTEM_PROMPT = """\
@@ -14,172 +21,144 @@ GCU_BROWSER_SYSTEM_PROMPT = """\

 Follow these rules for reliable, efficient browser interaction.

-## Reading Pages
- ALWAYS prefer `browser_snapshot` over `browser_get_text("body")`
-  — it returns a compact ~1-5 KB accessibility tree vs 100+ KB of raw HTML.
- Interaction tools (`browser_click`, `browser_type`, `browser_fill`,
-  `browser_scroll`, etc.) return a page snapshot automatically in their
-  result. Use it to decide your next action — do NOT call
-  `browser_snapshot` separately after every action.
-  Only call `browser_snapshot` when you need a fresh view without
-  performing an action, or after setting `auto_snapshot=false`.
- Do NOT use `browser_screenshot` to read text — use
-  `browser_snapshot` for that (compact, searchable, fast).
- DO use `browser_screenshot` when you need visual context:
-  charts, images, canvas elements, layout verification, or when
-  the snapshot doesn't capture what you need.
- Only fall back to `browser_get_text` for extracting specific
-  small elements by CSS selector.
+## Pick the right reading tool

-## Navigation & Waiting
- `browser_navigate` and `browser_open` already wait for the page to
-  load (`domcontentloaded`). Do NOT call `browser_wait` with no
-  arguments after navigation — it wastes time.
-  Only use `browser_wait` when you need a *specific element* or *text*
-  to appear (pass `selector` or `text`).
- NEVER re-navigate to the same URL after scrolling
-  — this resets your scroll position and loses loaded content.
+- **`browser_snapshot`** — compact accessibility tree. Fast, cheap, good
+  for static / text-heavy pages where the DOM matches what's visually
+  rendered (docs, forms, search results, settings pages).
+- **`browser_screenshot`** — visual capture + scale metadata. Use on any
+  complex SPA (LinkedIn, X / Twitter, Reddit, Gmail, Notion, Slack,
+  Discord) and on any site using shadow DOM or virtual scrolling. On
+  those pages, snapshot refs go stale in seconds, shadow contents
+  aren't in the AX tree, and virtual-scrolled elements disappear from
+  the tree entirely — screenshots are the only reliable way to orient.
+
+Neither tool is "preferred" universally — they're for different jobs.
+Default to snapshot on static pages, screenshot on SPAs and
+shadow-heavy sites. Interaction tools (click/type/fill/scroll) return
+a snapshot automatically, so don't call `browser_snapshot` separately
+after an interaction unless you need a fresh view.
+
+Only fall back to `browser_get_text` for extracting small elements by
+CSS selector.
+
+## Coordinates: always CSS pixels
+
+Chrome DevTools Protocol `Input.dispatchMouseEvent` takes **CSS
+pixels**, not physical pixels. This is critical and often gets wrong:
+
+| Tool | Unit |
+|---|---|
+| `browser_click_coordinate(x, y)` | **CSS pixels** |
+| `browser_hover_coordinate(x, y)` | **CSS pixels** |
+| `browser_press_at(x, y, key)` | **CSS pixels** |
+| `getBoundingClientRect()` | already CSS pixels — pass straight through |
+| `browser_coords(img_x, img_y)` | returns `css_x/y` (use this) and `physical_x/y` (debug only) |
+
+**Always use `css_x/y`** from `browser_coords`. Feeding `physical_x/y`
+on a HiDPI display overshoots by `DPR×` — clicks land DPR times too
+far right and down. On a DPR=1.6 display that's 60% off.
+
+Never multiply `getBoundingClientRect()` by `devicePixelRatio` — it's
+already in the right unit.
+
+## Rich-text editors (X, LinkedIn DMs, Gmail, Reddit, Slack, Discord)
+
+Click the input area first with `browser_click_coordinate` or
+`browser_click(selector)` BEFORE typing. React / Draft.js / Lexical /
+ProseMirror only register input as "real" after a native pointer-
+sourced focus event; JS `.focus()` is not enough. Without a real click
+first, the editor stays empty and the send button stays disabled.
+
+`browser_type` now does this automatically — it clicks the element,
+then inserts text via CDP `Input.insertText` (IME-commit style), which
+rich editors accept cleanly. Before clicking send, verify the submit
+button's `disabled` / `aria-disabled` state via `browser_evaluate`.
+
+## Shadow DOM
+
+Sites like LinkedIn messaging (`#interop-outlet`), Reddit (faceplate
+Web Components), and some X elements live inside shadow roots.
+`document.querySelector` and `wait_for_selector` do **not** see into
+shadow roots. But `browser_click_coordinate` **does** — CDP hit
+testing walks shadow roots natively, so coordinate-based operations
+reach shadow elements transparently.
+
+**Shadow-heavy site workflow:**
+1. `browser_screenshot()` → visual image
+2. Identify target visually → image coordinate
+3. `browser_coords(x, y)` → CSS px
+4. `browser_click_coordinate(css_x, css_y)` → lands via native hit
+   test; inputs get focused regardless of shadow depth
+5. Type via `browser_type` or, if the selector path can't reach the
+   element, dispatch keys to the focused element
+
+For selector-style access when you know the shadow path:
+`browser_shadow_query("#interop-outlet >>> #msg-overlay >>> p")` —
+returns a CSS-px rect you can feed directly to click tools.
+
+## Navigation & waiting
+
+- `browser_navigate(wait_until="load")` returns when the page fires
+  load. On SPAs (LinkedIn especially — 4–5 seconds), add a 2–3 s sleep
+  after to let React/Vue hydrate before querying for chrome elements.
+- Never re-navigate to the same URL after scrolling — resets scroll.
+- Use `timeout_ms=20000` for heavy SPAs.
+- `wait_for_selector` / `wait_for_text` resolve in milliseconds when
+  the element is already in the DOM — no need to sleep if you can
+  express the wait condition.
+
+## Keyboard shortcuts
+
+`browser_press("a", modifiers=["ctrl"])` for Ctrl+A. Accepted
+modifiers: `"alt"`, `"ctrl"`/`"control"`, `"meta"`/`"cmd"`,
+`"shift"`. The tool dispatches the modifier key first, then the main
+key with `code` and `windowsVirtualKeyCode` populated (Chrome's
+shortcut dispatcher requires both), then releases in reverse order.

 ## Scrolling
- Use large scroll amounts ~2000 when loading more content
-  — sites like twitter and linkedin have lazy loading for paging.
- The scroll result includes a snapshot automatically — no need to call
-  `browser_snapshot` separately.

-## Batching Actions
- You can call multiple tools in a single turn — they execute in parallel.
-  ALWAYS batch independent actions together. Examples:
-  - Fill multiple form fields in one turn.
-  - Navigate + snapshot in one turn.
-  - Click + scroll if targeting different elements.
- When batching, set `auto_snapshot=false` on all but the last action
-  to avoid redundant snapshots.
- Aim for 3-5 tool calls per turn minimum. One tool call per turn is
-  wasteful.
+- Use large amounts (~2000 px) for lazy-loaded sites (X, LinkedIn).
+- Scroll result includes a snapshot — don't call `browser_snapshot`
+  separately.

-## Error Recovery
- If a tool fails, retry once with the same approach.
- If it fails a second time, STOP retrying and switch approach.
- If `browser_snapshot` fails → try `browser_get_text` with a
-  specific small selector as fallback.
- If `browser_open` fails or page seems stale → `browser_stop`,
-  then `browser_start`, then retry.
+## Batching

-## Tab Management
+- Multiple tool calls per turn execute in parallel. Batch independent
+  actions together: fill multiple fields, navigate + snapshot,
+  different-target click + scroll.
+- Set `auto_snapshot=false` on all but the last when batching.
+- Aim for 3–5 tool calls per turn minimum.

-**Close tabs as soon as you are done with them** — not only at the end of the task.
-After reading or extracting data from a tab, close it immediately.
+## Tab management

-**Decision rules:**
- Finished reading/extracting from a tab? → `browser_close(target_id=...)`
- Completed a multi-tab workflow? → `browser_close_finished()` to clean up all your tabs
- More than 3 tabs open? → stop and close finished ones before opening more
- Popup appeared that you didn't need? → close it immediately
+Close tabs as soon as you're done with them — not only at the end of
+the task. `browser_close(target_id=...)` for one, `browser_close_finished()`
+for a full cleanup. Never accumulate more than 3 open tabs.
+`browser_tabs` reports an `origin` field: `"agent"` (you own it, close
+when done), `"popup"` (close after extracting), `"startup"`/`"user"`
+(leave alone).

-**Origin awareness:** `browser_tabs` returns an `origin` field for each tab:
- `"agent"` — you opened it; you own it; close it when done
- `"popup"` — opened by a link or script; close after extracting what you need
- `"startup"` or `"user"` — leave these alone unless the task requires it
+## Login & auth walls

-**Cleanup tools:**
- `browser_close(target_id=...)` — close one specific tab
- `browser_close_finished()` — close all your agent/popup tabs (safe: leaves startup/user tabs)
- `browser_close_all()` — close everything except the active tab (use only for full reset)
+Report the auth wall and stop — do NOT attempt to log in. Dismiss
+cookie consent banners if they block content.

-**Multi-tab workflow pattern:**
-1. Open background tabs with `browser_open(url=..., background=true)` to stay on current tab
-2. Process each tab and close it with `browser_close` when done
-3. When the full workflow completes, call `browser_close_finished()` to confirm cleanup
-4. Check `browser_tabs` at any point — it shows `origin` and `age_seconds` per tab
+## Error recovery

-Never accumulate tabs. Treat every tab you open as a resource you must free.
+- Retry once on failure, then switch approach.
+- If `browser_snapshot` fails, try `browser_get_text` with a narrow
+  selector as fallback.
+- If `browser_open` fails or the page seems stale, `browser_stop` →
+  `browser_start` → retry.

-## Shadow DOM & Overlays
+## `browser_evaluate`

-Some sites (LinkedIn messaging, etc.) render content inside closed shadow roots that are
-invisible to regular DOM queries and `browser_snapshot` coordinates.
-
-**Detecting shadow DOM**: `document.elementFromPoint(x, y)` returns a zero-height host element
-(e.g. `#interop-outlet`) for the entire overlay area — this is normal, not a bug.
-`document.body.innerText` and `document.querySelectorAll` return nothing for shadow content.
-`browser_snapshot` CAN read shadow DOM text but cannot return coordinates.
-
-**Querying into shadow DOM:**
-```
-browser_shadow_query("#interop-outlet >>> #msg-overlay >>> p")
-```
-Uses `>>>` to pierce shadow roots. Returns `rect` in CSS pixels and `physicalRect` ready for
-`browser_click_coordinate` / `browser_hover_coordinate`.
-
-**Getting physical rect for any element (including shadow DOM):**
-```
-browser_get_rect(selector="#interop-outlet >>> .msg-convo-wrapper", pierce_shadow=true)
-```
-
-**Manual JS traversal when selector is dynamic:**
-```js
-const shadow = document.getElementById('interop-outlet').shadowRoot;
-const convo = shadow.querySelector('#ember37');
-const rect = convo.querySelector('p').getBoundingClientRect();
-// rect is in CSS pixels — multiply by DPR for physical pixels
-```
-Pass this as a multi-statement script to `browser_evaluate`; it wraps automatically in an IIFE.
-Use `JSON.stringify(rect)` to serialize the result.
-
-## Coordinate System
-
-There are THREE coordinate spaces. Using the wrong one causes clicks/hovers to land in the
-wrong place.
-
-| Space | Used by | How to get |
-|---|---|---|
-| Physical pixels | `browser_click_coordinate` | `browser_coords` `physical_x/y` |
-| CSS pixels | `getBoundingClientRect()`, `elementFromPoint` | `browser_coords` `css_x/y` |
-| Screenshot pixels | What you see in the 800px image | Raw position in screenshot |
-
-**Converting screenshot → physical**: `browser_coords(x, y)` → use `physical_x/y`.
-**Converting CSS → physical**: multiply by `window.devicePixelRatio` (typically 1.6 on HiDPI).
-**Never** pass raw `getBoundingClientRect()` values to `browser_hover_coordinate` without
-multiplying by DPR first.
-
-## Screenshots
-
-Screenshot data is base64-encoded PNG. To view it:
-```
-run_command("echo '<base64_data>' | base64 -d > /tmp/screenshot.png")
-```
-Then use `read_file("/tmp/screenshot.png")` to view the image.
-
-Always use `full_page=false` (default) unless you specifically need the full scrolled page.
-
-## JavaScript Evaluation
-
-`browser_evaluate` wraps your script in an IIFE automatically:
- Single expression (`document.title`) → wrapped with `return`
- Multi-statement or contains `;`/`\n` → wrapped without return (add explicit `return` yourself)
- Already an IIFE → run as-is
-
-**Avoid**: complex closures with `return` inside `for` loops — Chrome CDP returns `null`.
-**Use instead**: `Array.from(...).map(...).join(...)` chains, or build result objects and
-`JSON.stringify()` them.
-
-**For shadow DOM traversal with dynamic selectors**, write the full JS path:
-```js
-const s = document.getElementById('interop-outlet').shadowRoot;
-const el = s.querySelector('.msg-convo-wrapper');
-return JSON.stringify(el.getBoundingClientRect());
-```
-
-## Login & Auth Walls
- If you see a "Log in" or "Sign up" prompt instead of expected
-  content, report the auth wall immediately — do NOT attempt to log in.
- Check for cookie consent banners and dismiss them if they block content.
-
-## Efficiency
- Minimize tool calls — combine actions where possible.
- When a snapshot result is saved to a spillover file, use
-  `run_command` with grep to extract specific data rather than
-  re-reading the full file.
- Call `set_output` in the same turn as your last browser action
-  when possible — don't waste a turn.
+Use for reading state inside a shadow root that standard tools don't
+handle, for one-shot site-specific actions, or to measure layout the
+tools don't expose. Do NOT use it on a strict-CSP site (LinkedIn,
+some X surfaces) with `innerHTML` — Trusted Types silently drops the
+assignment. Always use `createElement` + `appendChild` + `setAttribute`
+for DOM injection on those sites. `style.cssText`, `textContent`, and
+`.value` assignments are fine.
 """