fix: gcu system prompt

This commit is contained in:
Timothy
2026-04-13 10:00:00 -07:00
parent 273d4ec66e
commit 857af8e6a3
2 changed files with 171 additions and 167 deletions
@@ -17,20 +17,41 @@ Use browser nodes (with `tools: {policy: "all"}`) when:
## Available Browser Tools
All tools are prefixed with `browser_`:
- `browser_start`, `browser_open` -- launch/navigate
- `browser_click`, `browser_fill`, `browser_type` -- interact
- `browser_snapshot` -- read page content (preferred over screenshot)
- `browser_screenshot` -- visual capture
- `browser_scroll`, `browser_wait` -- navigation helpers
- `browser_evaluate` -- run JavaScript
- `browser_start`, `browser_open`, `browser_navigate` launch/navigate
- `browser_click`, `browser_click_coordinate`, `browser_fill`, `browser_type` interact
- `browser_press` (with optional `modifiers=["ctrl"]` etc.) — keyboard shortcuts
- `browser_snapshot` — compact accessibility-tree read (structured)
- `browser_screenshot` — visual capture (annotated PNG)
- `browser_shadow_query`, `browser_get_rect` — locate elements (shadow-piercing via `>>>`)
- `browser_coords` — convert image pixels to CSS pixels (always use `css_x/y`, never `physical_x/y`)
- `browser_scroll`, `browser_wait` — navigation helpers
- `browser_evaluate` — run JavaScript
- `browser_close`, `browser_close_finished` — tab cleanup
## System Prompt Tips for Browser Nodes
## Pick the right reading tool
**`browser_snapshot`** — compact accessibility tree of interactive elements. Fast, cheap, good for static or form-heavy pages where the DOM matches what's visually rendered (documentation, simple dashboards, search results, settings pages).
**`browser_screenshot`** — visual capture + metadata (`cssWidth`, `devicePixelRatio`, scale fields). **Use this on any complex SPA** — LinkedIn, Twitter/X, Reddit, Gmail, Notion, Slack, Discord, any site using shadow DOM, virtual scrolling, React reconciliation, or dynamic layout. On these pages, snapshot refs go stale in seconds, shadow contents aren't in the AX tree, and virtual-scrolled elements disappear from the tree entirely. Screenshot is the **only** reliable way to orient yourself.
Neither tool is "preferred" universally — they're for different jobs. Default to snapshot on text-heavy static pages, screenshot on SPAs and anything shadow-DOM-heavy. Activate the `browser-automation` skill for the full decision tree.
## Coordinate rule: always CSS pixels
Chrome DevTools Protocol `Input.dispatchMouseEvent` takes **CSS pixels**, not physical pixels. After a screenshot, use `browser_coords(image_x, image_y)` and feed the returned `css_x/y` (NOT `physical_x/y`) to `browser_click_coordinate`, `browser_hover_coordinate`, `browser_press_at`. Feeding physical pixels on a HiDPI display (DPR=1.6, 2, or 3) overshoots by `DPR×` and clicks land in the wrong place. `getBoundingClientRect()` already returns CSS pixels — pass through unchanged, no DPR multiplication.
## System prompt tips for browser nodes
```
1. Use browser_snapshot() to read page content (NOT browser_get_text)
2. Use browser_wait(seconds=2-3) after navigation for page load
3. If you hit an auth wall, call set_output with an error and move on
4. Keep tool calls per turn <= 10 for reliability
1. On LinkedIn / X / Reddit / Gmail / any SPA — use browser_screenshot to orient,
not browser_snapshot. Shadow DOM and virtual scrolling make snapshots unreliable.
2. For static pages (docs, forms, search results), browser_snapshot is fine.
3. Before typing into a rich-text editor (X compose, LinkedIn DM, Gmail, Reddit),
click the input area first with browser_click_coordinate so React / Draft.js /
Lexical register a native focus event. Otherwise the send button stays disabled.
4. Use browser_wait(seconds=2-3) after navigation for SPA hydration.
5. If you hit an auth wall, call set_output with an error and move on.
6. Keep tool calls per turn <= 10 for reliability.
```
## Example
@@ -43,7 +64,7 @@ All tools are prefixed with `browser_`:
"tools": {"policy": "all"},
"input_keys": ["search_url"],
"output_keys": ["profiles"],
"system_prompt": "Navigate to the search URL, paginate through results..."
"system_prompt": "Navigate to the search URL via browser_navigate(wait_until='load', timeout_ms=20000). Wait 3s for SPA hydration. On LinkedIn, use browser_screenshot to see the page — browser_snapshot misses shadow-DOM and virtual-scrolled content. Paginate through results by scrolling and screenshotting; extract each profile card by reading its visible layout..."
}
```
@@ -51,3 +72,7 @@ Connected via regular edges:
```
search-setup -> scan-profiles -> process-results
```
## Further detail
For rich-text editor quirks (Lexical, Draft.js, ProseMirror), shadow-DOM shortcuts, `beforeunload` dialog neutralization, Trusted Types CSP on LinkedIn, keyboard shortcut dispatch, and per-site selector tables — **activate the `browser-automation` skill**. That skill has the full verified guidance and is refreshed against real production sites.
+134 -155
View File
@@ -1,12 +1,19 @@
"""Browser automation best-practices prompt.
This module provides ``GCU_BROWSER_SYSTEM_PROMPT`` -- a canonical set of
This module provides ``GCU_BROWSER_SYSTEM_PROMPT`` a canonical set of
browser automation guidelines that can be included in any node's system
prompt that uses browser tools from the gcu-tools MCP server.
Browser tools are registered via the global MCP registry (gcu-tools).
Nodes that need browser access declare ``tools: {policy: "all"}`` in their
agent.json config.
Note: the canonical source of truth for browser automation guidance is
the ``browser-automation`` default skill at
``core/framework/skills/_default_skills/browser-automation/SKILL.md``.
Activate that skill for the full decision tree. This module holds a
compact subset suitable for direct inlining into a node's system prompt
when a skill activation is not desired.
"""
GCU_BROWSER_SYSTEM_PROMPT = """\
@@ -14,172 +21,144 @@ GCU_BROWSER_SYSTEM_PROMPT = """\
Follow these rules for reliable, efficient browser interaction.
## Reading Pages
- ALWAYS prefer `browser_snapshot` over `browser_get_text("body")`
it returns a compact ~1-5 KB accessibility tree vs 100+ KB of raw HTML.
- Interaction tools (`browser_click`, `browser_type`, `browser_fill`,
`browser_scroll`, etc.) return a page snapshot automatically in their
result. Use it to decide your next action do NOT call
`browser_snapshot` separately after every action.
Only call `browser_snapshot` when you need a fresh view without
performing an action, or after setting `auto_snapshot=false`.
- Do NOT use `browser_screenshot` to read text use
`browser_snapshot` for that (compact, searchable, fast).
- DO use `browser_screenshot` when you need visual context:
charts, images, canvas elements, layout verification, or when
the snapshot doesn't capture what you need.
- Only fall back to `browser_get_text` for extracting specific
small elements by CSS selector.
## Pick the right reading tool
## Navigation & Waiting
- `browser_navigate` and `browser_open` already wait for the page to
load (`domcontentloaded`). Do NOT call `browser_wait` with no
arguments after navigation it wastes time.
Only use `browser_wait` when you need a *specific element* or *text*
to appear (pass `selector` or `text`).
- NEVER re-navigate to the same URL after scrolling
this resets your scroll position and loses loaded content.
- **`browser_snapshot`** compact accessibility tree. Fast, cheap, good
for static / text-heavy pages where the DOM matches what's visually
rendered (docs, forms, search results, settings pages).
- **`browser_screenshot`** visual capture + scale metadata. Use on any
complex SPA (LinkedIn, X / Twitter, Reddit, Gmail, Notion, Slack,
Discord) and on any site using shadow DOM or virtual scrolling. On
those pages, snapshot refs go stale in seconds, shadow contents
aren't in the AX tree, and virtual-scrolled elements disappear from
the tree entirely screenshots are the only reliable way to orient.
Neither tool is "preferred" universally they're for different jobs.
Default to snapshot on static pages, screenshot on SPAs and
shadow-heavy sites. Interaction tools (click/type/fill/scroll) return
a snapshot automatically, so don't call `browser_snapshot` separately
after an interaction unless you need a fresh view.
Only fall back to `browser_get_text` for extracting small elements by
CSS selector.
## Coordinates: always CSS pixels
Chrome DevTools Protocol `Input.dispatchMouseEvent` takes **CSS
pixels**, not physical pixels. This is critical and often gets wrong:
| Tool | Unit |
|---|---|
| `browser_click_coordinate(x, y)` | **CSS pixels** |
| `browser_hover_coordinate(x, y)` | **CSS pixels** |
| `browser_press_at(x, y, key)` | **CSS pixels** |
| `getBoundingClientRect()` | already CSS pixels pass straight through |
| `browser_coords(img_x, img_y)` | returns `css_x/y` (use this) and `physical_x/y` (debug only) |
**Always use `css_x/y`** from `browser_coords`. Feeding `physical_x/y`
on a HiDPI display overshoots by `DPR×` clicks land DPR times too
far right and down. On a DPR=1.6 display that's 60% off.
Never multiply `getBoundingClientRect()` by `devicePixelRatio` it's
already in the right unit.
## Rich-text editors (X, LinkedIn DMs, Gmail, Reddit, Slack, Discord)
Click the input area first with `browser_click_coordinate` or
`browser_click(selector)` BEFORE typing. React / Draft.js / Lexical /
ProseMirror only register input as "real" after a native pointer-
sourced focus event; JS `.focus()` is not enough. Without a real click
first, the editor stays empty and the send button stays disabled.
`browser_type` now does this automatically it clicks the element,
then inserts text via CDP `Input.insertText` (IME-commit style), which
rich editors accept cleanly. Before clicking send, verify the submit
button's `disabled` / `aria-disabled` state via `browser_evaluate`.
## Shadow DOM
Sites like LinkedIn messaging (`#interop-outlet`), Reddit (faceplate
Web Components), and some X elements live inside shadow roots.
`document.querySelector` and `wait_for_selector` do **not** see into
shadow roots. But `browser_click_coordinate` **does** CDP hit
testing walks shadow roots natively, so coordinate-based operations
reach shadow elements transparently.
**Shadow-heavy site workflow:**
1. `browser_screenshot()` visual image
2. Identify target visually image coordinate
3. `browser_coords(x, y)` CSS px
4. `browser_click_coordinate(css_x, css_y)` lands via native hit
test; inputs get focused regardless of shadow depth
5. Type via `browser_type` or, if the selector path can't reach the
element, dispatch keys to the focused element
For selector-style access when you know the shadow path:
`browser_shadow_query("#interop-outlet >>> #msg-overlay >>> p")`
returns a CSS-px rect you can feed directly to click tools.
## Navigation & waiting
- `browser_navigate(wait_until="load")` returns when the page fires
load. On SPAs (LinkedIn especially 45 seconds), add a 23 s sleep
after to let React/Vue hydrate before querying for chrome elements.
- Never re-navigate to the same URL after scrolling resets scroll.
- Use `timeout_ms=20000` for heavy SPAs.
- `wait_for_selector` / `wait_for_text` resolve in milliseconds when
the element is already in the DOM no need to sleep if you can
express the wait condition.
## Keyboard shortcuts
`browser_press("a", modifiers=["ctrl"])` for Ctrl+A. Accepted
modifiers: `"alt"`, `"ctrl"`/`"control"`, `"meta"`/`"cmd"`,
`"shift"`. The tool dispatches the modifier key first, then the main
key with `code` and `windowsVirtualKeyCode` populated (Chrome's
shortcut dispatcher requires both), then releases in reverse order.
## Scrolling
- Use large scroll amounts ~2000 when loading more content
sites like twitter and linkedin have lazy loading for paging.
- The scroll result includes a snapshot automatically no need to call
`browser_snapshot` separately.
## Batching Actions
- You can call multiple tools in a single turn they execute in parallel.
ALWAYS batch independent actions together. Examples:
- Fill multiple form fields in one turn.
- Navigate + snapshot in one turn.
- Click + scroll if targeting different elements.
- When batching, set `auto_snapshot=false` on all but the last action
to avoid redundant snapshots.
- Aim for 3-5 tool calls per turn minimum. One tool call per turn is
wasteful.
- Use large amounts (~2000 px) for lazy-loaded sites (X, LinkedIn).
- Scroll result includes a snapshot don't call `browser_snapshot`
separately.
## Error Recovery
- If a tool fails, retry once with the same approach.
- If it fails a second time, STOP retrying and switch approach.
- If `browser_snapshot` fails try `browser_get_text` with a
specific small selector as fallback.
- If `browser_open` fails or page seems stale `browser_stop`,
then `browser_start`, then retry.
## Batching
## Tab Management
- Multiple tool calls per turn execute in parallel. Batch independent
actions together: fill multiple fields, navigate + snapshot,
different-target click + scroll.
- Set `auto_snapshot=false` on all but the last when batching.
- Aim for 35 tool calls per turn minimum.
**Close tabs as soon as you are done with them** not only at the end of the task.
After reading or extracting data from a tab, close it immediately.
## Tab management
**Decision rules:**
- Finished reading/extracting from a tab? `browser_close(target_id=...)`
- Completed a multi-tab workflow? `browser_close_finished()` to clean up all your tabs
- More than 3 tabs open? stop and close finished ones before opening more
- Popup appeared that you didn't need? → close it immediately
Close tabs as soon as you're done with them — not only at the end of
the task. `browser_close(target_id=...)` for one, `browser_close_finished()`
for a full cleanup. Never accumulate more than 3 open tabs.
`browser_tabs` reports an `origin` field: `"agent"` (you own it, close
when done), `"popup"` (close after extracting), `"startup"`/`"user"`
(leave alone).
**Origin awareness:** `browser_tabs` returns an `origin` field for each tab:
- `"agent"` you opened it; you own it; close it when done
- `"popup"` opened by a link or script; close after extracting what you need
- `"startup"` or `"user"` leave these alone unless the task requires it
## Login & auth walls
**Cleanup tools:**
- `browser_close(target_id=...)` close one specific tab
- `browser_close_finished()` close all your agent/popup tabs (safe: leaves startup/user tabs)
- `browser_close_all()` close everything except the active tab (use only for full reset)
Report the auth wall and stop do NOT attempt to log in. Dismiss
cookie consent banners if they block content.
**Multi-tab workflow pattern:**
1. Open background tabs with `browser_open(url=..., background=true)` to stay on current tab
2. Process each tab and close it with `browser_close` when done
3. When the full workflow completes, call `browser_close_finished()` to confirm cleanup
4. Check `browser_tabs` at any point it shows `origin` and `age_seconds` per tab
## Error recovery
Never accumulate tabs. Treat every tab you open as a resource you must free.
- Retry once on failure, then switch approach.
- If `browser_snapshot` fails, try `browser_get_text` with a narrow
selector as fallback.
- If `browser_open` fails or the page seems stale, `browser_stop`
`browser_start` retry.
## Shadow DOM & Overlays
## `browser_evaluate`
Some sites (LinkedIn messaging, etc.) render content inside closed shadow roots that are
invisible to regular DOM queries and `browser_snapshot` coordinates.
**Detecting shadow DOM**: `document.elementFromPoint(x, y)` returns a zero-height host element
(e.g. `#interop-outlet`) for the entire overlay area — this is normal, not a bug.
`document.body.innerText` and `document.querySelectorAll` return nothing for shadow content.
`browser_snapshot` CAN read shadow DOM text but cannot return coordinates.
**Querying into shadow DOM:**
```
browser_shadow_query("#interop-outlet >>> #msg-overlay >>> p")
```
Uses `>>>` to pierce shadow roots. Returns `rect` in CSS pixels and `physicalRect` ready for
`browser_click_coordinate` / `browser_hover_coordinate`.
**Getting physical rect for any element (including shadow DOM):**
```
browser_get_rect(selector="#interop-outlet >>> .msg-convo-wrapper", pierce_shadow=true)
```
**Manual JS traversal when selector is dynamic:**
```js
const shadow = document.getElementById('interop-outlet').shadowRoot;
const convo = shadow.querySelector('#ember37');
const rect = convo.querySelector('p').getBoundingClientRect();
// rect is in CSS pixels multiply by DPR for physical pixels
```
Pass this as a multi-statement script to `browser_evaluate`; it wraps automatically in an IIFE.
Use `JSON.stringify(rect)` to serialize the result.
## Coordinate System
There are THREE coordinate spaces. Using the wrong one causes clicks/hovers to land in the
wrong place.
| Space | Used by | How to get |
|---|---|---|
| Physical pixels | `browser_click_coordinate` | `browser_coords` `physical_x/y` |
| CSS pixels | `getBoundingClientRect()`, `elementFromPoint` | `browser_coords` `css_x/y` |
| Screenshot pixels | What you see in the 800px image | Raw position in screenshot |
**Converting screenshot physical**: `browser_coords(x, y)` use `physical_x/y`.
**Converting CSS physical**: multiply by `window.devicePixelRatio` (typically 1.6 on HiDPI).
**Never** pass raw `getBoundingClientRect()` values to `browser_hover_coordinate` without
multiplying by DPR first.
## Screenshots
Screenshot data is base64-encoded PNG. To view it:
```
run_command("echo '<base64_data>' | base64 -d > /tmp/screenshot.png")
```
Then use `read_file("/tmp/screenshot.png")` to view the image.
Always use `full_page=false` (default) unless you specifically need the full scrolled page.
## JavaScript Evaluation
`browser_evaluate` wraps your script in an IIFE automatically:
- Single expression (`document.title`) wrapped with `return`
- Multi-statement or contains `;`/`\n` wrapped without return (add explicit `return` yourself)
- Already an IIFE run as-is
**Avoid**: complex closures with `return` inside `for` loops Chrome CDP returns `null`.
**Use instead**: `Array.from(...).map(...).join(...)` chains, or build result objects and
`JSON.stringify()` them.
**For shadow DOM traversal with dynamic selectors**, write the full JS path:
```js
const s = document.getElementById('interop-outlet').shadowRoot;
const el = s.querySelector('.msg-convo-wrapper');
return JSON.stringify(el.getBoundingClientRect());
```
## Login & Auth Walls
- If you see a "Log in" or "Sign up" prompt instead of expected
content, report the auth wall immediately do NOT attempt to log in.
- Check for cookie consent banners and dismiss them if they block content.
## Efficiency
- Minimize tool calls combine actions where possible.
- When a snapshot result is saved to a spillover file, use
`run_command` with grep to extract specific data rather than
re-reading the full file.
- Call `set_output` in the same turn as your last browser action
when possible don't waste a turn.
Use for reading state inside a shadow root that standard tools don't
handle, for one-shot site-specific actions, or to measure layout the
tools don't expose. Do NOT use it on a strict-CSP site (LinkedIn,
some X surfaces) with `innerHTML` Trusted Types silently drops the
assignment. Always use `createElement` + `appendChild` + `setAttribute`
for DOM injection on those sites. `style.cssText`, `textContent`, and
`.value` assignments are fine.
"""