fix: gcu system prompt
This commit is contained in:
@@ -17,20 +17,41 @@ Use browser nodes (with `tools: {policy: "all"}`) when:
|
||||
## Available Browser Tools
|
||||
|
||||
All tools are prefixed with `browser_`:
|
||||
- `browser_start`, `browser_open` -- launch/navigate
|
||||
- `browser_click`, `browser_fill`, `browser_type` -- interact
|
||||
- `browser_snapshot` -- read page content (preferred over screenshot)
|
||||
- `browser_screenshot` -- visual capture
|
||||
- `browser_scroll`, `browser_wait` -- navigation helpers
|
||||
- `browser_evaluate` -- run JavaScript
|
||||
- `browser_start`, `browser_open`, `browser_navigate` — launch/navigate
|
||||
- `browser_click`, `browser_click_coordinate`, `browser_fill`, `browser_type` — interact
|
||||
- `browser_press` (with optional `modifiers=["ctrl"]` etc.) — keyboard shortcuts
|
||||
- `browser_snapshot` — compact accessibility-tree read (structured)
|
||||
- `browser_screenshot` — visual capture (annotated PNG)
|
||||
- `browser_shadow_query`, `browser_get_rect` — locate elements (shadow-piercing via `>>>`)
|
||||
- `browser_coords` — convert image pixels to CSS pixels (always use `css_x/y`, never `physical_x/y`)
|
||||
- `browser_scroll`, `browser_wait` — navigation helpers
|
||||
- `browser_evaluate` — run JavaScript
|
||||
- `browser_close`, `browser_close_finished` — tab cleanup
|
||||
|
||||
## System Prompt Tips for Browser Nodes
|
||||
## Pick the right reading tool
|
||||
|
||||
**`browser_snapshot`** — compact accessibility tree of interactive elements. Fast, cheap, good for static or form-heavy pages where the DOM matches what's visually rendered (documentation, simple dashboards, search results, settings pages).
|
||||
|
||||
**`browser_screenshot`** — visual capture + metadata (`cssWidth`, `devicePixelRatio`, scale fields). **Use this on any complex SPA** — LinkedIn, Twitter/X, Reddit, Gmail, Notion, Slack, Discord, any site using shadow DOM, virtual scrolling, React reconciliation, or dynamic layout. On these pages, snapshot refs go stale in seconds, shadow contents aren't in the AX tree, and virtual-scrolled elements disappear from the tree entirely. Screenshot is the **only** reliable way to orient yourself.
|
||||
|
||||
Neither tool is "preferred" universally — they're for different jobs. Default to snapshot on text-heavy static pages, screenshot on SPAs and anything shadow-DOM-heavy. Activate the `browser-automation` skill for the full decision tree.
|
||||
|
||||
## Coordinate rule: always CSS pixels
|
||||
|
||||
Chrome DevTools Protocol `Input.dispatchMouseEvent` takes **CSS pixels**, not physical pixels. After a screenshot, use `browser_coords(image_x, image_y)` and feed the returned `css_x/y` (NOT `physical_x/y`) to `browser_click_coordinate`, `browser_hover_coordinate`, `browser_press_at`. Feeding physical pixels on a HiDPI display (DPR=1.6, 2, or 3) overshoots by `DPR×` and clicks land in the wrong place. `getBoundingClientRect()` already returns CSS pixels — pass through unchanged, no DPR multiplication.
|
||||
|
||||
## System prompt tips for browser nodes
|
||||
|
||||
```
|
||||
1. Use browser_snapshot() to read page content (NOT browser_get_text)
|
||||
2. Use browser_wait(seconds=2-3) after navigation for page load
|
||||
3. If you hit an auth wall, call set_output with an error and move on
|
||||
4. Keep tool calls per turn <= 10 for reliability
|
||||
1. On LinkedIn / X / Reddit / Gmail / any SPA — use browser_screenshot to orient,
|
||||
not browser_snapshot. Shadow DOM and virtual scrolling make snapshots unreliable.
|
||||
2. For static pages (docs, forms, search results), browser_snapshot is fine.
|
||||
3. Before typing into a rich-text editor (X compose, LinkedIn DM, Gmail, Reddit),
|
||||
click the input area first with browser_click_coordinate so React / Draft.js /
|
||||
Lexical register a native focus event. Otherwise the send button stays disabled.
|
||||
4. Use browser_wait(seconds=2-3) after navigation for SPA hydration.
|
||||
5. If you hit an auth wall, call set_output with an error and move on.
|
||||
6. Keep tool calls per turn <= 10 for reliability.
|
||||
```
|
||||
|
||||
## Example
|
||||
@@ -43,7 +64,7 @@ All tools are prefixed with `browser_`:
|
||||
"tools": {"policy": "all"},
|
||||
"input_keys": ["search_url"],
|
||||
"output_keys": ["profiles"],
|
||||
"system_prompt": "Navigate to the search URL, paginate through results..."
|
||||
"system_prompt": "Navigate to the search URL via browser_navigate(wait_until='load', timeout_ms=20000). Wait 3s for SPA hydration. On LinkedIn, use browser_screenshot to see the page — browser_snapshot misses shadow-DOM and virtual-scrolled content. Paginate through results by scrolling and screenshotting; extract each profile card by reading its visible layout..."
|
||||
}
|
||||
```
|
||||
|
||||
@@ -51,3 +72,7 @@ Connected via regular edges:
|
||||
```
|
||||
search-setup -> scan-profiles -> process-results
|
||||
```
|
||||
|
||||
## Further detail
|
||||
|
||||
For rich-text editor quirks (Lexical, Draft.js, ProseMirror), shadow-DOM shortcuts, `beforeunload` dialog neutralization, Trusted Types CSP on LinkedIn, keyboard shortcut dispatch, and per-site selector tables — **activate the `browser-automation` skill**. That skill has the full verified guidance and is refreshed against real production sites.
|
||||
|
||||
+134
-155
@@ -1,12 +1,19 @@
|
||||
"""Browser automation best-practices prompt.
|
||||
|
||||
This module provides ``GCU_BROWSER_SYSTEM_PROMPT`` -- a canonical set of
|
||||
This module provides ``GCU_BROWSER_SYSTEM_PROMPT`` — a canonical set of
|
||||
browser automation guidelines that can be included in any node's system
|
||||
prompt that uses browser tools from the gcu-tools MCP server.
|
||||
|
||||
Browser tools are registered via the global MCP registry (gcu-tools).
|
||||
Nodes that need browser access declare ``tools: {policy: "all"}`` in their
|
||||
agent.json config.
|
||||
|
||||
Note: the canonical source of truth for browser automation guidance is
|
||||
the ``browser-automation`` default skill at
|
||||
``core/framework/skills/_default_skills/browser-automation/SKILL.md``.
|
||||
Activate that skill for the full decision tree. This module holds a
|
||||
compact subset suitable for direct inlining into a node's system prompt
|
||||
when a skill activation is not desired.
|
||||
"""
|
||||
|
||||
GCU_BROWSER_SYSTEM_PROMPT = """\
|
||||
@@ -14,172 +21,144 @@ GCU_BROWSER_SYSTEM_PROMPT = """\
|
||||
|
||||
Follow these rules for reliable, efficient browser interaction.
|
||||
|
||||
## Reading Pages
|
||||
- ALWAYS prefer `browser_snapshot` over `browser_get_text("body")`
|
||||
— it returns a compact ~1-5 KB accessibility tree vs 100+ KB of raw HTML.
|
||||
- Interaction tools (`browser_click`, `browser_type`, `browser_fill`,
|
||||
`browser_scroll`, etc.) return a page snapshot automatically in their
|
||||
result. Use it to decide your next action — do NOT call
|
||||
`browser_snapshot` separately after every action.
|
||||
Only call `browser_snapshot` when you need a fresh view without
|
||||
performing an action, or after setting `auto_snapshot=false`.
|
||||
- Do NOT use `browser_screenshot` to read text — use
|
||||
`browser_snapshot` for that (compact, searchable, fast).
|
||||
- DO use `browser_screenshot` when you need visual context:
|
||||
charts, images, canvas elements, layout verification, or when
|
||||
the snapshot doesn't capture what you need.
|
||||
- Only fall back to `browser_get_text` for extracting specific
|
||||
small elements by CSS selector.
|
||||
## Pick the right reading tool
|
||||
|
||||
## Navigation & Waiting
|
||||
- `browser_navigate` and `browser_open` already wait for the page to
|
||||
load (`domcontentloaded`). Do NOT call `browser_wait` with no
|
||||
arguments after navigation — it wastes time.
|
||||
Only use `browser_wait` when you need a *specific element* or *text*
|
||||
to appear (pass `selector` or `text`).
|
||||
- NEVER re-navigate to the same URL after scrolling
|
||||
— this resets your scroll position and loses loaded content.
|
||||
- **`browser_snapshot`** — compact accessibility tree. Fast, cheap, good
|
||||
for static / text-heavy pages where the DOM matches what's visually
|
||||
rendered (docs, forms, search results, settings pages).
|
||||
- **`browser_screenshot`** — visual capture + scale metadata. Use on any
|
||||
complex SPA (LinkedIn, X / Twitter, Reddit, Gmail, Notion, Slack,
|
||||
Discord) and on any site using shadow DOM or virtual scrolling. On
|
||||
those pages, snapshot refs go stale in seconds, shadow contents
|
||||
aren't in the AX tree, and virtual-scrolled elements disappear from
|
||||
the tree entirely — screenshots are the only reliable way to orient.
|
||||
|
||||
Neither tool is "preferred" universally — they're for different jobs.
|
||||
Default to snapshot on static pages, screenshot on SPAs and
|
||||
shadow-heavy sites. Interaction tools (click/type/fill/scroll) return
|
||||
a snapshot automatically, so don't call `browser_snapshot` separately
|
||||
after an interaction unless you need a fresh view.
|
||||
|
||||
Only fall back to `browser_get_text` for extracting small elements by
|
||||
CSS selector.
|
||||
|
||||
## Coordinates: always CSS pixels
|
||||
|
||||
Chrome DevTools Protocol `Input.dispatchMouseEvent` takes **CSS
|
||||
pixels**, not physical pixels. This is critical and often gets wrong:
|
||||
|
||||
| Tool | Unit |
|
||||
|---|---|
|
||||
| `browser_click_coordinate(x, y)` | **CSS pixels** |
|
||||
| `browser_hover_coordinate(x, y)` | **CSS pixels** |
|
||||
| `browser_press_at(x, y, key)` | **CSS pixels** |
|
||||
| `getBoundingClientRect()` | already CSS pixels — pass straight through |
|
||||
| `browser_coords(img_x, img_y)` | returns `css_x/y` (use this) and `physical_x/y` (debug only) |
|
||||
|
||||
**Always use `css_x/y`** from `browser_coords`. Feeding `physical_x/y`
|
||||
on a HiDPI display overshoots by `DPR×` — clicks land DPR times too
|
||||
far right and down. On a DPR=1.6 display that's 60% off.
|
||||
|
||||
Never multiply `getBoundingClientRect()` by `devicePixelRatio` — it's
|
||||
already in the right unit.
|
||||
|
||||
## Rich-text editors (X, LinkedIn DMs, Gmail, Reddit, Slack, Discord)
|
||||
|
||||
Click the input area first with `browser_click_coordinate` or
|
||||
`browser_click(selector)` BEFORE typing. React / Draft.js / Lexical /
|
||||
ProseMirror only register input as "real" after a native pointer-
|
||||
sourced focus event; JS `.focus()` is not enough. Without a real click
|
||||
first, the editor stays empty and the send button stays disabled.
|
||||
|
||||
`browser_type` now does this automatically — it clicks the element,
|
||||
then inserts text via CDP `Input.insertText` (IME-commit style), which
|
||||
rich editors accept cleanly. Before clicking send, verify the submit
|
||||
button's `disabled` / `aria-disabled` state via `browser_evaluate`.
|
||||
|
||||
## Shadow DOM
|
||||
|
||||
Sites like LinkedIn messaging (`#interop-outlet`), Reddit (faceplate
|
||||
Web Components), and some X elements live inside shadow roots.
|
||||
`document.querySelector` and `wait_for_selector` do **not** see into
|
||||
shadow roots. But `browser_click_coordinate` **does** — CDP hit
|
||||
testing walks shadow roots natively, so coordinate-based operations
|
||||
reach shadow elements transparently.
|
||||
|
||||
**Shadow-heavy site workflow:**
|
||||
1. `browser_screenshot()` → visual image
|
||||
2. Identify target visually → image coordinate
|
||||
3. `browser_coords(x, y)` → CSS px
|
||||
4. `browser_click_coordinate(css_x, css_y)` → lands via native hit
|
||||
test; inputs get focused regardless of shadow depth
|
||||
5. Type via `browser_type` or, if the selector path can't reach the
|
||||
element, dispatch keys to the focused element
|
||||
|
||||
For selector-style access when you know the shadow path:
|
||||
`browser_shadow_query("#interop-outlet >>> #msg-overlay >>> p")` —
|
||||
returns a CSS-px rect you can feed directly to click tools.
|
||||
|
||||
## Navigation & waiting
|
||||
|
||||
- `browser_navigate(wait_until="load")` returns when the page fires
|
||||
load. On SPAs (LinkedIn especially — 4–5 seconds), add a 2–3 s sleep
|
||||
after to let React/Vue hydrate before querying for chrome elements.
|
||||
- Never re-navigate to the same URL after scrolling — resets scroll.
|
||||
- Use `timeout_ms=20000` for heavy SPAs.
|
||||
- `wait_for_selector` / `wait_for_text` resolve in milliseconds when
|
||||
the element is already in the DOM — no need to sleep if you can
|
||||
express the wait condition.
|
||||
|
||||
## Keyboard shortcuts
|
||||
|
||||
`browser_press("a", modifiers=["ctrl"])` for Ctrl+A. Accepted
|
||||
modifiers: `"alt"`, `"ctrl"`/`"control"`, `"meta"`/`"cmd"`,
|
||||
`"shift"`. The tool dispatches the modifier key first, then the main
|
||||
key with `code` and `windowsVirtualKeyCode` populated (Chrome's
|
||||
shortcut dispatcher requires both), then releases in reverse order.
|
||||
|
||||
## Scrolling
|
||||
- Use large scroll amounts ~2000 when loading more content
|
||||
— sites like twitter and linkedin have lazy loading for paging.
|
||||
- The scroll result includes a snapshot automatically — no need to call
|
||||
`browser_snapshot` separately.
|
||||
|
||||
## Batching Actions
|
||||
- You can call multiple tools in a single turn — they execute in parallel.
|
||||
ALWAYS batch independent actions together. Examples:
|
||||
- Fill multiple form fields in one turn.
|
||||
- Navigate + snapshot in one turn.
|
||||
- Click + scroll if targeting different elements.
|
||||
- When batching, set `auto_snapshot=false` on all but the last action
|
||||
to avoid redundant snapshots.
|
||||
- Aim for 3-5 tool calls per turn minimum. One tool call per turn is
|
||||
wasteful.
|
||||
- Use large amounts (~2000 px) for lazy-loaded sites (X, LinkedIn).
|
||||
- Scroll result includes a snapshot — don't call `browser_snapshot`
|
||||
separately.
|
||||
|
||||
## Error Recovery
|
||||
- If a tool fails, retry once with the same approach.
|
||||
- If it fails a second time, STOP retrying and switch approach.
|
||||
- If `browser_snapshot` fails → try `browser_get_text` with a
|
||||
specific small selector as fallback.
|
||||
- If `browser_open` fails or page seems stale → `browser_stop`,
|
||||
then `browser_start`, then retry.
|
||||
## Batching
|
||||
|
||||
## Tab Management
|
||||
- Multiple tool calls per turn execute in parallel. Batch independent
|
||||
actions together: fill multiple fields, navigate + snapshot,
|
||||
different-target click + scroll.
|
||||
- Set `auto_snapshot=false` on all but the last when batching.
|
||||
- Aim for 3–5 tool calls per turn minimum.
|
||||
|
||||
**Close tabs as soon as you are done with them** — not only at the end of the task.
|
||||
After reading or extracting data from a tab, close it immediately.
|
||||
## Tab management
|
||||
|
||||
**Decision rules:**
|
||||
- Finished reading/extracting from a tab? → `browser_close(target_id=...)`
|
||||
- Completed a multi-tab workflow? → `browser_close_finished()` to clean up all your tabs
|
||||
- More than 3 tabs open? → stop and close finished ones before opening more
|
||||
- Popup appeared that you didn't need? → close it immediately
|
||||
Close tabs as soon as you're done with them — not only at the end of
|
||||
the task. `browser_close(target_id=...)` for one, `browser_close_finished()`
|
||||
for a full cleanup. Never accumulate more than 3 open tabs.
|
||||
`browser_tabs` reports an `origin` field: `"agent"` (you own it, close
|
||||
when done), `"popup"` (close after extracting), `"startup"`/`"user"`
|
||||
(leave alone).
|
||||
|
||||
**Origin awareness:** `browser_tabs` returns an `origin` field for each tab:
|
||||
- `"agent"` — you opened it; you own it; close it when done
|
||||
- `"popup"` — opened by a link or script; close after extracting what you need
|
||||
- `"startup"` or `"user"` — leave these alone unless the task requires it
|
||||
## Login & auth walls
|
||||
|
||||
**Cleanup tools:**
|
||||
- `browser_close(target_id=...)` — close one specific tab
|
||||
- `browser_close_finished()` — close all your agent/popup tabs (safe: leaves startup/user tabs)
|
||||
- `browser_close_all()` — close everything except the active tab (use only for full reset)
|
||||
Report the auth wall and stop — do NOT attempt to log in. Dismiss
|
||||
cookie consent banners if they block content.
|
||||
|
||||
**Multi-tab workflow pattern:**
|
||||
1. Open background tabs with `browser_open(url=..., background=true)` to stay on current tab
|
||||
2. Process each tab and close it with `browser_close` when done
|
||||
3. When the full workflow completes, call `browser_close_finished()` to confirm cleanup
|
||||
4. Check `browser_tabs` at any point — it shows `origin` and `age_seconds` per tab
|
||||
## Error recovery
|
||||
|
||||
Never accumulate tabs. Treat every tab you open as a resource you must free.
|
||||
- Retry once on failure, then switch approach.
|
||||
- If `browser_snapshot` fails, try `browser_get_text` with a narrow
|
||||
selector as fallback.
|
||||
- If `browser_open` fails or the page seems stale, `browser_stop` →
|
||||
`browser_start` → retry.
|
||||
|
||||
## Shadow DOM & Overlays
|
||||
## `browser_evaluate`
|
||||
|
||||
Some sites (LinkedIn messaging, etc.) render content inside closed shadow roots that are
|
||||
invisible to regular DOM queries and `browser_snapshot` coordinates.
|
||||
|
||||
**Detecting shadow DOM**: `document.elementFromPoint(x, y)` returns a zero-height host element
|
||||
(e.g. `#interop-outlet`) for the entire overlay area — this is normal, not a bug.
|
||||
`document.body.innerText` and `document.querySelectorAll` return nothing for shadow content.
|
||||
`browser_snapshot` CAN read shadow DOM text but cannot return coordinates.
|
||||
|
||||
**Querying into shadow DOM:**
|
||||
```
|
||||
browser_shadow_query("#interop-outlet >>> #msg-overlay >>> p")
|
||||
```
|
||||
Uses `>>>` to pierce shadow roots. Returns `rect` in CSS pixels and `physicalRect` ready for
|
||||
`browser_click_coordinate` / `browser_hover_coordinate`.
|
||||
|
||||
**Getting physical rect for any element (including shadow DOM):**
|
||||
```
|
||||
browser_get_rect(selector="#interop-outlet >>> .msg-convo-wrapper", pierce_shadow=true)
|
||||
```
|
||||
|
||||
**Manual JS traversal when selector is dynamic:**
|
||||
```js
|
||||
const shadow = document.getElementById('interop-outlet').shadowRoot;
|
||||
const convo = shadow.querySelector('#ember37');
|
||||
const rect = convo.querySelector('p').getBoundingClientRect();
|
||||
// rect is in CSS pixels — multiply by DPR for physical pixels
|
||||
```
|
||||
Pass this as a multi-statement script to `browser_evaluate`; it wraps automatically in an IIFE.
|
||||
Use `JSON.stringify(rect)` to serialize the result.
|
||||
|
||||
## Coordinate System
|
||||
|
||||
There are THREE coordinate spaces. Using the wrong one causes clicks/hovers to land in the
|
||||
wrong place.
|
||||
|
||||
| Space | Used by | How to get |
|
||||
|---|---|---|
|
||||
| Physical pixels | `browser_click_coordinate` | `browser_coords` `physical_x/y` |
|
||||
| CSS pixels | `getBoundingClientRect()`, `elementFromPoint` | `browser_coords` `css_x/y` |
|
||||
| Screenshot pixels | What you see in the 800px image | Raw position in screenshot |
|
||||
|
||||
**Converting screenshot → physical**: `browser_coords(x, y)` → use `physical_x/y`.
|
||||
**Converting CSS → physical**: multiply by `window.devicePixelRatio` (typically 1.6 on HiDPI).
|
||||
**Never** pass raw `getBoundingClientRect()` values to `browser_hover_coordinate` without
|
||||
multiplying by DPR first.
|
||||
|
||||
## Screenshots
|
||||
|
||||
Screenshot data is base64-encoded PNG. To view it:
|
||||
```
|
||||
run_command("echo '<base64_data>' | base64 -d > /tmp/screenshot.png")
|
||||
```
|
||||
Then use `read_file("/tmp/screenshot.png")` to view the image.
|
||||
|
||||
Always use `full_page=false` (default) unless you specifically need the full scrolled page.
|
||||
|
||||
## JavaScript Evaluation
|
||||
|
||||
`browser_evaluate` wraps your script in an IIFE automatically:
|
||||
- Single expression (`document.title`) → wrapped with `return`
|
||||
- Multi-statement or contains `;`/`\n` → wrapped without return (add explicit `return` yourself)
|
||||
- Already an IIFE → run as-is
|
||||
|
||||
**Avoid**: complex closures with `return` inside `for` loops — Chrome CDP returns `null`.
|
||||
**Use instead**: `Array.from(...).map(...).join(...)` chains, or build result objects and
|
||||
`JSON.stringify()` them.
|
||||
|
||||
**For shadow DOM traversal with dynamic selectors**, write the full JS path:
|
||||
```js
|
||||
const s = document.getElementById('interop-outlet').shadowRoot;
|
||||
const el = s.querySelector('.msg-convo-wrapper');
|
||||
return JSON.stringify(el.getBoundingClientRect());
|
||||
```
|
||||
|
||||
## Login & Auth Walls
|
||||
- If you see a "Log in" or "Sign up" prompt instead of expected
|
||||
content, report the auth wall immediately — do NOT attempt to log in.
|
||||
- Check for cookie consent banners and dismiss them if they block content.
|
||||
|
||||
## Efficiency
|
||||
- Minimize tool calls — combine actions where possible.
|
||||
- When a snapshot result is saved to a spillover file, use
|
||||
`run_command` with grep to extract specific data rather than
|
||||
re-reading the full file.
|
||||
- Call `set_output` in the same turn as your last browser action
|
||||
when possible — don't waste a turn.
|
||||
Use for reading state inside a shadow root that standard tools don't
|
||||
handle, for one-shot site-specific actions, or to measure layout the
|
||||
tools don't expose. Do NOT use it on a strict-CSP site (LinkedIn,
|
||||
some X surfaces) with `innerHTML` — Trusted Types silently drops the
|
||||
assignment. Always use `createElement` + `appendChild` + `setAttribute`
|
||||
for DOM injection on those sites. `style.cssText`, `textContent`, and
|
||||
`.value` assignments are fine.
|
||||
"""
|
||||
|
||||
Reference in New Issue
Block a user