Compare commits

..

2 Commits

Author SHA1 Message Date
Timothy 3a219e27ab fix: two dimensions 2026-04-16 17:08:40 -07:00
Timothy 7f62e7a2d0 fix: split image click vs coordinate click 2026-04-16 16:37:52 -07:00
11 changed files with 366 additions and 257 deletions
+2 -2
View File
@@ -64,7 +64,7 @@ snapshot = await browser_snapshot(tab_id)
|---------|--------------|-------|
| Scroll doesn't move | Nested scroll container | Look for `overflow: scroll` divs |
| Click no effect | Element covered | Check `getBoundingClientRect` vs viewport |
| Type clears | Autocomplete/React | Check for event listeners on input; try `browser_type_focused` |
| Type clears | Autocomplete/React | Check for event listeners on input |
| Snapshot hangs | Huge DOM | Check node count in snapshot |
| Snapshot stale | SPA hydration | Wait after navigation |
@@ -229,7 +229,7 @@ function queryShadow(selector) {
|-------|-------------|----------|
| Scroll not working | Find scrollable container | Mouse wheel at container center |
| Click no effect | JavaScript click() | CDP mouse events |
| Type clears | Add delay_ms | Use `browser_type_focused` (Input.insertText) |
| Type clears | Add delay_ms | Use execCommand |
| Snapshot hangs | Add timeout_s | DOM snapshot fallback |
| Stale content | Wait for selector | Increase wait_until timeout |
| Shadow DOM | Pierce selector | JavaScript traversal |
@@ -18,7 +18,7 @@ Use browser nodes (with `tools: {policy: "all"}`) when:
All tools are prefixed with `browser_`:
- `browser_start`, `browser_open`, `browser_navigate` — launch/navigate
- `browser_click`, `browser_click_coordinate`, `browser_fill`, `browser_type`, `browser_type_focused` — interact
- `browser_click`, `browser_click_coordinate`, `browser_fill`, `browser_type` — interact
- `browser_press` (with optional `modifiers=["ctrl"]` etc.) — keyboard shortcuts
- `browser_snapshot` — compact accessibility-tree read (structured)
<!-- vision-only -->
@@ -50,8 +50,7 @@ Chrome DevTools Protocol `Input.dispatchMouseEvent` takes **CSS pixels**, not ph
2. For static pages (docs, forms, search results), browser_snapshot is fine.
3. Before typing into a rich-text editor (X compose, LinkedIn DM, Gmail, Reddit),
click the input area first with browser_click_coordinate so React / Draft.js /
Lexical register a native focus event, then use browser_type_focused(text=...)
for shadow-DOM inputs or browser_type(selector, text) for light-DOM inputs.
Lexical register a native focus event. Otherwise the send button stays disabled.
4. Use browser_wait(seconds=2-3) after navigation for SPA hydration.
5. If you hit an auth wall, call set_output with an error and move on.
6. Keep tool calls per turn <= 10 for reliability.
+4 -6
View File
@@ -70,12 +70,10 @@ ProseMirror only register input as "real" after a native pointer-
sourced focus event; JS `.focus()` is not enough. Without a real click
first, the editor stays empty and the send button stays disabled.
`browser_type` does this automatically when you have a selector it
clicks the element, then inserts text via CDP `Input.insertText`.
For shadow-DOM inputs where selectors can't reach, use
`browser_click_coordinate` to focus, then `browser_type_focused(text=...)`
to type into the active element. Before clicking send, verify the
submit button's `disabled` / `aria-disabled` state via `browser_evaluate`.
`browser_type` now does this automatically it clicks the element,
then inserts text via CDP `Input.insertText` (IME-commit style), which
rich editors accept cleanly. Before clicking send, verify the submit
button's `disabled` / `aria-disabled` state via `browser_evaluate`.
## Shadow DOM
@@ -12,23 +12,49 @@ metadata:
All GCU browser tools drive a real Chrome instance through the Beeline extension and Chrome DevTools Protocol (CDP). That means clicks, keystrokes, and screenshots are processed by the actual browser's native hit testing, focus, and layout engines — **not** a synthetic event layer. Understanding this unlocks strategies that make hard sites easy.
## Coordinates: always CSS pixels
## Coordinates: image-px vs CSS-px — pick the right tool
**Chrome DevTools Protocol `Input.dispatchMouseEvent` operates in CSS pixels, not physical pixels.**
Screenshots are downscaled (800 px wide by default) while the real viewport is typically 15001900 CSS px wide on a modern display. So the pixel you read off a screenshot image is **not** the CSS coordinate you pass to CDP — feeding an 800-scale number to a 1717-scale API lands your click ~40% to the left of where you meant.
When you call `browser_coords(image_x, image_y)` after a screenshot, the returned dict has both `css_x/y` and `physical_x/y`. **Always use `css_x/y` for clicks, hovers, and key presses.**
**The fix is a separate verb for each coord space. You should almost never need to do the math yourself.**
```
browser_screenshot() → image (downscaled to 800/900 px wide)
browser_coords(img_x, img_y) → {css_x, css_y, physical_x, physical_y}
browser_click_coordinate(css_x, css_y) ← USE css_x/y
browser_hover_coordinate(css_x, css_y) ← USE css_x/y
browser_press_at(css_x, css_y, key) ← USE css_x/y
browser_screenshot() → image (downscaled to ~800 px wide)
browser_click_image(img_x, img_y) ← PREFERRED after a screenshot
Reads image pixels straight from
the PNG; the tool auto-converts
to CSS using the cached scale.
Response includes converted_css_x/y
and the cssScale used.
browser_click_coordinate(css_x, css_y) ← CSS pixels only. Use when you
already have CSS coords from
getBoundingClientRect / browser_get_rect.
browser_hover_coordinate(css_x, css_y) ← CSS pixels
browser_press_at(css_x, css_y, key) ← CSS pixels
```
Feeding `physical_x/y` on a HiDPI display overshoots by DPR× — on a DPR=1.6 laptop, clicks land 60% too far right and down. The ratio between `physicalScale` and `cssScale` tells you the effective DPR.
`browser_coords(img_x, img_y)` is still available if you want to *see* the conversion (it returns `{css_x, css_y, physical_x, physical_y}`) — but for ordinary screenshot-then-click work, `browser_click_image` does the whole pipeline in one call and logs the conversion in its response.
`getBoundingClientRect()` already returns CSS pixels — feed those values straight through to click/hover tools without any DPR multiplication.
Never feed `physical_x/y` to any click tool. On a DPR=1.6 display, physical coords overshoot by 60%. The ratio between `physicalScale` and `cssScale` tells you the effective DPR.
`getBoundingClientRect()` already returns CSS pixels — feed those straight into `browser_click_coordinate` (not `browser_click_image`) without any scaling.
### The naming convention
Every coord-returning tool (`browser_coords`, `browser_get_rect`, `browser_shadow_query`) returns parallel blocks — one per coord space. Match the block name to the click tool suffix:
```
rect = browser_get_rect(selector)
# rect.image → browser_click_image ← preferred after a screenshot (scale cached)
# rect.css → browser_click_coordinate (hover_coordinate / press_at)
# rect.physical → DO NOT click — debug only
browser_click_image(rect.image.cx, rect.image.cy)
browser_click_coordinate(rect.css.cx, rect.css.cy)
```
Same shape from `browser_shadow_query` (`sq.image`, `sq.css`, `sq.physical`) and `browser_coords` (`image_x/image_y`, `css_x/css_y`, `physical_x/physical_y`). If the block prefix and the tool suffix don't match, you're about to click the wrong place.
**Exception for zoomed elements:** pages that use `zoom` or `transform: scale()` on a container (LinkedIn's `#interop-outlet`, some embedded iframes) render in a scaled local coordinate space. `getBoundingClientRect` there may not match CDP's hit space. Use `browser_shadow_query` which handles the math, or fall back to visually picking coordinates from a screenshot.
@@ -42,25 +68,37 @@ Why:
- **Keyboard dispatch follows focus** into shadow roots. After a click focuses an input (even one three shadow levels deep), `browser_press(...)` with no selector dispatches keys to `document.activeElement`'s computed focus target.
- **Screenshots render the real layout** regardless of DOM implementation.
Whereas `wait_for_selector`, `browser_click(selector=...)`, `browser_type(selector=...)` all use `document.querySelector` under the hood, which **stops at shadow boundaries**. They cannot see elements inside shadow roots. For shadow-DOM inputs, use `browser_type_focused` after focusing via click-coordinate.
Whereas `wait_for_selector`, `browser_click(selector=...)`, `browser_type(selector=...)` all use `document.querySelector` under the hood, which **stops at shadow boundaries**. They cannot see elements inside shadow roots.
### Recommended workflow on shadow-heavy sites
1. `browser_screenshot()` → visual image
1. `browser_screenshot()` → visual image (also caches the image→CSS scale for this tab)
2. Identify the target visually → image pixel `(x, y)` (eyeball from the screenshot)
3. `browser_coords(x, y)` → convert to CSS px
4. `browser_click_coordinate(css_x, css_y)` → lands on the element via native hit testing; inputs get focused. **The response now includes `focused_element: {tag, id, role, contenteditable, rect, ...}`** — use it to verify you actually focused what you intended.
5. `browser_type_focused(text="...")` → dispatches CDP `Input.insertText` to `document.activeElement`. Shadow roots, iframes, Lexical, Draft.js, ProseMirror all just work. Use `browser_type(selector, text)` instead when you have a reliable CSS selector for a light-DOM element.
6. Verify via `browser_screenshot` OR `browser_get_attribute` on a known-reachable marker (e.g. check that the Send button's `aria-disabled` flipped to `false`).
3. `browser_click_image(x, y)`auto-converts image px → CSS px using the cached scale, then clicks. **The response includes `focused_element: {tag, id, role, contenteditable, rect, ...}`** — use it to verify you actually focused what you intended, plus `converted_css_x/y` and `cssScale` so you can see what the conversion did.
4. `browser_type(text="...")` with **NO selector** → dispatches CDP `Input.insertText` to `document.activeElement`. Shadow roots, iframes, Lexical, Draft.js, ProseMirror all just work. Only pass a selector if you want a DIFFERENT element than the one you just focused (rare).
5. Verify via `browser_screenshot` OR `browser_get_attribute` on a known-reachable marker (e.g. check that the Send button's `aria-disabled` flipped to `false`).
Do **not** pass image pixels to `browser_click_coordinate`. It expects CSS pixels and does no conversion — a common failure mode is eyeballing `(490, 680)` off an 800-wide screenshot, passing it to `browser_click_coordinate`, and landing in the sidebar at CSS-x=490 of a 1717-wide viewport.
### The click→type loop (canonical pattern)
1. Call `browser_click_coordinate(x, y)` to click the target element.
2. Check the `focused_element` field in the response — it tells you what actually received focus (tag, id, role, contenteditable, rect).
3. If the focused element is editable, call `browser_type_focused(text="...")` to insert text. use tools to verify the text took effect.
4. If it is NOT editable, your click landed on the wrong thing — refine coordinates and retry. Do NOT reach for `browser_evaluate` + `execCommand('insertText')` or shadow-root traversals. The problem is the click target, not the typing method.
```
resp = browser_click_image(x, y) # x, y are raw image pixels
fe = resp.get("focused_element")
if fe and (fe.get("contenteditable") or fe["tag"] in ("textarea", "input")):
browser_type(text="...") # no selector — insertText to activeElement
else:
# you clicked something that isn't editable — refine the pixel and retry.
# Check resp["converted_css_x"] / ["converted_css_y"] in the response to
# see where the click actually landed; if it's clearly off, your image
# pixel was wrong, not the conversion.
# Do NOT reach for browser_evaluate + execCommand('insertText', ...)
# or a walk(root) shadow traversal. The problem is your click, not
# the typing method.
...
```
`browser_click` (selector-based) also returns `focused_element`, so the same check works whether you clicked by selector or coordinate.
`browser_click` (selector-based) also returns `focused_element`, so the same check works whether you clicked by selector, image pixel, or CSS coordinate.
### Empirically verified (2026-04-11)
@@ -71,6 +109,13 @@ document > reddit-search-large [shadow]
> input[name="q"]
```
- `document.querySelector('input')`**0 visible inputs** on the page (all in shadow)
- `browser_type('faceplate-search-input input', 'python')` → "Element not found"
- `browser_click_coordinate(617, 28)` → focus trail: `REDDIT-SEARCH-LARGE > FACEPLATE-SEARCH-INPUT > INPUT`
- Char-by-char key dispatch after the click → `input.value === 'python'`
Coordinate pipeline: works perfectly. Selector pipeline: unusable without shadow-piercing syntax.
### Shadow-piercing selectors
When you DO want a selector-based approach and know the shadow structure, `browser_shadow_query` and `browser_get_rect` support `>>>` shadow-piercing syntax:
@@ -88,8 +133,8 @@ Returns the element's rect in **CSS pixels** (feed directly to click tools). Rem
```
browser_navigate(url, wait_until="load") # "load" | "domcontentloaded" | "networkidle"
browser_wait_for_selector("h1", timeout_ms=2000)
browser_wait_for_text("Some text", timeout_ms=2000)
browser_wait_for_selector("h1", timeout_ms=5000)
browser_wait_for_text("Some text", timeout_ms=5000)
browser_go_back()
browser_go_forward()
browser_reload()
@@ -107,7 +152,7 @@ All return real URLs and titles. On a fast page `navigate(wait_until="load")` re
| x.com/twitter | 1.21.6 s |
| linkedin.com (logged in) | 45 s |
For LinkedIn and other heavy SPAs, rely on `sleep()` after navigation to let the page hydrate.
Use `timeout_ms=20000` for LinkedIn and other heavy SPAs to give them margin.
### After navigate, always let SPA hydrate
@@ -116,7 +161,7 @@ Even after `wait_until="load"`, React/Vue SPAs often render their real chrome in
### Reading pages efficiently
- **Prefer `browser_snapshot` over `browser_get_text("body")`** — returns a compact ~15 KB accessibility tree vs 100+ KB of raw HTML.
- Interaction tools (`browser_click`, `browser_type`, `browser_type_focused`, `browser_fill`, `browser_scroll`, etc.) return a page snapshot automatically in their result. Use it to decide your next action — do NOT call `browser_snapshot` separately after every action. Only call `browser_snapshot` when you need a fresh view without performing an action, or after setting `auto_snapshot=false`.
- Interaction tools (`browser_click`, `browser_type`, `browser_fill`, `browser_scroll`, etc.) return a page snapshot automatically in their result. Use it to decide your next action — do NOT call `browser_snapshot` separately after every action. Only call `browser_snapshot` when you need a fresh view without performing an action, or after setting `auto_snapshot=false`.
- Complex pages (LinkedIn, Twitter/X, SPAs with virtual scrolling) have DOMs that don't match what's visually rendered — snapshot refs may be stale, missing, or misaligned with visible layout. On these pages, `browser_screenshot` is the only reliable way to orient yourself.
- Only fall back to `browser_get_text` for extracting specific small elements by CSS selector.
@@ -136,13 +181,44 @@ The symptom is always the same: **you type, the characters appear visually, and
### Safe "click-then-type-then-verify" pattern
1. **Focus** the real element via a real click (not JS `.focus()`). Use `browser_get_rect(selector)` (or `browser_shadow_query` for shadow sites) to get coordinates, then `browser_click_coordinate(cx, cy)`. Wait ~0.5 s for the editor to open and focus to settle.
```
# 1. Focus the real element via a real click (not JS .focus()).
rect = browser_get_rect(selector) # or browser_shadow_query for shadow sites
browser_click_coordinate(rect.css.cx, rect.css.cy) # rect.css.cx/cy — matched pair
sleep(0.5) # let the editor open / focus settle
2. **Type** the text. Use `browser_type(selector, text)` for light-DOM inputs, or `browser_type_focused(text=...)` for shadow-DOM / already-focused inputs. Both use CDP `Input.insertText` by default, which is the most reliable method for rich editors (Lexical, Draft.js, ProseMirror). Wait ~500 ms for framework state to commit.
# 2. Type. browser_type now uses CDP Input.insertText by default, which is
# the most reliable way to insert text into rich editors (Lexical,
# Draft.js, ProseMirror, any React-controlled contenteditable).
browser_type(selector, text)
sleep(1.0) # let framework state commit
3. **Verify** the submit button is enabled before clicking it. Use `browser_evaluate` to check the button's `disabled` or `aria-disabled` attribute. Do NOT trust that typing worked — always check state.
# 3. BEFORE clicking send, verify the submit button is actually enabled.
# Don't trust that typing worked — check state.
state = browser_evaluate("""
(function(){
const btn = document.querySelector('[data-testid="tweetButton"]');
if (!btn) return {exists: false};
return {
exists: true,
disabled: btn.disabled || btn.getAttribute('aria-disabled') === 'true',
text: btn.textContent.trim(),
};
})()
""")
4. **Only click send if the button is enabled.** If the button is still disabled, try the recovery dance: click the textarea again, press `End`, press a space, press `Backspace` — this forces React to recompute `hasRealContent`. Then re-check the button state.
# 4. Only click send if the button is enabled.
if not state['disabled']:
browser_click(submit_selector)
else:
# Recovery: sometimes a click-again + one extra keystroke nudges
# React into recomputing hasRealContent.
browser_click_coordinate(rect.css.cx, rect.css.cy) # rect.css.cx/cy — matched pair
browser_press("End")
browser_press(" ")
browser_press("Backspace")
# re-check state
```
### Why `browser_type` uses `Input.insertText` by default
@@ -178,7 +254,7 @@ Always include an equivalent cleanup block in any script that types into a compo
| Site | Editor | Workaround |
|---|---|---|
| **X / Twitter** compose | Draft.js | Click `[data-testid='tweetTextarea_0']` first, then type with `delay_ms=20`. First 1-2 chars may be eaten — accept truncation or prepend a throwaway char. Verify `[data-testid='tweetButton']` has `disabled: false` before clicking. |
| **LinkedIn** messaging | contenteditable (inside `#interop-outlet` shadow root) | Use `browser_shadow_query` to find the rect, click-coordinate to focus, then `browser_type_focused(text=...)` (selector-based `browser_type` can't reach shadow). Send button is `.msg-form__send-button`. |
| **LinkedIn** messaging | contenteditable (inside `#interop-outlet` shadow root) | Use `browser_shadow_query` to find the rect, click-coordinate to focus, then type via focus-based key dispatch (selector-based type can't reach shadow). Send button is `.msg-form__send-button`. |
| **LinkedIn** feed post composer | Quill/LinkedIn custom | Click the "Start a post" trigger first, wait 1s for modal, click the textarea, type. |
| **Reddit** comment/post box | ProseMirror | Click the textarea, wait 0.5s for the toolbar to mount, then type. Submit is `button[slot="submit-button"]` inside a shreddit-composer. |
| **Gmail** compose | Lexical | Click the body first. Gmail has a visible `div[contenteditable=true][aria-label*='Message Body']` after opening a compose window. |
@@ -198,7 +274,7 @@ browser_type(selector, text)
- Fires real `keydown` / `keypress` / `input` / `keyup` events — frameworks that branch on `event.key` or `event.code` see the right values
- Matches what Playwright and Puppeteer send
Works on real `<input>`, `<textarea>`, and `contenteditable` elements. For shadow-DOM inputs, see the "shadow-heavy sites" section above — `browser_type(selector=)` can't see past shadow boundaries; use `browser_type_focused` after click-coordinate focus.
Works on real `<input>`, `<textarea>`, and `contenteditable` elements. For shadow-DOM inputs, see the "shadow-heavy sites" section above — `type_text(selector=)` can't see past shadow boundaries.
### Keyboard shortcuts (Ctrl+A, Shift+Tab, Cmd+Enter)
@@ -293,8 +369,8 @@ LinkedIn enforces **strict Trusted Types CSP**. Any script you inject via `brows
Reddit's search input lives **two shadow levels deep** inside `reddit-search-large > faceplate-search-input`. You cannot reach it with `browser_type(selector=)`. The working pattern:
1. `browser_shadow_query("reddit-search-large >>> #search-input")` → rect
2. `browser_click_coordinate(rect.cx, rect.cy)` → click lands on the real shadow input via native hit testing; input becomes focused
3. `browser_type_focused(text="query")` → dispatches to focused element via `Input.insertText`
2. `browser_click_coordinate(rect.css.cx, rect.css.cy) # rect.css.cx/cy — matched pair` → click lands on the real shadow input via native hit testing; input becomes focused
3. `browser_press(c)` for each character → dispatches to focused element
4. Verify by reading `.value` via `browser_evaluate` walking the shadow path
### X / Twitter
@@ -363,11 +439,12 @@ Then pass the most specific selector that uniquely identifies the right input (e
- **Typing into a rich-text editor without clicking first → send button stays disabled.** Draft.js (X), Lexical (Gmail, LinkedIn DMs), ProseMirror (Reddit), and React-controlled `contenteditable` elements only register input as "real" when the element received a native focus event — JS-sourced `.focus()` is not enough. `browser_type` now does this automatically via a real CDP pointer click before inserting text, but always verify the submit button's `disabled` state before clicking send. See the "ALWAYS click before typing" section above.
- **Using per-character `keyDown` on Lexical / Draft.js editors → keys dispatch but text never appears.** Those editors intercept `beforeinput` and route insertion through their own state machine; raw keyDown events are silently dropped. `browser_type` now uses `Input.insertText` by default (the CDP IME-commit method) which these editors accept cleanly. Only set `use_insert_text=False` when you explicitly need per-keystroke dispatch.
- **Leaving a composer with text then trying to navigate → `beforeunload` dialog hangs the bridge.** LinkedIn and several other sites pop a native "unsent message" confirm. `browser_navigate` and `close_tab` both time out against this. Always strip `window.onbeforeunload = null` via `browser_evaluate` before any navigation after typing in a composer, or wrap your logic in a `try/finally` that runs the cleanup block.
- **Clicking at physical pixels.** CDP uses CSS px. `browser_coords` returns both for debugging, but always feed `css_x/y` to click tools.
- **Passing image pixels to `browser_click_coordinate`.** The tool expects CSS pixels and does no conversion. If you eyeballed a target pixel off the screenshot PNG, use `browser_click_image(x, y)` instead — it auto-converts using the cached image→CSS scale. Symptom when you get this wrong: click lands in a sidebar / left rail because an 800-scale number was interpreted as a 1717-scale CSS coordinate. The response of a mis-aimed click will usually show a `focused_element` that isn't the target (e.g. `tag: "div", className: "msg-conversation-listitem__link"`) — branch on that and retry with the right tool.
- **Clicking at physical pixels.** CDP uses CSS px. `browser_coords` returns both for debugging, but always feed `css_x/y` to `browser_click_coordinate` — or pass the raw image pixel straight into `browser_click_image`.
- **Calling `wait_for_selector` on a shadow element.** It'll always time out. Use `browser_shadow_query` or the screenshot + coordinate strategy.
- **Relying on `innerHTML` in injected scripts on LinkedIn.** Silently discarded. Use `createElement` + `appendChild`.
- **Not waiting for SPA hydration.** `wait_until="load"` fires before React/Vue rendering on many sites. Add a 23 s sleep before querying for chrome elements.
- **Using `browser_type(selector)` on LinkedIn DMs or any shadow-DOM input.** Won't find the element. Use `browser_click_coordinate` to focus, then `browser_type_focused(text=...)` to type.
- **Using `browser_type(selector)` on LinkedIn DMs or any shadow-DOM input.** Won't find the element. Fall back to click-to-focus + `browser_press` per character.
- **Clicking a "Photo" / "Attach" / "Upload" button to pick a file.** This opens Chrome's NATIVE OS file picker, which is rendered outside the web page and cannot be interacted with via CDP. Your automation will hang staring at an unreachable dialog. ALWAYS use `browser_upload(selector, file_paths)` against the underlying `<input type='file'>` element — see the "File uploads" section above for the full pattern. This is the single most common way to wedge a browser session on compose-with-media flows (X/LinkedIn/Gmail).
- **Keyboard shortcuts without the `code` field.** Chrome's shortcut dispatcher ignores keyboard events that lack a `code` or `windowsVirtualKeyCode`. `browser_press(..., modifiers=[...])` populates these automatically; raw `Input.dispatchKeyEvent` calls from `browser_evaluate` may not.
- **Taking a screenshot more than 10s after the last interaction** and expecting the highlight to still be visible. The overlay fades after 10s. Take the screenshot sooner, or re-trigger the interaction.
@@ -415,7 +492,7 @@ browser_navigate("https://x.com/explore", wait_until="load")
sleep(3)
browser_wait_for_selector("input[data-testid='SearchBox_Search_Input']", timeout_ms=5000)
rect = browser_get_rect("input[data-testid='SearchBox_Search_Input']")
browser_click_coordinate(rect.cx, rect.cy)
browser_click_coordinate(rect.css.cx, rect.css.cy) # rect.css.cx/cy — matched pair
browser_type("input[data-testid='SearchBox_Search_Input']", "openai", clear_first=True)
# Screenshot now shows live search suggestions
browser_screenshot()
@@ -429,9 +506,10 @@ browser_navigate("https://www.reddit.com/r/programming/", wait_until="load")
sleep(2)
# Shadow-pierce the nested search input
sq = browser_shadow_query("reddit-search-large >>> #search-input")
browser_click_coordinate(sq.rect.cx, sq.rect.cy)
# Typing can't use selector (shadow); use browser_type_focused on the focused input
browser_type_focused(text="python")
browser_click_coordinate(sq.css.cx, sq.css.cy) # sq.css.cx/cy — matched pair
# Typing can't use selector (shadow); focused input receives raw key presses
for c in "python":
browser_press(c)
browser_screenshot()
browser_press("Escape")
```
@@ -439,11 +517,11 @@ browser_press("Escape")
### Search LinkedIn and dismiss without submitting
```
browser_navigate("https://www.linkedin.com/feed/", wait_until="load")
browser_navigate("https://www.linkedin.com/feed/", wait_until="load", timeout_ms=20000)
sleep(3)
browser_wait_for_selector("input[data-testid='typeahead-input']", timeout_ms=5000)
rect = browser_get_rect("input[data-testid='typeahead-input']")
browser_click_coordinate(rect.cx, rect.cy)
browser_click_coordinate(rect.css.cx, rect.css.cy) # rect.css.cx/cy — matched pair
browser_type("input[data-testid='typeahead-input']", "anthropic", clear_first=True)
# Dropdown shows real live suggestions
browser_screenshot()
@@ -13,15 +13,15 @@ metadata:
LinkedIn is the hardest mainstream site to automate because it combines **shadow DOM** (`#interop-outlet` for messaging), **strict Trusted Types CSP** (silently drops `innerHTML`), **heavy React reconciliation** (injected nodes get stripped on re-render), **native `beforeunload` draft dialogs** (hang the bridge), and **aggressive spam filters**. Every one of those has bit us at least once. This skill documents what actually works.
**Always activate `browser-automation` first.** This skill assumes you already know about CSS-px coordinates, `browser_type`/`browser_type_focused`, and `browser_shadow_query`. The guidance below is LinkedIn-specific; general browser rules are there.
**Always activate `browser-automation` first.** This skill assumes you already know about CSS-px coordinates, `browser_type`'s click-first behavior, and `browser_shadow_query`. The guidance below is LinkedIn-specific; general browser rules are there.
## Timing expectations
- `browser_navigate(wait_until="load")` — LinkedIn takes **45 seconds** to load the feed cold.
- `browser_navigate(wait_until="load", timeout_ms=20000)` — LinkedIn takes **45 seconds** to load the feed cold. Default 30s timeout is fine; use 20s as a floor.
- After navigation, **always `sleep(3)`** to let React hydrate the profile/feed chrome before querying selectors. Without the sleep `wait_for_selector` will flake on elements that exist moments later.
- Composer modal slide-in takes **~2 seconds** after you click the Message button.
## Verified selectors
## Verified selectors (2026-04-11)
| Target | Selector | Notes |
|---|---|---|
@@ -40,8 +40,8 @@ LinkedIn changes class names aggressively. If a class-based selector breaks, fal
```
# 1. Load the profile
browser_navigate("https://www.linkedin.com/in/<username>/", wait_until="load")
sleep(3)
browser_navigate("https://www.linkedin.com/in/<username>/", wait_until="load", timeout_ms=20000)
sleep(4)
# 2. Strip onbeforeunload before any state-mutating work — prevents draft-dialog deadlock later
browser_evaluate("""
@@ -98,18 +98,17 @@ textarea = browser_evaluate("""
browser_click_coordinate(textarea['cx'], textarea['cy'])
sleep(0.6)
# 6. Insert text via browser_type_focused. This dispatches CDP
# Input.insertText to document.activeElement — the same underlying
# 6. Insert text via browser_type WITHOUT a selector. This dispatches
# CDP Input.insertText to document.activeElement — the same underlying
# mechanism as execCommand('insertText') but with no JSON escaping,
# no browser_evaluate round trip, and built-in retry. The click in
# step 5 already focused Lexical, so insertText lands in the editor
# regardless of the shadow wrapping around #interop-outlet.
#
# Use browser_type_focused (not browser_type) here — browser_type
# requires a selector, which cannot see past the #interop-outlet
# shadow root. browser_type_focused targets document.activeElement
# directly, sidestepping shadow boundaries entirely.
browser_type_focused(text=message_text)
# Do NOT pass a selector here. Selector-based browser_type cannot see
# past the #interop-outlet shadow root. No-selector mode sidesteps
# that entirely by routing to activeElement.
browser_type(text=message_text) # no selector — targets document.activeElement
sleep(1.0) # let Lexical commit state + enable Send button
# 7. Find the modal Send button (filter by in-viewport, reject pinned bar)
@@ -144,7 +143,7 @@ send = browser_evaluate("""
# 8. ONLY click Send if it's enabled — if disabled, the insertText
# didn't land. DO NOT retry with a different tool; the fix is
# always: re-click the composer rect, re-run browser_type_focused(text=...),
# always: re-click the composer rect, re-run browser_type(text=...),
# re-check. The Send button's `disabled` state IS the ground truth —
# if Lexical registered your text, it enables the button. If it's
# still disabled, your text did not reach the editor, regardless
@@ -154,7 +153,7 @@ if send['disabled']:
# fall back to browser_type with a selector (see anti-pattern in
# Common Pitfalls — selector-based type can't reach the shadow-DOM
# composer). Instead: re-click the textarea rect from step 4, wait
# a beat, re-run browser_type_focused(text=message_text) from
# a beat, re-run browser_type(text=message_text) (no selector) from
# step 6. If that still fails after 2 retries, bail and surface —
# the modal may have been reclaimed by a stale state or auth wall.
raise Exception("Send button disabled after insertText — editor did not receive input")
@@ -171,7 +170,7 @@ Daily outbound pattern — accept pending connection requests and send a templat
```
browser_navigate("https://www.linkedin.com/mynetwork/invitation-manager/received/",
wait_until="load")
wait_until="load", timeout_ms=20000)
sleep(4)
browser_evaluate("(function(){window.onbeforeunload=null;})()")
@@ -215,7 +214,7 @@ for card in cards[:25]:
## Feed post composer flow
```
browser_navigate("https://www.linkedin.com/feed/", wait_until="load")
browser_navigate("https://www.linkedin.com/feed/", wait_until="load", timeout_ms=20000)
sleep(4)
browser_evaluate("(function(){window.onbeforeunload=null;})()")
@@ -302,7 +301,7 @@ If the image isn't already on disk, write it first with `write_file(absolute_pat
## Rate limits and safety
LinkedIn's abuse detection is aggressive. Beware of the limits, let user know but dare to break them if user confirms.:
LinkedIn's abuse detection is aggressive. Respect these limits:
| Action | Limit |
|---|---|
@@ -310,7 +309,8 @@ LinkedIn's abuse detection is aggressive. Beware of the limits, let user know bu
| Outbound messages to new 1st-degree connections | **25/day max**, 510s randomized delays |
| Connection request sends | **100/week max**, spread across days, warm intros preferred |
| Profile views | Several hundred/day is usually fine but varies by account age |
| Post publications | 15/day, no URL-only posts |
| Post publications | 13/day, no URL-only posts |
| Feed reactions | Dozens/day is fine; vary your activity mix |
Signals you're being throttled:
- "Message failed to send" with no error detail
@@ -323,8 +323,9 @@ If any of those show up, **stop the run, screenshot the state, and surface the i
## Common pitfalls
- **`innerHTML` injection is silently dropped** — LinkedIn's Trusted Types CSP discards any `innerHTML = "<...>"` from injected scripts, no console error. Always use `createElement` + `appendChild` + `setAttribute` for DOM injection. `textContent`, `style.cssText`, and `.value` assignments are fine.
- **Use `browser_type_focused` (not `browser_type`) on the message composer.** The Lexical contenteditable lives inside the `#interop-outlet` shadow root which `document.querySelector` (what `browser_type`'s selector path uses under the hood) cannot see. `browser_type` requires a selector and will fail with "Element not found". The reliable insert path is: (1) `browser_click_coordinate` on the composer rect — the response's `focused_element` confirms Lexical received focus → (2) `browser_type_focused(text=message_text)` — CDP `Input.insertText` dispatches to `document.activeElement` regardless of shadow wrapping.
- **Per-char keyDown on the message composer produces empty text** — Lexical intercepts `beforeinput` and drops raw keys. Use `browser_type_focused(text=..., use_insert_text=True)` after click-coordinate focused the composer. The CDP `Input.insertText` method commits as if IME fired, which Lexical accepts cleanly.
- **Do NOT pass a selector to `browser_type` on the message composer — call it with NO selector (`browser_type(text=...)`).** The Lexical contenteditable lives inside the `#interop-outlet` shadow root which `document.querySelector` (what the selector-based path uses under the hood) cannot see. Attempts to work around this with `browser_shadow_query` fail because selector-based `browser_type` doesn't support the `>>>` shadow-pierce syntax. The reliable insert path is: (1) `browser_click_coordinate` on the composer rect — the response's `focused_element` confirms Lexical received focus → (2) `browser_type(text=message_text)` with NO selector — CDP `Input.insertText` dispatches to `document.activeElement` regardless of shadow wrapping. The old `browser_evaluate` + `document.execCommand('insertText', ...)` pattern worked but had JSON-escaping pitfalls and cost ~200 chars of JS per send; `browser_type(text=...)` is the same mechanism with built-in retry.
- **Per-char keyDown on the message composer produces empty text** — Lexical intercepts `beforeinput` and drops raw keys. Use `browser_type(text=..., use_insert_text=True)` with NO selector after click-coordinate focused the composer. The CDP `Input.insertText` method commits as if IME fired, which Lexical accepts cleanly. Do NOT pass a selector; selector-based `browser_type` can't see past `#interop-outlet`.
- **ANTI-PATTERN: "inject a dummy `<div id='dummy-target'>` and pass it as the `selector` arg to `browser_type`".** This looks tempting but fails compoundingly: `browser_type` clicks the **dummy div's** rect (not the editor's), the click lands on the Lexical wrapper's non-editable chrome, the contenteditable never receives focus, and `Input.insertText` fires against nothing. The bridge will still return `{"ok": true, "action": "type", "length": N}` because it has no way to verify the text actually landed. Symptom: Send button stays `disabled: true` forever. Fix: `browser_click_coordinate` on the real composer rect, then `browser_type(text=message_text)` with NO selector — CDP `Input.insertText` dispatches to `document.activeElement`. (See `session_20260414_114820_08bd3c4d` for the failed dummy-div attempt.)
- **Multiple Send buttons on the page** — the pinned bottom-right messaging bar has its own `msg-form__send-button` that's usually below `innerHeight`. Filter by in-viewport before clicking.
- **`window.onbeforeunload` hangs navigation/close** — after typing in a composer, any `browser_navigate` or `close_tab` can pop a native "unsent message, leave?" confirm dialog that deadlocks the bridge. Always strip `onbeforeunload` before any navigation, and wrap composer flows in a `try/finally` that runs the cleanup block:
@@ -345,7 +346,7 @@ browser_evaluate("""
## Auth wall detection
If you see a "Log in" / "Join LinkedIn" prompt instead of the logged-in feed, **stop immediately** and surface the issue to user. Do NOT attempt to log in via automation — LinkedIn's bot detection will flag the account.
If you see a "Log in" / "Join LinkedIn" prompt instead of the logged-in feed, **stop immediately** and surface the issue. Do NOT attempt to log in via automation — LinkedIn's bot detection will flag the account.
Check via:
```
+5 -5
View File
@@ -457,12 +457,12 @@ let currentView = 'grid';
// Tool categories for sidebar grouping
const CATEGORIES = {
'Lifecycle': ['browser_setup', 'browser_start', 'browser_stop', 'browser_status'],
'Tabs': ['browser_tabs', 'browser_open', 'browser_close', 'browser_close_all', 'browser_close_finished', 'browser_focus'],
'Lifecycle': ['browser_start', 'browser_stop', 'browser_status'],
'Tabs': ['browser_tabs', 'browser_open', 'browser_close', 'browser_focus'],
'Navigation': ['browser_navigate', 'browser_go_back', 'browser_go_forward', 'browser_reload'],
'Interactions': ['browser_click', 'browser_click_coordinate', 'browser_type', 'browser_type_focused', 'browser_fill', 'browser_press', 'browser_press_at', 'browser_hover', 'browser_hover_coordinate', 'browser_select', 'browser_scroll', 'browser_drag'],
'Inspection': ['browser_screenshot', 'browser_snapshot', 'browser_console', 'browser_html', 'browser_get_text', 'browser_get_attribute', 'browser_get_rect', 'browser_shadow_query', 'browser_evaluate', 'browser_wait'],
'Advanced': ['browser_resize', 'browser_upload', 'browser_dialog'],
'Interactions': ['browser_click', 'browser_click_coordinate', 'browser_type', 'browser_fill', 'browser_press', 'browser_press_at', 'browser_hover', 'browser_hover_coordinate', 'browser_select', 'browser_scroll'],
'Inspection': ['browser_screenshot', 'browser_snapshot', 'browser_console', 'browser_get_text', 'browser_evaluate', 'browser_wait'],
'Advanced': ['browser_resize', 'browser_upload', 'browser_dialog', 'browser_coords'],
};
async function init() {
+5 -5
View File
@@ -88,13 +88,13 @@ Find Textarea (it is hidden inside shadow DOM):
```
Click that coordinate, `sleep(1)`.
Type the message:
Inject text and Send:
Construct the message: `Hey {first_name}, thanks for the connection invite! I'm currently building a prediction market for jobs: https://honeycomb.open-hive.com/. If you could check it out and share some feedback, I'd really appreciate it.`
Use `browser_type_focused` — it dispatches CDP `Input.insertText` to the already-focused composer (document.activeElement), which works through shadow DOM without JSON-escaping issues:
```
browser_type_focused(text=message_text)
sleep(1.0)
Escape the string properly for JS injection, then run:
```javascript
// Replace MSG_TEXT with your actual string
browser_evaluate("(function(){ document.execCommand('insertText', false, `MSG_TEXT`); return true; })()")
```
Find Send button (also inside shadow DOM):
+23 -100
View File
@@ -82,8 +82,8 @@ _interaction_highlights: dict[int, dict] = {}
# Compact descriptor of document.activeElement. Returned by both click()
# and click_coordinate() so the agent can verify it focused what it
# intended, then decide whether to follow up with browser_type_focused(text=...).
# Keeping this as a single shared string avoids drift
# intended, then decide whether to follow up with browser_type(text=...,
# no selector). Keeping this as a single shared string avoids drift
# between the two click paths.
_FOCUSED_ELEMENT_JS = """
(function() {
@@ -1177,24 +1177,16 @@ class BeelineBridge:
if rect:
await self.highlight_rect(tab_id, rect["x"], rect["y"], rect["w"], rect["h"], label=selector)
else:
# Highlight the active element when no selector was provided.
# Drill into same-origin iframes to find the real focused
# element — the top-level activeElement may be a full-screen
# iframe whose rect covers the entire viewport.
# Highlight the active element when no selector was provided
rect_result = await self.evaluate(
tab_id,
"(function(){"
"var el=document.activeElement;"
"try{while(el&&el.tagName==='IFRAME'&&el.contentDocument){"
"el=el.contentDocument.activeElement;"
"}}catch(e){}"
"if(!el||el===document.body||el===document.documentElement)return null;"
"(function(){const el=document.activeElement;if(!el)return null;"
"const r=el.getBoundingClientRect();"
"return{x:r.left,y:r.top,w:r.width,h:r.height};})()",
)
rect = (rect_result or {}).get("result")
if rect:
await self.highlight_rect(tab_id, rect["x"], rect["y"], rect["w"], rect["h"], label="active element", border_style="dashed")
await self.highlight_rect(tab_id, rect["x"], rect["y"], rect["w"], rect["h"], label="active element")
return {"ok": True, "action": "type", "selector": selector, "length": len(text)}
# CDP Input.dispatchKeyEvent modifiers bitmask.
@@ -1564,7 +1556,6 @@ class BeelineBridge:
h: float,
label: str = "",
color: dict | None = None,
border_style: str = "solid",
) -> None:
"""Inject a visible highlight overlay into the page DOM.
@@ -1593,7 +1584,7 @@ class BeelineBridge:
box.id = '__hive_hl';
box.style.cssText = 'position:fixed;z-index:2147483647;pointer-events:none;'
+ 'left:{int(x)}px;top:{int(y)}px;width:{max(1, int(w))}px;height:{max(1, int(h))}px;'
+ 'border:2px {border_style} {border_rgb};background:{bg_rgba};'
+ 'border:2px solid {border_rgb};background:{bg_rgba};'
+ 'border-radius:3px;transition:opacity 0.4s ease;opacity:1;'
+ 'box-shadow:0 0 8px {bg_rgba};';
@@ -1936,7 +1927,7 @@ class BeelineBridge:
"result": value,
}
async def snapshot(self, tab_id: int, timeout_s: float = 30.0, mode: str = "default") -> dict:
async def snapshot(self, tab_id: int, timeout_s: float = 30.0) -> dict:
"""Get an accessibility snapshot of the page.
Uses a hybrid approach:
@@ -1947,7 +1938,6 @@ class BeelineBridge:
Args:
tab_id: The tab ID to snapshot
timeout_s: Maximum time to spend building snapshot (default 10s)
mode: Filtering mode "default", "simple", or "interactive"
"""
try:
async with asyncio.timeout(timeout_s):
@@ -1979,11 +1969,8 @@ class BeelineBridge:
)
return await self._dom_snapshot(tab_id)
# Clean redundant InlineTextBox children before formatting
nodes = self._clean_inline_text_boxes(nodes)
# Format the accessibility tree (with node limit)
snapshot = self._format_ax_tree(nodes, max_nodes=2000, mode=mode)
snapshot = self._format_ax_tree(nodes, max_nodes=2000)
# Get URL
url_result = await self._cdp(
@@ -2117,78 +2104,13 @@ class BeelineBridge:
"tree": "\n".join(lines),
}
@staticmethod
def _clean_inline_text_boxes(nodes: list[dict]) -> list[dict]:
"""Remove redundant InlineTextBox children from StaticText nodes.
If a StaticText node has 3+ InlineTextBox children and ALL their
text is already contained in the StaticText's name, remove all
the InlineTextBox children (they add no information).
"""
by_id = {n["nodeId"]: n for n in nodes}
children_map: dict[str, list[str]] = {}
for n in nodes:
for child_id in n.get("childIds", []):
children_map.setdefault(n["nodeId"], []).append(child_id)
ids_to_remove: set[str] = set()
for n in nodes:
role_info = n.get("role", {})
role = role_info.get("value", "") if isinstance(role_info, dict) else str(role_info)
if role != "StaticText":
continue
child_ids = children_map.get(n["nodeId"], [])
if len(child_ids) < 3:
continue
name_info = n.get("name", {})
parent_name = name_info.get("value", "") if isinstance(name_info, dict) else str(name_info)
if not parent_name:
continue
all_inline = True
for cid in child_ids:
child = by_id.get(cid)
if not child:
all_inline = False
break
child_role_info = child.get("role", {})
child_role = (
child_role_info.get("value", "") if isinstance(child_role_info, dict) else str(child_role_info)
)
if child_role != "InlineTextBox":
all_inline = False
break
child_name_info = child.get("name", {})
child_name = (
child_name_info.get("value", "") if isinstance(child_name_info, dict) else str(child_name_info)
)
if child_name and child_name not in parent_name:
all_inline = False
break
if all_inline:
ids_to_remove.update(child_ids)
n["childIds"] = []
if not ids_to_remove:
return nodes
return [n for n in nodes if n["nodeId"] not in ids_to_remove]
def _format_ax_tree(self, nodes: list[dict], max_nodes: int = 2000, mode: str = "default") -> str:
def _format_ax_tree(self, nodes: list[dict], max_nodes: int = 2000) -> str:
"""Format a CDP Accessibility.getFullAXTree result.
Args:
nodes: List of accessibility tree nodes
max_nodes: Maximum number of nodes to process (prevents hangs on huge trees)
mode: Filtering mode "default" (full tree), "simple" (interactive +
content, skip unnamed structural), "interactive" (interactive only)
"""
from .refs import INTERACTIVE_ROLES, STRUCTURAL_ROLES
if not nodes:
return "(empty tree)"
@@ -2228,21 +2150,11 @@ class BeelineBridge:
_walk(cid, depth)
return
node_counter[0] += 1
name_info = node.get("name", {})
name = name_info.get("value", "") if isinstance(name_info, dict) else str(name_info)
# Mode-based filtering — skip node but walk children at same depth
if mode == "interactive" and role not in INTERACTIVE_ROLES:
for cid in children_map.get(node_id, []):
_walk(cid, depth)
return
if mode == "simple" and role in STRUCTURAL_ROLES and not name:
for cid in children_map.get(node_id, []):
_walk(cid, depth)
return
node_counter[0] += 1
# Build property annotations
props: list[str] = []
for prop in node.get("properties", []):
@@ -2259,7 +2171,18 @@ class BeelineBridge:
label = f"- {role}"
# Add ref for interactive elements
if role in INTERACTIVE_ROLES or name:
interactive_roles = {
"button",
"link",
"textbox",
"checkbox",
"radio",
"combobox",
"menuitem",
"tab",
"searchbox",
}
if role in interactive_roles or name:
ref_counter[0] += 1
ref_id = f"e{ref_counter[0]}"
ref_map[ref_id] = f"[{role}]{name}"
+5 -43
View File
@@ -13,13 +13,7 @@ from typing import TYPE_CHECKING
if TYPE_CHECKING:
from .session import BrowserSession
"""Shared ARIA role classification sets.
Keep these in sync across snapshot paths divergence causes different
drivers to produce different snapshot output for the same page.
"""
# Roles that represent user-interactive elements and always get a ref.
# Role sets for interactive elements
INTERACTIVE_ROLES: frozenset[str] = frozenset(
{
"button",
@@ -32,6 +26,7 @@ INTERACTIVE_ROLES: frozenset[str] = frozenset(
"menuitemradio",
"option",
"radio",
"scrollbar",
"searchbox",
"slider",
"spinbutton",
@@ -42,44 +37,11 @@ INTERACTIVE_ROLES: frozenset[str] = frozenset(
}
)
# Roles that carry meaningful content and get a ref when named.
CONTENT_ROLES: frozenset[str] = frozenset(
NAMED_CONTENT_ROLES: frozenset[str] = frozenset(
{
"article",
"cell",
"columnheader",
"gridcell",
"heading",
"listitem",
"main",
"navigation",
"region",
"rowheader",
}
)
# Structural/container roles — typically skipped in compact mode.
STRUCTURAL_ROLES: frozenset[str] = frozenset(
{
"application",
"directory",
"document",
"generic",
"grid",
"group",
"ignored",
"list",
"menu",
"menubar",
"none",
"presentation",
"row",
"rowgroup",
"table",
"tablist",
"toolbar",
"tree",
"treegrid",
"img",
}
)
@@ -119,7 +81,7 @@ def annotate_snapshot(snapshot: str) -> tuple[str, RefMap]:
role = m.group(2)
name = m.group(3)
if role in INTERACTIVE_ROLES or (role in CONTENT_ROLES and name):
if role in INTERACTIVE_ROLES or (role in NAMED_CONTENT_ROLES and name):
candidates.append((i, role, name))
ref_map: RefMap = {}
+60 -31
View File
@@ -379,7 +379,14 @@ def register_inspection_tools(mcp: FastMCP) -> None:
return {
"ok": True,
# Primary output: CSS pixels. Feed these to click/hover/press.
# Echo the input — you can feed these straight into
# browser_click_image, which does the image→CSS conversion
# internally. This is the simpler path when you just read a
# pixel off a screenshot.
"image_x": round(x, 1),
"image_y": round(y, 1),
# CSS pixels — feed these to browser_click_coordinate /
# hover_coordinate / press_at, which expect CSS px.
"css_x": round(x * css_scale, 1),
"css_y": round(y * css_scale, 1),
# Debug output: raw physical pixels. DO NOT feed to clicks on
@@ -392,12 +399,11 @@ def register_inspection_tools(mcp: FastMCP) -> None:
"cssScale": css_scale,
"tabId": target_tab,
"note": (
"Use css_x/css_y with browser_click_coordinate, "
"browser_hover_coordinate, browser_press_at — "
"Chrome DevTools Protocol Input.dispatchMouseEvent "
"operates in CSS pixels. physical_x/y is for debugging "
"on HiDPI displays only; feeding it to clicks lands "
"them at DPR× the intended coordinate."
"Simpler path: skip browser_coords entirely and call "
"browser_click_image(image_x, image_y) — it does the "
"conversion automatically. Use css_x/css_y only if you "
"need to pass coords to browser_click_coordinate / "
"hover_coordinate / press_at. physical_x/y is debug-only."
),
}
@@ -412,7 +418,8 @@ def register_inspection_tools(mcp: FastMCP) -> None:
Traverses shadow roots to find elements inside closed/open shadow DOM,
overlays, and virtual-rendered components (e.g. LinkedIn's #interop-outlet).
Returns getBoundingClientRect in both CSS and physical pixels.
Returns getBoundingClientRect in image, CSS, and physical pixels
pass the matching block into the matching click tool.
Args:
selector: CSS selectors joined by ' >>> ' to pierce shadow roots.
@@ -421,7 +428,9 @@ def register_inspection_tools(mcp: FastMCP) -> None:
profile: Browser profile name (default: "default")
Returns:
Dict with rect (CSS px) and physical rect (CSS px × DPR) of the element
Dict with ``image`` (pass to browser_click_image), ``css``
(pass to browser_click_coordinate / hover / press_at), and
``physical`` (debug only).
"""
bridge = get_bridge()
if not bridge or not bridge.is_connected:
@@ -441,11 +450,21 @@ def register_inspection_tools(mcp: FastMCP) -> None:
physical_scale = _screenshot_scales.get(target_tab, 1.0)
css_scale = _screenshot_css_scales.get(target_tab, 1.0)
dpr = physical_scale / css_scale if css_scale else 1.0
# image = css / cssScale — inverse of the conversion browser_click_image does
inv_css = 1.0 / css_scale if css_scale else 1.0
return {
"ok": True,
"selector": selector,
"tag": rect.get("tag"),
"image": {
"x": round(rect["x"] * inv_css, 1),
"y": round(rect["y"] * inv_css, 1),
"w": round(rect["w"] * inv_css, 1),
"h": round(rect["h"] * inv_css, 1),
"cx": round(rect["cx"] * inv_css, 1),
"cy": round(rect["cy"] * inv_css, 1),
},
"css": {
"x": rect["x"],
"y": rect["y"],
@@ -462,12 +481,14 @@ def register_inspection_tools(mcp: FastMCP) -> None:
"cx": round(rect["cx"] * dpr, 1),
"cy": round(rect["cy"] * dpr, 1),
},
"cssScale": css_scale,
"note": (
"Use css.cx/cy with browser_click_coordinate, "
"browser_hover_coordinate, browser_press_at — "
"CDP Input events operate in CSS pixels. "
"physical.* is debug-only; feeding it to clicks "
"lands them DPR× too far on HiDPI displays."
"Pass image.cx/cy browser_click_image (preferred after a "
"screenshot). Pass css.cx/cy → browser_click_coordinate / "
"hover_coordinate / press_at. physical.* is debug-only; "
"feeding it to clicks lands them DPR× too far on HiDPI. "
"If cssScale=1.0 no screenshot is cached yet — take a "
"browser_screenshot first if you want to use image coords."
),
}
@@ -481,10 +502,12 @@ def register_inspection_tools(mcp: FastMCP) -> None:
Get the bounding rect of an element by CSS selector.
Supports '>>>' shadow-piercing selectors for overlay/shadow DOM content.
Returns coordinates in CSS pixels (for clicks and DOM APIs); the
physical-pixel variant is returned for debugging on HiDPI displays
only it must not be fed to click/hover/press tools, which use
CSS pixels.
Returns coordinates in image, CSS, and physical pixels. Pass the
``image`` block into browser_click_image (preferred after a
screenshot) or the ``css`` block into browser_click_coordinate /
hover_coordinate / press_at. ``physical`` is debug-only and must
not be fed to click tools CDP Input events use CSS pixels, not
physical pixels.
Args:
selector: CSS selector, optionally with ' >>> ' to pierce shadow roots.
@@ -493,7 +516,7 @@ def register_inspection_tools(mcp: FastMCP) -> None:
profile: Browser profile name (default: "default")
Returns:
Dict with css and physical bounding rects
Dict with image, css, and physical bounding rects.
"""
bridge = get_bridge()
if not bridge or not bridge.is_connected:
@@ -513,11 +536,20 @@ def register_inspection_tools(mcp: FastMCP) -> None:
physical_scale = _screenshot_scales.get(target_tab, 1.0)
css_scale = _screenshot_css_scales.get(target_tab, 1.0)
dpr = physical_scale / css_scale if css_scale else 1.0
inv_css = 1.0 / css_scale if css_scale else 1.0
return {
"ok": True,
"selector": selector,
"tag": rect.get("tag"),
"image": {
"x": round(rect["x"] * inv_css, 1),
"y": round(rect["y"] * inv_css, 1),
"w": round(rect["w"] * inv_css, 1),
"h": round(rect["h"] * inv_css, 1),
"cx": round(rect["cx"] * inv_css, 1),
"cy": round(rect["cy"] * inv_css, 1),
},
"css": {
"x": rect["x"],
"y": rect["y"],
@@ -534,12 +566,14 @@ def register_inspection_tools(mcp: FastMCP) -> None:
"cx": round(rect["cx"] * dpr, 1),
"cy": round(rect["cy"] * dpr, 1),
},
"cssScale": css_scale,
"note": (
"Use css.cx/cy with browser_click_coordinate, "
"browser_hover_coordinate, browser_press_at — "
"CDP Input events operate in CSS pixels. "
"physical.* is debug-only; feeding it to clicks "
"lands them DPR× too far on HiDPI displays."
"Pass image.cx/cy browser_click_image (preferred after a "
"screenshot). Pass css.cx/cy → browser_click_coordinate / "
"hover_coordinate / press_at. physical.* is debug-only; "
"feeding it to clicks lands them DPR× too far on HiDPI. "
"If cssScale=1.0 no screenshot is cached yet — take a "
"browser_screenshot first if you want to use image coords."
),
}
@@ -547,7 +581,6 @@ def register_inspection_tools(mcp: FastMCP) -> None:
async def browser_snapshot(
tab_id: int | None = None,
profile: str | None = None,
mode: Literal["default", "simple", "interactive"] = "default",
) -> dict:
"""
Get an accessibility snapshot of the page.
@@ -566,16 +599,12 @@ def register_inspection_tools(mcp: FastMCP) -> None:
Args:
tab_id: Chrome tab ID (default: active tab)
profile: Browser profile name (default: "default")
mode: Snapshot filtering mode (default: "default")
- "default": full accessibility tree
- "simple": interactive + content nodes, skip unnamed structural nodes
- "interactive": only interactive nodes (buttons, links, inputs, etc.)
Returns:
Dict with the snapshot text tree, URL, and tab ID
"""
start = time.perf_counter()
params = {"tab_id": tab_id, "profile": profile, "mode": mode}
params = {"tab_id": tab_id, "profile": profile}
bridge = get_bridge()
if not bridge or not bridge.is_connected:
@@ -596,7 +625,7 @@ def register_inspection_tools(mcp: FastMCP) -> None:
return result
try:
snapshot_result = await bridge.snapshot(target_tab, mode=mode)
snapshot_result = await bridge.snapshot(target_tab)
log_tool_call(
"browser_snapshot",
params,
+119
View File
@@ -173,6 +173,125 @@ def register_interaction_tools(mcp: FastMCP) -> None:
)
return result
@mcp.tool()
async def browser_click_image(
image_x: float,
image_y: float,
tab_id: int | None = None,
profile: str | None = None,
button: Literal["left", "right", "middle"] = "left",
) -> dict:
"""
Click at coordinates read directly from the most recent screenshot.
**Use this after ``browser_screenshot`` when you've eyeballed a
target pixel from the returned image.** It reads the cached
imageCSS scale for the tab and converts automatically so you
pass in the raw image pixel you saw, not a pre-converted CSS px.
This is the canonical screenshot-then-click path and eliminates
the single most common coordinate bug (passing 800-px-scale
numbers to a 1717-px-scale CDP API, which lands the click on a
sidebar instead of the target).
Pipeline:
img = browser_screenshot() # image (default 800 px wide)
# look at img, pick a target pixel (x, y)
browser_click_image(x, y) # auto-converts to CSS px
If you already hold CSS coordinates (from ``getBoundingClientRect``,
``browser_get_rect``, or an explicit ``browser_coords`` call), use
``browser_click_coordinate`` instead that path does no conversion.
Fails fast if no screenshot scale is cached for the tab (call
``browser_screenshot`` first so the scale is known).
Args:
image_x: X coordinate in image pixels (the value you read off
the screenshot PNG).
image_y: Y coordinate in image pixels.
tab_id: Chrome tab ID (default: active tab).
profile: Browser profile name (default: "default").
button: Mouse button to click (left, right, middle).
Returns:
Dict with click result, including the converted ``css_x`` /
``css_y`` that were actually dispatched and the ``cssScale``
used for the conversion.
"""
start = time.perf_counter()
params = {
"image_x": image_x,
"image_y": image_y,
"tab_id": tab_id,
"profile": profile,
"button": button,
}
bridge = get_bridge()
if not bridge or not bridge.is_connected:
result = {"ok": False, "error": "Browser extension not connected"}
log_tool_call("browser_click_image", params, result=result)
return result
ctx = _get_context(profile)
if not ctx:
result = {"ok": False, "error": "Browser not started. Call browser_start first."}
log_tool_call("browser_click_image", params, result=result)
return result
target_tab = tab_id or ctx.get("activeTabId")
if target_tab is None:
result = {"ok": False, "error": "No active tab"}
log_tool_call("browser_click_image", params, result=result)
return result
from .inspection import _screenshot_css_scales
css_scale = _screenshot_css_scales.get(target_tab)
if css_scale is None:
result = {
"ok": False,
"error": (
f"No screenshot scale cached for tab {target_tab}. "
"Call browser_screenshot first so the image→CSS scale "
"is known, or use browser_click_coordinate if you "
"already have CSS px."
),
}
log_tool_call("browser_click_image", params, result=result)
return result
css_x = image_x * css_scale
css_y = image_y * css_scale
try:
click_result = await bridge.click_coordinate(target_tab, css_x, css_y, button=button)
enriched = {
**click_result,
"input_image_x": image_x,
"input_image_y": image_y,
"converted_css_x": css_x,
"converted_css_y": css_y,
"cssScale": css_scale,
}
log_tool_call(
"browser_click_image",
params,
result=enriched,
duration_ms=(time.perf_counter() - start) * 1000,
)
return enriched
except Exception as e:
result = {"ok": False, "error": str(e)}
log_tool_call(
"browser_click_image",
params,
error=e,
duration_ms=(time.perf_counter() - start) * 1000,
)
return result
@mcp.tool()
async def browser_type(
selector: str,