Compare commits

...

4 Commits

Author SHA1 Message Date
Timothy bb2e55f5fa fix: click response 2026-04-16 17:43:21 -07:00
Timothy 5d0106ac34 fix: remove coordinate conversation 2026-04-16 17:36:05 -07:00
Timothy 3a219e27ab fix: two dimensions 2026-04-16 17:08:40 -07:00
Timothy 7f62e7a2d0 fix: split image click vs coordinate click 2026-04-16 16:37:52 -07:00
8 changed files with 201 additions and 330 deletions
@@ -25,7 +25,6 @@ All tools are prefixed with `browser_`:
- `browser_screenshot` — visual capture (annotated PNG)
<!-- /vision-only -->
- `browser_shadow_query`, `browser_get_rect` — locate elements (shadow-piercing via `>>>`)
- `browser_coords` — convert image pixels to CSS pixels (always use `css_x/y`, never `physical_x/y`)
- `browser_scroll`, `browser_wait` — navigation helpers
- `browser_evaluate` — run JavaScript
- `browser_close`, `browser_close_finished` — tab cleanup
@@ -38,9 +37,9 @@ All tools are prefixed with `browser_`:
Neither tool is "preferred" universally — they're for different jobs. Default to snapshot on text-heavy static pages, screenshot on SPAs and anything shadow-DOM-heavy. Activate the `browser-automation` skill for the full decision tree.
## Coordinate rule: always CSS pixels
## Coordinate rule
Chrome DevTools Protocol `Input.dispatchMouseEvent` takes **CSS pixels**, not physical pixels. After a screenshot, use `browser_coords(image_x, image_y)` and feed the returned `css_x/y` (NOT `physical_x/y`) to `browser_click_coordinate`, `browser_hover_coordinate`, `browser_press_at`. Feeding physical pixels on a HiDPI display (DPR=1.6, 2, or 3) overshoots by `DPR×` and clicks land in the wrong place. `getBoundingClientRect()` already returns CSS pixels pass through unchanged, no DPR multiplication.
`browser_screenshot` delivers the image at the CSS viewport's own dimensions, so a pixel you read off the screenshot is the same coordinate `browser_click_coordinate`, `browser_hover_coordinate`, and `browser_press_at` expect — no conversion. `getBoundingClientRect()` likewise returns CSS pixels; pass through unchanged.
## System prompt tips for browser nodes
+11 -20
View File
@@ -42,22 +42,14 @@ after an interaction unless you need a fresh view.
Only fall back to `browser_get_text` for extracting small elements by
CSS selector.
## Coordinates: always CSS pixels
## Coordinates
Chrome DevTools Protocol `Input.dispatchMouseEvent` takes **CSS
pixels**, not physical pixels. This is critical and often gets wrong:
| Tool | Unit |
|---|---|
| `browser_click_coordinate(x, y)` | **CSS pixels** |
| `browser_hover_coordinate(x, y)` | **CSS pixels** |
| `browser_press_at(x, y, key)` | **CSS pixels** |
| `getBoundingClientRect()` | already CSS pixels pass straight through |
| `browser_coords(img_x, img_y)` | returns `css_x/y` (use this) and `physical_x/y` (debug only) |
**Always use `css_x/y`** from `browser_coords`. Feeding `physical_x/y`
on a HiDPI display overshoots by `DPR×` clicks land DPR times too
far right and down. On a DPR=1.6 display that's 60% off.
`browser_screenshot` delivers the image at the CSS viewport's own
dimensions, so a pixel you read off the screenshot is the same number
you pass to `browser_click_coordinate` / `browser_hover_coordinate` /
`browser_press_at`. `browser_get_rect` and `browser_shadow_query` also
return CSS px feed `rect.css.cx` / `rect.css.cy` straight through.
No scale factors to remember.
Never multiply `getBoundingClientRect()` by `devicePixelRatio` it's
already in the right unit.
@@ -86,11 +78,10 @@ reach shadow elements transparently.
**Shadow-heavy site workflow:**
1. `browser_screenshot()` visual image
2. Identify target visually image coordinate
3. `browser_coords(x, y)` CSS px
4. `browser_click_coordinate(css_x, css_y)` lands via native hit
test; inputs get focused regardless of shadow depth
5. Type via `browser_type_focused` (no selector needed types into the
2. Identify target visually pixel `(x, y)` read straight off the image
3. `browser_click_coordinate(x, y)` lands via native hit test; inputs
get focused regardless of shadow depth
4. Type via `browser_type_focused` (no selector needed types into the
already-focused element), or `browser_type` if you have a selector
For selector-style access when you know the shadow path:
@@ -1,6 +1,6 @@
---
name: hive.browser-automation
description: Required before any browser_* tool call. Teaches the screenshot + browser_click_coordinate workflow that reaches shadow-DOM inputs selectors can't see, the CSS-pixel coordinate rule (not physical px), rich-text editor quirks ("send button stays disabled" failures), and CSP gotchas. Covers Chrome via CDP through the GCU Beeline extension. Skipping this causes repeated failures on LinkedIn / Reddit / X. Verified against real production sites 2026-04-11.
description: Required before any browser_* tool call. Teaches the screenshot + browser_click_coordinate workflow that reaches shadow-DOM inputs selectors can't see, rich-text editor quirks ("send button stays disabled" failures), and CSP gotchas. Covers Chrome via CDP through the GCU Beeline extension. Skipping this causes repeated failures on LinkedIn / Reddit / X. Verified against real production sites 2026-04-11.
metadata:
author: hive
type: default-skill
@@ -12,25 +12,20 @@ metadata:
All GCU browser tools drive a real Chrome instance through the Beeline extension and Chrome DevTools Protocol (CDP). That means clicks, keystrokes, and screenshots are processed by the actual browser's native hit testing, focus, and layout engines — **not** a synthetic event layer. Understanding this unlocks strategies that make hard sites easy.
## Coordinates: always CSS pixels
## Coordinates
**Chrome DevTools Protocol `Input.dispatchMouseEvent` operates in CSS pixels, not physical pixels.**
When you call `browser_coords(image_x, image_y)` after a screenshot, the returned dict has both `css_x/y` and `physical_x/y`. **Always use `css_x/y` for clicks, hovers, and key presses.**
Screenshots are delivered at the CSS viewport's own dimensions. A pixel you see in the screenshot is the same coordinate `browser_click_coordinate` expects — no conversion, no scale factors.
```
browser_screenshot() → image (downscaled to 800/900 px wide)
browser_coords(img_x, img_y) → {css_x, css_y, physical_x, physical_y}
browser_click_coordinate(css_x, css_y) ← USE css_x/y
browser_hover_coordinate(css_x, css_y) ← USE css_x/y
browser_press_at(css_x, css_y, key) ← USE css_x/y
browser_screenshot() → image at CSS-viewport size (JPEG)
browser_click_coordinate(x, y) → same (x, y)
browser_hover_coordinate(x, y) → same (x, y)
browser_press_at(x, y, key) → same (x, y)
browser_get_rect(selector) → rect.css → pass rect.css.cx, rect.css.cy to any of the above
browser_shadow_query(...) → sq.css → same
```
Feeding `physical_x/y` on a HiDPI display overshoots by DPR× — on a DPR=1.6 laptop, clicks land 60% too far right and down. The ratio between `physicalScale` and `cssScale` tells you the effective DPR.
`getBoundingClientRect()` already returns CSS pixels — feed those values straight through to click/hover tools without any DPR multiplication.
**Exception for zoomed elements:** pages that use `zoom` or `transform: scale()` on a container (LinkedIn's `#interop-outlet`, some embedded iframes) render in a scaled local coordinate space. `getBoundingClientRect` there may not match CDP's hit space. Use `browser_shadow_query` which handles the math, or fall back to visually picking coordinates from a screenshot.
**Exception for zoomed elements:** pages that use `zoom` or `transform: scale()` on a container (LinkedIn's `#interop-outlet`, some embedded iframes) render in a scaled local coordinate space. `getBoundingClientRect` there may not match CDP's hit space. Use `browser_shadow_query` which handles the math, or visually pick coordinates from a screenshot.
## Screenshot + coordinates is shadow-agnostic — prefer it on shadow-heavy sites
@@ -46,29 +41,28 @@ Whereas `wait_for_selector`, `browser_click(selector=...)`, `browser_type(select
### Recommended workflow on shadow-heavy sites
1. `browser_screenshot()` → visual image
2. Identify the target visually → image pixel `(x, y)` (eyeball from the screenshot)
3. `browser_coords(x, y)` → convert to CSS px
4. `browser_click_coordinate(css_x, css_y)` → lands on the element via native hit testing; inputs get focused. **The response now includes `focused_element: {tag, id, role, contenteditable, rect, ...}`** — use it to verify you actually focused what you intended.
5. `browser_type(text="...")` with **NO selector** → dispatches CDP `Input.insertText` to `document.activeElement`. Shadow roots, iframes, Lexical, Draft.js, ProseMirror all just work. Only pass a selector if you want a DIFFERENT element than the one you just focused (rare).
6. Verify via `browser_screenshot` OR `browser_get_attribute` on a known-reachable marker (e.g. check that the Send button's `aria-disabled` flipped to `false`).
1. `browser_screenshot()` → visual image (delivered at the CSS-viewport's own dimensions).
2. Identify the target visually → pixel `(x, y)` read straight off the image.
3. `browser_click_coordinate(x, y)` → clicks there. **The response includes `focused_element: {tag, id, role, contenteditable, rect, ...}`** — use it to verify you actually focused what you intended.
4. `browser_type_focused(text="...")` → dispatches CDP `Input.insertText` to `document.activeElement`. Shadow roots, iframes, Lexical, Draft.js, ProseMirror all just work. Use `browser_type(selector, text)` only when you want to target a different element than the one you just focused.
5. Verify via `browser_screenshot` OR `browser_get_attribute` on a known-reachable marker (e.g. check that the Send button's `aria-disabled` flipped to `false`).
### The click→type loop (canonical pattern)
```
resp = browser_click_coordinate(x, y)
resp = browser_click_coordinate(x, y) # x, y read straight off the screenshot
fe = resp.get("focused_element")
if fe and (fe.get("contenteditable") or fe["tag"] in ("textarea", "input")):
browser_type(text="...") # no selector — insertText to activeElement
browser_type_focused(text="...") # insertText to activeElement
else:
# you clicked something that isn't editable — refine coords and retry
# do NOT reach for browser_evaluate + execCommand('insertText', ...)
# you clicked something that isn't editable — refine the pixel and retry.
# Do NOT reach for browser_evaluate + execCommand('insertText', ...)
# or a walk(root) shadow traversal. The problem is your click, not
# the typing method.
...
```
`browser_click` (selector-based) also returns `focused_element` now, so the same check works whether you clicked by selector or coordinate.
`browser_click` (selector-based) also returns `focused_element`, so the same check works whether you clicked by selector or by coordinate.
### Empirically verified (2026-04-11)
@@ -154,7 +148,7 @@ The symptom is always the same: **you type, the characters appear visually, and
```
# 1. Focus the real element via a real click (not JS .focus()).
rect = browser_get_rect(selector) # or browser_shadow_query for shadow sites
browser_click_coordinate(rect.cx, rect.cy)
browser_click_coordinate(rect.css.cx, rect.css.cy) # rect.css.cx/cy — matched pair
sleep(0.5) # let the editor open / focus settle
# 2. Type. browser_type now uses CDP Input.insertText by default, which is
@@ -183,7 +177,7 @@ if not state['disabled']:
else:
# Recovery: sometimes a click-again + one extra keystroke nudges
# React into recomputing hasRealContent.
browser_click_coordinate(rect.cx, rect.cy)
browser_click_coordinate(rect.css.cx, rect.css.cy) # rect.css.cx/cy — matched pair
browser_press("End")
browser_press(" ")
browser_press("Backspace")
@@ -266,25 +260,15 @@ Recognized without modifiers: `Enter`, `Tab`, `Escape`, `Backspace`, `Delete`, `
## Screenshots
```
browser_screenshot() # viewport, 900 px wide by default
browser_screenshot() # viewport, CSS-sized JPEG
browser_screenshot(full_page=True) # full scrollable page
browser_screenshot(selector="#header") # clip to element's rect
```
Returns a PNG with automatic downscaling to a target width (default 900 px) plus a JSON metadata block containing `cssWidth`, `devicePixelRatio`, `physicalScale`, `cssScale`, and a `scaleHint` string. The image is also annotated with a highlight rectangle/dot showing the last interaction (click, hover, type) if one happened on this tab.
Returns a JPEG (quality 75, ~150250 KB for a typical UI) at the CSS viewport's own dimensions, plus a JSON metadata block containing `cssWidth`, `devicePixelRatio`, `imageWidth` (= `cssWidth`), and a `scaleHint` confirming image-px == CSS-px. The image is annotated with a highlight rectangle/dot showing the last interaction (click, hover, type) if one happened on this tab.
The highlight overlay stays visible on the page for **10 seconds** after each interaction, then fades. Before a screenshot is likely, make sure your click / hover / type happens <10 s before the screenshot.
### Anatomy of the scale fields
- `cssWidth` = `window.innerWidth` (CSS px)
- `devicePixelRatio` = `window.devicePixelRatio` (often 1.6, 2, or 3 on modern displays)
- `physicalScale = png_width / image_width` (how many physical-px per image-px)
- `cssScale = cssWidth / image_width` (how many CSS-px per image-px)
- Effective DPR = `physicalScale / cssScale` (should match `devicePixelRatio`)
When converting image coordinates for clicks, always use `cssScale`. The `physicalScale` field is there for debugging HiDPI displays, not for inputs.
## Scrolling
- Use large scroll amounts (~2000) when loading more content — sites like Twitter and LinkedIn have lazy loading for paging.
@@ -339,7 +323,7 @@ LinkedIn enforces **strict Trusted Types CSP**. Any script you inject via `brows
Reddit's search input lives **two shadow levels deep** inside `reddit-search-large > faceplate-search-input`. You cannot reach it with `browser_type(selector=)`. The working pattern:
1. `browser_shadow_query("reddit-search-large >>> #search-input")` → rect
2. `browser_click_coordinate(rect.cx, rect.cy)` → click lands on the real shadow input via native hit testing; input becomes focused
2. `browser_click_coordinate(rect.css.cx, rect.css.cy) # rect.css.cx/cy — matched pair` → click lands on the real shadow input via native hit testing; input becomes focused
3. `browser_press(c)` for each character → dispatches to focused element
4. Verify by reading `.value` via `browser_evaluate` walking the shadow path
@@ -409,7 +393,7 @@ Then pass the most specific selector that uniquely identifies the right input (e
- **Typing into a rich-text editor without clicking first → send button stays disabled.** Draft.js (X), Lexical (Gmail, LinkedIn DMs), ProseMirror (Reddit), and React-controlled `contenteditable` elements only register input as "real" when the element received a native focus event — JS-sourced `.focus()` is not enough. `browser_type` now does this automatically via a real CDP pointer click before inserting text, but always verify the submit button's `disabled` state before clicking send. See the "ALWAYS click before typing" section above.
- **Using per-character `keyDown` on Lexical / Draft.js editors → keys dispatch but text never appears.** Those editors intercept `beforeinput` and route insertion through their own state machine; raw keyDown events are silently dropped. `browser_type` now uses `Input.insertText` by default (the CDP IME-commit method) which these editors accept cleanly. Only set `use_insert_text=False` when you explicitly need per-keystroke dispatch.
- **Leaving a composer with text then trying to navigate → `beforeunload` dialog hangs the bridge.** LinkedIn and several other sites pop a native "unsent message" confirm. `browser_navigate` and `close_tab` both time out against this. Always strip `window.onbeforeunload = null` via `browser_evaluate` before any navigation after typing in a composer, or wrap your logic in a `try/finally` that runs the cleanup block.
- **Clicking at physical pixels.** CDP uses CSS px. `browser_coords` returns both for debugging, but always feed `css_x/y` to click tools.
- **Click landed in the wrong region (sidebar / header instead of target).** The `focused_element` in the click response shows what actually got focused (e.g. `className: "msg-conversation-listitem__link"` means you hit the messaging sidebar). Treat it as ground truth — if it isn't the target, adjust the pixel and retry. Screenshot pixels equal CSS pixels, so the number you passed is the number CDP clicked; a wrong result means you picked the wrong pixel, not that any conversion went sideways.
- **Calling `wait_for_selector` on a shadow element.** It'll always time out. Use `browser_shadow_query` or the screenshot + coordinate strategy.
- **Relying on `innerHTML` in injected scripts on LinkedIn.** Silently discarded. Use `createElement` + `appendChild`.
- **Not waiting for SPA hydration.** `wait_until="load"` fires before React/Vue rendering on many sites. Add a 23 s sleep before querying for chrome elements.
@@ -461,7 +445,7 @@ browser_navigate("https://x.com/explore", wait_until="load")
sleep(3)
browser_wait_for_selector("input[data-testid='SearchBox_Search_Input']", timeout_ms=5000)
rect = browser_get_rect("input[data-testid='SearchBox_Search_Input']")
browser_click_coordinate(rect.cx, rect.cy)
browser_click_coordinate(rect.css.cx, rect.css.cy) # rect.css.cx/cy — matched pair
browser_type("input[data-testid='SearchBox_Search_Input']", "openai", clear_first=True)
# Screenshot now shows live search suggestions
browser_screenshot()
@@ -475,7 +459,7 @@ browser_navigate("https://www.reddit.com/r/programming/", wait_until="load")
sleep(2)
# Shadow-pierce the nested search input
sq = browser_shadow_query("reddit-search-large >>> #search-input")
browser_click_coordinate(sq.rect.cx, sq.rect.cy)
browser_click_coordinate(sq.css.cx, sq.css.cy) # sq.css.cx/cy — matched pair
# Typing can't use selector (shadow); focused input receives raw key presses
for c in "python":
browser_press(c)
@@ -490,7 +474,7 @@ browser_navigate("https://www.linkedin.com/feed/", wait_until="load", timeout_ms
sleep(3)
browser_wait_for_selector("input[data-testid='typeahead-input']", timeout_ms=5000)
rect = browser_get_rect("input[data-testid='typeahead-input']")
browser_click_coordinate(rect.cx, rect.cy)
browser_click_coordinate(rect.css.cx, rect.css.cy) # rect.css.cx/cy — matched pair
browser_type("input[data-testid='typeahead-input']", "anthropic", clear_first=True)
# Dropdown shows real live suggestions
browser_screenshot()
@@ -34,7 +34,7 @@ LinkedIn is the hardest mainstream site to automate because it combines **shadow
| Pending connection card | `.invitation-card, .invitations-card, [data-test-incoming-invitation-card]` | Filter out "invited you to follow" / "subscribe" cards |
| Accept button | `button[aria-label*="Accept"]` within the card scope | Per-card scoping is critical — there are many Accept buttons on the page |
LinkedIn changes class names aggressively. If a class-based selector breaks, fall back to **`browser_screenshot` → visual identification → `browser_coords``browser_click_coordinate`**. The screenshot + coord path works regardless of class-name churn and regardless of shadow DOM.
LinkedIn changes class names aggressively. If a class-based selector breaks, fall back to **`browser_screenshot` → visual identification → `browser_click_coordinate`** with the pixel you read straight off the image (screenshots are CSS-sized, so no conversion). The screenshot + coord path works regardless of class-name churn and regardless of shadow DOM.
## Profile Message flow (verified end-to-end 2026-04-11)
@@ -108,7 +108,7 @@ sleep(0.6)
# Do NOT pass a selector here. Selector-based browser_type cannot see
# past the #interop-outlet shadow root. No-selector mode sidesteps
# that entirely by routing to activeElement.
browser_type(text=message_text) # no selector — targets document.activeElement
browser_type_focused(text=message_text) # targets document.activeElement
sleep(1.0) # let Lexical commit state + enable Send button
# 7. Find the modal Send button (filter by in-viewport, reject pinned bar)
@@ -143,7 +143,7 @@ send = browser_evaluate("""
# 8. ONLY click Send if it's enabled — if disabled, the insertText
# didn't land. DO NOT retry with a different tool; the fix is
# always: re-click the composer rect, re-run browser_type(text=...),
# always: re-click the composer rect, re-run browser_type_focused(text=...),
# re-check. The Send button's `disabled` state IS the ground truth —
# if Lexical registered your text, it enables the button. If it's
# still disabled, your text did not reach the editor, regardless
@@ -153,7 +153,7 @@ if send['disabled']:
# fall back to browser_type with a selector (see anti-pattern in
# Common Pitfalls — selector-based type can't reach the shadow-DOM
# composer). Instead: re-click the textarea rect from step 4, wait
# a beat, re-run browser_type(text=message_text) (no selector) from
# a beat, re-run browser_type_focused(text=message_text) from
# step 6. If that still fails after 2 retries, bail and surface —
# the modal may have been reclaimed by a stale state or auth wall.
raise Exception("Send button disabled after insertText — editor did not receive input")
@@ -323,9 +323,9 @@ If any of those show up, **stop the run, screenshot the state, and surface the i
## Common pitfalls
- **`innerHTML` injection is silently dropped** — LinkedIn's Trusted Types CSP discards any `innerHTML = "<...>"` from injected scripts, no console error. Always use `createElement` + `appendChild` + `setAttribute` for DOM injection. `textContent`, `style.cssText`, and `.value` assignments are fine.
- **Do NOT pass a selector to `browser_type` on the message composer — call it with NO selector (`browser_type(text=...)`).** The Lexical contenteditable lives inside the `#interop-outlet` shadow root which `document.querySelector` (what the selector-based path uses under the hood) cannot see. Attempts to work around this with `browser_shadow_query` fail because selector-based `browser_type` doesn't support the `>>>` shadow-pierce syntax. The reliable insert path is: (1) `browser_click_coordinate` on the composer rect — the response's `focused_element` confirms Lexical received focus → (2) `browser_type(text=message_text)` with NO selector — CDP `Input.insertText` dispatches to `document.activeElement` regardless of shadow wrapping. The old `browser_evaluate` + `document.execCommand('insertText', ...)` pattern worked but had JSON-escaping pitfalls and cost ~200 chars of JS per send; `browser_type(text=...)` is the same mechanism with built-in retry.
- **Per-char keyDown on the message composer produces empty text** — Lexical intercepts `beforeinput` and drops raw keys. Use `browser_type(text=..., use_insert_text=True)` with NO selector after click-coordinate focused the composer. The CDP `Input.insertText` method commits as if IME fired, which Lexical accepts cleanly. Do NOT pass a selector; selector-based `browser_type` can't see past `#interop-outlet`.
- **ANTI-PATTERN: "inject a dummy `<div id='dummy-target'>` and pass it as the `selector` arg to `browser_type`".** This looks tempting but fails compoundingly: `browser_type` clicks the **dummy div's** rect (not the editor's), the click lands on the Lexical wrapper's non-editable chrome, the contenteditable never receives focus, and `Input.insertText` fires against nothing. The bridge will still return `{"ok": true, "action": "type", "length": N}` because it has no way to verify the text actually landed. Symptom: Send button stays `disabled: true` forever. Fix: `browser_click_coordinate` on the real composer rect, then `browser_type(text=message_text)` with NO selector — CDP `Input.insertText` dispatches to `document.activeElement`. (See `session_20260414_114820_08bd3c4d` for the failed dummy-div attempt.)
- **Do NOT use selector-based `browser_type` on the message composer — use `browser_type_focused(text=...)`.** The Lexical contenteditable lives inside the `#interop-outlet` iframe/shadow wrapper which `document.querySelector` cannot see. `browser_shadow_query` can find it but selector-based `browser_type` doesn't support the `>>>` shadow-pierce syntax. The reliable insert path is: (1) `browser_click_coordinate` on the composer rect — the response's `focused_element` (which recurses into same-origin iframes) confirms what actually received focus → (2) `browser_type_focused(text=message_text)` — CDP `Input.insertText` dispatches to `document.activeElement` regardless of shadow wrapping.
- **Per-char keyDown on the message composer produces empty text** — Lexical intercepts `beforeinput` and drops raw keys. Use `browser_type_focused(text=..., use_insert_text=True)` after click-coordinate focused the composer. The CDP `Input.insertText` method commits as if IME fired, which Lexical accepts cleanly.
- **ANTI-PATTERN: "inject a dummy `<div id='dummy-target'>` and pass it as the `selector` arg to `browser_type`".** This fails compoundingly: `browser_type` clicks the **dummy div's** rect (not the editor's), the click lands on the Lexical wrapper's non-editable chrome, the contenteditable never receives focus, and `Input.insertText` fires against nothing. The bridge will still return `{"ok": true, "action": "type", "length": N}` because it has no way to verify the text actually landed. Symptom: Send button stays `disabled: true` forever. Fix: `browser_click_coordinate` on the real composer rect, then `browser_type_focused(text=message_text)` — CDP `Input.insertText` dispatches to `document.activeElement`.
- **Multiple Send buttons on the page** — the pinned bottom-right messaging bar has its own `msg-form__send-button` that's usually below `innerHeight`. Filter by in-viewport before clicking.
- **`window.onbeforeunload` hangs navigation/close** — after typing in a composer, any `browser_navigate` or `close_tab` can pop a native "unsent message, leave?" confirm dialog that deadlocks the bridge. Always strip `onbeforeunload` before any navigation, and wrap composer flows in a `try/finally` that runs the cleanup block:
+1 -1
View File
@@ -462,7 +462,7 @@ const CATEGORIES = {
'Navigation': ['browser_navigate', 'browser_go_back', 'browser_go_forward', 'browser_reload'],
'Interactions': ['browser_click', 'browser_click_coordinate', 'browser_type', 'browser_fill', 'browser_press', 'browser_press_at', 'browser_hover', 'browser_hover_coordinate', 'browser_select', 'browser_scroll'],
'Inspection': ['browser_screenshot', 'browser_snapshot', 'browser_console', 'browser_get_text', 'browser_evaluate', 'browser_wait'],
'Advanced': ['browser_resize', 'browser_upload', 'browser_dialog', 'browser_coords'],
'Advanced': ['browser_resize', 'browser_upload', 'browser_dialog'],
};
async function init() {
+46 -29
View File
@@ -80,33 +80,57 @@ async def _adaptive_poll_sleep(elapsed_s: float) -> None:
_interaction_highlights: dict[int, dict] = {}
# Compact descriptor of document.activeElement. Returned by both click()
# Compact descriptor of the focused element. Returned by both click()
# and click_coordinate() so the agent can verify it focused what it
# intended, then decide whether to follow up with browser_type(text=...,
# no selector). Keeping this as a single shared string avoids drift
# between the two click paths.
# intended. When the outer document's activeElement is an <iframe>,
# we recurse into the iframe's document (same-origin only) so the
# response describes the real inner element — otherwise the agent
# always sees {tag: "iframe"} and can't tell whether it hit the
# composer or something else inside the frame (e.g. a sidebar item in
# LinkedIn's #interop-outlet messaging overlay).
_FOCUSED_ELEMENT_JS = """
(function() {
function describe(el) {
var rect = el.getBoundingClientRect();
var attrs = {};
for (var i = 0; i < el.attributes.length && i < 10; i++) {
attrs[el.attributes[i].name] = el.attributes[i].value.substring(0, 200);
}
return {
tag: el.tagName.toLowerCase(),
id: el.id || null,
className: el.className || null,
name: el.getAttribute('name') || null,
type: el.getAttribute('type') || null,
role: el.getAttribute('role') || null,
contenteditable: el.getAttribute('contenteditable') || null,
text: (el.innerText || '').substring(0, 200),
value: (el.value !== undefined ? String(el.value).substring(0, 200) : null),
attributes: attrs,
rect: { x: rect.x, y: rect.y, width: rect.width, height: rect.height }
};
}
var el = document.activeElement;
if (!el || el === document.body) return null;
var rect = el.getBoundingClientRect();
var attrs = {};
for (var i = 0; i < el.attributes.length && i < 10; i++) {
attrs[el.attributes[i].name] = el.attributes[i].value.substring(0, 200);
// Descend into same-origin iframes. Capped at 5 levels of nesting
// to bound cost and prevent a pathological loop. If a frame is
// cross-origin, contentDocument throws; catch and report the
// outermost iframe instead.
var framePath = [];
var depth = 0;
while (el && (el.tagName === 'IFRAME' || el.tagName === 'FRAME') && depth < 5) {
framePath.push(el.id || el.getAttribute('data-testid') || el.tagName.toLowerCase());
var innerDoc = null;
try { innerDoc = el.contentDocument; } catch (e) { innerDoc = null; }
if (!innerDoc) break;
var innerActive = innerDoc.activeElement;
if (!innerActive || innerActive === innerDoc.body) break;
el = innerActive;
depth++;
}
return {
tag: el.tagName.toLowerCase(),
id: el.id || null,
className: el.className || null,
name: el.getAttribute('name') || null,
type: el.getAttribute('type') || null,
role: el.getAttribute('role') || null,
contenteditable: el.getAttribute('contenteditable') || null,
text: (el.innerText || '').substring(0, 200),
value: (el.value !== undefined ? String(el.value).substring(0, 200) : null),
attributes: attrs,
rect: { x: rect.x, y: rect.y, width: rect.width, height: rect.height }
};
var out = describe(el);
if (framePath.length) out.inFrame = framePath;
return out;
})()
"""
@@ -959,18 +983,11 @@ class BeelineBridge:
button_map = {"left": "left", "right": "right", "middle": "middle"}
cdp_button = button_map.get(button, "left")
from .tools.inspection import _screenshot_css_scales, _screenshot_scales
phys_scale = _screenshot_scales.get(tab_id, "unset")
css_scale = _screenshot_css_scales.get(tab_id, "unset")
logger.info(
"click_coordinate tab=%d: x=%.1f, y=%.1f → CDP Input.dispatchMouseEvent. "
"stored_scales: physicalScale=%s, cssScale=%s",
"click_coordinate tab=%d: x=%.1f, y=%.1f → CDP Input.dispatchMouseEvent",
tab_id,
x,
y,
phys_scale,
css_scale,
)
await self._cdp(
+84 -200
View File
@@ -12,7 +12,6 @@ import io
import json
import logging
import time
from typing import Literal
from fastmcp import FastMCP
from mcp.types import ImageContent, TextContent
@@ -23,32 +22,31 @@ from .tabs import _get_context
logger = logging.getLogger(__name__)
# Target width for normalized screenshots (px in the delivered image)
_SCREENSHOT_WIDTH = 600
# Maps tab_id -> physical scale: image_coord × scale = physical pixels (for CDP Input events)
_screenshot_scales: dict[int, float] = {}
# Maps tab_id -> CSS scale: image_coord × scale = CSS pixels (for DOM APIs / getBoundingClientRect)
_screenshot_css_scales: dict[int, float] = {}
def _resize_and_annotate(
data: str,
css_width: int,
dpr: float = 1.0,
highlights: list[dict] | None = None,
width: int = _SCREENSHOT_WIDTH,
) -> tuple[str, float, float]:
"""Resize a base64 PNG to _SCREENSHOT_WIDTH wide, annotate highlights.
) -> tuple[str, float]:
"""Resize a captured PNG so that image pixels == CSS pixels, then
re-encode as JPEG quality 75.
Returns (new_b64, physical_scale, css_scale) where:
physical_scale = physical_px_per_image_px (multiply image coords physical px)
css_scale = css_px_per_image_px (multiply image coords CSS px for DOM APIs)
Output is ``css_width × round(orig_h × css_width / orig_w)``. The
1:1 imageCSS mapping means a coord the agent reads off the image
is the same coord CDP expects no conversion, no scale factors to
remember. Highlight annotations are drawn directly in CSS px (which
equal image px after resize).
Highlights have x,y,w,h in CSS pixels (what getBoundingClientRect returns,
and what CDP Input.dispatchMouseEvent accepts).
Falls back to original data if Pillow unavailable or resize fails.
Returns ``(new_b64, physical_scale)`` where
``physical_scale = orig_png_w / css_width`` (= DPR). Kept for logs
and HiDPI debugging only.
"""
if not css_width or css_width <= 0:
# Capture path always supplies css_width; only reach here on a
# degraded bridge response. Return the raw image untouched.
return data, 1.0
try:
from PIL import Image, ImageDraw, ImageFont
except ImportError:
@@ -58,48 +56,39 @@ def _resize_and_annotate(
import struct
orig_w = struct.unpack(">I", raw[16:20])[0]
raw_size_bytes = len(raw)
physical_scale = orig_w / width if orig_w and width else 1.0
css_scale = (css_width / width) if css_width else (physical_scale / max(dpr, 1.0))
physical_scale = orig_w / css_width if orig_w else 1.0
logger.warning(
"PIL not available — screenshot resize SKIPPED (cannot downscale image). "
"raw_size=%d bytes, png_width=%d, css_width=%s, dpr=%s, target_width=%d. "
"Returning ORIGINAL image with computed scales: physicalScale=%.4f, cssScale=%.4f. "
"Agent must use browser_coords() to convert image positions before clicking.",
raw_size_bytes,
orig_w,
"PIL not available — screenshot resize+convert SKIPPED. "
"Returning original physical-px PNG. physicalScale=%.4f, "
"css_width=%d, dpr=%s. Clicks WILL be misaligned; install Pillow.",
physical_scale,
css_width,
dpr,
width,
physical_scale,
css_scale,
)
return data, round(physical_scale, 4), round(css_scale, 4)
return data, round(physical_scale, 4)
try:
raw = base64.b64decode(data)
img = Image.open(io.BytesIO(raw)).convert("RGBA")
orig_w, orig_h = img.size
physical_scale = orig_w / width
css_scale = (css_width / width) if css_width else (physical_scale / max(dpr, 1.0))
physical_scale = orig_w / css_width
new_w = css_width
new_h = round(orig_h * new_w / orig_w)
if (new_w, new_h) != img.size:
img = img.resize((new_w, new_h), Image.LANCZOS)
logger.info(
"Screenshot resize: orig=%dx%dtarget=%dx%d, css_width=%s, dpr=%s, physicalScale=%.4f, cssScale=%.4f",
"Screenshot: orig=%dx%dout=%dx%d (css_width=%d, dpr=%s), physicalScale=%.4f",
orig_w,
orig_h,
width,
round(orig_h * width / orig_w),
new_w,
new_h,
css_width,
dpr,
physical_scale,
css_scale,
)
new_w = width
new_h = round(orig_h * new_w / orig_w)
img = img.resize((new_w, new_h), Image.LANCZOS)
if highlights:
overlay = Image.new("RGBA", img.size, (0, 0, 0, 0))
draw = ImageDraw.Draw(overlay)
@@ -111,11 +100,11 @@ def _resize_and_annotate(
for h in highlights:
kind = h.get("kind", "rect")
label = h.get("label", "")
# Highlights are in CSS px → convert to image px
ix = h["x"] / css_scale
iy = h["y"] / css_scale
iw = h.get("w", 0) / css_scale
ih = h.get("h", 0) / css_scale
# Highlights are in CSS px. Image px == CSS px, no conversion.
ix = h["x"]
iy = h["y"]
iw = h.get("w", 0)
ih = h.get("h", 0)
if kind == "point":
cx, cy, r = ix, iy, 10
@@ -135,11 +124,9 @@ def _resize_and_annotate(
width=2,
)
# Label: show image pixel position so user knows where to look
img_coords = f"img:({round(ix)},{round(iy)})"
display_label = f"{img_coords} {label}" if label else img_coords
display_label = f"({round(ix)},{round(iy)}) {label}".strip()
lx, ly = ix, max(2, iy - 16)
lx = max(2, min(lx, width - 120))
lx = max(2, min(lx, new_w - 120))
bbox = draw.textbbox((lx, ly), display_label, font=font)
pad = 3
draw.rectangle(
@@ -153,22 +140,20 @@ def _resize_and_annotate(
img = img.convert("RGB")
buf = io.BytesIO()
img.save(buf, format="PNG", optimize=True)
img.save(buf, format="JPEG", quality=75, optimize=True)
return (
base64.b64encode(buf.getvalue()).decode(),
round(physical_scale, 4),
round(css_scale, 4),
)
except Exception:
logger.warning(
"Screenshot resize/annotate FAILED — returning original image with scale=1.0. "
"css_width=%s, dpr=%s, target_width=%d. Clicks will be misaligned.",
"Screenshot resize/annotate FAILED — returning original image. "
"css_width=%s, dpr=%s.",
css_width,
dpr,
width,
exc_info=True,
)
return data, 1.0, 1.0
return data, 1.0
def register_inspection_tools(mcp: FastMCP) -> None:
@@ -180,26 +165,24 @@ def register_inspection_tools(mcp: FastMCP) -> None:
profile: str | None = None,
full_page: bool = False,
selector: str | None = None,
image_type: Literal["png", "jpeg"] = "png",
annotate: bool = True,
width: int = _SCREENSHOT_WIDTH,
) -> list:
"""
Take a screenshot of the current page.
Returns a normalized image alongside text metadata (URL, size, scale
factors, etc.). Automatically annotates the last interaction (click,
hover, type) with a bounding box overlay.
The image is delivered at the CSS viewport's own dimensions, so
a pixel you see in the screenshot is the same coordinate you
pass to ``browser_click_coordinate`` / ``browser_hover_coordinate``
/ ``browser_press_at``. No conversion, no scale factors.
Output is JPEG quality 75 (~150250 KB for a typical UI).
Args:
tab_id: Chrome tab ID (default: active tab)
profile: Browser profile name (default: "default")
full_page: Capture full scrollable page (default: False)
selector: CSS selector to screenshot a specific element (optional)
image_type: Image format - png or jpeg (default: png)
annotate: Draw bounding box of last interaction on image (default: True)
width: Output image width in pixels (default: 600). Use 800+ for fine
text, 400 for quick layout checks.
Returns:
List of content blocks: text metadata + image
@@ -252,7 +235,6 @@ def register_inspection_tools(mcp: FastMCP) -> None:
return [TextContent(type="text", text=json.dumps(screenshot_result))]
data = screenshot_result.get("data")
mime_type = screenshot_result.get("mimeType", "image/png")
css_width = screenshot_result.get("cssWidth", 0)
dpr = screenshot_result.get("devicePixelRatio", 1.0)
@@ -263,45 +245,38 @@ def register_inspection_tools(mcp: FastMCP) -> None:
if annotate and target_tab in _interaction_highlights:
highlights = [_interaction_highlights[target_tab]]
# Normalize to 800px wide and annotate. Offloaded to a
# thread because PIL Image.open/resize/ImageDraw/composite on
# a 2-megapixel PNG blocks for ~150-300ms of CPU — plenty to
# freeze the asyncio event loop and delay every concurrent
# tool call during a screenshot. The function is reentrant
# (fresh PIL Image per call, no shared state), so to_thread
# is safe.
data, physical_scale, css_scale = await asyncio.to_thread(
# Resize to CSS-viewport dimensions so image px == CSS px,
# and re-encode as the chosen lossy format. Offloaded to a
# thread because PIL Image.open/resize/ImageDraw/composite
# on a 2-megapixel PNG blocks for ~150300 ms of CPU —
# plenty to freeze the asyncio event loop. The function is
# reentrant (fresh PIL Image per call, no shared state), so
# to_thread is safe.
data, physical_scale = await asyncio.to_thread(
_resize_and_annotate,
data,
css_width,
dpr,
highlights,
width,
)
_screenshot_scales[target_tab] = physical_scale
_screenshot_css_scales[target_tab] = css_scale
meta = json.dumps(
{
"ok": True,
"tabId": target_tab,
"url": screenshot_result.get("url", ""),
"imageType": mime_type.split("/")[-1],
"imageType": "jpeg",
"size": len(base64.b64decode(data)) if data else 0,
"imageWidth": width,
"imageWidth": css_width,
"fullPage": full_page,
"devicePixelRatio": dpr,
"physicalScale": physical_scale,
"cssScale": css_scale,
"annotated": bool(highlights),
"scaleHint": (
f"image_coord × {css_scale} = CSS px "
f"→ feed to browser_click_coordinate, "
f"browser_hover_coordinate, browser_press_at "
f"(CDP Input events use CSS pixels). "
f"image_coord × {physical_scale} = physical px "
f"is debug-only on HiDPI displays and must NOT "
f"be used for clicks — it overshoots by DPR×."
"Image pixel = CSS pixel. Feed any coord you see "
"in this image directly to browser_click_coordinate "
"/ browser_hover_coordinate / browser_press_at "
"no conversion needed."
),
}
)
@@ -313,17 +288,15 @@ def register_inspection_tools(mcp: FastMCP) -> None:
"ok": True,
"size": len(base64.b64decode(data)) if data else 0,
"url": screenshot_result.get("url", ""),
"physicalScale": physical_scale,
"cssScale": css_scale,
"debug_cssWidth": css_width,
"debug_dpr": dpr,
"cssWidth": css_width,
"dpr": dpr,
},
duration_ms=(time.perf_counter() - start) * 1000,
)
return [
TextContent(type="text", text=meta),
ImageContent(type="image", data=data, mimeType=mime_type),
ImageContent(type="image", data=data, mimeType="image/jpeg"),
]
except Exception as e:
log_tool_call(
@@ -334,73 +307,6 @@ def register_inspection_tools(mcp: FastMCP) -> None:
)
return [TextContent(type="text", text=json.dumps({"ok": False, "error": str(e)}))]
@mcp.tool()
def browser_coords(
x: float,
y: float,
tab_id: int | None = None,
profile: str | None = None,
) -> dict:
"""
Convert screenshot image coordinates to browser click coordinates.
After browser_screenshot returns a downscaled image, use this to
translate pixel positions you see in the image into the CSS pixel
coordinates that Chrome DevTools Protocol expects.
**CDP Input.dispatchMouseEvent uses CSS pixels**, so you want
``css_x`` / ``css_y`` for every click/hover tool. ``physical_x/y``
is kept in the return for debugging on HiDPI displays do NOT
feed it to clicks; on a DPR=2 screen it lands 2× too far.
Edge case: pages using ``zoom`` or ``transform: scale()`` (e.g.
LinkedIn's ``#interop-outlet`` shadow DOM) render in a scaled
local coordinate space. For those, ``getBoundingClientRect()``
reports pre-zoom coordinates and you may still need to multiply
by the element's effective zoom. Use browser_shadow_query to
get the zoomed rect directly.
Args:
x: X pixel position in the screenshot image
y: Y pixel position in the screenshot image
tab_id: Chrome tab ID (default: active tab for profile)
profile: Browser profile name (default: "default")
Returns:
Dict with css_x, css_y (primary use these), physical_x,
physical_y (debug only), and scale factors.
"""
ctx = _get_context(profile)
target_tab = tab_id or (ctx.get("activeTabId") if ctx else None)
physical_scale = _screenshot_scales.get(target_tab, 1.0) if target_tab else 1.0
# css_scale stored in second slot via _screenshot_css_scales
css_scale = _screenshot_css_scales.get(target_tab, physical_scale) if target_tab else physical_scale
return {
"ok": True,
# Primary output: CSS pixels. Feed these to click/hover/press.
"css_x": round(x * css_scale, 1),
"css_y": round(y * css_scale, 1),
# Debug output: raw physical pixels. DO NOT feed to clicks on
# HiDPI displays — CDP Input events use CSS pixels, so sending
# physical coordinates lands the click at roughly DPR× the
# intended position.
"physical_x": round(x * physical_scale, 1),
"physical_y": round(y * physical_scale, 1),
"physicalScale": physical_scale,
"cssScale": css_scale,
"tabId": target_tab,
"note": (
"Use css_x/css_y with browser_click_coordinate, "
"browser_hover_coordinate, browser_press_at — "
"Chrome DevTools Protocol Input.dispatchMouseEvent "
"operates in CSS pixels. physical_x/y is for debugging "
"on HiDPI displays only; feeding it to clicks lands "
"them at DPR× the intended coordinate."
),
}
@mcp.tool()
async def browser_shadow_query(
selector: str,
@@ -412,7 +318,11 @@ def register_inspection_tools(mcp: FastMCP) -> None:
Traverses shadow roots to find elements inside closed/open shadow DOM,
overlays, and virtual-rendered components (e.g. LinkedIn's #interop-outlet).
Returns getBoundingClientRect in both CSS and physical pixels.
Returns the element's bounding rect in CSS pixels. Screenshot
pixels == CSS pixels, so the same numbers also match whatever
the agent sees in a browser_screenshot feed ``css.cx/cy``
straight to browser_click_coordinate / hover_coordinate /
press_at.
Args:
selector: CSS selectors joined by ' >>> ' to pierce shadow roots.
@@ -421,7 +331,7 @@ def register_inspection_tools(mcp: FastMCP) -> None:
profile: Browser profile name (default: "default")
Returns:
Dict with rect (CSS px) and physical rect (CSS px × DPR) of the element
Dict with ``css`` block (x, y, w, h, cx, cy).
"""
bridge = get_bridge()
if not bridge or not bridge.is_connected:
@@ -438,10 +348,6 @@ def register_inspection_tools(mcp: FastMCP) -> None:
return result
rect = result["rect"]
physical_scale = _screenshot_scales.get(target_tab, 1.0)
css_scale = _screenshot_css_scales.get(target_tab, 1.0)
dpr = physical_scale / css_scale if css_scale else 1.0
return {
"ok": True,
"selector": selector,
@@ -454,20 +360,11 @@ def register_inspection_tools(mcp: FastMCP) -> None:
"cx": rect["cx"],
"cy": rect["cy"],
},
"physical": {
"x": round(rect["x"] * dpr, 1),
"y": round(rect["y"] * dpr, 1),
"w": round(rect["w"] * dpr, 1),
"h": round(rect["h"] * dpr, 1),
"cx": round(rect["cx"] * dpr, 1),
"cy": round(rect["cy"] * dpr, 1),
},
"note": (
"Use css.cx/cy with browser_click_coordinate, "
"browser_hover_coordinate, browser_press_at — "
"CDP Input events operate in CSS pixels. "
"physical.* is debug-only; feeding it to clicks "
"lands them DPR× too far on HiDPI displays."
"Pass css.cx/cy browser_click_coordinate / "
"hover_coordinate / press_at. Screenshot pixels == CSS "
"pixels, so these coords also match anything you see in "
"browser_screenshot."
),
}
@@ -480,11 +377,11 @@ def register_inspection_tools(mcp: FastMCP) -> None:
"""
Get the bounding rect of an element by CSS selector.
Supports '>>>' shadow-piercing selectors for overlay/shadow DOM content.
Returns coordinates in CSS pixels (for clicks and DOM APIs); the
physical-pixel variant is returned for debugging on HiDPI displays
only it must not be fed to click/hover/press tools, which use
CSS pixels.
Supports '>>>' shadow-piercing selectors for overlay/shadow DOM
content. Returns the rect in CSS pixels. Screenshot pixels ==
CSS pixels, so the same numbers match anything visible in
browser_screenshot feed ``css.cx/cy`` straight to
browser_click_coordinate / hover_coordinate / press_at.
Args:
selector: CSS selector, optionally with ' >>> ' to pierce shadow roots.
@@ -493,7 +390,7 @@ def register_inspection_tools(mcp: FastMCP) -> None:
profile: Browser profile name (default: "default")
Returns:
Dict with css and physical bounding rects
Dict with ``css`` block (x, y, w, h, cx, cy).
"""
bridge = get_bridge()
if not bridge or not bridge.is_connected:
@@ -510,10 +407,6 @@ def register_inspection_tools(mcp: FastMCP) -> None:
return result
rect = result["rect"]
physical_scale = _screenshot_scales.get(target_tab, 1.0)
css_scale = _screenshot_css_scales.get(target_tab, 1.0)
dpr = physical_scale / css_scale if css_scale else 1.0
return {
"ok": True,
"selector": selector,
@@ -526,20 +419,11 @@ def register_inspection_tools(mcp: FastMCP) -> None:
"cx": rect["cx"],
"cy": rect["cy"],
},
"physical": {
"x": round(rect["x"] * dpr, 1),
"y": round(rect["y"] * dpr, 1),
"w": round(rect["w"] * dpr, 1),
"h": round(rect["h"] * dpr, 1),
"cx": round(rect["cx"] * dpr, 1),
"cy": round(rect["cy"] * dpr, 1),
},
"note": (
"Use css.cx/cy with browser_click_coordinate, "
"browser_hover_coordinate, browser_press_at — "
"CDP Input events operate in CSS pixels. "
"physical.* is debug-only; feeding it to clicks "
"lands them DPR× too far on HiDPI displays."
"Pass css.cx/cy browser_click_coordinate / "
"hover_coordinate / press_at. Screenshot pixels == CSS "
"pixels, so these coords also match anything you see in "
"browser_screenshot."
),
}
+21 -25
View File
@@ -108,24 +108,24 @@ def register_interaction_tools(mcp: FastMCP) -> None:
button: Literal["left", "right", "middle"] = "left",
) -> dict:
"""
Click at specific viewport coordinates (CSS pixels).
Click at viewport coordinates.
Chrome DevTools Protocol's Input.dispatchMouseEvent operates in
**CSS pixels**, not physical pixels. If you have a screenshot
image coordinate, convert it with ``browser_coords(x, y)`` and
use the returned ``css_x`` / ``css_y`` not ``physical_x/y``.
On a DPR=2 display, feeding physical coordinates lands the click
at 2× the intended position.
Screenshots are delivered at the CSS viewport's own dimensions
(see ``browser_screenshot``), so a pixel you read off the image
is the same number you pass here no conversion, no scale
factors. ``browser_get_rect`` likewise returns coords you can
feed straight through.
Args:
x: X coordinate in CSS pixels (viewport space)
y: Y coordinate in CSS pixels (viewport space)
x: X coordinate in the screenshot / CSS viewport.
y: Y coordinate in the screenshot / CSS viewport.
tab_id: Chrome tab ID (default: active tab)
profile: Browser profile name (default: "default")
button: Mouse button to click (left, right, middle)
Returns:
Dict with click result
Dict with click result, including ``focused_element``
describing what the click focused.
"""
start = time.perf_counter()
params = {"x": x, "y": y, "tab_id": tab_id, "profile": profile, "button": button}
@@ -149,17 +149,11 @@ def register_interaction_tools(mcp: FastMCP) -> None:
return result
try:
from .inspection import _screenshot_css_scales, _screenshot_scales
click_result = await bridge.click_coordinate(target_tab, x, y, button=button)
log_tool_call(
"browser_click_coordinate",
params,
result={
**click_result,
"debug_stored_physicalScale": _screenshot_scales.get(target_tab, "unset"),
"debug_stored_cssScale": _screenshot_css_scales.get(target_tab, "unset"),
},
result=click_result,
duration_ms=(time.perf_counter() - start) * 1000,
)
return click_result
@@ -484,15 +478,16 @@ def register_interaction_tools(mcp: FastMCP) -> None:
profile: str | None = None,
) -> dict:
"""
Hover at CSS pixel coordinates without needing a CSS selector.
Hover at viewport coordinates without needing a CSS selector.
Use this instead of browser_hover when the element is in an overlay,
shadow DOM, or virtual-rendered component that isn't in the regular DOM.
Pair with browser_coords to convert screenshot image positions to CSS pixels.
Screenshot pixels == CSS pixels, so any coord you read off a
browser_screenshot image can be fed straight through.
Args:
x: CSS pixel X coordinate
y: CSS pixel Y coordinate
x: X coordinate in the screenshot / CSS viewport.
y: Y coordinate in the screenshot / CSS viewport.
tab_id: Chrome tab ID (default: active tab)
profile: Browser profile name (default: "default")
@@ -548,16 +543,17 @@ def register_interaction_tools(mcp: FastMCP) -> None:
profile: str | None = None,
) -> dict:
"""
Move mouse to CSS pixel coordinates then press a key.
Move mouse to viewport coordinates then press a key.
Use this instead of browser_press when the focused element is in an overlay
or virtual-rendered component. Moving the mouse first routes the key event
through native browser hit-testing instead of the DOM focus chain.
Pair with browser_coords to convert screenshot image positions to CSS pixels.
Screenshot pixels == CSS pixels, so coords read off a
browser_screenshot image can be fed straight through.
Args:
x: CSS pixel X coordinate to position mouse
y: CSS pixel Y coordinate to position mouse
x: X coordinate in the screenshot / CSS viewport.
y: Y coordinate in the screenshot / CSS viewport.
key: Key to press (e.g. 'Enter', 'Space', 'Escape', 'ArrowDown')
tab_id: Chrome tab ID (default: active tab)
profile: Browser profile name (default: "default")