Compare commits
4 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
| bb2e55f5fa | |||
| 5d0106ac34 | |||
| 3a219e27ab | |||
| 7f62e7a2d0 |
@@ -25,7 +25,6 @@ All tools are prefixed with `browser_`:
|
||||
- `browser_screenshot` — visual capture (annotated PNG)
|
||||
<!-- /vision-only -->
|
||||
- `browser_shadow_query`, `browser_get_rect` — locate elements (shadow-piercing via `>>>`)
|
||||
- `browser_coords` — convert image pixels to CSS pixels (always use `css_x/y`, never `physical_x/y`)
|
||||
- `browser_scroll`, `browser_wait` — navigation helpers
|
||||
- `browser_evaluate` — run JavaScript
|
||||
- `browser_close`, `browser_close_finished` — tab cleanup
|
||||
@@ -38,9 +37,9 @@ All tools are prefixed with `browser_`:
|
||||
|
||||
Neither tool is "preferred" universally — they're for different jobs. Default to snapshot on text-heavy static pages, screenshot on SPAs and anything shadow-DOM-heavy. Activate the `browser-automation` skill for the full decision tree.
|
||||
|
||||
## Coordinate rule: always CSS pixels
|
||||
## Coordinate rule
|
||||
|
||||
Chrome DevTools Protocol `Input.dispatchMouseEvent` takes **CSS pixels**, not physical pixels. After a screenshot, use `browser_coords(image_x, image_y)` and feed the returned `css_x/y` (NOT `physical_x/y`) to `browser_click_coordinate`, `browser_hover_coordinate`, `browser_press_at`. Feeding physical pixels on a HiDPI display (DPR=1.6, 2, or 3) overshoots by `DPR×` and clicks land in the wrong place. `getBoundingClientRect()` already returns CSS pixels — pass through unchanged, no DPR multiplication.
|
||||
`browser_screenshot` delivers the image at the CSS viewport's own dimensions, so a pixel you read off the screenshot is the same coordinate `browser_click_coordinate`, `browser_hover_coordinate`, and `browser_press_at` expect — no conversion. `getBoundingClientRect()` likewise returns CSS pixels; pass through unchanged.
|
||||
|
||||
## System prompt tips for browser nodes
|
||||
|
||||
|
||||
@@ -42,22 +42,14 @@ after an interaction unless you need a fresh view.
|
||||
Only fall back to `browser_get_text` for extracting small elements by
|
||||
CSS selector.
|
||||
|
||||
## Coordinates: always CSS pixels
|
||||
## Coordinates
|
||||
|
||||
Chrome DevTools Protocol `Input.dispatchMouseEvent` takes **CSS
|
||||
pixels**, not physical pixels. This is critical and often gets wrong:
|
||||
|
||||
| Tool | Unit |
|
||||
|---|---|
|
||||
| `browser_click_coordinate(x, y)` | **CSS pixels** |
|
||||
| `browser_hover_coordinate(x, y)` | **CSS pixels** |
|
||||
| `browser_press_at(x, y, key)` | **CSS pixels** |
|
||||
| `getBoundingClientRect()` | already CSS pixels — pass straight through |
|
||||
| `browser_coords(img_x, img_y)` | returns `css_x/y` (use this) and `physical_x/y` (debug only) |
|
||||
|
||||
**Always use `css_x/y`** from `browser_coords`. Feeding `physical_x/y`
|
||||
on a HiDPI display overshoots by `DPR×` — clicks land DPR times too
|
||||
far right and down. On a DPR=1.6 display that's 60% off.
|
||||
`browser_screenshot` delivers the image at the CSS viewport's own
|
||||
dimensions, so a pixel you read off the screenshot is the same number
|
||||
you pass to `browser_click_coordinate` / `browser_hover_coordinate` /
|
||||
`browser_press_at`. `browser_get_rect` and `browser_shadow_query` also
|
||||
return CSS px — feed `rect.css.cx` / `rect.css.cy` straight through.
|
||||
No scale factors to remember.
|
||||
|
||||
Never multiply `getBoundingClientRect()` by `devicePixelRatio` — it's
|
||||
already in the right unit.
|
||||
@@ -86,11 +78,10 @@ reach shadow elements transparently.
|
||||
|
||||
**Shadow-heavy site workflow:**
|
||||
1. `browser_screenshot()` → visual image
|
||||
2. Identify target visually → image coordinate
|
||||
3. `browser_coords(x, y)` → CSS px
|
||||
4. `browser_click_coordinate(css_x, css_y)` → lands via native hit
|
||||
test; inputs get focused regardless of shadow depth
|
||||
5. Type via `browser_type_focused` (no selector needed — types into the
|
||||
2. Identify target visually → pixel `(x, y)` read straight off the image
|
||||
3. `browser_click_coordinate(x, y)` → lands via native hit test; inputs
|
||||
get focused regardless of shadow depth
|
||||
4. Type via `browser_type_focused` (no selector needed — types into the
|
||||
already-focused element), or `browser_type` if you have a selector
|
||||
|
||||
For selector-style access when you know the shadow path:
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
---
|
||||
name: hive.browser-automation
|
||||
description: Required before any browser_* tool call. Teaches the screenshot + browser_click_coordinate workflow that reaches shadow-DOM inputs selectors can't see, the CSS-pixel coordinate rule (not physical px), rich-text editor quirks ("send button stays disabled" failures), and CSP gotchas. Covers Chrome via CDP through the GCU Beeline extension. Skipping this causes repeated failures on LinkedIn / Reddit / X. Verified against real production sites 2026-04-11.
|
||||
description: Required before any browser_* tool call. Teaches the screenshot + browser_click_coordinate workflow that reaches shadow-DOM inputs selectors can't see, rich-text editor quirks ("send button stays disabled" failures), and CSP gotchas. Covers Chrome via CDP through the GCU Beeline extension. Skipping this causes repeated failures on LinkedIn / Reddit / X. Verified against real production sites 2026-04-11.
|
||||
metadata:
|
||||
author: hive
|
||||
type: default-skill
|
||||
@@ -12,25 +12,20 @@ metadata:
|
||||
|
||||
All GCU browser tools drive a real Chrome instance through the Beeline extension and Chrome DevTools Protocol (CDP). That means clicks, keystrokes, and screenshots are processed by the actual browser's native hit testing, focus, and layout engines — **not** a synthetic event layer. Understanding this unlocks strategies that make hard sites easy.
|
||||
|
||||
## Coordinates: always CSS pixels
|
||||
## Coordinates
|
||||
|
||||
**Chrome DevTools Protocol `Input.dispatchMouseEvent` operates in CSS pixels, not physical pixels.**
|
||||
|
||||
When you call `browser_coords(image_x, image_y)` after a screenshot, the returned dict has both `css_x/y` and `physical_x/y`. **Always use `css_x/y` for clicks, hovers, and key presses.**
|
||||
Screenshots are delivered at the CSS viewport's own dimensions. A pixel you see in the screenshot is the same coordinate `browser_click_coordinate` expects — no conversion, no scale factors.
|
||||
|
||||
```
|
||||
browser_screenshot() → image (downscaled to 800/900 px wide)
|
||||
browser_coords(img_x, img_y) → {css_x, css_y, physical_x, physical_y}
|
||||
browser_click_coordinate(css_x, css_y) ← USE css_x/y
|
||||
browser_hover_coordinate(css_x, css_y) ← USE css_x/y
|
||||
browser_press_at(css_x, css_y, key) ← USE css_x/y
|
||||
browser_screenshot() → image at CSS-viewport size (JPEG)
|
||||
browser_click_coordinate(x, y) → same (x, y)
|
||||
browser_hover_coordinate(x, y) → same (x, y)
|
||||
browser_press_at(x, y, key) → same (x, y)
|
||||
browser_get_rect(selector) → rect.css → pass rect.css.cx, rect.css.cy to any of the above
|
||||
browser_shadow_query(...) → sq.css → same
|
||||
```
|
||||
|
||||
Feeding `physical_x/y` on a HiDPI display overshoots by DPR× — on a DPR=1.6 laptop, clicks land 60% too far right and down. The ratio between `physicalScale` and `cssScale` tells you the effective DPR.
|
||||
|
||||
`getBoundingClientRect()` already returns CSS pixels — feed those values straight through to click/hover tools without any DPR multiplication.
|
||||
|
||||
**Exception for zoomed elements:** pages that use `zoom` or `transform: scale()` on a container (LinkedIn's `#interop-outlet`, some embedded iframes) render in a scaled local coordinate space. `getBoundingClientRect` there may not match CDP's hit space. Use `browser_shadow_query` which handles the math, or fall back to visually picking coordinates from a screenshot.
|
||||
**Exception for zoomed elements:** pages that use `zoom` or `transform: scale()` on a container (LinkedIn's `#interop-outlet`, some embedded iframes) render in a scaled local coordinate space. `getBoundingClientRect` there may not match CDP's hit space. Use `browser_shadow_query` which handles the math, or visually pick coordinates from a screenshot.
|
||||
|
||||
## Screenshot + coordinates is shadow-agnostic — prefer it on shadow-heavy sites
|
||||
|
||||
@@ -46,29 +41,28 @@ Whereas `wait_for_selector`, `browser_click(selector=...)`, `browser_type(select
|
||||
|
||||
### Recommended workflow on shadow-heavy sites
|
||||
|
||||
1. `browser_screenshot()` → visual image
|
||||
2. Identify the target visually → image pixel `(x, y)` (eyeball from the screenshot)
|
||||
3. `browser_coords(x, y)` → convert to CSS px
|
||||
4. `browser_click_coordinate(css_x, css_y)` → lands on the element via native hit testing; inputs get focused. **The response now includes `focused_element: {tag, id, role, contenteditable, rect, ...}`** — use it to verify you actually focused what you intended.
|
||||
5. `browser_type(text="...")` with **NO selector** → dispatches CDP `Input.insertText` to `document.activeElement`. Shadow roots, iframes, Lexical, Draft.js, ProseMirror all just work. Only pass a selector if you want a DIFFERENT element than the one you just focused (rare).
|
||||
6. Verify via `browser_screenshot` OR `browser_get_attribute` on a known-reachable marker (e.g. check that the Send button's `aria-disabled` flipped to `false`).
|
||||
1. `browser_screenshot()` → visual image (delivered at the CSS-viewport's own dimensions).
|
||||
2. Identify the target visually → pixel `(x, y)` read straight off the image.
|
||||
3. `browser_click_coordinate(x, y)` → clicks there. **The response includes `focused_element: {tag, id, role, contenteditable, rect, ...}`** — use it to verify you actually focused what you intended.
|
||||
4. `browser_type_focused(text="...")` → dispatches CDP `Input.insertText` to `document.activeElement`. Shadow roots, iframes, Lexical, Draft.js, ProseMirror all just work. Use `browser_type(selector, text)` only when you want to target a different element than the one you just focused.
|
||||
5. Verify via `browser_screenshot` OR `browser_get_attribute` on a known-reachable marker (e.g. check that the Send button's `aria-disabled` flipped to `false`).
|
||||
|
||||
### The click→type loop (canonical pattern)
|
||||
|
||||
```
|
||||
resp = browser_click_coordinate(x, y)
|
||||
resp = browser_click_coordinate(x, y) # x, y read straight off the screenshot
|
||||
fe = resp.get("focused_element")
|
||||
if fe and (fe.get("contenteditable") or fe["tag"] in ("textarea", "input")):
|
||||
browser_type(text="...") # no selector — insertText to activeElement
|
||||
browser_type_focused(text="...") # insertText to activeElement
|
||||
else:
|
||||
# you clicked something that isn't editable — refine coords and retry
|
||||
# do NOT reach for browser_evaluate + execCommand('insertText', ...)
|
||||
# you clicked something that isn't editable — refine the pixel and retry.
|
||||
# Do NOT reach for browser_evaluate + execCommand('insertText', ...)
|
||||
# or a walk(root) shadow traversal. The problem is your click, not
|
||||
# the typing method.
|
||||
...
|
||||
```
|
||||
|
||||
`browser_click` (selector-based) also returns `focused_element` now, so the same check works whether you clicked by selector or coordinate.
|
||||
`browser_click` (selector-based) also returns `focused_element`, so the same check works whether you clicked by selector or by coordinate.
|
||||
|
||||
### Empirically verified (2026-04-11)
|
||||
|
||||
@@ -154,7 +148,7 @@ The symptom is always the same: **you type, the characters appear visually, and
|
||||
```
|
||||
# 1. Focus the real element via a real click (not JS .focus()).
|
||||
rect = browser_get_rect(selector) # or browser_shadow_query for shadow sites
|
||||
browser_click_coordinate(rect.cx, rect.cy)
|
||||
browser_click_coordinate(rect.css.cx, rect.css.cy) # rect.css.cx/cy — matched pair
|
||||
sleep(0.5) # let the editor open / focus settle
|
||||
|
||||
# 2. Type. browser_type now uses CDP Input.insertText by default, which is
|
||||
@@ -183,7 +177,7 @@ if not state['disabled']:
|
||||
else:
|
||||
# Recovery: sometimes a click-again + one extra keystroke nudges
|
||||
# React into recomputing hasRealContent.
|
||||
browser_click_coordinate(rect.cx, rect.cy)
|
||||
browser_click_coordinate(rect.css.cx, rect.css.cy) # rect.css.cx/cy — matched pair
|
||||
browser_press("End")
|
||||
browser_press(" ")
|
||||
browser_press("Backspace")
|
||||
@@ -266,25 +260,15 @@ Recognized without modifiers: `Enter`, `Tab`, `Escape`, `Backspace`, `Delete`, `
|
||||
## Screenshots
|
||||
|
||||
```
|
||||
browser_screenshot() # viewport, 900 px wide by default
|
||||
browser_screenshot() # viewport, CSS-sized JPEG
|
||||
browser_screenshot(full_page=True) # full scrollable page
|
||||
browser_screenshot(selector="#header") # clip to element's rect
|
||||
```
|
||||
|
||||
Returns a PNG with automatic downscaling to a target width (default 900 px) plus a JSON metadata block containing `cssWidth`, `devicePixelRatio`, `physicalScale`, `cssScale`, and a `scaleHint` string. The image is also annotated with a highlight rectangle/dot showing the last interaction (click, hover, type) if one happened on this tab.
|
||||
Returns a JPEG (quality 75, ~150–250 KB for a typical UI) at the CSS viewport's own dimensions, plus a JSON metadata block containing `cssWidth`, `devicePixelRatio`, `imageWidth` (= `cssWidth`), and a `scaleHint` confirming image-px == CSS-px. The image is annotated with a highlight rectangle/dot showing the last interaction (click, hover, type) if one happened on this tab.
|
||||
|
||||
The highlight overlay stays visible on the page for **10 seconds** after each interaction, then fades. Before a screenshot is likely, make sure your click / hover / type happens <10 s before the screenshot.
|
||||
|
||||
### Anatomy of the scale fields
|
||||
|
||||
- `cssWidth` = `window.innerWidth` (CSS px)
|
||||
- `devicePixelRatio` = `window.devicePixelRatio` (often 1.6, 2, or 3 on modern displays)
|
||||
- `physicalScale = png_width / image_width` (how many physical-px per image-px)
|
||||
- `cssScale = cssWidth / image_width` (how many CSS-px per image-px)
|
||||
- Effective DPR = `physicalScale / cssScale` (should match `devicePixelRatio`)
|
||||
|
||||
When converting image coordinates for clicks, always use `cssScale`. The `physicalScale` field is there for debugging HiDPI displays, not for inputs.
|
||||
|
||||
## Scrolling
|
||||
|
||||
- Use large scroll amounts (~2000) when loading more content — sites like Twitter and LinkedIn have lazy loading for paging.
|
||||
@@ -339,7 +323,7 @@ LinkedIn enforces **strict Trusted Types CSP**. Any script you inject via `brows
|
||||
Reddit's search input lives **two shadow levels deep** inside `reddit-search-large > faceplate-search-input`. You cannot reach it with `browser_type(selector=)`. The working pattern:
|
||||
|
||||
1. `browser_shadow_query("reddit-search-large >>> #search-input")` → rect
|
||||
2. `browser_click_coordinate(rect.cx, rect.cy)` → click lands on the real shadow input via native hit testing; input becomes focused
|
||||
2. `browser_click_coordinate(rect.css.cx, rect.css.cy) # rect.css.cx/cy — matched pair` → click lands on the real shadow input via native hit testing; input becomes focused
|
||||
3. `browser_press(c)` for each character → dispatches to focused element
|
||||
4. Verify by reading `.value` via `browser_evaluate` walking the shadow path
|
||||
|
||||
@@ -409,7 +393,7 @@ Then pass the most specific selector that uniquely identifies the right input (e
|
||||
- **Typing into a rich-text editor without clicking first → send button stays disabled.** Draft.js (X), Lexical (Gmail, LinkedIn DMs), ProseMirror (Reddit), and React-controlled `contenteditable` elements only register input as "real" when the element received a native focus event — JS-sourced `.focus()` is not enough. `browser_type` now does this automatically via a real CDP pointer click before inserting text, but always verify the submit button's `disabled` state before clicking send. See the "ALWAYS click before typing" section above.
|
||||
- **Using per-character `keyDown` on Lexical / Draft.js editors → keys dispatch but text never appears.** Those editors intercept `beforeinput` and route insertion through their own state machine; raw keyDown events are silently dropped. `browser_type` now uses `Input.insertText` by default (the CDP IME-commit method) which these editors accept cleanly. Only set `use_insert_text=False` when you explicitly need per-keystroke dispatch.
|
||||
- **Leaving a composer with text then trying to navigate → `beforeunload` dialog hangs the bridge.** LinkedIn and several other sites pop a native "unsent message" confirm. `browser_navigate` and `close_tab` both time out against this. Always strip `window.onbeforeunload = null` via `browser_evaluate` before any navigation after typing in a composer, or wrap your logic in a `try/finally` that runs the cleanup block.
|
||||
- **Clicking at physical pixels.** CDP uses CSS px. `browser_coords` returns both for debugging, but always feed `css_x/y` to click tools.
|
||||
- **Click landed in the wrong region (sidebar / header instead of target).** The `focused_element` in the click response shows what actually got focused (e.g. `className: "msg-conversation-listitem__link"` means you hit the messaging sidebar). Treat it as ground truth — if it isn't the target, adjust the pixel and retry. Screenshot pixels equal CSS pixels, so the number you passed is the number CDP clicked; a wrong result means you picked the wrong pixel, not that any conversion went sideways.
|
||||
- **Calling `wait_for_selector` on a shadow element.** It'll always time out. Use `browser_shadow_query` or the screenshot + coordinate strategy.
|
||||
- **Relying on `innerHTML` in injected scripts on LinkedIn.** Silently discarded. Use `createElement` + `appendChild`.
|
||||
- **Not waiting for SPA hydration.** `wait_until="load"` fires before React/Vue rendering on many sites. Add a 2–3 s sleep before querying for chrome elements.
|
||||
@@ -461,7 +445,7 @@ browser_navigate("https://x.com/explore", wait_until="load")
|
||||
sleep(3)
|
||||
browser_wait_for_selector("input[data-testid='SearchBox_Search_Input']", timeout_ms=5000)
|
||||
rect = browser_get_rect("input[data-testid='SearchBox_Search_Input']")
|
||||
browser_click_coordinate(rect.cx, rect.cy)
|
||||
browser_click_coordinate(rect.css.cx, rect.css.cy) # rect.css.cx/cy — matched pair
|
||||
browser_type("input[data-testid='SearchBox_Search_Input']", "openai", clear_first=True)
|
||||
# Screenshot now shows live search suggestions
|
||||
browser_screenshot()
|
||||
@@ -475,7 +459,7 @@ browser_navigate("https://www.reddit.com/r/programming/", wait_until="load")
|
||||
sleep(2)
|
||||
# Shadow-pierce the nested search input
|
||||
sq = browser_shadow_query("reddit-search-large >>> #search-input")
|
||||
browser_click_coordinate(sq.rect.cx, sq.rect.cy)
|
||||
browser_click_coordinate(sq.css.cx, sq.css.cy) # sq.css.cx/cy — matched pair
|
||||
# Typing can't use selector (shadow); focused input receives raw key presses
|
||||
for c in "python":
|
||||
browser_press(c)
|
||||
@@ -490,7 +474,7 @@ browser_navigate("https://www.linkedin.com/feed/", wait_until="load", timeout_ms
|
||||
sleep(3)
|
||||
browser_wait_for_selector("input[data-testid='typeahead-input']", timeout_ms=5000)
|
||||
rect = browser_get_rect("input[data-testid='typeahead-input']")
|
||||
browser_click_coordinate(rect.cx, rect.cy)
|
||||
browser_click_coordinate(rect.css.cx, rect.css.cy) # rect.css.cx/cy — matched pair
|
||||
browser_type("input[data-testid='typeahead-input']", "anthropic", clear_first=True)
|
||||
# Dropdown shows real live suggestions
|
||||
browser_screenshot()
|
||||
|
||||
@@ -34,7 +34,7 @@ LinkedIn is the hardest mainstream site to automate because it combines **shadow
|
||||
| Pending connection card | `.invitation-card, .invitations-card, [data-test-incoming-invitation-card]` | Filter out "invited you to follow" / "subscribe" cards |
|
||||
| Accept button | `button[aria-label*="Accept"]` within the card scope | Per-card scoping is critical — there are many Accept buttons on the page |
|
||||
|
||||
LinkedIn changes class names aggressively. If a class-based selector breaks, fall back to **`browser_screenshot` → visual identification → `browser_coords` → `browser_click_coordinate`**. The screenshot + coord path works regardless of class-name churn and regardless of shadow DOM.
|
||||
LinkedIn changes class names aggressively. If a class-based selector breaks, fall back to **`browser_screenshot` → visual identification → `browser_click_coordinate`** with the pixel you read straight off the image (screenshots are CSS-sized, so no conversion). The screenshot + coord path works regardless of class-name churn and regardless of shadow DOM.
|
||||
|
||||
## Profile Message flow (verified end-to-end 2026-04-11)
|
||||
|
||||
@@ -108,7 +108,7 @@ sleep(0.6)
|
||||
# Do NOT pass a selector here. Selector-based browser_type cannot see
|
||||
# past the #interop-outlet shadow root. No-selector mode sidesteps
|
||||
# that entirely by routing to activeElement.
|
||||
browser_type(text=message_text) # no selector — targets document.activeElement
|
||||
browser_type_focused(text=message_text) # targets document.activeElement
|
||||
sleep(1.0) # let Lexical commit state + enable Send button
|
||||
|
||||
# 7. Find the modal Send button (filter by in-viewport, reject pinned bar)
|
||||
@@ -143,7 +143,7 @@ send = browser_evaluate("""
|
||||
|
||||
# 8. ONLY click Send if it's enabled — if disabled, the insertText
|
||||
# didn't land. DO NOT retry with a different tool; the fix is
|
||||
# always: re-click the composer rect, re-run browser_type(text=...),
|
||||
# always: re-click the composer rect, re-run browser_type_focused(text=...),
|
||||
# re-check. The Send button's `disabled` state IS the ground truth —
|
||||
# if Lexical registered your text, it enables the button. If it's
|
||||
# still disabled, your text did not reach the editor, regardless
|
||||
@@ -153,7 +153,7 @@ if send['disabled']:
|
||||
# fall back to browser_type with a selector (see anti-pattern in
|
||||
# Common Pitfalls — selector-based type can't reach the shadow-DOM
|
||||
# composer). Instead: re-click the textarea rect from step 4, wait
|
||||
# a beat, re-run browser_type(text=message_text) (no selector) from
|
||||
# a beat, re-run browser_type_focused(text=message_text) from
|
||||
# step 6. If that still fails after 2 retries, bail and surface —
|
||||
# the modal may have been reclaimed by a stale state or auth wall.
|
||||
raise Exception("Send button disabled after insertText — editor did not receive input")
|
||||
@@ -323,9 +323,9 @@ If any of those show up, **stop the run, screenshot the state, and surface the i
|
||||
## Common pitfalls
|
||||
|
||||
- **`innerHTML` injection is silently dropped** — LinkedIn's Trusted Types CSP discards any `innerHTML = "<...>"` from injected scripts, no console error. Always use `createElement` + `appendChild` + `setAttribute` for DOM injection. `textContent`, `style.cssText`, and `.value` assignments are fine.
|
||||
- **Do NOT pass a selector to `browser_type` on the message composer — call it with NO selector (`browser_type(text=...)`).** The Lexical contenteditable lives inside the `#interop-outlet` shadow root which `document.querySelector` (what the selector-based path uses under the hood) cannot see. Attempts to work around this with `browser_shadow_query` fail because selector-based `browser_type` doesn't support the `>>>` shadow-pierce syntax. The reliable insert path is: (1) `browser_click_coordinate` on the composer rect — the response's `focused_element` confirms Lexical received focus → (2) `browser_type(text=message_text)` with NO selector — CDP `Input.insertText` dispatches to `document.activeElement` regardless of shadow wrapping. The old `browser_evaluate` + `document.execCommand('insertText', ...)` pattern worked but had JSON-escaping pitfalls and cost ~200 chars of JS per send; `browser_type(text=...)` is the same mechanism with built-in retry.
|
||||
- **Per-char keyDown on the message composer produces empty text** — Lexical intercepts `beforeinput` and drops raw keys. Use `browser_type(text=..., use_insert_text=True)` with NO selector after click-coordinate focused the composer. The CDP `Input.insertText` method commits as if IME fired, which Lexical accepts cleanly. Do NOT pass a selector; selector-based `browser_type` can't see past `#interop-outlet`.
|
||||
- **ANTI-PATTERN: "inject a dummy `<div id='dummy-target'>` and pass it as the `selector` arg to `browser_type`".** This looks tempting but fails compoundingly: `browser_type` clicks the **dummy div's** rect (not the editor's), the click lands on the Lexical wrapper's non-editable chrome, the contenteditable never receives focus, and `Input.insertText` fires against nothing. The bridge will still return `{"ok": true, "action": "type", "length": N}` because it has no way to verify the text actually landed. Symptom: Send button stays `disabled: true` forever. Fix: `browser_click_coordinate` on the real composer rect, then `browser_type(text=message_text)` with NO selector — CDP `Input.insertText` dispatches to `document.activeElement`. (See `session_20260414_114820_08bd3c4d` for the failed dummy-div attempt.)
|
||||
- **Do NOT use selector-based `browser_type` on the message composer — use `browser_type_focused(text=...)`.** The Lexical contenteditable lives inside the `#interop-outlet` iframe/shadow wrapper which `document.querySelector` cannot see. `browser_shadow_query` can find it but selector-based `browser_type` doesn't support the `>>>` shadow-pierce syntax. The reliable insert path is: (1) `browser_click_coordinate` on the composer rect — the response's `focused_element` (which recurses into same-origin iframes) confirms what actually received focus → (2) `browser_type_focused(text=message_text)` — CDP `Input.insertText` dispatches to `document.activeElement` regardless of shadow wrapping.
|
||||
- **Per-char keyDown on the message composer produces empty text** — Lexical intercepts `beforeinput` and drops raw keys. Use `browser_type_focused(text=..., use_insert_text=True)` after click-coordinate focused the composer. The CDP `Input.insertText` method commits as if IME fired, which Lexical accepts cleanly.
|
||||
- **ANTI-PATTERN: "inject a dummy `<div id='dummy-target'>` and pass it as the `selector` arg to `browser_type`".** This fails compoundingly: `browser_type` clicks the **dummy div's** rect (not the editor's), the click lands on the Lexical wrapper's non-editable chrome, the contenteditable never receives focus, and `Input.insertText` fires against nothing. The bridge will still return `{"ok": true, "action": "type", "length": N}` because it has no way to verify the text actually landed. Symptom: Send button stays `disabled: true` forever. Fix: `browser_click_coordinate` on the real composer rect, then `browser_type_focused(text=message_text)` — CDP `Input.insertText` dispatches to `document.activeElement`.
|
||||
- **Multiple Send buttons on the page** — the pinned bottom-right messaging bar has its own `msg-form__send-button` that's usually below `innerHeight`. Filter by in-viewport before clicking.
|
||||
- **`window.onbeforeunload` hangs navigation/close** — after typing in a composer, any `browser_navigate` or `close_tab` can pop a native "unsent message, leave?" confirm dialog that deadlocks the bridge. Always strip `onbeforeunload` before any navigation, and wrap composer flows in a `try/finally` that runs the cleanup block:
|
||||
|
||||
|
||||
@@ -462,7 +462,7 @@ const CATEGORIES = {
|
||||
'Navigation': ['browser_navigate', 'browser_go_back', 'browser_go_forward', 'browser_reload'],
|
||||
'Interactions': ['browser_click', 'browser_click_coordinate', 'browser_type', 'browser_fill', 'browser_press', 'browser_press_at', 'browser_hover', 'browser_hover_coordinate', 'browser_select', 'browser_scroll'],
|
||||
'Inspection': ['browser_screenshot', 'browser_snapshot', 'browser_console', 'browser_get_text', 'browser_evaluate', 'browser_wait'],
|
||||
'Advanced': ['browser_resize', 'browser_upload', 'browser_dialog', 'browser_coords'],
|
||||
'Advanced': ['browser_resize', 'browser_upload', 'browser_dialog'],
|
||||
};
|
||||
|
||||
async function init() {
|
||||
|
||||
@@ -80,33 +80,57 @@ async def _adaptive_poll_sleep(elapsed_s: float) -> None:
|
||||
_interaction_highlights: dict[int, dict] = {}
|
||||
|
||||
|
||||
# Compact descriptor of document.activeElement. Returned by both click()
|
||||
# Compact descriptor of the focused element. Returned by both click()
|
||||
# and click_coordinate() so the agent can verify it focused what it
|
||||
# intended, then decide whether to follow up with browser_type(text=...,
|
||||
# no selector). Keeping this as a single shared string avoids drift
|
||||
# between the two click paths.
|
||||
# intended. When the outer document's activeElement is an <iframe>,
|
||||
# we recurse into the iframe's document (same-origin only) so the
|
||||
# response describes the real inner element — otherwise the agent
|
||||
# always sees {tag: "iframe"} and can't tell whether it hit the
|
||||
# composer or something else inside the frame (e.g. a sidebar item in
|
||||
# LinkedIn's #interop-outlet messaging overlay).
|
||||
_FOCUSED_ELEMENT_JS = """
|
||||
(function() {
|
||||
function describe(el) {
|
||||
var rect = el.getBoundingClientRect();
|
||||
var attrs = {};
|
||||
for (var i = 0; i < el.attributes.length && i < 10; i++) {
|
||||
attrs[el.attributes[i].name] = el.attributes[i].value.substring(0, 200);
|
||||
}
|
||||
return {
|
||||
tag: el.tagName.toLowerCase(),
|
||||
id: el.id || null,
|
||||
className: el.className || null,
|
||||
name: el.getAttribute('name') || null,
|
||||
type: el.getAttribute('type') || null,
|
||||
role: el.getAttribute('role') || null,
|
||||
contenteditable: el.getAttribute('contenteditable') || null,
|
||||
text: (el.innerText || '').substring(0, 200),
|
||||
value: (el.value !== undefined ? String(el.value).substring(0, 200) : null),
|
||||
attributes: attrs,
|
||||
rect: { x: rect.x, y: rect.y, width: rect.width, height: rect.height }
|
||||
};
|
||||
}
|
||||
var el = document.activeElement;
|
||||
if (!el || el === document.body) return null;
|
||||
var rect = el.getBoundingClientRect();
|
||||
var attrs = {};
|
||||
for (var i = 0; i < el.attributes.length && i < 10; i++) {
|
||||
attrs[el.attributes[i].name] = el.attributes[i].value.substring(0, 200);
|
||||
// Descend into same-origin iframes. Capped at 5 levels of nesting
|
||||
// to bound cost and prevent a pathological loop. If a frame is
|
||||
// cross-origin, contentDocument throws; catch and report the
|
||||
// outermost iframe instead.
|
||||
var framePath = [];
|
||||
var depth = 0;
|
||||
while (el && (el.tagName === 'IFRAME' || el.tagName === 'FRAME') && depth < 5) {
|
||||
framePath.push(el.id || el.getAttribute('data-testid') || el.tagName.toLowerCase());
|
||||
var innerDoc = null;
|
||||
try { innerDoc = el.contentDocument; } catch (e) { innerDoc = null; }
|
||||
if (!innerDoc) break;
|
||||
var innerActive = innerDoc.activeElement;
|
||||
if (!innerActive || innerActive === innerDoc.body) break;
|
||||
el = innerActive;
|
||||
depth++;
|
||||
}
|
||||
return {
|
||||
tag: el.tagName.toLowerCase(),
|
||||
id: el.id || null,
|
||||
className: el.className || null,
|
||||
name: el.getAttribute('name') || null,
|
||||
type: el.getAttribute('type') || null,
|
||||
role: el.getAttribute('role') || null,
|
||||
contenteditable: el.getAttribute('contenteditable') || null,
|
||||
text: (el.innerText || '').substring(0, 200),
|
||||
value: (el.value !== undefined ? String(el.value).substring(0, 200) : null),
|
||||
attributes: attrs,
|
||||
rect: { x: rect.x, y: rect.y, width: rect.width, height: rect.height }
|
||||
};
|
||||
var out = describe(el);
|
||||
if (framePath.length) out.inFrame = framePath;
|
||||
return out;
|
||||
})()
|
||||
"""
|
||||
|
||||
@@ -959,18 +983,11 @@ class BeelineBridge:
|
||||
button_map = {"left": "left", "right": "right", "middle": "middle"}
|
||||
cdp_button = button_map.get(button, "left")
|
||||
|
||||
from .tools.inspection import _screenshot_css_scales, _screenshot_scales
|
||||
|
||||
phys_scale = _screenshot_scales.get(tab_id, "unset")
|
||||
css_scale = _screenshot_css_scales.get(tab_id, "unset")
|
||||
logger.info(
|
||||
"click_coordinate tab=%d: x=%.1f, y=%.1f → CDP Input.dispatchMouseEvent. "
|
||||
"stored_scales: physicalScale=%s, cssScale=%s",
|
||||
"click_coordinate tab=%d: x=%.1f, y=%.1f → CDP Input.dispatchMouseEvent",
|
||||
tab_id,
|
||||
x,
|
||||
y,
|
||||
phys_scale,
|
||||
css_scale,
|
||||
)
|
||||
|
||||
await self._cdp(
|
||||
|
||||
@@ -12,7 +12,6 @@ import io
|
||||
import json
|
||||
import logging
|
||||
import time
|
||||
from typing import Literal
|
||||
|
||||
from fastmcp import FastMCP
|
||||
from mcp.types import ImageContent, TextContent
|
||||
@@ -23,32 +22,31 @@ from .tabs import _get_context
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# Target width for normalized screenshots (px in the delivered image)
|
||||
_SCREENSHOT_WIDTH = 600
|
||||
|
||||
# Maps tab_id -> physical scale: image_coord × scale = physical pixels (for CDP Input events)
|
||||
_screenshot_scales: dict[int, float] = {}
|
||||
# Maps tab_id -> CSS scale: image_coord × scale = CSS pixels (for DOM APIs / getBoundingClientRect)
|
||||
_screenshot_css_scales: dict[int, float] = {}
|
||||
|
||||
|
||||
def _resize_and_annotate(
|
||||
data: str,
|
||||
css_width: int,
|
||||
dpr: float = 1.0,
|
||||
highlights: list[dict] | None = None,
|
||||
width: int = _SCREENSHOT_WIDTH,
|
||||
) -> tuple[str, float, float]:
|
||||
"""Resize a base64 PNG to _SCREENSHOT_WIDTH wide, annotate highlights.
|
||||
) -> tuple[str, float]:
|
||||
"""Resize a captured PNG so that image pixels == CSS pixels, then
|
||||
re-encode as JPEG quality 75.
|
||||
|
||||
Returns (new_b64, physical_scale, css_scale) where:
|
||||
physical_scale = physical_px_per_image_px (multiply image coords → physical px)
|
||||
css_scale = css_px_per_image_px (multiply image coords → CSS px for DOM APIs)
|
||||
Output is ``css_width × round(orig_h × css_width / orig_w)``. The
|
||||
1:1 image↔CSS mapping means a coord the agent reads off the image
|
||||
is the same coord CDP expects — no conversion, no scale factors to
|
||||
remember. Highlight annotations are drawn directly in CSS px (which
|
||||
equal image px after resize).
|
||||
|
||||
Highlights have x,y,w,h in CSS pixels (what getBoundingClientRect returns,
|
||||
and what CDP Input.dispatchMouseEvent accepts).
|
||||
Falls back to original data if Pillow unavailable or resize fails.
|
||||
Returns ``(new_b64, physical_scale)`` where
|
||||
``physical_scale = orig_png_w / css_width`` (= DPR). Kept for logs
|
||||
and HiDPI debugging only.
|
||||
"""
|
||||
if not css_width or css_width <= 0:
|
||||
# Capture path always supplies css_width; only reach here on a
|
||||
# degraded bridge response. Return the raw image untouched.
|
||||
return data, 1.0
|
||||
|
||||
try:
|
||||
from PIL import Image, ImageDraw, ImageFont
|
||||
except ImportError:
|
||||
@@ -58,48 +56,39 @@ def _resize_and_annotate(
|
||||
import struct
|
||||
|
||||
orig_w = struct.unpack(">I", raw[16:20])[0]
|
||||
raw_size_bytes = len(raw)
|
||||
physical_scale = orig_w / width if orig_w and width else 1.0
|
||||
css_scale = (css_width / width) if css_width else (physical_scale / max(dpr, 1.0))
|
||||
physical_scale = orig_w / css_width if orig_w else 1.0
|
||||
logger.warning(
|
||||
"PIL not available — screenshot resize SKIPPED (cannot downscale image). "
|
||||
"raw_size=%d bytes, png_width=%d, css_width=%s, dpr=%s, target_width=%d. "
|
||||
"Returning ORIGINAL image with computed scales: physicalScale=%.4f, cssScale=%.4f. "
|
||||
"Agent must use browser_coords() to convert image positions before clicking.",
|
||||
raw_size_bytes,
|
||||
orig_w,
|
||||
"PIL not available — screenshot resize+convert SKIPPED. "
|
||||
"Returning original physical-px PNG. physicalScale=%.4f, "
|
||||
"css_width=%d, dpr=%s. Clicks WILL be misaligned; install Pillow.",
|
||||
physical_scale,
|
||||
css_width,
|
||||
dpr,
|
||||
width,
|
||||
physical_scale,
|
||||
css_scale,
|
||||
)
|
||||
return data, round(physical_scale, 4), round(css_scale, 4)
|
||||
return data, round(physical_scale, 4)
|
||||
|
||||
try:
|
||||
raw = base64.b64decode(data)
|
||||
img = Image.open(io.BytesIO(raw)).convert("RGBA")
|
||||
orig_w, orig_h = img.size
|
||||
|
||||
physical_scale = orig_w / width
|
||||
css_scale = (css_width / width) if css_width else (physical_scale / max(dpr, 1.0))
|
||||
physical_scale = orig_w / css_width
|
||||
new_w = css_width
|
||||
new_h = round(orig_h * new_w / orig_w)
|
||||
if (new_w, new_h) != img.size:
|
||||
img = img.resize((new_w, new_h), Image.LANCZOS)
|
||||
|
||||
logger.info(
|
||||
"Screenshot resize: orig=%dx%d → target=%dx%d, css_width=%s, dpr=%s, physicalScale=%.4f, cssScale=%.4f",
|
||||
"Screenshot: orig=%dx%d → out=%dx%d (css_width=%d, dpr=%s), physicalScale=%.4f",
|
||||
orig_w,
|
||||
orig_h,
|
||||
width,
|
||||
round(orig_h * width / orig_w),
|
||||
new_w,
|
||||
new_h,
|
||||
css_width,
|
||||
dpr,
|
||||
physical_scale,
|
||||
css_scale,
|
||||
)
|
||||
|
||||
new_w = width
|
||||
new_h = round(orig_h * new_w / orig_w)
|
||||
img = img.resize((new_w, new_h), Image.LANCZOS)
|
||||
|
||||
if highlights:
|
||||
overlay = Image.new("RGBA", img.size, (0, 0, 0, 0))
|
||||
draw = ImageDraw.Draw(overlay)
|
||||
@@ -111,11 +100,11 @@ def _resize_and_annotate(
|
||||
for h in highlights:
|
||||
kind = h.get("kind", "rect")
|
||||
label = h.get("label", "")
|
||||
# Highlights are in CSS px → convert to image px
|
||||
ix = h["x"] / css_scale
|
||||
iy = h["y"] / css_scale
|
||||
iw = h.get("w", 0) / css_scale
|
||||
ih = h.get("h", 0) / css_scale
|
||||
# Highlights are in CSS px. Image px == CSS px, no conversion.
|
||||
ix = h["x"]
|
||||
iy = h["y"]
|
||||
iw = h.get("w", 0)
|
||||
ih = h.get("h", 0)
|
||||
|
||||
if kind == "point":
|
||||
cx, cy, r = ix, iy, 10
|
||||
@@ -135,11 +124,9 @@ def _resize_and_annotate(
|
||||
width=2,
|
||||
)
|
||||
|
||||
# Label: show image pixel position so user knows where to look
|
||||
img_coords = f"img:({round(ix)},{round(iy)})"
|
||||
display_label = f"{img_coords} {label}" if label else img_coords
|
||||
display_label = f"({round(ix)},{round(iy)}) {label}".strip()
|
||||
lx, ly = ix, max(2, iy - 16)
|
||||
lx = max(2, min(lx, width - 120))
|
||||
lx = max(2, min(lx, new_w - 120))
|
||||
bbox = draw.textbbox((lx, ly), display_label, font=font)
|
||||
pad = 3
|
||||
draw.rectangle(
|
||||
@@ -153,22 +140,20 @@ def _resize_and_annotate(
|
||||
img = img.convert("RGB")
|
||||
|
||||
buf = io.BytesIO()
|
||||
img.save(buf, format="PNG", optimize=True)
|
||||
img.save(buf, format="JPEG", quality=75, optimize=True)
|
||||
return (
|
||||
base64.b64encode(buf.getvalue()).decode(),
|
||||
round(physical_scale, 4),
|
||||
round(css_scale, 4),
|
||||
)
|
||||
except Exception:
|
||||
logger.warning(
|
||||
"Screenshot resize/annotate FAILED — returning original image with scale=1.0. "
|
||||
"css_width=%s, dpr=%s, target_width=%d. Clicks will be misaligned.",
|
||||
"Screenshot resize/annotate FAILED — returning original image. "
|
||||
"css_width=%s, dpr=%s.",
|
||||
css_width,
|
||||
dpr,
|
||||
width,
|
||||
exc_info=True,
|
||||
)
|
||||
return data, 1.0, 1.0
|
||||
return data, 1.0
|
||||
|
||||
|
||||
def register_inspection_tools(mcp: FastMCP) -> None:
|
||||
@@ -180,26 +165,24 @@ def register_inspection_tools(mcp: FastMCP) -> None:
|
||||
profile: str | None = None,
|
||||
full_page: bool = False,
|
||||
selector: str | None = None,
|
||||
image_type: Literal["png", "jpeg"] = "png",
|
||||
annotate: bool = True,
|
||||
width: int = _SCREENSHOT_WIDTH,
|
||||
) -> list:
|
||||
"""
|
||||
Take a screenshot of the current page.
|
||||
|
||||
Returns a normalized image alongside text metadata (URL, size, scale
|
||||
factors, etc.). Automatically annotates the last interaction (click,
|
||||
hover, type) with a bounding box overlay.
|
||||
The image is delivered at the CSS viewport's own dimensions, so
|
||||
a pixel you see in the screenshot is the same coordinate you
|
||||
pass to ``browser_click_coordinate`` / ``browser_hover_coordinate``
|
||||
/ ``browser_press_at``. No conversion, no scale factors.
|
||||
|
||||
Output is JPEG quality 75 (~150–250 KB for a typical UI).
|
||||
|
||||
Args:
|
||||
tab_id: Chrome tab ID (default: active tab)
|
||||
profile: Browser profile name (default: "default")
|
||||
full_page: Capture full scrollable page (default: False)
|
||||
selector: CSS selector to screenshot a specific element (optional)
|
||||
image_type: Image format - png or jpeg (default: png)
|
||||
annotate: Draw bounding box of last interaction on image (default: True)
|
||||
width: Output image width in pixels (default: 600). Use 800+ for fine
|
||||
text, 400 for quick layout checks.
|
||||
|
||||
Returns:
|
||||
List of content blocks: text metadata + image
|
||||
@@ -252,7 +235,6 @@ def register_inspection_tools(mcp: FastMCP) -> None:
|
||||
return [TextContent(type="text", text=json.dumps(screenshot_result))]
|
||||
|
||||
data = screenshot_result.get("data")
|
||||
mime_type = screenshot_result.get("mimeType", "image/png")
|
||||
css_width = screenshot_result.get("cssWidth", 0)
|
||||
dpr = screenshot_result.get("devicePixelRatio", 1.0)
|
||||
|
||||
@@ -263,45 +245,38 @@ def register_inspection_tools(mcp: FastMCP) -> None:
|
||||
if annotate and target_tab in _interaction_highlights:
|
||||
highlights = [_interaction_highlights[target_tab]]
|
||||
|
||||
# Normalize to 800px wide and annotate. Offloaded to a
|
||||
# thread because PIL Image.open/resize/ImageDraw/composite on
|
||||
# a 2-megapixel PNG blocks for ~150-300ms of CPU — plenty to
|
||||
# freeze the asyncio event loop and delay every concurrent
|
||||
# tool call during a screenshot. The function is reentrant
|
||||
# (fresh PIL Image per call, no shared state), so to_thread
|
||||
# is safe.
|
||||
data, physical_scale, css_scale = await asyncio.to_thread(
|
||||
# Resize to CSS-viewport dimensions so image px == CSS px,
|
||||
# and re-encode as the chosen lossy format. Offloaded to a
|
||||
# thread because PIL Image.open/resize/ImageDraw/composite
|
||||
# on a 2-megapixel PNG blocks for ~150–300 ms of CPU —
|
||||
# plenty to freeze the asyncio event loop. The function is
|
||||
# reentrant (fresh PIL Image per call, no shared state), so
|
||||
# to_thread is safe.
|
||||
data, physical_scale = await asyncio.to_thread(
|
||||
_resize_and_annotate,
|
||||
data,
|
||||
css_width,
|
||||
dpr,
|
||||
highlights,
|
||||
width,
|
||||
)
|
||||
_screenshot_scales[target_tab] = physical_scale
|
||||
_screenshot_css_scales[target_tab] = css_scale
|
||||
|
||||
meta = json.dumps(
|
||||
{
|
||||
"ok": True,
|
||||
"tabId": target_tab,
|
||||
"url": screenshot_result.get("url", ""),
|
||||
"imageType": mime_type.split("/")[-1],
|
||||
"imageType": "jpeg",
|
||||
"size": len(base64.b64decode(data)) if data else 0,
|
||||
"imageWidth": width,
|
||||
"imageWidth": css_width,
|
||||
"fullPage": full_page,
|
||||
"devicePixelRatio": dpr,
|
||||
"physicalScale": physical_scale,
|
||||
"cssScale": css_scale,
|
||||
"annotated": bool(highlights),
|
||||
"scaleHint": (
|
||||
f"image_coord × {css_scale} = CSS px "
|
||||
f"→ feed to browser_click_coordinate, "
|
||||
f"browser_hover_coordinate, browser_press_at "
|
||||
f"(CDP Input events use CSS pixels). "
|
||||
f"image_coord × {physical_scale} = physical px "
|
||||
f"is debug-only on HiDPI displays and must NOT "
|
||||
f"be used for clicks — it overshoots by DPR×."
|
||||
"Image pixel = CSS pixel. Feed any coord you see "
|
||||
"in this image directly to browser_click_coordinate "
|
||||
"/ browser_hover_coordinate / browser_press_at — "
|
||||
"no conversion needed."
|
||||
),
|
||||
}
|
||||
)
|
||||
@@ -313,17 +288,15 @@ def register_inspection_tools(mcp: FastMCP) -> None:
|
||||
"ok": True,
|
||||
"size": len(base64.b64decode(data)) if data else 0,
|
||||
"url": screenshot_result.get("url", ""),
|
||||
"physicalScale": physical_scale,
|
||||
"cssScale": css_scale,
|
||||
"debug_cssWidth": css_width,
|
||||
"debug_dpr": dpr,
|
||||
"cssWidth": css_width,
|
||||
"dpr": dpr,
|
||||
},
|
||||
duration_ms=(time.perf_counter() - start) * 1000,
|
||||
)
|
||||
|
||||
return [
|
||||
TextContent(type="text", text=meta),
|
||||
ImageContent(type="image", data=data, mimeType=mime_type),
|
||||
ImageContent(type="image", data=data, mimeType="image/jpeg"),
|
||||
]
|
||||
except Exception as e:
|
||||
log_tool_call(
|
||||
@@ -334,73 +307,6 @@ def register_inspection_tools(mcp: FastMCP) -> None:
|
||||
)
|
||||
return [TextContent(type="text", text=json.dumps({"ok": False, "error": str(e)}))]
|
||||
|
||||
@mcp.tool()
|
||||
def browser_coords(
|
||||
x: float,
|
||||
y: float,
|
||||
tab_id: int | None = None,
|
||||
profile: str | None = None,
|
||||
) -> dict:
|
||||
"""
|
||||
Convert screenshot image coordinates to browser click coordinates.
|
||||
|
||||
After browser_screenshot returns a downscaled image, use this to
|
||||
translate pixel positions you see in the image into the CSS pixel
|
||||
coordinates that Chrome DevTools Protocol expects.
|
||||
|
||||
**CDP Input.dispatchMouseEvent uses CSS pixels**, so you want
|
||||
``css_x`` / ``css_y`` for every click/hover tool. ``physical_x/y``
|
||||
is kept in the return for debugging on HiDPI displays — do NOT
|
||||
feed it to clicks; on a DPR=2 screen it lands 2× too far.
|
||||
|
||||
Edge case: pages using ``zoom`` or ``transform: scale()`` (e.g.
|
||||
LinkedIn's ``#interop-outlet`` shadow DOM) render in a scaled
|
||||
local coordinate space. For those, ``getBoundingClientRect()``
|
||||
reports pre-zoom coordinates and you may still need to multiply
|
||||
by the element's effective zoom. Use browser_shadow_query to
|
||||
get the zoomed rect directly.
|
||||
|
||||
Args:
|
||||
x: X pixel position in the screenshot image
|
||||
y: Y pixel position in the screenshot image
|
||||
tab_id: Chrome tab ID (default: active tab for profile)
|
||||
profile: Browser profile name (default: "default")
|
||||
|
||||
Returns:
|
||||
Dict with css_x, css_y (primary — use these), physical_x,
|
||||
physical_y (debug only), and scale factors.
|
||||
"""
|
||||
ctx = _get_context(profile)
|
||||
target_tab = tab_id or (ctx.get("activeTabId") if ctx else None)
|
||||
|
||||
physical_scale = _screenshot_scales.get(target_tab, 1.0) if target_tab else 1.0
|
||||
# css_scale stored in second slot via _screenshot_css_scales
|
||||
css_scale = _screenshot_css_scales.get(target_tab, physical_scale) if target_tab else physical_scale
|
||||
|
||||
return {
|
||||
"ok": True,
|
||||
# Primary output: CSS pixels. Feed these to click/hover/press.
|
||||
"css_x": round(x * css_scale, 1),
|
||||
"css_y": round(y * css_scale, 1),
|
||||
# Debug output: raw physical pixels. DO NOT feed to clicks on
|
||||
# HiDPI displays — CDP Input events use CSS pixels, so sending
|
||||
# physical coordinates lands the click at roughly DPR× the
|
||||
# intended position.
|
||||
"physical_x": round(x * physical_scale, 1),
|
||||
"physical_y": round(y * physical_scale, 1),
|
||||
"physicalScale": physical_scale,
|
||||
"cssScale": css_scale,
|
||||
"tabId": target_tab,
|
||||
"note": (
|
||||
"Use css_x/css_y with browser_click_coordinate, "
|
||||
"browser_hover_coordinate, browser_press_at — "
|
||||
"Chrome DevTools Protocol Input.dispatchMouseEvent "
|
||||
"operates in CSS pixels. physical_x/y is for debugging "
|
||||
"on HiDPI displays only; feeding it to clicks lands "
|
||||
"them at DPR× the intended coordinate."
|
||||
),
|
||||
}
|
||||
|
||||
@mcp.tool()
|
||||
async def browser_shadow_query(
|
||||
selector: str,
|
||||
@@ -412,7 +318,11 @@ def register_inspection_tools(mcp: FastMCP) -> None:
|
||||
|
||||
Traverses shadow roots to find elements inside closed/open shadow DOM,
|
||||
overlays, and virtual-rendered components (e.g. LinkedIn's #interop-outlet).
|
||||
Returns getBoundingClientRect in both CSS and physical pixels.
|
||||
Returns the element's bounding rect in CSS pixels. Screenshot
|
||||
pixels == CSS pixels, so the same numbers also match whatever
|
||||
the agent sees in a browser_screenshot — feed ``css.cx/cy``
|
||||
straight to browser_click_coordinate / hover_coordinate /
|
||||
press_at.
|
||||
|
||||
Args:
|
||||
selector: CSS selectors joined by ' >>> ' to pierce shadow roots.
|
||||
@@ -421,7 +331,7 @@ def register_inspection_tools(mcp: FastMCP) -> None:
|
||||
profile: Browser profile name (default: "default")
|
||||
|
||||
Returns:
|
||||
Dict with rect (CSS px) and physical rect (CSS px × DPR) of the element
|
||||
Dict with ``css`` block (x, y, w, h, cx, cy).
|
||||
"""
|
||||
bridge = get_bridge()
|
||||
if not bridge or not bridge.is_connected:
|
||||
@@ -438,10 +348,6 @@ def register_inspection_tools(mcp: FastMCP) -> None:
|
||||
return result
|
||||
|
||||
rect = result["rect"]
|
||||
physical_scale = _screenshot_scales.get(target_tab, 1.0)
|
||||
css_scale = _screenshot_css_scales.get(target_tab, 1.0)
|
||||
dpr = physical_scale / css_scale if css_scale else 1.0
|
||||
|
||||
return {
|
||||
"ok": True,
|
||||
"selector": selector,
|
||||
@@ -454,20 +360,11 @@ def register_inspection_tools(mcp: FastMCP) -> None:
|
||||
"cx": rect["cx"],
|
||||
"cy": rect["cy"],
|
||||
},
|
||||
"physical": {
|
||||
"x": round(rect["x"] * dpr, 1),
|
||||
"y": round(rect["y"] * dpr, 1),
|
||||
"w": round(rect["w"] * dpr, 1),
|
||||
"h": round(rect["h"] * dpr, 1),
|
||||
"cx": round(rect["cx"] * dpr, 1),
|
||||
"cy": round(rect["cy"] * dpr, 1),
|
||||
},
|
||||
"note": (
|
||||
"Use css.cx/cy with browser_click_coordinate, "
|
||||
"browser_hover_coordinate, browser_press_at — "
|
||||
"CDP Input events operate in CSS pixels. "
|
||||
"physical.* is debug-only; feeding it to clicks "
|
||||
"lands them DPR× too far on HiDPI displays."
|
||||
"Pass css.cx/cy → browser_click_coordinate / "
|
||||
"hover_coordinate / press_at. Screenshot pixels == CSS "
|
||||
"pixels, so these coords also match anything you see in "
|
||||
"browser_screenshot."
|
||||
),
|
||||
}
|
||||
|
||||
@@ -480,11 +377,11 @@ def register_inspection_tools(mcp: FastMCP) -> None:
|
||||
"""
|
||||
Get the bounding rect of an element by CSS selector.
|
||||
|
||||
Supports '>>>' shadow-piercing selectors for overlay/shadow DOM content.
|
||||
Returns coordinates in CSS pixels (for clicks and DOM APIs); the
|
||||
physical-pixel variant is returned for debugging on HiDPI displays
|
||||
only — it must not be fed to click/hover/press tools, which use
|
||||
CSS pixels.
|
||||
Supports '>>>' shadow-piercing selectors for overlay/shadow DOM
|
||||
content. Returns the rect in CSS pixels. Screenshot pixels ==
|
||||
CSS pixels, so the same numbers match anything visible in
|
||||
browser_screenshot — feed ``css.cx/cy`` straight to
|
||||
browser_click_coordinate / hover_coordinate / press_at.
|
||||
|
||||
Args:
|
||||
selector: CSS selector, optionally with ' >>> ' to pierce shadow roots.
|
||||
@@ -493,7 +390,7 @@ def register_inspection_tools(mcp: FastMCP) -> None:
|
||||
profile: Browser profile name (default: "default")
|
||||
|
||||
Returns:
|
||||
Dict with css and physical bounding rects
|
||||
Dict with ``css`` block (x, y, w, h, cx, cy).
|
||||
"""
|
||||
bridge = get_bridge()
|
||||
if not bridge or not bridge.is_connected:
|
||||
@@ -510,10 +407,6 @@ def register_inspection_tools(mcp: FastMCP) -> None:
|
||||
return result
|
||||
|
||||
rect = result["rect"]
|
||||
physical_scale = _screenshot_scales.get(target_tab, 1.0)
|
||||
css_scale = _screenshot_css_scales.get(target_tab, 1.0)
|
||||
dpr = physical_scale / css_scale if css_scale else 1.0
|
||||
|
||||
return {
|
||||
"ok": True,
|
||||
"selector": selector,
|
||||
@@ -526,20 +419,11 @@ def register_inspection_tools(mcp: FastMCP) -> None:
|
||||
"cx": rect["cx"],
|
||||
"cy": rect["cy"],
|
||||
},
|
||||
"physical": {
|
||||
"x": round(rect["x"] * dpr, 1),
|
||||
"y": round(rect["y"] * dpr, 1),
|
||||
"w": round(rect["w"] * dpr, 1),
|
||||
"h": round(rect["h"] * dpr, 1),
|
||||
"cx": round(rect["cx"] * dpr, 1),
|
||||
"cy": round(rect["cy"] * dpr, 1),
|
||||
},
|
||||
"note": (
|
||||
"Use css.cx/cy with browser_click_coordinate, "
|
||||
"browser_hover_coordinate, browser_press_at — "
|
||||
"CDP Input events operate in CSS pixels. "
|
||||
"physical.* is debug-only; feeding it to clicks "
|
||||
"lands them DPR× too far on HiDPI displays."
|
||||
"Pass css.cx/cy → browser_click_coordinate / "
|
||||
"hover_coordinate / press_at. Screenshot pixels == CSS "
|
||||
"pixels, so these coords also match anything you see in "
|
||||
"browser_screenshot."
|
||||
),
|
||||
}
|
||||
|
||||
|
||||
@@ -108,24 +108,24 @@ def register_interaction_tools(mcp: FastMCP) -> None:
|
||||
button: Literal["left", "right", "middle"] = "left",
|
||||
) -> dict:
|
||||
"""
|
||||
Click at specific viewport coordinates (CSS pixels).
|
||||
Click at viewport coordinates.
|
||||
|
||||
Chrome DevTools Protocol's Input.dispatchMouseEvent operates in
|
||||
**CSS pixels**, not physical pixels. If you have a screenshot
|
||||
image coordinate, convert it with ``browser_coords(x, y)`` and
|
||||
use the returned ``css_x`` / ``css_y`` — not ``physical_x/y``.
|
||||
On a DPR=2 display, feeding physical coordinates lands the click
|
||||
at 2× the intended position.
|
||||
Screenshots are delivered at the CSS viewport's own dimensions
|
||||
(see ``browser_screenshot``), so a pixel you read off the image
|
||||
is the same number you pass here — no conversion, no scale
|
||||
factors. ``browser_get_rect`` likewise returns coords you can
|
||||
feed straight through.
|
||||
|
||||
Args:
|
||||
x: X coordinate in CSS pixels (viewport space)
|
||||
y: Y coordinate in CSS pixels (viewport space)
|
||||
x: X coordinate in the screenshot / CSS viewport.
|
||||
y: Y coordinate in the screenshot / CSS viewport.
|
||||
tab_id: Chrome tab ID (default: active tab)
|
||||
profile: Browser profile name (default: "default")
|
||||
button: Mouse button to click (left, right, middle)
|
||||
|
||||
Returns:
|
||||
Dict with click result
|
||||
Dict with click result, including ``focused_element``
|
||||
describing what the click focused.
|
||||
"""
|
||||
start = time.perf_counter()
|
||||
params = {"x": x, "y": y, "tab_id": tab_id, "profile": profile, "button": button}
|
||||
@@ -149,17 +149,11 @@ def register_interaction_tools(mcp: FastMCP) -> None:
|
||||
return result
|
||||
|
||||
try:
|
||||
from .inspection import _screenshot_css_scales, _screenshot_scales
|
||||
|
||||
click_result = await bridge.click_coordinate(target_tab, x, y, button=button)
|
||||
log_tool_call(
|
||||
"browser_click_coordinate",
|
||||
params,
|
||||
result={
|
||||
**click_result,
|
||||
"debug_stored_physicalScale": _screenshot_scales.get(target_tab, "unset"),
|
||||
"debug_stored_cssScale": _screenshot_css_scales.get(target_tab, "unset"),
|
||||
},
|
||||
result=click_result,
|
||||
duration_ms=(time.perf_counter() - start) * 1000,
|
||||
)
|
||||
return click_result
|
||||
@@ -484,15 +478,16 @@ def register_interaction_tools(mcp: FastMCP) -> None:
|
||||
profile: str | None = None,
|
||||
) -> dict:
|
||||
"""
|
||||
Hover at CSS pixel coordinates without needing a CSS selector.
|
||||
Hover at viewport coordinates without needing a CSS selector.
|
||||
|
||||
Use this instead of browser_hover when the element is in an overlay,
|
||||
shadow DOM, or virtual-rendered component that isn't in the regular DOM.
|
||||
Pair with browser_coords to convert screenshot image positions to CSS pixels.
|
||||
Screenshot pixels == CSS pixels, so any coord you read off a
|
||||
browser_screenshot image can be fed straight through.
|
||||
|
||||
Args:
|
||||
x: CSS pixel X coordinate
|
||||
y: CSS pixel Y coordinate
|
||||
x: X coordinate in the screenshot / CSS viewport.
|
||||
y: Y coordinate in the screenshot / CSS viewport.
|
||||
tab_id: Chrome tab ID (default: active tab)
|
||||
profile: Browser profile name (default: "default")
|
||||
|
||||
@@ -548,16 +543,17 @@ def register_interaction_tools(mcp: FastMCP) -> None:
|
||||
profile: str | None = None,
|
||||
) -> dict:
|
||||
"""
|
||||
Move mouse to CSS pixel coordinates then press a key.
|
||||
Move mouse to viewport coordinates then press a key.
|
||||
|
||||
Use this instead of browser_press when the focused element is in an overlay
|
||||
or virtual-rendered component. Moving the mouse first routes the key event
|
||||
through native browser hit-testing instead of the DOM focus chain.
|
||||
Pair with browser_coords to convert screenshot image positions to CSS pixels.
|
||||
Screenshot pixels == CSS pixels, so coords read off a
|
||||
browser_screenshot image can be fed straight through.
|
||||
|
||||
Args:
|
||||
x: CSS pixel X coordinate to position mouse
|
||||
y: CSS pixel Y coordinate to position mouse
|
||||
x: X coordinate in the screenshot / CSS viewport.
|
||||
y: Y coordinate in the screenshot / CSS viewport.
|
||||
key: Key to press (e.g. 'Enter', 'Space', 'Escape', 'ArrowDown')
|
||||
tab_id: Chrome tab ID (default: active tab)
|
||||
profile: Browser profile name (default: "default")
|
||||
|
||||
Reference in New Issue
Block a user