Compare commits

...

2 Commits

Author SHA1 Message Date
Timothy 3a219e27ab fix: two dimensions 2026-04-16 17:08:40 -07:00
Timothy 7f62e7a2d0 fix: split image click vs coordinate click 2026-04-16 16:37:52 -07:00
3 changed files with 234 additions and 50 deletions
@@ -12,23 +12,49 @@ metadata:
All GCU browser tools drive a real Chrome instance through the Beeline extension and Chrome DevTools Protocol (CDP). That means clicks, keystrokes, and screenshots are processed by the actual browser's native hit testing, focus, and layout engines — **not** a synthetic event layer. Understanding this unlocks strategies that make hard sites easy.
## Coordinates: always CSS pixels
## Coordinates: image-px vs CSS-px — pick the right tool
**Chrome DevTools Protocol `Input.dispatchMouseEvent` operates in CSS pixels, not physical pixels.**
Screenshots are downscaled (800 px wide by default) while the real viewport is typically 15001900 CSS px wide on a modern display. So the pixel you read off a screenshot image is **not** the CSS coordinate you pass to CDP — feeding an 800-scale number to a 1717-scale API lands your click ~40% to the left of where you meant.
When you call `browser_coords(image_x, image_y)` after a screenshot, the returned dict has both `css_x/y` and `physical_x/y`. **Always use `css_x/y` for clicks, hovers, and key presses.**
**The fix is a separate verb for each coord space. You should almost never need to do the math yourself.**
```
browser_screenshot() → image (downscaled to 800/900 px wide)
browser_coords(img_x, img_y) → {css_x, css_y, physical_x, physical_y}
browser_click_coordinate(css_x, css_y) ← USE css_x/y
browser_hover_coordinate(css_x, css_y) ← USE css_x/y
browser_press_at(css_x, css_y, key) ← USE css_x/y
browser_screenshot() → image (downscaled to ~800 px wide)
browser_click_image(img_x, img_y) ← PREFERRED after a screenshot
Reads image pixels straight from
the PNG; the tool auto-converts
to CSS using the cached scale.
Response includes converted_css_x/y
and the cssScale used.
browser_click_coordinate(css_x, css_y) ← CSS pixels only. Use when you
already have CSS coords from
getBoundingClientRect / browser_get_rect.
browser_hover_coordinate(css_x, css_y) ← CSS pixels
browser_press_at(css_x, css_y, key) ← CSS pixels
```
Feeding `physical_x/y` on a HiDPI display overshoots by DPR× — on a DPR=1.6 laptop, clicks land 60% too far right and down. The ratio between `physicalScale` and `cssScale` tells you the effective DPR.
`browser_coords(img_x, img_y)` is still available if you want to *see* the conversion (it returns `{css_x, css_y, physical_x, physical_y}`) — but for ordinary screenshot-then-click work, `browser_click_image` does the whole pipeline in one call and logs the conversion in its response.
`getBoundingClientRect()` already returns CSS pixels — feed those values straight through to click/hover tools without any DPR multiplication.
Never feed `physical_x/y` to any click tool. On a DPR=1.6 display, physical coords overshoot by 60%. The ratio between `physicalScale` and `cssScale` tells you the effective DPR.
`getBoundingClientRect()` already returns CSS pixels — feed those straight into `browser_click_coordinate` (not `browser_click_image`) without any scaling.
### The naming convention
Every coord-returning tool (`browser_coords`, `browser_get_rect`, `browser_shadow_query`) returns parallel blocks — one per coord space. Match the block name to the click tool suffix:
```
rect = browser_get_rect(selector)
# rect.image → browser_click_image ← preferred after a screenshot (scale cached)
# rect.css → browser_click_coordinate (hover_coordinate / press_at)
# rect.physical → DO NOT click — debug only
browser_click_image(rect.image.cx, rect.image.cy)
browser_click_coordinate(rect.css.cx, rect.css.cy)
```
Same shape from `browser_shadow_query` (`sq.image`, `sq.css`, `sq.physical`) and `browser_coords` (`image_x/image_y`, `css_x/css_y`, `physical_x/physical_y`). If the block prefix and the tool suffix don't match, you're about to click the wrong place.
**Exception for zoomed elements:** pages that use `zoom` or `transform: scale()` on a container (LinkedIn's `#interop-outlet`, some embedded iframes) render in a scaled local coordinate space. `getBoundingClientRect` there may not match CDP's hit space. Use `browser_shadow_query` which handles the math, or fall back to visually picking coordinates from a screenshot.
@@ -46,29 +72,33 @@ Whereas `wait_for_selector`, `browser_click(selector=...)`, `browser_type(select
### Recommended workflow on shadow-heavy sites
1. `browser_screenshot()` → visual image
1. `browser_screenshot()` → visual image (also caches the image→CSS scale for this tab)
2. Identify the target visually → image pixel `(x, y)` (eyeball from the screenshot)
3. `browser_coords(x, y)` → convert to CSS px
4. `browser_click_coordinate(css_x, css_y)` → lands on the element via native hit testing; inputs get focused. **The response now includes `focused_element: {tag, id, role, contenteditable, rect, ...}`** — use it to verify you actually focused what you intended.
5. `browser_type(text="...")` with **NO selector** → dispatches CDP `Input.insertText` to `document.activeElement`. Shadow roots, iframes, Lexical, Draft.js, ProseMirror all just work. Only pass a selector if you want a DIFFERENT element than the one you just focused (rare).
6. Verify via `browser_screenshot` OR `browser_get_attribute` on a known-reachable marker (e.g. check that the Send button's `aria-disabled` flipped to `false`).
3. `browser_click_image(x, y)`auto-converts image px → CSS px using the cached scale, then clicks. **The response includes `focused_element: {tag, id, role, contenteditable, rect, ...}`** — use it to verify you actually focused what you intended, plus `converted_css_x/y` and `cssScale` so you can see what the conversion did.
4. `browser_type(text="...")` with **NO selector** → dispatches CDP `Input.insertText` to `document.activeElement`. Shadow roots, iframes, Lexical, Draft.js, ProseMirror all just work. Only pass a selector if you want a DIFFERENT element than the one you just focused (rare).
5. Verify via `browser_screenshot` OR `browser_get_attribute` on a known-reachable marker (e.g. check that the Send button's `aria-disabled` flipped to `false`).
Do **not** pass image pixels to `browser_click_coordinate`. It expects CSS pixels and does no conversion — a common failure mode is eyeballing `(490, 680)` off an 800-wide screenshot, passing it to `browser_click_coordinate`, and landing in the sidebar at CSS-x=490 of a 1717-wide viewport.
### The click→type loop (canonical pattern)
```
resp = browser_click_coordinate(x, y)
resp = browser_click_image(x, y) # x, y are raw image pixels
fe = resp.get("focused_element")
if fe and (fe.get("contenteditable") or fe["tag"] in ("textarea", "input")):
browser_type(text="...") # no selector — insertText to activeElement
else:
# you clicked something that isn't editable — refine coords and retry
# do NOT reach for browser_evaluate + execCommand('insertText', ...)
# you clicked something that isn't editable — refine the pixel and retry.
# Check resp["converted_css_x"] / ["converted_css_y"] in the response to
# see where the click actually landed; if it's clearly off, your image
# pixel was wrong, not the conversion.
# Do NOT reach for browser_evaluate + execCommand('insertText', ...)
# or a walk(root) shadow traversal. The problem is your click, not
# the typing method.
...
```
`browser_click` (selector-based) also returns `focused_element` now, so the same check works whether you clicked by selector or coordinate.
`browser_click` (selector-based) also returns `focused_element`, so the same check works whether you clicked by selector, image pixel, or CSS coordinate.
### Empirically verified (2026-04-11)
@@ -154,7 +184,7 @@ The symptom is always the same: **you type, the characters appear visually, and
```
# 1. Focus the real element via a real click (not JS .focus()).
rect = browser_get_rect(selector) # or browser_shadow_query for shadow sites
browser_click_coordinate(rect.cx, rect.cy)
browser_click_coordinate(rect.css.cx, rect.css.cy) # rect.css.cx/cy — matched pair
sleep(0.5) # let the editor open / focus settle
# 2. Type. browser_type now uses CDP Input.insertText by default, which is
@@ -183,7 +213,7 @@ if not state['disabled']:
else:
# Recovery: sometimes a click-again + one extra keystroke nudges
# React into recomputing hasRealContent.
browser_click_coordinate(rect.cx, rect.cy)
browser_click_coordinate(rect.css.cx, rect.css.cy) # rect.css.cx/cy — matched pair
browser_press("End")
browser_press(" ")
browser_press("Backspace")
@@ -339,7 +369,7 @@ LinkedIn enforces **strict Trusted Types CSP**. Any script you inject via `brows
Reddit's search input lives **two shadow levels deep** inside `reddit-search-large > faceplate-search-input`. You cannot reach it with `browser_type(selector=)`. The working pattern:
1. `browser_shadow_query("reddit-search-large >>> #search-input")` → rect
2. `browser_click_coordinate(rect.cx, rect.cy)` → click lands on the real shadow input via native hit testing; input becomes focused
2. `browser_click_coordinate(rect.css.cx, rect.css.cy) # rect.css.cx/cy — matched pair` → click lands on the real shadow input via native hit testing; input becomes focused
3. `browser_press(c)` for each character → dispatches to focused element
4. Verify by reading `.value` via `browser_evaluate` walking the shadow path
@@ -409,7 +439,8 @@ Then pass the most specific selector that uniquely identifies the right input (e
- **Typing into a rich-text editor without clicking first → send button stays disabled.** Draft.js (X), Lexical (Gmail, LinkedIn DMs), ProseMirror (Reddit), and React-controlled `contenteditable` elements only register input as "real" when the element received a native focus event — JS-sourced `.focus()` is not enough. `browser_type` now does this automatically via a real CDP pointer click before inserting text, but always verify the submit button's `disabled` state before clicking send. See the "ALWAYS click before typing" section above.
- **Using per-character `keyDown` on Lexical / Draft.js editors → keys dispatch but text never appears.** Those editors intercept `beforeinput` and route insertion through their own state machine; raw keyDown events are silently dropped. `browser_type` now uses `Input.insertText` by default (the CDP IME-commit method) which these editors accept cleanly. Only set `use_insert_text=False` when you explicitly need per-keystroke dispatch.
- **Leaving a composer with text then trying to navigate → `beforeunload` dialog hangs the bridge.** LinkedIn and several other sites pop a native "unsent message" confirm. `browser_navigate` and `close_tab` both time out against this. Always strip `window.onbeforeunload = null` via `browser_evaluate` before any navigation after typing in a composer, or wrap your logic in a `try/finally` that runs the cleanup block.
- **Clicking at physical pixels.** CDP uses CSS px. `browser_coords` returns both for debugging, but always feed `css_x/y` to click tools.
- **Passing image pixels to `browser_click_coordinate`.** The tool expects CSS pixels and does no conversion. If you eyeballed a target pixel off the screenshot PNG, use `browser_click_image(x, y)` instead — it auto-converts using the cached image→CSS scale. Symptom when you get this wrong: click lands in a sidebar / left rail because an 800-scale number was interpreted as a 1717-scale CSS coordinate. The response of a mis-aimed click will usually show a `focused_element` that isn't the target (e.g. `tag: "div", className: "msg-conversation-listitem__link"`) — branch on that and retry with the right tool.
- **Clicking at physical pixels.** CDP uses CSS px. `browser_coords` returns both for debugging, but always feed `css_x/y` to `browser_click_coordinate` — or pass the raw image pixel straight into `browser_click_image`.
- **Calling `wait_for_selector` on a shadow element.** It'll always time out. Use `browser_shadow_query` or the screenshot + coordinate strategy.
- **Relying on `innerHTML` in injected scripts on LinkedIn.** Silently discarded. Use `createElement` + `appendChild`.
- **Not waiting for SPA hydration.** `wait_until="load"` fires before React/Vue rendering on many sites. Add a 23 s sleep before querying for chrome elements.
@@ -461,7 +492,7 @@ browser_navigate("https://x.com/explore", wait_until="load")
sleep(3)
browser_wait_for_selector("input[data-testid='SearchBox_Search_Input']", timeout_ms=5000)
rect = browser_get_rect("input[data-testid='SearchBox_Search_Input']")
browser_click_coordinate(rect.cx, rect.cy)
browser_click_coordinate(rect.css.cx, rect.css.cy) # rect.css.cx/cy — matched pair
browser_type("input[data-testid='SearchBox_Search_Input']", "openai", clear_first=True)
# Screenshot now shows live search suggestions
browser_screenshot()
@@ -475,7 +506,7 @@ browser_navigate("https://www.reddit.com/r/programming/", wait_until="load")
sleep(2)
# Shadow-pierce the nested search input
sq = browser_shadow_query("reddit-search-large >>> #search-input")
browser_click_coordinate(sq.rect.cx, sq.rect.cy)
browser_click_coordinate(sq.css.cx, sq.css.cy) # sq.css.cx/cy — matched pair
# Typing can't use selector (shadow); focused input receives raw key presses
for c in "python":
browser_press(c)
@@ -490,7 +521,7 @@ browser_navigate("https://www.linkedin.com/feed/", wait_until="load", timeout_ms
sleep(3)
browser_wait_for_selector("input[data-testid='typeahead-input']", timeout_ms=5000)
rect = browser_get_rect("input[data-testid='typeahead-input']")
browser_click_coordinate(rect.cx, rect.cy)
browser_click_coordinate(rect.css.cx, rect.css.cy) # rect.css.cx/cy — matched pair
browser_type("input[data-testid='typeahead-input']", "anthropic", clear_first=True)
# Dropdown shows real live suggestions
browser_screenshot()
+58 -24
View File
@@ -379,7 +379,14 @@ def register_inspection_tools(mcp: FastMCP) -> None:
return {
"ok": True,
# Primary output: CSS pixels. Feed these to click/hover/press.
# Echo the input — you can feed these straight into
# browser_click_image, which does the image→CSS conversion
# internally. This is the simpler path when you just read a
# pixel off a screenshot.
"image_x": round(x, 1),
"image_y": round(y, 1),
# CSS pixels — feed these to browser_click_coordinate /
# hover_coordinate / press_at, which expect CSS px.
"css_x": round(x * css_scale, 1),
"css_y": round(y * css_scale, 1),
# Debug output: raw physical pixels. DO NOT feed to clicks on
@@ -392,12 +399,11 @@ def register_inspection_tools(mcp: FastMCP) -> None:
"cssScale": css_scale,
"tabId": target_tab,
"note": (
"Use css_x/css_y with browser_click_coordinate, "
"browser_hover_coordinate, browser_press_at — "
"Chrome DevTools Protocol Input.dispatchMouseEvent "
"operates in CSS pixels. physical_x/y is for debugging "
"on HiDPI displays only; feeding it to clicks lands "
"them at DPR× the intended coordinate."
"Simpler path: skip browser_coords entirely and call "
"browser_click_image(image_x, image_y) — it does the "
"conversion automatically. Use css_x/css_y only if you "
"need to pass coords to browser_click_coordinate / "
"hover_coordinate / press_at. physical_x/y is debug-only."
),
}
@@ -412,7 +418,8 @@ def register_inspection_tools(mcp: FastMCP) -> None:
Traverses shadow roots to find elements inside closed/open shadow DOM,
overlays, and virtual-rendered components (e.g. LinkedIn's #interop-outlet).
Returns getBoundingClientRect in both CSS and physical pixels.
Returns getBoundingClientRect in image, CSS, and physical pixels
pass the matching block into the matching click tool.
Args:
selector: CSS selectors joined by ' >>> ' to pierce shadow roots.
@@ -421,7 +428,9 @@ def register_inspection_tools(mcp: FastMCP) -> None:
profile: Browser profile name (default: "default")
Returns:
Dict with rect (CSS px) and physical rect (CSS px × DPR) of the element
Dict with ``image`` (pass to browser_click_image), ``css``
(pass to browser_click_coordinate / hover / press_at), and
``physical`` (debug only).
"""
bridge = get_bridge()
if not bridge or not bridge.is_connected:
@@ -441,11 +450,21 @@ def register_inspection_tools(mcp: FastMCP) -> None:
physical_scale = _screenshot_scales.get(target_tab, 1.0)
css_scale = _screenshot_css_scales.get(target_tab, 1.0)
dpr = physical_scale / css_scale if css_scale else 1.0
# image = css / cssScale — inverse of the conversion browser_click_image does
inv_css = 1.0 / css_scale if css_scale else 1.0
return {
"ok": True,
"selector": selector,
"tag": rect.get("tag"),
"image": {
"x": round(rect["x"] * inv_css, 1),
"y": round(rect["y"] * inv_css, 1),
"w": round(rect["w"] * inv_css, 1),
"h": round(rect["h"] * inv_css, 1),
"cx": round(rect["cx"] * inv_css, 1),
"cy": round(rect["cy"] * inv_css, 1),
},
"css": {
"x": rect["x"],
"y": rect["y"],
@@ -462,12 +481,14 @@ def register_inspection_tools(mcp: FastMCP) -> None:
"cx": round(rect["cx"] * dpr, 1),
"cy": round(rect["cy"] * dpr, 1),
},
"cssScale": css_scale,
"note": (
"Use css.cx/cy with browser_click_coordinate, "
"browser_hover_coordinate, browser_press_at — "
"CDP Input events operate in CSS pixels. "
"physical.* is debug-only; feeding it to clicks "
"lands them DPR× too far on HiDPI displays."
"Pass image.cx/cy browser_click_image (preferred after a "
"screenshot). Pass css.cx/cy → browser_click_coordinate / "
"hover_coordinate / press_at. physical.* is debug-only; "
"feeding it to clicks lands them DPR× too far on HiDPI. "
"If cssScale=1.0 no screenshot is cached yet — take a "
"browser_screenshot first if you want to use image coords."
),
}
@@ -481,10 +502,12 @@ def register_inspection_tools(mcp: FastMCP) -> None:
Get the bounding rect of an element by CSS selector.
Supports '>>>' shadow-piercing selectors for overlay/shadow DOM content.
Returns coordinates in CSS pixels (for clicks and DOM APIs); the
physical-pixel variant is returned for debugging on HiDPI displays
only it must not be fed to click/hover/press tools, which use
CSS pixels.
Returns coordinates in image, CSS, and physical pixels. Pass the
``image`` block into browser_click_image (preferred after a
screenshot) or the ``css`` block into browser_click_coordinate /
hover_coordinate / press_at. ``physical`` is debug-only and must
not be fed to click tools CDP Input events use CSS pixels, not
physical pixels.
Args:
selector: CSS selector, optionally with ' >>> ' to pierce shadow roots.
@@ -493,7 +516,7 @@ def register_inspection_tools(mcp: FastMCP) -> None:
profile: Browser profile name (default: "default")
Returns:
Dict with css and physical bounding rects
Dict with image, css, and physical bounding rects.
"""
bridge = get_bridge()
if not bridge or not bridge.is_connected:
@@ -513,11 +536,20 @@ def register_inspection_tools(mcp: FastMCP) -> None:
physical_scale = _screenshot_scales.get(target_tab, 1.0)
css_scale = _screenshot_css_scales.get(target_tab, 1.0)
dpr = physical_scale / css_scale if css_scale else 1.0
inv_css = 1.0 / css_scale if css_scale else 1.0
return {
"ok": True,
"selector": selector,
"tag": rect.get("tag"),
"image": {
"x": round(rect["x"] * inv_css, 1),
"y": round(rect["y"] * inv_css, 1),
"w": round(rect["w"] * inv_css, 1),
"h": round(rect["h"] * inv_css, 1),
"cx": round(rect["cx"] * inv_css, 1),
"cy": round(rect["cy"] * inv_css, 1),
},
"css": {
"x": rect["x"],
"y": rect["y"],
@@ -534,12 +566,14 @@ def register_inspection_tools(mcp: FastMCP) -> None:
"cx": round(rect["cx"] * dpr, 1),
"cy": round(rect["cy"] * dpr, 1),
},
"cssScale": css_scale,
"note": (
"Use css.cx/cy with browser_click_coordinate, "
"browser_hover_coordinate, browser_press_at — "
"CDP Input events operate in CSS pixels. "
"physical.* is debug-only; feeding it to clicks "
"lands them DPR× too far on HiDPI displays."
"Pass image.cx/cy browser_click_image (preferred after a "
"screenshot). Pass css.cx/cy → browser_click_coordinate / "
"hover_coordinate / press_at. physical.* is debug-only; "
"feeding it to clicks lands them DPR× too far on HiDPI. "
"If cssScale=1.0 no screenshot is cached yet — take a "
"browser_screenshot first if you want to use image coords."
),
}
+119
View File
@@ -173,6 +173,125 @@ def register_interaction_tools(mcp: FastMCP) -> None:
)
return result
@mcp.tool()
async def browser_click_image(
image_x: float,
image_y: float,
tab_id: int | None = None,
profile: str | None = None,
button: Literal["left", "right", "middle"] = "left",
) -> dict:
"""
Click at coordinates read directly from the most recent screenshot.
**Use this after ``browser_screenshot`` when you've eyeballed a
target pixel from the returned image.** It reads the cached
imageCSS scale for the tab and converts automatically so you
pass in the raw image pixel you saw, not a pre-converted CSS px.
This is the canonical screenshot-then-click path and eliminates
the single most common coordinate bug (passing 800-px-scale
numbers to a 1717-px-scale CDP API, which lands the click on a
sidebar instead of the target).
Pipeline:
img = browser_screenshot() # image (default 800 px wide)
# look at img, pick a target pixel (x, y)
browser_click_image(x, y) # auto-converts to CSS px
If you already hold CSS coordinates (from ``getBoundingClientRect``,
``browser_get_rect``, or an explicit ``browser_coords`` call), use
``browser_click_coordinate`` instead that path does no conversion.
Fails fast if no screenshot scale is cached for the tab (call
``browser_screenshot`` first so the scale is known).
Args:
image_x: X coordinate in image pixels (the value you read off
the screenshot PNG).
image_y: Y coordinate in image pixels.
tab_id: Chrome tab ID (default: active tab).
profile: Browser profile name (default: "default").
button: Mouse button to click (left, right, middle).
Returns:
Dict with click result, including the converted ``css_x`` /
``css_y`` that were actually dispatched and the ``cssScale``
used for the conversion.
"""
start = time.perf_counter()
params = {
"image_x": image_x,
"image_y": image_y,
"tab_id": tab_id,
"profile": profile,
"button": button,
}
bridge = get_bridge()
if not bridge or not bridge.is_connected:
result = {"ok": False, "error": "Browser extension not connected"}
log_tool_call("browser_click_image", params, result=result)
return result
ctx = _get_context(profile)
if not ctx:
result = {"ok": False, "error": "Browser not started. Call browser_start first."}
log_tool_call("browser_click_image", params, result=result)
return result
target_tab = tab_id or ctx.get("activeTabId")
if target_tab is None:
result = {"ok": False, "error": "No active tab"}
log_tool_call("browser_click_image", params, result=result)
return result
from .inspection import _screenshot_css_scales
css_scale = _screenshot_css_scales.get(target_tab)
if css_scale is None:
result = {
"ok": False,
"error": (
f"No screenshot scale cached for tab {target_tab}. "
"Call browser_screenshot first so the image→CSS scale "
"is known, or use browser_click_coordinate if you "
"already have CSS px."
),
}
log_tool_call("browser_click_image", params, result=result)
return result
css_x = image_x * css_scale
css_y = image_y * css_scale
try:
click_result = await bridge.click_coordinate(target_tab, css_x, css_y, button=button)
enriched = {
**click_result,
"input_image_x": image_x,
"input_image_y": image_y,
"converted_css_x": css_x,
"converted_css_y": css_y,
"cssScale": css_scale,
}
log_tool_call(
"browser_click_image",
params,
result=enriched,
duration_ms=(time.perf_counter() - start) * 1000,
)
return enriched
except Exception as e:
result = {"ok": False, "error": str(e)}
log_tool_call(
"browser_click_image",
params,
error=e,
duration_ms=(time.perf_counter() - start) * 1000,
)
return result
@mcp.tool()
async def browser_type(
selector: str,