feat: fraction-based visual clicks

This commit is contained in:
Timothy
2026-04-16 22:36:41 -07:00
parent aba0ff07ba
commit 558813e7fa
7 changed files with 259 additions and 181 deletions
@@ -39,7 +39,7 @@ Neither tool is "preferred" universally — they're for different jobs. Default
## Coordinate rule
Every browser tool that takes or returns coordinates operates in **screenshot pixels** (the 800 px wide JPEG `browser_screenshot` delivers). Read a pixel off the image, pass it straight to `browser_click_coordinate` / `browser_hover_coordinate` / `browser_press_at`. `browser_get_rect` and `browser_shadow_query` return `rect.cx` / `rect.cy` in the same space. The tools translate to CSS px internally — no scale awareness required. Avoid raw `getBoundingClientRect()` via `browser_evaluate` for coord lookup; use `browser_get_rect` instead.
Every browser tool that takes or returns coordinates operates in **fractions of the viewport (0..1 for both axes)**. Read a target's proportional position off `browser_screenshot` ("~35% from the left, ~20% from the top" → `(0.35, 0.20)`) and pass that to `browser_click_coordinate` / `browser_hover_coordinate` / `browser_press_at`. `browser_get_rect` and `browser_shadow_query` return `rect.cx` / `rect.cy` as fractions. The tools multiply by `cssWidth` / `cssHeight` internally — no scale awareness required. Fractions are used because every vision model (Claude, GPT-4o, Gemini, local VLMs) resizes/tiles images differently; proportions are invariant. Avoid raw `getBoundingClientRect()` via `browser_evaluate` for coord lookup; use `browser_get_rect` instead.
## System prompt tips for browser nodes
+14 -9
View File
@@ -45,18 +45,23 @@ CSS selector.
## Coordinates
Every browser tool that takes or returns coordinates operates in
**screenshot pixels** (the 800 px wide JPEG `browser_screenshot`
delivers). Read a pixel off the image, pass it to
`browser_click_coordinate` / `browser_hover_coordinate` /
`browser_press_at`. `browser_get_rect` and `browser_shadow_query`
return `rect.cx` / `rect.cy` in the same space. The tools handle the
image-px CSS-px translation internally; you do not need to know
about CSS pixels, DPR, or any scale factor.
**fractions of the viewport (0..1 for both axes)**. Read a target's
proportional position off `browser_screenshot` "this button is
~35% from the left, ~20% from the top" → pass `(0.35, 0.20)`.
`browser_get_rect` and `browser_shadow_query` return `rect.cx` /
`rect.cy` as fractions in the same space. The tools handle the
fraction CSS-px multiplication internally; you do not need to
track image pixels, DPR, or any scale factor.
Why fractions: every vision model (Claude, GPT-4o, Gemini, local
VLMs) resizes or tiles images differently before the model sees the
pixels. Proportions survive every such transform; pixel coordinates
only "work" per-model and break when you swap backends.
Avoid raw `browser_evaluate` + `getBoundingClientRect()` for coord
lookup that returns CSS px and will be mis-scaled when fed to click
lookup that returns CSS px and will be wrong when fed to click
tools. Prefer `browser_get_rect` / `browser_shadow_query`, which
convert for you.
return fractions.
## Rich-text editors (X, LinkedIn DMs, Gmail, Reddit, Slack, Discord)
@@ -14,18 +14,20 @@ All GCU browser tools drive a real Chrome instance through the Beeline extension
## Coordinates
Every browser tool that takes or returns coordinates operates in **screenshot pixels** (the 800 px wide JPEG `browser_screenshot` delivers). Take a screenshot, read a pixel off the image, pass that number to `browser_click_coordinate` / `browser_hover_coordinate` / `browser_press_at`. Rect-returning tools (`browser_get_rect`, `browser_shadow_query`, and the `rect` inside `focused_element`) also return screenshot pixels. You do not need to convert anything, track scale factors, or know about CSS pixels or device pixel ratio — the tools translate internally before dispatching to Chrome.
Every browser tool that takes or returns coordinates operates in **fractions of the viewport (0..1 for both axes)**. Read a target's proportional position off `browser_screenshot` — "this button is about 35% from the left and 20% from the top" → pass `(0.35, 0.20)`. Rect-returning tools (`browser_get_rect`, `browser_shadow_query`, and the `rect` inside `focused_element`) also return fractions. The tools convert to CSS pixels internally before dispatching to Chrome.
```
browser_screenshot() → image (800 px wide JPEG)
browser_click_coordinate(x, y) → x, y are screenshot px
browser_hover_coordinate(x, y) → x, y are screenshot px
browser_press_at(x, y, key) → x, y are screenshot px
browser_get_rect(selector) → rect → rect.cx / rect.cy are screenshot px
browser_screenshot() → image + cssWidth/cssHeight in meta
browser_click_coordinate(x, y) → x, y are fractions 0..1
browser_hover_coordinate(x, y) → fractions
browser_press_at(x, y, key) → fractions
browser_get_rect(selector) → rect → rect.cx / rect.cy are fractions
browser_shadow_query(...) → rect → same
```
**Exception for zoomed elements:** pages that use `zoom` or `transform: scale()` on a container (LinkedIn's `#interop-outlet`, some embedded iframes) render in a scaled local coordinate space. `getBoundingClientRect` there may not match CDP's hit space. Prefer `browser_shadow_query` (which handles the math) or visually pick coordinates from a screenshot. When in doubt, avoid raw `browser_evaluate` + `getBoundingClientRect()` for coord lookup — that returns CSS px and will be mis-scaled when passed to click tools.
**Why fractions:** every vision model (Claude ~1.15 MP target, GPT-4o 512-px tiles, Gemini, local VLMs) resizes or tiles images differently before the model sees the pixels. Proportions survive every such transform; pixel coordinates only "work" per-model and silently break when you swap backends. Four-decimal precision (`0.0001` ≈ 0.17 CSS px on a 1717-wide viewport) is more than enough for the tightest targets.
**Exception for zoomed elements:** pages that use `zoom` or `transform: scale()` on a container (LinkedIn's `#interop-outlet`, some embedded iframes) render in a scaled local coordinate space. `getBoundingClientRect` there may not match CDP's hit space. Prefer `browser_shadow_query` (which handles the math and returns fractions) or visually pick coordinates from a screenshot. Avoid raw `browser_evaluate` + `getBoundingClientRect()` for coord lookup — that returns CSS px and will be wrong when fed to click tools.
## Screenshot + coordinates is shadow-agnostic — prefer it on shadow-heavy sites
@@ -41,9 +43,9 @@ Whereas `wait_for_selector`, `browser_click(selector=...)`, `browser_type(select
### Recommended workflow on shadow-heavy sites
1. `browser_screenshot()`800 px wide JPEG.
2. Identify the target visually → pixel `(x, y)` read straight off the image.
3. `browser_click_coordinate(x, y)`lands on the element via native hit testing; inputs get focused. **The response includes `focused_element: {tag, id, role, contenteditable, rect, inFrame?, ...}`** — use it to verify you actually focused what you intended. `rect` is in screenshot pixels (same space as the image). When focus is inside a same-origin iframe, the descriptor reports the inner element and adds `inFrame: [...]` breadcrumbs.
1. `browser_screenshot()`JPEG; meta includes `cssWidth`/`cssHeight` for reference.
2. Identify the target visually → estimate its proportional position `(fx, fy)` where each is in `0..1`.
3. `browser_click_coordinate(fx, fy)`tool converts to CSS px and dispatches; CDP native hit testing focuses the element. **The response includes `focused_element: {tag, id, role, contenteditable, rect, inFrame?, ...}`** — use it to verify you actually focused what you intended. `rect` is in fractions (same space as your input). When focus is inside a same-origin iframe, the descriptor reports the inner element and adds `inFrame: [...]` breadcrumbs.
4. `browser_type_focused(text="...")` → inserts text into `document.activeElement` (traverses into same-origin iframes automatically). Shadow roots, iframes, Lexical, Draft.js, ProseMirror all just work. Use `browser_type(selector, text)` instead when you have a reliable CSS selector for a light-DOM element.
5. Verify via `browser_screenshot` OR `browser_get_attribute` on a known-reachable marker (e.g. check that the Send button's `aria-disabled` flipped to `false`).
@@ -74,7 +76,7 @@ browser_shadow_query("reddit-search-large >>> #search-input")
browser_get_rect("#interop-outlet >>> #ember37 >>> p")
```
Returns the element's rect in **screenshot pixels** (feed `rect.cx` / `rect.cy` directly to click tools). Remember: `browser_type` and `wait_for_selector` do **not** support `>>>` — only shadow_query and get_rect do.
Returns the element's rect as **fractions of the viewport** (feed `rect.cx` / `rect.cy` directly to click tools). Remember: `browser_type` and `wait_for_selector` do **not** support `>>>` — only shadow_query and get_rect do.
## Navigation and waiting
@@ -215,11 +217,11 @@ Recognized without modifiers: `Enter`, `Tab`, `Escape`, `Backspace`, `Delete`, `
```
browser_screenshot() # viewport, 800 px wide JPEG
browser_screenshot(full_page=True) # full scrollable page
browser_screenshot(full_page=True) # full scrollable page (overview only — don't click off a full-page shot)
browser_screenshot(selector="#header") # clip to element's rect
```
Returns a JPEG (quality 75, ~50120 KB) fixed at **800 px wide** — well below the vision-API resize threshold, so the model sees the exact pixels we emit. Metadata includes `imageWidth` (800), `cssWidth` (the page's real viewport width), `cssScale` (for debug only), and `physicalScale`. The image is annotated with a highlight rectangle/dot showing the last interaction (click, hover, type) if one happened on this tab.
Returns a JPEG (quality 75, ~50120 KB) at 800 px wide. The pixel width is purely a bandwidth choice; all tool coordinates are fractions of the viewport and are invariant to image size. Metadata includes `imageWidth` (800), `cssWidth`, `cssHeight` (for reference), and `physicalScale`. The image is annotated with a highlight rectangle/dot showing the last interaction (click, hover, type) if one happened on this tab.
The highlight overlay stays visible on the page for **10 seconds** after each interaction, then fades. Before a screenshot is likely, make sure your click / hover / type happens <10 s before the screenshot.
@@ -347,7 +349,8 @@ Then pass the most specific selector that uniquely identifies the right input (e
- **Typing into a rich-text editor without clicking first → send button stays disabled.** Draft.js (X), Lexical (Gmail, LinkedIn DMs), ProseMirror (Reddit), and React-controlled `contenteditable` elements only register input as "real" when the element received a native focus event — JS-sourced `.focus()` is not enough. `browser_type` now does this automatically via a real CDP pointer click before inserting text, but always verify the submit button's `disabled` state before clicking send. See the "ALWAYS click before typing" section above.
- **Using per-character `keyDown` on Lexical / Draft.js editors → keys dispatch but text never appears.** Those editors intercept `beforeinput` and route insertion through their own state machine; raw keyDown events are silently dropped. `browser_type` now uses `Input.insertText` by default (the CDP IME-commit method) which these editors accept cleanly. Only set `use_insert_text=False` when you explicitly need per-keystroke dispatch.
- **Leaving a composer with text then trying to navigate → `beforeunload` dialog hangs the bridge.** LinkedIn and several other sites pop a native "unsent message" confirm. `browser_navigate` and `close_tab` both time out against this. Always strip `window.onbeforeunload = null` via `browser_evaluate` before any navigation after typing in a composer, or wrap your logic in a `try/finally` that runs the cleanup block.
- **Click landed in the wrong region (sidebar / header instead of target).** Check `focused_element` in the click response — it's ground truth for what actually got focused, including the `inFrame` breadcrumb when focus ends up inside a same-origin iframe. If it isn't the target (e.g. `className: "msg-conversation-listitem__link"` when you meant to hit a composer), adjust the pixel and retry. Coordinates you pass are screenshot pixels; the tool translates to CSS px internally, so a wrong result means you picked the wrong pixel off the image — not that any scale went sideways.
- **Click landed in the wrong region (sidebar / header instead of target).** Check `focused_element` in the click response — it's ground truth for what actually got focused, including the `inFrame` breadcrumb when focus ends up inside a same-origin iframe. If it isn't the target (e.g. `className: "msg-conversation-listitem__link"` when you meant to hit a composer), adjust the fraction and retry. Coordinates you pass are fractions of the viewport; the tool multiplies by `cssWidth` / `cssHeight` internally, so a wrong result means your estimated proportion was off — not that any scale went sideways.
- **Accidentally passing pixels to click / hover / press_at.** The tools reject any coord outside `[-0.1, 1.5]` with a clear error. If you see that error, you passed a pixel (like 815) instead of a fraction (like 0.475). Use `browser_get_rect` to get exact fractional cx/cy, or read proportions off `browser_screenshot`.
- **Calling `wait_for_selector` on a shadow element.** It'll always time out. Use `browser_shadow_query` or the screenshot + coordinate strategy.
- **Relying on `innerHTML` in injected scripts on LinkedIn.** Silently discarded. Use `createElement` + `appendChild`.
- **Not waiting for SPA hydration.** `wait_until="load"` fires before React/Vue rendering on many sites. Add a 23 s sleep before querying for chrome elements.
+16 -13
View File
@@ -962,9 +962,9 @@ class BeelineBridge:
"""Read document.activeElement and return a compact descriptor.
The JS returns ``rect`` fields in CSS px (they come straight
from ``getBoundingClientRect``). We scale them to screenshot
pixels here so the agent sees a rect in the same coord space
it passed to click / hover / press_at.
from ``getBoundingClientRect``). We convert them to fractions
of the viewport here so the agent sees a rect in the same
coord space it passed to click / hover / press_at.
Returns None on any failure never raises.
"""
@@ -973,20 +973,23 @@ class BeelineBridge:
result = await self.evaluate(tab_id, _FOCUSED_ELEMENT_JS)
info = (result or {}).get("result")
if info and isinstance(info, dict) and isinstance(info.get("rect"), dict):
# Convert CSS px rect → screenshot px using the cached
# scale. Fall back to 1.0 if no screenshot has been
# taken yet on this tab.
from .tools.inspection import _screenshot_css_scales
from .tools.inspection import _viewport_sizes
scale = _screenshot_css_scales.get(tab_id, 1.0) or 1.0
if scale > 0 and scale != 1.0:
vp = _viewport_sizes.get(tab_id)
if vp and vp[0] > 0 and vp[1] > 0:
cw, ch = float(vp[0]), float(vp[1])
r = info["rect"]
info["rect"] = {
"x": round(r.get("x", 0) / scale, 1),
"y": round(r.get("y", 0) / scale, 1),
"width": round(r.get("width", 0) / scale, 1),
"height": round(r.get("height", 0) / scale, 1),
"x": round(r.get("x", 0) / cw, 4),
"y": round(r.get("y", 0) / ch, 4),
"width": round(r.get("width", 0) / cw, 4),
"height": round(r.get("height", 0) / ch, 4),
}
else:
# Degraded: cache missing (no screenshot taken
# yet). Leave rect in CSS px and flag it so the
# agent can tell.
info["rectSpace"] = "css"
return info
except Exception:
return None
+5 -4
View File
@@ -256,12 +256,13 @@ def register_advanced_tools(mcp: FastMCP) -> None:
try:
result = await bridge.resize(target_tab, width, height)
# Invalidate per-tab scale caches — CSS width changed, so the
# cached image→CSS multiplier is stale. Click / rect tools
# will re-query innerWidth on next use via _ensure_css_scale.
# cached viewport dimensions are stale. Click / rect tools
# will re-query innerWidth / innerHeight on next use via
# _ensure_viewport_size.
try:
from .inspection import _screenshot_css_scales, _screenshot_scales
from .inspection import _screenshot_scales, _viewport_sizes
_screenshot_css_scales.pop(target_tab, None)
_viewport_sizes.pop(target_tab, None)
_screenshot_scales.pop(target_tab, None)
except Exception:
pass
+128 -101
View File
@@ -24,21 +24,25 @@ from .tabs import _get_context
logger = logging.getLogger(__name__)
# Fixed output width for all screenshots. Chosen well below Anthropic's
# ~1568-px vision-API resize threshold so the image the server emits is
# the SAME image (pixel-for-pixel) the LLM sees. That preserves
# image_px == model_px, which is the cornerstone of the "LLM works in
# screenshot pixels only" contract — all click/hover/press/rect tools
# translate between image pixels and CSS pixels internally.
# Fixed output width for all screenshots (bandwidth default). This
# number does NOT affect coordinate semantics — click / hover / press
# and rect tools all work in fractions of the viewport (0..1), which
# are invariant to whatever resize / tile the vision API applies. The
# 800 px width is simply small enough to keep JPEG payloads under
# ~150 KB on typical UI screenshots.
_SCREENSHOT_WIDTH = 800
# Per-tab scale caches populated on every browser_screenshot and on
# lazy-init inside the click tools. Both are ``image_px × scale =
# target_px`` multipliers.
# - _screenshot_scales[tab] → physical scale (image → physical px, debug only)
# - _screenshot_css_scales[tab] → css scale (image → CSS px, used for Input events)
# Per-tab viewport-size cache populated on every browser_screenshot
# and on lazy-init inside the click tools. Stores CSS-pixel viewport
# dimensions (window.innerWidth / window.innerHeight). Click tools
# multiply fractional inputs by these to get CSS coords before
# dispatching CDP events; rect tools divide CSS-pixel DOM rects by
# these to produce fractions for the agent.
_viewport_sizes: dict[int, tuple[int, int]] = {}
# Optional debug cache — physical-px scale per tab (orig_png_w /
# _SCREENSHOT_WIDTH). Logged only; no consumer.
_screenshot_scales: dict[int, float] = {}
_screenshot_css_scales: dict[int, float] = {}
def _resize_and_annotate(
@@ -46,27 +50,24 @@ def _resize_and_annotate(
css_width: int,
dpr: float = 1.0,
highlights: list[dict] | None = None,
) -> tuple[str, float, float]:
) -> tuple[str, float]:
"""Resize the captured PNG down to ``_SCREENSHOT_WIDTH`` (=800 px)
and re-encode as JPEG quality 75.
CDP captures at the physical-pixel resolution (DPR × CSS). We
downscale to 800 px wide so the delivered image stays under
Anthropic's vision-API resize cap — the model sees pixel-for-pixel
what we send.
The image dimensions do NOT determine click coordinates any more
the tools work in viewport fractions. This helper exists purely
for bandwidth + annotation overlay. Returns ``(new_b64,
physical_scale)`` where ``physical_scale = orig_png_w / output_w``
is kept for debug logging.
Returns ``(new_b64, physical_scale, css_scale)`` where
- ``physical_scale = orig_png_w / _SCREENSHOT_WIDTH`` (image physical px)
- ``css_scale = css_width / _SCREENSHOT_WIDTH`` (image CSS px)
Highlight rects arrive in CSS px and are divided by ``css_scale``
before drawing so overlays land in the correct spot on the
800-wide output.
Highlight rects arrive in CSS px; they're converted to image-space
for overlay drawing via the local ``css_to_image = css_width /
output_w`` factor (computed inline no external cache).
"""
if not css_width or css_width <= 0:
# Bridge always supplies css_width from window.innerWidth; only
# reach here on a degraded response. Return the raw PNG.
return data, 1.0, 1.0
return data, 1.0
try:
from PIL import Image, ImageDraw, ImageFont
@@ -78,17 +79,15 @@ def _resize_and_annotate(
orig_w = struct.unpack(">I", raw[16:20])[0]
physical_scale = orig_w / _SCREENSHOT_WIDTH if orig_w else 1.0
css_scale = css_width / _SCREENSHOT_WIDTH
logger.warning(
"PIL not available — screenshot resize SKIPPED. "
"Returning raw physical-px PNG. physicalScale=%.4f, "
"cssScale=%.4f, css_width=%d, dpr=%s. Install Pillow for correct clicks.",
"css_width=%d, dpr=%s. Install Pillow for annotation.",
physical_scale,
css_scale,
css_width,
dpr,
)
return data, round(physical_scale, 4), round(css_scale, 4)
return data, round(physical_scale, 4)
try:
raw = base64.b64decode(data)
@@ -96,14 +95,17 @@ def _resize_and_annotate(
orig_w, orig_h = img.size
physical_scale = orig_w / _SCREENSHOT_WIDTH
css_scale = css_width / _SCREENSHOT_WIDTH
new_w = _SCREENSHOT_WIDTH
new_h = round(orig_h * new_w / orig_w)
if (new_w, new_h) != img.size:
img = img.resize((new_w, new_h), Image.LANCZOS)
# Local CSS → image px factor for overlay draws. Kept local —
# not exported, not stored, not leaked to the agent.
css_to_image = css_width / _SCREENSHOT_WIDTH
logger.info(
"Screenshot: orig=%dx%d → out=%dx%d (css_width=%d, dpr=%s), physicalScale=%.4f, cssScale=%.4f",
"Screenshot: orig=%dx%d → out=%dx%d (css_width=%d, dpr=%s), physicalScale=%.4f, css_to_image=%.4f",
orig_w,
orig_h,
new_w,
@@ -111,7 +113,7 @@ def _resize_and_annotate(
css_width,
dpr,
physical_scale,
css_scale,
css_to_image,
)
if highlights:
@@ -126,10 +128,10 @@ def _resize_and_annotate(
kind = h.get("kind", "rect")
label = h.get("label", "")
# Highlights arrive in CSS px → convert to image px.
ix = h["x"] / css_scale
iy = h["y"] / css_scale
iw = h.get("w", 0) / css_scale
ih = h.get("h", 0) / css_scale
ix = h["x"] / css_to_image
iy = h["y"] / css_to_image
iw = h.get("w", 0) / css_to_image
ih = h.get("h", 0) / css_to_image
if kind == "point":
cx, cy, r = ix, iy, 10
@@ -169,7 +171,6 @@ def _resize_and_annotate(
return (
base64.b64encode(buf.getvalue()).decode(),
round(physical_scale, 4),
round(css_scale, 4),
)
except Exception:
logger.warning(
@@ -179,30 +180,37 @@ def _resize_and_annotate(
dpr,
exc_info=True,
)
return data, 1.0, 1.0
return data, 1.0
async def _ensure_css_scale(tab_id: int) -> float:
"""Return the image→CSS scale for ``tab_id``, populating the cache
via ``window.innerWidth`` if missing. Used by click tools when the
agent clicks before the first screenshot has been taken.
async def _ensure_viewport_size(tab_id: int) -> tuple[int, int]:
"""Return ``(cssWidth, cssHeight)`` for ``tab_id``, populating the
cache via ``window.innerWidth`` / ``window.innerHeight`` on miss.
Used by click / hover / press tools to turn fractional inputs
(0..1) into CSS px, and by rect tools to turn CSS-px rects into
fractions. Degrades to ``(1, 1)`` if the bridge can't be queried
that makes every coord an identity op, which is a safe no-op
(and preferable to crashing).
"""
cached = _screenshot_css_scales.get(tab_id)
if cached is not None and cached > 0:
cached = _viewport_sizes.get(tab_id)
if cached is not None and cached[0] > 0 and cached[1] > 0:
return cached
bridge = get_bridge()
try:
result = await bridge.evaluate(tab_id, "({w: window.innerWidth})")
inner = float(((result or {}).get("result") or {}).get("w") or 0)
result = await bridge.evaluate(tab_id, "({w: window.innerWidth, h: window.innerHeight})")
inner = (result or {}).get("result") or {}
cw = int(float(inner.get("w") or 0))
ch = int(float(inner.get("h") or 0))
except Exception:
inner = 0.0
if inner <= 0:
# Degraded: no viewport width available. Treat image px as CSS px.
scale = 1.0
else:
scale = inner / _SCREENSHOT_WIDTH
_screenshot_css_scales[tab_id] = scale
return scale
cw, ch = 0, 0
if cw <= 0 or ch <= 0:
# Degraded: bridge didn't return viewport. Cache an identity
# so we don't retry on every call; corrects itself after the
# next successful browser_screenshot.
cw, ch = 1, 1
_viewport_sizes[tab_id] = (cw, ch)
return cw, ch
def register_inspection_tools(mcp: FastMCP) -> None:
@@ -219,22 +227,28 @@ def register_inspection_tools(mcp: FastMCP) -> None:
"""
Take a screenshot of the current page.
Image is 800 px wide (JPEG quality 75, ~50120 KB). A pixel you
see in this image is the same number you pass to
``browser_click_coordinate`` / ``browser_hover_coordinate`` /
``browser_press_at`` the tools translate to CSS internally.
Image is 800 px wide (JPEG quality 75, ~50120 KB). All
coordinate tools work in **fractions of the viewport (0..1)**,
not pixels so read a target's proportional position off this
image ("~35 % from the left, ~20 % from the top") and pass
``(0.35, 0.20)`` to ``browser_click_coordinate`` /
``browser_hover_coordinate`` / ``browser_press_at``.
``browser_get_rect`` and ``browser_shadow_query`` likewise
return coordinates in screenshot pixels.
return coordinates as fractions.
Args:
tab_id: Chrome tab ID (default: active tab)
profile: Browser profile name (default: "default")
full_page: Capture full scrollable page (default: False)
full_page: Capture full scrollable page (default: False).
Note: full_page images extend beyond the viewport, so
fractions read off them do NOT map cleanly to
viewport-space clicks. Use for reading / overview only,
not for pointing.
selector: CSS selector to screenshot a specific element (optional)
annotate: Draw bounding box of last interaction on image (default: True)
Returns:
List of content blocks: text metadata + image
List of content blocks: text metadata + image.
"""
start = time.perf_counter()
params = {
@@ -299,18 +313,20 @@ def register_inspection_tools(mcp: FastMCP) -> None:
# Image.open/resize/ImageDraw/composite on a 2-megapixel
# PNG blocks for ~150300 ms of CPU — plenty to freeze the
# asyncio event loop. Reentrant: no shared state.
data, physical_scale, css_scale = await asyncio.to_thread(
data, physical_scale = await asyncio.to_thread(
_resize_and_annotate,
data,
css_width,
dpr,
highlights,
)
# Refresh caches so click / hover / press / rect tools can
# translate image px ↔ CSS px without asking the page again.
if target_tab is not None:
# Cache live viewport dimensions so click / hover / press /
# rect tools can translate fractions ↔ CSS px without
# asking the page again.
css_height = int(screenshot_result.get("cssHeight", 0)) or 0
if target_tab is not None and css_width > 0 and css_height > 0:
_viewport_sizes[target_tab] = (int(css_width), css_height)
_screenshot_scales[target_tab] = physical_scale
_screenshot_css_scales[target_tab] = css_scale
meta = json.dumps(
{
@@ -321,18 +337,21 @@ def register_inspection_tools(mcp: FastMCP) -> None:
"size": len(base64.b64decode(data)) if data else 0,
"imageWidth": _SCREENSHOT_WIDTH,
"cssWidth": css_width,
"cssHeight": css_height,
"fullPage": full_page,
"devicePixelRatio": dpr,
"physicalScale": physical_scale,
"cssScale": css_scale,
"annotated": bool(highlights),
"scaleHint": (
"Image is 800 px wide. Pass pixel coordinates "
"you read off this image straight into "
"Coordinates for click / hover / press are "
"fractions 0..1 of the viewport. Read a "
"target's proportional position off this image "
"(e.g. '~35 % from the left, ~20 % from the top' "
"→ (0.35, 0.20)) and pass that to "
"browser_click_coordinate / "
"browser_hover_coordinate / browser_press_at "
"the tools translate image px → CSS px "
"internally (cssScale is for debug only)."
"browser_hover_coordinate / browser_press_at. "
"browser_get_rect / browser_shadow_query / "
"focused_element.rect return fractions too."
),
}
)
@@ -345,7 +364,7 @@ def register_inspection_tools(mcp: FastMCP) -> None:
"size": len(base64.b64decode(data)) if data else 0,
"url": screenshot_result.get("url", ""),
"cssWidth": css_width,
"cssScale": css_scale,
"cssHeight": css_height,
"physicalScale": physical_scale,
"dpr": dpr,
},
@@ -376,9 +395,9 @@ def register_inspection_tools(mcp: FastMCP) -> None:
Traverses shadow roots to find elements inside closed/open shadow DOM,
overlays, and virtual-rendered components (e.g. LinkedIn's #interop-outlet).
Returns the element's bounding rect in screenshot pixels — feed
``rect.cx`` / ``rect.cy`` straight into browser_click_coordinate
/ hover_coordinate / press_at.
Returns the element's bounding rect as **fractions of the
viewport (0..1)** feed ``rect.cx`` / ``rect.cy`` straight
into browser_click_coordinate / hover_coordinate / press_at.
Args:
selector: CSS selectors joined by ' >>> ' to pierce shadow roots.
@@ -387,7 +406,8 @@ def register_inspection_tools(mcp: FastMCP) -> None:
profile: Browser profile name (default: "default")
Returns:
Dict with ``rect`` block (x, y, w, h, cx, cy) in screenshot pixels.
Dict with ``rect`` block (x, y, w, h, cx, cy) as fractions,
plus ``cssWidth`` / ``cssHeight`` for reference.
"""
bridge = get_bridge()
if not bridge or not bridge.is_connected:
@@ -404,23 +424,26 @@ def register_inspection_tools(mcp: FastMCP) -> None:
return result
rect = result["rect"]
css_scale = await _ensure_css_scale(target_tab)
s = css_scale if css_scale > 0 else 1.0
cw, ch = await _ensure_viewport_size(target_tab)
cw_f = float(cw) if cw > 0 else 1.0
ch_f = float(ch) if ch > 0 else 1.0
return {
"ok": True,
"selector": selector,
"tag": rect.get("tag"),
"rect": {
"x": round(rect["x"] / s, 1),
"y": round(rect["y"] / s, 1),
"w": round(rect["w"] / s, 1),
"h": round(rect["h"] / s, 1),
"cx": round(rect["cx"] / s, 1),
"cy": round(rect["cy"] / s, 1),
"x": round(rect["x"] / cw_f, 4),
"y": round(rect["y"] / ch_f, 4),
"w": round(rect["w"] / cw_f, 4),
"h": round(rect["h"] / ch_f, 4),
"cx": round(rect["cx"] / cw_f, 4),
"cy": round(rect["cy"] / ch_f, 4),
},
"cssWidth": cw,
"cssHeight": ch,
"note": (
"rect fields are in screenshot pixels. Pass rect.cx / "
"rect.cy to browser_click_coordinate / "
"rect fields are fractions of the viewport (0..1). "
"Pass rect.cx / rect.cy to browser_click_coordinate / "
"hover_coordinate / press_at."
),
}
@@ -435,9 +458,9 @@ def register_inspection_tools(mcp: FastMCP) -> None:
Get the bounding rect of an element by CSS selector.
Supports '>>>' shadow-piercing selectors for overlay/shadow DOM
content. Returns the rect in screenshot pixels the same
numbers you'd read off a browser_screenshot, and the same
numbers browser_click_coordinate expects.
content. Returns the rect as **fractions of the viewport
(0..1)** the same coordinate space browser_click_coordinate
/ hover_coordinate / press_at expect.
Args:
selector: CSS selector, optionally with ' >>> ' to pierce shadow roots.
@@ -446,7 +469,8 @@ def register_inspection_tools(mcp: FastMCP) -> None:
profile: Browser profile name (default: "default")
Returns:
Dict with ``rect`` block (x, y, w, h, cx, cy) in screenshot pixels.
Dict with ``rect`` block (x, y, w, h, cx, cy) as fractions,
plus ``cssWidth`` / ``cssHeight`` for reference.
"""
bridge = get_bridge()
if not bridge or not bridge.is_connected:
@@ -463,23 +487,26 @@ def register_inspection_tools(mcp: FastMCP) -> None:
return result
rect = result["rect"]
css_scale = await _ensure_css_scale(target_tab)
s = css_scale if css_scale > 0 else 1.0
cw, ch = await _ensure_viewport_size(target_tab)
cw_f = float(cw) if cw > 0 else 1.0
ch_f = float(ch) if ch > 0 else 1.0
return {
"ok": True,
"selector": selector,
"tag": rect.get("tag"),
"rect": {
"x": round(rect["x"] / s, 1),
"y": round(rect["y"] / s, 1),
"w": round(rect["w"] / s, 1),
"h": round(rect["h"] / s, 1),
"cx": round(rect["cx"] / s, 1),
"cy": round(rect["cy"] / s, 1),
"x": round(rect["x"] / cw_f, 4),
"y": round(rect["y"] / ch_f, 4),
"w": round(rect["w"] / cw_f, 4),
"h": round(rect["h"] / ch_f, 4),
"cx": round(rect["cx"] / cw_f, 4),
"cy": round(rect["cy"] / ch_f, 4),
},
"cssWidth": cw,
"cssHeight": ch,
"note": (
"rect fields are in screenshot pixels. Pass rect.cx / "
"rect.cy to browser_click_coordinate / "
"rect fields are fractions of the viewport (0..1). "
"Pass rect.cx / rect.cy to browser_click_coordinate / "
"hover_coordinate / press_at."
),
}
+78 -39
View File
@@ -108,25 +108,31 @@ def register_interaction_tools(mcp: FastMCP) -> None:
button: Literal["left", "right", "middle"] = "left",
) -> dict:
"""
Click at the given SCREENSHOT pixel.
Click at a FRACTION of the viewport (0..1, 0..1).
``x`` and ``y`` are pixel coordinates read directly off a
``browser_screenshot`` image (800 px wide JPEG). The tool
multiplies them by the cached imageCSS scale for the tab
before dispatching to Chrome no scale awareness required on
the caller side. ``browser_get_rect`` / ``browser_shadow_query``
return coordinates in the same (screenshot) space.
Coordinates are **fractions of the viewport**, not pixels:
``(0.5, 0.5)`` is the center, ``(0.1, 0.2)`` is 10 % from the
left and 20 % from the top. Read a target's proportional
position off ``browser_screenshot`` (or pass
``rect.cx`` / ``rect.cy`` from ``browser_get_rect`` /
``browser_shadow_query`` directly they return fractions too).
Fractions are used because every vision model resizes or tiles
images differently (Claude ~1.15 MP target, GPT-4o 512-px
tiles, etc.). Proportional positions survive every such
transform; pixel coords do not.
Args:
x: X coordinate in screenshot pixels.
y: Y coordinate in screenshot pixels.
x: X fraction of the viewport (0..1).
y: Y fraction of the viewport (0..1).
tab_id: Chrome tab ID (default: active tab)
profile: Browser profile name (default: "default")
button: Mouse button to click (left, right, middle)
Returns:
Dict with click result, including ``focused_element``
describing what the click focused.
describing what the click focused. ``focused_element.rect``
is also in fractions.
"""
start = time.perf_counter()
params = {"x": x, "y": y, "tab_id": tab_id, "profile": profile, "button": button}
@@ -149,18 +155,33 @@ def register_interaction_tools(mcp: FastMCP) -> None:
log_tool_call("browser_click_coordinate", params, result=result)
return result
try:
from .inspection import _ensure_css_scale
# Pixel-input guard: legitimate fractions live in [0, 1]. Allow a
# small overshoot tolerance for edge targets.
if x > 1.5 or y > 1.5 or x < -0.1 or y < -0.1:
result = {
"ok": False,
"error": (
f"Coords ({x}, {y}) look like pixels. This tool expects "
"fractions 0..1 of the viewport. Read the target's "
"proportional position off browser_screenshot, or pass "
"rect.cx / rect.cy from browser_get_rect / "
"browser_shadow_query (they return fractions)."
),
}
log_tool_call("browser_click_coordinate", params, result=result)
return result
css_scale = await _ensure_css_scale(target_tab)
s = css_scale if css_scale > 0 else 1.0
css_x = x * s
css_y = y * s
try:
from .inspection import _ensure_viewport_size
cw, ch = await _ensure_viewport_size(target_tab)
css_x = x * cw
css_y = y * ch
click_result = await bridge.click_coordinate(target_tab, css_x, css_y, button=button)
log_tool_call(
"browser_click_coordinate",
params,
result={**click_result, "cssScale": round(css_scale, 4)},
result={**click_result, "cssWidth": cw, "cssHeight": ch},
duration_ms=(time.perf_counter() - start) * 1000,
)
return click_result
@@ -485,17 +506,16 @@ def register_interaction_tools(mcp: FastMCP) -> None:
profile: str | None = None,
) -> dict:
"""
Hover at the given SCREENSHOT pixel.
Hover at a FRACTION of the viewport (0..1, 0..1).
Use this instead of browser_hover when the element is in an overlay,
shadow DOM, or virtual-rendered component that isn't in the regular DOM.
``x`` / ``y`` are pixel coordinates read directly off a
``browser_screenshot`` image; the tool translates to CSS px
internally before dispatching to Chrome.
``x`` / ``y`` are fractions of the viewport (``0.5`` = center);
the tool converts to CSS px internally.
Args:
x: X coordinate in screenshot pixels.
y: Y coordinate in screenshot pixels.
x: X fraction of the viewport (0..1).
y: Y fraction of the viewport (0..1).
tab_id: Chrome tab ID (default: active tab)
profile: Browser profile name (default: "default")
@@ -523,12 +543,22 @@ def register_interaction_tools(mcp: FastMCP) -> None:
log_tool_call("browser_hover_coordinate", params, result=result)
return result
try:
from .inspection import _ensure_css_scale
if x > 1.5 or y > 1.5 or x < -0.1 or y < -0.1:
result = {
"ok": False,
"error": (
f"Coords ({x}, {y}) look like pixels. This tool expects "
"fractions 0..1 of the viewport."
),
}
log_tool_call("browser_hover_coordinate", params, result=result)
return result
css_scale = await _ensure_css_scale(target_tab)
s = css_scale if css_scale > 0 else 1.0
hover_result = await bridge.hover_coordinate(target_tab, x * s, y * s)
try:
from .inspection import _ensure_viewport_size
cw, ch = await _ensure_viewport_size(target_tab)
hover_result = await bridge.hover_coordinate(target_tab, x * cw, y * ch)
log_tool_call(
"browser_hover_coordinate",
params,
@@ -555,18 +585,17 @@ def register_interaction_tools(mcp: FastMCP) -> None:
profile: str | None = None,
) -> dict:
"""
Move mouse to the given SCREENSHOT pixel, then press a key.
Move mouse to a FRACTION of the viewport (0..1, 0..1), then press a key.
Use this instead of browser_press when the focused element is in an overlay
or virtual-rendered component. Moving the mouse first routes the key event
through native browser hit-testing instead of the DOM focus chain.
``x`` / ``y`` are pixel coordinates read directly off a
``browser_screenshot`` image; the tool translates to CSS px
internally.
``x`` / ``y`` are fractions of the viewport; the tool converts
to CSS px internally.
Args:
x: X coordinate in screenshot pixels.
y: Y coordinate in screenshot pixels.
x: X fraction of the viewport (0..1).
y: Y fraction of the viewport (0..1).
key: Key to press (e.g. 'Enter', 'Space', 'Escape', 'ArrowDown')
tab_id: Chrome tab ID (default: active tab)
profile: Browser profile name (default: "default")
@@ -595,12 +624,22 @@ def register_interaction_tools(mcp: FastMCP) -> None:
log_tool_call("browser_press_at", params, result=result)
return result
try:
from .inspection import _ensure_css_scale
if x > 1.5 or y > 1.5 or x < -0.1 or y < -0.1:
result = {
"ok": False,
"error": (
f"Coords ({x}, {y}) look like pixels. This tool expects "
"fractions 0..1 of the viewport."
),
}
log_tool_call("browser_press_at", params, result=result)
return result
css_scale = await _ensure_css_scale(target_tab)
s = css_scale if css_scale > 0 else 1.0
press_result = await bridge.press_key_at(target_tab, x * s, y * s, key)
try:
from .inspection import _ensure_viewport_size
cw, ch = await _ensure_viewport_size(target_tab)
press_result = await bridge.press_key_at(target_tab, x * cw, y * ch, key)
log_tool_call(
"browser_press_at",
params,