feat: fraction-based visual clicks

2026-04-16 22:36:41 -07:00
parent aba0ff07ba
commit 558813e7fa
7 changed files with 259 additions and 181 deletions
@@ -39,7 +39,7 @@ Neither tool is "preferred" universally — they're for different jobs. Default

 ## Coordinate rule

-Every browser tool that takes or returns coordinates operates in **screenshot pixels** (the 800 px wide JPEG `browser_screenshot` delivers). Read a pixel off the image, pass it straight to `browser_click_coordinate` / `browser_hover_coordinate` / `browser_press_at`. `browser_get_rect` and `browser_shadow_query` return `rect.cx` / `rect.cy` in the same space. The tools translate to CSS px internally — no scale awareness required. Avoid raw `getBoundingClientRect()` via `browser_evaluate` for coord lookup; use `browser_get_rect` instead.
+Every browser tool that takes or returns coordinates operates in **fractions of the viewport (0..1 for both axes)**. Read a target's proportional position off `browser_screenshot` ("~35% from the left, ~20% from the top" → `(0.35, 0.20)`) and pass that to `browser_click_coordinate` / `browser_hover_coordinate` / `browser_press_at`. `browser_get_rect` and `browser_shadow_query` return `rect.cx` / `rect.cy` as fractions. The tools multiply by `cssWidth` / `cssHeight` internally — no scale awareness required. Fractions are used because every vision model (Claude, GPT-4o, Gemini, local VLMs) resizes/tiles images differently; proportions are invariant. Avoid raw `getBoundingClientRect()` via `browser_evaluate` for coord lookup; use `browser_get_rect` instead.

 ## System prompt tips for browser nodes

@@ -45,18 +45,23 @@ CSS selector.
 ## Coordinates

 Every browser tool that takes or returns coordinates operates in
-**screenshot pixels** (the 800 px wide JPEG `browser_screenshot`
-delivers). Read a pixel off the image, pass it to
-`browser_click_coordinate` / `browser_hover_coordinate` /
-`browser_press_at`. `browser_get_rect` and `browser_shadow_query`
-return `rect.cx` / `rect.cy` in the same space. The tools handle the
-image-px → CSS-px translation internally; you do not need to know
-about CSS pixels, DPR, or any scale factor.
+**fractions of the viewport (0..1 for both axes)**. Read a target's
+proportional position off `browser_screenshot` — "this button is
+~35% from the left, ~20% from the top" → pass `(0.35, 0.20)`.
+`browser_get_rect` and `browser_shadow_query` return `rect.cx` /
+`rect.cy` as fractions in the same space. The tools handle the
+fraction → CSS-px multiplication internally; you do not need to
+track image pixels, DPR, or any scale factor.
+
+Why fractions: every vision model (Claude, GPT-4o, Gemini, local
+VLMs) resizes or tiles images differently before the model sees the
+pixels. Proportions survive every such transform; pixel coordinates
+only "work" per-model and break when you swap backends.

 Avoid raw `browser_evaluate` + `getBoundingClientRect()` for coord
-lookup — that returns CSS px and will be mis-scaled when fed to click
+lookup — that returns CSS px and will be wrong when fed to click
 tools. Prefer `browser_get_rect` / `browser_shadow_query`, which
-convert for you.
+return fractions.

 ## Rich-text editors (X, LinkedIn DMs, Gmail, Reddit, Slack, Discord)

@@ -14,18 +14,20 @@ All GCU browser tools drive a real Chrome instance through the Beeline extension

 ## Coordinates

-Every browser tool that takes or returns coordinates operates in **screenshot pixels** (the 800 px wide JPEG `browser_screenshot` delivers). Take a screenshot, read a pixel off the image, pass that number to `browser_click_coordinate` / `browser_hover_coordinate` / `browser_press_at`. Rect-returning tools (`browser_get_rect`, `browser_shadow_query`, and the `rect` inside `focused_element`) also return screenshot pixels. You do not need to convert anything, track scale factors, or know about CSS pixels or device pixel ratio — the tools translate internally before dispatching to Chrome.
+Every browser tool that takes or returns coordinates operates in **fractions of the viewport (0..1 for both axes)**. Read a target's proportional position off `browser_screenshot` — "this button is about 35% from the left and 20% from the top" → pass `(0.35, 0.20)`. Rect-returning tools (`browser_get_rect`, `browser_shadow_query`, and the `rect` inside `focused_element`) also return fractions. The tools convert to CSS pixels internally before dispatching to Chrome.

 ```
-browser_screenshot()                  → image (800 px wide JPEG)
-browser_click_coordinate(x, y)        → x, y are screenshot px
-browser_hover_coordinate(x, y)        → x, y are screenshot px
-browser_press_at(x, y, key)           → x, y are screenshot px
-browser_get_rect(selector) → rect     → rect.cx / rect.cy are screenshot px
+browser_screenshot()                  → image + cssWidth/cssHeight in meta
+browser_click_coordinate(x, y)        → x, y are fractions 0..1
+browser_hover_coordinate(x, y)        → fractions
+browser_press_at(x, y, key)           → fractions
+browser_get_rect(selector) → rect     → rect.cx / rect.cy are fractions
 browser_shadow_query(...)  → rect     → same
 ```

-**Exception for zoomed elements:** pages that use `zoom` or `transform: scale()` on a container (LinkedIn's `#interop-outlet`, some embedded iframes) render in a scaled local coordinate space. `getBoundingClientRect` there may not match CDP's hit space. Prefer `browser_shadow_query` (which handles the math) or visually pick coordinates from a screenshot. When in doubt, avoid raw `browser_evaluate` + `getBoundingClientRect()` for coord lookup — that returns CSS px and will be mis-scaled when passed to click tools.
+**Why fractions:** every vision model (Claude ~1.15 MP target, GPT-4o 512-px tiles, Gemini, local VLMs) resizes or tiles images differently before the model sees the pixels. Proportions survive every such transform; pixel coordinates only "work" per-model and silently break when you swap backends. Four-decimal precision (`0.0001` ≈ 0.17 CSS px on a 1717-wide viewport) is more than enough for the tightest targets.
+
+**Exception for zoomed elements:** pages that use `zoom` or `transform: scale()` on a container (LinkedIn's `#interop-outlet`, some embedded iframes) render in a scaled local coordinate space. `getBoundingClientRect` there may not match CDP's hit space. Prefer `browser_shadow_query` (which handles the math and returns fractions) or visually pick coordinates from a screenshot. Avoid raw `browser_evaluate` + `getBoundingClientRect()` for coord lookup — that returns CSS px and will be wrong when fed to click tools.

 ## Screenshot + coordinates is shadow-agnostic — prefer it on shadow-heavy sites

@@ -41,9 +43,9 @@ Whereas `wait_for_selector`, `browser_click(selector=...)`, `browser_type(select

 ### Recommended workflow on shadow-heavy sites

-1. `browser_screenshot()` → 800 px wide JPEG.
-2. Identify the target visually → pixel `(x, y)` read straight off the image.
-3. `browser_click_coordinate(x, y)` → lands on the element via native hit testing; inputs get focused. **The response includes `focused_element: {tag, id, role, contenteditable, rect, inFrame?, ...}`** — use it to verify you actually focused what you intended. `rect` is in screenshot pixels (same space as the image). When focus is inside a same-origin iframe, the descriptor reports the inner element and adds `inFrame: [...]` breadcrumbs.
+1. `browser_screenshot()` → JPEG; meta includes `cssWidth`/`cssHeight` for reference.
+2. Identify the target visually → estimate its proportional position `(fx, fy)` where each is in `0..1`.
+3. `browser_click_coordinate(fx, fy)` → tool converts to CSS px and dispatches; CDP native hit testing focuses the element. **The response includes `focused_element: {tag, id, role, contenteditable, rect, inFrame?, ...}`** — use it to verify you actually focused what you intended. `rect` is in fractions (same space as your input). When focus is inside a same-origin iframe, the descriptor reports the inner element and adds `inFrame: [...]` breadcrumbs.
 4. `browser_type_focused(text="...")` → inserts text into `document.activeElement` (traverses into same-origin iframes automatically). Shadow roots, iframes, Lexical, Draft.js, ProseMirror all just work. Use `browser_type(selector, text)` instead when you have a reliable CSS selector for a light-DOM element.
 5. Verify via `browser_screenshot` OR `browser_get_attribute` on a known-reachable marker (e.g. check that the Send button's `aria-disabled` flipped to `false`).

@@ -74,7 +76,7 @@ browser_shadow_query("reddit-search-large >>> #search-input")
 browser_get_rect("#interop-outlet >>> #ember37 >>> p")
 ```

-Returns the element's rect in **screenshot pixels** (feed `rect.cx` / `rect.cy` directly to click tools). Remember: `browser_type` and `wait_for_selector` do **not** support `>>>` — only shadow_query and get_rect do.
+Returns the element's rect as **fractions of the viewport** (feed `rect.cx` / `rect.cy` directly to click tools). Remember: `browser_type` and `wait_for_selector` do **not** support `>>>` — only shadow_query and get_rect do.

 ## Navigation and waiting

@@ -215,11 +217,11 @@ Recognized without modifiers: `Enter`, `Tab`, `Escape`, `Backspace`, `Delete`, `

 ```
 browser_screenshot()                    # viewport, 800 px wide JPEG
-browser_screenshot(full_page=True)      # full scrollable page
+browser_screenshot(full_page=True)      # full scrollable page (overview only — don't click off a full-page shot)
 browser_screenshot(selector="#header")  # clip to element's rect
 ```

-Returns a JPEG (quality 75, ~50–120 KB) fixed at **800 px wide** — well below the vision-API resize threshold, so the model sees the exact pixels we emit. Metadata includes `imageWidth` (800), `cssWidth` (the page's real viewport width), `cssScale` (for debug only), and `physicalScale`. The image is annotated with a highlight rectangle/dot showing the last interaction (click, hover, type) if one happened on this tab.
+Returns a JPEG (quality 75, ~50–120 KB) at 800 px wide. The pixel width is purely a bandwidth choice; all tool coordinates are fractions of the viewport and are invariant to image size. Metadata includes `imageWidth` (800), `cssWidth`, `cssHeight` (for reference), and `physicalScale`. The image is annotated with a highlight rectangle/dot showing the last interaction (click, hover, type) if one happened on this tab.

 The highlight overlay stays visible on the page for **10 seconds** after each interaction, then fades. Before a screenshot is likely, make sure your click / hover / type happens <10 s before the screenshot.

@@ -347,7 +349,8 @@ Then pass the most specific selector that uniquely identifies the right input (e
 - **Typing into a rich-text editor without clicking first → send button stays disabled.** Draft.js (X), Lexical (Gmail, LinkedIn DMs), ProseMirror (Reddit), and React-controlled `contenteditable` elements only register input as "real" when the element received a native focus event — JS-sourced `.focus()` is not enough. `browser_type` now does this automatically via a real CDP pointer click before inserting text, but always verify the submit button's `disabled` state before clicking send. See the "ALWAYS click before typing" section above.
 - **Using per-character `keyDown` on Lexical / Draft.js editors → keys dispatch but text never appears.** Those editors intercept `beforeinput` and route insertion through their own state machine; raw keyDown events are silently dropped. `browser_type` now uses `Input.insertText` by default (the CDP IME-commit method) which these editors accept cleanly. Only set `use_insert_text=False` when you explicitly need per-keystroke dispatch.
 - **Leaving a composer with text then trying to navigate → `beforeunload` dialog hangs the bridge.** LinkedIn and several other sites pop a native "unsent message" confirm. `browser_navigate` and `close_tab` both time out against this. Always strip `window.onbeforeunload = null` via `browser_evaluate` before any navigation after typing in a composer, or wrap your logic in a `try/finally` that runs the cleanup block.
- **Click landed in the wrong region (sidebar / header instead of target).** Check `focused_element` in the click response — it's ground truth for what actually got focused, including the `inFrame` breadcrumb when focus ends up inside a same-origin iframe. If it isn't the target (e.g. `className: "msg-conversation-listitem__link"` when you meant to hit a composer), adjust the pixel and retry. Coordinates you pass are screenshot pixels; the tool translates to CSS px internally, so a wrong result means you picked the wrong pixel off the image — not that any scale went sideways.
+- **Click landed in the wrong region (sidebar / header instead of target).** Check `focused_element` in the click response — it's ground truth for what actually got focused, including the `inFrame` breadcrumb when focus ends up inside a same-origin iframe. If it isn't the target (e.g. `className: "msg-conversation-listitem__link"` when you meant to hit a composer), adjust the fraction and retry. Coordinates you pass are fractions of the viewport; the tool multiplies by `cssWidth` / `cssHeight` internally, so a wrong result means your estimated proportion was off — not that any scale went sideways.
+- **Accidentally passing pixels to click / hover / press_at.** The tools reject any coord outside `[-0.1, 1.5]` with a clear error. If you see that error, you passed a pixel (like 815) instead of a fraction (like 0.475). Use `browser_get_rect` to get exact fractional cx/cy, or read proportions off `browser_screenshot`.
 - **Calling `wait_for_selector` on a shadow element.** It'll always time out. Use `browser_shadow_query` or the screenshot + coordinate strategy.
 - **Relying on `innerHTML` in injected scripts on LinkedIn.** Silently discarded. Use `createElement` + `appendChild`.
 - **Not waiting for SPA hydration.** `wait_until="load"` fires before React/Vue rendering on many sites. Add a 2–3 s sleep before querying for chrome elements.
@@ -962,9 +962,9 @@ class BeelineBridge:
        """Read document.activeElement and return a compact descriptor.

        The JS returns ``rect`` fields in CSS px (they come straight
-        from ``getBoundingClientRect``). We scale them to screenshot
-        pixels here so the agent sees a rect in the same coord space
-        it passed to click / hover / press_at.
+        from ``getBoundingClientRect``). We convert them to fractions
+        of the viewport here so the agent sees a rect in the same
+        coord space it passed to click / hover / press_at.

        Returns None on any failure — never raises.
        """
@@ -973,20 +973,23 @@ class BeelineBridge:
            result = await self.evaluate(tab_id, _FOCUSED_ELEMENT_JS)
            info = (result or {}).get("result")
            if info and isinstance(info, dict) and isinstance(info.get("rect"), dict):
-                # Convert CSS px rect → screenshot px using the cached
-                # scale. Fall back to 1.0 if no screenshot has been
-                # taken yet on this tab.
-                from .tools.inspection import _screenshot_css_scales
+                from .tools.inspection import _viewport_sizes

-                scale = _screenshot_css_scales.get(tab_id, 1.0) or 1.0
-                if scale > 0 and scale != 1.0:
+                vp = _viewport_sizes.get(tab_id)
+                if vp and vp[0] > 0 and vp[1] > 0:
+                    cw, ch = float(vp[0]), float(vp[1])
                    r = info["rect"]
                    info["rect"] = {
-                        "x": round(r.get("x", 0) / scale, 1),
-                        "y": round(r.get("y", 0) / scale, 1),
-                        "width": round(r.get("width", 0) / scale, 1),
-                        "height": round(r.get("height", 0) / scale, 1),
+                        "x": round(r.get("x", 0) / cw, 4),
+                        "y": round(r.get("y", 0) / ch, 4),
+                        "width": round(r.get("width", 0) / cw, 4),
+                        "height": round(r.get("height", 0) / ch, 4),
                    }
+                else:
+                    # Degraded: cache missing (no screenshot taken
+                    # yet). Leave rect in CSS px and flag it so the
+                    # agent can tell.
+                    info["rectSpace"] = "css"
            return info
        except Exception:
            return None
@@ -256,12 +256,13 @@ def register_advanced_tools(mcp: FastMCP) -> None:
        try:
            result = await bridge.resize(target_tab, width, height)
            # Invalidate per-tab scale caches — CSS width changed, so the
-            # cached image→CSS multiplier is stale. Click / rect tools
-            # will re-query innerWidth on next use via _ensure_css_scale.
+            # cached viewport dimensions are stale. Click / rect tools
+            # will re-query innerWidth / innerHeight on next use via
+            # _ensure_viewport_size.
            try:
-                from .inspection import _screenshot_css_scales, _screenshot_scales
+                from .inspection import _screenshot_scales, _viewport_sizes

-                _screenshot_css_scales.pop(target_tab, None)
+                _viewport_sizes.pop(target_tab, None)
                _screenshot_scales.pop(target_tab, None)
            except Exception:
                pass
@@ -24,21 +24,25 @@ from .tabs import _get_context
 logger = logging.getLogger(__name__)


-# Fixed output width for all screenshots. Chosen well below Anthropic's
-# ~1568-px vision-API resize threshold so the image the server emits is
-# the SAME image (pixel-for-pixel) the LLM sees. That preserves
-# image_px == model_px, which is the cornerstone of the "LLM works in
-# screenshot pixels only" contract — all click/hover/press/rect tools
-# translate between image pixels and CSS pixels internally.
+# Fixed output width for all screenshots (bandwidth default). This
+# number does NOT affect coordinate semantics — click / hover / press
+# and rect tools all work in fractions of the viewport (0..1), which
+# are invariant to whatever resize / tile the vision API applies. The
+# 800 px width is simply small enough to keep JPEG payloads under
+# ~150 KB on typical UI screenshots.
 _SCREENSHOT_WIDTH = 800

-# Per-tab scale caches populated on every browser_screenshot and on
-# lazy-init inside the click tools. Both are ``image_px × scale =
-# target_px`` multipliers.
-# - _screenshot_scales[tab]      → physical scale (image → physical px, debug only)
-# - _screenshot_css_scales[tab]  → css scale      (image → CSS px, used for Input events)
+# Per-tab viewport-size cache populated on every browser_screenshot
+# and on lazy-init inside the click tools. Stores CSS-pixel viewport
+# dimensions (window.innerWidth / window.innerHeight). Click tools
+# multiply fractional inputs by these to get CSS coords before
+# dispatching CDP events; rect tools divide CSS-pixel DOM rects by
+# these to produce fractions for the agent.
+_viewport_sizes: dict[int, tuple[int, int]] = {}
+
+# Optional debug cache — physical-px scale per tab (orig_png_w /
+# _SCREENSHOT_WIDTH). Logged only; no consumer.
 _screenshot_scales: dict[int, float] = {}
-_screenshot_css_scales: dict[int, float] = {}


 def _resize_and_annotate(
@@ -46,27 +50,24 @@ def _resize_and_annotate(
    css_width: int,
    dpr: float = 1.0,
    highlights: list[dict] | None = None,
-) -> tuple[str, float, float]:
+) -> tuple[str, float]:
    """Resize the captured PNG down to ``_SCREENSHOT_WIDTH`` (=800 px)
    and re-encode as JPEG quality 75.

-    CDP captures at the physical-pixel resolution (DPR × CSS). We
-    downscale to 800 px wide so the delivered image stays under
-    Anthropic's vision-API resize cap — the model sees pixel-for-pixel
-    what we send.
+    The image dimensions do NOT determine click coordinates any more —
+    the tools work in viewport fractions. This helper exists purely
+    for bandwidth + annotation overlay. Returns ``(new_b64,
+    physical_scale)`` where ``physical_scale = orig_png_w / output_w``
+    is kept for debug logging.

-    Returns ``(new_b64, physical_scale, css_scale)`` where
-    - ``physical_scale = orig_png_w / _SCREENSHOT_WIDTH`` (image → physical px)
-    - ``css_scale      = css_width / _SCREENSHOT_WIDTH`` (image → CSS px)
-
-    Highlight rects arrive in CSS px and are divided by ``css_scale``
-    before drawing so overlays land in the correct spot on the
-    800-wide output.
+    Highlight rects arrive in CSS px; they're converted to image-space
+    for overlay drawing via the local ``css_to_image = css_width /
+    output_w`` factor (computed inline — no external cache).
    """
    if not css_width or css_width <= 0:
        # Bridge always supplies css_width from window.innerWidth; only
        # reach here on a degraded response. Return the raw PNG.
-        return data, 1.0, 1.0
+        return data, 1.0

    try:
        from PIL import Image, ImageDraw, ImageFont
@@ -78,17 +79,15 @@ def _resize_and_annotate(

            orig_w = struct.unpack(">I", raw[16:20])[0]
        physical_scale = orig_w / _SCREENSHOT_WIDTH if orig_w else 1.0
-        css_scale = css_width / _SCREENSHOT_WIDTH
        logger.warning(
            "PIL not available — screenshot resize SKIPPED. "
            "Returning raw physical-px PNG. physicalScale=%.4f, "
-            "cssScale=%.4f, css_width=%d, dpr=%s. Install Pillow for correct clicks.",
+            "css_width=%d, dpr=%s. Install Pillow for annotation.",
            physical_scale,
-            css_scale,
            css_width,
            dpr,
        )
-        return data, round(physical_scale, 4), round(css_scale, 4)
+        return data, round(physical_scale, 4)

    try:
        raw = base64.b64decode(data)
@@ -96,14 +95,17 @@ def _resize_and_annotate(
        orig_w, orig_h = img.size

        physical_scale = orig_w / _SCREENSHOT_WIDTH
-        css_scale = css_width / _SCREENSHOT_WIDTH
        new_w = _SCREENSHOT_WIDTH
        new_h = round(orig_h * new_w / orig_w)
        if (new_w, new_h) != img.size:
            img = img.resize((new_w, new_h), Image.LANCZOS)

+        # Local CSS → image px factor for overlay draws. Kept local —
+        # not exported, not stored, not leaked to the agent.
+        css_to_image = css_width / _SCREENSHOT_WIDTH
+
        logger.info(
-            "Screenshot: orig=%dx%d → out=%dx%d (css_width=%d, dpr=%s), physicalScale=%.4f, cssScale=%.4f",
+            "Screenshot: orig=%dx%d → out=%dx%d (css_width=%d, dpr=%s), physicalScale=%.4f, css_to_image=%.4f",
            orig_w,
            orig_h,
            new_w,
@@ -111,7 +113,7 @@ def _resize_and_annotate(
            css_width,
            dpr,
            physical_scale,
-            css_scale,
+            css_to_image,
        )

        if highlights:
@@ -126,10 +128,10 @@ def _resize_and_annotate(
                kind = h.get("kind", "rect")
                label = h.get("label", "")
                # Highlights arrive in CSS px → convert to image px.
-                ix = h["x"] / css_scale
-                iy = h["y"] / css_scale
-                iw = h.get("w", 0) / css_scale
-                ih = h.get("h", 0) / css_scale
+                ix = h["x"] / css_to_image
+                iy = h["y"] / css_to_image
+                iw = h.get("w", 0) / css_to_image
+                ih = h.get("h", 0) / css_to_image

                if kind == "point":
                    cx, cy, r = ix, iy, 10
@@ -169,7 +171,6 @@ def _resize_and_annotate(
        return (
            base64.b64encode(buf.getvalue()).decode(),
            round(physical_scale, 4),
-            round(css_scale, 4),
        )
    except Exception:
        logger.warning(
@@ -179,30 +180,37 @@ def _resize_and_annotate(
            dpr,
            exc_info=True,
        )
-        return data, 1.0, 1.0
+        return data, 1.0


-async def _ensure_css_scale(tab_id: int) -> float:
-    """Return the image→CSS scale for ``tab_id``, populating the cache
-    via ``window.innerWidth`` if missing. Used by click tools when the
-    agent clicks before the first screenshot has been taken.
+async def _ensure_viewport_size(tab_id: int) -> tuple[int, int]:
+    """Return ``(cssWidth, cssHeight)`` for ``tab_id``, populating the
+    cache via ``window.innerWidth`` / ``window.innerHeight`` on miss.
+
+    Used by click / hover / press tools to turn fractional inputs
+    (0..1) into CSS px, and by rect tools to turn CSS-px rects into
+    fractions. Degrades to ``(1, 1)`` if the bridge can't be queried
+    — that makes every coord an identity op, which is a safe no-op
+    (and preferable to crashing).
    """
-    cached = _screenshot_css_scales.get(tab_id)
-    if cached is not None and cached > 0:
+    cached = _viewport_sizes.get(tab_id)
+    if cached is not None and cached[0] > 0 and cached[1] > 0:
        return cached
    bridge = get_bridge()
    try:
-        result = await bridge.evaluate(tab_id, "({w: window.innerWidth})")
-        inner = float(((result or {}).get("result") or {}).get("w") or 0)
+        result = await bridge.evaluate(tab_id, "({w: window.innerWidth, h: window.innerHeight})")
+        inner = (result or {}).get("result") or {}
+        cw = int(float(inner.get("w") or 0))
+        ch = int(float(inner.get("h") or 0))
    except Exception:
-        inner = 0.0
-    if inner <= 0:
-        # Degraded: no viewport width available. Treat image px as CSS px.
-        scale = 1.0
-    else:
-        scale = inner / _SCREENSHOT_WIDTH
-    _screenshot_css_scales[tab_id] = scale
-    return scale
+        cw, ch = 0, 0
+    if cw <= 0 or ch <= 0:
+        # Degraded: bridge didn't return viewport. Cache an identity
+        # so we don't retry on every call; corrects itself after the
+        # next successful browser_screenshot.
+        cw, ch = 1, 1
+    _viewport_sizes[tab_id] = (cw, ch)
+    return cw, ch


 def register_inspection_tools(mcp: FastMCP) -> None:
@@ -219,22 +227,28 @@ def register_inspection_tools(mcp: FastMCP) -> None:
        """
        Take a screenshot of the current page.

-        Image is 800 px wide (JPEG quality 75, ~50–120 KB). A pixel you
-        see in this image is the same number you pass to
-        ``browser_click_coordinate`` / ``browser_hover_coordinate`` /
-        ``browser_press_at`` — the tools translate to CSS internally.
+        Image is 800 px wide (JPEG quality 75, ~50–120 KB). All
+        coordinate tools work in **fractions of the viewport (0..1)**,
+        not pixels — so read a target's proportional position off this
+        image ("~35 % from the left, ~20 % from the top") and pass
+        ``(0.35, 0.20)`` to ``browser_click_coordinate`` /
+        ``browser_hover_coordinate`` / ``browser_press_at``.
        ``browser_get_rect`` and ``browser_shadow_query`` likewise
-        return coordinates in screenshot pixels.
+        return coordinates as fractions.

        Args:
            tab_id: Chrome tab ID (default: active tab)
            profile: Browser profile name (default: "default")
-            full_page: Capture full scrollable page (default: False)
+            full_page: Capture full scrollable page (default: False).
+                Note: full_page images extend beyond the viewport, so
+                fractions read off them do NOT map cleanly to
+                viewport-space clicks. Use for reading / overview only,
+                not for pointing.
            selector: CSS selector to screenshot a specific element (optional)
            annotate: Draw bounding box of last interaction on image (default: True)

        Returns:
-            List of content blocks: text metadata + image
+            List of content blocks: text metadata + image.
        """
        start = time.perf_counter()
        params = {
@@ -299,18 +313,20 @@ def register_inspection_tools(mcp: FastMCP) -> None:
            # Image.open/resize/ImageDraw/composite on a 2-megapixel
            # PNG blocks for ~150–300 ms of CPU — plenty to freeze the
            # asyncio event loop. Reentrant: no shared state.
-            data, physical_scale, css_scale = await asyncio.to_thread(
+            data, physical_scale = await asyncio.to_thread(
                _resize_and_annotate,
                data,
                css_width,
                dpr,
                highlights,
            )
-            # Refresh caches so click / hover / press / rect tools can
-            # translate image px ↔ CSS px without asking the page again.
-            if target_tab is not None:
+            # Cache live viewport dimensions so click / hover / press /
+            # rect tools can translate fractions ↔ CSS px without
+            # asking the page again.
+            css_height = int(screenshot_result.get("cssHeight", 0)) or 0
+            if target_tab is not None and css_width > 0 and css_height > 0:
+                _viewport_sizes[target_tab] = (int(css_width), css_height)
                _screenshot_scales[target_tab] = physical_scale
-                _screenshot_css_scales[target_tab] = css_scale

            meta = json.dumps(
                {
@@ -321,18 +337,21 @@ def register_inspection_tools(mcp: FastMCP) -> None:
                    "size": len(base64.b64decode(data)) if data else 0,
                    "imageWidth": _SCREENSHOT_WIDTH,
                    "cssWidth": css_width,
+                    "cssHeight": css_height,
                    "fullPage": full_page,
                    "devicePixelRatio": dpr,
                    "physicalScale": physical_scale,
-                    "cssScale": css_scale,
                    "annotated": bool(highlights),
                    "scaleHint": (
-                        "Image is 800 px wide. Pass pixel coordinates "
-                        "you read off this image straight into "
+                        "Coordinates for click / hover / press are "
+                        "fractions 0..1 of the viewport. Read a "
+                        "target's proportional position off this image "
+                        "(e.g. '~35 % from the left, ~20 % from the top' "
+                        "→ (0.35, 0.20)) and pass that to "
                        "browser_click_coordinate / "
-                        "browser_hover_coordinate / browser_press_at — "
-                        "the tools translate image px → CSS px "
-                        "internally (cssScale is for debug only)."
+                        "browser_hover_coordinate / browser_press_at. "
+                        "browser_get_rect / browser_shadow_query / "
+                        "focused_element.rect return fractions too."
                    ),
                }
            )
@@ -345,7 +364,7 @@ def register_inspection_tools(mcp: FastMCP) -> None:
                    "size": len(base64.b64decode(data)) if data else 0,
                    "url": screenshot_result.get("url", ""),
                    "cssWidth": css_width,
-                    "cssScale": css_scale,
+                    "cssHeight": css_height,
                    "physicalScale": physical_scale,
                    "dpr": dpr,
                },
@@ -376,9 +395,9 @@ def register_inspection_tools(mcp: FastMCP) -> None:

        Traverses shadow roots to find elements inside closed/open shadow DOM,
        overlays, and virtual-rendered components (e.g. LinkedIn's #interop-outlet).
-        Returns the element's bounding rect in screenshot pixels — feed
-        ``rect.cx`` / ``rect.cy`` straight into browser_click_coordinate
-        / hover_coordinate / press_at.
+        Returns the element's bounding rect as **fractions of the
+        viewport (0..1)** — feed ``rect.cx`` / ``rect.cy`` straight
+        into browser_click_coordinate / hover_coordinate / press_at.

        Args:
            selector: CSS selectors joined by ' >>> ' to pierce shadow roots.
@@ -387,7 +406,8 @@ def register_inspection_tools(mcp: FastMCP) -> None:
            profile: Browser profile name (default: "default")

        Returns:
-            Dict with ``rect`` block (x, y, w, h, cx, cy) in screenshot pixels.
+            Dict with ``rect`` block (x, y, w, h, cx, cy) as fractions,
+            plus ``cssWidth`` / ``cssHeight`` for reference.
        """
        bridge = get_bridge()
        if not bridge or not bridge.is_connected:
@@ -404,23 +424,26 @@ def register_inspection_tools(mcp: FastMCP) -> None:
            return result

        rect = result["rect"]
-        css_scale = await _ensure_css_scale(target_tab)
-        s = css_scale if css_scale > 0 else 1.0
+        cw, ch = await _ensure_viewport_size(target_tab)
+        cw_f = float(cw) if cw > 0 else 1.0
+        ch_f = float(ch) if ch > 0 else 1.0
        return {
            "ok": True,
            "selector": selector,
            "tag": rect.get("tag"),
            "rect": {
-                "x": round(rect["x"] / s, 1),
-                "y": round(rect["y"] / s, 1),
-                "w": round(rect["w"] / s, 1),
-                "h": round(rect["h"] / s, 1),
-                "cx": round(rect["cx"] / s, 1),
-                "cy": round(rect["cy"] / s, 1),
+                "x": round(rect["x"] / cw_f, 4),
+                "y": round(rect["y"] / ch_f, 4),
+                "w": round(rect["w"] / cw_f, 4),
+                "h": round(rect["h"] / ch_f, 4),
+                "cx": round(rect["cx"] / cw_f, 4),
+                "cy": round(rect["cy"] / ch_f, 4),
            },
+            "cssWidth": cw,
+            "cssHeight": ch,
            "note": (
-                "rect fields are in screenshot pixels. Pass rect.cx / "
-                "rect.cy to browser_click_coordinate / "
+                "rect fields are fractions of the viewport (0..1). "
+                "Pass rect.cx / rect.cy to browser_click_coordinate / "
                "hover_coordinate / press_at."
            ),
        }
@@ -435,9 +458,9 @@ def register_inspection_tools(mcp: FastMCP) -> None:
        Get the bounding rect of an element by CSS selector.

        Supports '>>>' shadow-piercing selectors for overlay/shadow DOM
-        content. Returns the rect in screenshot pixels — the same
-        numbers you'd read off a browser_screenshot, and the same
-        numbers browser_click_coordinate expects.
+        content. Returns the rect as **fractions of the viewport
+        (0..1)** — the same coordinate space browser_click_coordinate
+        / hover_coordinate / press_at expect.

        Args:
            selector: CSS selector, optionally with ' >>> ' to pierce shadow roots.
@@ -446,7 +469,8 @@ def register_inspection_tools(mcp: FastMCP) -> None:
            profile: Browser profile name (default: "default")

        Returns:
-            Dict with ``rect`` block (x, y, w, h, cx, cy) in screenshot pixels.
+            Dict with ``rect`` block (x, y, w, h, cx, cy) as fractions,
+            plus ``cssWidth`` / ``cssHeight`` for reference.
        """
        bridge = get_bridge()
        if not bridge or not bridge.is_connected:
@@ -463,23 +487,26 @@ def register_inspection_tools(mcp: FastMCP) -> None:
            return result

        rect = result["rect"]
-        css_scale = await _ensure_css_scale(target_tab)
-        s = css_scale if css_scale > 0 else 1.0
+        cw, ch = await _ensure_viewport_size(target_tab)
+        cw_f = float(cw) if cw > 0 else 1.0
+        ch_f = float(ch) if ch > 0 else 1.0
        return {
            "ok": True,
            "selector": selector,
            "tag": rect.get("tag"),
            "rect": {
-                "x": round(rect["x"] / s, 1),
-                "y": round(rect["y"] / s, 1),
-                "w": round(rect["w"] / s, 1),
-                "h": round(rect["h"] / s, 1),
-                "cx": round(rect["cx"] / s, 1),
-                "cy": round(rect["cy"] / s, 1),
+                "x": round(rect["x"] / cw_f, 4),
+                "y": round(rect["y"] / ch_f, 4),
+                "w": round(rect["w"] / cw_f, 4),
+                "h": round(rect["h"] / ch_f, 4),
+                "cx": round(rect["cx"] / cw_f, 4),
+                "cy": round(rect["cy"] / ch_f, 4),
            },
+            "cssWidth": cw,
+            "cssHeight": ch,
            "note": (
-                "rect fields are in screenshot pixels. Pass rect.cx / "
-                "rect.cy to browser_click_coordinate / "
+                "rect fields are fractions of the viewport (0..1). "
+                "Pass rect.cx / rect.cy to browser_click_coordinate / "
                "hover_coordinate / press_at."
            ),
        }
@@ -108,25 +108,31 @@ def register_interaction_tools(mcp: FastMCP) -> None:
        button: Literal["left", "right", "middle"] = "left",
    ) -> dict:
        """
-        Click at the given SCREENSHOT pixel.
+        Click at a FRACTION of the viewport (0..1, 0..1).

-        ``x`` and ``y`` are pixel coordinates read directly off a
-        ``browser_screenshot`` image (800 px wide JPEG). The tool
-        multiplies them by the cached image→CSS scale for the tab
-        before dispatching to Chrome — no scale awareness required on
-        the caller side. ``browser_get_rect`` / ``browser_shadow_query``
-        return coordinates in the same (screenshot) space.
+        Coordinates are **fractions of the viewport**, not pixels:
+        ``(0.5, 0.5)`` is the center, ``(0.1, 0.2)`` is 10 % from the
+        left and 20 % from the top. Read a target's proportional
+        position off ``browser_screenshot`` (or pass
+        ``rect.cx`` / ``rect.cy`` from ``browser_get_rect`` /
+        ``browser_shadow_query`` directly — they return fractions too).
+
+        Fractions are used because every vision model resizes or tiles
+        images differently (Claude ~1.15 MP target, GPT-4o 512-px
+        tiles, etc.). Proportional positions survive every such
+        transform; pixel coords do not.

        Args:
-            x: X coordinate in screenshot pixels.
-            y: Y coordinate in screenshot pixels.
+            x: X fraction of the viewport (0..1).
+            y: Y fraction of the viewport (0..1).
            tab_id: Chrome tab ID (default: active tab)
            profile: Browser profile name (default: "default")
            button: Mouse button to click (left, right, middle)

        Returns:
            Dict with click result, including ``focused_element``
-            describing what the click focused.
+            describing what the click focused. ``focused_element.rect``
+            is also in fractions.
        """
        start = time.perf_counter()
        params = {"x": x, "y": y, "tab_id": tab_id, "profile": profile, "button": button}
@@ -149,18 +155,33 @@ def register_interaction_tools(mcp: FastMCP) -> None:
            log_tool_call("browser_click_coordinate", params, result=result)
            return result

-        try:
-            from .inspection import _ensure_css_scale
+        # Pixel-input guard: legitimate fractions live in [0, 1]. Allow a
+        # small overshoot tolerance for edge targets.
+        if x > 1.5 or y > 1.5 or x < -0.1 or y < -0.1:
+            result = {
+                "ok": False,
+                "error": (
+                    f"Coords ({x}, {y}) look like pixels. This tool expects "
+                    "fractions 0..1 of the viewport. Read the target's "
+                    "proportional position off browser_screenshot, or pass "
+                    "rect.cx / rect.cy from browser_get_rect / "
+                    "browser_shadow_query (they return fractions)."
+                ),
+            }
+            log_tool_call("browser_click_coordinate", params, result=result)
+            return result

-            css_scale = await _ensure_css_scale(target_tab)
-            s = css_scale if css_scale > 0 else 1.0
-            css_x = x * s
-            css_y = y * s
+        try:
+            from .inspection import _ensure_viewport_size
+
+            cw, ch = await _ensure_viewport_size(target_tab)
+            css_x = x * cw
+            css_y = y * ch
            click_result = await bridge.click_coordinate(target_tab, css_x, css_y, button=button)
            log_tool_call(
                "browser_click_coordinate",
                params,
-                result={**click_result, "cssScale": round(css_scale, 4)},
+                result={**click_result, "cssWidth": cw, "cssHeight": ch},
                duration_ms=(time.perf_counter() - start) * 1000,
            )
            return click_result
@@ -485,17 +506,16 @@ def register_interaction_tools(mcp: FastMCP) -> None:
        profile: str | None = None,
    ) -> dict:
        """
-        Hover at the given SCREENSHOT pixel.
+        Hover at a FRACTION of the viewport (0..1, 0..1).

        Use this instead of browser_hover when the element is in an overlay,
        shadow DOM, or virtual-rendered component that isn't in the regular DOM.
-        ``x`` / ``y`` are pixel coordinates read directly off a
-        ``browser_screenshot`` image; the tool translates to CSS px
-        internally before dispatching to Chrome.
+        ``x`` / ``y`` are fractions of the viewport (``0.5`` = center);
+        the tool converts to CSS px internally.

        Args:
-            x: X coordinate in screenshot pixels.
-            y: Y coordinate in screenshot pixels.
+            x: X fraction of the viewport (0..1).
+            y: Y fraction of the viewport (0..1).
            tab_id: Chrome tab ID (default: active tab)
            profile: Browser profile name (default: "default")

@@ -523,12 +543,22 @@ def register_interaction_tools(mcp: FastMCP) -> None:
            log_tool_call("browser_hover_coordinate", params, result=result)
            return result

-        try:
-            from .inspection import _ensure_css_scale
+        if x > 1.5 or y > 1.5 or x < -0.1 or y < -0.1:
+            result = {
+                "ok": False,
+                "error": (
+                    f"Coords ({x}, {y}) look like pixels. This tool expects "
+                    "fractions 0..1 of the viewport."
+                ),
+            }
+            log_tool_call("browser_hover_coordinate", params, result=result)
+            return result

-            css_scale = await _ensure_css_scale(target_tab)
-            s = css_scale if css_scale > 0 else 1.0
-            hover_result = await bridge.hover_coordinate(target_tab, x * s, y * s)
+        try:
+            from .inspection import _ensure_viewport_size
+
+            cw, ch = await _ensure_viewport_size(target_tab)
+            hover_result = await bridge.hover_coordinate(target_tab, x * cw, y * ch)
            log_tool_call(
                "browser_hover_coordinate",
                params,
@@ -555,18 +585,17 @@ def register_interaction_tools(mcp: FastMCP) -> None:
        profile: str | None = None,
    ) -> dict:
        """
-        Move mouse to the given SCREENSHOT pixel, then press a key.
+        Move mouse to a FRACTION of the viewport (0..1, 0..1), then press a key.

        Use this instead of browser_press when the focused element is in an overlay
        or virtual-rendered component. Moving the mouse first routes the key event
        through native browser hit-testing instead of the DOM focus chain.
-        ``x`` / ``y`` are pixel coordinates read directly off a
-        ``browser_screenshot`` image; the tool translates to CSS px
-        internally.
+        ``x`` / ``y`` are fractions of the viewport; the tool converts
+        to CSS px internally.

        Args:
-            x: X coordinate in screenshot pixels.
-            y: Y coordinate in screenshot pixels.
+            x: X fraction of the viewport (0..1).
+            y: Y fraction of the viewport (0..1).
            key: Key to press (e.g. 'Enter', 'Space', 'Escape', 'ArrowDown')
            tab_id: Chrome tab ID (default: active tab)
            profile: Browser profile name (default: "default")
@@ -595,12 +624,22 @@ def register_interaction_tools(mcp: FastMCP) -> None:
            log_tool_call("browser_press_at", params, result=result)
            return result

-        try:
-            from .inspection import _ensure_css_scale
+        if x > 1.5 or y > 1.5 or x < -0.1 or y < -0.1:
+            result = {
+                "ok": False,
+                "error": (
+                    f"Coords ({x}, {y}) look like pixels. This tool expects "
+                    "fractions 0..1 of the viewport."
+                ),
+            }
+            log_tool_call("browser_press_at", params, result=result)
+            return result

-            css_scale = await _ensure_css_scale(target_tab)
-            s = css_scale if css_scale > 0 else 1.0
-            press_result = await bridge.press_key_at(target_tab, x * s, y * s, key)
+        try:
+            from .inspection import _ensure_viewport_size
+
+            cw, ch = await _ensure_viewport_size(target_tab)
+            press_result = await bridge.press_key_at(target_tab, x * cw, y * ch, key)
            log_tool_call(
                "browser_press_at",
                params,