fix: simplify canonical workflow

This commit is contained in:
Timothy
2026-04-16 16:02:37 -07:00
parent 916803889f
commit 8222cd306e
4 changed files with 146 additions and 93 deletions
@@ -49,11 +49,26 @@ Whereas `wait_for_selector`, `browser_click(selector=...)`, `browser_type(select
1. `browser_screenshot()` → visual image
2. Identify the target visually → image pixel `(x, y)` (eyeball from the screenshot)
3. `browser_coords(x, y)` → convert to CSS px
4. `browser_click_coordinate(css_x, css_y)` → lands on the element via native hit testing; inputs get focused
5. For typing:
- If the element was reachable via a selector → `browser_type(selector, text)`
- Otherwise → `browser_press(key)` per character (dispatches to focused element, no selector needed)
6. Verify by reading element state via a targeted `browser_evaluate` that walks the shadow tree
4. `browser_click_coordinate(css_x, css_y)` → lands on the element via native hit testing; inputs get focused. **The response now includes `focused_element: {tag, id, role, contenteditable, rect, ...}`** — use it to verify you actually focused what you intended.
5. `browser_type(text="...")` with **NO selector** → dispatches CDP `Input.insertText` to `document.activeElement`. Shadow roots, iframes, Lexical, Draft.js, ProseMirror all just work. Only pass a selector if you want a DIFFERENT element than the one you just focused (rare).
6. Verify via `browser_screenshot` OR `browser_get_attribute` on a known-reachable marker (e.g. check that the Send button's `aria-disabled` flipped to `false`).
### The click→type loop (canonical pattern)
```
resp = browser_click_coordinate(x, y)
fe = resp.get("focused_element")
if fe and (fe.get("contenteditable") or fe["tag"] in ("textarea", "input")):
browser_type(text="...") # no selector — insertText to activeElement
else:
# you clicked something that isn't editable — refine coords and retry
# do NOT reach for browser_evaluate + execCommand('insertText', ...)
# or a walk(root) shadow traversal. The problem is your click, not
# the typing method.
...
```
`browser_click` (selector-based) also returns `focused_element` now, so the same check works whether you clicked by selector or coordinate.
### Empirically verified (2026-04-11)
@@ -98,19 +98,17 @@ textarea = browser_evaluate("""
browser_click_coordinate(textarea['cx'], textarea['cy'])
sleep(0.6)
# 6. Insert text via document.execCommand('insertText') through browser_evaluate.
# This is the ONLY reliable approach for LinkedIn's Lexical composer.
# See the "Lexical composer quirks" section below for why browser_type
# with a selector does NOT work here (the contenteditable lives inside
# the #interop-outlet shadow root which document.querySelector can't
# reach). The click in step 5 already put Lexical into edit mode, so
# execCommand injects straight into the focused editor's state.
browser_evaluate("""
(function(){
document.execCommand('insertText', false, %s);
return true;
})();
""" % json.dumps(message_text)) # json.dumps gives you a safely-escaped JS string literal
# 6. Insert text via browser_type WITHOUT a selector. This dispatches
# CDP Input.insertText to document.activeElement — the same underlying
# mechanism as execCommand('insertText') but with no JSON escaping,
# no browser_evaluate round trip, and built-in retry. The click in
# step 5 already focused Lexical, so insertText lands in the editor
# regardless of the shadow wrapping around #interop-outlet.
#
# Do NOT pass a selector here. Selector-based browser_type cannot see
# past the #interop-outlet shadow root. No-selector mode sidesteps
# that entirely by routing to activeElement.
browser_type(text=message_text) # no selector — targets document.activeElement
sleep(1.0) # let Lexical commit state + enable Send button
# 7. Find the modal Send button (filter by in-viewport, reject pinned bar)
@@ -143,20 +141,21 @@ send = browser_evaluate("""
})();
""")
# 8. ONLY click Send if it's enabled — if disabled, the execCommand
# 8. ONLY click Send if it's enabled — if disabled, the insertText
# didn't land. DO NOT retry with a different tool; the fix is
# always: re-click the composer rect, re-run execCommand, re-check.
# The Send button's `disabled` state IS the ground truth — if
# Lexical registered your text, it enables the button. If it's
# always: re-click the composer rect, re-run browser_type(text=...),
# re-check. The Send button's `disabled` state IS the ground truth —
# if Lexical registered your text, it enables the button. If it's
# still disabled, your text did not reach the editor, regardless
# of what any tool call claims.
if send['disabled']:
# The editor didn't receive your text. Do NOT click Send. Do NOT
# fall back to browser_type with a dummy selector (see anti-pattern
# in Common Pitfalls). Instead: re-click the textarea rect from
# step 4, wait a beat, re-run the execCommand insertText from step
# 6. If that still fails after 2 retries, bail and surface — the
# modal may have been reclaimed by a stale state or auth wall.
# fall back to browser_type with a selector (see anti-pattern in
# Common Pitfalls — selector-based type can't reach the shadow-DOM
# composer). Instead: re-click the textarea rect from step 4, wait
# a beat, re-run browser_type(text=message_text) (no selector) from
# step 6. If that still fails after 2 retries, bail and surface —
# the modal may have been reclaimed by a stale state or auth wall.
raise Exception("Send button disabled after insertText — editor did not receive input")
browser_click_coordinate(send['cx'], send['cy'])
@@ -324,9 +323,9 @@ If any of those show up, **stop the run, screenshot the state, and surface the i
## Common pitfalls
- **`innerHTML` injection is silently dropped** — LinkedIn's Trusted Types CSP discards any `innerHTML = "<...>"` from injected scripts, no console error. Always use `createElement` + `appendChild` + `setAttribute` for DOM injection. `textContent`, `style.cssText`, and `.value` assignments are fine.
- **Do NOT use `browser_type` on the message composer — use `document.execCommand('insertText', false, text)` via `browser_evaluate` instead.** The Lexical contenteditable lives inside the `#interop-outlet` shadow root which `document.querySelector` (what `browser_type` uses under the hood) cannot see. Attempts to work around this with `browser_shadow_query` fail because `browser_type` doesn't support the `>>>` shadow-pierce syntax. The ONLY reliable insert path is: (1) `browser_click_coordinate` on the composer rect (put Lexical in edit mode via a real CDP pointer click) → (2) `browser_evaluate` with `document.execCommand('insertText', false, <message>)` against the focused editor. This pattern is verified end-to-end across 15+ successful sends in session `session_20260414_113244_a98cfd66` (2026-04-14).
- **Per-char keyDown on the message composer produces empty text** — Lexical intercepts `beforeinput` and drops raw keys. Ignore `browser_type` entirely for LinkedIn DMs; use the `execCommand('insertText')` path above.
- **ANTI-PATTERN: "inject a dummy `<div id='dummy-target'>` and pass it as the `selector` arg to `browser_type`".** This looks tempting but fails compoundingly: `browser_type` clicks the **dummy div's** rect (not the editor's), the click lands on the Lexical wrapper's non-editable chrome, the contenteditable never receives focus, and `Input.insertText` fires against nothing. The bridge will still return `{"ok": true, "action": "type", "length": N}` because it has no way to verify the text actually landed. Symptom: Send button stays `disabled: true` forever. Fix: use `execCommand('insertText')` exactly as shown in the profile-message flow above. (See `session_20260414_114820_08bd3c4d` for the failed attempt.)
- **Do NOT pass a selector to `browser_type` on the message composer — call it with NO selector (`browser_type(text=...)`).** The Lexical contenteditable lives inside the `#interop-outlet` shadow root which `document.querySelector` (what the selector-based path uses under the hood) cannot see. Attempts to work around this with `browser_shadow_query` fail because selector-based `browser_type` doesn't support the `>>>` shadow-pierce syntax. The reliable insert path is: (1) `browser_click_coordinate` on the composer rect — the response's `focused_element` confirms Lexical received focus → (2) `browser_type(text=message_text)` with NO selector — CDP `Input.insertText` dispatches to `document.activeElement` regardless of shadow wrapping. The old `browser_evaluate` + `document.execCommand('insertText', ...)` pattern worked but had JSON-escaping pitfalls and cost ~200 chars of JS per send; `browser_type(text=...)` is the same mechanism with built-in retry.
- **Per-char keyDown on the message composer produces empty text** — Lexical intercepts `beforeinput` and drops raw keys. Use `browser_type(text=..., use_insert_text=True)` with NO selector after click-coordinate focused the composer. The CDP `Input.insertText` method commits as if IME fired, which Lexical accepts cleanly. Do NOT pass a selector; selector-based `browser_type` can't see past `#interop-outlet`.
- **ANTI-PATTERN: "inject a dummy `<div id='dummy-target'>` and pass it as the `selector` arg to `browser_type`".** This looks tempting but fails compoundingly: `browser_type` clicks the **dummy div's** rect (not the editor's), the click lands on the Lexical wrapper's non-editable chrome, the contenteditable never receives focus, and `Input.insertText` fires against nothing. The bridge will still return `{"ok": true, "action": "type", "length": N}` because it has no way to verify the text actually landed. Symptom: Send button stays `disabled: true` forever. Fix: `browser_click_coordinate` on the real composer rect, then `browser_type(text=message_text)` with NO selector — CDP `Input.insertText` dispatches to `document.activeElement`. (See `session_20260414_114820_08bd3c4d` for the failed dummy-div attempt.)
- **Multiple Send buttons on the page** — the pinned bottom-right messaging bar has its own `msg-form__send-button` that's usually below `innerHeight`. Filter by in-viewport before clicking.
- **`window.onbeforeunload` hangs navigation/close** — after typing in a composer, any `browser_navigate` or `close_tab` can pop a native "unsent message, leave?" confirm dialog that deadlocks the bridge. Always strip `onbeforeunload` before any navigation, and wrap composer flows in a `try/finally` that runs the cleanup block:
+58 -36
View File
@@ -80,6 +80,37 @@ async def _adaptive_poll_sleep(elapsed_s: float) -> None:
_interaction_highlights: dict[int, dict] = {}
# Compact descriptor of document.activeElement. Returned by both click()
# and click_coordinate() so the agent can verify it focused what it
# intended, then decide whether to follow up with browser_type(text=...,
# no selector). Keeping this as a single shared string avoids drift
# between the two click paths.
_FOCUSED_ELEMENT_JS = """
(function() {
var el = document.activeElement;
if (!el || el === document.body) return null;
var rect = el.getBoundingClientRect();
var attrs = {};
for (var i = 0; i < el.attributes.length && i < 10; i++) {
attrs[el.attributes[i].name] = el.attributes[i].value.substring(0, 200);
}
return {
tag: el.tagName.toLowerCase(),
id: el.id || null,
className: el.className || null,
name: el.getAttribute('name') || null,
type: el.getAttribute('type') || null,
role: el.getAttribute('role') || null,
contenteditable: el.getAttribute('contenteditable') || null,
text: (el.innerText || '').substring(0, 200),
value: (el.value !== undefined ? String(el.value).substring(0, 200) : null),
attributes: attrs,
rect: { x: rect.x, y: rect.y, width: rect.width, height: rect.height }
};
})()
"""
def _get_active_profile() -> str:
"""Get the current active profile from context variable."""
try:
@@ -763,7 +794,8 @@ class BeelineBridge:
rx = value.get("x", 0) - value.get("width", 0) / 2
ry = value.get("y", 0) - value.get("height", 0) / 2
await self.highlight_rect(tab_id, rx, ry, value.get("width", 0), value.get("height", 0), label=selector)
return {
focused_info = await self._read_focused_element(tab_id)
resp = {
"ok": True,
"action": "click",
"selector": selector,
@@ -771,6 +803,9 @@ class BeelineBridge:
"y": value.get("y", 0),
"method": "javascript",
}
if focused_info:
resp["focused_element"] = focused_info
return resp
# If JavaScript click failed, try CDP approach
if isinstance(value, dict) and value.get("error"):
@@ -883,7 +918,8 @@ class BeelineBridge:
w = bounds_value.get("width", 0)
h = bounds_value.get("height", 0)
await self.highlight_rect(tab_id, x - w / 2, y - h / 2, w, h, label=selector)
return {
focused_info = await self._read_focused_element(tab_id)
resp = {
"ok": True,
"action": "click",
"selector": selector,
@@ -891,10 +927,29 @@ class BeelineBridge:
"y": y,
"method": "cdp",
}
if focused_info:
resp["focused_element"] = focused_info
return resp
except Exception as e:
return {"ok": False, "error": f"Click failed: {e}"}
async def _read_focused_element(self, tab_id: int) -> dict | None:
"""Read document.activeElement and return a compact descriptor.
Returns None on any failure never raises. Used by both click
paths (selector-based click() and click_coordinate()) so the
agent gets the same response shape regardless of which one was
called. The descriptor lets the agent answer "did my click land
on an editable?" without a second round-trip.
"""
try:
await self._try_enable_domain(tab_id, "Runtime")
result = await self.evaluate(tab_id, _FOCUSED_ELEMENT_JS)
return (result or {}).get("result")
except Exception:
return None
async def click_coordinate(self, tab_id: int, x: float, y: float, button: str = "left") -> dict:
"""Click at specific coordinates."""
await self.cdp_attach(tab_id)
@@ -931,40 +986,7 @@ class BeelineBridge:
await self.highlight_point(tab_id, x, y, label=f"click ({x},{y})")
# Query the focused element after the click
focused_info = None
try:
await self._try_enable_domain(tab_id, "Runtime")
result = await self.evaluate(
tab_id,
"""
(function() {
var el = document.activeElement;
if (!el || el === document.body) return null;
var rect = el.getBoundingClientRect();
var attrs = {};
for (var i = 0; i < el.attributes.length && i < 10; i++) {
attrs[el.attributes[i].name] = el.attributes[i].value.substring(0, 200);
}
return {
tag: el.tagName.toLowerCase(),
id: el.id || null,
className: el.className || null,
name: el.getAttribute('name') || null,
type: el.getAttribute('type') || null,
role: el.getAttribute('role') || null,
text: (el.innerText || '').substring(0, 200),
value: (el.value !== undefined ? String(el.value).substring(0, 200) : null),
attributes: attrs,
rect: { x: rect.x, y: rect.y, width: rect.width, height: rect.height }
};
})()
""",
)
focused_info = (result or {}).get("result")
except Exception:
pass
focused_info = await self._read_focused_element(tab_id)
resp = {"ok": True, "action": "click_coordinate", "x": x, "y": y}
if focused_info:
resp["focused_element"] = focused_info
+44 -27
View File
@@ -185,42 +185,59 @@ def register_interaction_tools(mcp: FastMCP) -> None:
use_insert_text: bool = True,
) -> dict:
"""
Type text into an input element.
Insert text into the currently focused element via CDP Input.insertText.
Automatically routes through a real CDP pointer click on the
element before inserting text so that rich-text editors like
Lexical (Gmail, LinkedIn DMs), Draft.js (X compose), and
ProseMirror (Reddit) see a native focus event and enable their
submit buttons. See the gcu-browser skill for the full "click-
then-type" pattern.
CANONICAL PATTERN PREFER THIS:
browser_click_coordinate(x, y) # click inspects focused_element
browser_type(text="...") # NO selector — targets activeElement
When ``selector`` is omitted (None), types into the currently
focused element useful after ``browser_click_coordinate``
has already focused the target.
The click focuses the element (including through shadow roots and
across iframes), and Input.insertText dispatches to
document.activeElement regardless of DOM structure. This is the
ONLY reliable way to type into:
- LinkedIn's #interop-outlet Lexical composer
- X/Twitter's Draft.js compose box
- Reddit's ProseMirror comment box
- Any site wrapped in Trusted Types CSP (innerHTML silently dropped)
- Any nested-iframe message overlay (LinkedIn invitation manager)
By default uses CDP Input.insertText which is the most reliable
way to insert text into rich editors. Set
``use_insert_text=False`` to fall back to per-character
keyDown/keyUp events (needed only for code editors that fire
on specific keystrokes, or when ``delay_ms`` typing animation
is required).
CDP's Input.insertText takes no target parameter — it operates
implicitly on the focused editable. That is why the no-selector
path is shadow-agnostic and iframe-agnostic. DO NOT reach for
browser_evaluate with document.execCommand('insertText', ...) or
walk(root) shadow traversals; they reinvent this method with
more escaping bugs.
When to pass ``selector`` (rare):
- You want to type into a DIFFERENT element than the one
currently focused, without a prior click.
- The target is a plain <input> in the light DOM and you want
a single-call shortcut (selector-based click is performed
first, then insertText dispatches to the now-focused field).
By default uses CDP Input.insertText (``use_insert_text=True``).
Set ``use_insert_text=False`` only for code editors that watch
specific keystrokes, or when ``delay_ms`` typing animation is
required.
Args:
selector: CSS selector for the input element (None to type
into the already-focused element)
text: Text to type
tab_id: Chrome tab ID (default: active tab)
profile: Browser profile name (default: "default")
text: Text to insert at the current cursor position.
selector: CSS selector (OPTIONAL prefer omitting). When
omitted, dispatches Input.insertText to
document.activeElement. When provided, performs a CDP
click on the selector first, then inserts.
tab_id: Chrome tab ID (default: active tab).
profile: Browser profile name (default: "default").
delay_ms: Delay between keystrokes in ms (default: 0).
Forces the per-keystroke fallback when > 0.
clear_first: Clear existing text before typing (default: True)
timeout_ms: Timeout waiting for element (default: 30000)
Forces the per-keystroke fallback when > 0.
clear_first: Clear existing text before typing (default: True).
timeout_ms: Timeout waiting for element (default: 30000).
use_insert_text: Use CDP Input.insertText (default: True) for
reliable insertion into rich-text editors.
Set False for per-keystroke dispatch.
reliable insertion into rich-text editors. Set False for
per-keystroke dispatch.
Returns:
Dict with type result
Dict with type result.
"""
start = time.perf_counter()
params = {"selector": selector, "text": text, "tab_id": tab_id, "profile": profile}