fix: simplify canonical workflow
This commit is contained in:
@@ -49,11 +49,26 @@ Whereas `wait_for_selector`, `browser_click(selector=...)`, `browser_type(select
|
||||
1. `browser_screenshot()` → visual image
|
||||
2. Identify the target visually → image pixel `(x, y)` (eyeball from the screenshot)
|
||||
3. `browser_coords(x, y)` → convert to CSS px
|
||||
4. `browser_click_coordinate(css_x, css_y)` → lands on the element via native hit testing; inputs get focused
|
||||
5. For typing:
|
||||
- If the element was reachable via a selector → `browser_type(selector, text)`
|
||||
- Otherwise → `browser_press(key)` per character (dispatches to focused element, no selector needed)
|
||||
6. Verify by reading element state via a targeted `browser_evaluate` that walks the shadow tree
|
||||
4. `browser_click_coordinate(css_x, css_y)` → lands on the element via native hit testing; inputs get focused. **The response now includes `focused_element: {tag, id, role, contenteditable, rect, ...}`** — use it to verify you actually focused what you intended.
|
||||
5. `browser_type(text="...")` with **NO selector** → dispatches CDP `Input.insertText` to `document.activeElement`. Shadow roots, iframes, Lexical, Draft.js, ProseMirror all just work. Only pass a selector if you want a DIFFERENT element than the one you just focused (rare).
|
||||
6. Verify via `browser_screenshot` OR `browser_get_attribute` on a known-reachable marker (e.g. check that the Send button's `aria-disabled` flipped to `false`).
|
||||
|
||||
### The click→type loop (canonical pattern)
|
||||
|
||||
```
|
||||
resp = browser_click_coordinate(x, y)
|
||||
fe = resp.get("focused_element")
|
||||
if fe and (fe.get("contenteditable") or fe["tag"] in ("textarea", "input")):
|
||||
browser_type(text="...") # no selector — insertText to activeElement
|
||||
else:
|
||||
# you clicked something that isn't editable — refine coords and retry
|
||||
# do NOT reach for browser_evaluate + execCommand('insertText', ...)
|
||||
# or a walk(root) shadow traversal. The problem is your click, not
|
||||
# the typing method.
|
||||
...
|
||||
```
|
||||
|
||||
`browser_click` (selector-based) also returns `focused_element` now, so the same check works whether you clicked by selector or coordinate.
|
||||
|
||||
### Empirically verified (2026-04-11)
|
||||
|
||||
|
||||
@@ -98,19 +98,17 @@ textarea = browser_evaluate("""
|
||||
browser_click_coordinate(textarea['cx'], textarea['cy'])
|
||||
sleep(0.6)
|
||||
|
||||
# 6. Insert text via document.execCommand('insertText') through browser_evaluate.
|
||||
# This is the ONLY reliable approach for LinkedIn's Lexical composer.
|
||||
# See the "Lexical composer quirks" section below for why browser_type
|
||||
# with a selector does NOT work here (the contenteditable lives inside
|
||||
# the #interop-outlet shadow root which document.querySelector can't
|
||||
# reach). The click in step 5 already put Lexical into edit mode, so
|
||||
# execCommand injects straight into the focused editor's state.
|
||||
browser_evaluate("""
|
||||
(function(){
|
||||
document.execCommand('insertText', false, %s);
|
||||
return true;
|
||||
})();
|
||||
""" % json.dumps(message_text)) # json.dumps gives you a safely-escaped JS string literal
|
||||
# 6. Insert text via browser_type WITHOUT a selector. This dispatches
|
||||
# CDP Input.insertText to document.activeElement — the same underlying
|
||||
# mechanism as execCommand('insertText') but with no JSON escaping,
|
||||
# no browser_evaluate round trip, and built-in retry. The click in
|
||||
# step 5 already focused Lexical, so insertText lands in the editor
|
||||
# regardless of the shadow wrapping around #interop-outlet.
|
||||
#
|
||||
# Do NOT pass a selector here. Selector-based browser_type cannot see
|
||||
# past the #interop-outlet shadow root. No-selector mode sidesteps
|
||||
# that entirely by routing to activeElement.
|
||||
browser_type(text=message_text) # no selector — targets document.activeElement
|
||||
sleep(1.0) # let Lexical commit state + enable Send button
|
||||
|
||||
# 7. Find the modal Send button (filter by in-viewport, reject pinned bar)
|
||||
@@ -143,20 +141,21 @@ send = browser_evaluate("""
|
||||
})();
|
||||
""")
|
||||
|
||||
# 8. ONLY click Send if it's enabled — if disabled, the execCommand
|
||||
# 8. ONLY click Send if it's enabled — if disabled, the insertText
|
||||
# didn't land. DO NOT retry with a different tool; the fix is
|
||||
# always: re-click the composer rect, re-run execCommand, re-check.
|
||||
# The Send button's `disabled` state IS the ground truth — if
|
||||
# Lexical registered your text, it enables the button. If it's
|
||||
# always: re-click the composer rect, re-run browser_type(text=...),
|
||||
# re-check. The Send button's `disabled` state IS the ground truth —
|
||||
# if Lexical registered your text, it enables the button. If it's
|
||||
# still disabled, your text did not reach the editor, regardless
|
||||
# of what any tool call claims.
|
||||
if send['disabled']:
|
||||
# The editor didn't receive your text. Do NOT click Send. Do NOT
|
||||
# fall back to browser_type with a dummy selector (see anti-pattern
|
||||
# in Common Pitfalls). Instead: re-click the textarea rect from
|
||||
# step 4, wait a beat, re-run the execCommand insertText from step
|
||||
# 6. If that still fails after 2 retries, bail and surface — the
|
||||
# modal may have been reclaimed by a stale state or auth wall.
|
||||
# fall back to browser_type with a selector (see anti-pattern in
|
||||
# Common Pitfalls — selector-based type can't reach the shadow-DOM
|
||||
# composer). Instead: re-click the textarea rect from step 4, wait
|
||||
# a beat, re-run browser_type(text=message_text) (no selector) from
|
||||
# step 6. If that still fails after 2 retries, bail and surface —
|
||||
# the modal may have been reclaimed by a stale state or auth wall.
|
||||
raise Exception("Send button disabled after insertText — editor did not receive input")
|
||||
|
||||
browser_click_coordinate(send['cx'], send['cy'])
|
||||
@@ -324,9 +323,9 @@ If any of those show up, **stop the run, screenshot the state, and surface the i
|
||||
## Common pitfalls
|
||||
|
||||
- **`innerHTML` injection is silently dropped** — LinkedIn's Trusted Types CSP discards any `innerHTML = "<...>"` from injected scripts, no console error. Always use `createElement` + `appendChild` + `setAttribute` for DOM injection. `textContent`, `style.cssText`, and `.value` assignments are fine.
|
||||
- **Do NOT use `browser_type` on the message composer — use `document.execCommand('insertText', false, text)` via `browser_evaluate` instead.** The Lexical contenteditable lives inside the `#interop-outlet` shadow root which `document.querySelector` (what `browser_type` uses under the hood) cannot see. Attempts to work around this with `browser_shadow_query` fail because `browser_type` doesn't support the `>>>` shadow-pierce syntax. The ONLY reliable insert path is: (1) `browser_click_coordinate` on the composer rect (put Lexical in edit mode via a real CDP pointer click) → (2) `browser_evaluate` with `document.execCommand('insertText', false, <message>)` against the focused editor. This pattern is verified end-to-end across 15+ successful sends in session `session_20260414_113244_a98cfd66` (2026-04-14).
|
||||
- **Per-char keyDown on the message composer produces empty text** — Lexical intercepts `beforeinput` and drops raw keys. Ignore `browser_type` entirely for LinkedIn DMs; use the `execCommand('insertText')` path above.
|
||||
- **ANTI-PATTERN: "inject a dummy `<div id='dummy-target'>` and pass it as the `selector` arg to `browser_type`".** This looks tempting but fails compoundingly: `browser_type` clicks the **dummy div's** rect (not the editor's), the click lands on the Lexical wrapper's non-editable chrome, the contenteditable never receives focus, and `Input.insertText` fires against nothing. The bridge will still return `{"ok": true, "action": "type", "length": N}` because it has no way to verify the text actually landed. Symptom: Send button stays `disabled: true` forever. Fix: use `execCommand('insertText')` exactly as shown in the profile-message flow above. (See `session_20260414_114820_08bd3c4d` for the failed attempt.)
|
||||
- **Do NOT pass a selector to `browser_type` on the message composer — call it with NO selector (`browser_type(text=...)`).** The Lexical contenteditable lives inside the `#interop-outlet` shadow root which `document.querySelector` (what the selector-based path uses under the hood) cannot see. Attempts to work around this with `browser_shadow_query` fail because selector-based `browser_type` doesn't support the `>>>` shadow-pierce syntax. The reliable insert path is: (1) `browser_click_coordinate` on the composer rect — the response's `focused_element` confirms Lexical received focus → (2) `browser_type(text=message_text)` with NO selector — CDP `Input.insertText` dispatches to `document.activeElement` regardless of shadow wrapping. The old `browser_evaluate` + `document.execCommand('insertText', ...)` pattern worked but had JSON-escaping pitfalls and cost ~200 chars of JS per send; `browser_type(text=...)` is the same mechanism with built-in retry.
|
||||
- **Per-char keyDown on the message composer produces empty text** — Lexical intercepts `beforeinput` and drops raw keys. Use `browser_type(text=..., use_insert_text=True)` with NO selector after click-coordinate focused the composer. The CDP `Input.insertText` method commits as if IME fired, which Lexical accepts cleanly. Do NOT pass a selector; selector-based `browser_type` can't see past `#interop-outlet`.
|
||||
- **ANTI-PATTERN: "inject a dummy `<div id='dummy-target'>` and pass it as the `selector` arg to `browser_type`".** This looks tempting but fails compoundingly: `browser_type` clicks the **dummy div's** rect (not the editor's), the click lands on the Lexical wrapper's non-editable chrome, the contenteditable never receives focus, and `Input.insertText` fires against nothing. The bridge will still return `{"ok": true, "action": "type", "length": N}` because it has no way to verify the text actually landed. Symptom: Send button stays `disabled: true` forever. Fix: `browser_click_coordinate` on the real composer rect, then `browser_type(text=message_text)` with NO selector — CDP `Input.insertText` dispatches to `document.activeElement`. (See `session_20260414_114820_08bd3c4d` for the failed dummy-div attempt.)
|
||||
- **Multiple Send buttons on the page** — the pinned bottom-right messaging bar has its own `msg-form__send-button` that's usually below `innerHeight`. Filter by in-viewport before clicking.
|
||||
- **`window.onbeforeunload` hangs navigation/close** — after typing in a composer, any `browser_navigate` or `close_tab` can pop a native "unsent message, leave?" confirm dialog that deadlocks the bridge. Always strip `onbeforeunload` before any navigation, and wrap composer flows in a `try/finally` that runs the cleanup block:
|
||||
|
||||
|
||||
@@ -80,6 +80,37 @@ async def _adaptive_poll_sleep(elapsed_s: float) -> None:
|
||||
_interaction_highlights: dict[int, dict] = {}
|
||||
|
||||
|
||||
# Compact descriptor of document.activeElement. Returned by both click()
|
||||
# and click_coordinate() so the agent can verify it focused what it
|
||||
# intended, then decide whether to follow up with browser_type(text=...,
|
||||
# no selector). Keeping this as a single shared string avoids drift
|
||||
# between the two click paths.
|
||||
_FOCUSED_ELEMENT_JS = """
|
||||
(function() {
|
||||
var el = document.activeElement;
|
||||
if (!el || el === document.body) return null;
|
||||
var rect = el.getBoundingClientRect();
|
||||
var attrs = {};
|
||||
for (var i = 0; i < el.attributes.length && i < 10; i++) {
|
||||
attrs[el.attributes[i].name] = el.attributes[i].value.substring(0, 200);
|
||||
}
|
||||
return {
|
||||
tag: el.tagName.toLowerCase(),
|
||||
id: el.id || null,
|
||||
className: el.className || null,
|
||||
name: el.getAttribute('name') || null,
|
||||
type: el.getAttribute('type') || null,
|
||||
role: el.getAttribute('role') || null,
|
||||
contenteditable: el.getAttribute('contenteditable') || null,
|
||||
text: (el.innerText || '').substring(0, 200),
|
||||
value: (el.value !== undefined ? String(el.value).substring(0, 200) : null),
|
||||
attributes: attrs,
|
||||
rect: { x: rect.x, y: rect.y, width: rect.width, height: rect.height }
|
||||
};
|
||||
})()
|
||||
"""
|
||||
|
||||
|
||||
def _get_active_profile() -> str:
|
||||
"""Get the current active profile from context variable."""
|
||||
try:
|
||||
@@ -763,7 +794,8 @@ class BeelineBridge:
|
||||
rx = value.get("x", 0) - value.get("width", 0) / 2
|
||||
ry = value.get("y", 0) - value.get("height", 0) / 2
|
||||
await self.highlight_rect(tab_id, rx, ry, value.get("width", 0), value.get("height", 0), label=selector)
|
||||
return {
|
||||
focused_info = await self._read_focused_element(tab_id)
|
||||
resp = {
|
||||
"ok": True,
|
||||
"action": "click",
|
||||
"selector": selector,
|
||||
@@ -771,6 +803,9 @@ class BeelineBridge:
|
||||
"y": value.get("y", 0),
|
||||
"method": "javascript",
|
||||
}
|
||||
if focused_info:
|
||||
resp["focused_element"] = focused_info
|
||||
return resp
|
||||
|
||||
# If JavaScript click failed, try CDP approach
|
||||
if isinstance(value, dict) and value.get("error"):
|
||||
@@ -883,7 +918,8 @@ class BeelineBridge:
|
||||
w = bounds_value.get("width", 0)
|
||||
h = bounds_value.get("height", 0)
|
||||
await self.highlight_rect(tab_id, x - w / 2, y - h / 2, w, h, label=selector)
|
||||
return {
|
||||
focused_info = await self._read_focused_element(tab_id)
|
||||
resp = {
|
||||
"ok": True,
|
||||
"action": "click",
|
||||
"selector": selector,
|
||||
@@ -891,10 +927,29 @@ class BeelineBridge:
|
||||
"y": y,
|
||||
"method": "cdp",
|
||||
}
|
||||
if focused_info:
|
||||
resp["focused_element"] = focused_info
|
||||
return resp
|
||||
|
||||
except Exception as e:
|
||||
return {"ok": False, "error": f"Click failed: {e}"}
|
||||
|
||||
async def _read_focused_element(self, tab_id: int) -> dict | None:
|
||||
"""Read document.activeElement and return a compact descriptor.
|
||||
|
||||
Returns None on any failure — never raises. Used by both click
|
||||
paths (selector-based click() and click_coordinate()) so the
|
||||
agent gets the same response shape regardless of which one was
|
||||
called. The descriptor lets the agent answer "did my click land
|
||||
on an editable?" without a second round-trip.
|
||||
"""
|
||||
try:
|
||||
await self._try_enable_domain(tab_id, "Runtime")
|
||||
result = await self.evaluate(tab_id, _FOCUSED_ELEMENT_JS)
|
||||
return (result or {}).get("result")
|
||||
except Exception:
|
||||
return None
|
||||
|
||||
async def click_coordinate(self, tab_id: int, x: float, y: float, button: str = "left") -> dict:
|
||||
"""Click at specific coordinates."""
|
||||
await self.cdp_attach(tab_id)
|
||||
@@ -931,40 +986,7 @@ class BeelineBridge:
|
||||
|
||||
await self.highlight_point(tab_id, x, y, label=f"click ({x},{y})")
|
||||
|
||||
# Query the focused element after the click
|
||||
focused_info = None
|
||||
try:
|
||||
await self._try_enable_domain(tab_id, "Runtime")
|
||||
result = await self.evaluate(
|
||||
tab_id,
|
||||
"""
|
||||
(function() {
|
||||
var el = document.activeElement;
|
||||
if (!el || el === document.body) return null;
|
||||
var rect = el.getBoundingClientRect();
|
||||
var attrs = {};
|
||||
for (var i = 0; i < el.attributes.length && i < 10; i++) {
|
||||
attrs[el.attributes[i].name] = el.attributes[i].value.substring(0, 200);
|
||||
}
|
||||
return {
|
||||
tag: el.tagName.toLowerCase(),
|
||||
id: el.id || null,
|
||||
className: el.className || null,
|
||||
name: el.getAttribute('name') || null,
|
||||
type: el.getAttribute('type') || null,
|
||||
role: el.getAttribute('role') || null,
|
||||
text: (el.innerText || '').substring(0, 200),
|
||||
value: (el.value !== undefined ? String(el.value).substring(0, 200) : null),
|
||||
attributes: attrs,
|
||||
rect: { x: rect.x, y: rect.y, width: rect.width, height: rect.height }
|
||||
};
|
||||
})()
|
||||
""",
|
||||
)
|
||||
focused_info = (result or {}).get("result")
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
focused_info = await self._read_focused_element(tab_id)
|
||||
resp = {"ok": True, "action": "click_coordinate", "x": x, "y": y}
|
||||
if focused_info:
|
||||
resp["focused_element"] = focused_info
|
||||
|
||||
@@ -185,42 +185,59 @@ def register_interaction_tools(mcp: FastMCP) -> None:
|
||||
use_insert_text: bool = True,
|
||||
) -> dict:
|
||||
"""
|
||||
Type text into an input element.
|
||||
Insert text into the currently focused element via CDP Input.insertText.
|
||||
|
||||
Automatically routes through a real CDP pointer click on the
|
||||
element before inserting text — so that rich-text editors like
|
||||
Lexical (Gmail, LinkedIn DMs), Draft.js (X compose), and
|
||||
ProseMirror (Reddit) see a native focus event and enable their
|
||||
submit buttons. See the gcu-browser skill for the full "click-
|
||||
then-type" pattern.
|
||||
CANONICAL PATTERN — PREFER THIS:
|
||||
browser_click_coordinate(x, y) # click inspects focused_element
|
||||
browser_type(text="...") # NO selector — targets activeElement
|
||||
|
||||
When ``selector`` is omitted (None), types into the currently
|
||||
focused element — useful after ``browser_click_coordinate``
|
||||
has already focused the target.
|
||||
The click focuses the element (including through shadow roots and
|
||||
across iframes), and Input.insertText dispatches to
|
||||
document.activeElement regardless of DOM structure. This is the
|
||||
ONLY reliable way to type into:
|
||||
- LinkedIn's #interop-outlet Lexical composer
|
||||
- X/Twitter's Draft.js compose box
|
||||
- Reddit's ProseMirror comment box
|
||||
- Any site wrapped in Trusted Types CSP (innerHTML silently dropped)
|
||||
- Any nested-iframe message overlay (LinkedIn invitation manager)
|
||||
|
||||
By default uses CDP Input.insertText which is the most reliable
|
||||
way to insert text into rich editors. Set
|
||||
``use_insert_text=False`` to fall back to per-character
|
||||
keyDown/keyUp events (needed only for code editors that fire
|
||||
on specific keystrokes, or when ``delay_ms`` typing animation
|
||||
is required).
|
||||
CDP's Input.insertText takes no target parameter — it operates
|
||||
implicitly on the focused editable. That is why the no-selector
|
||||
path is shadow-agnostic and iframe-agnostic. DO NOT reach for
|
||||
browser_evaluate with document.execCommand('insertText', ...) or
|
||||
walk(root) shadow traversals; they reinvent this method with
|
||||
more escaping bugs.
|
||||
|
||||
When to pass ``selector`` (rare):
|
||||
- You want to type into a DIFFERENT element than the one
|
||||
currently focused, without a prior click.
|
||||
- The target is a plain <input> in the light DOM and you want
|
||||
a single-call shortcut (selector-based click is performed
|
||||
first, then insertText dispatches to the now-focused field).
|
||||
|
||||
By default uses CDP Input.insertText (``use_insert_text=True``).
|
||||
Set ``use_insert_text=False`` only for code editors that watch
|
||||
specific keystrokes, or when ``delay_ms`` typing animation is
|
||||
required.
|
||||
|
||||
Args:
|
||||
selector: CSS selector for the input element (None to type
|
||||
into the already-focused element)
|
||||
text: Text to type
|
||||
tab_id: Chrome tab ID (default: active tab)
|
||||
profile: Browser profile name (default: "default")
|
||||
text: Text to insert at the current cursor position.
|
||||
selector: CSS selector (OPTIONAL — prefer omitting). When
|
||||
omitted, dispatches Input.insertText to
|
||||
document.activeElement. When provided, performs a CDP
|
||||
click on the selector first, then inserts.
|
||||
tab_id: Chrome tab ID (default: active tab).
|
||||
profile: Browser profile name (default: "default").
|
||||
delay_ms: Delay between keystrokes in ms (default: 0).
|
||||
Forces the per-keystroke fallback when > 0.
|
||||
clear_first: Clear existing text before typing (default: True)
|
||||
timeout_ms: Timeout waiting for element (default: 30000)
|
||||
Forces the per-keystroke fallback when > 0.
|
||||
clear_first: Clear existing text before typing (default: True).
|
||||
timeout_ms: Timeout waiting for element (default: 30000).
|
||||
use_insert_text: Use CDP Input.insertText (default: True) for
|
||||
reliable insertion into rich-text editors.
|
||||
Set False for per-keystroke dispatch.
|
||||
reliable insertion into rich-text editors. Set False for
|
||||
per-keystroke dispatch.
|
||||
|
||||
Returns:
|
||||
Dict with type result
|
||||
Dict with type result.
|
||||
"""
|
||||
start = time.perf_counter()
|
||||
params = {"selector": selector, "text": text, "tab_id": tab_id, "profile": profile}
|
||||
|
||||
Reference in New Issue
Block a user