merge: PR #4636 feature/memory-inheritance into feat/inbox-management
Brings in append_data tool, continuous conversation mode, conversation judge, phase compaction, and prompt composer from the memory-inheritance feature branch. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -553,6 +553,26 @@ AskUserQuestion(questions=[{
|
||||
- condition_expr (Python expression, only if conditional)
|
||||
- priority (positive = forward, negative = feedback/loop-back)
|
||||
|
||||
**DETERMINE the graph lifecycle.** Not every agent needs a terminal node:
|
||||
|
||||
| Pattern | `terminal_nodes` | When to Use |
|
||||
|---------|-------------------|-------------|
|
||||
| **Linear (finish)** | `["last-node"]` | Agent completes a task and exits (batch processing, one-shot generation) |
|
||||
| **Forever-alive (loop)** | `[]` (empty) | Agent stays alive for continuous interaction (research assistant, personal assistant, monitoring) |
|
||||
|
||||
**Forever-alive pattern:** The deep_research_agent example uses `terminal_nodes=[]`. Every leaf node has edges that loop back to earlier nodes, creating a perpetual session. The agent only stops when the user explicitly exits. This is the preferred pattern for interactive, multi-turn agents.
|
||||
|
||||
**Key design rules for forever-alive graphs:**
|
||||
- Every node must have at least one outgoing edge (no dead ends)
|
||||
- Client-facing nodes block for user input — these are the natural "pause points"
|
||||
- The user controls when to stop, not the graph
|
||||
- Sessions accumulate memory across loops — plan for conversation compaction
|
||||
- Use `conversation_mode="continuous"` to preserve conversation history across node transitions
|
||||
- `max_iterations` should be set high (e.g., 100) since the agent is designed to run indefinitely
|
||||
- The agent will NOT enter a "completed" execution state — this is intentional, not a bug
|
||||
|
||||
**Ask the user** which lifecycle pattern fits their agent. Default to forever-alive for interactive agents, linear for batch/one-shot tasks.
|
||||
|
||||
**RENDER the complete graph as ASCII art.** Make it large and clear — the user needs to see and understand the full workflow at a glance.
|
||||
|
||||
**IMPORTANT: Make the ASCII art BIG and READABLE.** Use a box-and-arrow style with generous spacing. Do NOT make it tiny or compressed. Example format:
|
||||
@@ -912,6 +932,46 @@ result = await executor.execute(graph=graph, goal=goal, input_data=input_data)
|
||||
|
||||
---
|
||||
|
||||
## REFERENCE: Graph Lifecycle & Conversation Memory
|
||||
|
||||
### Terminal vs Forever-Alive Graphs
|
||||
|
||||
Agents have two lifecycle patterns:
|
||||
|
||||
**Linear (terminal) graphs** have `terminal_nodes=["last-node"]`. Execution ends when the terminal node completes. The session enters a "completed" state. Use for batch processing, one-shot generation, and fire-and-forget tasks.
|
||||
|
||||
**Forever-alive graphs** have `terminal_nodes=[]` (empty). Every node has at least one outgoing edge — the graph loops indefinitely. The session **never enters a "completed" state** — this is intentional. The agent stays alive until the user explicitly exits. Use for interactive assistants, research tools, and any agent where the user drives the conversation.
|
||||
|
||||
The deep_research_agent example demonstrates this: `report` loops back to either `research` (dig deeper) or `intake` (new topic). The agent is a persistent, interactive assistant.
|
||||
|
||||
### Continuous Conversation Mode
|
||||
|
||||
When `conversation_mode="continuous"` is set on the GraphSpec, the framework preserves a **single conversation thread** across all node transitions:
|
||||
|
||||
**What the framework does automatically:**
|
||||
- **Inherits conversation**: Same message history carries forward to the next node
|
||||
- **Composes layered system prompts**: Identity (agent-level) + Narrative (auto-generated state summary) + Focus (per-node instructions)
|
||||
- **Inserts transition markers**: At each node boundary, a "State of the World" message showing completed phases, current memory, and available data files
|
||||
- **Accumulates tools**: Once a tool becomes available, it stays available in subsequent nodes
|
||||
- **Compacts opportunistically**: At phase transitions, old tool results are pruned to stay within token budget
|
||||
|
||||
**What this means for agent builders:**
|
||||
- Nodes don't need to re-explain context — the conversation carries it forward
|
||||
- Output keys from earlier nodes are available in memory for edge conditions and later nodes
|
||||
- For forever-alive agents, conversation memory persists across the entire session lifetime
|
||||
- Plan for compaction: very long sessions will have older tool results summarized automatically
|
||||
|
||||
**When to use continuous mode:**
|
||||
- Interactive agents with client-facing nodes (always)
|
||||
- Multi-phase workflows where context matters across phases
|
||||
- Forever-alive agents that loop indefinitely
|
||||
|
||||
**When NOT to use continuous mode:**
|
||||
- Embarrassingly parallel fan-out nodes (each branch should be independent)
|
||||
- Stateless utility agents that process items independently
|
||||
|
||||
---
|
||||
|
||||
## REFERENCE: Framework Capabilities for Qualification
|
||||
|
||||
Use this reference during STEP 2 to give accurate, honest assessments.
|
||||
@@ -944,7 +1004,7 @@ Use this reference during STEP 2 to give accurate, honest assessments.
|
||||
|
||||
| Use Case | Why It's Problematic | Alternative |
|
||||
|----------|---------------------|-------------|
|
||||
| Long-running daemons | Framework is request-response, not persistent | External scheduler + agent |
|
||||
| Persistent background daemons (no user) | Forever-alive graphs need a user at client-facing nodes; no autonomous background polling without user | External scheduler triggering agent runs |
|
||||
| Sub-second responses | LLM latency is inherent | Traditional code, no LLM |
|
||||
| Processing millions of items | Context windows and rate limits | Batch processing + sampling |
|
||||
| Real-time streaming data | No built-in pub/sub or streaming input | Custom MCP server + agent |
|
||||
@@ -979,3 +1039,6 @@ Use this reference during STEP 2 to give accurate, honest assessments.
|
||||
11. **Adding framework gating for LLM behavior** - Fix prompts or use judges, not ad-hoc code
|
||||
12. **Writing code before user approves the graph** - Always get approval on goal, nodes, and graph BEFORE writing any agent code
|
||||
13. **Wrong mcp_servers.json format** - Use flat format (no `"mcpServers"` wrapper), `cwd` must be `"../../tools"`, and `command` must be `"uv"` with args `["run", "python", ...]`
|
||||
14. **Assuming all agents need terminal nodes** - Interactive agents often work best with `terminal_nodes=[]` (forever-alive pattern). The agent never enters "completed" state — this is intentional. Only batch/one-shot agents need terminal nodes
|
||||
15. **Creating dead-end nodes in forever-alive graphs** - Every node must have at least one outgoing edge. A node with no outgoing edges will cause execution to end unexpectedly, breaking the forever-alive loop
|
||||
16. **Not using continuous conversation mode for interactive agents** - Multi-phase interactive agents should use `conversation_mode="continuous"` to preserve context across node transitions. Without it, each node starts with a blank conversation and loses all prior context
|
||||
|
||||
@@ -1,12 +1,15 @@
|
||||
"""Agent graph construction for Deep Research Agent."""
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
from framework.graph import EdgeSpec, EdgeCondition, Goal, SuccessCriterion, Constraint
|
||||
from framework.graph.edge import GraphSpec
|
||||
from framework.graph.executor import ExecutionResult, GraphExecutor
|
||||
from framework.runtime.event_bus import EventBus
|
||||
from framework.runtime.core import Runtime
|
||||
from framework.graph.executor import ExecutionResult
|
||||
from framework.graph.checkpoint_config import CheckpointConfig
|
||||
from framework.llm import LiteLLMProvider
|
||||
from framework.runner.tool_registry import ToolRegistry
|
||||
from framework.runtime.agent_runtime import AgentRuntime, create_agent_runtime
|
||||
from framework.runtime.execution_stream import EntryPointSpec
|
||||
|
||||
from .config import default_config, metadata
|
||||
from .nodes import (
|
||||
@@ -120,13 +123,31 @@ edges = [
|
||||
condition_expr="needs_more_research == False",
|
||||
priority=2,
|
||||
),
|
||||
# report -> research (user wants deeper research on current topic)
|
||||
EdgeSpec(
|
||||
id="report-to-research",
|
||||
source="report",
|
||||
target="research",
|
||||
condition=EdgeCondition.CONDITIONAL,
|
||||
condition_expr="str(next_action).lower() == 'more_research'",
|
||||
priority=2,
|
||||
),
|
||||
# report -> intake (user wants a new topic — default when not more_research)
|
||||
EdgeSpec(
|
||||
id="report-to-intake",
|
||||
source="report",
|
||||
target="intake",
|
||||
condition=EdgeCondition.CONDITIONAL,
|
||||
condition_expr="str(next_action).lower() != 'more_research'",
|
||||
priority=1,
|
||||
),
|
||||
]
|
||||
|
||||
# Graph configuration
|
||||
entry_node = "intake"
|
||||
entry_points = {"start": "intake"}
|
||||
pause_nodes = []
|
||||
terminal_nodes = ["report"]
|
||||
terminal_nodes = []
|
||||
|
||||
|
||||
class DeepResearchAgent:
|
||||
@@ -136,6 +157,12 @@ class DeepResearchAgent:
|
||||
Flow: intake -> research -> review -> report
|
||||
^ |
|
||||
+-- feedback loop (if user wants more)
|
||||
|
||||
Uses AgentRuntime for proper session management:
|
||||
- Session-scoped storage (sessions/{session_id}/)
|
||||
- Checkpointing for resume capability
|
||||
- Runtime logging
|
||||
- Data folder for save_data/load_data
|
||||
"""
|
||||
|
||||
def __init__(self, config=None):
|
||||
@@ -147,10 +174,10 @@ class DeepResearchAgent:
|
||||
self.entry_points = entry_points
|
||||
self.pause_nodes = pause_nodes
|
||||
self.terminal_nodes = terminal_nodes
|
||||
self._executor: GraphExecutor | None = None
|
||||
self._graph: GraphSpec | None = None
|
||||
self._event_bus: EventBus | None = None
|
||||
self._agent_runtime: AgentRuntime | None = None
|
||||
self._tool_registry: ToolRegistry | None = None
|
||||
self._storage_path: Path | None = None
|
||||
|
||||
def _build_graph(self) -> GraphSpec:
|
||||
"""Build the GraphSpec."""
|
||||
@@ -171,16 +198,20 @@ class DeepResearchAgent:
|
||||
"max_tool_calls_per_turn": 20,
|
||||
"max_history_tokens": 32000,
|
||||
},
|
||||
conversation_mode="continuous",
|
||||
identity_prompt=(
|
||||
"You are a rigorous research agent. You search for information "
|
||||
"from diverse, authoritative sources, analyze findings critically, "
|
||||
"and produce well-cited reports. You never fabricate information — "
|
||||
"every claim must trace back to a source you actually retrieved."
|
||||
),
|
||||
)
|
||||
|
||||
def _setup(self, mock_mode=False) -> GraphExecutor:
|
||||
"""Set up the executor with all components."""
|
||||
from pathlib import Path
|
||||
def _setup(self, mock_mode=False) -> None:
|
||||
"""Set up the agent runtime with sessions, checkpoints, and logging."""
|
||||
self._storage_path = Path.home() / ".hive" / "agents" / "deep_research_agent"
|
||||
self._storage_path.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
storage_path = Path.home() / ".hive" / "agents" / "deep_research_agent"
|
||||
storage_path.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
self._event_bus = EventBus()
|
||||
self._tool_registry = ToolRegistry()
|
||||
|
||||
mcp_config_path = Path(__file__).parent / "mcp_servers.json"
|
||||
@@ -199,47 +230,63 @@ class DeepResearchAgent:
|
||||
tools = list(self._tool_registry.get_tools().values())
|
||||
|
||||
self._graph = self._build_graph()
|
||||
runtime = Runtime(storage_path)
|
||||
|
||||
self._executor = GraphExecutor(
|
||||
runtime=runtime,
|
||||
checkpoint_config = CheckpointConfig(
|
||||
enabled=True,
|
||||
checkpoint_on_node_start=False,
|
||||
checkpoint_on_node_complete=True,
|
||||
checkpoint_max_age_days=7,
|
||||
async_checkpoint=True,
|
||||
)
|
||||
|
||||
entry_point_specs = [
|
||||
EntryPointSpec(
|
||||
id="default",
|
||||
name="Default",
|
||||
entry_node=self.entry_node,
|
||||
trigger_type="manual",
|
||||
isolation_level="shared",
|
||||
)
|
||||
]
|
||||
|
||||
self._agent_runtime = create_agent_runtime(
|
||||
graph=self._graph,
|
||||
goal=self.goal,
|
||||
storage_path=self._storage_path,
|
||||
entry_points=entry_point_specs,
|
||||
llm=llm,
|
||||
tools=tools,
|
||||
tool_executor=tool_executor,
|
||||
event_bus=self._event_bus,
|
||||
storage_path=storage_path,
|
||||
loop_config=self._graph.loop_config,
|
||||
checkpoint_config=checkpoint_config,
|
||||
)
|
||||
|
||||
return self._executor
|
||||
|
||||
async def start(self, mock_mode=False) -> None:
|
||||
"""Set up the agent (initialize executor and tools)."""
|
||||
if self._executor is None:
|
||||
"""Set up and start the agent runtime."""
|
||||
if self._agent_runtime is None:
|
||||
self._setup(mock_mode=mock_mode)
|
||||
if not self._agent_runtime.is_running:
|
||||
await self._agent_runtime.start()
|
||||
|
||||
async def stop(self) -> None:
|
||||
"""Clean up resources."""
|
||||
self._executor = None
|
||||
self._event_bus = None
|
||||
"""Stop the agent runtime and clean up."""
|
||||
if self._agent_runtime and self._agent_runtime.is_running:
|
||||
await self._agent_runtime.stop()
|
||||
self._agent_runtime = None
|
||||
|
||||
async def trigger_and_wait(
|
||||
self,
|
||||
entry_point: str,
|
||||
input_data: dict,
|
||||
entry_point: str = "default",
|
||||
input_data: dict | None = None,
|
||||
timeout: float | None = None,
|
||||
session_state: dict | None = None,
|
||||
) -> ExecutionResult | None:
|
||||
"""Execute the graph and wait for completion."""
|
||||
if self._executor is None:
|
||||
if self._agent_runtime is None:
|
||||
raise RuntimeError("Agent not started. Call start() first.")
|
||||
if self._graph is None:
|
||||
raise RuntimeError("Graph not built. Call start() first.")
|
||||
|
||||
return await self._executor.execute(
|
||||
graph=self._graph,
|
||||
goal=self.goal,
|
||||
input_data=input_data,
|
||||
return await self._agent_runtime.trigger_and_wait(
|
||||
entry_point_id=entry_point,
|
||||
input_data=input_data or {},
|
||||
session_state=session_state,
|
||||
)
|
||||
|
||||
@@ -250,7 +297,7 @@ class DeepResearchAgent:
|
||||
await self.start(mock_mode=mock_mode)
|
||||
try:
|
||||
result = await self.trigger_and_wait(
|
||||
"start", context, session_state=session_state
|
||||
"default", context, session_state=session_state
|
||||
)
|
||||
return result or ExecutionResult(success=False, error="Execution timeout")
|
||||
finally:
|
||||
|
||||
@@ -10,8 +10,13 @@ intake_node = NodeSpec(
|
||||
description="Discuss the research topic with the user, clarify scope, and confirm direction",
|
||||
node_type="event_loop",
|
||||
client_facing=True,
|
||||
max_node_visits=0,
|
||||
input_keys=["topic"],
|
||||
output_keys=["research_brief"],
|
||||
success_criteria=(
|
||||
"The research brief is specific and actionable: it states the topic, "
|
||||
"the key questions to answer, the desired scope, and depth."
|
||||
),
|
||||
system_prompt="""\
|
||||
You are a research intake specialist. The user wants to research a topic.
|
||||
Have a brief conversation to clarify what they need.
|
||||
@@ -38,10 +43,14 @@ research_node = NodeSpec(
|
||||
name="Research",
|
||||
description="Search the web, fetch source content, and compile findings",
|
||||
node_type="event_loop",
|
||||
max_node_visits=3,
|
||||
max_node_visits=0,
|
||||
input_keys=["research_brief", "feedback"],
|
||||
output_keys=["findings", "sources", "gaps"],
|
||||
nullable_output_keys=["feedback"],
|
||||
success_criteria=(
|
||||
"Findings reference at least 3 distinct sources with URLs. "
|
||||
"Key claims are substantiated by fetched content, not generated."
|
||||
),
|
||||
system_prompt="""\
|
||||
You are a research agent. Given a research brief, find and analyze sources.
|
||||
|
||||
@@ -56,18 +65,19 @@ Work in phases:
|
||||
and any contradictions between sources.
|
||||
|
||||
Important:
|
||||
- Work in batches of 3-4 tool calls at a time to manage context
|
||||
- Work in batches of 3-4 tool calls at a time — never more than 10 per turn
|
||||
- After each batch, assess whether you have enough material
|
||||
- Prefer quality over quantity — 5 good sources beat 15 thin ones
|
||||
- Track which URL each finding comes from (you'll need citations later)
|
||||
- Call set_output for each key in a SEPARATE turn (not in the same turn as other tool calls)
|
||||
|
||||
When done, use set_output:
|
||||
When done, use set_output (one key at a time, separate turns):
|
||||
- set_output("findings", "Structured summary: key findings with source URLs for each claim. \
|
||||
Include themes, contradictions, and confidence levels.")
|
||||
- set_output("sources", [{"url": "...", "title": "...", "summary": "..."}])
|
||||
- set_output("gaps", "What aspects of the research brief are NOT well-covered yet, if any.")
|
||||
""",
|
||||
tools=["web_search", "web_scrape", "load_data", "save_data", "list_data_files"],
|
||||
tools=["web_search", "web_scrape", "load_data", "save_data", "append_data", "list_data_files"],
|
||||
)
|
||||
|
||||
# Node 3: Review (client-facing)
|
||||
@@ -78,9 +88,13 @@ review_node = NodeSpec(
|
||||
description="Present findings to user and decide whether to research more or write the report",
|
||||
node_type="event_loop",
|
||||
client_facing=True,
|
||||
max_node_visits=3,
|
||||
max_node_visits=0,
|
||||
input_keys=["findings", "sources", "gaps", "research_brief"],
|
||||
output_keys=["needs_more_research", "feedback"],
|
||||
success_criteria=(
|
||||
"The user has been presented with findings and has explicitly indicated "
|
||||
"whether they want more research or are ready for the report."
|
||||
),
|
||||
system_prompt="""\
|
||||
Present the research findings to the user clearly and concisely.
|
||||
|
||||
@@ -109,49 +123,66 @@ report_node = NodeSpec(
|
||||
description="Write a cited HTML report from the findings and present it to the user",
|
||||
node_type="event_loop",
|
||||
client_facing=True,
|
||||
max_node_visits=0,
|
||||
input_keys=["findings", "sources", "research_brief"],
|
||||
output_keys=["delivery_status"],
|
||||
output_keys=["delivery_status", "next_action"],
|
||||
success_criteria=(
|
||||
"An HTML report has been saved, the file link has been presented to the user, "
|
||||
"and the user has indicated what they want to do next."
|
||||
),
|
||||
system_prompt="""\
|
||||
Write a comprehensive research report as an HTML file and present it to the user.
|
||||
Write a research report as an HTML file and present it to the user.
|
||||
|
||||
**STEP 1 — Write the HTML report (tool calls, NO text to user yet):**
|
||||
IMPORTANT: save_data requires TWO separate arguments: filename and data.
|
||||
Call it like: save_data(filename="report.html", data="<html>...</html>")
|
||||
Do NOT use _raw, do NOT nest arguments inside a JSON string.
|
||||
|
||||
1. Compose a complete, self-contained HTML document with embedded CSS styling.
|
||||
Use a clean, readable design: max-width container, pleasant typography,
|
||||
numbered citation links, a table of contents, and a references section.
|
||||
**STEP 1 — Write and save the HTML report (tool calls, NO text to user yet):**
|
||||
|
||||
Report structure inside the HTML:
|
||||
- Title & date
|
||||
- Executive Summary (2-3 paragraphs)
|
||||
- Table of Contents
|
||||
- Findings (organized by theme, with [n] citation links)
|
||||
- Analysis (synthesis, implications, areas of debate)
|
||||
- Conclusion (key takeaways, confidence assessment)
|
||||
- References (numbered list with clickable URLs)
|
||||
Build a clean HTML document. Keep the HTML concise — aim for clarity over length.
|
||||
Use minimal embedded CSS (a few lines of style, not a full framework).
|
||||
|
||||
Requirements:
|
||||
- Every factual claim must cite its source with [n] notation
|
||||
- Be objective — present multiple viewpoints where sources disagree
|
||||
- Distinguish well-supported conclusions from speculation
|
||||
- Answer the original research questions from the brief
|
||||
Report structure:
|
||||
- Title & date
|
||||
- Executive Summary (2-3 paragraphs)
|
||||
- Key Findings (organized by theme, with [n] citation links)
|
||||
- Analysis (synthesis, implications)
|
||||
- Conclusion (key takeaways)
|
||||
- References (numbered list with clickable URLs)
|
||||
|
||||
2. Save the HTML file:
|
||||
save_data(filename="report.html", data=<your_html>)
|
||||
Requirements:
|
||||
- Every factual claim must cite its source with [n] notation
|
||||
- Be objective — present multiple viewpoints where sources disagree
|
||||
- Answer the original research questions from the brief
|
||||
|
||||
3. Get the clickable link:
|
||||
serve_file_to_user(filename="report.html", label="Research Report")
|
||||
Save the HTML:
|
||||
save_data(filename="report.html", data="<html>...</html>")
|
||||
|
||||
Then get the clickable link:
|
||||
serve_file_to_user(filename="report.html", label="Research Report")
|
||||
|
||||
If save_data fails, simplify and shorten the HTML, then retry.
|
||||
|
||||
**STEP 2 — Present the link to the user (text only, NO tool calls):**
|
||||
|
||||
Tell the user the report is ready and include the file:// URI from
|
||||
serve_file_to_user so they can click it to open. Give a brief summary
|
||||
of what the report covers. Ask if they have questions.
|
||||
of what the report covers. Ask if they have questions or want to continue.
|
||||
|
||||
**STEP 3 — After the user responds:**
|
||||
- Answer follow-up questions from the research material
|
||||
- When the user is satisfied: set_output("delivery_status", "completed")
|
||||
- Answer any follow-up questions from the research material
|
||||
- When the user is ready to move on, ask what they'd like to do next:
|
||||
- Research a new topic?
|
||||
- Dig deeper into the current topic?
|
||||
- Then call set_output:
|
||||
- set_output("delivery_status", "completed")
|
||||
- set_output("next_action", "new_topic") — if they want a new topic
|
||||
- set_output("next_action", "more_research") — if they want deeper research
|
||||
""",
|
||||
tools=["save_data", "serve_file_to_user", "load_data", "list_data_files"],
|
||||
tools=[
|
||||
"save_data", "append_data", "edit_data",
|
||||
"serve_file_to_user", "load_data", "list_data_files",
|
||||
],
|
||||
)
|
||||
|
||||
__all__ = [
|
||||
|
||||
@@ -26,6 +26,17 @@ Use `/hive-debugger` when:
|
||||
|
||||
This skill works alongside agents running in TUI mode and provides supervisor-level insights into execution behavior.
|
||||
|
||||
### Forever-Alive Agent Awareness
|
||||
|
||||
Some agents use `terminal_nodes=[]` (the "forever-alive" pattern), meaning they loop indefinitely and never enter a "completed" execution state. For these agents:
|
||||
- Sessions with status "in_progress" or "paused" are **normal**, not failures
|
||||
- High step counts, long durations, and many node visits are expected behavior
|
||||
- The agent stops only when the user explicitly exits — there is no graph-driven completion
|
||||
- Debug focus should be on **quality of individual node visits and iterations**, not whether the session reached a terminal state
|
||||
- Conversation memory accumulates across loops — watch for context overflow and stale data issues
|
||||
|
||||
**How to identify forever-alive agents:** Check `agent.py` or `agent.json` for `terminal_nodes=[]` (empty list). If empty, the agent is forever-alive.
|
||||
|
||||
---
|
||||
|
||||
## Prerequisites
|
||||
@@ -142,6 +153,7 @@ Store the selected mode for the session.
|
||||
- Check `attention_summary.categories` for issue types
|
||||
- Note the `run_id` of problematic sessions
|
||||
- Check `status` field: "degraded", "failure", "in_progress"
|
||||
- **For forever-alive agents:** Sessions with status "in_progress" or "paused" are normal — these agents never reach "completed". Only flag sessions with `needs_attention: true` or actual error indicators (tool failures, retry loops, missing outputs). High step counts alone do not indicate a problem.
|
||||
|
||||
3. **Attention flag triggers to understand:**
|
||||
From runtime_logger.py, runs are flagged when:
|
||||
@@ -199,13 +211,20 @@ Which run would you like to investigate?
|
||||
| **Tool Errors** | `tool_error_count > 0`, `attention_reasons` contains "tool_failures" | Tool calls failed (API errors, timeouts, auth issues) |
|
||||
| **Retry Loops** | `retry_count > 3`, `verdict_counts.RETRY > 5` | Judge repeatedly rejecting outputs |
|
||||
| **Guard Failures** | `guard_reject_count > 0` | Output validation failed (wrong types, missing keys) |
|
||||
| **Stalled Execution** | `total_steps > 20`, `verdict_counts.CONTINUE > 10` | EventLoopNode not making progress |
|
||||
| **Stalled Execution** | `total_steps > 20`, `verdict_counts.CONTINUE > 10` | EventLoopNode not making progress. **Caveat:** Forever-alive agents may legitimately have high step counts — check if agent is blocked at a client-facing node (normal) vs genuinely stuck in a loop |
|
||||
| **High Latency** | `latency_ms > 60000`, `avg_step_latency > 5000` | Slow tool calls or LLM responses |
|
||||
| **Client-Facing Issues** | `client_input_requested` but no `user_input_received` | Premature set_output before user input |
|
||||
| **Edge Routing Errors** | `exit_status == "no_valid_edge"`, `attention_reasons` contains "routing_issue" | No edges match current state |
|
||||
| **Memory/Context Issues** | `tokens_used > 100000`, `context_overflow_count > 0` | Conversation history too long |
|
||||
| **Constraint Violations** | Compare output against goal constraints | Agent violated goal-level rules |
|
||||
|
||||
**Forever-Alive Agent Caveat:** If the agent uses `terminal_nodes=[]`, sessions will never reach "completed" status. This is by design. When debugging these agents, focus on:
|
||||
- Whether individual node visits succeed (not whether the graph "finishes")
|
||||
- Quality of each loop iteration — are outputs improving or degrading across loops?
|
||||
- Whether client-facing nodes are correctly blocking for user input
|
||||
- Memory accumulation issues: stale data from previous loops, context overflow across many iterations
|
||||
- Conversation compaction behavior: is the conversation growing unbounded?
|
||||
|
||||
3. **Analyze each flagged node:**
|
||||
- Node ID and name
|
||||
- Exit status
|
||||
@@ -1015,6 +1034,9 @@ Your agent should now work correctly!"
|
||||
3. **Don't ignore edge conditions** - Missing edges cause routing failures
|
||||
4. **Don't overlook judge configuration** - Mismatched expectations cause retry loops
|
||||
5. **Don't forget nullable_output_keys** - Optional inputs need explicit marking
|
||||
6. **Don't diagnose "in_progress" as a failure for forever-alive agents** - Agents with `terminal_nodes=[]` are designed to never enter "completed" state. This is intentional. Focus on quality of individual node visits, not session completion status
|
||||
7. **Don't ignore conversation memory issues in long-running sessions** - In continuous conversation mode, history grows across node transitions and loop iterations. Watch for context overflow (tokens_used > 100K), stale data from previous loops affecting edge conditions, and compaction failures that cause the LLM to lose important context
|
||||
8. **Don't confuse "waiting for user" with "stalled"** - Client-facing nodes in forever-alive agents block for user input by design. A session paused at a client-facing node is working correctly, not stalled
|
||||
|
||||
---
|
||||
|
||||
|
||||
@@ -27,6 +27,9 @@ class Message:
|
||||
tool_use_id: str | None = None
|
||||
tool_calls: list[dict[str, Any]] | None = None
|
||||
is_error: bool = False
|
||||
# Phase-aware compaction metadata (continuous mode)
|
||||
phase_id: str | None = None
|
||||
is_transition_marker: bool = False
|
||||
|
||||
def to_llm_dict(self) -> dict[str, Any]:
|
||||
"""Convert to OpenAI-format message dict."""
|
||||
@@ -60,6 +63,10 @@ class Message:
|
||||
d["tool_calls"] = self.tool_calls
|
||||
if self.is_error:
|
||||
d["is_error"] = self.is_error
|
||||
if self.phase_id is not None:
|
||||
d["phase_id"] = self.phase_id
|
||||
if self.is_transition_marker:
|
||||
d["is_transition_marker"] = self.is_transition_marker
|
||||
return d
|
||||
|
||||
@classmethod
|
||||
@@ -72,6 +79,8 @@ class Message:
|
||||
tool_use_id=data.get("tool_use_id"),
|
||||
tool_calls=data.get("tool_calls"),
|
||||
is_error=data.get("is_error", False),
|
||||
phase_id=data.get("phase_id"),
|
||||
is_transition_marker=data.get("is_transition_marker", False),
|
||||
)
|
||||
|
||||
|
||||
@@ -188,6 +197,7 @@ class NodeConversation:
|
||||
self._next_seq: int = 0
|
||||
self._meta_persisted: bool = False
|
||||
self._last_api_input_tokens: int | None = None
|
||||
self._current_phase: str | None = None
|
||||
|
||||
# --- Properties --------------------------------------------------------
|
||||
|
||||
@@ -195,6 +205,33 @@ class NodeConversation:
|
||||
def system_prompt(self) -> str:
|
||||
return self._system_prompt
|
||||
|
||||
def update_system_prompt(self, new_prompt: str) -> None:
|
||||
"""Update the system prompt.
|
||||
|
||||
Used in continuous conversation mode at phase transitions to swap
|
||||
Layer 3 (focus) while preserving the conversation history.
|
||||
"""
|
||||
self._system_prompt = new_prompt
|
||||
|
||||
def set_current_phase(self, phase_id: str) -> None:
|
||||
"""Set the current phase ID. Subsequent messages will be stamped with it."""
|
||||
self._current_phase = phase_id
|
||||
|
||||
async def switch_store(self, new_store: ConversationStore) -> None:
|
||||
"""Switch to a new persistence store at a phase transition.
|
||||
|
||||
Subsequent messages are written to *new_store*. Meta (system
|
||||
prompt, config) is re-persisted on the next write so the new
|
||||
store's ``meta.json`` reflects the updated prompt.
|
||||
"""
|
||||
self._store = new_store
|
||||
self._meta_persisted = False
|
||||
await new_store.write_cursor({"next_seq": self._next_seq})
|
||||
|
||||
@property
|
||||
def current_phase(self) -> str | None:
|
||||
return self._current_phase
|
||||
|
||||
@property
|
||||
def messages(self) -> list[Message]:
|
||||
"""Return a defensive copy of the message list."""
|
||||
@@ -216,8 +253,19 @@ class NodeConversation:
|
||||
|
||||
# --- Add messages ------------------------------------------------------
|
||||
|
||||
async def add_user_message(self, content: str) -> Message:
|
||||
msg = Message(seq=self._next_seq, role="user", content=content)
|
||||
async def add_user_message(
|
||||
self,
|
||||
content: str,
|
||||
*,
|
||||
is_transition_marker: bool = False,
|
||||
) -> Message:
|
||||
msg = Message(
|
||||
seq=self._next_seq,
|
||||
role="user",
|
||||
content=content,
|
||||
phase_id=self._current_phase,
|
||||
is_transition_marker=is_transition_marker,
|
||||
)
|
||||
self._messages.append(msg)
|
||||
self._next_seq += 1
|
||||
await self._persist(msg)
|
||||
@@ -233,6 +281,7 @@ class NodeConversation:
|
||||
role="assistant",
|
||||
content=content,
|
||||
tool_calls=tool_calls,
|
||||
phase_id=self._current_phase,
|
||||
)
|
||||
self._messages.append(msg)
|
||||
self._next_seq += 1
|
||||
@@ -251,6 +300,7 @@ class NodeConversation:
|
||||
content=content,
|
||||
tool_use_id=tool_use_id,
|
||||
is_error=is_error,
|
||||
phase_id=self._current_phase,
|
||||
)
|
||||
self._messages.append(msg)
|
||||
self._next_seq += 1
|
||||
@@ -380,6 +430,11 @@ class NodeConversation:
|
||||
spillover filename reference (if any). Message structure (role,
|
||||
seq, tool_use_id) stays valid for the LLM API.
|
||||
|
||||
Phase-aware behavior (continuous mode): when messages have ``phase_id``
|
||||
metadata, all messages in the current phase are protected regardless of
|
||||
token budget. Transition markers are never pruned. Older phases' tool
|
||||
results are pruned more aggressively.
|
||||
|
||||
Error tool results are never pruned — they prevent re-calling
|
||||
failing tools.
|
||||
|
||||
@@ -388,13 +443,18 @@ class NodeConversation:
|
||||
if not self._messages:
|
||||
return 0
|
||||
|
||||
# Phase 1: Walk backward, classify tool results as protected vs pruneable
|
||||
# Walk backward, classify tool results as protected vs pruneable
|
||||
protected_tokens = 0
|
||||
pruneable: list[int] = [] # indices into self._messages
|
||||
pruneable_tokens = 0
|
||||
|
||||
for i in range(len(self._messages) - 1, -1, -1):
|
||||
msg = self._messages[i]
|
||||
|
||||
# Transition markers are never pruned (any role)
|
||||
if msg.is_transition_marker:
|
||||
continue
|
||||
|
||||
if msg.role != "tool":
|
||||
continue
|
||||
if msg.is_error:
|
||||
@@ -402,6 +462,10 @@ class NodeConversation:
|
||||
if msg.content.startswith("[Pruned tool result"):
|
||||
continue # already pruned
|
||||
|
||||
# Phase-aware: protect current phase messages
|
||||
if self._current_phase and msg.phase_id == self._current_phase:
|
||||
continue
|
||||
|
||||
est = len(msg.content) // 4
|
||||
if protected_tokens < protect_tokens:
|
||||
protected_tokens += est
|
||||
@@ -409,11 +473,11 @@ class NodeConversation:
|
||||
pruneable.append(i)
|
||||
pruneable_tokens += est
|
||||
|
||||
# Phase 2: Only prune if enough to be worthwhile
|
||||
# Only prune if enough to be worthwhile
|
||||
if pruneable_tokens < min_prune_tokens:
|
||||
return 0
|
||||
|
||||
# Phase 3: Replace content with compact placeholder
|
||||
# Replace content with compact placeholder
|
||||
count = 0
|
||||
for i in pruneable:
|
||||
msg = self._messages[i]
|
||||
@@ -436,6 +500,8 @@ class NodeConversation:
|
||||
tool_use_id=msg.tool_use_id,
|
||||
tool_calls=msg.tool_calls,
|
||||
is_error=msg.is_error,
|
||||
phase_id=msg.phase_id,
|
||||
is_transition_marker=msg.is_transition_marker,
|
||||
)
|
||||
count += 1
|
||||
|
||||
|
||||
@@ -0,0 +1,177 @@
|
||||
"""Level 2 Conversation-Aware Judge.
|
||||
|
||||
When a node has `success_criteria` set, the implicit judge upgrades:
|
||||
after Level 0 passes (all output keys set), a fast LLM call evaluates
|
||||
whether the conversation actually meets the criteria.
|
||||
|
||||
This prevents nodes from "checking boxes" (setting output keys) without
|
||||
doing quality work. The LLM reads the recent conversation and assesses
|
||||
whether the phase's goal was genuinely accomplished.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
from dataclasses import dataclass
|
||||
from typing import Any
|
||||
|
||||
from framework.graph.conversation import NodeConversation
|
||||
from framework.llm.provider import LLMProvider
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
@dataclass
|
||||
class PhaseVerdict:
|
||||
"""Result of Level 2 conversation-aware evaluation."""
|
||||
|
||||
action: str # "ACCEPT" or "RETRY"
|
||||
confidence: float = 0.8
|
||||
feedback: str = ""
|
||||
|
||||
|
||||
async def evaluate_phase_completion(
|
||||
llm: LLMProvider,
|
||||
conversation: NodeConversation,
|
||||
phase_name: str,
|
||||
phase_description: str,
|
||||
success_criteria: str,
|
||||
accumulator_state: dict[str, Any],
|
||||
max_history_tokens: int = 8_196,
|
||||
) -> PhaseVerdict:
|
||||
"""Level 2 judge: read the conversation and evaluate quality.
|
||||
|
||||
Only called after Level 0 passes (all output keys set).
|
||||
|
||||
Args:
|
||||
llm: LLM provider for evaluation
|
||||
conversation: The current conversation to evaluate
|
||||
phase_name: Name of the current phase/node
|
||||
phase_description: Description of the phase
|
||||
success_criteria: Natural-language criteria for phase completion
|
||||
accumulator_state: Current output key values
|
||||
max_history_tokens: Main conversation token budget (judge gets 20%)
|
||||
|
||||
Returns:
|
||||
PhaseVerdict with action and optional feedback
|
||||
"""
|
||||
# Build a compact view of the recent conversation
|
||||
recent_messages = _extract_recent_context(conversation, max_messages=10)
|
||||
outputs_summary = _format_outputs(accumulator_state)
|
||||
|
||||
system_prompt = (
|
||||
"You are a quality judge evaluating whether a phase of work is complete. "
|
||||
"Be concise. Evaluate based on the success criteria, not on style."
|
||||
)
|
||||
|
||||
user_prompt = f"""Evaluate this phase:
|
||||
|
||||
PHASE: {phase_name}
|
||||
DESCRIPTION: {phase_description}
|
||||
|
||||
SUCCESS CRITERIA:
|
||||
{success_criteria}
|
||||
|
||||
OUTPUTS SET:
|
||||
{outputs_summary}
|
||||
|
||||
RECENT CONVERSATION:
|
||||
{recent_messages}
|
||||
|
||||
Has this phase accomplished its goal based on the success criteria?
|
||||
|
||||
Respond in exactly this format:
|
||||
ACTION: ACCEPT or RETRY
|
||||
CONFIDENCE: 0.X
|
||||
FEEDBACK: (reason if RETRY, empty if ACCEPT)"""
|
||||
|
||||
try:
|
||||
response = llm.complete(
|
||||
messages=[{"role": "user", "content": user_prompt}],
|
||||
system=system_prompt,
|
||||
max_tokens=max(1024, max_history_tokens // 5),
|
||||
max_retries=1,
|
||||
)
|
||||
if not response.content or not response.content.strip():
|
||||
logger.debug("Level 2 judge: empty response, accepting by default")
|
||||
return PhaseVerdict(action="ACCEPT", confidence=0.5, feedback="")
|
||||
return _parse_verdict(response.content)
|
||||
except Exception as e:
|
||||
logger.warning(f"Level 2 judge failed, accepting by default: {e}")
|
||||
# On failure, don't block — Level 0 already passed
|
||||
return PhaseVerdict(action="ACCEPT", confidence=0.5, feedback="")
|
||||
|
||||
|
||||
def _extract_recent_context(conversation: NodeConversation, max_messages: int = 10) -> str:
|
||||
"""Extract recent conversation messages for evaluation."""
|
||||
messages = conversation.messages
|
||||
recent = messages[-max_messages:] if len(messages) > max_messages else messages
|
||||
|
||||
parts = []
|
||||
for msg in recent:
|
||||
role = msg.role.upper()
|
||||
content = msg.content or ""
|
||||
# Truncate long tool results
|
||||
if msg.role == "tool" and len(content) > 200:
|
||||
content = content[:200] + "..."
|
||||
if content.strip():
|
||||
parts.append(f"[{role}]: {content.strip()}")
|
||||
|
||||
return "\n".join(parts) if parts else "(no messages)"
|
||||
|
||||
|
||||
def _format_outputs(accumulator_state: dict[str, Any]) -> str:
|
||||
"""Format output key values for evaluation.
|
||||
|
||||
Lists and dicts get structural formatting so the judge can assess
|
||||
quantity and structure, not just a truncated stringification.
|
||||
"""
|
||||
if not accumulator_state:
|
||||
return "(none)"
|
||||
parts = []
|
||||
for key, value in accumulator_state.items():
|
||||
if isinstance(value, list):
|
||||
# Show count + brief per-item preview so the judge can
|
||||
# verify quantity without the full serialization.
|
||||
items_preview = []
|
||||
for i, item in enumerate(value[:8]):
|
||||
item_str = str(item)
|
||||
if len(item_str) > 150:
|
||||
item_str = item_str[:150] + "..."
|
||||
items_preview.append(f" [{i}]: {item_str}")
|
||||
val_str = f"list ({len(value)} items):\n" + "\n".join(items_preview)
|
||||
if len(value) > 8:
|
||||
val_str += f"\n ... and {len(value) - 8} more"
|
||||
elif isinstance(value, dict):
|
||||
val_str = str(value)
|
||||
if len(val_str) > 400:
|
||||
val_str = val_str[:400] + "..."
|
||||
else:
|
||||
val_str = str(value)
|
||||
if len(val_str) > 300:
|
||||
val_str = val_str[:300] + "..."
|
||||
parts.append(f" {key}: {val_str}")
|
||||
return "\n".join(parts)
|
||||
|
||||
|
||||
def _parse_verdict(response: str) -> PhaseVerdict:
|
||||
"""Parse LLM response into PhaseVerdict."""
|
||||
action = "ACCEPT"
|
||||
confidence = 0.8
|
||||
feedback = ""
|
||||
|
||||
for line in response.strip().split("\n"):
|
||||
line = line.strip()
|
||||
if line.startswith("ACTION:"):
|
||||
action_str = line.split(":", 1)[1].strip().upper()
|
||||
if action_str in ("ACCEPT", "RETRY"):
|
||||
action = action_str
|
||||
elif line.startswith("CONFIDENCE:"):
|
||||
try:
|
||||
confidence = float(line.split(":", 1)[1].strip())
|
||||
except ValueError:
|
||||
pass
|
||||
elif line.startswith("FEEDBACK:"):
|
||||
feedback = line.split(":", 1)[1].strip()
|
||||
|
||||
return PhaseVerdict(action=action, confidence=confidence, feedback=feedback)
|
||||
@@ -435,6 +435,25 @@ class GraphSpec(BaseModel):
|
||||
description="EventLoopNode configuration (max_iterations, max_tool_calls_per_turn, etc.)",
|
||||
)
|
||||
|
||||
# Conversation mode
|
||||
conversation_mode: str = Field(
|
||||
default="continuous",
|
||||
description=(
|
||||
"How conversations flow between event_loop nodes. "
|
||||
"'continuous' (default): one conversation threads through all "
|
||||
"event_loop nodes with cumulative tools and layered prompt composition. "
|
||||
"'isolated': each node gets a fresh conversation."
|
||||
),
|
||||
)
|
||||
identity_prompt: str | None = Field(
|
||||
default=None,
|
||||
description=(
|
||||
"Agent-level identity prompt (Layer 1 of the onion model). "
|
||||
"In continuous mode, this is the static identity that persists "
|
||||
"unchanged across all node transitions. In isolated mode, ignored."
|
||||
),
|
||||
)
|
||||
|
||||
# Metadata
|
||||
description: str = ""
|
||||
created_by: str = "" # "human" or "builder_agent"
|
||||
|
||||
@@ -149,7 +149,7 @@ class EventLoopNode(NodeProtocol):
|
||||
1. Try to restore from durable state (crash recovery)
|
||||
2. If no prior state, init from NodeSpec.system_prompt + input_keys
|
||||
3. Loop: drain injection queue -> stream LLM -> execute tools
|
||||
-> if client_facing + ask_user called: block for user input
|
||||
-> if client_facing: block for user input (see below)
|
||||
-> judge evaluates (acceptance criteria)
|
||||
(each add_* and set_output writes through to store immediately)
|
||||
4. Publish events to EventBus at each stage
|
||||
@@ -157,11 +157,17 @@ class EventLoopNode(NodeProtocol):
|
||||
6. Terminate when judge returns ACCEPT, shutdown signaled, or max iterations
|
||||
7. Build output dict from OutputAccumulator
|
||||
|
||||
Client-facing blocking: When ``client_facing=True``, a synthetic
|
||||
``ask_user`` tool is injected. The node blocks for user input ONLY
|
||||
when the LLM explicitly calls ``ask_user()``. Text-only turns
|
||||
without ``ask_user`` flow through without blocking, allowing the LLM
|
||||
to stream progress updates and summaries freely.
|
||||
Client-facing blocking (``client_facing=True``):
|
||||
|
||||
- **Text-only turns** (no real tool calls, no set_output)
|
||||
automatically block for user input. If the LLM is talking to the
|
||||
user (not calling tools or setting outputs), it should wait for
|
||||
the user's response before the judge runs.
|
||||
- **Work turns** (tool calls or set_output) flow through without
|
||||
blocking — the LLM is making progress, not asking the user.
|
||||
- A synthetic ``ask_user`` tool is also injected for explicit
|
||||
blocking when the LLM wants to be deliberate about requesting
|
||||
input (e.g. mid-tool-call).
|
||||
|
||||
Always returns NodeResult with retryable=False semantics. The executor
|
||||
must NOT retry event loop nodes -- retry is handled internally by the
|
||||
@@ -234,23 +240,44 @@ class EventLoopNode(NodeProtocol):
|
||||
return NodeResult(success=False, error=error_msg)
|
||||
|
||||
# 2. Restore or create new conversation + accumulator
|
||||
conversation, accumulator, start_iteration = await self._restore(ctx)
|
||||
if conversation is None:
|
||||
system_prompt = ctx.node_spec.system_prompt or ""
|
||||
# Track whether we're in continuous mode (conversation threaded across nodes)
|
||||
_is_continuous = getattr(ctx, "continuous_mode", False)
|
||||
|
||||
conversation = NodeConversation(
|
||||
system_prompt=system_prompt,
|
||||
max_history_tokens=self._config.max_history_tokens,
|
||||
output_keys=ctx.node_spec.output_keys or None,
|
||||
store=self._conversation_store,
|
||||
)
|
||||
if _is_continuous and ctx.inherited_conversation is not None:
|
||||
# Continuous mode with inherited conversation from prior node.
|
||||
# This takes priority over store restoration — when the graph loops
|
||||
# back to a previously-visited node, the inherited conversation
|
||||
# carries forward the full thread rather than restoring stale state.
|
||||
# System prompt already updated by executor. Transition marker
|
||||
# already inserted by executor. Fresh accumulator for this phase.
|
||||
# Phase already set by executor via set_current_phase().
|
||||
conversation = ctx.inherited_conversation
|
||||
conversation._output_keys = ctx.node_spec.output_keys or None
|
||||
accumulator = OutputAccumulator(store=self._conversation_store)
|
||||
start_iteration = 0
|
||||
else:
|
||||
# Try crash-recovery restore from store, then fall back to fresh.
|
||||
conversation, accumulator, start_iteration = await self._restore(ctx)
|
||||
if conversation is None:
|
||||
# Fresh conversation: either isolated mode or first node in continuous mode.
|
||||
system_prompt = ctx.node_spec.system_prompt or ""
|
||||
|
||||
# Add initial user message from input data
|
||||
initial_message = self._build_initial_message(ctx)
|
||||
if initial_message:
|
||||
await conversation.add_user_message(initial_message)
|
||||
conversation = NodeConversation(
|
||||
system_prompt=system_prompt,
|
||||
max_history_tokens=self._config.max_history_tokens,
|
||||
output_keys=ctx.node_spec.output_keys or None,
|
||||
store=self._conversation_store,
|
||||
)
|
||||
# Stamp phase for first node in continuous mode
|
||||
if _is_continuous:
|
||||
conversation.set_current_phase(ctx.node_id)
|
||||
accumulator = OutputAccumulator(store=self._conversation_store)
|
||||
start_iteration = 0
|
||||
|
||||
# Add initial user message from input data
|
||||
initial_message = self._build_initial_message(ctx)
|
||||
if initial_message:
|
||||
await conversation.add_user_message(initial_message)
|
||||
|
||||
# 3. Build tool list: node tools + synthetic set_output + ask_user tools
|
||||
tools = list(ctx.available_tools)
|
||||
@@ -305,6 +332,7 @@ class EventLoopNode(NodeProtocol):
|
||||
output=accumulator.to_dict(),
|
||||
tokens_used=total_input_tokens + total_output_tokens,
|
||||
latency_ms=latency_ms,
|
||||
conversation=conversation if _is_continuous else None,
|
||||
)
|
||||
|
||||
# 6b. Drain injection queue
|
||||
@@ -429,6 +457,7 @@ class EventLoopNode(NodeProtocol):
|
||||
output=accumulator.to_dict(),
|
||||
tokens_used=total_input_tokens + total_output_tokens,
|
||||
latency_ms=latency_ms,
|
||||
conversation=conversation if _is_continuous else None,
|
||||
)
|
||||
|
||||
# 6f. Stall detection
|
||||
@@ -479,6 +508,7 @@ class EventLoopNode(NodeProtocol):
|
||||
output=accumulator.to_dict(),
|
||||
tokens_used=total_input_tokens + total_output_tokens,
|
||||
latency_ms=latency_ms,
|
||||
conversation=conversation if _is_continuous else None,
|
||||
)
|
||||
|
||||
# 6g. Write cursor checkpoint
|
||||
@@ -486,12 +516,27 @@ class EventLoopNode(NodeProtocol):
|
||||
|
||||
# 6h. Client-facing input blocking
|
||||
#
|
||||
# Block ONLY when the LLM explicitly calls ask_user().
|
||||
# Text-only turns and set_output-only turns flow through
|
||||
# without blocking, allowing progress updates and summaries
|
||||
# to stream freely. After user input arrives, fall through
|
||||
# to judge evaluation (6i) — the judge handles acceptance.
|
||||
if ctx.node_spec.client_facing and user_input_requested:
|
||||
# Two triggers:
|
||||
# (a) Explicit ask_user() — always blocks, then falls through
|
||||
# to judge evaluation (6i).
|
||||
# (b) Auto-block — a text-only turn (no real tools, no
|
||||
# set_output) from a client-facing node is addressed to the
|
||||
# user. Block for their response, then *skip* judge so the
|
||||
# next LLM turn can process the reply without confusing
|
||||
# "missing outputs" feedback.
|
||||
#
|
||||
# Turns that include tool calls or set_output are *work*, not
|
||||
# conversation — they flow through without blocking.
|
||||
_cf_block = False
|
||||
_cf_auto = False
|
||||
if ctx.node_spec.client_facing:
|
||||
if user_input_requested:
|
||||
_cf_block = True
|
||||
elif assistant_text and not real_tool_results and not outputs_set:
|
||||
_cf_block = True
|
||||
_cf_auto = True
|
||||
|
||||
if _cf_block:
|
||||
if self._shutdown:
|
||||
await self._publish_loop_completed(stream_id, node_id, iteration + 1)
|
||||
latency_ms = int((time.time() - start_time) * 1000)
|
||||
@@ -531,9 +576,15 @@ class EventLoopNode(NodeProtocol):
|
||||
output=accumulator.to_dict(),
|
||||
tokens_used=total_input_tokens + total_output_tokens,
|
||||
latency_ms=latency_ms,
|
||||
conversation=conversation if _is_continuous else None,
|
||||
)
|
||||
|
||||
logger.info("[%s] iter=%d: blocking for user input...", node_id, iteration)
|
||||
logger.info(
|
||||
"[%s] iter=%d: blocking for user input (auto=%s)...",
|
||||
node_id,
|
||||
iteration,
|
||||
_cf_auto,
|
||||
)
|
||||
got_input = await self._await_user_input(ctx)
|
||||
logger.info("[%s] iter=%d: unblocked, got_input=%s", node_id, iteration, got_input)
|
||||
if not got_input:
|
||||
@@ -575,11 +626,34 @@ class EventLoopNode(NodeProtocol):
|
||||
output=accumulator.to_dict(),
|
||||
tokens_used=total_input_tokens + total_output_tokens,
|
||||
latency_ms=latency_ms,
|
||||
conversation=conversation if _is_continuous else None,
|
||||
)
|
||||
|
||||
user_interaction_count += 1
|
||||
recent_responses.clear()
|
||||
# Fall through to judge evaluation (6i)
|
||||
|
||||
if _cf_auto:
|
||||
# Auto-block: skip judge — let the LLM process the
|
||||
# user's response on the next turn without confusing
|
||||
# "missing outputs" feedback injected between the
|
||||
# assistant's question and the user's answer.
|
||||
_continue_count += 1
|
||||
if ctx.runtime_logger:
|
||||
iter_latency_ms = int((time.time() - iter_start) * 1000)
|
||||
ctx.runtime_logger.log_step(
|
||||
node_id=node_id,
|
||||
node_type="event_loop",
|
||||
step_index=iteration,
|
||||
verdict="CONTINUE",
|
||||
verdict_feedback="Auto-blocked for user input (pre-interaction)",
|
||||
tool_calls=logged_tool_calls,
|
||||
llm_text=assistant_text,
|
||||
input_tokens=turn_tokens.get("input", 0),
|
||||
output_tokens=turn_tokens.get("output", 0),
|
||||
latency_ms=iter_latency_ms,
|
||||
)
|
||||
continue
|
||||
# Explicit ask_user: fall through to judge evaluation (6i)
|
||||
|
||||
# 6i. Judge evaluation
|
||||
should_judge = (
|
||||
@@ -702,6 +776,7 @@ class EventLoopNode(NodeProtocol):
|
||||
output=accumulator.to_dict(),
|
||||
tokens_used=total_input_tokens + total_output_tokens,
|
||||
latency_ms=latency_ms,
|
||||
conversation=conversation if _is_continuous else None,
|
||||
)
|
||||
|
||||
elif verdict.action == "ESCALATE":
|
||||
@@ -746,6 +821,7 @@ class EventLoopNode(NodeProtocol):
|
||||
output=accumulator.to_dict(),
|
||||
tokens_used=total_input_tokens + total_output_tokens,
|
||||
latency_ms=latency_ms,
|
||||
conversation=conversation if _is_continuous else None,
|
||||
)
|
||||
|
||||
elif verdict.action == "RETRY":
|
||||
@@ -795,6 +871,7 @@ class EventLoopNode(NodeProtocol):
|
||||
output=accumulator.to_dict(),
|
||||
tokens_used=total_input_tokens + total_output_tokens,
|
||||
latency_ms=latency_ms,
|
||||
conversation=conversation if _is_continuous else None,
|
||||
)
|
||||
|
||||
async def inject_event(self, content: str) -> None:
|
||||
@@ -819,8 +896,11 @@ class EventLoopNode(NodeProtocol):
|
||||
async def _await_user_input(self, ctx: NodeContext) -> bool:
|
||||
"""Block until user input arrives or shutdown is signaled.
|
||||
|
||||
Called when a client_facing node explicitly calls ask_user() —
|
||||
an intentional conversational turn boundary.
|
||||
Called in two situations:
|
||||
- The LLM explicitly calls ask_user().
|
||||
- Auto-block: any text-only turn (no real tools, no set_output)
|
||||
from a client-facing node — ensures the user sees and responds
|
||||
before the judge runs.
|
||||
|
||||
Returns True if input arrived, False if shutdown was signaled.
|
||||
"""
|
||||
@@ -893,6 +973,21 @@ class EventLoopNode(NodeProtocol):
|
||||
await self._compact_tiered(ctx, conversation, accumulator)
|
||||
|
||||
messages = conversation.to_llm_messages()
|
||||
|
||||
# Defensive guard: ensure messages don't end with an assistant
|
||||
# message. The Anthropic API rejects "assistant message prefill"
|
||||
# (conversations must end with a user or tool message). This can
|
||||
# happen after compaction trims messages leaving an assistant tail,
|
||||
# or when a conversation is inherited without a transition marker
|
||||
# (e.g. parallel-branch execution).
|
||||
if messages and messages[-1].get("role") == "assistant":
|
||||
logger.info(
|
||||
"[%s] Messages end with assistant — injecting continuation prompt",
|
||||
node_id,
|
||||
)
|
||||
await conversation.add_user_message("[Continue working on your current task.]")
|
||||
messages = conversation.to_llm_messages()
|
||||
|
||||
accumulated_text = ""
|
||||
tool_calls: list[ToolCallEvent] = []
|
||||
|
||||
@@ -1027,7 +1122,25 @@ class EventLoopNode(NodeProtocol):
|
||||
)
|
||||
else:
|
||||
# --- Real tool execution ---
|
||||
result = await self._execute_tool(tc)
|
||||
# Guard: detect truncated tool arguments (_raw fallback
|
||||
# from litellm when json.loads fails on max_tokens hit).
|
||||
if "_raw" in tc.tool_input:
|
||||
result = ToolResult(
|
||||
tool_use_id=tc.tool_use_id,
|
||||
content=(
|
||||
f"Tool call to '{tc.tool_name}' failed: your arguments "
|
||||
"were truncated (hit output token limit). "
|
||||
"Simplify or shorten your arguments and try again."
|
||||
),
|
||||
is_error=True,
|
||||
)
|
||||
logger.warning(
|
||||
"[%s] Blocked truncated _raw tool call: %s",
|
||||
node_id,
|
||||
tc.tool_name,
|
||||
)
|
||||
else:
|
||||
result = await self._execute_tool(tc)
|
||||
result = self._truncate_tool_result(result, tc.tool_name)
|
||||
tool_entry = {
|
||||
"tool_use_id": tc.tool_use_id,
|
||||
@@ -1307,6 +1420,26 @@ class EventLoopNode(NodeProtocol):
|
||||
f"Use set_output to set at least one of: {output_keys}"
|
||||
),
|
||||
)
|
||||
|
||||
# Level 2: conversation-aware quality check (if success_criteria set)
|
||||
if ctx.node_spec.success_criteria and ctx.llm:
|
||||
from framework.graph.conversation_judge import evaluate_phase_completion
|
||||
|
||||
verdict = await evaluate_phase_completion(
|
||||
llm=ctx.llm,
|
||||
conversation=conversation,
|
||||
phase_name=ctx.node_spec.name,
|
||||
phase_description=ctx.node_spec.description,
|
||||
success_criteria=ctx.node_spec.success_criteria,
|
||||
accumulator_state=accumulator.to_dict(),
|
||||
max_history_tokens=self._config.max_history_tokens,
|
||||
)
|
||||
if verdict.action != "ACCEPT":
|
||||
return JudgeVerdict(
|
||||
action=verdict.action,
|
||||
feedback=verdict.feedback or "Phase criteria not met.",
|
||||
)
|
||||
|
||||
return JudgeVerdict(action="ACCEPT")
|
||||
else:
|
||||
return JudgeVerdict(
|
||||
@@ -1332,26 +1465,43 @@ class EventLoopNode(NodeProtocol):
|
||||
|
||||
Used in compaction summaries to prevent the LLM from re-calling
|
||||
tools it already called. Extracts:
|
||||
- Tool call counts (e.g. "github_list_pull_requests (6x)")
|
||||
- Tool call details: name, count, and *inputs* for key tools
|
||||
(search queries, scrape URLs, loaded filenames)
|
||||
- Files saved via save_data
|
||||
- Outputs set via set_output
|
||||
- Errors encountered
|
||||
"""
|
||||
tool_counts: dict[str, int] = {}
|
||||
# Per-tool: list of input summaries (one per call)
|
||||
tool_calls_detail: dict[str, list[str]] = {}
|
||||
files_saved: list[str] = []
|
||||
outputs_set: list[str] = []
|
||||
errors: list[str] = []
|
||||
|
||||
# Tool-specific input extractors: return a short summary string
|
||||
def _summarize_input(name: str, args: dict) -> str:
|
||||
if name == "web_search":
|
||||
return args.get("query", "")
|
||||
if name == "web_scrape":
|
||||
return args.get("url", "")
|
||||
if name == "load_data":
|
||||
return args.get("filename", "")
|
||||
if name == "save_data":
|
||||
return args.get("filename", "")
|
||||
return ""
|
||||
|
||||
for msg in conversation.messages:
|
||||
if msg.role == "assistant" and msg.tool_calls:
|
||||
for tc in msg.tool_calls:
|
||||
func = tc.get("function", {})
|
||||
name = func.get("name", "unknown")
|
||||
tool_counts[name] = tool_counts.get(name, 0) + 1
|
||||
try:
|
||||
args = json.loads(func.get("arguments", "{}"))
|
||||
except (json.JSONDecodeError, TypeError):
|
||||
args = {}
|
||||
|
||||
summary = _summarize_input(name, args)
|
||||
tool_calls_detail.setdefault(name, []).append(summary)
|
||||
|
||||
if name == "save_data" and args.get("filename"):
|
||||
files_saved.append(args["filename"])
|
||||
if name == "set_output" and args.get("key"):
|
||||
@@ -1362,9 +1512,18 @@ class EventLoopNode(NodeProtocol):
|
||||
errors.append(preview)
|
||||
|
||||
parts: list[str] = []
|
||||
if tool_counts:
|
||||
lines = [f" {n} ({c}x)" for n, c in tool_counts.items()]
|
||||
parts.append("TOOLS ALREADY CALLED:\n" + "\n".join(lines[:max_entries]))
|
||||
if tool_calls_detail:
|
||||
lines: list[str] = []
|
||||
for name, inputs in list(tool_calls_detail.items())[:max_entries]:
|
||||
count = len(inputs)
|
||||
# Include input details for tools where inputs matter
|
||||
non_empty = [s for s in inputs if s]
|
||||
if non_empty:
|
||||
detail_lines = [f" - {s[:120]}" for s in non_empty[:8]]
|
||||
lines.append(f" {name} ({count}x):\n" + "\n".join(detail_lines))
|
||||
else:
|
||||
lines.append(f" {name} ({count}x)")
|
||||
parts.append("TOOLS ALREADY CALLED:\n" + "\n".join(lines))
|
||||
if files_saved:
|
||||
unique = list(dict.fromkeys(files_saved))
|
||||
parts.append("FILES SAVED: " + ", ".join(unique))
|
||||
@@ -1453,6 +1612,12 @@ class EventLoopNode(NodeProtocol):
|
||||
if limit <= 0 or result.is_error or len(result.content) <= limit:
|
||||
return result
|
||||
|
||||
# load_data is the designated mechanism for reading spilled files.
|
||||
# The LLM controls chunk size via offset/limit — re-spilling its
|
||||
# result would create a circular loop.
|
||||
if tool_name == "load_data":
|
||||
return result
|
||||
|
||||
# Determine a preview size — leave room for the metadata wrapper
|
||||
preview_chars = max(limit - 300, limit // 2)
|
||||
preview = result.content[:preview_chars]
|
||||
@@ -1698,12 +1863,28 @@ class EventLoopNode(NodeProtocol):
|
||||
if spec.tools:
|
||||
parts.append(f"AVAILABLE TOOLS: {', '.join(spec.tools)}")
|
||||
|
||||
# 5. Spillover files hint
|
||||
# 5. Spillover files — list actual files so the LLM can load
|
||||
# them immediately instead of having to call list_data_files first.
|
||||
if self._config.spillover_dir:
|
||||
parts.append(
|
||||
"NOTE: Large tool results were saved to files. "
|
||||
"Use load_data(filename='<filename>') to read them."
|
||||
)
|
||||
try:
|
||||
from pathlib import Path
|
||||
|
||||
data_dir = Path(self._config.spillover_dir)
|
||||
if data_dir.is_dir():
|
||||
files = sorted(f.name for f in data_dir.iterdir() if f.is_file())
|
||||
if files:
|
||||
file_list = "\n".join(f" - {f}" for f in files[:30])
|
||||
parts.append("DATA FILES (use load_data to read):\n" + file_list)
|
||||
else:
|
||||
parts.append(
|
||||
"NOTE: Large tool results may have been saved to files. "
|
||||
"Use list_data_files() to check."
|
||||
)
|
||||
except Exception:
|
||||
parts.append(
|
||||
"NOTE: Large tool results were saved to files. "
|
||||
"Use load_data(filename='<filename>') to read them."
|
||||
)
|
||||
|
||||
# 6. Tool call history (prevent re-calling tools)
|
||||
if conversation is not None:
|
||||
|
||||
@@ -186,6 +186,52 @@ class GraphExecutor:
|
||||
# Pause/resume control
|
||||
self._pause_requested = asyncio.Event()
|
||||
|
||||
def _write_progress(
|
||||
self,
|
||||
current_node: str,
|
||||
path: list[str],
|
||||
memory: Any,
|
||||
node_visit_counts: dict[str, int],
|
||||
) -> None:
|
||||
"""Update state.json with live progress at node transitions.
|
||||
|
||||
Reads the existing state.json (written by ExecutionStream at session
|
||||
start) and patches the progress fields in-place. This keeps
|
||||
state.json as the single source of truth — readers always see
|
||||
current progress, not stale initial values.
|
||||
|
||||
The write is synchronous and best-effort: never blocks execution.
|
||||
"""
|
||||
if not self._storage_path:
|
||||
return
|
||||
try:
|
||||
import json as _json
|
||||
from datetime import datetime
|
||||
|
||||
state_path = self._storage_path / "state.json"
|
||||
if state_path.exists():
|
||||
state_data = _json.loads(state_path.read_text(encoding="utf-8"))
|
||||
else:
|
||||
state_data = {}
|
||||
|
||||
# Patch progress fields
|
||||
progress = state_data.setdefault("progress", {})
|
||||
progress["current_node"] = current_node
|
||||
progress["path"] = list(path)
|
||||
progress["node_visit_counts"] = dict(node_visit_counts)
|
||||
progress["steps_executed"] = len(path)
|
||||
|
||||
# Update timestamp
|
||||
timestamps = state_data.setdefault("timestamps", {})
|
||||
timestamps["updated_at"] = datetime.now().isoformat()
|
||||
|
||||
# Memory keys (lightweight — just keys, not values)
|
||||
state_data["memory_keys"] = list(memory.read_all().keys())
|
||||
|
||||
state_path.write_text(_json.dumps(state_data, indent=2), encoding="utf-8")
|
||||
except Exception:
|
||||
pass # Best-effort — never block execution
|
||||
|
||||
def _validate_tools(self, graph: GraphSpec) -> list[str]:
|
||||
"""
|
||||
Validate that all tools declared by nodes are available.
|
||||
@@ -257,6 +303,12 @@ class GraphExecutor:
|
||||
# Initialize execution state
|
||||
memory = SharedMemory()
|
||||
|
||||
# Continuous conversation mode state
|
||||
is_continuous = getattr(graph, "conversation_mode", "isolated") == "continuous"
|
||||
continuous_conversation = None # NodeConversation threaded across nodes
|
||||
cumulative_tools: list = [] # Tools accumulate, never removed
|
||||
cumulative_tool_names: set[str] = set()
|
||||
|
||||
# Initialize checkpoint store if checkpointing is enabled
|
||||
checkpoint_store: CheckpointStore | None = None
|
||||
if checkpoint_config and checkpoint_config.enabled and self._storage_path:
|
||||
@@ -273,13 +325,20 @@ class GraphExecutor:
|
||||
f"{type(memory_data).__name__}, expected dict"
|
||||
)
|
||||
else:
|
||||
# Restore memory from previous session
|
||||
# Restore memory from previous session.
|
||||
# Skip validation — this data was already validated when
|
||||
# originally written, and research text triggers false
|
||||
# positives on the code-indicator heuristic.
|
||||
for key, value in memory_data.items():
|
||||
memory.write(key, value)
|
||||
memory.write(key, value, validate=False)
|
||||
self.logger.info(f"📥 Restored session state with {len(memory_data)} memory keys")
|
||||
|
||||
# Write new input data to memory (each key individually)
|
||||
if input_data:
|
||||
# Write new input data to memory (each key individually).
|
||||
# Skip when resuming from a paused session — restored memory already
|
||||
# contains all state including the original input, and re-writing
|
||||
# input_data would overwrite intermediate results with stale values.
|
||||
_is_resuming = bool(session_state and session_state.get("paused_at"))
|
||||
if input_data and not _is_resuming:
|
||||
for key, value in input_data.items():
|
||||
memory.write(key, value)
|
||||
|
||||
@@ -530,6 +589,13 @@ class GraphExecutor:
|
||||
self.logger.info(f" Inputs: {node_spec.input_keys}")
|
||||
self.logger.info(f" Outputs: {node_spec.output_keys}")
|
||||
|
||||
# Continuous mode: accumulate tools from this node
|
||||
if is_continuous and node_spec.tools:
|
||||
for t in self.tools:
|
||||
if t.name in node_spec.tools and t.name not in cumulative_tool_names:
|
||||
cumulative_tools.append(t)
|
||||
cumulative_tool_names.add(t.name)
|
||||
|
||||
# Build context for node
|
||||
ctx = self._build_context(
|
||||
node_spec=node_spec,
|
||||
@@ -537,6 +603,9 @@ class GraphExecutor:
|
||||
goal=goal,
|
||||
input_data=input_data or {},
|
||||
max_tokens=graph.max_tokens,
|
||||
continuous_mode=is_continuous,
|
||||
inherited_conversation=continuous_conversation if is_continuous else None,
|
||||
override_tools=cumulative_tools if is_continuous else None,
|
||||
)
|
||||
|
||||
# Log actual input data being read
|
||||
@@ -763,6 +832,7 @@ class GraphExecutor:
|
||||
"memory": saved_memory,
|
||||
"execution_path": list(path),
|
||||
"node_visit_counts": dict(node_visit_counts),
|
||||
"resume_from": current_node_id,
|
||||
}
|
||||
|
||||
return ExecutionResult(
|
||||
@@ -843,6 +913,7 @@ class GraphExecutor:
|
||||
# Router explicitly set next node
|
||||
self.logger.info(f" → Router directing to: {result.next_node}")
|
||||
current_node_id = result.next_node
|
||||
self._write_progress(current_node_id, path, memory, node_visit_counts)
|
||||
else:
|
||||
# Get all traversable edges for fan-out detection
|
||||
traversable_edges = self._get_all_traversable_edges(
|
||||
@@ -886,6 +957,7 @@ class GraphExecutor:
|
||||
if fan_in_node:
|
||||
self.logger.info(f" ⑃ Fan-in: converging at {fan_in_node}")
|
||||
current_node_id = fan_in_node
|
||||
self._write_progress(current_node_id, path, memory, node_visit_counts)
|
||||
else:
|
||||
# No convergence point - branches are terminal
|
||||
self.logger.info(" → Parallel branches completed (no convergence)")
|
||||
@@ -940,6 +1012,68 @@ class GraphExecutor:
|
||||
|
||||
current_node_id = next_node
|
||||
|
||||
# Write progress snapshot at node transition
|
||||
self._write_progress(current_node_id, path, memory, node_visit_counts)
|
||||
|
||||
# Continuous mode: thread conversation forward with transition marker
|
||||
if is_continuous and result.conversation is not None:
|
||||
continuous_conversation = result.conversation
|
||||
|
||||
# Look up the next node spec for the transition marker
|
||||
next_spec = graph.get_node(current_node_id)
|
||||
if next_spec and next_spec.node_type == "event_loop":
|
||||
from framework.graph.prompt_composer import (
|
||||
build_narrative,
|
||||
build_transition_marker,
|
||||
compose_system_prompt,
|
||||
)
|
||||
|
||||
# Build Layer 2 (narrative) from current state
|
||||
narrative = build_narrative(memory, path, graph)
|
||||
|
||||
# Compose new system prompt (Layer 1 + 2 + 3)
|
||||
new_system = compose_system_prompt(
|
||||
identity_prompt=getattr(graph, "identity_prompt", None),
|
||||
focus_prompt=next_spec.system_prompt,
|
||||
narrative=narrative,
|
||||
)
|
||||
continuous_conversation.update_system_prompt(new_system)
|
||||
|
||||
# Switch conversation store to the next node's directory
|
||||
# so the transition marker and all subsequent messages are
|
||||
# persisted there instead of the first node's directory.
|
||||
if self._storage_path:
|
||||
from framework.storage.conversation_store import (
|
||||
FileConversationStore,
|
||||
)
|
||||
|
||||
next_store_path = self._storage_path / "conversations" / next_spec.id
|
||||
next_store = FileConversationStore(base_path=next_store_path)
|
||||
await continuous_conversation.switch_store(next_store)
|
||||
|
||||
# Insert transition marker into conversation
|
||||
data_dir = str(self._storage_path / "data") if self._storage_path else None
|
||||
marker = build_transition_marker(
|
||||
previous_node=node_spec,
|
||||
next_node=next_spec,
|
||||
memory=memory,
|
||||
cumulative_tool_names=sorted(cumulative_tool_names),
|
||||
data_dir=data_dir,
|
||||
)
|
||||
await continuous_conversation.add_user_message(
|
||||
marker,
|
||||
is_transition_marker=True,
|
||||
)
|
||||
|
||||
# Set current phase for phase-aware compaction
|
||||
continuous_conversation.set_current_phase(next_spec.id)
|
||||
|
||||
# Opportunistic compaction at transition
|
||||
if continuous_conversation.usage_ratio() > 0.5:
|
||||
await continuous_conversation.prune_old_tool_results(
|
||||
protect_tokens=2000,
|
||||
)
|
||||
|
||||
# Update input_data for next node
|
||||
input_data = result.output
|
||||
|
||||
@@ -1082,6 +1216,7 @@ class GraphExecutor:
|
||||
"memory": saved_memory,
|
||||
"execution_path": list(path),
|
||||
"node_visit_counts": dict(node_visit_counts),
|
||||
"resume_from": current_node_id,
|
||||
}
|
||||
|
||||
# Mark latest checkpoint for resume on failure
|
||||
@@ -1134,12 +1269,19 @@ class GraphExecutor:
|
||||
goal: Goal,
|
||||
input_data: dict[str, Any],
|
||||
max_tokens: int = 4096,
|
||||
continuous_mode: bool = False,
|
||||
inherited_conversation: Any = None,
|
||||
override_tools: list | None = None,
|
||||
) -> NodeContext:
|
||||
"""Build execution context for a node."""
|
||||
# Filter tools to those available to this node
|
||||
available_tools = []
|
||||
if node_spec.tools:
|
||||
available_tools = [t for t in self.tools if t.name in node_spec.tools]
|
||||
if override_tools is not None:
|
||||
# Continuous mode: use cumulative tool set
|
||||
available_tools = list(override_tools)
|
||||
else:
|
||||
available_tools = []
|
||||
if node_spec.tools:
|
||||
available_tools = [t for t in self.tools if t.name in node_spec.tools]
|
||||
|
||||
# Create scoped memory view
|
||||
scoped_memory = memory.with_permissions(
|
||||
@@ -1160,6 +1302,8 @@ class GraphExecutor:
|
||||
max_tokens=max_tokens,
|
||||
runtime_logger=self.runtime_logger,
|
||||
pause_event=self._pause_requested, # Pass pause event for granular control
|
||||
continuous_mode=continuous_mode,
|
||||
inherited_conversation=inherited_conversation,
|
||||
)
|
||||
|
||||
# Valid node types - no ambiguous "llm" type allowed
|
||||
|
||||
@@ -238,6 +238,16 @@ class NodeSpec(BaseModel):
|
||||
description="If True, this node streams output to the end user and can request input.",
|
||||
)
|
||||
|
||||
# Phase completion criteria for conversation-aware judge (Level 2)
|
||||
success_criteria: str | None = Field(
|
||||
default=None,
|
||||
description=(
|
||||
"Natural-language criteria for phase completion. When set, the "
|
||||
"implicit judge upgrades to Level 2: after output keys are satisfied, "
|
||||
"a fast LLM evaluates whether the conversation meets these criteria."
|
||||
),
|
||||
)
|
||||
|
||||
model_config = {"extra": "allow", "arbitrary_types_allowed": True}
|
||||
|
||||
|
||||
@@ -483,6 +493,10 @@ class NodeContext:
|
||||
# Pause control (optional) - asyncio.Event for pause requests
|
||||
pause_event: Any = None # asyncio.Event | None
|
||||
|
||||
# Continuous conversation mode
|
||||
continuous_mode: bool = False # True when graph has conversation_mode="continuous"
|
||||
inherited_conversation: Any = None # NodeConversation | None (from prior node)
|
||||
|
||||
|
||||
@dataclass
|
||||
class NodeResult:
|
||||
@@ -511,6 +525,9 @@ class NodeResult:
|
||||
# Pydantic validation errors (if any)
|
||||
validation_errors: list[str] = field(default_factory=list)
|
||||
|
||||
# Continuous conversation mode: return conversation for threading to next node
|
||||
conversation: Any = None # NodeConversation | None
|
||||
|
||||
def to_summary(self, node_spec: Any = None) -> str:
|
||||
"""
|
||||
Generate a human-readable summary of this node's execution and output.
|
||||
|
||||
@@ -0,0 +1,185 @@
|
||||
"""Prompt composition for continuous agent mode.
|
||||
|
||||
Composes the three-layer system prompt (onion model) and generates
|
||||
transition markers inserted into the conversation at phase boundaries.
|
||||
|
||||
Layer 1 — Identity (static, defined at agent level, never changes):
|
||||
"You are a thorough research agent. You prefer clarity over jargon..."
|
||||
|
||||
Layer 2 — Narrative (auto-generated from conversation/memory state):
|
||||
"We've finished scoping the project. The user wants to focus on..."
|
||||
|
||||
Layer 3 — Focus (per-node system_prompt, reframed as focus directive):
|
||||
"Your current attention: synthesize findings into a report..."
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
from pathlib import Path
|
||||
from typing import TYPE_CHECKING
|
||||
|
||||
if TYPE_CHECKING:
|
||||
from framework.graph.edge import GraphSpec
|
||||
from framework.graph.node import NodeSpec, SharedMemory
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
def compose_system_prompt(
|
||||
identity_prompt: str | None,
|
||||
focus_prompt: str | None,
|
||||
narrative: str | None = None,
|
||||
) -> str:
|
||||
"""Compose the three-layer system prompt.
|
||||
|
||||
Args:
|
||||
identity_prompt: Layer 1 — static agent identity (from GraphSpec).
|
||||
focus_prompt: Layer 3 — per-node focus directive (from NodeSpec.system_prompt).
|
||||
narrative: Layer 2 — auto-generated from conversation state.
|
||||
|
||||
Returns:
|
||||
Composed system prompt with all layers present.
|
||||
"""
|
||||
parts: list[str] = []
|
||||
|
||||
# Layer 1: Identity (always first, anchors the personality)
|
||||
if identity_prompt:
|
||||
parts.append(identity_prompt)
|
||||
|
||||
# Layer 2: Narrative (what's happened so far)
|
||||
if narrative:
|
||||
parts.append(f"\n--- Context (what has happened so far) ---\n{narrative}")
|
||||
|
||||
# Layer 3: Focus (current phase directive)
|
||||
if focus_prompt:
|
||||
parts.append(f"\n--- Current Focus ---\n{focus_prompt}")
|
||||
|
||||
return "\n".join(parts) if parts else ""
|
||||
|
||||
|
||||
def build_narrative(
|
||||
memory: SharedMemory,
|
||||
execution_path: list[str],
|
||||
graph: GraphSpec,
|
||||
) -> str:
|
||||
"""Build Layer 2 (narrative) from structured state.
|
||||
|
||||
Deterministic — no LLM call. Reads SharedMemory and execution path
|
||||
to describe what has happened so far. Cheap and fast.
|
||||
|
||||
Args:
|
||||
memory: Current shared memory state.
|
||||
execution_path: List of node IDs visited so far.
|
||||
graph: Graph spec (for node names/descriptions).
|
||||
|
||||
Returns:
|
||||
Narrative string describing the session state.
|
||||
"""
|
||||
parts: list[str] = []
|
||||
|
||||
# Describe execution path
|
||||
if execution_path:
|
||||
phase_descriptions: list[str] = []
|
||||
for node_id in execution_path:
|
||||
node_spec = graph.get_node(node_id)
|
||||
if node_spec:
|
||||
phase_descriptions.append(f"- {node_spec.name}: {node_spec.description}")
|
||||
else:
|
||||
phase_descriptions.append(f"- {node_id}")
|
||||
parts.append("Phases completed:\n" + "\n".join(phase_descriptions))
|
||||
|
||||
# Describe key memory values (skip very long values)
|
||||
all_memory = memory.read_all()
|
||||
if all_memory:
|
||||
memory_lines: list[str] = []
|
||||
for key, value in all_memory.items():
|
||||
if value is None:
|
||||
continue
|
||||
val_str = str(value)
|
||||
if len(val_str) > 200:
|
||||
val_str = val_str[:200] + "..."
|
||||
memory_lines.append(f"- {key}: {val_str}")
|
||||
if memory_lines:
|
||||
parts.append("Current state:\n" + "\n".join(memory_lines))
|
||||
|
||||
return "\n\n".join(parts) if parts else ""
|
||||
|
||||
|
||||
def build_transition_marker(
|
||||
previous_node: NodeSpec,
|
||||
next_node: NodeSpec,
|
||||
memory: SharedMemory,
|
||||
cumulative_tool_names: list[str],
|
||||
data_dir: Path | str | None = None,
|
||||
) -> str:
|
||||
"""Build a 'State of the World' transition marker.
|
||||
|
||||
Inserted into the conversation as a user message at phase boundaries.
|
||||
Gives the LLM full situational awareness: what happened, what's stored,
|
||||
what tools are available, and what to focus on next.
|
||||
|
||||
Args:
|
||||
previous_node: NodeSpec of the phase just completed.
|
||||
next_node: NodeSpec of the phase about to start.
|
||||
memory: Current shared memory state.
|
||||
cumulative_tool_names: All tools available (cumulative set).
|
||||
data_dir: Path to spillover data directory.
|
||||
|
||||
Returns:
|
||||
Transition marker message text.
|
||||
"""
|
||||
sections: list[str] = []
|
||||
|
||||
# Header
|
||||
sections.append(f"--- PHASE TRANSITION: {previous_node.name} → {next_node.name} ---")
|
||||
|
||||
# What just completed
|
||||
sections.append(f"\nCompleted: {previous_node.name}")
|
||||
sections.append(f" {previous_node.description}")
|
||||
|
||||
# Outputs in memory
|
||||
all_memory = memory.read_all()
|
||||
if all_memory:
|
||||
memory_lines: list[str] = []
|
||||
for key, value in all_memory.items():
|
||||
if value is None:
|
||||
continue
|
||||
val_str = str(value)
|
||||
if len(val_str) > 300:
|
||||
val_str = val_str[:300] + "..."
|
||||
memory_lines.append(f" {key}: {val_str}")
|
||||
if memory_lines:
|
||||
sections.append("\nOutputs available:\n" + "\n".join(memory_lines))
|
||||
|
||||
# Files in data directory
|
||||
if data_dir:
|
||||
data_path = Path(data_dir)
|
||||
if data_path.exists():
|
||||
files = sorted(data_path.iterdir())
|
||||
if files:
|
||||
file_lines = [
|
||||
f" {f.name} ({f.stat().st_size:,} bytes)" for f in files if f.is_file()
|
||||
]
|
||||
if file_lines:
|
||||
sections.append(
|
||||
"\nData files (use load_data to access):\n" + "\n".join(file_lines)
|
||||
)
|
||||
|
||||
# Available tools
|
||||
if cumulative_tool_names:
|
||||
sections.append("\nAvailable tools: " + ", ".join(sorted(cumulative_tool_names)))
|
||||
|
||||
# Next phase
|
||||
sections.append(f"\nNow entering: {next_node.name}")
|
||||
sections.append(f" {next_node.description}")
|
||||
|
||||
# Reflection prompt (engineered metacognition)
|
||||
sections.append(
|
||||
"\nBefore proceeding, briefly reflect: what went well in the "
|
||||
"previous phase? Are there any gaps or surprises worth noting?"
|
||||
)
|
||||
|
||||
sections.append("\n--- END TRANSITION ---")
|
||||
|
||||
return "\n".join(sections)
|
||||
@@ -70,6 +70,7 @@ class AnthropicProvider(LLMProvider):
|
||||
max_tokens: int = 1024,
|
||||
response_format: dict[str, Any] | None = None,
|
||||
json_mode: bool = False,
|
||||
max_retries: int | None = None,
|
||||
) -> LLMResponse:
|
||||
"""Generate a completion from Claude (via LiteLLM)."""
|
||||
return self._provider.complete(
|
||||
@@ -79,6 +80,7 @@ class AnthropicProvider(LLMProvider):
|
||||
max_tokens=max_tokens,
|
||||
response_format=response_format,
|
||||
json_mode=json_mode,
|
||||
max_retries=max_retries,
|
||||
)
|
||||
|
||||
def complete_with_tools(
|
||||
|
||||
@@ -150,10 +150,13 @@ class LiteLLMProvider(LLMProvider):
|
||||
"LiteLLM is not installed. Please install it with: uv pip install litellm"
|
||||
)
|
||||
|
||||
def _completion_with_rate_limit_retry(self, **kwargs: Any) -> Any:
|
||||
def _completion_with_rate_limit_retry(
|
||||
self, max_retries: int | None = None, **kwargs: Any
|
||||
) -> Any:
|
||||
"""Call litellm.completion with retry on 429 rate limit errors and empty responses."""
|
||||
model = kwargs.get("model", self.model)
|
||||
for attempt in range(RATE_LIMIT_MAX_RETRIES + 1):
|
||||
retries = max_retries if max_retries is not None else RATE_LIMIT_MAX_RETRIES
|
||||
for attempt in range(retries + 1):
|
||||
try:
|
||||
response = litellm.completion(**kwargs) # type: ignore[union-attr]
|
||||
|
||||
@@ -194,9 +197,9 @@ class LiteLLMProvider(LLMProvider):
|
||||
f"Full request dumped to: {dump_path}"
|
||||
)
|
||||
|
||||
if attempt == RATE_LIMIT_MAX_RETRIES:
|
||||
if attempt == retries:
|
||||
logger.error(
|
||||
f"[retry] GAVE UP on {model} after {RATE_LIMIT_MAX_RETRIES + 1} "
|
||||
f"[retry] GAVE UP on {model} after {retries + 1} "
|
||||
f"attempts — empty response "
|
||||
f"(finish_reason={finish_reason}, "
|
||||
f"choices={len(response.choices) if response.choices else 0})"
|
||||
@@ -209,7 +212,7 @@ class LiteLLMProvider(LLMProvider):
|
||||
f"choices={len(response.choices) if response.choices else 0}) — "
|
||||
f"likely rate limited or quota exceeded. "
|
||||
f"Retrying in {wait}s "
|
||||
f"(attempt {attempt + 1}/{RATE_LIMIT_MAX_RETRIES})"
|
||||
f"(attempt {attempt + 1}/{retries})"
|
||||
)
|
||||
time.sleep(wait)
|
||||
continue
|
||||
@@ -225,9 +228,9 @@ class LiteLLMProvider(LLMProvider):
|
||||
error_type="rate_limit",
|
||||
attempt=attempt,
|
||||
)
|
||||
if attempt == RATE_LIMIT_MAX_RETRIES:
|
||||
if attempt == retries:
|
||||
logger.error(
|
||||
f"[retry] GAVE UP on {model} after {RATE_LIMIT_MAX_RETRIES + 1} "
|
||||
f"[retry] GAVE UP on {model} after {retries + 1} "
|
||||
f"attempts — rate limit error: {e!s}. "
|
||||
f"~{token_count} tokens ({token_method}). "
|
||||
f"Full request dumped to: {dump_path}"
|
||||
@@ -239,7 +242,7 @@ class LiteLLMProvider(LLMProvider):
|
||||
f"~{token_count} tokens ({token_method}). "
|
||||
f"Full request dumped to: {dump_path}. "
|
||||
f"Retrying in {wait}s "
|
||||
f"(attempt {attempt + 1}/{RATE_LIMIT_MAX_RETRIES})"
|
||||
f"(attempt {attempt + 1}/{retries})"
|
||||
)
|
||||
time.sleep(wait)
|
||||
# unreachable, but satisfies type checker
|
||||
@@ -253,6 +256,7 @@ class LiteLLMProvider(LLMProvider):
|
||||
max_tokens: int = 1024,
|
||||
response_format: dict[str, Any] | None = None,
|
||||
json_mode: bool = False,
|
||||
max_retries: int | None = None,
|
||||
) -> LLMResponse:
|
||||
"""Generate a completion using LiteLLM."""
|
||||
# Prepare messages with system prompt
|
||||
@@ -293,7 +297,7 @@ class LiteLLMProvider(LLMProvider):
|
||||
kwargs["response_format"] = response_format
|
||||
|
||||
# Make the call
|
||||
response = self._completion_with_rate_limit_retry(**kwargs)
|
||||
response = self._completion_with_rate_limit_retry(max_retries=max_retries, **kwargs)
|
||||
|
||||
# Extract content
|
||||
content = response.choices[0].message.content or ""
|
||||
|
||||
@@ -120,6 +120,7 @@ class MockLLMProvider(LLMProvider):
|
||||
max_tokens: int = 1024,
|
||||
response_format: dict[str, Any] | None = None,
|
||||
json_mode: bool = False,
|
||||
max_retries: int | None = None,
|
||||
) -> LLMResponse:
|
||||
"""
|
||||
Generate a mock completion without calling a real LLM.
|
||||
|
||||
@@ -65,6 +65,7 @@ class LLMProvider(ABC):
|
||||
max_tokens: int = 1024,
|
||||
response_format: dict[str, Any] | None = None,
|
||||
json_mode: bool = False,
|
||||
max_retries: int | None = None,
|
||||
) -> LLMResponse:
|
||||
"""
|
||||
Generate a completion from the LLM.
|
||||
@@ -79,6 +80,8 @@ class LLMProvider(ABC):
|
||||
- {"type": "json_schema", "json_schema": {"name": "...", "schema": {...}}}
|
||||
for strict JSON schema enforcement
|
||||
json_mode: If True, request structured JSON output from the LLM
|
||||
max_retries: Override retry count for rate-limit/empty-response retries.
|
||||
None uses the provider default.
|
||||
|
||||
Returns:
|
||||
LLMResponse with content and metadata
|
||||
|
||||
@@ -362,6 +362,7 @@ def _load_resume_state(
|
||||
except (json.JSONDecodeError, OSError):
|
||||
return None
|
||||
return {
|
||||
"resume_session_id": session_id,
|
||||
"memory": cp_data.get("shared_memory", {}),
|
||||
"paused_at": cp_data.get("next_node") or cp_data.get("current_node"),
|
||||
"execution_path": cp_data.get("execution_path", []),
|
||||
@@ -379,6 +380,7 @@ def _load_resume_state(
|
||||
progress = state_data.get("progress", {})
|
||||
paused_at = progress.get("paused_at") or progress.get("resume_from")
|
||||
return {
|
||||
"resume_session_id": session_id,
|
||||
"memory": state_data.get("memory", {}),
|
||||
"paused_at": paused_at,
|
||||
"execution_path": progress.get("path", []),
|
||||
|
||||
@@ -293,8 +293,13 @@ class ExecutionStream:
|
||||
if not self._running:
|
||||
raise RuntimeError(f"ExecutionStream '{self.stream_id}' is not running")
|
||||
|
||||
# Generate execution ID using unified session format
|
||||
if self._session_store:
|
||||
# When resuming, reuse the original session ID so the execution
|
||||
# continues in the same session directory instead of creating a new one.
|
||||
resume_session_id = session_state.get("resume_session_id") if session_state else None
|
||||
|
||||
if resume_session_id:
|
||||
execution_id = resume_session_id
|
||||
elif self._session_store:
|
||||
execution_id = self._session_store.generate_session_id()
|
||||
else:
|
||||
# Fallback to old format if SessionStore not available (shouldn't happen)
|
||||
@@ -597,10 +602,22 @@ class ExecutionStream:
|
||||
entry_point=self.entry_spec.id,
|
||||
)
|
||||
else:
|
||||
# Create initial state
|
||||
from framework.schemas.session_state import SessionTimestamps
|
||||
# Create initial state — when resuming, preserve the previous
|
||||
# execution's progress so crashes don't lose track of state.
|
||||
from framework.schemas.session_state import (
|
||||
SessionProgress,
|
||||
SessionTimestamps,
|
||||
)
|
||||
|
||||
now = datetime.now().isoformat()
|
||||
ss = ctx.session_state or {}
|
||||
progress = SessionProgress(
|
||||
current_node=ss.get("paused_at") or ss.get("resume_from"),
|
||||
paused_at=ss.get("paused_at"),
|
||||
resume_from=ss.get("paused_at") or ss.get("resume_from"),
|
||||
path=ss.get("execution_path", []),
|
||||
node_visit_counts=ss.get("node_visit_counts", {}),
|
||||
)
|
||||
state = SessionState(
|
||||
session_id=execution_id,
|
||||
stream_id=self.stream_id,
|
||||
@@ -613,6 +630,8 @@ class ExecutionStream:
|
||||
started_at=ctx.started_at.isoformat(),
|
||||
updated_at=now,
|
||||
),
|
||||
progress=progress,
|
||||
memory=ss.get("memory", {}),
|
||||
input_data=ctx.input_data,
|
||||
)
|
||||
|
||||
|
||||
@@ -156,8 +156,14 @@ class SessionState(BaseModel):
|
||||
@computed_field
|
||||
@property
|
||||
def is_resumable(self) -> bool:
|
||||
"""Can this session be resumed?"""
|
||||
return self.status == SessionStatus.PAUSED and self.progress.resume_from is not None
|
||||
"""Can this session be resumed?
|
||||
|
||||
A session is resumable when it stopped mid-execution (paused or
|
||||
failed) and we know which node to resume from.
|
||||
"""
|
||||
if self.status not in (SessionStatus.PAUSED, SessionStatus.FAILED):
|
||||
return False
|
||||
return self.progress.resume_from is not None or self.progress.paused_at is not None
|
||||
|
||||
@computed_field
|
||||
@property
|
||||
|
||||
@@ -451,6 +451,7 @@ class ChatRepl(Vertical):
|
||||
if paused_at:
|
||||
# Has paused_at - resume from there
|
||||
resume_session_state = {
|
||||
"resume_session_id": session_id,
|
||||
"paused_at": paused_at,
|
||||
"memory": state.get("memory", {}),
|
||||
"execution_path": progress.get("path", []),
|
||||
@@ -458,8 +459,13 @@ class ChatRepl(Vertical):
|
||||
}
|
||||
resume_info = f"From node: [cyan]{paused_at}[/cyan]"
|
||||
else:
|
||||
# No paused_at - just retry with same input
|
||||
resume_session_state = {}
|
||||
# No paused_at - retry with same input but reuse session directory
|
||||
resume_session_state = {
|
||||
"resume_session_id": session_id,
|
||||
"memory": state.get("memory", {}),
|
||||
"execution_path": progress.get("path", []),
|
||||
"node_visit_counts": progress.get("node_visit_counts", {}),
|
||||
}
|
||||
resume_info = "Retrying with same input"
|
||||
|
||||
# Display resume info
|
||||
@@ -563,6 +569,7 @@ class ChatRepl(Vertical):
|
||||
|
||||
# Create session_state for checkpoint recovery
|
||||
recover_session_state = {
|
||||
"resume_session_id": session_id,
|
||||
"resume_from_checkpoint": checkpoint_id,
|
||||
}
|
||||
|
||||
|
||||
@@ -0,0 +1,538 @@
|
||||
"""Tests for the Continuous Agent architecture (conversation threading + cumulative tools).
|
||||
|
||||
Validates:
|
||||
- conversation_mode="isolated" preserves existing behavior
|
||||
- conversation_mode="continuous" threads one conversation across nodes
|
||||
- Transition markers are inserted at phase boundaries
|
||||
- System prompt updates at each transition (layered prompt composition)
|
||||
- Tools accumulate across nodes in continuous mode
|
||||
- prompt_composer functions work correctly
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from collections.abc import AsyncIterator
|
||||
from typing import Any
|
||||
from unittest.mock import MagicMock
|
||||
|
||||
import pytest
|
||||
|
||||
from framework.graph.conversation import NodeConversation
|
||||
from framework.graph.edge import EdgeCondition, EdgeSpec, GraphSpec
|
||||
from framework.graph.executor import GraphExecutor
|
||||
from framework.graph.goal import Goal
|
||||
from framework.graph.node import NodeResult, NodeSpec, SharedMemory
|
||||
from framework.graph.prompt_composer import (
|
||||
build_narrative,
|
||||
build_transition_marker,
|
||||
compose_system_prompt,
|
||||
)
|
||||
from framework.llm.provider import LLMProvider, LLMResponse, Tool
|
||||
from framework.llm.stream_events import FinishEvent, TextDeltaEvent, ToolCallEvent
|
||||
from framework.runtime.core import Runtime
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Mock LLM
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class MockStreamingLLM(LLMProvider):
|
||||
"""Mock LLM that yields pre-programmed StreamEvent sequences."""
|
||||
|
||||
def __init__(self, scenarios: list[list] | None = None):
|
||||
self.scenarios = scenarios or []
|
||||
self._call_index = 0
|
||||
self.stream_calls: list[dict] = []
|
||||
|
||||
async def stream(
|
||||
self,
|
||||
messages: list[dict[str, Any]],
|
||||
system: str = "",
|
||||
tools: list[Tool] | None = None,
|
||||
max_tokens: int = 4096,
|
||||
) -> AsyncIterator:
|
||||
self.stream_calls.append({"messages": messages, "system": system, "tools": tools})
|
||||
if not self.scenarios:
|
||||
return
|
||||
events = self.scenarios[self._call_index % len(self.scenarios)]
|
||||
self._call_index += 1
|
||||
for event in events:
|
||||
yield event
|
||||
|
||||
def complete(self, messages, system="", **kwargs) -> LLMResponse:
|
||||
return LLMResponse(content="Summary.", model="mock", stop_reason="stop")
|
||||
|
||||
def complete_with_tools(self, messages, system, tools, tool_executor, **kwargs) -> LLMResponse:
|
||||
return LLMResponse(content="", model="mock", stop_reason="stop")
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Helpers
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def _set_output_scenario(key: str, value: str) -> list:
|
||||
"""LLM calls set_output then finishes."""
|
||||
return [
|
||||
ToolCallEvent(
|
||||
tool_use_id=f"call_{key}",
|
||||
tool_name="set_output",
|
||||
tool_input={"key": key, "value": value},
|
||||
),
|
||||
FinishEvent(stop_reason="tool_calls", input_tokens=10, output_tokens=5, model="mock"),
|
||||
]
|
||||
|
||||
|
||||
def _text_then_set_output(text: str, key: str, value: str) -> list:
|
||||
"""LLM produces text, then calls set_output, then finishes (2 turns needed)."""
|
||||
return [
|
||||
TextDeltaEvent(content=text, snapshot=text),
|
||||
ToolCallEvent(
|
||||
tool_use_id=f"call_{key}",
|
||||
tool_name="set_output",
|
||||
tool_input={"key": key, "value": value},
|
||||
),
|
||||
FinishEvent(stop_reason="tool_calls", input_tokens=10, output_tokens=5, model="mock"),
|
||||
]
|
||||
|
||||
|
||||
def _text_finish(text: str) -> list:
|
||||
"""LLM produces text and stops (triggers judge)."""
|
||||
return [
|
||||
TextDeltaEvent(content=text, snapshot=text),
|
||||
FinishEvent(stop_reason="stop", input_tokens=10, output_tokens=5, model="mock"),
|
||||
]
|
||||
|
||||
|
||||
def _make_runtime():
|
||||
rt = MagicMock(spec=Runtime)
|
||||
rt.start_run = MagicMock(return_value="run_1")
|
||||
rt.end_run = MagicMock()
|
||||
rt.report_problem = MagicMock()
|
||||
rt.decide = MagicMock(return_value="dec_1")
|
||||
rt.record_outcome = MagicMock()
|
||||
rt.set_node = MagicMock()
|
||||
return rt
|
||||
|
||||
|
||||
def _make_goal():
|
||||
return Goal(id="g1", name="test", description="test goal")
|
||||
|
||||
|
||||
def _make_tool(name: str) -> Tool:
|
||||
return Tool(
|
||||
name=name,
|
||||
description=f"Tool {name}",
|
||||
parameters={"type": "object", "properties": {}},
|
||||
)
|
||||
|
||||
|
||||
# ===========================================================================
|
||||
# prompt_composer unit tests
|
||||
# ===========================================================================
|
||||
|
||||
|
||||
class TestComposeSystemPrompt:
|
||||
def test_all_layers(self):
|
||||
result = compose_system_prompt(
|
||||
identity_prompt="I am a research agent.",
|
||||
focus_prompt="Focus on writing the report.",
|
||||
narrative="We found 5 sources on topic X.",
|
||||
)
|
||||
assert "I am a research agent." in result
|
||||
assert "Focus on writing the report." in result
|
||||
assert "We found 5 sources on topic X." in result
|
||||
# Identity comes first
|
||||
assert result.index("I am a research agent.") < result.index("Focus on writing")
|
||||
|
||||
def test_identity_only(self):
|
||||
result = compose_system_prompt(identity_prompt="I am an agent.", focus_prompt=None)
|
||||
assert result == "I am an agent."
|
||||
|
||||
def test_focus_only(self):
|
||||
result = compose_system_prompt(identity_prompt=None, focus_prompt="Do the thing.")
|
||||
assert "Current Focus" in result
|
||||
assert "Do the thing." in result
|
||||
|
||||
def test_empty(self):
|
||||
result = compose_system_prompt(identity_prompt=None, focus_prompt=None)
|
||||
assert result == ""
|
||||
|
||||
|
||||
class TestBuildNarrative:
|
||||
def test_with_execution_path(self):
|
||||
memory = SharedMemory()
|
||||
memory.write("findings", "some findings")
|
||||
|
||||
node_a = NodeSpec(
|
||||
id="a", name="Research", description="Research the topic", node_type="event_loop"
|
||||
)
|
||||
node_b = NodeSpec(id="b", name="Report", description="Write report", node_type="event_loop")
|
||||
graph = GraphSpec(
|
||||
id="g1",
|
||||
goal_id="g1",
|
||||
entry_node="a",
|
||||
nodes=[node_a, node_b],
|
||||
edges=[],
|
||||
)
|
||||
|
||||
result = build_narrative(memory, ["a"], graph)
|
||||
assert "Research" in result
|
||||
assert "findings" in result
|
||||
|
||||
def test_empty_state(self):
|
||||
memory = SharedMemory()
|
||||
graph = GraphSpec(id="g1", goal_id="g1", entry_node="a", nodes=[], edges=[])
|
||||
result = build_narrative(memory, [], graph)
|
||||
assert result == ""
|
||||
|
||||
|
||||
class TestBuildTransitionMarker:
|
||||
def test_basic_marker(self):
|
||||
prev = NodeSpec(
|
||||
id="research", name="Research", description="Find sources", node_type="event_loop"
|
||||
)
|
||||
next_n = NodeSpec(
|
||||
id="report", name="Report", description="Write report", node_type="event_loop"
|
||||
)
|
||||
memory = SharedMemory()
|
||||
memory.write("findings", "important stuff")
|
||||
|
||||
marker = build_transition_marker(
|
||||
previous_node=prev,
|
||||
next_node=next_n,
|
||||
memory=memory,
|
||||
cumulative_tool_names=["web_search", "save_data"],
|
||||
)
|
||||
|
||||
assert "PHASE TRANSITION" in marker
|
||||
assert "Research" in marker
|
||||
assert "Report" in marker
|
||||
assert "findings" in marker
|
||||
assert "web_search" in marker
|
||||
assert "reflect" in marker.lower()
|
||||
|
||||
|
||||
# ===========================================================================
|
||||
# NodeConversation.update_system_prompt
|
||||
# ===========================================================================
|
||||
|
||||
|
||||
class TestUpdateSystemPrompt:
|
||||
def test_update(self):
|
||||
conv = NodeConversation(system_prompt="original")
|
||||
assert conv.system_prompt == "original"
|
||||
conv.update_system_prompt("updated")
|
||||
assert conv.system_prompt == "updated"
|
||||
|
||||
|
||||
# ===========================================================================
|
||||
# Conversation threading through executor
|
||||
# ===========================================================================
|
||||
|
||||
|
||||
class TestContinuousConversation:
|
||||
"""Test that conversation_mode='continuous' threads a single conversation."""
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_isolated_mode_no_conversation_in_result(self):
|
||||
"""In isolated mode, NodeResult.conversation should be None."""
|
||||
runtime = _make_runtime()
|
||||
llm = MockStreamingLLM(
|
||||
scenarios=[
|
||||
_set_output_scenario("result", "done"),
|
||||
_text_finish("accepted"),
|
||||
]
|
||||
)
|
||||
|
||||
spec = NodeSpec(
|
||||
id="n1",
|
||||
name="Node1",
|
||||
description="test",
|
||||
node_type="event_loop",
|
||||
output_keys=["result"],
|
||||
)
|
||||
graph = GraphSpec(
|
||||
id="g1",
|
||||
goal_id="g1",
|
||||
entry_node="n1",
|
||||
nodes=[spec],
|
||||
edges=[],
|
||||
conversation_mode="isolated",
|
||||
)
|
||||
|
||||
executor = GraphExecutor(runtime=runtime, llm=llm)
|
||||
result = await executor.execute(graph=graph, goal=_make_goal())
|
||||
assert result.success
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_continuous_threads_conversation(self):
|
||||
"""In continuous mode, second node sees messages from first node."""
|
||||
runtime = _make_runtime()
|
||||
|
||||
# Node A: set_output("brief", "the brief"), then finish (accept)
|
||||
# Node B: set_output("report", "the report"), then finish (accept)
|
||||
llm = MockStreamingLLM(
|
||||
scenarios=[
|
||||
_text_then_set_output("I'll research this.", "brief", "the brief"),
|
||||
_text_finish(""), # triggers accept for node A (all keys set)
|
||||
_text_then_set_output("Here's the report.", "report", "the report"),
|
||||
_text_finish(""), # triggers accept for node B
|
||||
]
|
||||
)
|
||||
|
||||
node_a = NodeSpec(
|
||||
id="a",
|
||||
name="Intake",
|
||||
description="Gather requirements",
|
||||
node_type="event_loop",
|
||||
output_keys=["brief"],
|
||||
)
|
||||
node_b = NodeSpec(
|
||||
id="b",
|
||||
name="Report",
|
||||
description="Write report",
|
||||
node_type="event_loop",
|
||||
input_keys=["brief"],
|
||||
output_keys=["report"],
|
||||
)
|
||||
|
||||
graph = GraphSpec(
|
||||
id="g1",
|
||||
goal_id="g1",
|
||||
entry_node="a",
|
||||
nodes=[node_a, node_b],
|
||||
edges=[EdgeSpec(id="e1", source="a", target="b", condition=EdgeCondition.ON_SUCCESS)],
|
||||
terminal_nodes=["b"],
|
||||
conversation_mode="continuous",
|
||||
identity_prompt="You are a thorough research agent.",
|
||||
)
|
||||
|
||||
executor = GraphExecutor(runtime=runtime, llm=llm)
|
||||
result = await executor.execute(graph=graph, goal=_make_goal())
|
||||
|
||||
assert result.success
|
||||
assert result.path == ["a", "b"]
|
||||
|
||||
# Verify the LLM saw the identity prompt in system messages
|
||||
# The second node's system prompt should contain the identity
|
||||
if len(llm.stream_calls) >= 3:
|
||||
system_at_node_b = llm.stream_calls[2]["system"]
|
||||
assert "thorough research agent" in system_at_node_b
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_continuous_transition_marker_present(self):
|
||||
"""Transition marker should appear in messages when switching nodes."""
|
||||
runtime = _make_runtime()
|
||||
|
||||
llm = MockStreamingLLM(
|
||||
scenarios=[
|
||||
_text_then_set_output("Research done.", "brief", "the brief"),
|
||||
_text_finish(""),
|
||||
_text_then_set_output("Report done.", "report", "the report"),
|
||||
_text_finish(""),
|
||||
]
|
||||
)
|
||||
|
||||
node_a = NodeSpec(
|
||||
id="a",
|
||||
name="Research",
|
||||
description="Do research",
|
||||
node_type="event_loop",
|
||||
output_keys=["brief"],
|
||||
)
|
||||
node_b = NodeSpec(
|
||||
id="b",
|
||||
name="Report",
|
||||
description="Write report",
|
||||
node_type="event_loop",
|
||||
input_keys=["brief"],
|
||||
output_keys=["report"],
|
||||
)
|
||||
|
||||
graph = GraphSpec(
|
||||
id="g1",
|
||||
goal_id="g1",
|
||||
entry_node="a",
|
||||
nodes=[node_a, node_b],
|
||||
edges=[EdgeSpec(id="e1", source="a", target="b", condition=EdgeCondition.ON_SUCCESS)],
|
||||
terminal_nodes=["b"],
|
||||
conversation_mode="continuous",
|
||||
)
|
||||
|
||||
executor = GraphExecutor(runtime=runtime, llm=llm)
|
||||
result = await executor.execute(graph=graph, goal=_make_goal())
|
||||
assert result.success
|
||||
|
||||
# When node B's first LLM call happens, its messages should contain
|
||||
# the transition marker from the executor
|
||||
if len(llm.stream_calls) >= 3:
|
||||
node_b_messages = llm.stream_calls[2]["messages"]
|
||||
all_content = " ".join(
|
||||
m.get("content", "") for m in node_b_messages if isinstance(m.get("content"), str)
|
||||
)
|
||||
assert "PHASE TRANSITION" in all_content
|
||||
|
||||
|
||||
# ===========================================================================
|
||||
# Cumulative tools
|
||||
# ===========================================================================
|
||||
|
||||
|
||||
class TestCumulativeTools:
|
||||
"""Test that tools accumulate in continuous mode."""
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_isolated_mode_tools_scoped(self):
|
||||
"""In isolated mode, each node only gets its own declared tools."""
|
||||
runtime = _make_runtime()
|
||||
tool_a = _make_tool("web_search")
|
||||
tool_b = _make_tool("save_data")
|
||||
|
||||
llm = MockStreamingLLM(
|
||||
scenarios=[
|
||||
_text_then_set_output("Done.", "brief", "brief"),
|
||||
_text_finish(""),
|
||||
_text_then_set_output("Done.", "report", "report"),
|
||||
_text_finish(""),
|
||||
]
|
||||
)
|
||||
|
||||
node_a = NodeSpec(
|
||||
id="a",
|
||||
name="Research",
|
||||
description="Research",
|
||||
node_type="event_loop",
|
||||
output_keys=["brief"],
|
||||
tools=["web_search"],
|
||||
)
|
||||
node_b = NodeSpec(
|
||||
id="b",
|
||||
name="Report",
|
||||
description="Report",
|
||||
node_type="event_loop",
|
||||
input_keys=["brief"],
|
||||
output_keys=["report"],
|
||||
tools=["save_data"],
|
||||
)
|
||||
|
||||
graph = GraphSpec(
|
||||
id="g1",
|
||||
goal_id="g1",
|
||||
entry_node="a",
|
||||
nodes=[node_a, node_b],
|
||||
edges=[EdgeSpec(id="e1", source="a", target="b", condition=EdgeCondition.ON_SUCCESS)],
|
||||
terminal_nodes=["b"],
|
||||
conversation_mode="isolated",
|
||||
)
|
||||
|
||||
executor = GraphExecutor(
|
||||
runtime=runtime,
|
||||
llm=llm,
|
||||
tools=[tool_a, tool_b],
|
||||
)
|
||||
result = await executor.execute(graph=graph, goal=_make_goal())
|
||||
assert result.success
|
||||
|
||||
# In isolated mode, node B should NOT have web_search
|
||||
if len(llm.stream_calls) >= 3:
|
||||
node_b_tools = llm.stream_calls[2].get("tools") or []
|
||||
tool_names = [t.name for t in node_b_tools]
|
||||
assert "save_data" in tool_names or "set_output" in tool_names
|
||||
# web_search should NOT be present (only set_output + save_data)
|
||||
real_tools = [n for n in tool_names if n != "set_output"]
|
||||
assert "web_search" not in real_tools
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_continuous_mode_tools_accumulate(self):
|
||||
"""In continuous mode, node B should have both web_search and save_data."""
|
||||
runtime = _make_runtime()
|
||||
tool_a = _make_tool("web_search")
|
||||
tool_b = _make_tool("save_data")
|
||||
|
||||
llm = MockStreamingLLM(
|
||||
scenarios=[
|
||||
_text_then_set_output("Done.", "brief", "brief"),
|
||||
_text_finish(""),
|
||||
_text_then_set_output("Done.", "report", "report"),
|
||||
_text_finish(""),
|
||||
]
|
||||
)
|
||||
|
||||
node_a = NodeSpec(
|
||||
id="a",
|
||||
name="Research",
|
||||
description="Research",
|
||||
node_type="event_loop",
|
||||
output_keys=["brief"],
|
||||
tools=["web_search"],
|
||||
)
|
||||
node_b = NodeSpec(
|
||||
id="b",
|
||||
name="Report",
|
||||
description="Report",
|
||||
node_type="event_loop",
|
||||
input_keys=["brief"],
|
||||
output_keys=["report"],
|
||||
tools=["save_data"],
|
||||
)
|
||||
|
||||
graph = GraphSpec(
|
||||
id="g1",
|
||||
goal_id="g1",
|
||||
entry_node="a",
|
||||
nodes=[node_a, node_b],
|
||||
edges=[EdgeSpec(id="e1", source="a", target="b", condition=EdgeCondition.ON_SUCCESS)],
|
||||
terminal_nodes=["b"],
|
||||
conversation_mode="continuous",
|
||||
)
|
||||
|
||||
executor = GraphExecutor(
|
||||
runtime=runtime,
|
||||
llm=llm,
|
||||
tools=[tool_a, tool_b],
|
||||
)
|
||||
result = await executor.execute(graph=graph, goal=_make_goal())
|
||||
assert result.success
|
||||
|
||||
# In continuous mode, node B should have BOTH tools
|
||||
if len(llm.stream_calls) >= 3:
|
||||
node_b_tools = llm.stream_calls[2].get("tools") or []
|
||||
tool_names = [t.name for t in node_b_tools]
|
||||
real_tools = [n for n in tool_names if n != "set_output"]
|
||||
assert "web_search" in real_tools
|
||||
assert "save_data" in real_tools
|
||||
|
||||
|
||||
# ===========================================================================
|
||||
# Schema field defaults
|
||||
# ===========================================================================
|
||||
|
||||
|
||||
class TestSchemaDefaults:
|
||||
def test_graphspec_defaults(self):
|
||||
"""New fields should have safe defaults."""
|
||||
graph = GraphSpec(
|
||||
id="g1",
|
||||
goal_id="g1",
|
||||
entry_node="n1",
|
||||
nodes=[],
|
||||
edges=[],
|
||||
)
|
||||
assert graph.conversation_mode == "continuous"
|
||||
assert graph.identity_prompt is None
|
||||
|
||||
def test_nodespec_defaults(self):
|
||||
"""NodeSpec.success_criteria should default to None."""
|
||||
spec = NodeSpec(
|
||||
id="n1",
|
||||
name="test",
|
||||
description="test",
|
||||
node_type="event_loop",
|
||||
)
|
||||
assert spec.success_criteria is None
|
||||
|
||||
def test_noderesult_defaults(self):
|
||||
"""NodeResult.conversation should default to None."""
|
||||
result = NodeResult(success=True)
|
||||
assert result.conversation is None
|
||||
@@ -0,0 +1,380 @@
|
||||
"""Tests for Level 2 conversation-aware judge.
|
||||
|
||||
Validates:
|
||||
- No success_criteria → Level 0 only (existing behavior)
|
||||
- success_criteria set, good conversation → Level 2 ACCEPT
|
||||
- success_criteria set, poor conversation → Level 2 RETRY with feedback
|
||||
- Custom explicit judge takes priority over Level 2
|
||||
- Level 2 fires only when Level 0 passes (all keys set)
|
||||
- _parse_verdict correctly parses LLM responses
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from collections.abc import AsyncIterator
|
||||
from typing import Any
|
||||
from unittest.mock import MagicMock
|
||||
|
||||
import pytest
|
||||
|
||||
from framework.graph.conversation import NodeConversation
|
||||
from framework.graph.conversation_judge import (
|
||||
_parse_verdict,
|
||||
evaluate_phase_completion,
|
||||
)
|
||||
from framework.graph.edge import GraphSpec
|
||||
from framework.graph.executor import GraphExecutor
|
||||
from framework.graph.goal import Goal
|
||||
from framework.graph.node import NodeSpec
|
||||
from framework.llm.provider import LLMProvider, LLMResponse, Tool
|
||||
from framework.llm.stream_events import FinishEvent, TextDeltaEvent, ToolCallEvent
|
||||
from framework.runtime.core import Runtime
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Mock LLM
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class MockStreamingLLM(LLMProvider):
|
||||
"""Mock LLM that yields pre-programmed StreamEvent sequences."""
|
||||
|
||||
def __init__(self, scenarios: list[list] | None = None, complete_response: str = ""):
|
||||
self.scenarios = scenarios or []
|
||||
self._call_index = 0
|
||||
self.stream_calls: list[dict] = []
|
||||
self.complete_response = complete_response
|
||||
self.complete_calls: list[dict] = []
|
||||
|
||||
async def stream(
|
||||
self,
|
||||
messages: list[dict[str, Any]],
|
||||
system: str = "",
|
||||
tools: list[Tool] | None = None,
|
||||
max_tokens: int = 4096,
|
||||
) -> AsyncIterator:
|
||||
self.stream_calls.append({"messages": messages, "system": system, "tools": tools})
|
||||
if not self.scenarios:
|
||||
return
|
||||
events = self.scenarios[self._call_index % len(self.scenarios)]
|
||||
self._call_index += 1
|
||||
for event in events:
|
||||
yield event
|
||||
|
||||
def complete(self, messages, system="", **kwargs) -> LLMResponse:
|
||||
self.complete_calls.append({"messages": messages, "system": system})
|
||||
return LLMResponse(content=self.complete_response, model="mock", stop_reason="stop")
|
||||
|
||||
def complete_with_tools(self, messages, system, tools, tool_executor, **kwargs) -> LLMResponse:
|
||||
return LLMResponse(content="", model="mock", stop_reason="stop")
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Helpers
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def _set_output_scenario(key: str, value: str) -> list:
|
||||
return [
|
||||
ToolCallEvent(
|
||||
tool_use_id=f"call_{key}",
|
||||
tool_name="set_output",
|
||||
tool_input={"key": key, "value": value},
|
||||
),
|
||||
FinishEvent(stop_reason="tool_calls", input_tokens=10, output_tokens=5, model="mock"),
|
||||
]
|
||||
|
||||
|
||||
def _text_then_set_output(text: str, key: str, value: str) -> list:
|
||||
return [
|
||||
TextDeltaEvent(content=text, snapshot=text),
|
||||
ToolCallEvent(
|
||||
tool_use_id=f"call_{key}",
|
||||
tool_name="set_output",
|
||||
tool_input={"key": key, "value": value},
|
||||
),
|
||||
FinishEvent(stop_reason="tool_calls", input_tokens=10, output_tokens=5, model="mock"),
|
||||
]
|
||||
|
||||
|
||||
def _text_finish(text: str) -> list:
|
||||
return [
|
||||
TextDeltaEvent(content=text, snapshot=text),
|
||||
FinishEvent(stop_reason="stop", input_tokens=10, output_tokens=5, model="mock"),
|
||||
]
|
||||
|
||||
|
||||
def _make_runtime():
|
||||
rt = MagicMock(spec=Runtime)
|
||||
rt.start_run = MagicMock(return_value="run_1")
|
||||
rt.end_run = MagicMock()
|
||||
rt.report_problem = MagicMock()
|
||||
rt.decide = MagicMock(return_value="dec_1")
|
||||
rt.record_outcome = MagicMock()
|
||||
rt.set_node = MagicMock()
|
||||
return rt
|
||||
|
||||
|
||||
def _make_goal():
|
||||
return Goal(id="g1", name="test", description="test goal")
|
||||
|
||||
|
||||
# ===========================================================================
|
||||
# Unit tests for _parse_verdict
|
||||
# ===========================================================================
|
||||
|
||||
|
||||
class TestParseVerdict:
|
||||
def test_accept(self):
|
||||
v = _parse_verdict("ACTION: ACCEPT\nCONFIDENCE: 0.9\nFEEDBACK:")
|
||||
assert v.action == "ACCEPT"
|
||||
assert v.confidence == 0.9
|
||||
assert v.feedback == ""
|
||||
|
||||
def test_retry_with_feedback(self):
|
||||
v = _parse_verdict("ACTION: RETRY\nCONFIDENCE: 0.6\nFEEDBACK: Research is too shallow.")
|
||||
assert v.action == "RETRY"
|
||||
assert v.confidence == 0.6
|
||||
assert "shallow" in v.feedback
|
||||
|
||||
def test_defaults_on_garbage(self):
|
||||
v = _parse_verdict("some random text\nno structured output")
|
||||
assert v.action == "ACCEPT" # default
|
||||
assert v.confidence == 0.8 # default
|
||||
|
||||
def test_invalid_action_defaults_to_accept(self):
|
||||
v = _parse_verdict("ACTION: ESCALATE\nCONFIDENCE: 0.5")
|
||||
assert v.action == "ACCEPT" # ESCALATE not valid for Level 2
|
||||
|
||||
|
||||
# ===========================================================================
|
||||
# Unit tests for evaluate_phase_completion
|
||||
# ===========================================================================
|
||||
|
||||
|
||||
class TestEvaluatePhaseCompletion:
|
||||
@pytest.mark.asyncio
|
||||
async def test_accept_on_good_response(self):
|
||||
"""LLM says ACCEPT → verdict is ACCEPT."""
|
||||
llm = MockStreamingLLM(complete_response="ACTION: ACCEPT\nCONFIDENCE: 0.95\nFEEDBACK:")
|
||||
conv = NodeConversation(system_prompt="test")
|
||||
await conv.add_user_message("Do research on topic X")
|
||||
await conv.add_assistant_message("I found 5 high-quality sources on X.")
|
||||
|
||||
verdict = await evaluate_phase_completion(
|
||||
llm=llm,
|
||||
conversation=conv,
|
||||
phase_name="Research",
|
||||
phase_description="Research the topic",
|
||||
success_criteria="Find at least 3 credible sources",
|
||||
accumulator_state={"findings": "5 sources found"},
|
||||
)
|
||||
assert verdict.action == "ACCEPT"
|
||||
assert verdict.confidence == 0.95
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_retry_on_poor_response(self):
|
||||
"""LLM says RETRY → verdict is RETRY with feedback."""
|
||||
llm = MockStreamingLLM(
|
||||
complete_response=(
|
||||
"ACTION: RETRY\nCONFIDENCE: 0.4\nFEEDBACK: Only found 1 source, need 3."
|
||||
)
|
||||
)
|
||||
conv = NodeConversation(system_prompt="test")
|
||||
await conv.add_user_message("Do research")
|
||||
await conv.add_assistant_message("I found 1 source.")
|
||||
|
||||
verdict = await evaluate_phase_completion(
|
||||
llm=llm,
|
||||
conversation=conv,
|
||||
phase_name="Research",
|
||||
phase_description="Research the topic",
|
||||
success_criteria="Find at least 3 credible sources",
|
||||
accumulator_state={"findings": "1 source"},
|
||||
)
|
||||
assert verdict.action == "RETRY"
|
||||
assert "1 source" in verdict.feedback
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_llm_failure_defaults_to_accept(self):
|
||||
"""When LLM fails, Level 2 should not block (Level 0 already passed)."""
|
||||
llm = MockStreamingLLM()
|
||||
# Make complete() raise an exception
|
||||
llm.complete = MagicMock(side_effect=RuntimeError("LLM unavailable"))
|
||||
|
||||
conv = NodeConversation(system_prompt="test")
|
||||
await conv.add_assistant_message("Done.")
|
||||
|
||||
verdict = await evaluate_phase_completion(
|
||||
llm=llm,
|
||||
conversation=conv,
|
||||
phase_name="Test",
|
||||
phase_description="Test phase",
|
||||
success_criteria="Do the thing",
|
||||
accumulator_state={"result": "done"},
|
||||
)
|
||||
assert verdict.action == "ACCEPT"
|
||||
assert verdict.confidence == 0.5
|
||||
|
||||
|
||||
# ===========================================================================
|
||||
# Integration: Level 2 in EventLoopNode implicit judge
|
||||
# ===========================================================================
|
||||
|
||||
|
||||
class TestLevel2InImplicitJudge:
|
||||
@pytest.mark.asyncio
|
||||
async def test_no_success_criteria_level0_only(self):
|
||||
"""Without success_criteria, Level 0 accepts normally (existing behavior)."""
|
||||
runtime = _make_runtime()
|
||||
llm = MockStreamingLLM(
|
||||
scenarios=[
|
||||
_set_output_scenario("result", "done"),
|
||||
_text_finish("accepted"),
|
||||
]
|
||||
)
|
||||
|
||||
spec = NodeSpec(
|
||||
id="n1",
|
||||
name="Node1",
|
||||
description="test",
|
||||
node_type="event_loop",
|
||||
output_keys=["result"],
|
||||
# No success_criteria!
|
||||
)
|
||||
graph = GraphSpec(
|
||||
id="g1",
|
||||
goal_id="g1",
|
||||
entry_node="n1",
|
||||
nodes=[spec],
|
||||
edges=[],
|
||||
)
|
||||
|
||||
executor = GraphExecutor(runtime=runtime, llm=llm)
|
||||
result = await executor.execute(graph=graph, goal=_make_goal())
|
||||
assert result.success
|
||||
# LLM.complete should NOT have been called for Level 2
|
||||
assert len(llm.complete_calls) == 0
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_success_criteria_accept(self):
|
||||
"""With success_criteria and good work, Level 2 accepts."""
|
||||
runtime = _make_runtime()
|
||||
llm = MockStreamingLLM(
|
||||
scenarios=[
|
||||
_text_then_set_output("I did thorough research.", "result", "done"),
|
||||
_text_finish(""), # triggers judge
|
||||
],
|
||||
complete_response="ACTION: ACCEPT\nCONFIDENCE: 0.9\nFEEDBACK:",
|
||||
)
|
||||
|
||||
spec = NodeSpec(
|
||||
id="n1",
|
||||
name="Research",
|
||||
description="Do research",
|
||||
node_type="event_loop",
|
||||
output_keys=["result"],
|
||||
success_criteria="Provide thorough research with multiple sources.",
|
||||
)
|
||||
graph = GraphSpec(
|
||||
id="g1",
|
||||
goal_id="g1",
|
||||
entry_node="n1",
|
||||
nodes=[spec],
|
||||
edges=[],
|
||||
)
|
||||
|
||||
executor = GraphExecutor(runtime=runtime, llm=llm)
|
||||
result = await executor.execute(graph=graph, goal=_make_goal())
|
||||
assert result.success
|
||||
# LLM.complete should have been called for Level 2
|
||||
assert len(llm.complete_calls) >= 1
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_success_criteria_retry_then_accept(self):
|
||||
"""Level 2 rejects first attempt, LLM tries again, Level 2 accepts."""
|
||||
runtime = _make_runtime()
|
||||
|
||||
# Track complete calls to alternate responses
|
||||
complete_responses = [
|
||||
"ACTION: RETRY\nCONFIDENCE: 0.4\nFEEDBACK: Need more detail.",
|
||||
"ACTION: ACCEPT\nCONFIDENCE: 0.9\nFEEDBACK:",
|
||||
]
|
||||
call_count = [0]
|
||||
|
||||
class SequentialLLM(MockStreamingLLM):
|
||||
def complete(self, messages, system="", **kwargs):
|
||||
idx = call_count[0]
|
||||
call_count[0] += 1
|
||||
resp = complete_responses[idx % len(complete_responses)]
|
||||
return LLMResponse(content=resp, model="mock", stop_reason="stop")
|
||||
|
||||
llm = SequentialLLM(
|
||||
scenarios=[
|
||||
# Turn 1: set output, then stop → Level 2 RETRY
|
||||
_text_then_set_output("Brief research.", "result", "brief"),
|
||||
_text_finish(""), # triggers judge → Level 2 RETRY
|
||||
# Turn 2: after retry feedback, set output again, stop → Level 2 ACCEPT
|
||||
_text_then_set_output("Much more detailed research.", "result", "detailed"),
|
||||
_text_finish(""), # triggers judge → Level 2 ACCEPT
|
||||
]
|
||||
)
|
||||
|
||||
spec = NodeSpec(
|
||||
id="n1",
|
||||
name="Research",
|
||||
description="Do research",
|
||||
node_type="event_loop",
|
||||
output_keys=["result"],
|
||||
success_criteria="Provide thorough research with multiple sources.",
|
||||
)
|
||||
graph = GraphSpec(
|
||||
id="g1",
|
||||
goal_id="g1",
|
||||
entry_node="n1",
|
||||
nodes=[spec],
|
||||
edges=[],
|
||||
)
|
||||
|
||||
executor = GraphExecutor(runtime=runtime, llm=llm)
|
||||
result = await executor.execute(graph=graph, goal=_make_goal())
|
||||
assert result.success
|
||||
# Should have had 2 complete calls (first RETRY, second ACCEPT)
|
||||
assert call_count[0] >= 2
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_level2_only_fires_when_level0_passes(self):
|
||||
"""Level 2 should NOT fire when output keys are missing."""
|
||||
runtime = _make_runtime()
|
||||
|
||||
llm = MockStreamingLLM(
|
||||
scenarios=[
|
||||
# Turn 1: just text, no set_output → Level 0 RETRY (missing keys)
|
||||
_text_finish("I did some thinking."),
|
||||
# Turn 2: set output → Level 0 ACCEPT, Level 2 check
|
||||
_text_then_set_output("Now I have output.", "result", "done"),
|
||||
_text_finish(""), # triggers judge
|
||||
],
|
||||
complete_response="ACTION: ACCEPT\nCONFIDENCE: 0.9\nFEEDBACK:",
|
||||
)
|
||||
|
||||
spec = NodeSpec(
|
||||
id="n1",
|
||||
name="Research",
|
||||
description="Do research",
|
||||
node_type="event_loop",
|
||||
output_keys=["result"],
|
||||
success_criteria="Provide results.",
|
||||
)
|
||||
graph = GraphSpec(
|
||||
id="g1",
|
||||
goal_id="g1",
|
||||
entry_node="n1",
|
||||
nodes=[spec],
|
||||
edges=[],
|
||||
)
|
||||
|
||||
executor = GraphExecutor(runtime=runtime, llm=llm)
|
||||
result = await executor.execute(graph=graph, goal=_make_goal())
|
||||
assert result.success
|
||||
# Level 2 should only fire once (when Level 0 passes)
|
||||
assert len(llm.complete_calls) == 1
|
||||
@@ -0,0 +1,280 @@
|
||||
"""Tests for phase-aware compaction in continuous conversation mode.
|
||||
|
||||
Validates:
|
||||
- Phase tags persist through storage roundtrip
|
||||
- Transition markers survive compaction
|
||||
- Current phase messages protected during compaction
|
||||
- Older phase tool results pruned first
|
||||
- Phase metadata fields have safe defaults
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import pytest
|
||||
|
||||
from framework.graph.conversation import Message, NodeConversation
|
||||
|
||||
|
||||
class TestPhaseMetadata:
|
||||
"""Phase metadata on Message dataclass."""
|
||||
|
||||
def test_defaults(self):
|
||||
msg = Message(seq=0, role="user", content="hello")
|
||||
assert msg.phase_id is None
|
||||
assert msg.is_transition_marker is False
|
||||
|
||||
def test_set_phase(self):
|
||||
msg = Message(seq=0, role="user", content="hello", phase_id="research")
|
||||
assert msg.phase_id == "research"
|
||||
|
||||
def test_transition_marker(self):
|
||||
msg = Message(
|
||||
seq=0,
|
||||
role="user",
|
||||
content="PHASE TRANSITION",
|
||||
is_transition_marker=True,
|
||||
phase_id="report",
|
||||
)
|
||||
assert msg.is_transition_marker is True
|
||||
assert msg.phase_id == "report"
|
||||
|
||||
def test_storage_roundtrip(self):
|
||||
"""Phase metadata should survive to_storage_dict → from_storage_dict."""
|
||||
msg = Message(
|
||||
seq=5,
|
||||
role="user",
|
||||
content="transition",
|
||||
phase_id="review",
|
||||
is_transition_marker=True,
|
||||
)
|
||||
d = msg.to_storage_dict()
|
||||
assert d["phase_id"] == "review"
|
||||
assert d["is_transition_marker"] is True
|
||||
|
||||
restored = Message.from_storage_dict(d)
|
||||
assert restored.phase_id == "review"
|
||||
assert restored.is_transition_marker is True
|
||||
|
||||
def test_storage_roundtrip_no_phase(self):
|
||||
"""Messages without phase metadata should roundtrip cleanly."""
|
||||
msg = Message(seq=0, role="assistant", content="hello")
|
||||
d = msg.to_storage_dict()
|
||||
assert "phase_id" not in d
|
||||
assert "is_transition_marker" not in d
|
||||
|
||||
restored = Message.from_storage_dict(d)
|
||||
assert restored.phase_id is None
|
||||
assert restored.is_transition_marker is False
|
||||
|
||||
def test_to_llm_dict_no_metadata(self):
|
||||
"""Phase metadata should NOT appear in LLM-facing dicts."""
|
||||
msg = Message(
|
||||
seq=0,
|
||||
role="user",
|
||||
content="hello",
|
||||
phase_id="research",
|
||||
is_transition_marker=True,
|
||||
)
|
||||
d = msg.to_llm_dict()
|
||||
assert "phase_id" not in d
|
||||
assert "is_transition_marker" not in d
|
||||
assert d == {"role": "user", "content": "hello"}
|
||||
|
||||
|
||||
class TestPhaseStamping:
|
||||
"""Messages are stamped with current phase."""
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_messages_stamped_with_phase(self):
|
||||
conv = NodeConversation(system_prompt="test")
|
||||
conv.set_current_phase("research")
|
||||
|
||||
msg1 = await conv.add_user_message("search for X")
|
||||
msg2 = await conv.add_assistant_message("Found it.")
|
||||
|
||||
assert msg1.phase_id == "research"
|
||||
assert msg2.phase_id == "research"
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_phase_changes_stamp(self):
|
||||
conv = NodeConversation(system_prompt="test")
|
||||
conv.set_current_phase("research")
|
||||
|
||||
msg1 = await conv.add_user_message("research msg")
|
||||
|
||||
conv.set_current_phase("report")
|
||||
msg2 = await conv.add_user_message("report msg")
|
||||
|
||||
assert msg1.phase_id == "research"
|
||||
assert msg2.phase_id == "report"
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_no_phase_no_stamp(self):
|
||||
conv = NodeConversation(system_prompt="test")
|
||||
msg = await conv.add_user_message("no phase")
|
||||
assert msg.phase_id is None
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_transition_marker_flag(self):
|
||||
conv = NodeConversation(system_prompt="test")
|
||||
conv.set_current_phase("report")
|
||||
|
||||
msg = await conv.add_user_message(
|
||||
"PHASE TRANSITION: Research → Report",
|
||||
is_transition_marker=True,
|
||||
)
|
||||
assert msg.is_transition_marker is True
|
||||
assert msg.phase_id == "report"
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_tool_result_stamped(self):
|
||||
conv = NodeConversation(system_prompt="test")
|
||||
conv.set_current_phase("research")
|
||||
|
||||
msg = await conv.add_tool_result("call_1", "tool output here")
|
||||
assert msg.phase_id == "research"
|
||||
|
||||
|
||||
class TestPhaseAwareCompaction:
|
||||
"""prune_old_tool_results protects current phase and transition markers."""
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_transition_marker_survives_compaction(self):
|
||||
"""Transition markers should never be pruned."""
|
||||
conv = NodeConversation(system_prompt="test")
|
||||
|
||||
# Old phase with a big tool result
|
||||
conv.set_current_phase("research")
|
||||
await conv.add_assistant_message(
|
||||
"calling tool",
|
||||
tool_calls=[
|
||||
{
|
||||
"id": "call_1",
|
||||
"type": "function",
|
||||
"function": {"name": "search", "arguments": "{}"},
|
||||
}
|
||||
],
|
||||
)
|
||||
await conv.add_tool_result("call_1", "x" * 20000) # big tool result
|
||||
|
||||
# Transition marker
|
||||
await conv.add_user_message(
|
||||
"PHASE TRANSITION: Research → Report",
|
||||
is_transition_marker=True,
|
||||
)
|
||||
|
||||
# New phase
|
||||
conv.set_current_phase("report")
|
||||
await conv.add_assistant_message(
|
||||
"calling another tool",
|
||||
tool_calls=[
|
||||
{
|
||||
"id": "call_2",
|
||||
"type": "function",
|
||||
"function": {"name": "save", "arguments": "{}"},
|
||||
}
|
||||
],
|
||||
)
|
||||
await conv.add_tool_result("call_2", "y" * 200)
|
||||
|
||||
pruned = await conv.prune_old_tool_results(protect_tokens=0, min_prune_tokens=100)
|
||||
assert pruned >= 1
|
||||
|
||||
# Transition marker should still be intact
|
||||
marker_msgs = [m for m in conv.messages if m.is_transition_marker]
|
||||
assert len(marker_msgs) == 1
|
||||
assert "PHASE TRANSITION" in marker_msgs[0].content
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_current_phase_protected(self):
|
||||
"""Tool results in the current phase should not be pruned."""
|
||||
conv = NodeConversation(system_prompt="test")
|
||||
|
||||
# Old phase
|
||||
conv.set_current_phase("research")
|
||||
await conv.add_assistant_message(
|
||||
"tool call",
|
||||
tool_calls=[
|
||||
{"id": "c1", "type": "function", "function": {"name": "s", "arguments": "{}"}}
|
||||
],
|
||||
)
|
||||
await conv.add_tool_result("c1", "old_data " * 5000)
|
||||
|
||||
# Current phase
|
||||
conv.set_current_phase("report")
|
||||
await conv.add_assistant_message(
|
||||
"tool call",
|
||||
tool_calls=[
|
||||
{"id": "c2", "type": "function", "function": {"name": "s", "arguments": "{}"}}
|
||||
],
|
||||
)
|
||||
await conv.add_tool_result("c2", "current_data " * 5000)
|
||||
|
||||
await conv.prune_old_tool_results(protect_tokens=0, min_prune_tokens=100)
|
||||
|
||||
# Old phase's tool result should be pruned
|
||||
msgs = conv.messages
|
||||
old_tool = [m for m in msgs if m.role == "tool" and m.phase_id == "research"]
|
||||
assert len(old_tool) == 1
|
||||
assert old_tool[0].content.startswith("[Pruned tool result")
|
||||
|
||||
# Current phase's tool result should be intact
|
||||
current_tool = [m for m in msgs if m.role == "tool" and m.phase_id == "report"]
|
||||
assert len(current_tool) == 1
|
||||
assert "current_data" in current_tool[0].content
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_no_phase_metadata_works_normally(self):
|
||||
"""Without phase metadata, compaction works as before (no regression)."""
|
||||
conv = NodeConversation(system_prompt="test")
|
||||
|
||||
# No phase set — messages have phase_id=None
|
||||
await conv.add_assistant_message(
|
||||
"tool call",
|
||||
tool_calls=[
|
||||
{"id": "c1", "type": "function", "function": {"name": "s", "arguments": "{}"}}
|
||||
],
|
||||
)
|
||||
await conv.add_tool_result("c1", "data " * 5000) # ~6250 tokens
|
||||
|
||||
await conv.add_assistant_message(
|
||||
"another tool call",
|
||||
tool_calls=[
|
||||
{"id": "c2", "type": "function", "function": {"name": "s", "arguments": "{}"}}
|
||||
],
|
||||
)
|
||||
await conv.add_tool_result("c2", "more " * 100) # ~125 tokens
|
||||
|
||||
# protect_tokens=100: c2 (~125 tokens) fills the budget,
|
||||
# c1 (~6250 tokens) becomes pruneable
|
||||
pruned = await conv.prune_old_tool_results(protect_tokens=100, min_prune_tokens=100)
|
||||
assert pruned >= 1
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_pruned_message_preserves_phase_metadata(self):
|
||||
"""Pruned messages should keep their phase_id."""
|
||||
conv = NodeConversation(system_prompt="test")
|
||||
conv.set_current_phase("research")
|
||||
|
||||
await conv.add_assistant_message(
|
||||
"tool call",
|
||||
tool_calls=[
|
||||
{"id": "c1", "type": "function", "function": {"name": "s", "arguments": "{}"}}
|
||||
],
|
||||
)
|
||||
await conv.add_tool_result("c1", "data " * 5000)
|
||||
|
||||
# Switch to new phase so research messages become pruneable
|
||||
conv.set_current_phase("report")
|
||||
await conv.add_assistant_message(
|
||||
"recent",
|
||||
tool_calls=[
|
||||
{"id": "c2", "type": "function", "function": {"name": "s", "arguments": "{}"}}
|
||||
],
|
||||
)
|
||||
await conv.add_tool_result("c2", "x" * 200)
|
||||
|
||||
await conv.prune_old_tool_results(protect_tokens=0, min_prune_tokens=100)
|
||||
|
||||
pruned_msg = [m for m in conv.messages if m.content.startswith("[Pruned")][0]
|
||||
assert pruned_msg.phase_id == "research"
|
||||
@@ -17,6 +17,8 @@
|
||||
"terminal_nodes": [
|
||||
"report"
|
||||
],
|
||||
"conversation_mode": "continuous",
|
||||
"identity_prompt": "You are a rigorous research agent. You search for information from diverse, authoritative sources, analyze findings critically, and produce well-cited reports. You never fabricate information \u2014 every claim must trace back to a source you actually retrieved.",
|
||||
"nodes": [
|
||||
{
|
||||
"id": "intake",
|
||||
@@ -32,7 +34,8 @@
|
||||
"nullable_output_keys": [],
|
||||
"input_schema": {},
|
||||
"output_schema": {},
|
||||
"system_prompt": "You are a research intake specialist. The user wants to research a topic.\nHave a brief conversation to clarify what they need.\n\n**STEP 1 \u2014 Read and respond (text only, NO tool calls):**\n1. Read the topic provided\n2. If it's vague, ask 1-2 clarifying questions (scope, angle, depth)\n3. If it's already clear, confirm your understanding and ask the user to confirm\n\nKeep it short. Don't over-ask.\n\nAfter your message, call ask_user() to wait for the user's response.\n\n**STEP 2 \u2014 After the user confirms, call set_output:**\n- set_output(\"research_brief\", \"A clear paragraph describing exactly what to research, what questions to answer, what scope to cover, and how deep to go.\")",
|
||||
"success_criteria": "The research brief is specific and actionable: it states the topic, the key questions to answer, the desired scope, and depth.",
|
||||
"system_prompt": "You are a research intake specialist. The user wants to research a topic.\nHave a brief conversation to clarify what they need.\n\n**STEP 1 \u2014 Read and respond (text only, NO tool calls):**\n1. Read the topic provided\n2. If it's vague, ask 1-2 clarifying questions (scope, angle, depth)\n3. If it's already clear, confirm your understanding and ask the user to confirm\n\nKeep it short. Don't over-ask.\n\n**STEP 2 \u2014 After the user confirms, call set_output:**\n- set_output(\"research_brief\", \"A clear paragraph describing exactly what to research, what questions to answer, what scope to cover, and how deep to go.\")",
|
||||
"tools": [],
|
||||
"model": null,
|
||||
"function": null,
|
||||
@@ -63,7 +66,8 @@
|
||||
],
|
||||
"input_schema": {},
|
||||
"output_schema": {},
|
||||
"system_prompt": "You are a research agent. Given a research brief, find and analyze sources.\n\nIf feedback is provided, this is a follow-up round \u2014 focus on the gaps identified.\n\nWork in phases:\n1. **Search**: Use web_search with 3-5 diverse queries covering different angles.\n Prioritize authoritative sources (.edu, .gov, established publications).\n2. **Fetch**: Use web_scrape on the most promising URLs (aim for 5-8 sources).\n Skip URLs that fail. Extract the substantive content.\n3. **Analyze**: Review what you've collected. Identify key findings, themes,\n and any contradictions between sources.\n\nImportant:\n- Work in batches of 3-4 tool calls at a time to manage context\n- After each batch, assess whether you have enough material\n- Prefer quality over quantity \u2014 5 good sources beat 15 thin ones\n- Track which URL each finding comes from (you'll need citations later)\n\nWhen done, use set_output:\n- set_output(\"findings\", \"Structured summary: key findings with source URLs for each claim. Include themes, contradictions, and confidence levels.\")\n- set_output(\"sources\", [{\"url\": \"...\", \"title\": \"...\", \"summary\": \"...\"}])\n- set_output(\"gaps\", \"What aspects of the research brief are NOT well-covered yet, if any.\")",
|
||||
"success_criteria": "Findings reference at least 3 distinct sources with URLs. Key claims are substantiated by fetched content, not generated.",
|
||||
"system_prompt": "You are a research agent. Given a research brief, find and analyze sources.\n\nIf feedback is provided, this is a follow-up round \u2014 focus on the gaps identified.\n\nWork in phases:\n1. **Search**: Use web_search with 3-5 diverse queries covering different angles.\n Prioritize authoritative sources (.edu, .gov, established publications).\n2. **Fetch**: Use web_scrape on the most promising URLs (aim for 5-8 sources).\n Skip URLs that fail. Extract the substantive content.\n3. **Analyze**: Review what you've collected. Identify key findings, themes,\n and any contradictions between sources.\n\nImportant:\n- Work in batches of 3-4 tool calls at a time \u2014 never more than 10 per turn\n- After each batch, assess whether you have enough material\n- Prefer quality over quantity \u2014 5 good sources beat 15 thin ones\n- Track which URL each finding comes from (you'll need citations later)\n- Call set_output for each key in a SEPARATE turn (not in the same turn as other tool calls)\n\nWhen done, use set_output (one key at a time, separate turns):\n- set_output(\"findings\", \"Structured summary: key findings with source URLs for each claim. Include themes, contradictions, and confidence levels.\")\n- set_output(\"sources\", [{\"url\": \"...\", \"title\": \"...\", \"summary\": \"...\"}])\n- set_output(\"gaps\", \"What aspects of the research brief are NOT well-covered yet, if any.\")",
|
||||
"tools": [
|
||||
"web_search",
|
||||
"web_scrape",
|
||||
@@ -99,7 +103,8 @@
|
||||
"nullable_output_keys": [],
|
||||
"input_schema": {},
|
||||
"output_schema": {},
|
||||
"system_prompt": "Present the research findings to the user clearly and concisely.\n\n**STEP 1 \u2014 Present (your first message, text only, NO tool calls):**\n1. **Summary** (2-3 sentences of what was found)\n2. **Key Findings** (bulleted, with confidence levels)\n3. **Sources Used** (count and quality assessment)\n4. **Gaps** (what's still unclear or under-covered)\n\nEnd by asking: Are they satisfied, or do they want deeper research? Should we proceed to writing the final report?\n\nAfter your presentation, call ask_user() to wait for the user's response.\n\n**STEP 2 \u2014 After the user responds, call set_output:**\n- set_output(\"needs_more_research\", \"true\") \u2014 if they want more\n- set_output(\"needs_more_research\", \"false\") \u2014 if they're satisfied\n- set_output(\"feedback\", \"What the user wants explored further, or empty string\")",
|
||||
"success_criteria": "The user has been presented with findings and has explicitly indicated whether they want more research or are ready for the report.",
|
||||
"system_prompt": "Present the research findings to the user clearly and concisely.\n\n**STEP 1 \u2014 Present (your first message, text only, NO tool calls):**\n1. **Summary** (2-3 sentences of what was found)\n2. **Key Findings** (bulleted, with confidence levels)\n3. **Sources Used** (count and quality assessment)\n4. **Gaps** (what's still unclear or under-covered)\n\nEnd by asking: Are they satisfied, or do they want deeper research? Should we proceed to writing the final report?\n\n**STEP 2 \u2014 After the user responds, call set_output:**\n- set_output(\"needs_more_research\", \"true\") \u2014 if they want more\n- set_output(\"needs_more_research\", \"false\") \u2014 if they're satisfied\n- set_output(\"feedback\", \"What the user wants explored further, or empty string\")",
|
||||
"tools": [],
|
||||
"model": null,
|
||||
"function": null,
|
||||
@@ -127,7 +132,8 @@
|
||||
"nullable_output_keys": [],
|
||||
"input_schema": {},
|
||||
"output_schema": {},
|
||||
"system_prompt": "Write a comprehensive research report as an HTML file and present it to the user.\n\n**STEP 1 \u2014 Write the HTML report (tool calls, NO text to user yet):**\n\n1. Compose a complete, self-contained HTML document with embedded CSS styling.\n Use a clean, readable design: max-width container, pleasant typography,\n numbered citation links, a table of contents, and a references section.\n\n Report structure inside the HTML:\n - Title & date\n - Executive Summary (2-3 paragraphs)\n - Table of Contents\n - Findings (organized by theme, with [n] citation links)\n - Analysis (synthesis, implications, areas of debate)\n - Conclusion (key takeaways, confidence assessment)\n - References (numbered list with clickable URLs)\n\n Requirements:\n - Every factual claim must cite its source with [n] notation\n - Be objective \u2014 present multiple viewpoints where sources disagree\n - Distinguish well-supported conclusions from speculation\n - Answer the original research questions from the brief\n\n2. Save the HTML file:\n save_data(filename=\"report.html\", data=<your_html>)\n\n3. Get the clickable link:\n serve_file_to_user(filename=\"report.html\", label=\"Research Report\")\n\n**STEP 2 \u2014 Present the link to the user (text only, NO tool calls):**\n\nTell the user the report is ready and include the file:// URI from\nserve_file_to_user so they can click it to open. Give a brief summary\nof what the report covers. Ask if they have questions.\n\nAfter presenting the link, call ask_user() to wait for the user's response.\n\n**STEP 3 \u2014 After the user responds:**\n- Answer follow-up questions from the research material\n- Call ask_user() again if they might have more questions\n- When the user is satisfied: set_output(\"delivery_status\", \"completed\")",
|
||||
"success_criteria": "An HTML report has been saved, the file link has been presented to the user, and the user has acknowledged receipt.",
|
||||
"system_prompt": "Write a research report as an HTML file and present it to the user.\n\nIMPORTANT: save_data requires TWO separate arguments: filename and data.\nCall it like: save_data(filename=\"report.html\", data=\"<html>...</html>\")\nDo NOT use _raw, do NOT nest arguments inside a JSON string.\n\n**STEP 1 \u2014 Write and save the HTML report (tool calls, NO text to user yet):**\n\nBuild a clean HTML document. Keep the HTML concise \u2014 aim for clarity over length.\nUse minimal embedded CSS (a few lines of style, not a full framework).\n\nReport structure:\n- Title & date\n- Executive Summary (2-3 paragraphs)\n- Key Findings (organized by theme, with [n] citation links)\n- Analysis (synthesis, implications)\n- Conclusion (key takeaways)\n- References (numbered list with clickable URLs)\n\nRequirements:\n- Every factual claim must cite its source with [n] notation\n- Be objective \u2014 present multiple viewpoints where sources disagree\n- Answer the original research questions from the brief\n\nSave the HTML:\n save_data(filename=\"report.html\", data=\"<html>...</html>\")\n\nThen get the clickable link:\n serve_file_to_user(filename=\"report.html\", label=\"Research Report\")\n\nIf save_data fails, simplify and shorten the HTML, then retry.\n\n**STEP 2 \u2014 Present the link to the user (text only, NO tool calls):**\n\nTell the user the report is ready and include the file:// URI from\nserve_file_to_user so they can click it to open. Give a brief summary\nof what the report covers. Ask if they have questions.\n\n**STEP 3 \u2014 After the user responds:**\n- Answer follow-up questions from the research material\n- When the user is satisfied: set_output(\"delivery_status\", \"completed\")",
|
||||
"tools": [
|
||||
"save_data",
|
||||
"serve_file_to_user",
|
||||
|
||||
@@ -1,12 +1,15 @@
|
||||
"""Agent graph construction for Deep Research Agent."""
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
from framework.graph import EdgeSpec, EdgeCondition, Goal, SuccessCriterion, Constraint
|
||||
from framework.graph.edge import GraphSpec
|
||||
from framework.graph.executor import ExecutionResult, GraphExecutor
|
||||
from framework.runtime.event_bus import EventBus
|
||||
from framework.runtime.core import Runtime
|
||||
from framework.graph.executor import ExecutionResult
|
||||
from framework.graph.checkpoint_config import CheckpointConfig
|
||||
from framework.llm import LiteLLMProvider
|
||||
from framework.runner.tool_registry import ToolRegistry
|
||||
from framework.runtime.agent_runtime import AgentRuntime, create_agent_runtime
|
||||
from framework.runtime.execution_stream import EntryPointSpec
|
||||
|
||||
from .config import default_config, metadata
|
||||
from .nodes import (
|
||||
@@ -102,22 +105,40 @@ edges = [
|
||||
condition=EdgeCondition.ON_SUCCESS,
|
||||
priority=1,
|
||||
),
|
||||
# review -> research (feedback loop, checked first)
|
||||
# review -> research (feedback loop)
|
||||
EdgeSpec(
|
||||
id="review-to-research-feedback",
|
||||
source="review",
|
||||
target="research",
|
||||
condition=EdgeCondition.CONDITIONAL,
|
||||
condition_expr="str(needs_more_research).lower() == 'true'",
|
||||
priority=2,
|
||||
condition_expr="needs_more_research == True",
|
||||
priority=1,
|
||||
),
|
||||
# review -> report (complementary condition — proceed to report when no more research needed)
|
||||
# review -> report (user satisfied)
|
||||
EdgeSpec(
|
||||
id="review-to-report",
|
||||
source="review",
|
||||
target="report",
|
||||
condition=EdgeCondition.CONDITIONAL,
|
||||
condition_expr="str(needs_more_research).lower() != 'true'",
|
||||
condition_expr="needs_more_research == False",
|
||||
priority=2,
|
||||
),
|
||||
# report -> research (user wants deeper research on current topic)
|
||||
EdgeSpec(
|
||||
id="report-to-research",
|
||||
source="report",
|
||||
target="research",
|
||||
condition=EdgeCondition.CONDITIONAL,
|
||||
condition_expr="str(next_action).lower() == 'more_research'",
|
||||
priority=2,
|
||||
),
|
||||
# report -> intake (user wants a new topic — default when not more_research)
|
||||
EdgeSpec(
|
||||
id="report-to-intake",
|
||||
source="report",
|
||||
target="intake",
|
||||
condition=EdgeCondition.CONDITIONAL,
|
||||
condition_expr="str(next_action).lower() != 'more_research'",
|
||||
priority=1,
|
||||
),
|
||||
]
|
||||
@@ -126,7 +147,7 @@ edges = [
|
||||
entry_node = "intake"
|
||||
entry_points = {"start": "intake"}
|
||||
pause_nodes = []
|
||||
terminal_nodes = ["report"]
|
||||
terminal_nodes = []
|
||||
|
||||
|
||||
class DeepResearchAgent:
|
||||
@@ -136,6 +157,12 @@ class DeepResearchAgent:
|
||||
Flow: intake -> research -> review -> report
|
||||
^ |
|
||||
+-- feedback loop (if user wants more)
|
||||
|
||||
Uses AgentRuntime for proper session management:
|
||||
- Session-scoped storage (sessions/{session_id}/)
|
||||
- Checkpointing for resume capability
|
||||
- Runtime logging
|
||||
- Data folder for save_data/load_data
|
||||
"""
|
||||
|
||||
def __init__(self, config=None):
|
||||
@@ -147,10 +174,10 @@ class DeepResearchAgent:
|
||||
self.entry_points = entry_points
|
||||
self.pause_nodes = pause_nodes
|
||||
self.terminal_nodes = terminal_nodes
|
||||
self._executor: GraphExecutor | None = None
|
||||
self._graph: GraphSpec | None = None
|
||||
self._event_bus: EventBus | None = None
|
||||
self._agent_runtime: AgentRuntime | None = None
|
||||
self._tool_registry: ToolRegistry | None = None
|
||||
self._storage_path: Path | None = None
|
||||
|
||||
def _build_graph(self) -> GraphSpec:
|
||||
"""Build the GraphSpec."""
|
||||
@@ -177,84 +204,95 @@ class DeepResearchAgent:
|
||||
"""Set up the executor with all components."""
|
||||
from pathlib import Path
|
||||
|
||||
from framework.credentials.validation import (
|
||||
ensure_credential_key_env,
|
||||
validate_agent_credentials,
|
||||
)
|
||||
|
||||
ensure_credential_key_env()
|
||||
validate_agent_credentials(self.nodes)
|
||||
|
||||
storage_path = Path.home() / ".hive" / "agents" / "deep_research_agent"
|
||||
storage_path.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
self._event_bus = EventBus()
|
||||
self._tool_registry = ToolRegistry()
|
||||
|
||||
mcp_config_path = Path(__file__).parent / "mcp_servers.json"
|
||||
if mcp_config_path.exists():
|
||||
self._tool_registry.load_mcp_config(mcp_config_path)
|
||||
|
||||
llm = LiteLLMProvider(
|
||||
model=self.config.model,
|
||||
api_key=self.config.api_key,
|
||||
api_base=self.config.api_base,
|
||||
)
|
||||
llm = None
|
||||
if not mock_mode:
|
||||
llm = LiteLLMProvider(
|
||||
model=self.config.model,
|
||||
api_key=self.config.api_key,
|
||||
api_base=self.config.api_base,
|
||||
)
|
||||
|
||||
tool_executor = self._tool_registry.get_executor()
|
||||
tools = list(self._tool_registry.get_tools().values())
|
||||
|
||||
self._graph = self._build_graph()
|
||||
runtime = Runtime(storage_path)
|
||||
|
||||
self._executor = GraphExecutor(
|
||||
runtime=runtime,
|
||||
checkpoint_config = CheckpointConfig(
|
||||
enabled=True,
|
||||
checkpoint_on_node_start=False,
|
||||
checkpoint_on_node_complete=True,
|
||||
checkpoint_max_age_days=7,
|
||||
async_checkpoint=True,
|
||||
)
|
||||
|
||||
entry_point_specs = [
|
||||
EntryPointSpec(
|
||||
id="default",
|
||||
name="Default",
|
||||
entry_node=self.entry_node,
|
||||
trigger_type="manual",
|
||||
isolation_level="shared",
|
||||
)
|
||||
]
|
||||
|
||||
self._agent_runtime = create_agent_runtime(
|
||||
graph=self._graph,
|
||||
goal=self.goal,
|
||||
storage_path=self._storage_path,
|
||||
entry_points=entry_point_specs,
|
||||
llm=llm,
|
||||
tools=tools,
|
||||
tool_executor=tool_executor,
|
||||
event_bus=self._event_bus,
|
||||
storage_path=storage_path,
|
||||
loop_config=self._graph.loop_config,
|
||||
checkpoint_config=checkpoint_config,
|
||||
)
|
||||
|
||||
return self._executor
|
||||
|
||||
async def start(self) -> None:
|
||||
"""Set up the agent (initialize executor and tools)."""
|
||||
if self._executor is None:
|
||||
self._setup()
|
||||
async def start(self, mock_mode=False) -> None:
|
||||
"""Set up and start the agent runtime."""
|
||||
if self._agent_runtime is None:
|
||||
self._setup(mock_mode=mock_mode)
|
||||
if not self._agent_runtime.is_running:
|
||||
await self._agent_runtime.start()
|
||||
|
||||
async def stop(self) -> None:
|
||||
"""Clean up resources."""
|
||||
self._executor = None
|
||||
self._event_bus = None
|
||||
"""Stop the agent runtime and clean up."""
|
||||
if self._agent_runtime and self._agent_runtime.is_running:
|
||||
await self._agent_runtime.stop()
|
||||
self._agent_runtime = None
|
||||
|
||||
async def trigger_and_wait(
|
||||
self,
|
||||
entry_point: str,
|
||||
input_data: dict,
|
||||
entry_point: str = "default",
|
||||
input_data: dict | None = None,
|
||||
timeout: float | None = None,
|
||||
session_state: dict | None = None,
|
||||
) -> ExecutionResult | None:
|
||||
"""Execute the graph and wait for completion."""
|
||||
if self._executor is None:
|
||||
if self._agent_runtime is None:
|
||||
raise RuntimeError("Agent not started. Call start() first.")
|
||||
if self._graph is None:
|
||||
raise RuntimeError("Graph not built. Call start() first.")
|
||||
|
||||
return await self._executor.execute(
|
||||
graph=self._graph,
|
||||
goal=self.goal,
|
||||
input_data=input_data,
|
||||
return await self._agent_runtime.trigger_and_wait(
|
||||
entry_point_id=entry_point,
|
||||
input_data=input_data or {},
|
||||
session_state=session_state,
|
||||
)
|
||||
|
||||
async def run(self, context: dict, session_state=None) -> ExecutionResult:
|
||||
async def run(
|
||||
self, context: dict, mock_mode=False, session_state=None
|
||||
) -> ExecutionResult:
|
||||
"""Run the agent (convenience method for single execution)."""
|
||||
await self.start()
|
||||
await self.start(mock_mode=mock_mode)
|
||||
try:
|
||||
result = await self.trigger_and_wait(
|
||||
"start", context, session_state=session_state
|
||||
"default", context, session_state=session_state
|
||||
)
|
||||
return result or ExecutionResult(success=False, error="Execution timeout")
|
||||
finally:
|
||||
|
||||
@@ -10,8 +10,13 @@ intake_node = NodeSpec(
|
||||
description="Discuss the research topic with the user, clarify scope, and confirm direction",
|
||||
node_type="event_loop",
|
||||
client_facing=True,
|
||||
max_node_visits=0,
|
||||
input_keys=["topic"],
|
||||
output_keys=["research_brief"],
|
||||
success_criteria=(
|
||||
"The research brief is specific and actionable: it states the topic, "
|
||||
"the key questions to answer, the desired scope, and depth."
|
||||
),
|
||||
system_prompt="""\
|
||||
You are a research intake specialist. The user wants to research a topic.
|
||||
Have a brief conversation to clarify what they need.
|
||||
@@ -23,8 +28,6 @@ Have a brief conversation to clarify what they need.
|
||||
|
||||
Keep it short. Don't over-ask.
|
||||
|
||||
After your message, call ask_user() to wait for the user's response.
|
||||
|
||||
**STEP 2 — After the user confirms, call set_output:**
|
||||
- set_output("research_brief", "A clear paragraph describing exactly what to research, \
|
||||
what questions to answer, what scope to cover, and how deep to go.")
|
||||
@@ -40,10 +43,14 @@ research_node = NodeSpec(
|
||||
name="Research",
|
||||
description="Search the web, fetch source content, and compile findings",
|
||||
node_type="event_loop",
|
||||
max_node_visits=3,
|
||||
max_node_visits=0,
|
||||
input_keys=["research_brief", "feedback"],
|
||||
output_keys=["findings", "sources", "gaps"],
|
||||
nullable_output_keys=["feedback"],
|
||||
success_criteria=(
|
||||
"Findings reference at least 3 distinct sources with URLs. "
|
||||
"Key claims are substantiated by fetched content, not generated."
|
||||
),
|
||||
system_prompt="""\
|
||||
You are a research agent. Given a research brief, find and analyze sources.
|
||||
|
||||
@@ -58,18 +65,19 @@ Work in phases:
|
||||
and any contradictions between sources.
|
||||
|
||||
Important:
|
||||
- Work in batches of 3-4 tool calls at a time to manage context
|
||||
- Work in batches of 3-4 tool calls at a time — never more than 10 per turn
|
||||
- After each batch, assess whether you have enough material
|
||||
- Prefer quality over quantity — 5 good sources beat 15 thin ones
|
||||
- Track which URL each finding comes from (you'll need citations later)
|
||||
- Call set_output for each key in a SEPARATE turn (not in the same turn as other tool calls)
|
||||
|
||||
When done, use set_output:
|
||||
When done, use set_output (one key at a time, separate turns):
|
||||
- set_output("findings", "Structured summary: key findings with source URLs for each claim. \
|
||||
Include themes, contradictions, and confidence levels.")
|
||||
- set_output("sources", [{"url": "...", "title": "...", "summary": "..."}])
|
||||
- set_output("gaps", "What aspects of the research brief are NOT well-covered yet, if any.")
|
||||
""",
|
||||
tools=["web_search", "web_scrape", "load_data", "save_data", "list_data_files"],
|
||||
tools=["web_search", "web_scrape", "load_data", "save_data", "append_data", "list_data_files"],
|
||||
)
|
||||
|
||||
# Node 3: Review (client-facing)
|
||||
@@ -80,9 +88,13 @@ review_node = NodeSpec(
|
||||
description="Present findings to user and decide whether to research more or write the report",
|
||||
node_type="event_loop",
|
||||
client_facing=True,
|
||||
max_node_visits=3,
|
||||
max_node_visits=0,
|
||||
input_keys=["findings", "sources", "gaps", "research_brief"],
|
||||
output_keys=["needs_more_research", "feedback"],
|
||||
success_criteria=(
|
||||
"The user has been presented with findings and has explicitly indicated "
|
||||
"whether they want more research or are ready for the report."
|
||||
),
|
||||
system_prompt="""\
|
||||
Present the research findings to the user clearly and concisely.
|
||||
|
||||
@@ -95,8 +107,6 @@ Present the research findings to the user clearly and concisely.
|
||||
End by asking: Are they satisfied, or do they want deeper research? \
|
||||
Should we proceed to writing the final report?
|
||||
|
||||
After your presentation, call ask_user() to wait for the user's response.
|
||||
|
||||
**STEP 2 — After the user responds, call set_output:**
|
||||
- set_output("needs_more_research", "true") — if they want more
|
||||
- set_output("needs_more_research", "false") — if they're satisfied
|
||||
@@ -113,52 +123,66 @@ report_node = NodeSpec(
|
||||
description="Write a cited HTML report from the findings and present it to the user",
|
||||
node_type="event_loop",
|
||||
client_facing=True,
|
||||
max_node_visits=0,
|
||||
input_keys=["findings", "sources", "research_brief"],
|
||||
output_keys=["delivery_status"],
|
||||
output_keys=["delivery_status", "next_action"],
|
||||
success_criteria=(
|
||||
"An HTML report has been saved, the file link has been presented to the user, "
|
||||
"and the user has indicated what they want to do next."
|
||||
),
|
||||
system_prompt="""\
|
||||
Write a comprehensive research report as an HTML file and present it to the user.
|
||||
Write a research report as an HTML file and present it to the user.
|
||||
|
||||
**STEP 1 — Write the HTML report (tool calls, NO text to user yet):**
|
||||
IMPORTANT: save_data requires TWO separate arguments: filename and data.
|
||||
Call it like: save_data(filename="report.html", data="<html>...</html>")
|
||||
Do NOT use _raw, do NOT nest arguments inside a JSON string.
|
||||
|
||||
1. Compose a complete, self-contained HTML document with embedded CSS styling.
|
||||
Use a clean, readable design: max-width container, pleasant typography,
|
||||
numbered citation links, a table of contents, and a references section.
|
||||
**STEP 1 — Write and save the HTML report (tool calls, NO text to user yet):**
|
||||
|
||||
Report structure inside the HTML:
|
||||
- Title & date
|
||||
- Executive Summary (2-3 paragraphs)
|
||||
- Table of Contents
|
||||
- Findings (organized by theme, with [n] citation links)
|
||||
- Analysis (synthesis, implications, areas of debate)
|
||||
- Conclusion (key takeaways, confidence assessment)
|
||||
- References (numbered list with clickable URLs)
|
||||
Build a clean HTML document. Keep the HTML concise — aim for clarity over length.
|
||||
Use minimal embedded CSS (a few lines of style, not a full framework).
|
||||
|
||||
Requirements:
|
||||
- Every factual claim must cite its source with [n] notation
|
||||
- Be objective — present multiple viewpoints where sources disagree
|
||||
- Distinguish well-supported conclusions from speculation
|
||||
- Answer the original research questions from the brief
|
||||
Report structure:
|
||||
- Title & date
|
||||
- Executive Summary (2-3 paragraphs)
|
||||
- Key Findings (organized by theme, with [n] citation links)
|
||||
- Analysis (synthesis, implications)
|
||||
- Conclusion (key takeaways)
|
||||
- References (numbered list with clickable URLs)
|
||||
|
||||
2. Save the HTML file:
|
||||
save_data(filename="report.html", data=<your_html>)
|
||||
Requirements:
|
||||
- Every factual claim must cite its source with [n] notation
|
||||
- Be objective — present multiple viewpoints where sources disagree
|
||||
- Answer the original research questions from the brief
|
||||
|
||||
3. Get the clickable link:
|
||||
serve_file_to_user(filename="report.html", label="Research Report")
|
||||
Save the HTML:
|
||||
save_data(filename="report.html", data="<html>...</html>")
|
||||
|
||||
Then get the clickable link:
|
||||
serve_file_to_user(filename="report.html", label="Research Report")
|
||||
|
||||
If save_data fails, simplify and shorten the HTML, then retry.
|
||||
|
||||
**STEP 2 — Present the link to the user (text only, NO tool calls):**
|
||||
|
||||
Tell the user the report is ready and include the file:// URI from
|
||||
serve_file_to_user so they can click it to open. Give a brief summary
|
||||
of what the report covers. Ask if they have questions.
|
||||
|
||||
After presenting the link, call ask_user() to wait for the user's response.
|
||||
of what the report covers. Ask if they have questions or want to continue.
|
||||
|
||||
**STEP 3 — After the user responds:**
|
||||
- Answer follow-up questions from the research material
|
||||
- Call ask_user() again if they might have more questions
|
||||
- When the user is satisfied: set_output("delivery_status", "completed")
|
||||
- Answer any follow-up questions from the research material
|
||||
- When the user is ready to move on, ask what they'd like to do next:
|
||||
- Research a new topic?
|
||||
- Dig deeper into the current topic?
|
||||
- Then call set_output:
|
||||
- set_output("delivery_status", "completed")
|
||||
- set_output("next_action", "new_topic") — if they want a new topic
|
||||
- set_output("next_action", "more_research") — if they want deeper research
|
||||
""",
|
||||
tools=["save_data", "serve_file_to_user", "load_data", "list_data_files"],
|
||||
tools=[
|
||||
"save_data", "append_data", "edit_data",
|
||||
"serve_file_to_user", "load_data", "list_data_files",
|
||||
],
|
||||
)
|
||||
|
||||
__all__ = [
|
||||
|
||||
@@ -112,6 +112,8 @@ def register_all_tools(
|
||||
"execute_command_tool",
|
||||
"load_data",
|
||||
"save_data",
|
||||
"append_data",
|
||||
"edit_data",
|
||||
"list_data_files",
|
||||
"serve_file_to_user",
|
||||
"csv_read",
|
||||
|
||||
@@ -237,3 +237,115 @@ def register_tools(mcp: FastMCP) -> None:
|
||||
return {"files": files}
|
||||
except Exception as e:
|
||||
return {"error": f"Failed to list data files: {str(e)}"}
|
||||
|
||||
@mcp.tool()
|
||||
def append_data(filename: str, data: str, data_dir: str) -> dict:
|
||||
"""
|
||||
Purpose
|
||||
Append data to the end of an existing file, or create it if it
|
||||
doesn't exist yet.
|
||||
|
||||
When to use
|
||||
Build large files incrementally instead of writing everything in
|
||||
one save_data call. For example, write an HTML skeleton first,
|
||||
then append each section separately to stay within token limits.
|
||||
|
||||
Rules & Constraints
|
||||
filename must be a simple name like 'report.html' — no paths or '..'
|
||||
|
||||
Args:
|
||||
filename: Simple filename to append to. No paths or '..'.
|
||||
data: The string data to append.
|
||||
data_dir: Absolute path to the data directory.
|
||||
|
||||
Returns:
|
||||
Dict with success status, new total size, and bytes appended
|
||||
"""
|
||||
if not filename or ".." in filename or "/" in filename or "\\" in filename:
|
||||
return {"error": "Invalid filename. Use simple names like 'report.html'"}
|
||||
if not data_dir:
|
||||
return {"error": "data_dir is required"}
|
||||
|
||||
try:
|
||||
dir_path = Path(data_dir)
|
||||
dir_path.mkdir(parents=True, exist_ok=True)
|
||||
path = dir_path / filename
|
||||
with open(path, "a", encoding="utf-8") as f:
|
||||
f.write(data)
|
||||
appended_bytes = len(data.encode("utf-8"))
|
||||
total_bytes = path.stat().st_size
|
||||
return {
|
||||
"success": True,
|
||||
"filename": filename,
|
||||
"size_bytes": total_bytes,
|
||||
"appended_bytes": appended_bytes,
|
||||
}
|
||||
except Exception as e:
|
||||
return {"error": f"Failed to append data: {str(e)}"}
|
||||
|
||||
@mcp.tool()
|
||||
def edit_data(filename: str, old_text: str, new_text: str, data_dir: str) -> dict:
|
||||
"""
|
||||
Purpose
|
||||
Find and replace a specific text segment in an existing file.
|
||||
Works like a surgical diff — only the matched portion changes.
|
||||
|
||||
When to use
|
||||
Update a section of a previously saved file without rewriting
|
||||
the entire content. For example, replace a placeholder in an
|
||||
HTML report or fix a specific paragraph.
|
||||
|
||||
Rules & Constraints
|
||||
old_text must appear exactly once in the file. If it appears
|
||||
zero times or more than once, the edit is rejected with an
|
||||
error message.
|
||||
|
||||
Args:
|
||||
filename: The file to edit. Must exist in data_dir.
|
||||
old_text: The exact text to find (must match exactly once).
|
||||
new_text: The replacement text.
|
||||
data_dir: Absolute path to the data directory.
|
||||
|
||||
Returns:
|
||||
Dict with success status and updated file size
|
||||
"""
|
||||
if not filename or ".." in filename or "/" in filename or "\\" in filename:
|
||||
return {"error": "Invalid filename. Use simple names like 'report.html'"}
|
||||
if not data_dir:
|
||||
return {"error": "data_dir is required"}
|
||||
|
||||
try:
|
||||
path = Path(data_dir) / filename
|
||||
if not path.exists():
|
||||
return {"error": f"File not found: {filename}"}
|
||||
|
||||
content = path.read_text(encoding="utf-8")
|
||||
count = content.count(old_text)
|
||||
|
||||
if count == 0:
|
||||
return {
|
||||
"error": (
|
||||
"old_text not found in the file. "
|
||||
"Make sure you're matching the exact text, "
|
||||
"including whitespace and newlines."
|
||||
)
|
||||
}
|
||||
if count > 1:
|
||||
return {
|
||||
"error": (
|
||||
f"old_text found {count} times — it must be unique. "
|
||||
"Include more surrounding context to match exactly once."
|
||||
)
|
||||
}
|
||||
|
||||
updated = content.replace(old_text, new_text, 1)
|
||||
path.write_text(updated, encoding="utf-8")
|
||||
|
||||
return {
|
||||
"success": True,
|
||||
"filename": filename,
|
||||
"size_bytes": len(updated.encode("utf-8")),
|
||||
"replacements": 1,
|
||||
}
|
||||
except Exception as e:
|
||||
return {"error": f"Failed to edit data: {str(e)}"}
|
||||
|
||||
Reference in New Issue
Block a user