feat: automated test agent skill

This commit is contained in:
Timothy
2026-02-09 12:39:20 -08:00
parent 9d11f834b8
commit faf534511b
4 changed files with 132 additions and 14 deletions
+12 -11
View File
@@ -138,10 +138,10 @@ Two execution paths, use the right one for your situation.
Run the agent via CLI. This creates sessions with checkpoints at `~/.hive/agents/{agent_name}/sessions/`:
```bash
PYTHONPATH=core:exports uv run python -m {agent_name} --tui
uv run hive run exports/{agent_name} --input '{"query": "test topic"}'
```
The TUI lets you interact with client-facing nodes and see real-time execution. Sessions and checkpoints are saved automatically.
Sessions and checkpoints are saved automatically. For agents with client-facing nodes that require user interaction, the user must launch the TUI manually in a separate terminal (Claude Code cannot interact with TUI apps).
#### Automated regression (for CI or final verification)
@@ -334,7 +334,7 @@ Resume when ALL of these are true:
```bash
# Resume from the last clean checkpoint before the failing node
PYTHONPATH=core:exports uv run python -m {agent_name} --tui \
uv run hive run exports/{agent_name} \
--resume-session session_20260209_143022_abc12345 \
--checkpoint cp_node_complete_research_143030
```
@@ -350,7 +350,7 @@ Re-run when ANY of these are true:
- You changed the graph structure (added/removed nodes/edges)
```bash
PYTHONPATH=core:exports uv run python -m {agent_name} --tui
uv run hive run exports/{agent_name} --input '{"query": "test topic"}'
```
#### Inspecting a checkpoint before resuming
@@ -696,7 +696,7 @@ run_tests(goal_id, agent_path, test_types='["success"]')
```bash
# Iterative debugging with checkpoints (via CLI)
PYTHONPATH=core:exports uv run python -m {agent_name} --tui
uv run hive run exports/{agent_name} --input '{"query": "test"}'
```
### Phase 3: Analysis
@@ -739,8 +739,8 @@ get_agent_checkpoint(agent_work_dir, session_id, checkpoint_id)
```
```bash
# Resume from checkpoint via CLI
PYTHONPATH=core:exports uv run python -m {agent_name} --tui \
# Resume from checkpoint via CLI (headless)
uv run hive run exports/{agent_name} \
--resume-session {session_id} --checkpoint {checkpoint_id}
```
@@ -757,8 +757,9 @@ PYTHONPATH=core:exports uv run python -m {agent_name} --tui \
| Write 30+ tests | Write 8-15 focused tests |
| Skip credential check | Use `/hive-credentials` before testing |
| Confuse `exports/` with `~/.hive/agents/` | Code in `exports/`, runtime data in `~/.hive/` |
| Use `run_tests` for iterative debugging | Use CLI with checkpoints for iterative debugging |
| Use CLI for final regression | Use `run_tests` for automated regression |
| Use `run_tests` for iterative debugging | Use headless CLI with checkpoints for iterative debugging |
| Use headless CLI for final regression | Use `run_tests` for automated regression |
| Use `--tui` from Claude Code | Use headless `run` command — TUI hangs in non-interactive shells |
| Run tests without reading goal first | Always understand the goal before writing tests |
| Skip Phase 3 analysis and guess | Use session + log tools to identify root cause |
@@ -866,7 +867,7 @@ list_agent_checkpoints(
# → cp_node_complete_intake_150005
# Resume from after intake, re-run research with fixed prompt
PYTHONPATH=core:exports uv run python -m deep_research_agent --tui \
uv run hive run exports/deep_research_agent \
--resume-session session_20260209_150000_abc12345 \
--checkpoint cp_node_complete_intake_150005
```
@@ -874,7 +875,7 @@ PYTHONPATH=core:exports uv run python -m deep_research_agent --tui \
Or for this simple case (intake is fast), just re-run:
```bash
PYTHONPATH=core:exports uv run python -m deep_research_agent --tui
uv run hive run exports/deep_research_agent --input '{"topic": "test"}'
```
### Phase 6: Final verification
@@ -259,7 +259,7 @@ The fix is to the `report` node (the last node). To demonstrate checkpoint recov
```bash
# Run via CLI to get checkpoints
PYTHONPATH=core:exports uv run python -m deep_research_agent --tui
uv run hive run exports/deep_research_agent --input '{"topic": "climate change effects"}'
# After it runs, find the clean checkpoint before report
list_agent_checkpoints(
@@ -270,7 +270,7 @@ list_agent_checkpoints(
# → cp_node_complete_review_152100 (after review, before report)
# Resume — skips intake, research, review entirely
PYTHONPATH=core:exports uv run python -m deep_research_agent --tui \
uv run hive run exports/deep_research_agent \
--resume-session session_20260209_152000_ghi34567 \
--checkpoint cp_node_complete_review_152100
```
+76 -1
View File
@@ -332,6 +332,60 @@ def register_commands(subparsers: argparse._SubParsersAction) -> None:
resume_parser.set_defaults(func=cmd_resume)
def _load_resume_state(
agent_path: str, session_id: str, checkpoint_id: str | None = None
) -> dict | None:
"""Load session or checkpoint state for headless resume.
Args:
agent_path: Path to the agent folder (e.g., exports/my_agent)
session_id: Session ID to resume from
checkpoint_id: Optional checkpoint ID within the session
Returns:
session_state dict for executor, or None if not found
"""
agent_name = Path(agent_path).name
agent_work_dir = Path.home() / ".hive" / "agents" / agent_name
session_dir = agent_work_dir / "sessions" / session_id
if not session_dir.exists():
return None
if checkpoint_id:
# Checkpoint-based resume: load checkpoint and extract state
cp_path = session_dir / "checkpoints" / f"{checkpoint_id}.json"
if not cp_path.exists():
return None
try:
cp_data = json.loads(cp_path.read_text())
except (json.JSONDecodeError, OSError):
return None
return {
"memory": cp_data.get("shared_memory", {}),
"paused_at": cp_data.get("next_node") or cp_data.get("current_node"),
"execution_path": cp_data.get("execution_path", []),
"node_visit_counts": {},
}
else:
# Session state resume: load state.json
state_path = session_dir / "state.json"
if not state_path.exists():
return None
try:
state_data = json.loads(state_path.read_text())
except (json.JSONDecodeError, OSError):
return None
progress = state_data.get("progress", {})
paused_at = progress.get("paused_at") or progress.get("resume_from")
return {
"memory": state_data.get("memory", {}),
"paused_at": paused_at,
"execution_path": progress.get("path", []),
"node_visit_counts": progress.get("node_visit_counts", {}),
}
def cmd_run(args: argparse.Namespace) -> int:
"""Run an exported agent."""
import logging
@@ -421,6 +475,27 @@ def cmd_run(args: argparse.Namespace) -> int:
print(f"Error: {e}", file=sys.stderr)
return 1
# Load session/checkpoint state for resume (headless mode)
session_state = None
resume_session = getattr(args, "resume_session", None)
checkpoint = getattr(args, "checkpoint", None)
if resume_session:
session_state = _load_resume_state(args.agent_path, resume_session, checkpoint)
if session_state is None:
print(
f"Error: Could not load session state for {resume_session}",
file=sys.stderr,
)
return 1
if not args.quiet:
resume_node = session_state.get("paused_at", "unknown")
if checkpoint:
print(f"Resuming from checkpoint: {checkpoint}")
else:
print(f"Resuming session: {resume_session}")
print(f"Resume point: {resume_node}")
print()
# Auto-inject user_id if the agent expects it but it's not provided
entry_input_keys = runner.graph.nodes[0].input_keys if runner.graph.nodes else []
if "user_id" in entry_input_keys and context.get("user_id") is None:
@@ -440,7 +515,7 @@ def cmd_run(args: argparse.Namespace) -> int:
print("=" * 60)
print()
result = asyncio.run(runner.run(context))
result = asyncio.run(runner.run(context, session_state=session_state))
# Format output
output = {
+42
View File
@@ -0,0 +1,42 @@
# Why Conditional Edges Need Priority (Function Nodes)
## The problem
Function nodes return everything they computed. They don't pick one output key — they return all of them.
```python
def score_lead(inputs):
score = compute_score(inputs["profile"])
return {
"score": score,
"is_high_value": score > 80,
"needs_enrichment": score > 50 and not inputs["profile"].get("company"),
}
```
Lead comes in: score 92, no company on file. Output: `{"score": 92, "is_high_value": True, "needs_enrichment": True}`.
Two conditional edges leaving this node:
```
Edge A: needs_enrichment == True → enrichment node
Edge B: is_high_value == True → outreach node
```
Both are true. Without priority, the graph either fans out to both (wrong — you'd email someone while still enriching their data) or picks one randomly (wrong — non-deterministic).
## Priority fixes it
```
Edge A: needs_enrichment == True priority=2 (higher = checked first)
Edge B: is_high_value == True priority=1
Edge C: is_high_value == False priority=0
```
Executor keeps only the highest-priority matching group. A wins. Lead gets enriched first, loops back, gets re-scored — now `needs_enrichment` is false, B wins, outreach happens.
## Why event loop nodes don't need this
The LLM understands "if/else." You tell it in the prompt: "if needs enrichment, set `needs_enrichment`. Otherwise if high value, set `approved`." It picks one. Only one conditional edge matches.
A function just returns a dict. It doesn't do "otherwise." Priority is the "otherwise" for function nodes.