8cb0531959
* fix(lint): organize imports in queen_orchestrator.create_queen Ruff I001 blocks CI on every PR against main. The deferred imports inside create_queen were not in alphabetical order between the queen package and the framework package; ruff auto-fix moves framework.config below the framework.agents.queen.nodes block. No behavior change. * fix(ci): install Playwright Chromium before Test Tools job The new chart_tools smoke tests added infeabf327require a Chromium build for ECharts/Mermaid rendering, but the test-tools workflow only ran `uv sync` and went straight to pytest. Three tests (test_render_echarts_bar_chart, test_render_echarts_accepts_string_spec, test_render_mermaid_flowchart) crash on every PR with: BrowserType.launch: Executable doesn't exist at /home/runner/.cache/ms-playwright/chromium_headless_shell-1208/... Split the install/run into separate steps and add `playwright install chromium` before pytest. Use `--with-deps` on Linux to pull system libraries; Windows runners only need the browser binary. * fix(tests): adapt test_file_state_cache to new file_ops API The file_ops rewrite infeabf327dropped the standalone hashline_edit tool (the file_system_toolkits/hashline_edit/ directory was removed) and switched edit_file to a mode-first signature (mode, path, old_string, new_string, ...). The test fixture still tried to look up "hashline_edit" via the MCP tool manager and crashed with KeyError before any test could run, and the edit_file calls were positional in the old order so they hit "unknown mode 'e.py'" once the fixture was fixed. Drop the stale hashline_edit lookup and pass mode="replace" explicitly to every edit_file call. All 11 tests pass locally. * fix(tests): skip terminal_tools tests on Windows (POSIX-only) The new terminal_tools package added infeabf327imports the Unix-only `resource` module in tools/src/terminal_tools/common/limits.py to set RLIMIT_CPU / RLIMIT_AS / RLIMIT_FSIZE on subprocesses. Five of the six terminal_tools test files therefore crash on windows-latest with `ModuleNotFoundError: No module named 'resource'` once their fixtures trigger the import chain. test_terminal_tools_pty.py already has the right module-level skip (PTY is POSIX-only). Apply the same `pytestmark = skipif(win32)` to the other five so the whole suite skips cleanly on Windows. The terminal-tools package is bash-only by design (zsh refused at the shell-resolver level), so a Windows port is out of scope.
Framework
A goal-driven agent runtime with Builder-friendly observability.
Overview
Framework provides a runtime framework that captures decisions, not just actions. This enables a "Builder" LLM to analyze and improve agent behavior by understanding:
- What the agent was trying to accomplish
- What options it considered
- What it chose and why
- What happened as a result
Installation
uv pip install -e .
Agent Building
See the Getting Started Guide for building agents.
Quick Start
Calculator Agent
Run an LLM-powered calculator:
# Run an exported agent
uv run python -m framework run exports/calculator --input '{"expression": "2 + 3 * 4"}'
# Interactive shell session
uv run python -m framework shell exports/calculator
# Show agent info
uv run python -m framework info exports/calculator
Using the Runtime
from framework import Runtime
runtime = Runtime("/path/to/storage")
# Start a run
run_id = runtime.start_run("my_goal", "Description of what we're doing")
# Record a decision
decision_id = runtime.decide(
intent="Choose how to process the data",
options=[
{"id": "fast", "description": "Quick processing", "pros": ["Fast"], "cons": ["Less accurate"]},
{"id": "thorough", "description": "Detailed processing", "pros": ["Accurate"], "cons": ["Slower"]},
],
chosen="thorough",
reasoning="Accuracy is more important for this task"
)
# Record the outcome
runtime.record_outcome(
decision_id=decision_id,
success=True,
result={"processed": 100},
summary="Processed 100 items with detailed analysis"
)
# End the run
runtime.end_run(success=True, narrative="Successfully processed all data")
Testing Agents
The framework includes a goal-based testing framework for validating agent behavior.
Tests are generated using MCP tools (generate_constraint_tests, generate_success_tests) which return guidelines. Claude writes tests directly using the Write tool based on these guidelines.
# Run tests against an agent
uv run python -m framework test-run <agent_path> --goal <goal_id> --parallel 4
# Debug failed tests
uv run python -m framework test-debug <agent_path> <test_name>
# List tests for an agent
uv run python -m framework test-list <agent_path>
For detailed testing workflows, see developer-guide.md.
Analyzing Agent Behavior with Builder
The BuilderQuery interface allows you to analyze agent runs and identify improvements:
from framework import BuilderQuery
query = BuilderQuery("/path/to/storage")
# Find patterns across runs
patterns = query.find_patterns("my_goal")
print(f"Success rate: {patterns.success_rate:.1%}")
# Analyze a failure
analysis = query.analyze_failure("run_123")
print(f"Root cause: {analysis.root_cause}")
print(f"Suggestions: {analysis.suggestions}")
# Get improvement recommendations
suggestions = query.suggest_improvements("my_goal")
for s in suggestions:
print(f"[{s['priority']}] {s['recommendation']}")
Architecture
┌─────────────────┐
│ Human Engineer │ ← Supervision, approval
└────────┬────────┘
│
┌────────▼────────┐
│ Builder LLM │ ← Analyzes runs, suggests improvements
│ (BuilderQuery) │
└────────┬────────┘
│
┌────────▼────────┐
│ Agent LLM │ ← Executes tasks, records decisions
│ (Runtime) │
└─────────────────┘
Key Concepts
- Decision: The atomic unit of agent behavior. Captures intent, options, choice, and reasoning.
- Run: A complete execution with all decisions and outcomes.
- Runtime: Interface agents use to record their behavior.
- BuilderQuery: Interface Builder uses to analyze agent behavior.
Requirements
- Python 3.11+
- pydantic >= 2.0
- anthropic >= 0.40.0 (for LLM-powered agents)