Files
hive/core
Hundao 8cb0531959 fix(ci): unblock main CI, sort imports + install Playwright Chromium (#7172)
* fix(lint): organize imports in queen_orchestrator.create_queen

Ruff I001 blocks CI on every PR against main. The deferred imports
inside create_queen were not in alphabetical order between the queen
package and the framework package; ruff auto-fix moves
framework.config below the framework.agents.queen.nodes block.

No behavior change.

* fix(ci): install Playwright Chromium before Test Tools job

The new chart_tools smoke tests added in feabf327 require a Chromium
build for ECharts/Mermaid rendering, but the test-tools workflow only
ran `uv sync` and went straight to pytest. Three tests
(test_render_echarts_bar_chart, test_render_echarts_accepts_string_spec,
test_render_mermaid_flowchart) crash on every PR with:

    BrowserType.launch: Executable doesn't exist at
    /home/runner/.cache/ms-playwright/chromium_headless_shell-1208/...

Split the install/run into separate steps and add `playwright install
chromium` before pytest. Use `--with-deps` on Linux to pull system
libraries; Windows runners only need the browser binary.

* fix(tests): adapt test_file_state_cache to new file_ops API

The file_ops rewrite in feabf327 dropped the standalone hashline_edit
tool (the file_system_toolkits/hashline_edit/ directory was removed)
and switched edit_file to a mode-first signature
(mode, path, old_string, new_string, ...).

The test fixture still tried to look up "hashline_edit" via the MCP
tool manager and crashed with KeyError before any test could run, and
the edit_file calls were positional in the old order so they hit
"unknown mode 'e.py'" once the fixture was fixed.

Drop the stale hashline_edit lookup and pass mode="replace" explicitly
to every edit_file call. All 11 tests pass locally.

* fix(tests): skip terminal_tools tests on Windows (POSIX-only)

The new terminal_tools package added in feabf327 imports the Unix-only
`resource` module in tools/src/terminal_tools/common/limits.py to set
RLIMIT_CPU / RLIMIT_AS / RLIMIT_FSIZE on subprocesses. Five of the
six terminal_tools test files therefore crash on windows-latest with
`ModuleNotFoundError: No module named 'resource'` once their fixtures
trigger the import chain.

test_terminal_tools_pty.py already has the right module-level skip
(PTY is POSIX-only). Apply the same `pytestmark = skipif(win32)` to
the other five so the whole suite skips cleanly on Windows. The
terminal-tools package is bash-only by design (zsh refused at the
shell-resolver level), so a Windows port is out of scope.
2026-05-05 00:32:59 +08:00
..
2026-04-09 22:04:23 -07:00
2026-05-01 18:06:39 -07:00
2026-03-06 14:56:19 -08:00
2026-03-06 14:56:19 -08:00
2026-03-26 21:54:56 +05:30
2026-05-01 07:41:42 -07:00
2026-03-06 14:56:19 -08:00
2026-02-03 12:14:13 -08:00

Framework

A goal-driven agent runtime with Builder-friendly observability.

Overview

Framework provides a runtime framework that captures decisions, not just actions. This enables a "Builder" LLM to analyze and improve agent behavior by understanding:

  • What the agent was trying to accomplish
  • What options it considered
  • What it chose and why
  • What happened as a result

Installation

uv pip install -e .

Agent Building

See the Getting Started Guide for building agents.

Quick Start

Calculator Agent

Run an LLM-powered calculator:

# Run an exported agent
uv run python -m framework run exports/calculator --input '{"expression": "2 + 3 * 4"}'

# Interactive shell session
uv run python -m framework shell exports/calculator

# Show agent info
uv run python -m framework info exports/calculator

Using the Runtime

from framework import Runtime

runtime = Runtime("/path/to/storage")

# Start a run
run_id = runtime.start_run("my_goal", "Description of what we're doing")

# Record a decision
decision_id = runtime.decide(
    intent="Choose how to process the data",
    options=[
        {"id": "fast", "description": "Quick processing", "pros": ["Fast"], "cons": ["Less accurate"]},
        {"id": "thorough", "description": "Detailed processing", "pros": ["Accurate"], "cons": ["Slower"]},
    ],
    chosen="thorough",
    reasoning="Accuracy is more important for this task"
)

# Record the outcome
runtime.record_outcome(
    decision_id=decision_id,
    success=True,
    result={"processed": 100},
    summary="Processed 100 items with detailed analysis"
)

# End the run
runtime.end_run(success=True, narrative="Successfully processed all data")

Testing Agents

The framework includes a goal-based testing framework for validating agent behavior.

Tests are generated using MCP tools (generate_constraint_tests, generate_success_tests) which return guidelines. Claude writes tests directly using the Write tool based on these guidelines.

# Run tests against an agent
uv run python -m framework test-run <agent_path> --goal <goal_id> --parallel 4

# Debug failed tests
uv run python -m framework test-debug <agent_path> <test_name>

# List tests for an agent
uv run python -m framework test-list <agent_path>

For detailed testing workflows, see developer-guide.md.

Analyzing Agent Behavior with Builder

The BuilderQuery interface allows you to analyze agent runs and identify improvements:

from framework import BuilderQuery

query = BuilderQuery("/path/to/storage")

# Find patterns across runs
patterns = query.find_patterns("my_goal")
print(f"Success rate: {patterns.success_rate:.1%}")

# Analyze a failure
analysis = query.analyze_failure("run_123")
print(f"Root cause: {analysis.root_cause}")
print(f"Suggestions: {analysis.suggestions}")

# Get improvement recommendations
suggestions = query.suggest_improvements("my_goal")
for s in suggestions:
    print(f"[{s['priority']}] {s['recommendation']}")

Architecture

┌─────────────────┐
│  Human Engineer │  ← Supervision, approval
└────────┬────────┘
         │
┌────────▼────────┐
│   Builder LLM   │  ← Analyzes runs, suggests improvements
│  (BuilderQuery) │
└────────┬────────┘
         │
┌────────▼────────┐
│   Agent LLM     │  ← Executes tasks, records decisions
│    (Runtime)    │
└─────────────────┘

Key Concepts

  • Decision: The atomic unit of agent behavior. Captures intent, options, choice, and reasoning.
  • Run: A complete execution with all decisions and outcomes.
  • Runtime: Interface agents use to record their behavior.
  • BuilderQuery: Interface Builder uses to analyze agent behavior.

Requirements

  • Python 3.11+
  • pydantic >= 2.0
  • anthropic >= 0.40.0 (for LLM-powered agents)