Files

T

History

Kavin a881fe68da fix(llm): ensure store=False is passed to Codex Responses API (#7089 )

Forces store: false into the extra_body payload for Codex-style models
so that LiteLLM successfully passes it down to the ChatGPT Responses
API backend, fixing the BadRequestError.

Fixes #7056.

Original investigation and first PR by @Darshan174 (#7065).

Co-authored-by: Darshan174 <Darshan002321@gmail.com>

2026-04-20 17:54:41 +08:00

examples

refactor: big test cleanup

2026-04-09 22:04:23 -07:00

framework

fix(llm): ensure store=False is passed to Codex Responses API (#7089 )

2026-04-20 17:54:41 +08:00

frontend

fix: resolve merge conflict markers and ruff issues

2026-04-18 21:45:11 -07:00

tests

fix(ci): unbreak main, ruff format browser and refresh test_model_catalog (#7095 )

2026-04-20 17:23:26 +08:00

.gitignore

refactor: move the coder tools

2026-03-06 14:56:19 -08:00

.mcp.json

refactor: move the coder tools

2026-03-06 14:56:19 -08:00

antigravity_auth.py

fix: resolve all ruff lint and format errors across codebase (#7058 )

2026-04-16 19:30:01 +08:00

codex_oauth.py

chore: pull latest changes

2026-03-26 21:54:56 +05:30

MCP_BUILDER_TOOLS_GUIDE.md

fix(arch): remove all deprecated concepts and deadcodes

2026-02-17 10:59:15 -08:00

MCP_INTEGRATION_GUIDE.md

docs: clarify required vs optional fields for Unix and SSE transports

2026-03-29 10:29:47 -07:00

MCP_SERVER_GUIDE.md

fix: replace old tool name reference

2026-03-09 18:40:01 -07:00

pyproject.toml

test: align stale tests with current behavior

2026-04-18 22:02:03 -07:00

README.md

fix: replace old tool name reference

2026-03-09 18:40:01 -07:00

setup_mcp.sh

refactor: move the coder tools

2026-03-06 14:56:19 -08:00

uv.lock

update ci to use uv, updated linting

2026-02-03 12:14:13 -08:00

README.md

Framework

A goal-driven agent runtime with Builder-friendly observability.

Overview

Framework provides a runtime framework that captures decisions, not just actions. This enables a "Builder" LLM to analyze and improve agent behavior by understanding:

What the agent was trying to accomplish
What options it considered
What it chose and why
What happened as a result

Installation

uv pip install -e .

Agent Building

Agent scaffolding is handled by the coder-tools MCP server (in tools/coder_tools_server.py), which provides the initialize_and_build_agent tool and related utilities. The package generation logic lives directly in tools/coder_tools_server.py.

See the Getting Started Guide for building agents.

Quick Start

Calculator Agent

Run an LLM-powered calculator:

# Run an exported agent
uv run python -m framework run exports/calculator --input '{"expression": "2 + 3 * 4"}'

# Interactive shell session
uv run python -m framework shell exports/calculator

# Show agent info
uv run python -m framework info exports/calculator

Using the Runtime

from framework import Runtime

runtime = Runtime("/path/to/storage")

# Start a run
run_id = runtime.start_run("my_goal", "Description of what we're doing")

# Record a decision
decision_id = runtime.decide(
    intent="Choose how to process the data",
    options=[
        {"id": "fast", "description": "Quick processing", "pros": ["Fast"], "cons": ["Less accurate"]},
        {"id": "thorough", "description": "Detailed processing", "pros": ["Accurate"], "cons": ["Slower"]},
    ],
    chosen="thorough",
    reasoning="Accuracy is more important for this task"
)

# Record the outcome
runtime.record_outcome(
    decision_id=decision_id,
    success=True,
    result={"processed": 100},
    summary="Processed 100 items with detailed analysis"
)

# End the run
runtime.end_run(success=True, narrative="Successfully processed all data")

Testing Agents

The framework includes a goal-based testing framework for validating agent behavior.

Tests are generated using MCP tools (generate_constraint_tests, generate_success_tests) which return guidelines. Claude writes tests directly using the Write tool based on these guidelines.

# Run tests against an agent
uv run python -m framework test-run <agent_path> --goal <goal_id> --parallel 4

# Debug failed tests
uv run python -m framework test-debug <agent_path> <test_name>

# List tests for an agent
uv run python -m framework test-list <agent_path>

For detailed testing workflows, see developer-guide.md.

Analyzing Agent Behavior with Builder

The BuilderQuery interface allows you to analyze agent runs and identify improvements:

from framework import BuilderQuery

query = BuilderQuery("/path/to/storage")

# Find patterns across runs
patterns = query.find_patterns("my_goal")
print(f"Success rate: {patterns.success_rate:.1%}")

# Analyze a failure
analysis = query.analyze_failure("run_123")
print(f"Root cause: {analysis.root_cause}")
print(f"Suggestions: {analysis.suggestions}")

# Get improvement recommendations
suggestions = query.suggest_improvements("my_goal")
for s in suggestions:
    print(f"[{s['priority']}] {s['recommendation']}")

Architecture

┌─────────────────┐
│  Human Engineer │  ← Supervision, approval
└────────┬────────┘
         │
┌────────▼────────┐
│   Builder LLM   │  ← Analyzes runs, suggests improvements
│  (BuilderQuery) │
└────────┬────────┘
         │
┌────────▼────────┐
│   Agent LLM     │  ← Executes tasks, records decisions
│    (Runtime)    │
└─────────────────┘

Key Concepts

Decision: The atomic unit of agent behavior. Captures intent, options, choice, and reasoning.
Run: A complete execution with all decisions and outcomes.
Runtime: Interface agents use to record their behavior.
BuilderQuery: Interface Builder uses to analyze agent behavior.

Requirements

Python 3.11+
pydantic >= 2.0
anthropic >= 0.40.0 (for LLM-powered agents)