docs: add key concept section

2026-02-06 12:41:22 -08:00
parent b22be7a6cb
commit 830f11c47d
4 changed files with 279 additions and 0 deletions
@@ -0,0 +1,49 @@
 # Evolution
 ## Evolution Is the Mechanism; Adaptiveness Is the Result
 Agents don't just fail; they fail inevitably. Real-world variables—private LinkedIn profiles, shifting API schemas, or LLM hallucinations—are impossible to predict in a vacuum. The first version of any agent is merely a "happy path" draft.
 Evolution is how Hive handles this. When an agent fails, the framework captures what went wrong — which node failed, which success criteria weren't met, what the agent tried and why it didn't work. Then a coding agent (Claude Code, Cursor, or similar) uses that failure data to generate an improved version of the agent. The new version gets deployed, runs, encounters new edge cases, and the cycle continues.
 Over generations, the agent gets more reliable. Not because someone sat down and anticipated every possible failure, but because each failure teaches the next version something specific.
 ## How It Works
 The evolution loop has four stages:
 **1. Execute** — The worker agent runs against real inputs. Sessions produce outcomes, decisions, and metrics.
 **2. Evaluate** — The framework checks outcomes against the goal's success criteria and constraints. Did the agent produce the desired result? Which criteria were satisfied and which weren't? Were any constraints violated?
 **3. Diagnose** — Failure data is structured and specific. It's not just "the agent failed" — it's "node `draft_message` failed to produce personalized content because the research node returned insufficient data about the prospect's recent activity." The decision log, problem reports, and execution trace provide the full picture.
 **4. Regenerate** — A coding agent receives the diagnosis and the current agent code. It modifies the graph — adding nodes, adjusting prompts, changing edge conditions, adding tools — to address the specific failure. The new version is deployed and the cycle restarts.
 ## Adaptiveness ≠ Intelligence or Intent
 An important distinction: evolution makes agents more adaptive, but not more intelligent in any general sense. The agent isn't learning to reason better — it's being rewritten to handle more situations correctly.
 This is closer to how biological evolution works than how learning works. A species doesn't "learn" to survive winter — individuals that happen to have thicker fur survive, and that trait gets selected for. Similarly, agent versions that handle more edge cases correctly survive in production, and the patterns that made them successful get carried forward.
 The practical implication: don't expect evolution to make an agent smarter about problems it's never seen. Evolution improves reliability on the *kinds* of problems the agent has already encountered. For genuinely novel situations, that's what human-in-the-loop is for — and every time a human steps in, that interaction becomes potential fuel for the next evolution cycle.
 ## What Gets Evolved
 Evolution can change almost anything about an agent:
 **Prompts** — The most common fix. A node's system prompt gets refined based on the specific ways the LLM misunderstood its instructions.
 **Graph structure** — Adding a validation node before a critical step, splitting a node that's trying to do too much, adding a fallback path for a common failure mode.
 **Edge conditions** — Adjusting routing logic based on observed patterns. If low-confidence research results consistently lead to bad drafts, add a conditional edge that routes them back for another research pass.
 **Tool selection** — Swapping in a better tool, adding a new one, or removing one that causes more problems than it solves.
 **Constraints and criteria** — Tightening or loosening based on what's actually achievable and what matters in practice.
 ## The Role of Decision Logging
 Evolution depends on good data. The runtime captures every decision an agent makes: what it was trying to do, what options it considered, what it chose, and what happened as a result. This isn't overhead — it's the signal that makes evolution possible.
 Without decision logging, failure analysis is guesswork. With it, the coding agent can trace a failure back to its root cause and make a targeted fix rather than a blind change.
@@ -0,0 +1,101 @@
 # Goals & Outcome-Driven Development
 ## The Core Idea
 Business processes are outcome-driven. A sales team doesn't follow a rigid script — they adapt their approach until the deal closes. A support agent doesn't execute a flowchart — they resolve the customer's issue. The outcome is what matters, not the specific steps taken to get there.
 Hive is built on this principle. Instead of hardcoding agent workflows step by step, you define the outcome you want, and the framework figures out how to get there. We call this **Outcome-Driven Development (ODD)**.
 ## Task-Driven vs Goal-Driven vs Outcome-Driven
 These three paradigms represent different levels of abstraction for building agents:
 **Task-Driven Development (TDD)** asks: *"Is the code correct?"*
 You define explicit steps. The agent follows them. Success means the steps ran without errors. The problem: an agent can execute every step perfectly and still produce a useless result. The steps become the goal, not the actual outcome.
 **Goal-Driven Development (GDD)** asks: *"Are we solving the right problem?"*
 You define what you want to achieve. The agent plans and executes toward that goal. Better than TDD because it captures intent. But goals can be vague — "improve customer satisfaction" doesn't tell you when you're done.
 **Outcome-Driven Development (ODD)** asks: *"Did the system produce the desired result?"*
 You define measurable success criteria, hard constraints, and the context the agent needs. The agent is evaluated against the actual outcome, not whether it followed the right steps or aimed at the right goal. This is what Hive implements.
 ## Goals as First-Class Citizens
 In Hive, a `Goal` is not a string description. It's a structured object with three components:
 ### Success Criteria
 Each goal has weighted success criteria that define what "done" looks like. These aren't binary pass/fail checks — they're multi-dimensional measures of quality.
 ```python
 Goal(
    id="twitter-outreach",
    name="Personalized Twitter Outreach",
    success_criteria=[
        SuccessCriterion(
            id="personalized",
            description="Messages reference specific details from the prospect's profile",
            metric="llm_judge",
            weight=0.4
        ),
        SuccessCriterion(
            id="compliant",
            description="Messages follow brand voice guidelines",
            metric="llm_judge",
            weight=0.3
        ),
        SuccessCriterion(
            id="actionable",
            description="Each message includes a clear call to action",
            metric="output_contains",
            target="CTA",
            weight=0.3
        ),
    ],
    ...
 )
 ```
 Metrics can be `output_contains`, `output_equals`, `llm_judge`, or `custom`. Weights let you express what matters most — a perfectly compliant message that isn't personalized still falls short.
 ### Constraints
 Constraints define what must **not** happen. They're the guardrails.
 ```python
 constraints=[
    Constraint(
        id="no_spam",
        description="Never send more than 3 messages to the same person per week",
        constraint_type="hard",    # Violation = immediate escalation
        category="safety"
    ),
    Constraint(
        id="budget_limit",
        description="Total LLM cost must not exceed $5 per run",
        constraint_type="soft",    # Violation = warning, not a hard stop
        category="cost"
    ),
 ]
 ```
 Hard constraints are non-negotiable — violating one triggers escalation or failure. Soft constraints are preferences that the agent should respect but can bend when necessary. Constraint categories include `time`, `cost`, `safety`, `scope`, and `quality`.
 ### Context
 Goals carry context — domain knowledge, preferences, background information that the agent needs to make good decisions. This context is injected into every LLM call the agent makes, so the agent is always reasoning with the full picture.
 ## Why This Matters
 When you define goals with weighted criteria and constraints, three things happen:
 1. **The agent can self-correct.** Goals are injected into every LLM call, so the agent is always reasoning against its success criteria. Within a [graph execution](./graph.md), nodes use these criteria to decide whether to accept their output, retry, or escalate — self-correction in real time.
 2. **Evolution has a target.** When an agent fails, the framework knows *which criteria* it fell short on, which gives the coding agent specific information to improve the next generation (see [Evolution](./evolution.md)).
 3. **Humans stay in control.** Constraints define the boundaries. The agent has freedom to find creative solutions within those boundaries, but it can't cross the lines you've drawn.
 The goal lifecycle flows through `DRAFT → READY → ACTIVE → COMPLETED / FAILED / SUSPENDED`, giving you visibility into where each objective stands at any point during execution.
@@ -0,0 +1,78 @@
 # The Agent Graph
 ## Why a Graph
 Real business processes aren't linear. A sales outreach might go: research a prospect, draft a message, realize the research is thin, go back and dig deeper, draft again, get human approval, send. There are loops, branches, fallbacks, and decision points.
 Hive models this as a directed graph. Nodes do work, edges connect them, and shared memory lets them pass data. The framework walks this structure — running nodes, following edges, managing retries — until the agent reaches its goal or exhausts its step budget.
 Edges can loop back, creating feedback cycles where an agent retries a step or takes a different path. That's intentional. A graph that only moves forward can't self-correct.
 ## Nodes
 A node is a unit of work. Each node reads inputs from shared memory, does something, and writes outputs back. There are a handful of node types, each suited to a different kind of work:
 **`event_loop`** — The workhorse. This is a multi-turn LLM loop: the model reasons about the current state, calls tools, observes results, and keeps going until it has produced the required outputs. Most of the interesting agent behavior happens in these nodes. They handle long-running tasks, manage their own context window, and can recover from crashes mid-conversation.
 **`function`** — A plain Python function. No LLM involved. Use these for anything deterministic: data transformation, API calls with known parameters, validation logic, or any step where you don't want a language model making judgment calls.
 **`router`** — A decision point that directs execution down different paths. Can be rule-based ("if confidence is high, go left; otherwise, go right") or LLM-powered ("given the goal and what we know so far, which path makes sense?").
 **`human_input`** — A pause point where the agent stops and asks a human for input before continuing. See [Human-in-the-Loop](#human-in-the-loop) below.
 There are also simpler LLM node types (`llm_tool_use` for a single LLM call with tools, `llm_generate` for pure text generation) for steps that don't need the full event loop.
 ### Self-Correction Within a Node
 The most important behavior in an `event_loop` node is the ability to self-correct. After each iteration, the node evaluates its own output: did it produce what was needed? If yes, it's done. If not, it tries again — but this time it sees what went wrong and adjusts.
 This is the **reflexion pattern**: try, evaluate, learn from the result, try again. It's cheaper and more effective than starting over. An agent that takes three attempts to get something right is still more useful than one that fails on the first try and gives up.
 Within a single node, the outcomes are:
 - **Accept** — Output meets the bar. Move on.
 - **Retry** — Not good enough, but recoverable. Try again with feedback.
 - **Escalate** — Something is fundamentally broken. Hand off to error handling.
 This is self-correction *within a session* — the agent adapting in real time. It's different from [evolution](./evolution.md), which improves the agent *across sessions* by rewriting its code between generations. Both matter: reflexion handles the bumps in a single run, evolution handles the patterns that keep recurring across many runs.
 ## Edges
 Edges control flow between nodes. Each edge has a condition:
 - **On success** — follow this edge if the source node succeeded
 - **On failure** — follow if the source failed (this is how you wire up fallback paths and error recovery)
 - **Conditional** — follow if an expression is true (e.g., route high-confidence results one way, low-confidence results another)
 - **LLM-decided** — let the LLM choose which path based on the [goal](./goals_outcome.md) and current context
 Edges also handle data plumbing between nodes — mapping one node's outputs to another node's expected inputs, so each node has a clean interface without needing to know where its data came from.
 When a node has multiple outgoing edges, the framework can run those branches in parallel and reconverge when they're all done. This is useful for tasks like researching a prospect from multiple sources simultaneously.
 ## Shared Memory
 Shared memory is how nodes communicate. It's a key-value store scoped to a single [session](./worker_agent.md). Every node declares which keys it reads and which it writes, and the framework enforces those boundaries — a node can't quietly access data it hasn't declared.
 Data flows through the graph in a natural way: input arrives at the start, each node reads what it needs and writes what it produces, and edges map outputs to inputs as data moves between nodes. At the end, the full memory state is the execution result.
 ## Human-in-the-Loop
 Human-in-the-loop (HITL) nodes are where the agent pauses and asks a person for input. This isn't a blunt "stop everything" — the framework supports structured questions: open-ended text, multiple choice, yes/no approvals, and multi-field forms.
 When the agent hits a HITL node, it saves its entire state and presents the questions. The session can sit paused for minutes, hours, or days. When the human responds, execution picks up exactly where it left off.
 This is what makes Hive agents supervisable in production. You place HITL nodes at critical decision points — before sending a message, before making a purchase, before any action that's hard to undo. The agent handles the routine work autonomously; humans weigh in on the decisions that matter. And every time a human provides input, that decision becomes data the [evolution](./evolution.md) process can learn from.
 ## The Shape of an Agent
 A typical agent graph looks something like this:
 ```
 intake → research → draft → [human review] → send → done
                ↑                                 |
                └──── on failure ─────────────────┘
 ```
 An entry node where work begins. A chain of nodes that do the real work. HITL nodes at approval gates. Failure edges that loop back for another attempt. Terminal nodes where execution ends.
 The framework tracks everything as it walks the graph: which nodes ran, how many retries each needed, how much the LLM calls cost, how long each step took. This metadata feeds into the [worker agent runtime](./worker_agent.md) for monitoring and into the [evolution](./evolution.md) process for improvement.
@@ -0,0 +1,51 @@
 # The Worker Agent
 ## What a Worker Agent Is
 A worker agent is a specialized AI agent built to perform a specific business process. It's not a general-purpose assistant — it's purpose-built, like hiring someone for a defined role. A sales outreach agent knows how to research prospects, craft personalized messages, and follow up. A support triage agent knows how to categorize tickets, pull customer context, and route to the right team.
 In Hive, a **Coding Agent** (like Claude Code or Cursor) generates worker agents from a natural language goal description. You describe what you want the agent to do, and the coding agent produces the graph, nodes, edges, and configuration. The worker agent is the thing that actually runs.
 ## Sessions
 A session is a single execution of a worker agent against a specific input. If your outreach agent processes 50 prospects, that's 50 sessions.
 Each session is isolated — it has its own shared memory, its own execution state, and its own history. This matters because sessions can be long-running. An agent might start researching a prospect, pause for human approval, wait hours or days, and then resume to send the message. The session preserves everything across that gap.
 Sessions also make debugging straightforward. Every decision the agent made, every tool it called, every retry it attempted — it's all captured in the session. When something goes wrong, you can trace exactly what happened.
 ## Iterations
 Within a session, nodes (especially `event_loop` nodes) work in iterations. An iteration is one turn of the loop: the LLM reasons about the current state, possibly calls tools, observes results, and produces output. Then the judge evaluates: is this good enough?
 If not, the node iterates again. The LLM sees what went wrong and adjusts its approach. This is how agents self-correct without human intervention — through rapid iteration within a single node, not by restarting the whole process.
 Iterations have limits. You set a maximum per node to prevent runaway loops. If a node can't produce acceptable output within its iteration budget, it fails and the graph's error-handling edges take over.
 ## Headless Execution
 A lot of business processes need to run continuously — monitoring inboxes, processing incoming leads, watching for events. These agents run **headless**: no UI, no human sitting at a terminal, just the agent doing its job in the background.
 Headless doesn't mean unsupervised. HITL (human-in-the-loop) nodes still pause execution and wait for human input when the agent hits a decision it shouldn't make alone. The difference is that instead of a live conversation, the agent sends a notification, waits for a response through whatever channel you've configured, and resumes when the human weighs in.
 This is the operational model Hive is designed for: agents that run 24/7 as part of your business infrastructure, with humans stepping in only when needed. The goal is to automate the routine and escalate the exceptions.
 ## The Runtime
 The worker agent runtime manages the lifecycle: starting sessions, executing the graph, handling pauses and resumes, tracking costs, and collecting metrics. It coordinates everything the agent needs — LLM access, tool execution, shared memory, credential management — so individual nodes can focus on their specific job.
 Key things the runtime handles:
 **Cost tracking** — Every LLM call is metered. You set budget constraints on the goal, and the runtime enforces them. An agent can't silently burn through your API credits.
 **Decision logging** — Every meaningful choice the agent makes is recorded: what it was trying to do, what options it considered, what it chose, and what happened. This isn't just for debugging — it's the raw material that evolution uses to improve future generations.
 **Event streaming** — The runtime emits events as the agent works. You can wire these up to dashboards, logs, or alerting systems to monitor agents in real time.
 **Crash recovery** — If execution is interrupted (process crash, deployment, anything), the runtime can resume from the last checkpoint. Conversation state and memory are persisted, so the agent picks up where it left off rather than starting over.
 ## The Big Picture
 The worker agent model is Hive's answer to a simple question: how do you run AI agents like you'd run a team?
 You hire for a role (define the goal), you onboard them with context (provide tools, credentials, domain knowledge), you set expectations (success criteria and constraints), you let them work independently (headless execution), and you check in when something unusual comes up (HITL). When they're not performing well, you don't debug them line by line — you evolve them (see [Evolution](./evolution.md)).