18 KiB
AI Agent Observability & Monitoring: The Complete Guide
How to know what your AI agents are actually doing
AI agents are autonomous systems that make decisions, call tools, and interact with the world. Without proper observability, they become black boxes. This guide covers everything you need to monitor AI agents effectively.
Why Agent Observability Is Different
Traditional application monitoring tracks requests and responses. Agent monitoring must track:
| Traditional Apps | AI Agents |
|---|---|
| Request/Response | Multi-step reasoning chains |
| Deterministic behavior | Probabilistic decisions |
| Fixed execution paths | Dynamic tool selection |
| Predictable costs | Variable LLM spending |
| Clear errors | Subtle quality degradation |
The Four Pillars of Agent Observability
┌─────────────────────────────────────────────────────────────┐
│ Agent Observability Stack │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │
│ │ Metrics │ │ Logs │ │ Traces │ │
│ │ (Numbers) │ │ (Events) │ │ (Execution Flow) │ │
│ └─────────────┘ └─────────────┘ └─────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────┐ │
│ │ Quality Evals │ │
│ │ (Output Assessment) │ │
│ └───────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
1. Metrics
Quantitative measurements over time:
- Requests per minute
- Success/failure rates
- Latency distributions
- Token usage
- Cost per request
- Tool call frequencies
2. Logs
Discrete events with context:
- Agent decisions
- Tool inputs/outputs
- Error messages
- User interactions
- System events
3. Traces
End-to-end execution flows:
- Full reasoning chains
- Token-by-token generation
- Tool call sequences
- Parent-child relationships
- Cross-agent communication
4. Quality Evals
Output quality assessment:
- Accuracy scoring
- Hallucination detection
- Task completion rates
- User satisfaction
- Regression detection
Key Metrics to Track
Performance Metrics
| Metric | Description | Alert Threshold |
|---|---|---|
agent.latency.p50 |
Median response time | > 5s |
agent.latency.p99 |
99th percentile latency | > 30s |
agent.throughput |
Requests/second | < baseline * 0.5 |
agent.queue.depth |
Pending requests | > 100 |
agent.timeout.rate |
Timeout percentage | > 5% |
Reliability Metrics
| Metric | Description | Alert Threshold |
|---|---|---|
agent.success.rate |
Successful completions | < 95% |
agent.error.rate |
Error percentage | > 5% |
agent.retry.rate |
Retries needed | > 10% |
agent.fallback.rate |
Fallback usage | > 20% |
agent.circuit.open |
Circuit breaker status | true |
Cost Metrics
| Metric | Description | Alert Threshold |
|---|---|---|
agent.cost.total |
Total spend | > budget * 0.9 |
agent.cost.per.request |
Cost per request | > $0.50 |
agent.tokens.input |
Input tokens used | anomaly detection |
agent.tokens.output |
Output tokens used | anomaly detection |
agent.model.usage |
Calls by model | unusual patterns |
Quality Metrics
| Metric | Description | Alert Threshold |
|---|---|---|
agent.quality.score |
Output quality (0-1) | < 0.7 |
agent.hallucination.rate |
Detected hallucinations | > 5% |
agent.task.completion |
Tasks fully completed | < 80% |
agent.user.satisfaction |
User ratings | < 4.0/5.0 |
Logging Best Practices
Structured Logging Format
{
"timestamp": "2025-01-15T10:30:00Z",
"level": "info",
"event": "agent_tool_call",
"agent_id": "agent-123",
"session_id": "session-456",
"trace_id": "trace-789",
"tool": "search_web",
"input": {"query": "latest AI news"},
"output_tokens": 150,
"latency_ms": 1200,
"success": true
}
What to Log
Always Log:
- Agent start/stop
- Tool calls (name, duration, success)
- LLM calls (model, tokens, latency)
- Errors and exceptions
- Human interventions
- Budget events
Log Carefully (PII concerns):
- User inputs (may need redaction)
- Agent outputs (may contain sensitive data)
- Full prompts (can be large)
Never Log:
- API keys
- User credentials
- Full conversation transcripts in production
- Raw model weights
Log Levels for Agents
| Level | Use Case |
|---|---|
| DEBUG | Full prompts, token-level details |
| INFO | Tool calls, completions, metrics |
| WARN | Retries, degradation, budget warnings |
| ERROR | Failures, exceptions, circuit breaks |
| FATAL | System crashes, unrecoverable errors |
Distributed Tracing for Agents
Why Tracing Matters
Agents involve multiple steps, LLM calls, and tool invocations. Tracing connects them all.
Trace: "Process customer refund"
├── Span: Agent Initialize (5ms)
├── Span: LLM Planning Call (800ms)
│ └── Attribute: model=gpt-4, tokens=500
├── Span: Tool: fetch_order (200ms)
│ └── Attribute: order_id=12345
├── Span: Tool: check_policy (50ms)
├── Span: LLM Decision Call (600ms)
│ └── Attribute: decision=approve
├── Span: Tool: process_refund (300ms)
└── Span: Agent Complete (10ms)
└── Attribute: success=true, cost=$0.08
Key Trace Attributes
agent.id: Unique agent identifieragent.type: Agent type/rolesession.id: User sessionparent.agent: For multi-agent systemsllm.model: Model usedllm.tokens: Token countstool.name: Tool being calledtool.success: Tool outcome
Dashboard Design
Dashboard 1: Operations Overview
┌─────────────────────────────────────────────────────────────┐
│ Agent Operations │
├─────────────────┬─────────────────┬─────────────────────────┤
│ Active Agents │ Requests/Min │ Error Rate │
│ 42 │ 1,234 │ 0.3% ✓ │
├─────────────────┴─────────────────┴─────────────────────────┤
│ │
│ Request Latency (p50/p99) Success Rate (24h) │
│ ████████████████░░░░ ██████████████████████ │
│ 1.2s / 4.5s 99.2% │
│ │
├─────────────────────────────────────────────────────────────┤
│ Top Errors Active Alerts │
│ • Rate limit exceeded (12) ⚠️ High latency p99 │
│ • Tool timeout (5) ⚠️ Budget at 85% │
│ • Validation failed (3) │
└─────────────────────────────────────────────────────────────┘
Dashboard 2: Cost & Usage
┌─────────────────────────────────────────────────────────────┐
│ Cost & Usage │
├─────────────────┬─────────────────┬─────────────────────────┤
│ Today's Spend │ Budget Used │ Projected Monthly │
│ $127.50 │ 67% │ $3,825 │
├─────────────────┴─────────────────┴─────────────────────────┤
│ │
│ Cost by Model │ Cost by Agent │
│ ■ GPT-4: $89 │ ■ Support: $45 │
│ ■ Claude: $28 │ ■ Research: $52 │
│ ■ GPT-3.5: $10 │ ■ Writer: $30 │
│ │
├─────────────────────────────────────────────────────────────┤
│ Token Usage Trend (7 days) │
│ ▁▂▃▄▅▆▇█▇▆▅▄▃▂▁▂▃▄▅▆ │
└─────────────────────────────────────────────────────────────┘
Dashboard 3: Quality & Reliability
┌─────────────────────────────────────────────────────────────┐
│ Quality & Reliability │
├─────────────────┬─────────────────┬─────────────────────────┤
│ Quality Score │ Task Complete │ User Satisfaction │
│ 0.92/1.0 │ 94.5% │ 4.6/5.0 │
├─────────────────┴─────────────────┴─────────────────────────┤
│ │
│ Quality Trend (30 days) │ Failure Analysis │
│ ████████████████████████ │ ■ LLM errors: 2% │
│ ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔ │ ■ Tool errors: 1% │
│ Target: 0.90 │ ■ Timeouts: 0.5% │
│ │ ■ Logic errors: 0.5% │
├─────────────────────────────────────────────────────────────┤
│ Recent Quality Issues │
│ • Agent-42 hallucination detected (15 min ago) │
│ • Agent-17 task incomplete (1 hour ago) │
└─────────────────────────────────────────────────────────────┘
Alerting Strategy
Critical Alerts (Page immediately)
- Error rate > 10% for 5 minutes
- All agents offline
- Budget exceeded
- Security anomaly detected
Warning Alerts (Notify during business hours)
- Error rate > 5% for 15 minutes
- Latency p99 > 30s
- Budget > 90% of limit
- Quality score drops > 10%
Informational (Daily digest)
- Token usage trends
- Cost projections
- Quality score changes
- New error types detected
Alert Fatigue Prevention
- Use anomaly detection vs fixed thresholds
- Group related alerts
- Implement progressive escalation
- Review and tune alert thresholds monthly
Tool Comparison
| Tool | Best For | Agent-Specific Features |
|---|---|---|
| Datadog | Enterprise, full-stack | APM for LLM calls |
| Grafana | Self-hosted, flexibility | Custom dashboards |
| LangSmith | LangChain users | Prompt tracing |
| Weights & Biases | ML teams | Experiment tracking |
| Helicone | LLM-focused | Token analytics |
| Aden | Production agents | Built-in observability |
How Aden Handles Observability
Aden provides built-in observability without additional setup:
Automatic Collection
┌─────────────────────────────────────────────────────────────┐
│ Aden Observability │
│ │
│ ┌───────────────┐ ┌───────────────────────────────┐ │
│ │ SDK-Wrapped │──────▶│ Event Stream │ │
│ │ Nodes │ │ • Metrics • Logs • Traces │ │
│ └───────────────┘ └───────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ Honeycomb Dashboard │ │
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │
│ │ │ Metrics │ │ Costs │ │ Quality │ │ Alerts │ │ │
│ │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │ │
│ └───────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
What Aden Tracks Automatically
- Every LLM call (model, tokens, latency, cost)
- Every tool invocation (name, duration, success)
- Agent lifecycle events (start, stop, error)
- Budget consumption in real-time
- Quality metrics via failure tracking
- HITL intervention points
Built-in Dashboards
- Real-time agent status
- Cost breakdown by agent/model
- Quality trends over time
- Failure analysis
- Self-improvement metrics
No Configuration Required
Unlike external tools, Aden's observability requires no setup:
# Just wrap your node with the SDK
from aden import sdk
@sdk.node
async def my_agent(input):
# All metrics automatically collected
return await process(input)
Implementation Checklist
Phase 1: Basic (Week 1)
- Structured logging in place
- Basic metrics: latency, errors, throughput
- Cost tracking per request
- Simple dashboard with key metrics
Phase 2: Comprehensive (Week 2-3)
- Distributed tracing implemented
- Quality evaluation pipeline
- Alerting rules configured
- Full dashboards built
Phase 3: Advanced (Week 4+)
- Anomaly detection
- Automated regression detection
- Cost optimization insights
- Self-healing triggers
Common Pitfalls
1. Logging Too Much
Problem: Full prompts in production logs Solution: Log hashes or summaries, full content only for debugging
2. Alert Fatigue
Problem: Too many non-actionable alerts Solution: Use anomaly detection, tune thresholds, require action plans
3. Missing Context
Problem: Can't correlate events across agents Solution: Propagate trace IDs, use correlation IDs
4. Ignoring Quality
Problem: Only track operational metrics Solution: Implement quality scoring, track user feedback
5. No Baselines
Problem: Don't know what "normal" looks like Solution: Establish baselines before alerting, use relative thresholds
Conclusion
Effective agent observability requires:
- Metrics: Know your numbers (latency, errors, cost)
- Logs: Capture events with context
- Traces: Follow execution flows end-to-end
- Quality: Assess output, not just uptime
Modern agent platforms like Aden provide this built-in. For other frameworks, plan to invest significant effort in observability infrastructure.
The goal: Never wonder what your agents are doing—always know.
Last updated: January 2025