Skip to main content

Overview

klaw is designed to handle failures gracefully. This page covers the resilience mechanisms built into the provider layer, context management, structured error handling, and observability.

Provider Retry

When an LLM API call fails with a retryable error, klaw automatically retries with exponential backoff and jitter.

Retry Flow

Request ──→ Provider API

              ├── Success ──→ Return response

              ├── Retryable error ──→ Wait (backoff + jitter)
              │                          │
              │                          └──→ Retry (up to max_retries)
              │                                  │
              │                                  ├── Success ──→ Return
              │                                  └── Exhausted ──→ Try fallback

              └── Non-retryable error ──→ Return error immediately

Backoff Calculation

backoff = initial_backoff * (backoff_factor ^ attempt) + jitter
ParameterDefaultDescription
initial_backoff1sStarting delay
backoff_factor2.0Exponential multiplier
max_backoff30sMaximum delay cap
jitter0-25%Random addition to prevent thundering herd
Example progression (3 retries):
  1. Attempt 1: ~1s wait (1s + jitter)
  2. Attempt 2: ~2s wait (2s + jitter)
  3. Attempt 3: ~4s wait (4s + jitter)

Retryable Errors

ErrorDescription
HTTP 429Rate limit exceeded
HTTP 500Internal server error
HTTP 502Bad gateway
HTTP 503Service unavailable
HTTP 529Overloaded
timeout / timed outRequest timeout
connection resetConnection dropped
connection refusedServer unavailable
EOFUnexpected disconnect
All other errors (e.g., HTTP 400, 401, 403) fail immediately without retry.

Fallback Chain

When retries are exhausted on the primary provider, klaw tries fallback providers in order:
Primary (anthropic)
  │ retries exhausted

Fallback 1 (openrouter)
  │ retries exhausted

Fallback 2 (eachlabs)
  │ retries exhausted

Return error

Configuration

[provider.anthropic]
api_key = "${ANTHROPIC_API_KEY}"
model = "claude-sonnet-4-20250514"
max_retries = 3
fallback = "openrouter"

[provider.openrouter]
api_key = "${OPENROUTER_API_KEY}"
model = "anthropic/claude-sonnet-4"
max_retries = 2
Each fallback provider uses its own retry configuration independently.

Context Window Compaction

When conversation history approaches the context window limit, klaw automatically compacts it via LLM-based summarization.

Compaction Lifecycle

Message history grows


Token estimate > (max_tokens - reserve) * threshold?

    ┌────┴────┐
    │ No      │ Yes
    │         ▼
    │    Keep first user message (verbatim)
    │    Keep last 6 messages (verbatim)
    │    Summarize middle via LLM call
    │    Replace middle with summary
    │         │
    └────┬────┘

Continue with compacted history

Configuration

ParameterDefaultDescription
MaxContextTokens200,000Maximum context window size
CompactionThreshold0.75Fraction of available tokens that triggers compaction
ReserveTokens8,192Tokens reserved for the LLM response
Trigger point: Compaction occurs when estimated tokens exceed (200,000 - 8,192) * 0.75 ≈ 143,856 tokens. Token estimation uses a ~4 characters per token heuristic. The summarized section is prefixed with [Previous conversation summary].

Structured Errors

All agent errors use the AgentError type with machine-readable codes:
type AgentError struct {
    Code    ErrorCode
    Message string
    Cause   error
}

Error Codes

CodeConstantTrigger
max_iterationsErrMaxIterationsAgent loop exceeded iteration limit
provider_errorErrProviderLLM API failure after retries and fallbacks
tool_executionErrToolExecTool returned an error
context_limitErrContextLimitContext window exceeded after compaction
budget_exceededErrBudgetExceedSession cost exceeded max_session_cost

Error Format

Errors serialize as [CODE] message: cause:
[budget_exceeded] session cost $5.12 exceeds limit $5.00
[max_iterations] reached 50 iterations without completion
[provider_error] anthropic API error: rate limit exceeded
Errors implement Unwrap() for Go error chain inspection.

Observability

Structured Logging

klaw uses Go’s slog package for structured JSON logging:
[logging]
level = "info"
file = "~/.klaw/logs/klaw.log"
format = "json"
Example log entries:
{"level":"DEBUG","msg":"provider response","model":"claude-sonnet-4-20250514","input_tokens":1523,"output_tokens":847,"cost":"$0.014"}
{"level":"WARN","msg":"approaching budget limit","session_cost":4.12,"max_session_cost":5.00}
{"level":"ERROR","msg":"provider request failed","error":"HTTP 429","attempt":2,"backoff":"2.3s"}

Metrics

Global and per-session metrics are tracked using atomic counters: Global Metrics:
CounterDescription
TotalInputTokensCumulative input tokens
TotalOutputTokensCumulative output tokens
TotalRequestsTotal LLM API requests
TotalErrorsTotal error occurrences
TotalToolCallsTotal tool invocations
Per-Session Metrics:
FieldDescription
InputTokensSession input tokens
OutputTokensSession output tokens
RequestsSession request count
ErrorsSession error count
ToolCallsPer-tool call counts (map)
StartedAtSession start timestamp
Metrics are recorded via:
  • RecordRequest(sessionID, inputTokens, outputTokens) — after each LLM call
  • RecordToolCall(sessionID, toolName) — after each tool execution
  • RecordError(sessionID, errorCode) — on any error
Use metrics.Summary() to get a snapshot of all global counters.

Next Steps

Agent Loop

How these mechanisms fit into the execution cycle

Cost & Safety Guide

Practical guide to configuring safety features