This document describes all metrics exported by Kelpie for Prometheus/Grafana monitoring.
Kelpie exposes metrics in Prometheus text format at:
GET /metrics
Example:
curl http://localhost:8283/metricsType: Gauge Description: Current number of active agents in memory Use Case: Monitor memory pressure, detect leaks
# Alert if too many agents active
kelpie_agents_active_count > 1000
Type: Counter Description: Total number of agent activations since server start Use Case: Track churn rate, identify hot agents
# Activation rate per second
rate(kelpie_agents_activated_total[5m])
Type: Counter Description: Total number of agent deactivations since server start Use Case: Track churn rate, verify cleanup
# Deactivation rate per second
rate(kelpie_agents_deactivated_total[5m])
Type: Counter Labels:
operation- The operation name (e.g., "increment", "get")status- Result status ("success" or "error")
Description: Total number of actor invocations Use Case: Track throughput, error rates
# Overall invocation rate
sum(rate(kelpie_invocations_total[5m]))
# Error rate by operation
sum(rate(kelpie_invocations_total{status="error"}[5m])) by (operation)
# Success rate percentage
sum(rate(kelpie_invocations_total{status="success"}[5m]))
/
sum(rate(kelpie_invocations_total[5m])) * 100
Type: Histogram Labels:
operation- The operation name
Description: Invocation latency distribution in seconds Buckets: Default OpenTelemetry buckets Use Case: P50, P95, P99 latency tracking, SLA monitoring
# P95 latency by operation
histogram_quantile(0.95,
sum(rate(kelpie_invocation_duration_seconds_bucket[5m])) by (le, operation)
)
# P99 latency (SLA monitoring)
histogram_quantile(0.99,
sum(rate(kelpie_invocation_duration_seconds_bucket[5m])) by (le)
)
# Average latency
rate(kelpie_invocation_duration_seconds_sum[5m])
/
rate(kelpie_invocation_duration_seconds_count[5m])
Type: Gauge Description: Current number of queued invocations waiting to be processed Use Case: Monitor queue depth, detect backpressure
# Alert if queue is backing up
kelpie_invocations_pending_count > 100
Type: Gauge Labels:
tier- Memory tier ("core", "working", or "archival")
Description: Estimated memory usage in bytes by tier Tiers:
core- Agent memory blocks (persona, human, etc.)working- Recent messages in conversation historyarchival- Long-term archival memory entries
Use Case: Track memory growth, capacity planning
# Total memory usage across all tiers
sum(kelpie_memory_usage_bytes)
# Memory by tier
kelpie_memory_usage_bytes{tier="core"}
kelpie_memory_usage_bytes{tier="working"}
kelpie_memory_usage_bytes{tier="archival"}
# Alert if core memory exceeds threshold
kelpie_memory_usage_bytes{tier="core"} > 100 * 1024 * 1024 # 100MB
Type: Gauge Description: Total number of memory blocks across all agents Use Case: Track block count growth
# Average blocks per agent (approximate)
kelpie_memory_blocks_total / kelpie_agents_active_count
Type: Histogram Description: Storage operation latency distribution Use Case: Monitor storage backend performance (not yet fully implemented)
Type: Counter Description: Total number of storage operations Use Case: Track storage throughput (not yet fully implemented)
Type: Gauge Description: Server uptime in seconds since start Use Case: Track restarts, uptime monitoring
# Server uptime in hours
kelpie_server_uptime_seconds / 3600
# Alert if server restarted recently
kelpie_server_uptime_seconds < 300 # Less than 5 minutes
Metrics are enabled by default. Configure the metrics port via TelemetryConfig:
use kelpie_core::telemetry::TelemetryConfig;
let config = TelemetryConfig::new("kelpie-server")
.with_metrics(9090); // Custom metrics port (default: 9090)OTEL_EXPORTER_OTLP_ENDPOINT- OTLP collector endpoint (optional, for traces)RUST_LOG- Log level (affects trace output, not metrics)
Add this to your prometheus.yml:
scrape_configs:
- job_name: 'kelpie'
static_configs:
- targets: ['localhost:8283']
metrics_path: '/metrics'
scrape_interval: 15s# Active agents
kelpie_agents_active_count
# Invocation rate
sum(rate(kelpie_invocations_total[5m]))
# Error rate
sum(rate(kelpie_invocations_total{status="error"}[5m])) / sum(rate(kelpie_invocations_total[5m])) * 100
# P95 latency
histogram_quantile(0.95, sum(rate(kelpie_invocation_duration_seconds_bucket[5m])) by (le))
# Total memory by tier
sum(kelpie_memory_usage_bytes) by (tier)
# Active agents over time
kelpie_agents_active_count
# Latency by operation (heatmap)
sum(rate(kelpie_invocation_duration_seconds_bucket[5m])) by (le, operation)
# Throughput by operation
sum(rate(kelpie_invocations_total[5m])) by (operation)
- Set up alerts for high error rates, latency spikes, memory growth
- Monitor trends over time to catch slow degradation
- Use labels to drill down into specific operations or agents
- Scrape interval should be 15-60s for production workloads
- Retention depends on your monitoring setup (Prometheus default: 15 days)
- In-memory only - Metrics reset on server restart (standard Prometheus behavior)
- Estimation - Memory metrics are estimated based on string lengths, not actual allocations
- No per-agent metrics - Metrics are aggregated across all agents for efficiency
- Sampling - High cardinality labels (like agent_id) are avoided to prevent metric explosion