Runtime Services
The Esper CLI manages the complete runtime stack for self-hosted deployments. The architecture separates concerns across specialized services for scalability and fault isolation.
Architecture Overview
Esper's runtime follows a pipeline architecture with clear service boundaries:
graph LR
C[Client] --> S[esper-server]
S --> IB[Ingestion Broker]
IB --> E[Engine Worker]
E --> MB[Mitigation Broker]
MB --> S
S --> C
Each service has distinct responsibilities:
- esper-server: API gateway and control plane
- Ingestion Broker: Request queuing and distribution
- Engine Worker: Policy evaluation and state management
- Mitigation Broker: Decision caching and enforcement
Control Plane Server
The control plane manages configuration and routes traffic.
Basic Operation
Start the server with default configuration:
esper server run
Default bindings:
- HTTP:
0.0.0.0:8080 - Metrics:
0.0.0.0:9090 - Health:
0.0.0.0:8081
Configuration Management
Override defaults with environment files:
# Development configuration
esper server run --config ./configs/dev.env
# Production with specific overrides
esper server run --config ./configs/prod.env --port 8443 --tls
Configuration precedence (highest to lowest):
- Command-line flags
- Environment variables
- Config file
- Defaults
Use esper server config to generate a complete configuration template with all available options.
Health Monitoring
The server exposes health endpoints for orchestration:
# Liveness check
curl http://localhost:8081/healthz
# Readiness check (includes dependency checks)
curl http://localhost:8081/readyz
# Detailed health with component status
curl http://localhost:8081/healthz/detailed
Response codes:
200: Healthy503: Unhealthy or dependencies unavailable
Graceful Shutdown
The server handles shutdown signals intelligently:
# Send SIGTERM for graceful shutdown
kill -TERM $(pgrep esper-server)
# Or use the CLI
esper server stop --graceful --timeout 30s
Shutdown sequence:
- Stop accepting new requests
- Wait for in-flight requests (up to timeout)
- Close database connections
- Flush metrics
- Exit
Engine Worker
The engine worker processes evaluation workloads from the ingestion broker.
Worker Configuration
Start workers with appropriate resource limits:
# Basic worker
esper worker run engine
# Production worker with tuning
esper worker run engine \
--concurrency 16 \
--batch-size 100 \
--memory-limit 4G
Key parameters:
--concurrency: Parallel evaluation threads--batch-size: Events per processing batch--memory-limit: Maximum heap size--state-backend: State storage (redis|memory|postgres)
Scaling Workers
Deploy multiple workers for horizontal scaling:
# Start worker pool
for i in {1..4}; do
esper worker run engine \
--worker-id "worker-$i" \
--config ./engine.env &
done
# Monitor worker status
esper worker status
# Scale based on queue depth
esper worker autoscale \
--min 2 \
--max 10 \
--target-queue-depth 1000
Workers must have unique IDs when running multiple instances on the same host.
State Management
Workers maintain hot state for entity tracking and rate limiting:
# Configure Redis state backend
export ESPER_STATE_BACKEND=redis
export ESPER_REDIS_URL=redis://localhost:6379/0
# Configure state TTL and cleanup
esper worker run engine \
--state-ttl 3600 \
--state-cleanup-interval 300
State backends comparison:
| Backend | Use Case | Pros | Cons |
|---|---|---|---|
| Memory | Development | Fast, simple | No persistence |
| Redis | Production | Fast, shared state | Requires Redis |
| Postgres | High durability | Persistent, queryable | Slower |
Performance Tuning
Optimize worker performance for your workload:
# CPU-bound workloads (complex policies)
esper worker run engine \
--concurrency $(nproc) \
--batch-size 50 \
--evaluation-timeout 100ms
# Memory-bound workloads (large state)
esper worker run engine \
--concurrency 4 \
--memory-limit 8G \
--state-cache-size 100000
# I/O-bound workloads (external enrichment)
esper worker run engine \
--concurrency 32 \
--io-threads 16 \
--connection-pool-size 100
Broker Services
Brokers provide durable queuing and service decoupling.
Ingestion Broker
The ingestion broker queues incoming requests for processing.
# Run with default configuration
cargo run --manifest-path esper-rs/Cargo.toml \
--package esper-ingestion-broker --bin main
# Production configuration
BROKER_PORT=8082 \
REDIS_URL=redis://localhost:6379 \
MAX_QUEUE_SIZE=1000000 \
BATCH_TIMEOUT_MS=100 \
cargo run --release --package esper-ingestion-broker --bin main
Queue management:
# View queue metrics
curl http://localhost:8082/metrics | grep queue
# Inspect queue depth
curl http://localhost:8082/api/queue/status
# Pause ingestion (for maintenance)
curl -X POST http://localhost:8082/api/queue/pause
# Resume ingestion
curl -X POST http://localhost:8082/api/queue/resume
The ingestion broker implements backpressure. It returns 503 when the queue is full.
Mitigation Broker
The mitigation broker caches and serves policy decisions.
# Run with configuration
cargo run --manifest-path esper-rs/Cargo.toml \
--package esper-mitigation-broker --bin main
# Production with persistence
BROKER_PORT=8083 \
REDIS_URL=redis://localhost:6379 \
CACHE_TTL_SECONDS=900 \
PERSISTENCE_ENABLED=true \
cargo run --release --package esper-mitigation-broker --bin main
Cache operations:
# Query active mitigations
curl http://localhost:8083/api/mitigations/active
# Get mitigation for entity
curl http://localhost:8083/api/mitigations/entity/<entity-id>
# Clear mitigation cache (careful!)
curl -X POST http://localhost:8083/api/cache/clear \
-H "Authorization: Bearer $ADMIN_TOKEN"
Broker High Availability
Deploy brokers in HA configuration:
# docker-compose-ha.yml
version: "3.8"
services:
ingestion-broker-1:
image: esperr/ingestion-broker
environment:
REDIS_URL: redis://redis:6379
CLUSTER_MODE: true
NODE_ID: broker-1
ingestion-broker-2:
image: esperr/ingestion-broker
environment:
REDIS_URL: redis://redis:6379
CLUSTER_MODE: true
NODE_ID: broker-2
haproxy:
image: haproxy
volumes:
- ./haproxy.cfg:/usr/local/etc/haproxy/haproxy.cfg
ports:
- "8082:8082"
Local Stack Orchestration
Docker Compose Deployment
Run the complete stack with Docker Compose:
# Start all services
docker-compose up -d
# Scale workers
docker-compose up -d --scale engine-worker=4
# View logs
docker-compose logs -f engine-worker
# Stop gracefully
docker-compose down --timeout 30
Example docker-compose.yml:
version: "3.8"
services:
postgres:
image: postgres:15
environment:
POSTGRES_DB: esper
POSTGRES_PASSWORD: ${DB_PASSWORD}
volumes:
- postgres-data:/var/lib/postgresql/data
redis:
image: redis:7
command: redis-server --maxmemory 2gb --maxmemory-policy lru
esper-server:
build: .
command: esper server run
environment:
DATABASE_URL: postgresql://postgres:${DB_PASSWORD}@postgres/esper
REDIS_URL: redis://redis:6379
depends_on:
- postgres
- redis
ports:
- "8080:8080"
ingestion-broker:
build: ./esper-rs
command: cargo run --package esper-ingestion-broker --bin main
environment:
REDIS_URL: redis://redis:6379
depends_on:
- redis
engine-worker:
build: .
command: esper worker run engine
environment:
DATABASE_URL: postgresql://postgres:${DB_PASSWORD}@postgres/esper
REDIS_URL: redis://redis:6379
depends_on:
- postgres
- redis
- ingestion-broker
deploy:
replicas: 2
mitigation-broker:
build: ./esper-rs
command: cargo run --package esper-mitigation-broker --bin main
environment:
REDIS_URL: redis://redis:6379
depends_on:
- redis
volumes:
postgres-data:
Process Management
Use process managers for production deployments:
Systemd
# /etc/systemd/system/esper-server.service
[Unit]
Description=Esper Control Plane Server
After=network.target postgresql.service redis.service
[Service]
Type=simple
User=esper
WorkingDirectory=/opt/esper
ExecStart=/opt/esper/bin/esper server run --config /etc/esper/server.env
ExecReload=/bin/kill -HUP $MAINPID
Restart=always
RestartSec=10
StandardOutput=journal
StandardError=journal
[Install]
WantedBy=multi-user.target
Manage with systemctl:
# Enable and start
sudo systemctl enable esper-server
sudo systemctl start esper-server
# Check status
sudo systemctl status esper-server
# View logs
journalctl -u esper-server -f
# Reload configuration
sudo systemctl reload esper-server
Supervisor
# /etc/supervisor/conf.d/esper.conf
[program:esper-server]
command=/opt/esper/bin/esper server run
directory=/opt/esper
user=esper
autostart=true
autorestart=true
stdout_logfile=/var/log/esper/server.log
stderr_logfile=/var/log/esper/server-error.log
environment=PATH="/opt/esper/bin:%(ENV_PATH)s"
[program:engine-worker]
command=/opt/esper/bin/esper worker run engine
process_name=%(program_name)s_%(process_num)02d
numprocs=4
directory=/opt/esper
user=esper
autostart=true
autorestart=true
stdout_logfile=/var/log/esper/worker-%(process_num)02d.log
Monitoring & Observability
Metrics Collection
All services expose Prometheus metrics:
# prometheus.yml
scrape_configs:
- job_name: esper-server
static_configs:
- targets: ["localhost:9090"]
- job_name: engine-workers
static_configs:
- targets: ["localhost:9091", "localhost:9092"]
- job_name: brokers
static_configs:
- targets: ["localhost:8082", "localhost:8083"]
Key metrics to monitor:
# Request rate
rate(esper_requests_total[5m])
# Policy evaluation latency
histogram_quantile(0.99, esper_evaluation_duration_seconds)
# Queue depth
esper_queue_depth{queue="ingestion"}
# Worker utilization
esper_worker_utilization_ratio
Distributed Tracing
Enable tracing for request flow visibility:
# Configure OpenTelemetry
export OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4317
export OTEL_SERVICE_NAME=esper-server
export OTEL_TRACES_SAMPLER=parentbased_traceidratio
export OTEL_TRACES_SAMPLER_ARG=0.1
esper server run --tracing
Logging
Configure structured logging:
# JSON logging for production
export LOG_FORMAT=json
export LOG_LEVEL=info
# Verbose logging for debugging
export LOG_LEVEL=debug
export LOG_INCLUDE_CALLER=true
# Log to file
esper server run 2>&1 | tee -a /var/log/esper/server.log
Log aggregation with Fluentd:
# fluent.conf
<source>
@type tail
path /var/log/esper/*.log
pos_file /var/log/td-agent/esper.pos
tag esper.*
format json
</source>
<match esper.**>
@type elasticsearch
host elasticsearch
port 9200
index_name esper
type_name logs
</match>
Use correlation IDs to trace requests across services. The CLI automatically propagates X-Request-ID headers.
Troubleshooting
Common Issues
Service won't start
Error: Cannot bind to port 8080: Address already in use
Solution: Check for conflicting services with lsof -i :8080.
Worker connection failures
Error: Cannot connect to ingestion broker: Connection refused
Solution: Verify broker is running and check firewall rules.
State inconsistency
Warning: State divergence detected
Solution: Clear Redis state and restart workers.
Debug Mode
Enable comprehensive debugging:
# Maximum verbosity
export LOG_LEVEL=trace
export ESPER_DEBUG=true
export RUST_BACKTRACE=full
# Debug specific subsystem
export ESPER_DEBUG_SUBSYSTEM=evaluation
# Capture debug output
esper worker run engine 2>&1 | tee debug.log
Performance Profiling
Profile runtime performance:
# CPU profiling
esper worker run engine --cpuprofile cpu.prof
# Memory profiling
esper worker run engine --memprofile mem.prof
# Analyze profiles
go tool pprof -http=:8080 cpu.prof