Runtime Services

The Esper CLI manages the complete runtime stack for self-hosted deployments. The architecture separates concerns across specialized services for scalability and fault isolation.

Architecture Overview

Esper's runtime follows a pipeline architecture with clear service boundaries:

graph LR
    C[Client] --> S[esper-server]
    S --> IB[Ingestion Broker]
    IB --> E[Engine Worker]
    E --> MB[Mitigation Broker]
    MB --> S
    S --> C

Each service has distinct responsibilities:

esper-server: API gateway and control plane
Ingestion Broker: Request queuing and distribution
Engine Worker: Policy evaluation and state management
Mitigation Broker: Decision caching and enforcement

Control Plane Server

The control plane manages configuration and routes traffic.

Basic Operation

Start the server with default configuration:

esper server run

Default bindings:

HTTP: 0.0.0.0:8080
Metrics: 0.0.0.0:9090
Health: 0.0.0.0:8081

Configuration Management

Override defaults with environment files:

# Development configuration
esper server run --config ./configs/dev.env

# Production with specific overrides
esper server run --config ./configs/prod.env --port 8443 --tls

Configuration precedence (highest to lowest):

Command-line flags
Environment variables
Config file
Defaults

tip

Use esper server config to generate a complete configuration template with all available options.

Health Monitoring

The server exposes health endpoints for orchestration:

# Liveness check
curl http://localhost:8081/healthz

# Readiness check (includes dependency checks)
curl http://localhost:8081/readyz

# Detailed health with component status
curl http://localhost:8081/healthz/detailed

Response codes:

200: Healthy
503: Unhealthy or dependencies unavailable

Graceful Shutdown

The server handles shutdown signals intelligently:

# Send SIGTERM for graceful shutdown
kill -TERM $(pgrep esper-server)

# Or use the CLI
esper server stop --graceful --timeout 30s

Shutdown sequence:

Stop accepting new requests
Wait for in-flight requests (up to timeout)
Close database connections
Flush metrics
Exit

Engine Worker

The engine worker processes evaluation workloads from the ingestion broker.

Worker Configuration

Start workers with appropriate resource limits:

# Basic worker
esper worker run engine

# Production worker with tuning
esper worker run engine \
  --concurrency 16 \
  --batch-size 100 \
  --memory-limit 4G

Key parameters:

--concurrency: Parallel evaluation threads
--batch-size: Events per processing batch
--memory-limit: Maximum heap size
--state-backend: State storage (redis|memory|postgres)

Scaling Workers

Deploy multiple workers for horizontal scaling:

# Start worker pool
for i in {1..4}; do
  esper worker run engine \
    --worker-id "worker-$i" \
    --config ./engine.env &
done

# Monitor worker status
esper worker status

# Scale based on queue depth
esper worker autoscale \
  --min 2 \
  --max 10 \
  --target-queue-depth 1000

caution

Workers must have unique IDs when running multiple instances on the same host.

State Management

Workers maintain hot state for entity tracking and rate limiting:

# Configure Redis state backend
export ESPER_STATE_BACKEND=redis
export ESPER_REDIS_URL=redis://localhost:6379/0

# Configure state TTL and cleanup
esper worker run engine \
  --state-ttl 3600 \
  --state-cleanup-interval 300

State backends comparison:

Backend	Use Case	Pros	Cons
Memory	Development	Fast, simple	No persistence
Redis	Production	Fast, shared state	Requires Redis
Postgres	High durability	Persistent, queryable	Slower

Performance Tuning

Optimize worker performance for your workload:

# CPU-bound workloads (complex policies)
esper worker run engine \
  --concurrency $(nproc) \
  --batch-size 50 \
  --evaluation-timeout 100ms

# Memory-bound workloads (large state)
esper worker run engine \
  --concurrency 4 \
  --memory-limit 8G \
  --state-cache-size 100000

# I/O-bound workloads (external enrichment)
esper worker run engine \
  --concurrency 32 \
  --io-threads 16 \
  --connection-pool-size 100

Broker Services

Brokers provide durable queuing and service decoupling.

Ingestion Broker

The ingestion broker queues incoming requests for processing.

# Run with default configuration
cargo run --manifest-path esper-rs/Cargo.toml \
  --package esper-ingestion-broker --bin main

# Production configuration
BROKER_PORT=8082 \
REDIS_URL=redis://localhost:6379 \
MAX_QUEUE_SIZE=1000000 \
BATCH_TIMEOUT_MS=100 \
  cargo run --release --package esper-ingestion-broker --bin main

Queue management:

# View queue metrics
curl http://localhost:8082/metrics | grep queue

# Inspect queue depth
curl http://localhost:8082/api/queue/status

# Pause ingestion (for maintenance)
curl -X POST http://localhost:8082/api/queue/pause

# Resume ingestion
curl -X POST http://localhost:8082/api/queue/resume

info

The ingestion broker implements backpressure. It returns 503 when the queue is full.

Mitigation Broker

The mitigation broker caches and serves policy decisions.

# Run with configuration
cargo run --manifest-path esper-rs/Cargo.toml \
  --package esper-mitigation-broker --bin main

# Production with persistence
BROKER_PORT=8083 \
REDIS_URL=redis://localhost:6379 \
CACHE_TTL_SECONDS=900 \
PERSISTENCE_ENABLED=true \
  cargo run --release --package esper-mitigation-broker --bin main

Cache operations:

# Query active mitigations
curl http://localhost:8083/api/mitigations/active

# Get mitigation for entity
curl http://localhost:8083/api/mitigations/entity/<entity-id>

# Clear mitigation cache (careful!)
curl -X POST http://localhost:8083/api/cache/clear \
  -H "Authorization: Bearer $ADMIN_TOKEN"

Broker High Availability

Deploy brokers in HA configuration:

# docker-compose-ha.yml
version: "3.8"
services:
  ingestion-broker-1:
    image: esperr/ingestion-broker
    environment:
      REDIS_URL: redis://redis:6379
      CLUSTER_MODE: true
      NODE_ID: broker-1

  ingestion-broker-2:
    image: esperr/ingestion-broker
    environment:
      REDIS_URL: redis://redis:6379
      CLUSTER_MODE: true
      NODE_ID: broker-2

  haproxy:
    image: haproxy
    volumes:
      - ./haproxy.cfg:/usr/local/etc/haproxy/haproxy.cfg
    ports:
      - "8082:8082"

Local Stack Orchestration

Docker Compose Deployment

Run the complete stack with Docker Compose:

# Start all services
docker-compose up -d

# Scale workers
docker-compose up -d --scale engine-worker=4

# View logs
docker-compose logs -f engine-worker

# Stop gracefully
docker-compose down --timeout 30

Example docker-compose.yml:

version: "3.8"
services:
  postgres:
    image: postgres:15
    environment:
      POSTGRES_DB: esper
      POSTGRES_PASSWORD: ${DB_PASSWORD}
    volumes:
      - postgres-data:/var/lib/postgresql/data

  redis:
    image: redis:7
    command: redis-server --maxmemory 2gb --maxmemory-policy lru

  esper-server:
    build: .
    command: esper server run
    environment:
      DATABASE_URL: postgresql://postgres:${DB_PASSWORD}@postgres/esper
      REDIS_URL: redis://redis:6379
    depends_on:
      - postgres
      - redis
    ports:
      - "8080:8080"

  ingestion-broker:
    build: ./esper-rs
    command: cargo run --package esper-ingestion-broker --bin main
    environment:
      REDIS_URL: redis://redis:6379
    depends_on:
      - redis

  engine-worker:
    build: .
    command: esper worker run engine
    environment:
      DATABASE_URL: postgresql://postgres:${DB_PASSWORD}@postgres/esper
      REDIS_URL: redis://redis:6379
    depends_on:
      - postgres
      - redis
      - ingestion-broker
    deploy:
      replicas: 2

  mitigation-broker:
    build: ./esper-rs
    command: cargo run --package esper-mitigation-broker --bin main
    environment:
      REDIS_URL: redis://redis:6379
    depends_on:
      - redis

volumes:
  postgres-data:

Process Management

Use process managers for production deployments:

Systemd

# /etc/systemd/system/esper-server.service
[Unit]
Description=Esper Control Plane Server
After=network.target postgresql.service redis.service

[Service]
Type=simple
User=esper
WorkingDirectory=/opt/esper
ExecStart=/opt/esper/bin/esper server run --config /etc/esper/server.env
ExecReload=/bin/kill -HUP $MAINPID
Restart=always
RestartSec=10
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target

Manage with systemctl:

# Enable and start
sudo systemctl enable esper-server
sudo systemctl start esper-server

# Check status
sudo systemctl status esper-server

# View logs
journalctl -u esper-server -f

# Reload configuration
sudo systemctl reload esper-server

Supervisor

# /etc/supervisor/conf.d/esper.conf
[program:esper-server]
command=/opt/esper/bin/esper server run
directory=/opt/esper
user=esper
autostart=true
autorestart=true
stdout_logfile=/var/log/esper/server.log
stderr_logfile=/var/log/esper/server-error.log
environment=PATH="/opt/esper/bin:%(ENV_PATH)s"

[program:engine-worker]
command=/opt/esper/bin/esper worker run engine
process_name=%(program_name)s_%(process_num)02d
numprocs=4
directory=/opt/esper
user=esper
autostart=true
autorestart=true
stdout_logfile=/var/log/esper/worker-%(process_num)02d.log

Monitoring & Observability

Metrics Collection

All services expose Prometheus metrics:

# prometheus.yml
scrape_configs:
  - job_name: esper-server
    static_configs:
      - targets: ["localhost:9090"]

  - job_name: engine-workers
    static_configs:
      - targets: ["localhost:9091", "localhost:9092"]

  - job_name: brokers
    static_configs:
      - targets: ["localhost:8082", "localhost:8083"]

Key metrics to monitor:

# Request rate
rate(esper_requests_total[5m])

# Policy evaluation latency
histogram_quantile(0.99, esper_evaluation_duration_seconds)

# Queue depth
esper_queue_depth{queue="ingestion"}

# Worker utilization
esper_worker_utilization_ratio

Distributed Tracing

Enable tracing for request flow visibility:

# Configure OpenTelemetry
export OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4317
export OTEL_SERVICE_NAME=esper-server
export OTEL_TRACES_SAMPLER=parentbased_traceidratio
export OTEL_TRACES_SAMPLER_ARG=0.1

esper server run --tracing

Logging

Configure structured logging:

# JSON logging for production
export LOG_FORMAT=json
export LOG_LEVEL=info

# Verbose logging for debugging
export LOG_LEVEL=debug
export LOG_INCLUDE_CALLER=true

# Log to file
esper server run 2>&1 | tee -a /var/log/esper/server.log

Log aggregation with Fluentd:

# fluent.conf
<source>
@type tail
path /var/log/esper/*.log
pos_file /var/log/td-agent/esper.pos
tag esper.*
format json
</source>

<match esper.**>
@type elasticsearch
host elasticsearch
port 9200
index_name esper
type_name logs
</match>

note

Use correlation IDs to trace requests across services. The CLI automatically propagates X-Request-ID headers.

Troubleshooting

Common Issues

Service won't start

Error: Cannot bind to port 8080: Address already in use

Solution: Check for conflicting services with lsof -i :8080.

Worker connection failures

Error: Cannot connect to ingestion broker: Connection refused

Solution: Verify broker is running and check firewall rules.

State inconsistency

Warning: State divergence detected

Solution: Clear Redis state and restart workers.

Debug Mode

Enable comprehensive debugging:

# Maximum verbosity
export LOG_LEVEL=trace
export ESPER_DEBUG=true
export RUST_BACKTRACE=full

# Debug specific subsystem
export ESPER_DEBUG_SUBSYSTEM=evaluation

# Capture debug output
esper worker run engine 2>&1 | tee debug.log

Performance Profiling

Profile runtime performance:

# CPU profiling
esper worker run engine --cpuprofile cpu.prof

# Memory profiling
esper worker run engine --memprofile mem.prof

# Analyze profiles
go tool pprof -http=:8080 cpu.prof

Architecture Overview​

Control Plane Server​

Basic Operation​

Configuration Management​

Health Monitoring​

Graceful Shutdown​

Engine Worker​

Worker Configuration​

Scaling Workers​

State Management​

Performance Tuning​

Broker Services​

Ingestion Broker​

Mitigation Broker​

Broker High Availability​

Local Stack Orchestration​

Docker Compose Deployment​

Process Management​

Systemd​

Supervisor​

Monitoring & Observability​

Metrics Collection​

Distributed Tracing​

Logging​

Troubleshooting​

Common Issues​

Service won't start​

Worker connection failures​

State inconsistency​

Debug Mode​

Performance Profiling​