Skip to content

Kubernetes Deployment

Heddle — Config-driven multi-LLM workflows


Overview

Heddle ships with Kubernetes manifests in k8s/ that are ready for Minikube. The manifests deploy NATS, Valkey, the router, an orchestrator, and worker pods into a dedicated heddle namespace.


Minikube Deployment

Start Minikube

minikube start --cpus=4 --memory=8192 --driver=docker
eval $(minikube docker-env)

Build Container Images

Build images inside Minikube's Docker daemon so they're available to pods without a registry:

docker build -f docker/Dockerfile.worker -t heddle-worker:latest .
docker build -f docker/Dockerfile.router -t heddle-router:latest .
docker build -f docker/Dockerfile.orchestrator -t heddle-orchestrator:latest .
docker build -f docker/Dockerfile.workshop -t heddle-workshop:latest .

Create Namespace and Secrets

kubectl create namespace heddle
kubectl create secret generic heddle-secrets \
  --namespace heddle \
  --from-literal=anthropic-api-key="$ANTHROPIC_API_KEY"

Deploy

kubectl apply -k k8s/
kubectl get pods -n heddle -w

Access Workshop

The Workshop is exposed via NodePort on port 30080:

# Minikube
minikube service heddle-workshop -n heddle

# Or access directly
open http://$(minikube ip):30080

Manifest Structure

k8s/
├── namespace.yaml              # heddle namespace
├── nats-deployment.yaml        # NATS server
├── redis-deployment.yaml       # Valkey server
├── router-deployment.yaml      # Heddle router
├── orchestrator-deployment.yaml # Heddle orchestrator
├── worker-deployment.yaml      # Heddle worker(s)
├── workshop-deployment.yaml    # Heddle Workshop web UI (NodePort 30080)
└── kustomization.yaml          # Kustomize overlay

Local LLM runtimes on Mac with Minikube

For local LLM inference, run LM Studio or Ollama natively on the host and point workers at the host address:

# Option A: LM Studio (start the local server in the LM Studio UI)
LM_STUDIO_URL=http://host.minikube.internal:1234/v1
LM_STUDIO_MODEL=google/gemma-3-4b   # any id from /v1/models

# Option B: Ollama
ollama serve &
OLLAMA_URL=http://host.minikube.internal:11434

When both are configured, set HEDDLE_LOCAL_BACKEND=lmstudio (or ollama) on the worker pods to choose which one serves the local tier.


Environment Variables

Workers, router, and orchestrator containers use the following environment variables:

Variable Required Description
WORKER_CONFIG Workers Path to worker YAML config
MODEL_TIER Workers Model tier (local, standard, frontier)
NATS_URL All NATS server URL
LM_STUDIO_URL Optional LM Studio /v1/ endpoint
LM_STUDIO_MODEL Optional LM Studio model id
OLLAMA_URL Optional Ollama API endpoint
OLLAMA_MODEL Optional Ollama model name
HEDDLE_LOCAL_BACKEND Optional lmstudio or ollama (when both URLs are set)
ANTHROPIC_API_KEY Optional Anthropic API key (from secret)
FRONTIER_MODEL Optional Model name for frontier tier

Resource Requests and Limits

Configure resource requests and limits for each component type:

Component CPU Request CPU Limit Memory Request Memory Limit
Router 100m 500m 128Mi 256Mi
Orchestrator 200m 1000m 256Mi 512Mi
Worker (local) 200m 1000m 256Mi 512Mi
Worker (standard) 100m 500m 128Mi 256Mi
NATS 100m 500m 128Mi 256Mi
Valkey 100m 500m 128Mi 256Mi

Workers configured for the local tier (LM Studio or Ollama) generally need more resources because they proxy LLM calls. Workers using remote APIs (Anthropic) are lighter.

Example in a deployment spec:

resources:
  requests:
    cpu: "200m"
    memory: "256Mi"
  limits:
    cpu: "1000m"
    memory: "512Mi"

Health Checks

Heddle actors are long-running async processes. Use liveness and readiness probes to detect stuck or unresponsive actors.

Important — the probes below are placeholders. Heddle does not yet ship a built-in /healthz endpoint or standalone healthcheck CLI. The commands shown exit 0 as long as the Python interpreter starts; they confirm the container is alive but say nothing about whether the actor is connected to NATS or processing messages. Treat them as scaffolding, not as production-grade probes.

Operators running Heddle in production should replace them with one of:

  • A sidecar exporter that publishes actor state to a /healthz endpoint and uses httpGet-style probes.
  • A TCP probe against the actor's NATS server port from inside the pod (catches a NATS outage; does not catch a hung actor).
  • A custom exec probe that reads a file the actor touches on each message-loop iteration (catches a hung actor; needs the actor to cooperate by touching the file).

A built-in /healthz endpoint is on the roadmap; until then, the manifests use the placeholders below, labelled as such so the next operator does not mistake them for a real check:

# PLACEHOLDER: only checks the Python interpreter starts, not that
# the actor has a live NATS subscription.  Replace before production
# rollout — see Health Checks section above for guidance.
livenessProbe:
  exec:
    command: ["python", "-c", "import sys; sys.exit(0)"]
  initialDelaySeconds: 10
  periodSeconds: 30
  failureThreshold: 3
# PLACEHOLDER: same caveat.  A real readiness probe should confirm
# the actor's subscribe() has completed, not just that the process
# is up — otherwise the service routes traffic to a not-yet-ready
# actor and the first messages are dropped (NATS is at-most-once,
# see Design Invariant 17).
readinessProbe:
  exec:
    command: ["python", "-c", "import sys; sys.exit(0)"]
  initialDelaySeconds: 5
  periodSeconds: 10

For NATS connectivity at startup (rather than at probe time), the CLI runs the same pre-flight check internally before every actor command; see docs/runbooks/verify-nats-connectivity.md for the operator workflow.


Horizontal Scaling

Heddle actors scale horizontally via NATS queue groups with zero code changes. Multiple replicas of the same actor type automatically load-balance.

# Scale workers manually
kubectl scale deployment/heddle-worker --replicas=5 -n heddle

HPA Auto-Scaling

Use Horizontal Pod Autoscaler for CPU-based scaling:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: heddle-worker-hpa
  namespace: heddle
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: heddle-worker
  minReplicas: 1
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

Pipeline orchestrators also support concurrent goal processing via max_concurrent_goals in config, which can complement horizontal scaling.


Persistent Volumes

Valkey requires persistent storage for checkpoint data:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: redis-data
  namespace: heddle
spec:
  accessModes: [ReadWriteOnce]
  resources:
    requests:
      storage: 1Gi

Mount the PVC in the Valkey deployment's pod spec:

volumes:
  - name: redis-data
    persistentVolumeClaim:
      claimName: redis-data
containers:
  - name: redis
    volumeMounts:
      - name: redis-data
        mountPath: /data

For local development setup, see Getting Started. For architecture details, see Architecture.