Inference Layer

PrivateRouter's inference layer is two things:

A pool of GPU nodes running OpenAI-compatible model servers (Ollama in dev, vLLM in production).
A LiteLLM proxy in front of that pool — it owns model routing, per-key budgets, retries, fallbacks, and cost telemetry.

This doc covers the operator surface: how nodes register, heartbeat, get benchmarked, get routed to, and what the config looks like.

Node lifecycle

register  →  heartbeat (every 60s)  →  benchmark  →  serve traffic
   │                │                                      │
   │                │   missed heartbeats > 180s            │
   │                ▼                                       │
   │           health → unhealthy ─────────────────────────┘
   │                                                        │
   └─── admin marks status='draining' → bleeds traffic off  │
                                                            │
                          admin marks status='offline'  ────┘

Register

Operators register nodes through the admin UI (/admin/gpu-nodes) or POST /api/admin/gpu-nodes. Required fields: name, provider, region, gpu_type, hourly_cost_usd, endpoint_url. The endpoint URL is the internal address (e.g. http://192.168.6.55:11434); it is never exposed to customers.

Heartbeat

Every node runs the bundled scripts/heartbeat-agent.sh:

PRIVATEROUTER_API_URL=https://api.privaterouter.com \
NODE_ID=<uuid-from-admin-UI> \
HEARTBEAT_SECRET=<value-of-GPU_HEARTBEAT_SECRET> \
LOADED_MODELS=qwen3.6,deepseek-r1 \
INTERVAL_SECONDS=60 \
./scripts/heartbeat-agent.sh

It curls POST /api/internal/gpu-nodes/{node_id}/heartbeat with a JSON body of (when nvidia-smi is available):

{
  "gpu_utilization": 73.5,
  "memory_used_gb": 18.25,
  "loaded_models": ["qwen3.6", "deepseek-r1"]
}

All probe fields are optional — a bare empty POST is still a valid liveness ping. The server stamps last_heartbeat_at = now() and sets health_status = 'healthy'.

Stale detection

A background asyncio task in the API service runs every GPU_SWEEP_INTERVAL_SECONDS (default 30). It flips any node whose last_heartbeat_at is older than GPU_STALE_AFTER_SECONDS (default 180) back to health_status = 'unhealthy'. Idempotent — already-unhealthy nodes are skipped.

The sweep is disabled under pytest (PYTEST_CURRENT_TEST env var) so tests don't race the scheduler.

Benchmark

Operators run scripts/benchmark.py against a fresh node to capture its tokens/sec and latency profile:

python3 scripts/benchmark.py \
  --api-base https://api.privaterouter.com \
  --node-id <uuid> \
  --model qwen3.6 \
  --target-url http://192.168.6.55:11434/v1 \
  --target-key '' \
  --heartbeat-secret <value-of-GPU_HEARTBEAT_SECRET> \
  --concurrency 4 \
  --total 20

The script:

Fires --total non-streaming /v1/chat/completions against --target-url with --concurrency parallel workers.
Computes total tokens/sec across the run + p50/p95 per-request latency.
POSTs the result to /api/internal/benchmarks.

Persisted as one row in model_benchmarks; surfaced via GET /api/admin/benchmarks (full history) and GET /api/admin/benchmarks/latest (one row per node × model).

LiteLLM config

The proxy config lives in infra/litellm/config.yaml. It binds public model names (e.g. privaterouter/qwen-pro) to a list of internal deployments. Anatomy of one entry:

- model_name: privaterouter/qwen-pro
  litellm_params:
    model: ollama_chat/qwen3.6:latest      # provider-prefixed model id
    api_base: http://192.168.6.55:11434    # internal endpoint
    input_cost_per_token: 0.0000005        # $0.50 / 1M tokens
    output_cost_per_token: 0.0000015       # $1.50 / 1M tokens
  model_info:
    id: dev-qwen-pro-0                     # unique per deployment
    group: privaterouter/qwen-pro          # === model_name

Routing & reliability

router_settings:
  routing_strategy: simple-shuffle
  num_retries: 2
  retry_after: 5
  allowed_fails: 3
  cooldown_time: 60
  fallbacks:
    - { "privaterouter/qwen-pro":        ["privaterouter/qwen-fast"] }
    - { "privaterouter/deepseek-code":   ["privaterouter/qwen-pro"] }
    - { "privaterouter/deepseek-reason": ["privaterouter/qwen-pro"] }
    - { "privaterouter/fast":            ["privaterouter/qwen-fast"] }

simple-shuffle round-robins across deployments inside a group.
num_retries=2 + retry_after=5 — transient connection resets are common on Ollama; retry twice with a 5s gap.
allowed_fails=3 / cooldown_time=60 — three failures in a row parks the deployment for 60s.
fallbacks — only fire when every deployment in the primary group is cooling down. Keep chains short: silently re-routing to a different family of model is a surprise to callers.

Adding a node to a group

Pick the public model group it should serve under.
Append a new entry to that model_name group's deployment list with a fresh model_info.id and the node's internal api_base.
docker compose restart litellm — config is reloaded at boot.

Pricing in two places — keep in sync

Cost telemetry shows up in two places and they MUST match:

infra/litellm/config.yaml → input_cost_per_token / output_cost_per_token (per-token, dollars). What LiteLLM reports.
apps/api/app/services/seed_models.py → input_price_per_million / output_price_per_million (per million tokens, dollars). What the DB ledger uses to compute charges.

If you change one, change the other.

Environment variables

Var	Purpose	Default
`LITELLM_MASTER_KEY`	Admin key for the LiteLLM proxy	required
`LITELLM_PROXY_URL`	How the API talks to LiteLLM	`http://litellm:4000`
`GPU_HEARTBEAT_SECRET`	Bearer for `/api/internal/*`	unset → 503
`GPU_STALE_AFTER_SECONDS`	Heartbeat staleness threshold	`180`
`GPU_SWEEP_INTERVAL_SECONDS`	Sweep cadence	`30`
`PYTEST_DISABLE_BACKGROUND_TASKS`	Force-disable sweep loop	unset

Migrating Ollama → vLLM

See docs/inference/vllm-migration.md for the conversion plan, expected throughput delta, and the docker-compose profile that runs vLLM side-by-side with Ollama for shadow testing.

Bringing up a production node

See docs/inference/production-bringup.md for the step-by-step DO/OVH runbook, including cloud-init scripts and the per-provider cost/perf comparison.

Inference fleet ops