Inference Layer
PrivateRouter's inference layer is two things:
- A pool of GPU nodes running OpenAI-compatible model servers (Ollama in dev, vLLM in production).
- A LiteLLM proxy in front of that pool — it owns model routing, per-key budgets, retries, fallbacks, and cost telemetry.
This doc covers the operator surface: how nodes register, heartbeat, get benchmarked, get routed to, and what the config looks like.
Node lifecycle
register → heartbeat (every 60s) → benchmark → serve traffic
│ │ │
│ │ missed heartbeats > 180s │
│ ▼ │
│ health → unhealthy ─────────────────────────┘
│ │
└─── admin marks status='draining' → bleeds traffic off │
│
admin marks status='offline' ────┘
Register
Operators register nodes through the admin UI (/admin/gpu-nodes) or
POST /api/admin/gpu-nodes. Required fields: name, provider, region,
gpu_type, hourly_cost_usd, endpoint_url. The endpoint URL is the
internal address (e.g. http://192.168.6.55:11434); it is never exposed
to customers.
Heartbeat
Every node runs the bundled scripts/heartbeat-agent.sh:
PRIVATEROUTER_API_URL=https://api.privaterouter.com \
NODE_ID=<uuid-from-admin-UI> \
HEARTBEAT_SECRET=<value-of-GPU_HEARTBEAT_SECRET> \
LOADED_MODELS=qwen3.6,deepseek-r1 \
INTERVAL_SECONDS=60 \
./scripts/heartbeat-agent.sh
It curls POST /api/internal/gpu-nodes/{node_id}/heartbeat with a JSON
body of (when nvidia-smi is available):
{
"gpu_utilization": 73.5,
"memory_used_gb": 18.25,
"loaded_models": ["qwen3.6", "deepseek-r1"]
}
All probe fields are optional — a bare empty POST is still a valid
liveness ping. The server stamps last_heartbeat_at = now() and sets
health_status = 'healthy'.
Stale detection
A background asyncio task in the API service runs every
GPU_SWEEP_INTERVAL_SECONDS (default 30). It flips any node whose
last_heartbeat_at is older than GPU_STALE_AFTER_SECONDS (default 180)
back to health_status = 'unhealthy'. Idempotent — already-unhealthy
nodes are skipped.
The sweep is disabled under pytest (PYTEST_CURRENT_TEST env var) so
tests don't race the scheduler.
Benchmark
Operators run scripts/benchmark.py against a fresh node to capture its
tokens/sec and latency profile:
python3 scripts/benchmark.py \
--api-base https://api.privaterouter.com \
--node-id <uuid> \
--model qwen3.6 \
--target-url http://192.168.6.55:11434/v1 \
--target-key '' \
--heartbeat-secret <value-of-GPU_HEARTBEAT_SECRET> \
--concurrency 4 \
--total 20
The script:
- Fires
--totalnon-streaming/v1/chat/completionsagainst--target-urlwith--concurrencyparallel workers. - Computes total tokens/sec across the run + p50/p95 per-request latency.
- POSTs the result to
/api/internal/benchmarks.
Persisted as one row in model_benchmarks; surfaced via
GET /api/admin/benchmarks (full history) and
GET /api/admin/benchmarks/latest (one row per node × model).
LiteLLM config
The proxy config lives in infra/litellm/config.yaml. It binds public
model names (e.g. privaterouter/qwen-pro) to a list of internal
deployments. Anatomy of one entry:
- model_name: privaterouter/qwen-pro
litellm_params:
model: ollama_chat/qwen3.6:latest # provider-prefixed model id
api_base: http://192.168.6.55:11434 # internal endpoint
input_cost_per_token: 0.0000005 # $0.50 / 1M tokens
output_cost_per_token: 0.0000015 # $1.50 / 1M tokens
model_info:
id: dev-qwen-pro-0 # unique per deployment
group: privaterouter/qwen-pro # === model_name
Routing & reliability
router_settings:
routing_strategy: simple-shuffle
num_retries: 2
retry_after: 5
allowed_fails: 3
cooldown_time: 60
fallbacks:
- { "privaterouter/qwen-pro": ["privaterouter/qwen-fast"] }
- { "privaterouter/deepseek-code": ["privaterouter/qwen-pro"] }
- { "privaterouter/deepseek-reason": ["privaterouter/qwen-pro"] }
- { "privaterouter/fast": ["privaterouter/qwen-fast"] }
simple-shuffleround-robins across deployments inside a group.num_retries=2+retry_after=5— transient connection resets are common on Ollama; retry twice with a 5s gap.allowed_fails=3/cooldown_time=60— three failures in a row parks the deployment for 60s.fallbacks— only fire when every deployment in the primary group is cooling down. Keep chains short: silently re-routing to a different family of model is a surprise to callers.
Adding a node to a group
- Pick the public model group it should serve under.
- Append a new entry to that
model_namegroup's deployment list with a freshmodel_info.idand the node's internalapi_base. docker compose restart litellm— config is reloaded at boot.
Pricing in two places — keep in sync
Cost telemetry shows up in two places and they MUST match:
infra/litellm/config.yaml→input_cost_per_token/output_cost_per_token(per-token, dollars). What LiteLLM reports.apps/api/app/services/seed_models.py→input_price_per_million/output_price_per_million(per million tokens, dollars). What the DB ledger uses to compute charges.
If you change one, change the other.
Environment variables
| Var | Purpose | Default |
|---|---|---|
LITELLM_MASTER_KEY | Admin key for the LiteLLM proxy | required |
LITELLM_PROXY_URL | How the API talks to LiteLLM | http://litellm:4000 |
GPU_HEARTBEAT_SECRET | Bearer for /api/internal/* | unset → 503 |
GPU_STALE_AFTER_SECONDS | Heartbeat staleness threshold | 180 |
GPU_SWEEP_INTERVAL_SECONDS | Sweep cadence | 30 |
PYTEST_DISABLE_BACKGROUND_TASKS | Force-disable sweep loop | unset |
Migrating Ollama → vLLM
See docs/inference/vllm-migration.md for the conversion plan, expected
throughput delta, and the docker-compose profile that runs vLLM
side-by-side with Ollama for shadow testing.
Bringing up a production node
See docs/inference/production-bringup.md for the step-by-step DO/OVH
runbook, including cloud-init scripts and the per-provider cost/perf
comparison.