Autoscaling & Vast.ai
PrivateRouter ships with an autonomous per-model capacity-watcher system that decides when to add or remove GPU capacity, talks to Vast.ai to rent or destroy instances, and writes every decision to an auditable ledger.
This page covers what the system does, how the admin UI surfaces it, and the safety rails that keep it from spending money you didn't authorize.
What it does
For each model in the models table where agent_enabled = TRUE,
PrivateRouter spawns an asyncio.Task (a "capacity agent") that ticks
every ~30 seconds. On every tick the agent:
- Samples per-user usage from the last 60s of
usage_events— counts requests, tokens in/out, latency p95, in-flight concurrency, current active replicas. - Forecasts 60 minutes ahead using ordinary least-squares regression over the last 7 days of 5-minute usage buckets.
- Reads PnL — 24h rolling revenue (sum of
total_cost_usdfor this model) minus 24h cost (sum ofhourly_cost_usd × 24for agent-created active nodes). - Decides
scale_up,scale_down, ornopper the policy below. - Writes a
capacity_signalsrow with every metric + the decision + the reason — always, regardless of action. - Enqueues a
scale_eventsrow withstatus='planned'if the decision is non-nopandVAST_AUTOSCALER_ENABLED=true.
A separate provisioner loop drains planned scale events. For
scale_up it searches Vast.ai offers (POST /bundles/), picks the
cheapest one meeting the model's VRAM requirement, rents it
(PUT /asks/{id}/), and inserts a gpu_nodes row with
created_by_agent set. For scale_down it picks the lowest-utilization
agent-created node, drains it, and destroys it (DELETE /instances/{id}/).
Decision math
# Scale UP if any of:
# - active_concurrency >= 0.8 × target_concurrency × current_replicas
# - forecast_rpm_60m >= 2.0 × capacity_rpm (and forecast > 1 rpm)
# - current_replicas == 0 AND any traffic detected (cold start)
#
# AND ALL of:
# - current_replicas < max_replicas
# - total_agent_dph + new_offer_dph <= VAST_MAX_HOURLY_USD
# - vast_spendable >= 2 × new_offer_dph (≥ 2h runway)
# - 24h_revenue - 24h_cost >= 0 (skipped if revenue == 0)
#
# Scale DOWN if all of:
# - active_concurrency <= 0.2 × target_concurrency × current_replicas
# - that condition has been true for ≥ 15 minutes
# - current_replicas > min_replicas
Tunables live in
apps/api/app/services/autoscaler/decision.py at
the top of the module (SCALE_UP_CONCURRENCY_RATIO,
SCALE_UP_FORECAST_RATIO, SCALE_DOWN_MIN_MINUTES, etc.). Changing them
requires updating this page and the smoke probe's expectations.
Provisioner lifecycle
A gpu_nodes row's provisioning_state walks through these states for
an agent-created node:
| State | Meaning |
|---|---|
searching | (transient, in-memory) Provisioner is calling /bundles/. |
provisioning | Vast rent request sent, waiting for response. |
installing | Instance is up, ollama serve + ollama pull running. |
ready | Instance reachable, model loaded, in rotation. |
dry_run | Synthetic node created under VAST_AUTOSCALER_DRY_RUN=true. |
draining | Scale-down picked this node, waiting for in-flight to drain. |
destroying | DELETE /instances/{id}/ sent, awaiting confirmation. |
destroyed | Vast confirmed the instance is gone. |
failed | Lifecycle aborted; see the matching scale_events.error. |
manual | Operator-created node — never auto-touched. |
Safety rails
| Control | Default | Effect |
|---|---|---|
VAST_AUTOSCALER_ENABLED | false | Master kill-switch. No scale_events are ever enqueued when this is false. Flip from the admin UI for a runtime-only toggle, or edit .env to persist. |
VAST_AUTOSCALER_DRY_RUN | true | All Vast mutating calls return a VastDryRunResponse and write status='skipped_dry_run'. Edit .env and restart to flip. |
VAST_MAX_HOURLY_USD | 2.50 | Provisioner refuses to plan any rent that would push the sum of agent-owned hourly cost above this. |
VAST_MAX_INSTANCES | 3 | Cap on concurrent agent-created instances. |
vast_spendable ≥ 2h runway | (computed) | Each rent requires ≥ 2 hours of runway on Vast at the offer's $/hr. |
PnL ≥ 0 | (computed) | A model with measured 24h PnL below zero cannot trigger a scale-up. Brand-new models with zero revenue are exempt (cold start). |
created_by_agent IS NOT NULL | (db) | Manual nodes are never auto-destroyed. Provisioner only touches rows it owns. |
vast_contract_id UNIQUE | (db) | A duplicate provisioning attempt would 23505-fail before any double-billing could land. |
Admin UI
/admin/autoscaler
Live dashboard. Auto-refreshes every 10s. Shows:
- Kill-switch banner with one-click toggle,
DRY-RUNvsLIVEbadge, and aNO VAST API KEYbanner ifVAST_API_KEYis unset. - Stat cards: agents enabled/running, agent-managed nodes, current burn vs cap, live Vast spendable balance.
- Per-model agents table: each row shows the agent badge (
OFF,ENABLED · NOT RUNNING,RUNNING), current vs max replicas, last-tick RPM, in-flight concurrency, 60min forecast, capacity, $/hr burn, and the decision pill (withscale_up/scale_down/nopand the reason). - Recent scale events table: the last 25 scale events with action,
status pill (
planned/executing/succeeded/failed/skipped_dry_run), Vast contract id, cost, and any error. - Reconcile now button forces an immediate supervisor pass — useful after manual DB edits.
/admin/models
The catalog editor now has an Agent column with a per-row toggle. When
you click OFF on a model, a confirmation dialog appears explaining what
binding the agent does. When you flip an agent on the supervisor is
poked immediately so the agent task spawns within ~1 second instead of
waiting for the 60s reconcile tick.
Creating a new model via "Add model" also has a "Bind autoscaler agent" checkbox.
Going live
The default committed state is fully dry-run. To start spending real Vast credit:
# Optional sanity check first — confirm the API key is right.
curl -s -H "Authorization: Bearer $VAST_API_KEY" https://console.vast.ai/api/v0/users/current/ | jq
# Edit .env (persistent across restarts).
sed -i 's/^VAST_AUTOSCALER_DRY_RUN=true$/VAST_AUTOSCALER_DRY_RUN=false/' .env
# (Or open it and review the four VAST_* vars together.)
docker compose restart api
Then flip the master kill-switch from the /admin/autoscaler page —
that change is runtime-only so you can disable in an emergency without
restarting. Re-flipping survives a process restart only if you also keep
.env consistent.
What it deliberately does NOT do
- It does not OpenRouter-resell. No closed-model providers are brokered; agents only provision instances of self-hosted open models whose weights they can pull through Ollama.
- It does not destroy manual nodes. Even when a model has plenty of
manual capacity, the scale-down path won't touch a
gpu_nodesrow withcreated_by_agent IS NULL. - It does not optimise for latency alone. The decision math weighs
PnL — a model losing money over 24h cannot scale up, even at 100%
concurrency utilisation. Override by upping its
output_price_per_million_usd.
Tables behind the scenes
capacity_signals— append-only, one row per agent tick. Time-series- shaped; the(model_id, ts)composite index is the hot path.scale_events— audit log.planned → executing → succeeded | failed | skipped_dry_run. Each row carries the offer snapshot JSON, the Vast contract id (or NULL on dry-run + a negative synthetic id on the materialisedgpu_nodesrow).gpu_nodes— extended withvast_contract_id(unique),vast_offer_id,provisioning_state,cost_basis_dph_usd, andcreated_by_agent. Existing manual node rows keepprovisioning_state='manual'+created_by_agent IS NULLand are untouched by the autoscaler.models— extended withagent_enabled,min_replicas,max_replicas,target_concurrency(per-replica desired stream count).
M31 — Lifecycle tracking (expiration + auto-redeploy)
Vast.ai instances aren't open-ended rentals. Every host announces an
end_date — the moment they'll stop offering the box. For shared
3060/3090 hosts this can be as short as a few days; for stable
A100/H100 hosts it's usually 3–6 months. When the date hits, the host
powers the instance off and our model endpoint goes dark. Sometimes a
host also reboots its box mid-rental (maintenance, power blip), in
which case the instance comes back on its own — but if it stays down,
we need to replace it.
The NodeLifecyclePoller (started by AgentSupervisor alongside the
existing InstanceReadyPoller) closes that gap. Once every
GPU_LIFECYCLE_POLL_INTERVAL_SECONDS (default 300s = 5 min) it walks
every status='active' Vast-backed node and:
- Fetches
GET /instances/{id}from Vast. - Copies the upstream
end_dateintogpu_nodes.expires_at. - Updates
last_seen_running_atwhen the box reportsrunning; clearspowered_off_atif it previously went down. - Stamps
powered_off_at = nowthe first time it sees a non-running state. Oncenow - powered_off_at > GPU_POWER_OFF_GRACE_SECONDS(default 600s = 10 min, tolerating one reboot), the row is retired:status='inactive', health_status='auto_retired:powered_off:<vast_status>'. - If
expires_atis in the past, retires with reasonauto_retired:expired_end_date. - If Vast returns 404 (contract destroyed remotely), retires with
reason
auto_retired:vast_404.
Retirement is not a destroy — either the instance is already gone
or the host powered it off. We just stop counting the row toward the
model's replica total. The autoscaler's existing min_replicas=1 floor
on base models picks up the slack on its next tick and rents a fresh
instance.
Admin endpoints (M31)
GET /api/admin/autoscaler— overview now includes anode_lifecyclearray with one entry per active Vast-backed node:{name, current_model, expires_at, seconds_until_expiry, powered_off_at, ...}. The admin dashboard renders the expiration countdown directly from these fields.POST /api/admin/autoscaler/lifecycle-sweep— forces an immediate sweep instead of waiting for the next 5-min tick. Useful right after adding a new base model so the countdown shows up in the UI without delay.
Configuration
GPU_LIFECYCLE_POLL_INTERVAL_SECONDS(default 300) — sweep interval. Hard-floored to 30 seconds.GPU_POWER_OFF_GRACE_SECONDS(default 600) — how long an instance may stay non-running before we retire its row. Hard-floored to 60.
Base models — min_replicas=1 contract
For models flagged as "base" we set min_replicas=1 in seed.py:
privaterouter/qwen3(general)privaterouter/qwen3-coder(code)privaterouter/gpt-oss-120b(flagship reasoning)
The autoscaler decision logic guards scale_down behind
current_replicas > snap.min_replicas, so these can never be scaled
below 1 replica via traffic-driven decisions. When the lifecycle poller
retires their node (expiry / power-off / 404), the autoscaler immediately
re-rents on the next agent tick to honor the floor.
This is the only mechanism that protects base models from going
offline. If you want a model to always have a warm replica, give it
min_replicas=1. If you want it to scale to zero when idle, give it
min_replicas=0 — but then expect cold-starts on first traffic.