Back to docs

Autoscaling & Vast.ai

Per-model capacity-watcher agents, Vast.ai provisioning, money safety rails

Autoscaling & Vast.ai

PrivateRouter ships with an autonomous per-model capacity-watcher system that decides when to add or remove GPU capacity, talks to Vast.ai to rent or destroy instances, and writes every decision to an auditable ledger.

This page covers what the system does, how the admin UI surfaces it, and the safety rails that keep it from spending money you didn't authorize.

What it does

For each model in the models table where agent_enabled = TRUE, PrivateRouter spawns an asyncio.Task (a "capacity agent") that ticks every ~30 seconds. On every tick the agent:

  1. Samples per-user usage from the last 60s of usage_events — counts requests, tokens in/out, latency p95, in-flight concurrency, current active replicas.
  2. Forecasts 60 minutes ahead using ordinary least-squares regression over the last 7 days of 5-minute usage buckets.
  3. Reads PnL — 24h rolling revenue (sum of total_cost_usd for this model) minus 24h cost (sum of hourly_cost_usd × 24 for agent-created active nodes).
  4. Decides scale_up, scale_down, or nop per the policy below.
  5. Writes a capacity_signals row with every metric + the decision + the reason — always, regardless of action.
  6. Enqueues a scale_events row with status='planned' if the decision is non-nop and VAST_AUTOSCALER_ENABLED=true.

A separate provisioner loop drains planned scale events. For scale_up it searches Vast.ai offers (POST /bundles/), picks the cheapest one meeting the model's VRAM requirement, rents it (PUT /asks/{id}/), and inserts a gpu_nodes row with created_by_agent set. For scale_down it picks the lowest-utilization agent-created node, drains it, and destroys it (DELETE /instances/{id}/).

Decision math

# Scale UP if any of:
#   - active_concurrency >= 0.8 × target_concurrency × current_replicas
#   - forecast_rpm_60m >= 2.0 × capacity_rpm   (and forecast > 1 rpm)
#   - current_replicas == 0 AND any traffic detected (cold start)
#
# AND ALL of:
#   - current_replicas < max_replicas
#   - total_agent_dph + new_offer_dph <= VAST_MAX_HOURLY_USD
#   - vast_spendable >= 2 × new_offer_dph (≥ 2h runway)
#   - 24h_revenue - 24h_cost >= 0   (skipped if revenue == 0)
#
# Scale DOWN if all of:
#   - active_concurrency <= 0.2 × target_concurrency × current_replicas
#   - that condition has been true for ≥ 15 minutes
#   - current_replicas > min_replicas

Tunables live in apps/api/app/services/autoscaler/decision.py at the top of the module (SCALE_UP_CONCURRENCY_RATIO, SCALE_UP_FORECAST_RATIO, SCALE_DOWN_MIN_MINUTES, etc.). Changing them requires updating this page and the smoke probe's expectations.

Provisioner lifecycle

A gpu_nodes row's provisioning_state walks through these states for an agent-created node:

StateMeaning
searching(transient, in-memory) Provisioner is calling /bundles/.
provisioningVast rent request sent, waiting for response.
installingInstance is up, ollama serve + ollama pull running.
readyInstance reachable, model loaded, in rotation.
dry_runSynthetic node created under VAST_AUTOSCALER_DRY_RUN=true.
drainingScale-down picked this node, waiting for in-flight to drain.
destroyingDELETE /instances/{id}/ sent, awaiting confirmation.
destroyedVast confirmed the instance is gone.
failedLifecycle aborted; see the matching scale_events.error.
manualOperator-created node — never auto-touched.

Safety rails

ControlDefaultEffect
VAST_AUTOSCALER_ENABLEDfalseMaster kill-switch. No scale_events are ever enqueued when this is false. Flip from the admin UI for a runtime-only toggle, or edit .env to persist.
VAST_AUTOSCALER_DRY_RUNtrueAll Vast mutating calls return a VastDryRunResponse and write status='skipped_dry_run'. Edit .env and restart to flip.
VAST_MAX_HOURLY_USD2.50Provisioner refuses to plan any rent that would push the sum of agent-owned hourly cost above this.
VAST_MAX_INSTANCES3Cap on concurrent agent-created instances.
vast_spendable ≥ 2h runway(computed)Each rent requires ≥ 2 hours of runway on Vast at the offer's $/hr.
PnL ≥ 0(computed)A model with measured 24h PnL below zero cannot trigger a scale-up. Brand-new models with zero revenue are exempt (cold start).
created_by_agent IS NOT NULL(db)Manual nodes are never auto-destroyed. Provisioner only touches rows it owns.
vast_contract_id UNIQUE(db)A duplicate provisioning attempt would 23505-fail before any double-billing could land.

Admin UI

/admin/autoscaler

Live dashboard. Auto-refreshes every 10s. Shows:

  • Kill-switch banner with one-click toggle, DRY-RUN vs LIVE badge, and a NO VAST API KEY banner if VAST_API_KEY is unset.
  • Stat cards: agents enabled/running, agent-managed nodes, current burn vs cap, live Vast spendable balance.
  • Per-model agents table: each row shows the agent badge (OFF, ENABLED · NOT RUNNING, RUNNING), current vs max replicas, last-tick RPM, in-flight concurrency, 60min forecast, capacity, $/hr burn, and the decision pill (with scale_up/scale_down/nop and the reason).
  • Recent scale events table: the last 25 scale events with action, status pill (planned/executing/succeeded/failed/skipped_dry_run), Vast contract id, cost, and any error.
  • Reconcile now button forces an immediate supervisor pass — useful after manual DB edits.

/admin/models

The catalog editor now has an Agent column with a per-row toggle. When you click OFF on a model, a confirmation dialog appears explaining what binding the agent does. When you flip an agent on the supervisor is poked immediately so the agent task spawns within ~1 second instead of waiting for the 60s reconcile tick.

Creating a new model via "Add model" also has a "Bind autoscaler agent" checkbox.

Going live

The default committed state is fully dry-run. To start spending real Vast credit:

# Optional sanity check first — confirm the API key is right.
curl -s -H "Authorization: Bearer $VAST_API_KEY" https://console.vast.ai/api/v0/users/current/ | jq

# Edit .env (persistent across restarts).
sed -i 's/^VAST_AUTOSCALER_DRY_RUN=true$/VAST_AUTOSCALER_DRY_RUN=false/' .env
# (Or open it and review the four VAST_* vars together.)

docker compose restart api

Then flip the master kill-switch from the /admin/autoscaler page — that change is runtime-only so you can disable in an emergency without restarting. Re-flipping survives a process restart only if you also keep .env consistent.

What it deliberately does NOT do

  • It does not OpenRouter-resell. No closed-model providers are brokered; agents only provision instances of self-hosted open models whose weights they can pull through Ollama.
  • It does not destroy manual nodes. Even when a model has plenty of manual capacity, the scale-down path won't touch a gpu_nodes row with created_by_agent IS NULL.
  • It does not optimise for latency alone. The decision math weighs PnL — a model losing money over 24h cannot scale up, even at 100% concurrency utilisation. Override by upping its output_price_per_million_usd.

Tables behind the scenes

  • capacity_signals — append-only, one row per agent tick. Time-series- shaped; the (model_id, ts) composite index is the hot path.
  • scale_events — audit log. planned → executing → succeeded | failed | skipped_dry_run. Each row carries the offer snapshot JSON, the Vast contract id (or NULL on dry-run + a negative synthetic id on the materialised gpu_nodes row).
  • gpu_nodes — extended with vast_contract_id (unique), vast_offer_id, provisioning_state, cost_basis_dph_usd, and created_by_agent. Existing manual node rows keep provisioning_state='manual' + created_by_agent IS NULL and are untouched by the autoscaler.
  • models — extended with agent_enabled, min_replicas, max_replicas, target_concurrency (per-replica desired stream count).

M31 — Lifecycle tracking (expiration + auto-redeploy)

Vast.ai instances aren't open-ended rentals. Every host announces an end_date — the moment they'll stop offering the box. For shared 3060/3090 hosts this can be as short as a few days; for stable A100/H100 hosts it's usually 3–6 months. When the date hits, the host powers the instance off and our model endpoint goes dark. Sometimes a host also reboots its box mid-rental (maintenance, power blip), in which case the instance comes back on its own — but if it stays down, we need to replace it.

The NodeLifecyclePoller (started by AgentSupervisor alongside the existing InstanceReadyPoller) closes that gap. Once every GPU_LIFECYCLE_POLL_INTERVAL_SECONDS (default 300s = 5 min) it walks every status='active' Vast-backed node and:

  1. Fetches GET /instances/{id} from Vast.
  2. Copies the upstream end_date into gpu_nodes.expires_at.
  3. Updates last_seen_running_at when the box reports running; clears powered_off_at if it previously went down.
  4. Stamps powered_off_at = now the first time it sees a non-running state. Once now - powered_off_at > GPU_POWER_OFF_GRACE_SECONDS (default 600s = 10 min, tolerating one reboot), the row is retired: status='inactive', health_status='auto_retired:powered_off:<vast_status>'.
  5. If expires_at is in the past, retires with reason auto_retired:expired_end_date.
  6. If Vast returns 404 (contract destroyed remotely), retires with reason auto_retired:vast_404.

Retirement is not a destroy — either the instance is already gone or the host powered it off. We just stop counting the row toward the model's replica total. The autoscaler's existing min_replicas=1 floor on base models picks up the slack on its next tick and rents a fresh instance.

Admin endpoints (M31)

  • GET /api/admin/autoscaler — overview now includes a node_lifecycle array with one entry per active Vast-backed node: {name, current_model, expires_at, seconds_until_expiry, powered_off_at, ...}. The admin dashboard renders the expiration countdown directly from these fields.
  • POST /api/admin/autoscaler/lifecycle-sweep — forces an immediate sweep instead of waiting for the next 5-min tick. Useful right after adding a new base model so the countdown shows up in the UI without delay.

Configuration

  • GPU_LIFECYCLE_POLL_INTERVAL_SECONDS (default 300) — sweep interval. Hard-floored to 30 seconds.
  • GPU_POWER_OFF_GRACE_SECONDS (default 600) — how long an instance may stay non-running before we retire its row. Hard-floored to 60.

Base models — min_replicas=1 contract

For models flagged as "base" we set min_replicas=1 in seed.py:

  • privaterouter/qwen3 (general)
  • privaterouter/qwen3-coder (code)
  • privaterouter/gpt-oss-120b (flagship reasoning)

The autoscaler decision logic guards scale_down behind current_replicas > snap.min_replicas, so these can never be scaled below 1 replica via traffic-driven decisions. When the lifecycle poller retires their node (expiry / power-off / 404), the autoscaler immediately re-rents on the next agent tick to honor the floor.

This is the only mechanism that protects base models from going offline. If you want a model to always have a warm replica, give it min_replicas=1. If you want it to scale to zero when idle, give it min_replicas=0 — but then expect cold-starts on first traffic.