Observability — Prometheus + Grafana

PrivateRouter ships with an opt-in observability stack: Prometheus scrapes the API's /metrics endpoint, Grafana displays a pre-provisioned dashboard. Both run as a docker-compose profile so they don't burn resources in normal dev.

Start the stack

docker compose --profile observability up -d

This adds two containers alongside the existing api, web, postgres, redis, and litellm services:

Service	URL	Default credentials
Prometheus	http://localhost:9090	none (read-only)
Grafana	http://localhost:3001	`admin` / `admin`

Change the Grafana admin password by setting GRAFANA_ADMIN_PASSWORD in .env before the first start. After the first login Grafana will prompt you to change it.

Stop the stack

docker compose --profile observability stop prometheus grafana
# or to remove containers (volumes survive):
docker compose --profile observability rm -sf prometheus grafana

The base stack (api/web/postgres/redis/litellm) is unaffected.

What gets collected

The API emits the following custom Prometheus metrics (all prefixed pr_):

HTTP request metrics (every API request)

pr_http_requests_total{method,route,status} — counter
pr_http_request_duration_seconds{method,route,status} — histogram
pr_http_requests_in_flight{method,route} — gauge
pr_http_exceptions_total{method,route,exception_type} — counter

route is the matched FastAPI route template (e.g. /api/keys/{id}), not the raw path. This keeps label cardinality bounded.

LLM / completion metrics

pr_llm_calls_total{model,endpoint,status} — counter
pr_llm_tokens_total{model,direction} — counter (direction is input or output)
pr_llm_cost_usd_total{model} — counter (running sum of usage row costs)

Domain counters

pr_chat_conversations_created_total
pr_credit_transactions_total{kind} — topup / usage / admin_adjust / refund

GPU node health

pr_gpu_node_up{node_id,name} — gauge, 1 if healthy
pr_gpu_node_utilization_pct{node_id,name} — gauge, 0-100

User identity (user_id) is never a Prometheus label — high cardinality would blow up the time series database. Per-user correlation lives in the structured JSON access log instead (request_id + user_id).

Structured access logs

Every HTTP request emits one JSON line to stdout via the pr.access logger:

{
  "ts": 1778985270.62,
  "request_id": "c51fcef36f24437dba83982590ba5763",
  "method": "GET",
  "route": "/api/billing/balance",
  "raw_path": "/api/billing/balance",
  "status": 401,
  "latency_ms": 0.89,
  "user_id": null,
  "ip": "172.24.0.6",
  "user_agent": "Mozilla/5.0...",
  "bytes": 30
}

The same request_id is echoed back to the client as X-Request-ID. To correlate a customer report to a log line:

docker logs privaterouter-api 2>&1 | grep '"request_id": "c51fcef3..."'

Clients can also send their own X-Request-ID header (e.g. from upstream tracing); the server will use it verbatim.

The Grafana dashboard

The pre-provisioned PrivateRouter — Overview dashboard (infra/observability/grafana/dashboards/privaterouter-overview.json) has 12 panels organised as:

Top-row stat tiles: rps, error rate, p95 latency, in-flight
Latency percentiles (p50/p95/p99) over time
Requests/sec split by status class (2xx / 4xx / 5xx)
Top 10 routes by request volume + top 10 by p95 latency
LLM tokens/sec by model and direction
LLM spend (USD/hour) by model
GPU node up/down and utilization %

To add panels, edit in Grafana — allowUiUpdates: true is set so changes persist in the Grafana volume. Export and commit the JSON back to infra/observability/grafana/dashboards/ when ready to share.

Production hardening (when you put this on the internet)

The defaults are sane for development. Before exposing publicly:

Bind /metrics to the internal network only. Currently the API serves it on 8000 alongside the public API. Put nginx in front and block /metrics from the outside, OR run a separate uvicorn worker on a different port for metrics.
Rotate GRAFANA_ADMIN_PASSWORD from the default admin.
Enable HTTPS on the Prometheus + Grafana ingress.
Increase TSDB retention from the default 15d (--storage.tsdb.retention.time=...).
Set up alerting rules in Prometheus (infra/observability/prometheus.yml) or Grafana for high error rates, latency spikes, and GPU node down.

LiteLLM metrics

The Prometheus config includes a litellm scrape job pointed at http://litellm:4000/metrics. LiteLLM's own metrics endpoint requires the master key as a bearer token; the scrape job is left as a down/unknown target until that's configured (next step). See the LiteLLM Prometheus docs to enable.