Token + cost estimator

PrivateRouter exposes a fast, authenticated endpoint that returns an estimated token count and USD cost for any prompt + model combination. The dashboard chat composer and the playground prompt textarea call it live (300ms-debounced) to render the small ghost label you see under the input — ~47 tokens · ≈$0.000024.

This page documents the public contract, the tokenization caveats, and the caching strategy.

Endpoint

POST /api/tokens/estimate

Authentication: standard PrivateRouter session cookie or Authorization: Bearer <jwt>. Same auth surface as /api/usage.

Request

{
  "text": "Summarize this paragraph for me…",
  "model_public_name": "privaterouter/fast"
}

Field	Type	Notes
`text`	string	Up to 50,000 characters. Longer payloads → `422`.
`model_public_name`	string	A model from the catalog. Unknown → `404`.

Response

{
  "tokens": 47,
  "cost_input_usd": "0.000024",
  "cost_output_estimated_usd": "0.000048",
  "model_public_name": "privaterouter/fast",
  "cached": false
}

tokens — estimated prompt token count.
cost_input_usd — tokens × input_price_per_million_usd / 1_000_000, formatted to 6 decimal places.
cost_output_estimated_usd — a rough projection of the response cost assuming a ~2× prompt completion at the model's output price. This is intentionally generous so the label doesn't undersell the bill.
cached — true if the result was served from the 5-minute Redis cache.

Errors

Code	Reason
`401`	Missing / invalid auth.
`404`	`model_public_name` not in the catalog.
`422`	`text` longer than 50,000 chars, or body fails validation.

Tokenization

The server uses tiktoken with the cl100k_base encoding (GPT-3.5 / GPT-4 family). When tiktoken isn't importable for any reason, the service falls back to a max(1, len(text) // 4) chars-per-token heuristic.

cl100k_base is a heuristic for non-OpenAI families. Qwen, DeepSeek, Gemma, Llama and similar models tokenize with their own BPE vocabularies; the actual token count from the upstream model can differ from our estimate by roughly ±15%. The dollar number is a ballpark — exact billing always uses the real token count returned by the model.

Caching + performance

Results are cached in Redis under pr:token_est:<sha256[:16]> with a 5-minute TTL keyed by the exact (text, model) pair. The hot path is essentially free:

p50 (cached): < 50 ms — single Redis GET.
p50 (uncached): a few ms of CPU for tiktoken + one DB row for the model lookup + one Redis SET.

Admin price edits propagate within 5 minutes (next cache expiry).

Frontend usage

import { estimateTokens } from "@/lib/api";
import { useTokenEstimate } from "@/lib/useTokenEstimate";

// One-shot:
const est = await estimateTokens("hello", "privaterouter/fast");

// Live (300ms-debounced, AbortController-cancellable):
function Composer({ prompt, model }) {
  const { tokens, costEstimate, loading } = useTokenEstimate(prompt, model);
  return <span>~{tokens} tokens · ≈${costEstimate}</span>;
}

The hook bails out cleanly when prompt is empty or model is null, and resets immediately on either changing so stale numbers never linger.