Token + cost estimator
PrivateRouter exposes a fast, authenticated endpoint that returns an
estimated token count and USD cost for any prompt + model combination.
The dashboard chat composer and the playground prompt textarea call it
live (300ms-debounced) to render the small ghost label you see under the
input — ~47 tokens · ≈$0.000024.
This page documents the public contract, the tokenization caveats, and the caching strategy.
Endpoint
POST /api/tokens/estimate
Authentication: standard PrivateRouter session cookie or
Authorization: Bearer <jwt>. Same auth surface as /api/usage.
Request
{
"text": "Summarize this paragraph for me…",
"model_public_name": "privaterouter/fast"
}
| Field | Type | Notes |
|---|---|---|
text | string | Up to 50,000 characters. Longer payloads → 422. |
model_public_name | string | A model from the catalog. Unknown → 404. |
Response
{
"tokens": 47,
"cost_input_usd": "0.000024",
"cost_output_estimated_usd": "0.000048",
"model_public_name": "privaterouter/fast",
"cached": false
}
tokens— estimated prompt token count.cost_input_usd—tokens × input_price_per_million_usd / 1_000_000, formatted to 6 decimal places.cost_output_estimated_usd— a rough projection of the response cost assuming a~2× promptcompletion at the model's output price. This is intentionally generous so the label doesn't undersell the bill.cached—trueif the result was served from the 5-minute Redis cache.
Errors
| Code | Reason |
|---|---|
401 | Missing / invalid auth. |
404 | model_public_name not in the catalog. |
422 | text longer than 50,000 chars, or body fails validation. |
Tokenization
The server uses tiktoken with
the cl100k_base encoding (GPT-3.5 / GPT-4 family). When tiktoken
isn't importable for any reason, the service falls back to a
max(1, len(text) // 4) chars-per-token heuristic.
cl100k_base is a heuristic for non-OpenAI families. Qwen,
DeepSeek, Gemma, Llama and similar models tokenize with their own BPE
vocabularies; the actual token count from the upstream model can
differ from our estimate by roughly ±15%. The dollar number is a
ballpark — exact billing always uses the real token count returned by
the model.
Caching + performance
Results are cached in Redis under pr:token_est:<sha256[:16]> with a
5-minute TTL keyed by the exact (text, model) pair. The hot path is
essentially free:
- p50 (cached): < 50 ms — single Redis
GET. - p50 (uncached): a few ms of CPU for tiktoken + one DB row for the
model lookup + one Redis
SET.
Admin price edits propagate within 5 minutes (next cache expiry).
Frontend usage
import { estimateTokens } from "@/lib/api";
import { useTokenEstimate } from "@/lib/useTokenEstimate";
// One-shot:
const est = await estimateTokens("hello", "privaterouter/fast");
// Live (300ms-debounced, AbortController-cancellable):
function Composer({ prompt, model }) {
const { tokens, costEstimate, loading } = useTokenEstimate(prompt, model);
return <span>~{tokens} tokens · ≈${costEstimate}</span>;
}
The hook bails out cleanly when prompt is empty or model is null,
and resets immediately on either changing so stale numbers never
linger.