Context Window Limits
PrivateRouter enforces dynamic context window caps to keep inference throughput high, latency stable, and pricing honest. Bigger models get more context room; smaller models trade context for batching density.
Why this exists
KV-cache VRAM scales as:
vram_used = concurrent_streams × (input_tokens + output_tokens) × bytes_per_token
If every user could send 128K-token prompts at every model, even an H100 would only serve a handful of concurrent requests before OOM-ing. By capping context per request, we keep:
- Throughput high (more streams per GPU)
- Latency predictable (smaller KV caches → faster prefill)
- Pricing stable (no surprise bills from a 100K-token agent loop)
The policy
Caps scale with model parameter count:
| Model size | Hard cap |
|---|---|
| ≤ 7B | 16,384 tokens |
| ≤ 14B | 32,768 tokens |
| ≤ 32B | 65,536 tokens |
| ≤ 70B | 131,072 tokens |
| Embedding (Nomic) | 8,192 tokens |
Your plan multiplier scales the cap on top of that:
| Plan | Multiplier |
|---|---|
| Free | 0.5× |
| Starter | 0.75× |
| Pro | 1.0× |
| Developer | 1.5× |
| Team | 2.0× |
The effective cap for any request is:
effective_cap = min(model.max_context_tokens × plan.multiplier,
model.native_context_window)
The native window is always the absolute ceiling — a 2.0× multiplier on a model with a 32K native window still caps at 32K.
Current catalog
| Model | Params | Hard cap | Concurrency | VRAM |
|---|---|---|---|---|
privaterouter/fast (Gemma4) | 9B | 16,384 | 48 | 20 GB |
privaterouter/qwen-fast | 14B | 32,768 | 24 | 32 GB |
privaterouter/deepseek-code | 30B | 65,536 | 8 | 64 GB |
privaterouter/deepseek-reason | 32B | 65,536 | 6 | 72 GB |
privaterouter/qwen-pro (MoE) | 36B | 65,536 | 8 | 80 GB |
privaterouter/embed (Nomic) | 0.14B | 8,192 | 128 | 4 GB |
Live values are always visible on the Leaderboard per-model page.
What happens when you exceed the cap
Requests that estimate over your effective cap are rejected with a clear, OpenAI-shaped error:
HTTP/1.1 413 Payload Too Large
X-Context-Tokens-Estimated: 15234
X-Context-Cap-Effective: 8192
X-Context-Cap-Model: 32000
X-Context-Plan-Multiplier: 0.50
X-Context-Max-Reply-Tokens: 0
Content-Type: application/json
{
"error": {
"message": "Request input estimated at 15,234 tokens exceeds the 8,192-token cap for privaterouter/fast on this plan.",
"type": "invalid_request_error",
"code": "context_window_exceeded"
}
}
The X-Context-* headers give you everything you need to decide what to do — trim the prompt, upgrade your plan, or switch to a larger model.
Reply budget clipping
If your prompt fits but your max_tokens ask would push you over the cap, we silently clip it down to fit the headroom. The new value is reflected in the X-Context-Max-Reply-Tokens response header on the success response, and your billing only reflects the actual tokens generated.
How to estimate tokens client-side
A rough heuristic that matches ours: chars ÷ 4 for English-heavy text. For more precision, run tiktoken locally with the cl100k_base encoding — close enough for any model in our catalog.
def estimate_tokens(text: str) -> int:
return max(1, (len(text) + 3) // 4)
We add ~4 tokens of overhead per message (role tag + delimiters) and ~765 tokens per image (vision inputs).
Tips for working with the caps
- Long conversations: trim history client-side, or use the Chat UI which auto-trims to 24K tokens.
- Long documents: chunk + retrieve. Embeddings on Pro+ are bottomless within rate limits.
- Agent loops: cap your tool-output sizes; agents that copy whole pages will hit the cap fast.
- Codegen: pick
privaterouter/deepseek-code— 65K cap is enough for most repo-scale prompts on Pro+.
Roadmap
- Per-team caps (M24) — team admins can lower caps to manage shared spend
- Streaming token counter — see context usage in real time in the dashboard
- Custom plans — enterprise can negotiate non-standard multipliers and explicit hard caps