Context Window Limits

PrivateRouter enforces dynamic context window caps to keep inference throughput high, latency stable, and pricing honest. Bigger models get more context room; smaller models trade context for batching density.

Why this exists

KV-cache VRAM scales as:

vram_used = concurrent_streams × (input_tokens + output_tokens) × bytes_per_token

If every user could send 128K-token prompts at every model, even an H100 would only serve a handful of concurrent requests before OOM-ing. By capping context per request, we keep:

Throughput high (more streams per GPU)
Latency predictable (smaller KV caches → faster prefill)
Pricing stable (no surprise bills from a 100K-token agent loop)

The policy

Caps scale with model parameter count:

Model size	Hard cap
≤ 7B	16,384 tokens
≤ 14B	32,768 tokens
≤ 32B	65,536 tokens
≤ 70B	131,072 tokens
Embedding (Nomic)	8,192 tokens

Your plan multiplier scales the cap on top of that:

Plan	Multiplier
Free	0.5×
Starter	0.75×
Pro	1.0×
Developer	1.5×
Team	2.0×

The effective cap for any request is:

effective_cap = min(model.max_context_tokens × plan.multiplier,
                    model.native_context_window)

The native window is always the absolute ceiling — a 2.0× multiplier on a model with a 32K native window still caps at 32K.

Current catalog

Model	Params	Hard cap	Concurrency	VRAM
`privaterouter/fast` (Gemma4)	9B	16,384	48	20 GB
`privaterouter/qwen-fast`	14B	32,768	24	32 GB
`privaterouter/deepseek-code`	30B	65,536	8	64 GB
`privaterouter/deepseek-reason`	32B	65,536	6	72 GB
`privaterouter/qwen-pro` (MoE)	36B	65,536	8	80 GB
`privaterouter/embed` (Nomic)	0.14B	8,192	128	4 GB

Live values are always visible on the Leaderboard per-model page.

What happens when you exceed the cap

Requests that estimate over your effective cap are rejected with a clear, OpenAI-shaped error:

HTTP/1.1 413 Payload Too Large
X-Context-Tokens-Estimated: 15234
X-Context-Cap-Effective: 8192
X-Context-Cap-Model: 32000
X-Context-Plan-Multiplier: 0.50
X-Context-Max-Reply-Tokens: 0
Content-Type: application/json

{
  "error": {
    "message": "Request input estimated at 15,234 tokens exceeds the 8,192-token cap for privaterouter/fast on this plan.",
    "type": "invalid_request_error",
    "code": "context_window_exceeded"
  }
}

The X-Context-* headers give you everything you need to decide what to do — trim the prompt, upgrade your plan, or switch to a larger model.

Reply budget clipping

If your prompt fits but your max_tokens ask would push you over the cap, we silently clip it down to fit the headroom. The new value is reflected in the X-Context-Max-Reply-Tokens response header on the success response, and your billing only reflects the actual tokens generated.

How to estimate tokens client-side

A rough heuristic that matches ours: chars ÷ 4 for English-heavy text. For more precision, run tiktoken locally with the cl100k_base encoding — close enough for any model in our catalog.

def estimate_tokens(text: str) -> int:
    return max(1, (len(text) + 3) // 4)

We add ~4 tokens of overhead per message (role tag + delimiters) and ~765 tokens per image (vision inputs).

Tips for working with the caps

Long conversations: trim history client-side, or use the Chat UI which auto-trims to 24K tokens.
Long documents: chunk + retrieve. Embeddings on Pro+ are bottomless within rate limits.
Agent loops: cap your tool-output sizes; agents that copy whole pages will hit the cap fast.
Codegen: pick privaterouter/deepseek-code — 65K cap is enough for most repo-scale prompts on Pro+.

Roadmap

Per-team caps (M24) — team admins can lower caps to manage shared spend
Streaming token counter — see context usage in real time in the dashboard
Custom plans — enterprise can negotiate non-standard multipliers and explicit hard caps