Back to docs

Context Window Limits

Dynamic per-model + per-plan context caps for predictable throughput and pricing

Context Window Limits

PrivateRouter enforces dynamic context window caps to keep inference throughput high, latency stable, and pricing honest. Bigger models get more context room; smaller models trade context for batching density.

Why this exists

KV-cache VRAM scales as:

vram_used = concurrent_streams × (input_tokens + output_tokens) × bytes_per_token

If every user could send 128K-token prompts at every model, even an H100 would only serve a handful of concurrent requests before OOM-ing. By capping context per request, we keep:

  • Throughput high (more streams per GPU)
  • Latency predictable (smaller KV caches → faster prefill)
  • Pricing stable (no surprise bills from a 100K-token agent loop)

The policy

Caps scale with model parameter count:

Model sizeHard cap
≤ 7B16,384 tokens
≤ 14B32,768 tokens
≤ 32B65,536 tokens
≤ 70B131,072 tokens
Embedding (Nomic)8,192 tokens

Your plan multiplier scales the cap on top of that:

PlanMultiplier
Free0.5×
Starter0.75×
Pro1.0×
Developer1.5×
Team2.0×

The effective cap for any request is:

effective_cap = min(model.max_context_tokens × plan.multiplier,
                    model.native_context_window)

The native window is always the absolute ceiling — a 2.0× multiplier on a model with a 32K native window still caps at 32K.

Current catalog

ModelParamsHard capConcurrencyVRAM
privaterouter/fast (Gemma4)9B16,3844820 GB
privaterouter/qwen-fast14B32,7682432 GB
privaterouter/deepseek-code30B65,536864 GB
privaterouter/deepseek-reason32B65,536672 GB
privaterouter/qwen-pro (MoE)36B65,536880 GB
privaterouter/embed (Nomic)0.14B8,1921284 GB

Live values are always visible on the Leaderboard per-model page.

What happens when you exceed the cap

Requests that estimate over your effective cap are rejected with a clear, OpenAI-shaped error:

HTTP/1.1 413 Payload Too Large
X-Context-Tokens-Estimated: 15234
X-Context-Cap-Effective: 8192
X-Context-Cap-Model: 32000
X-Context-Plan-Multiplier: 0.50
X-Context-Max-Reply-Tokens: 0
Content-Type: application/json

{
  "error": {
    "message": "Request input estimated at 15,234 tokens exceeds the 8,192-token cap for privaterouter/fast on this plan.",
    "type": "invalid_request_error",
    "code": "context_window_exceeded"
  }
}

The X-Context-* headers give you everything you need to decide what to do — trim the prompt, upgrade your plan, or switch to a larger model.

Reply budget clipping

If your prompt fits but your max_tokens ask would push you over the cap, we silently clip it down to fit the headroom. The new value is reflected in the X-Context-Max-Reply-Tokens response header on the success response, and your billing only reflects the actual tokens generated.

How to estimate tokens client-side

A rough heuristic that matches ours: chars ÷ 4 for English-heavy text. For more precision, run tiktoken locally with the cl100k_base encoding — close enough for any model in our catalog.

def estimate_tokens(text: str) -> int:
    return max(1, (len(text) + 3) // 4)

We add ~4 tokens of overhead per message (role tag + delimiters) and ~765 tokens per image (vision inputs).

Tips for working with the caps

  • Long conversations: trim history client-side, or use the Chat UI which auto-trims to 24K tokens.
  • Long documents: chunk + retrieve. Embeddings on Pro+ are bottomless within rate limits.
  • Agent loops: cap your tool-output sizes; agents that copy whole pages will hit the cap fast.
  • Codegen: pick privaterouter/deepseek-code — 65K cap is enough for most repo-scale prompts on Pro+.

Roadmap

  • Per-team caps (M24) — team admins can lower caps to manage shared spend
  • Streaming token counter — see context usage in real time in the dashboard
  • Custom plans — enterprise can negotiate non-standard multipliers and explicit hard caps