KV cache & prefix caching
The KV cache is the largest single consumer of HBM during inference. How you manage it decides how many concurrent requests you can serve, how long your context window can be, and how cheaply you can serve repeated system prompts.
The math of the KV cache
For every token in the context window, every transformer layer stores two tensors (Key and Value). Per token, per layer, the size is 2 × num_heads × head_dim × dtype_bytes.
For Llama 3.3 70B in BF16: 80 layers × 64 heads × 128 dim × 2 bytes × 2 (K+V) = 2.6 MB per token. A single 128K-context request needs ~340 GB just for KV — more than fits on a single H100.
This is why naïve inference servers can't handle long context at high batch sizes. The fix isn't a faster GPU — it's smarter memory layout.
PagedAttention
Inspired by virtual memory in operating systems. Instead of allocating one contiguous KV buffer per request (which fragments horribly), PagedAttention splits each request's KV into fixed-size blocks (16 tokens each by default). The blocks live in a global pool and are allocated on demand.
Benefits compound:
- Zero internal fragmentation — only the last block is partially filled.
- Block-level sharing — two sequences with identical prefixes physically share the same blocks.
- Copy-on-write branching — beam search and parallel sampling reuse the prefix blocks untouched.
Prefix caching (the killer feature)
If 10,000 requests in your queue all start with the same 4,000-token system prompt, naïve inference recomputes those 4K tokens 10,000 times. With prefix caching, we compute it once, hash it, and reuse the KV blocks for every request that matches the prefix.
Real production numbers from a customer running RAG over a 6K-token template:
| Metric | No prefix cache | With prefix cache | Δ |
|---|---|---|---|
| TTFT (P50) | 342 ms | 48 ms | -86% |
| TTFT (P99) | 1.1 s | 210 ms | -81% |
| Cost per request | $0.0042 | $0.0007 | -83% |
| Effective batch capacity | 32 | 96 | +200% |
How to opt in
Prefix caching is on by default for every hosted model. The cache is per-organisation (your prompts never collide with another customer's) and uses a 5-minute TTL.
To get the best hit rate, structure your messages so the static prefix comes first and the variable user content comes last:
// ✅ Good — system prompt is identical, hits cache
messages: [
{ role: "system", content: LARGE_SYSTEM_PROMPT_4K_TOKENS },
{ role: "user", content: userQuery }, // varies
]
// ❌ Bad — user query is interleaved into the prefix
messages: [
{ role: "system", content: `You are helping ${userName}. ${LARGE_PROMPT}` },
{ role: "user", content: userQuery },
]KV cache quantization
Even with paging, the KV cache is the biggest HBM consumer at long context. Quantizing the KV (separately from weight quantization) halves the memory footprint at minimal quality cost.
Default for hosted models: FP8 KV (E4M3 for keys, E5M2 for values — a researcher trick that empirically preserves quality better than uniform FP8). To force INT4 KV cache for ultra-long context:
const resp = await client.chat.completions.create({
model: "meta/llama-4-scout", // 10M context model
messages: [
{ role: "user", content: hugeDocument /* 4M tokens */ },
],
extra_body: {
kv_cache_dtype: "int4", // 4× memory savings vs BF16
},
});Cache hit observability
Every response includes a x-luminet-cache-hit header with the percentage of prompt tokens served from cache. We surface a 24-hour rolling average in the dashboard so you can monitor and tune your prompt structure.
TL;DR
Put your stable prompt prefix first. Watch the x-luminet-cache-hit header. Use FP8 KV (default) for normal workloads, INT4 KV for ultra-long-context. Prefix caching alone often cuts TTFT by 80%+ and bills by 70%.