Inference optimizationScale

Long-context inference

From 128K to 10M tokens — without melting your wallet. When to use Llama 4 Scout's 10M context vs RAG, prefix caching tradeoffs, KV quantization at scale, and chunked-prefill tuning.

Long context vs RAG: which to use when

Many teams reach for RAG because "the context window is too small." With Llama 4 Scout's 10M-token window, that constraint is gone — but RAG is still the right call sometimes. Here's how to decide:

Scenario	Use long context	Use RAG
Single document chat (book, contract, codebase)	✅	❌ overkill
Knowledge base of millions of docs	❌ won't fit	✅
Real-time updated facts (news, prices)	❌ context is stale	✅
Multi-doc synthesis where you know which docs	✅	❌ retrieval errors
Latency budget < 200 ms TTFT	❌ prefill is slow	✅
Same docs, many users (B2B SaaS)	✅ + prefix caching	❌ duplicates retrieval

The cost curve

A 1M-token prefill on Llama 4 Maverick FP8 takes ~12 seconds and costs about $0.27. That's fine for one-shot analysis but expensive at scale.

The trick: prefix caching (see the dedicated KV cache doc). If your 1M-token document is the system prompt and only the user question varies, the second request hits the cache for ~94% of the tokens — costing pennies.

Workload	Cold cost	Warm cost	Ratio
100K context, 500 out	$0.027 + $0.00043	$0.0016 + $0.00043	16×
1M context, 500 out	$0.27 + $0.00043	$0.016 + $0.00043	16×
10M context, 1K out	$2.70 + $0.00085	$0.16 + $0.00085	16×

KV cache quantization at long context

At 10M tokens on Llama 4 Scout, the KV cache alone wants ~640 GB in BF16 — won't fit even on a 8× H200 box. INT4 KV brings that down to ~160 GB and runs comfortably.

Quality impact at 10M context: we measured a −1.8% drop on Needle-in-a-Haystack recall with INT4 KV vs FP8 KV, which is acceptable for most document QA workloads. For golden-output regression tests, stick with FP8 KV (default) and accept the smaller usable context.

See quantization for the full quality-vs-memory trade-off table.

Chunked prefill tuning

Long prompts are split into chunks at the inference engine so they don't starve other requests' decode steps. Default chunk size is 2048 tokens.

If your workload is dominated by long-prompt analysis (and you can tolerate higher TTFT for other concurrent requests), bump it up:

const resp = await client.chat.completions.create({
  model: "meta/llama-4-scout",
  messages: [
    { role: "system", content: bigDocument },  // 4M tokens
    { role: "user",   content: "Summarize chapter 12." },
  ],
  extra_body: {
    prefill_chunk_size: 8192,    // bigger chunks = faster prefill
    kv_cache_dtype: "int4",      // fit in HBM
  },
});

Position interpolation: how 10M is even possible

Llama 4 Scout was trained with a context up to 1M tokens, then extended to 10M via YaRN-style position interpolation + a small fine-tuning pass on synthetic ultra-long-context tasks. It works well for retrieval and summary, less well for tasks requiring long-range reasoning at the deep end of the context.

Practical advice: trust extracted facts at 10M. Be cautious about asking the model to do multi-step deduction across content separated by > 2M tokens.

Ranking long-context options

For most teams, this is our recommended progression:

Start with a 200K-context model (DeepSeek V3.2, Qwen3-Max) + RAG. Cheap, fast, and the embeddings give you fine-grained retrieval.
Move to 1M (Llama 4 Maverick) when retrieval is hurting you (you keep retrieving the wrong chunks). Use prefix caching aggressively.
Reach for 10M (Llama 4 Scout) only when 1M genuinely isn't enough. Use INT4 KV. Verify quality on your specific task with Needle-in-a-Haystack.

TL;DR

Long context isn't free, but with prefix caching it can be 16× cheaper than the headline price. Use INT4 KV at > 1M tokens. Don't reach for 10M before you've maxed out 1M with caching. RAG still wins for KB-style search and freshness.

Back to all docs Re-read quantization →