Inference optimizationScale

Long-context inference

From 128K to 10M tokens — without melting your wallet. When to use Llama 4 Scout's 10M context vs RAG, prefix caching tradeoffs, KV quantization at scale, and chunked-prefill tuning.

Long context vs RAG: which to use when

Many teams reach for RAG because "the context window is too small." With Llama 4 Scout's 10M-token window, that constraint is gone — but RAG is still the right call sometimes. Here's how to decide:

ScenarioUse long contextUse RAG
Single document chat (book, contract, codebase)❌ overkill
Knowledge base of millions of docs❌ won't fit
Real-time updated facts (news, prices)❌ context is stale
Multi-doc synthesis where you know which docs❌ retrieval errors
Latency budget < 200 ms TTFT❌ prefill is slow
Same docs, many users (B2B SaaS)✅ + prefix caching❌ duplicates retrieval

The cost curve

A 1M-token prefill on Llama 4 Maverick FP8 takes ~12 seconds and costs about $0.27. That's fine for one-shot analysis but expensive at scale.

The trick: prefix caching (see the dedicated KV cache doc). If your 1M-token document is the system prompt and only the user question varies, the second request hits the cache for ~94% of the tokens — costing pennies.

WorkloadCold costWarm costRatio
100K context, 500 out$0.027 + $0.00043$0.0016 + $0.0004316×
1M context, 500 out$0.27 + $0.00043$0.016 + $0.0004316×
10M context, 1K out$2.70 + $0.00085$0.16 + $0.0008516×

KV cache quantization at long context

At 10M tokens on Llama 4 Scout, the KV cache alone wants ~640 GB in BF16 — won't fit even on a 8× H200 box. INT4 KV brings that down to ~160 GB and runs comfortably.

Quality impact at 10M context: we measured a −1.8% drop on Needle-in-a-Haystack recall with INT4 KV vs FP8 KV, which is acceptable for most document QA workloads. For golden-output regression tests, stick with FP8 KV (default) and accept the smaller usable context.

See quantization for the full quality-vs-memory trade-off table.

Chunked prefill tuning

Long prompts are split into chunks at the inference engine so they don't starve other requests' decode steps. Default chunk size is 2048 tokens.

If your workload is dominated by long-prompt analysis (and you can tolerate higher TTFT for other concurrent requests), bump it up:

const resp = await client.chat.completions.create({
  model: "meta/llama-4-scout",
  messages: [
    { role: "system", content: bigDocument },  // 4M tokens
    { role: "user",   content: "Summarize chapter 12." },
  ],
  extra_body: {
    prefill_chunk_size: 8192,    // bigger chunks = faster prefill
    kv_cache_dtype: "int4",      // fit in HBM
  },
});

Position interpolation: how 10M is even possible

Llama 4 Scout was trained with a context up to 1M tokens, then extended to 10M via YaRN-style position interpolation + a small fine-tuning pass on synthetic ultra-long-context tasks. It works well for retrieval and summary, less well for tasks requiring long-range reasoning at the deep end of the context.

Practical advice: trust extracted facts at 10M. Be cautious about asking the model to do multi-step deduction across content separated by > 2M tokens.

Ranking long-context options

For most teams, this is our recommended progression:

  1. Start with a 200K-context model (DeepSeek V3.2, Qwen3-Max) + RAG. Cheap, fast, and the embeddings give you fine-grained retrieval.
  2. Move to 1M (Llama 4 Maverick) when retrieval is hurting you (you keep retrieving the wrong chunks). Use prefix caching aggressively.
  3. Reach for 10M (Llama 4 Scout) only when 1M genuinely isn't enough. Use INT4 KV. Verify quality on your specific task with Needle-in-a-Haystack.

TL;DR

Long context isn't free, but with prefix caching it can be 16× cheaper than the headline price. Use INT4 KV at > 1M tokens. Don't reach for 10M before you've maxed out 1M with caching. RAG still wins for KB-style search and freshness.