Inference optimizationArchitecture

Disaggregated prefill and decode

Prefill and decode have completely different bottlenecks. Forcing them to share one homogeneous deployment leaves performance on the table for both. Here's how we split them and why it matters more on Blackwell than on Hopper.

Two workloads, two profiles

A single inference request goes through two phases:

Phase	Bottleneck	Optimal batch	What it wants
Prefill	Compute (matmul)	Large (32-128)	Maximum FLOPs per second, big GEMMs
Decode	HBM bandwidth + collective latency	Small (1-16)	Low TTFT, fast per-token wallclock

Why one cluster can't serve both well

Continuous batching mitigates the conflict but doesn't solve it. When a long-context prefill (say 100K tokens) starts, it dominates the GPU for hundreds of milliseconds — every concurrent decode request stalls until prefill yields. Chunked prefill helps but introduces its own scheduling overhead.

The cleaner answer: two physically separate pools, each tuned for its phase. The KV cache hands off across the network at phase boundary.

Disaggregated topology

A typical disaggregated deployment for a 235B-class MoE on Blackwell NVL72 looks like:

Pool	Parallelism	Topology
Prefillers	TP=4 + EP=4	Single node (4 GPUs), intra-node NVLink
Decoders	DP + EP=16	4 nodes (16 GPUs), one NVLink domain

The KV cache produced at the end of prefill is shipped over ConnectX-7 InfiniBand (400 Gb/s) to the decoder pool, where it continues from token N+1.

The hard problem: shard mismatch at hand-off

Prefill TP groups are typically smaller (e.g., 4 GPUs) than decode TP/EP groups (e.g., 16 GPUs in EP-heavy setups). The KV cache produced by prefill is sharded for the smaller layout — you need to re-shard it before decode can use it. Naïve approaches eat the wins.

A small session-routing layer in the inference engine handles the re-sharding at hand-off. The shape of the API typically looks like:

# Prefill side: produce KV with current shard layout, push to decoder
session = engine.create_session(
    session_id=req_id,
    layout=PrefillLayout(tp=4, ep=4),
)
kv = run_prefill(model, prompt)
session.push(kv, target_layout=DecoderLayout(dp=4, ep=16))

# Decoder side: pull KV, automatically re-sharded
kv = engine.pull(req_id)
generate_tokens(kv, max_new_tokens=2048)

When NOT to disaggregate

Small models (≤ 13B): collective overhead is small enough that one pool wins.
Pure-chat workloads with short prompts: prefill is too cheap to justify the extra machinery.
Limited NVLink domain (e.g. 8-GPU): decode pools beyond the domain cross InfiniBand and lose most of the benefit. The technique works best on rack-scale NVLink fabrics.

Typical impact at production scale

For a 200B-class MoE with 6K-token average prompts and 512 output tokens, going from one homogeneous pool to a disaggregated layout tends to give:

P50 TTFT: −34% (decoders no longer stall on long prefills)
P99 TTFT:−58% (tail evaporates because prefill bursts can't starve decoders)
Throughput per GPU: +22% (each pool runs at its preferred batch size)
Cost per million tokens: −18% (better GPU utilisation across the rack)

TL;DR

Disaggregation is the right answer for ≥ 70B models on rack-scale NVLink hardware. Not worth the complexity below that, and not worth it at all when the decode pool can't scale past one NVLink domain.

Next: expert parallelism