Inference optimizationMoE

Expert parallelism for MoE

TP, DP, EP — three ways to split a big model across GPUs, each optimal for a different shape. For dense models, TP and DP cover everything. For MoE, you need EP. Here's when to use which, and the all-to-all communication cost that determines the ceiling.

Three ways to split a model

Strategy	What gets split	Communication	Best for
DP — Data	Requests	None per layer	Small models that fit on 1 GPU
TP — Tensor	Weight matrices	All-reduce per layer	Dense models too big for 1 GPU
EP — Expert	MoE experts	All-to-all (dispatch + combine)	MoE models with many experts

The all-to-all problem

MoE inference at every layer involves a dispatch step (route each token to its top-K experts) and a combine step (gather expert outputs back to source GPUs). With EP=N, every GPU sends data to every other GPU — an all-to-all collective whose latency dominates decode time.

The latency of all-to-all scales with the slowest hop in the interconnect path. On Hopper, the NVLink domain stops at 8 GPUs. Past EP=8, dispatch/combine has to cross InfiniBand — an order of magnitude slower than NVLink.

Hardware	Domain size	Bandwidth peer-to-peer	Practical EP cap
H100 (HGX)	8 GPUs	900 GB/s (NVLink4)	EP=8
GB200 NVL72	72 GPUs	1,800 GB/s (NVLink5)	EP=16-32

Choosing EP for a given model

The right EP value balances two opposing forces:

Higher EP → smaller per-expert GEMMs (faster matmul, but…)
Higher EP → more all-to-all latency (more peers to talk to)

For Qwen3 235B (128 experts), at our typical 6,000-token prefill batch, the per-expert workload looks like:

EP value	Tokens per expert	Dispatch latency	GEMM time	Verdict
EP=4	~1,500	+0 µs	1.0×	Compute-bound
EP=8	~750	+2 µs	0.62×	Sweet spot on Hopper
EP=16	~375	+5 µs	0.38×	Sweet spot on NVL72
EP=32	~187	+18 µs	0.32×	Memory-bound

Mixing strategies in practice

Real production deployments rarely use one pure strategy. For Qwen3 235B on NVL72:

Attention block: TP=4 (the dense matmul fits comfortably; tensor parallel keeps collectives small)
MoE block: EP=16 (the expert layer dominates parameter count; EP spreads the memory)
Across replicas: DP (each replica handles independent requests; no cross-replica comm)

Hot-expert load balancing

In production traffic, expert utilisation isn't uniform. A handful of experts get hit far more than the average — a long-tail distribution that destroys performance if you naïvely shard 1 expert per GPU.

We monitor per-expert load and rebalance dynamically: hot experts get replicated across multiple GPUs (a form of micro-DP within EP), and the dispatch step uses a hash that respects current load weights. The replication factor changes over a 5-minute rolling window.

Implementation: dispatch / combine kernels

The dispatch and combine collectives are the hottest code in the MoE forward pass. Production engines hand-tune them with NVLink topology awareness — the kernel knows which peers are intra-NVSwitch (one hop) vs cross-NVSwitch (two hops) and routes data accordingly.

Aggressive specialisation by token-count bucket (small / medium / large) lets the engine pick the right reduction primitive per request — warp shuffle for small payloads, shared-memory reduce for medium, in-network reduction (where supported) for large.

TL;DR

For MoE, EP is required. EP=8 is the sweet spot on Hopper (limited by 8-GPU NVLink domain). EP=16 becomes practical on rack-scale NVLink fabrics (limited by per-expert GEMM going memory-bound). Past that, expert parallelism stops paying for the all-to-all overhead it costs.

Re-read disaggregation