Inference optimizationMoE

Expert parallelism for MoE

TP, DP, EP — three ways to split a big model across GPUs, each optimal for a different shape. For dense models, TP and DP cover everything. For MoE, you need EP. Here's when to use which, and the all-to-all communication cost that determines the ceiling.

Three ways to split a model

StrategyWhat gets splitCommunicationBest for
DP — DataRequestsNone per layerSmall models that fit on 1 GPU
TP — TensorWeight matricesAll-reduce per layerDense models too big for 1 GPU
EP — ExpertMoE expertsAll-to-all (dispatch + combine)MoE models with many experts

The all-to-all problem

MoE inference at every layer involves a dispatch step (route each token to its top-K experts) and a combine step (gather expert outputs back to source GPUs). With EP=N, every GPU sends data to every other GPU — an all-to-all collective whose latency dominates decode time.

The latency of all-to-all scales with the slowest hop in the interconnect path. On Hopper, the NVLink domain stops at 8 GPUs. Past EP=8, dispatch/combine has to cross InfiniBand — an order of magnitude slower than NVLink.

HardwareDomain sizeBandwidth peer-to-peerPractical EP cap
H100 (HGX)8 GPUs900 GB/s (NVLink4)EP=8
GB200 NVL7272 GPUs1,800 GB/s (NVLink5)EP=16-32

Choosing EP for a given model

The right EP value balances two opposing forces:

  • Higher EP → smaller per-expert GEMMs (faster matmul, but…)
  • Higher EP → more all-to-all latency (more peers to talk to)

For Qwen3 235B (128 experts), at our typical 6,000-token prefill batch, the per-expert workload looks like:

EP valueTokens per expertDispatch latencyGEMM timeVerdict
EP=4~1,500+0 µs1.0×Compute-bound
EP=8~750+2 µs0.62×Sweet spot on Hopper
EP=16~375+5 µs0.38×Sweet spot on NVL72
EP=32~187+18 µs0.32×Memory-bound

Mixing strategies in practice

Real production deployments rarely use one pure strategy. For Qwen3 235B on NVL72:

  • Attention block: TP=4 (the dense matmul fits comfortably; tensor parallel keeps collectives small)
  • MoE block: EP=16 (the expert layer dominates parameter count; EP spreads the memory)
  • Across replicas: DP (each replica handles independent requests; no cross-replica comm)

Hot-expert load balancing

In production traffic, expert utilisation isn't uniform. A handful of experts get hit far more than the average — a long-tail distribution that destroys performance if you naïvely shard 1 expert per GPU.

We monitor per-expert load and rebalance dynamically: hot experts get replicated across multiple GPUs (a form of micro-DP within EP), and the dispatch step uses a hash that respects current load weights. The replication factor changes over a 5-minute rolling window.

Implementation: dispatch / combine kernels

The dispatch and combine collectives are the hottest code in the MoE forward pass. Production engines hand-tune them with NVLink topology awareness — the kernel knows which peers are intra-NVSwitch (one hop) vs cross-NVSwitch (two hops) and routes data accordingly.

Aggressive specialisation by token-count bucket (small / medium / large) lets the engine pick the right reduction primitive per request — warp shuffle for small payloads, shared-memory reduce for medium, in-network reduction (where supported) for large.

TL;DR

For MoE, EP is required. EP=8 is the sweet spot on Hopper (limited by 8-GPU NVLink domain). EP=16 becomes practical on rack-scale NVLink fabrics (limited by per-expert GEMM going memory-bound). Past that, expert parallelism stops paying for the all-to-all overhead it costs.