Expert parallelism for MoE
TP, DP, EP — three ways to split a big model across GPUs, each optimal for a different shape. For dense models, TP and DP cover everything. For MoE, you need EP. Here's when to use which, and the all-to-all communication cost that determines the ceiling.
Three ways to split a model
| Strategy | What gets split | Communication | Best for |
|---|---|---|---|
| DP — Data | Requests | None per layer | Small models that fit on 1 GPU |
| TP — Tensor | Weight matrices | All-reduce per layer | Dense models too big for 1 GPU |
| EP — Expert | MoE experts | All-to-all (dispatch + combine) | MoE models with many experts |
The all-to-all problem
MoE inference at every layer involves a dispatch step (route each token to its top-K experts) and a combine step (gather expert outputs back to source GPUs). With EP=N, every GPU sends data to every other GPU — an all-to-all collective whose latency dominates decode time.
The latency of all-to-all scales with the slowest hop in the interconnect path. On Hopper, the NVLink domain stops at 8 GPUs. Past EP=8, dispatch/combine has to cross InfiniBand — an order of magnitude slower than NVLink.
| Hardware | Domain size | Bandwidth peer-to-peer | Practical EP cap |
|---|---|---|---|
| H100 (HGX) | 8 GPUs | 900 GB/s (NVLink4) | EP=8 |
| GB200 NVL72 | 72 GPUs | 1,800 GB/s (NVLink5) | EP=16-32 |
Choosing EP for a given model
The right EP value balances two opposing forces:
- Higher EP → smaller per-expert GEMMs (faster matmul, but…)
- Higher EP → more all-to-all latency (more peers to talk to)
For Qwen3 235B (128 experts), at our typical 6,000-token prefill batch, the per-expert workload looks like:
| EP value | Tokens per expert | Dispatch latency | GEMM time | Verdict |
|---|---|---|---|---|
| EP=4 | ~1,500 | +0 µs | 1.0× | Compute-bound |
| EP=8 | ~750 | +2 µs | 0.62× | Sweet spot on Hopper |
| EP=16 | ~375 | +5 µs | 0.38× | Sweet spot on NVL72 |
| EP=32 | ~187 | +18 µs | 0.32× | Memory-bound |
Mixing strategies in practice
Real production deployments rarely use one pure strategy. For Qwen3 235B on NVL72:
- Attention block: TP=4 (the dense matmul fits comfortably; tensor parallel keeps collectives small)
- MoE block: EP=16 (the expert layer dominates parameter count; EP spreads the memory)
- Across replicas: DP (each replica handles independent requests; no cross-replica comm)
Hot-expert load balancing
In production traffic, expert utilisation isn't uniform. A handful of experts get hit far more than the average — a long-tail distribution that destroys performance if you naïvely shard 1 expert per GPU.
We monitor per-expert load and rebalance dynamically: hot experts get replicated across multiple GPUs (a form of micro-DP within EP), and the dispatch step uses a hash that respects current load weights. The replication factor changes over a 5-minute rolling window.
Implementation: dispatch / combine kernels
The dispatch and combine collectives are the hottest code in the MoE forward pass. Production engines hand-tune them with NVLink topology awareness — the kernel knows which peers are intra-NVSwitch (one hop) vs cross-NVSwitch (two hops) and routes data accordingly.
Aggressive specialisation by token-count bucket (small / medium / large) lets the engine pick the right reduction primitive per request — warp shuffle for small payloads, shared-memory reduce for medium, in-network reduction (where supported) for large.
TL;DR
For MoE, EP is required. EP=8 is the sweet spot on Hopper (limited by 8-GPU NVLink domain). EP=16 becomes practical on rack-scale NVLink fabrics (limited by per-expert GEMM going memory-bound). Past that, expert parallelism stops paying for the all-to-all overhead it costs.