Speculative decoding
A fast draft model proposes tokens; the slow target model verifies them in parallel. When the draft is right, you get many tokens per target-model forward pass — 2-3× wallclock speedup at batch=1 with identical output distribution.
Why it works
Most autoregressive generation is memory-bandwidth bound, not compute bound. A single forward pass of an 80B model loads ~160 GB off HBM whether you produce 1 token or 8 tokens. If a 1.5B draft model can guess the next 8 tokens in < 5 ms, and the target model verifies them in one batched forward pass, you skip 7 of 8 forward passes when the draft is right.
What Luminet ships by default
Every hosted target model has a paired draft model trained specifically for it. The draft is distilled from 40B tokens of target-model completions across web, code, and reasoning corpora.
| Target | Draft | Accept rate | Wallclock speedup (batch=1) |
|---|---|---|---|
| Llama 4 Maverick | Llama 3.2 1B (distilled) | 74% | 2.1× |
| DeepSeek V3.2 | DeepSeek 1.5B-draft | 71% | 2.3× |
| Qwen3-Max 235B | Qwen3 1.5B-draft | 76% | 2.6× |
| GLM-4.6 | GLM-4 9B (distilled) | 68% | 1.9× |
| Kimi K2 Instruct | Kimi 3B-draft | 72% | 2.2× |
Tree-based vs linear speculation
Linear speculation proposes one chain of k tokens. If the target rejects token i, you waste tokens i+1 through k.
Tree-based speculation proposes multiple parallel branchesat each step. The target verifies the entire tree in one forward pass and accepts whichever branch matches longest. We ship tree speculation by default for chat workloads (batch ≤ 4), which raises effective accept rate from ~70% to ~84%.
When speculation hurts
- High batch (≥ 16): the target is already compute-bound, so the verify pass costs more than the speedup. We auto-disable above batch 16.
- Highly random outputs: temperature > 1.5 or large top-k makes the draft accept rate collapse. Disable for creative workloads.
- Out-of-distribution prompts: very domain-specific (legal, medical jargon) prompts may have low draft accept rate. We auto-detect and degrade gracefully.
Disable per request
const resp = await client.chat.completions.create({
model: "deepseek/deepseek-v3.2-exp",
messages: [...],
// Disable speculation for this request
extra_body: {
speculative_decoding: false,
},
});TL;DR
Speculation is on by default at batch ≤ 16 and saves you 2-2.6× wallclock latency on chat. Output distribution is bit-exact identical to the target model. Disable per request if you measure regressions on your specific workload.