Inference optimizationLatency

Speculative decoding

A fast draft model proposes tokens; the slow target model verifies them in parallel. When the draft is right, you get many tokens per target-model forward pass — 2-3× wallclock speedup at batch=1 with identical output distribution.

Why it works

Most autoregressive generation is memory-bandwidth bound, not compute bound. A single forward pass of an 80B model loads ~160 GB off HBM whether you produce 1 token or 8 tokens. If a 1.5B draft model can guess the next 8 tokens in < 5 ms, and the target model verifies them in one batched forward pass, you skip 7 of 8 forward passes when the draft is right.

What Luminet ships by default

Every hosted target model has a paired draft model trained specifically for it. The draft is distilled from 40B tokens of target-model completions across web, code, and reasoning corpora.

TargetDraftAccept rateWallclock speedup (batch=1)
Llama 4 MaverickLlama 3.2 1B (distilled)74%2.1×
DeepSeek V3.2DeepSeek 1.5B-draft71%2.3×
Qwen3-Max 235BQwen3 1.5B-draft76%2.6×
GLM-4.6GLM-4 9B (distilled)68%1.9×
Kimi K2 InstructKimi 3B-draft72%2.2×

Tree-based vs linear speculation

Linear speculation proposes one chain of k tokens. If the target rejects token i, you waste tokens i+1 through k.

Tree-based speculation proposes multiple parallel branchesat each step. The target verifies the entire tree in one forward pass and accepts whichever branch matches longest. We ship tree speculation by default for chat workloads (batch ≤ 4), which raises effective accept rate from ~70% to ~84%.

When speculation hurts

  • High batch (≥ 16): the target is already compute-bound, so the verify pass costs more than the speedup. We auto-disable above batch 16.
  • Highly random outputs: temperature > 1.5 or large top-k makes the draft accept rate collapse. Disable for creative workloads.
  • Out-of-distribution prompts: very domain-specific (legal, medical jargon) prompts may have low draft accept rate. We auto-detect and degrade gracefully.

Disable per request

const resp = await client.chat.completions.create({
  model: "deepseek/deepseek-v3.2-exp",
  messages: [...],
  // Disable speculation for this request
  extra_body: {
    speculative_decoding: false,
  },
});

TL;DR

Speculation is on by default at batch ≤ 16 and saves you 2-2.6× wallclock latency on chat. Output distribution is bit-exact identical to the target model. Disable per request if you measure regressions on your specific workload.