FireAttention v3: 2.4× faster inference for open-weight LLMs
We're shipping FireAttention v3 today — our third-generation inference kernel that drops time-to-first-token by 38% and lifts steady-state throughput by 2.4× compared to stock vLLM, on the same hardware. This post walks through what changed under the hood and what it means if you serve open-weight LLMs in production.
The bottleneck wasn't where we expected
Most inference engines optimise the attention kernel and call it a day. We did too in v1. But after profiling six months of production traffic across DeepSeek V3.2, Llama 4 Maverick, and Qwen3-Max, two things jumped out:
- ~32% of GPU time was spent in the prefill stage, dominated by softmax + matmul fusion overhead.
- ~18% was idle waiting for KV-cache eviction during continuous-batching scheduling.
v3 attacks both. Here's how.
1. Fused FP8 prefill with on-the-fly quantization
Prefill kernels in vLLM and TGI run BF16 matmuls and quantize activations as a separate pass. We fuse them: the FP8 quantization happens inside the attention kernel, in registers, before the softmax accumulator. Net result: 1.7× speedup on prefill with identical numerical output (we tested across MMLU, GSM8K, HumanEval — <0.4% delta vs BF16).
// Before: 3 separate kernel launches
attention_bf16<<<...>>>(); // 4.2 ms
quantize_to_fp8<<<...>>>(); // 0.9 ms
matmul_fp8<<<...>>>(); // 2.1 ms
// total: 7.2 ms
// After: 1 fused kernel
fused_fa3_fp8<<<...>>>(); // 4.1 ms
// total: 4.1 ms (1.76x)2. Speculative KV eviction
Continuous batching is a well-known win — but the scheduler typically waits for one request to finish before reclaiming its KV cache slot. On bursty traffic, this caused a measurable tail latency spike.
v3 introduces speculative eviction: when we predict a request is within 95% confidence of completing in the next 4 steps, we proactively allocate the new incoming request into the soon-to-be-free slot. If the prediction is wrong (rare, ~1.2% of the time), we fall back gracefully without dropping tokens.
This alone added 14% steady-state throughput on our DeepSeek V3.2 fleet under realistic mixed loads.
3. Tree-based speculative decoding for chat workloads
For chat-style requests (single completion, low-batch), we ship a draft-model speculative decoder by default. Our draft is a 1.5B variant of the target model trained on 40B tokens of distilled completions — accept rate is ~74% on Llama 4 Maverick, which translates to a ~2.1× wallclock speedup at batch=1.
The numbers
Reproducible benchmarks, 8× H100 SXM, identical sampling parameters, 1K input / 256 output:
| Model | vLLM 0.6 | v2 (Apr) | v3 (today) | Δ vs vLLM |
|---|---|---|---|---|
| DeepSeek V3.2 | 178 tok/s | 320 tok/s | 410 tok/s | 2.30× |
| Llama 4 Maverick | 152 tok/s | 280 tok/s | 380 tok/s | 2.50× |
| Qwen3-Max | 118 tok/s | 215 tok/s | 280 tok/s | 2.37× |
| Qwen3-Next 80B | 245 tok/s | 480 tok/s | 640 tok/s | 2.61× |
| GLM-4.6 | 175 tok/s | 312 tok/s | 420 tok/s | 2.40× |
| Kimi K2 Instruct | 122 tok/s | 220 tok/s | 295 tok/s | 2.42× |
What this means for you
If you're already on Luminet, you don't need to do anything — v3 is rolled out across all hosted clusters as of today, and the new prices are live on the pricing page. Existing customers see ~30% lower bills automatically with no throughput regressions.
If you're running your own inference, well — open the quickstart and migrate the workload that hurts the most. Your finance team will thank you.
What's next
v3.1 (June): MoE expert prefetching for DeepSeek-style models. v4 (Q3): dynamic precision — automatically pick FP8 vs INT4 per layer based on residual error. Subscribe to the changelog.
Try v3 in under 60 seconds
Same OpenAI SDK, just point it at our endpoint.