All posts
EngineeringInferenceCUDA

FireAttention v3: 2.4× faster inference for open-weight LLMs

May 8, 2026 Sasha Petrov, Inference Eng· 9 min read

We're shipping FireAttention v3 today — our third-generation inference kernel that drops time-to-first-token by 38% and lifts steady-state throughput by 2.4× compared to stock vLLM, on the same hardware. This post walks through what changed under the hood and what it means if you serve open-weight LLMs in production.

The bottleneck wasn't where we expected

Most inference engines optimise the attention kernel and call it a day. We did too in v1. But after profiling six months of production traffic across DeepSeek V3.2, Llama 4 Maverick, and Qwen3-Max, two things jumped out:

  • ~32% of GPU time was spent in the prefill stage, dominated by softmax + matmul fusion overhead.
  • ~18% was idle waiting for KV-cache eviction during continuous-batching scheduling.

v3 attacks both. Here's how.

1. Fused FP8 prefill with on-the-fly quantization

Prefill kernels in vLLM and TGI run BF16 matmuls and quantize activations as a separate pass. We fuse them: the FP8 quantization happens inside the attention kernel, in registers, before the softmax accumulator. Net result: 1.7× speedup on prefill with identical numerical output (we tested across MMLU, GSM8K, HumanEval — <0.4% delta vs BF16).

// Before: 3 separate kernel launches
attention_bf16<<<...>>>();        // 4.2 ms
quantize_to_fp8<<<...>>>();       // 0.9 ms
matmul_fp8<<<...>>>();            // 2.1 ms
                                  // total: 7.2 ms

// After: 1 fused kernel
fused_fa3_fp8<<<...>>>();         // 4.1 ms
                                  // total: 4.1 ms (1.76x)

2. Speculative KV eviction

Continuous batching is a well-known win — but the scheduler typically waits for one request to finish before reclaiming its KV cache slot. On bursty traffic, this caused a measurable tail latency spike.

v3 introduces speculative eviction: when we predict a request is within 95% confidence of completing in the next 4 steps, we proactively allocate the new incoming request into the soon-to-be-free slot. If the prediction is wrong (rare, ~1.2% of the time), we fall back gracefully without dropping tokens.

This alone added 14% steady-state throughput on our DeepSeek V3.2 fleet under realistic mixed loads.

3. Tree-based speculative decoding for chat workloads

For chat-style requests (single completion, low-batch), we ship a draft-model speculative decoder by default. Our draft is a 1.5B variant of the target model trained on 40B tokens of distilled completions — accept rate is ~74% on Llama 4 Maverick, which translates to a ~2.1× wallclock speedup at batch=1.

The numbers

Reproducible benchmarks, 8× H100 SXM, identical sampling parameters, 1K input / 256 output:

ModelvLLM 0.6v2 (Apr)v3 (today)Δ vs vLLM
DeepSeek V3.2178 tok/s320 tok/s410 tok/s2.30×
Llama 4 Maverick152 tok/s280 tok/s380 tok/s2.50×
Qwen3-Max118 tok/s215 tok/s280 tok/s2.37×
Qwen3-Next 80B245 tok/s480 tok/s640 tok/s2.61×
GLM-4.6175 tok/s312 tok/s420 tok/s2.40×
Kimi K2 Instruct122 tok/s220 tok/s295 tok/s2.42×

What this means for you

If you're already on Luminet, you don't need to do anything — v3 is rolled out across all hosted clusters as of today, and the new prices are live on the pricing page. Existing customers see ~30% lower bills automatically with no throughput regressions.

If you're running your own inference, well — open the quickstart and migrate the workload that hurts the most. Your finance team will thank you.

What's next

v3.1 (June): MoE expert prefetching for DeepSeek-style models. v4 (Q3): dynamic precision — automatically pick FP8 vs INT4 per layer based on residual error. Subscribe to the changelog.

Try v3 in under 60 seconds

Same OpenAI SDK, just point it at our endpoint.

Get an API key