All numbers, no spin.
Reproducible head-to-head benchmarks across every hosted model. Run on identical hardware (8× H100 SXM, batch 32, FP8) against stock vLLM 0.6 baselines. Public dataset, public methodology.
Methodology
- • Hardware: 8× NVIDIA H100 SXM (80 GB), NVLink, IB
- • Batch size: 32 concurrent requests
- • Input length: 1,024 tokens · Output: 256 tokens
- • Quantization: FP8 (E4M3) for weights & KV
- • Sampling: temperature 0.7, top_p 0.95
- • Baseline: vLLM 0.6 with default settings, same hardware
- • Run window: 100 warmup + 1,000 measured requests
- • Reproducibility: full scripts on github.com/luminet/benchmarks
Chat throughput
Sustained tokens/sec at production batch sizes. Lower P50 TTFT means faster first-token-out for users.
| Model | Luminet (tok/s) | vLLM | Speedup | P50 TTFT | $/1M out |
|---|---|---|---|---|---|
| DeepSeek V3.2 Exp | 410 | 178 | 2.30× | 110 ms | $0.27 |
| Llama 4 Maverick | 380 | 152 | 2.50× | 130 ms | $0.27 |
| Llama 4 Scout | 460 | 198 | 2.32× | 95 ms | $0.18 |
| Qwen3-Max 235B | 280 | 118 | 2.37× | 175 ms | $1.60 |
| Qwen3-Next 80B A3B | 640 | 245 | 2.61× | 78 ms | $0.14 |
| GLM-4.6 | 420 | 175 | 2.40× | 105 ms | $0.50 |
| GLM-4.5 Air | 540 | 222 | 2.43× | 88 ms | $0.20 |
| Kimi K2 Instruct | 295 | 122 | 2.42× | 145 ms | $0.60 |
| Mistral Medium 3.1 | 350 | 145 | 2.41× | 125 ms | $0.40 |
Long-context prefill
Cold prefill cost vs warm (prefix-cached) prefill. Warm cost reveals how much you save by reusing system prompts across requests.
| Model (max ctx) | Test prompt | Cold prefill | Warm prefill | Cold $ | Cache speedup |
|---|---|---|---|---|---|
| Llama 4 Maverick (1M) | 1000K tokens | 12.4 s | 0.40 s | $0.27 | 31× |
| Llama 4 Scout (10M) | 10000K tokens | 142.0 s | 4.10 s | $1.80 | 35× |
| GLM-4.6 (200K) | 200K tokens | 2.8 s | 0.18 s | $0.10 | 16× |
| Qwen3-Max (256K) | 256K tokens | 3.6 s | 0.22 s | $0.41 | 16× |
Tool use & agents
BFCL v3 (Berkeley Function Calling Leaderboard) accuracy. Parallel calls measures the model's ability to emit multiple independent tool calls in one shot.
| Model | BFCL v3 overall | Parallel calls | Hallucination rate |
|---|---|---|---|
| Kimi K2 Instruct | 78.4 | 82.1 | 3.2% |
| Llama 4 Maverick | 76.9 | 80.5 | 4.1% |
| DeepSeek V3.2 | 76.2 | 78.8 | 3.8% |
| Qwen3-Max 235B | 75.5 | 79.4 | 4.5% |
| GLM-4.6 | 73.8 | 76.2 | 5.1% |
| Magistral Medium | 70.4 | 72.1 | 6.3% |
Code generation
HumanEval+ for synthesis quality, SWE-Bench Verified for end-to-end repo-fix tasks. FIM latency is the round-trip for fill-in-the-middle code completion.
| Model | HumanEval+ | SWE-Bench V | FIM latency |
|---|---|---|---|
| Qwen3-Coder 480B | 88.4 | 56.2 | 95 ms |
| DeepSeek V3.2 Exp | 87.1 | 52.8 | 110 ms |
| Devstral Small | 79.3 | 48.5 | 65 ms |
| Codestral 2 | 81.6 | 41.2 | 42 ms |
| Llama 4 Maverick | 82.9 | 44.7 | 130 ms |
Reproduce these on your own hardware
Full benchmark scripts, datasets, and Docker images are public. Run them locally — or take our word for it and skip straight to the API.