Benchmarks

All numbers, no spin.

Reproducible head-to-head benchmarks across every hosted model. Run on identical hardware (8× H100 SXM, batch 32, FP8) against stock vLLM 0.6 baselines. Public dataset, public methodology.

Peak throughput

640tok/s

Lowest P50 TTFT

42ms

Models tested

Avg cost vs vLLM

-42%

Methodology

• Hardware: 8× NVIDIA H100 SXM (80 GB), NVLink, IB
• Batch size: 32 concurrent requests
• Input length: 1,024 tokens · Output: 256 tokens
• Quantization: FP8 (E4M3) for weights & KV
• Sampling: temperature 0.7, top_p 0.95
• Baseline: vLLM 0.6 with default settings, same hardware
• Run window: 100 warmup + 1,000 measured requests
• Reproducibility: full scripts on github.com/luminet/benchmarks

Chat throughput

Sustained tokens/sec at production batch sizes. Lower P50 TTFT means faster first-token-out for users.

Model	Luminet (tok/s)	vLLM	Speedup	P50 TTFT	$/1M out
DeepSeek V3.2 Exp	410	178	2.30×	110 ms	$0.27
Llama 4 Maverick	380	152	2.50×	130 ms	$0.27
Llama 4 Scout	460	198	2.32×	95 ms	$0.18
Qwen3-Max 235B	280	118	2.37×	175 ms	$1.60
Qwen3-Next 80B A3B	640	245	2.61×	78 ms	$0.14
GLM-4.6	420	175	2.40×	105 ms	$0.50
GLM-4.5 Air	540	222	2.43×	88 ms	$0.20
Kimi K2 Instruct	295	122	2.42×	145 ms	$0.60
Mistral Medium 3.1	350	145	2.41×	125 ms	$0.40

Long-context prefill

Cold prefill cost vs warm (prefix-cached) prefill. Warm cost reveals how much you save by reusing system prompts across requests.

Model (max ctx)	Test prompt	Cold prefill	Warm prefill	Cold $	Cache speedup
Llama 4 Maverick (1M)	1000K tokens	12.4 s	0.40 s	$0.27	31×
Llama 4 Scout (10M)	10000K tokens	142.0 s	4.10 s	$1.80	35×
GLM-4.6 (200K)	200K tokens	2.8 s	0.18 s	$0.10	16×
Qwen3-Max (256K)	256K tokens	3.6 s	0.22 s	$0.41	16×

Tool use & agents

BFCL v3 (Berkeley Function Calling Leaderboard) accuracy. Parallel calls measures the model's ability to emit multiple independent tool calls in one shot.

Model	BFCL v3 overall	Parallel calls	Hallucination rate
Kimi K2 Instruct	78.4	82.1	3.2%
Llama 4 Maverick	76.9	80.5	4.1%
DeepSeek V3.2	76.2	78.8	3.8%
Qwen3-Max 235B	75.5	79.4	4.5%
GLM-4.6	73.8	76.2	5.1%
Magistral Medium	70.4	72.1	6.3%

Code generation

HumanEval+ for synthesis quality, SWE-Bench Verified for end-to-end repo-fix tasks. FIM latency is the round-trip for fill-in-the-middle code completion.

Model	HumanEval+	SWE-Bench V	FIM latency
Qwen3-Coder 480B	88.4	56.2	95 ms
DeepSeek V3.2 Exp	87.1	52.8	110 ms
Devstral Small	79.3	48.5	65 ms
Codestral 2	81.6	41.2	42 ms
Llama 4 Maverick	82.9	44.7	130 ms

Reproduce these on your own hardware

Full benchmark scripts, datasets, and Docker images are public. Run them locally — or take our word for it and skip straight to the API.

Try the API View on GitHub