Benchmarks

All numbers, no spin.

Reproducible head-to-head benchmarks across every hosted model. Run on identical hardware (8× H100 SXM, batch 32, FP8) against stock vLLM 0.6 baselines. Public dataset, public methodology.

Peak throughput
640tok/s
Lowest P50 TTFT
42ms
Models tested
20
Avg cost vs vLLM
-42%

Methodology

  • • Hardware: 8× NVIDIA H100 SXM (80 GB), NVLink, IB
  • • Batch size: 32 concurrent requests
  • • Input length: 1,024 tokens · Output: 256 tokens
  • • Quantization: FP8 (E4M3) for weights & KV
  • • Sampling: temperature 0.7, top_p 0.95
  • • Baseline: vLLM 0.6 with default settings, same hardware
  • • Run window: 100 warmup + 1,000 measured requests
  • • Reproducibility: full scripts on github.com/luminet/benchmarks

Chat throughput

Sustained tokens/sec at production batch sizes. Lower P50 TTFT means faster first-token-out for users.

ModelLuminet (tok/s)vLLMSpeedupP50 TTFT$/1M out
DeepSeek V3.2 Exp4101782.30×110 ms$0.27
Llama 4 Maverick3801522.50×130 ms$0.27
Llama 4 Scout4601982.32×95 ms$0.18
Qwen3-Max 235B2801182.37×175 ms$1.60
Qwen3-Next 80B A3B6402452.61×78 ms$0.14
GLM-4.64201752.40×105 ms$0.50
GLM-4.5 Air5402222.43×88 ms$0.20
Kimi K2 Instruct2951222.42×145 ms$0.60
Mistral Medium 3.13501452.41×125 ms$0.40

Long-context prefill

Cold prefill cost vs warm (prefix-cached) prefill. Warm cost reveals how much you save by reusing system prompts across requests.

Model (max ctx)Test promptCold prefillWarm prefillCold $Cache speedup
Llama 4 Maverick (1M)1000K tokens12.4 s0.40 s$0.2731×
Llama 4 Scout (10M)10000K tokens142.0 s4.10 s$1.8035×
GLM-4.6 (200K)200K tokens2.8 s0.18 s$0.1016×
Qwen3-Max (256K)256K tokens3.6 s0.22 s$0.4116×

Tool use & agents

BFCL v3 (Berkeley Function Calling Leaderboard) accuracy. Parallel calls measures the model's ability to emit multiple independent tool calls in one shot.

ModelBFCL v3 overallParallel callsHallucination rate
Kimi K2 Instruct78.482.13.2%
Llama 4 Maverick76.980.54.1%
DeepSeek V3.276.278.83.8%
Qwen3-Max 235B75.579.44.5%
GLM-4.673.876.25.1%
Magistral Medium70.472.16.3%

Code generation

HumanEval+ for synthesis quality, SWE-Bench Verified for end-to-end repo-fix tasks. FIM latency is the round-trip for fill-in-the-middle code completion.

ModelHumanEval+SWE-Bench VFIM latency
Qwen3-Coder 480B88.456.295 ms
DeepSeek V3.2 Exp87.152.8110 ms
Devstral Small79.348.565 ms
Codestral 281.641.242 ms
Llama 4 Maverick82.944.7130 ms

Reproduce these on your own hardware

Full benchmark scripts, datasets, and Docker images are public. Run them locally — or take our word for it and skip straight to the API.