Inference optimizationAdvanced

Quantization: pick the right precision

Lower precision means smaller weights, less memory bandwidth, and higher throughput — at the cost of some quality. Here's how Luminet handles quantization across the catalog and how to pick the right format for your workload.

The four formats we serve

All hosted models ship with at least two quantization variants. By default Luminet routes to the FP8 variant unless you explicitly override it.

BF16

Reference / baseline · fine-tuning · debugging numerical issues

Bytes/param
2
Quality Δ
0% (baseline)
Speedup
1.0×

FP8 (E4M3 / E5M2)

Default

Default for hosted inference · maximum throughput at near-lossless quality

Bytes/param
1
Quality Δ
−0.4% avg
Speedup
1.7×

INT8 W8A8

Hardware without FP8 (Ampere) · activation outlier-tolerant models

Bytes/param
1
Quality Δ
−1.2% avg
Speedup
1.6×

INT4 AWQ / GPTQ

Edge deployment · memory-constrained GPUs · long-context KV cache compression

Bytes/param
0.5
Quality Δ
−2.1% avg
Speedup
2.3×

How to override the default

Append @fp8, @int8, or @int4 to the model ID:

// Default (FP8)
model: "deepseek/deepseek-v3.2-exp"

// Force INT4 for memory-constrained dev cluster
model: "deepseek/deepseek-v3.2-exp@int4"

// Force BF16 for golden-output regression tests
model: "deepseek/deepseek-v3.2-exp@bf16"

Quality validation methodology

Every quantized variant we publish is validated against the BF16 reference on:

  • MMLU-Pro (general knowledge, 14K multiple-choice across 14 domains)
  • GSM8K (math word problems, chain-of-thought)
  • HumanEval+ (code generation, extended test suite)
  • MT-Bench (multi-turn chat quality, judge-scored)

We publish per-model quality deltas in our quality report. Any regression > 1.5% triggers a rollback.

KV cache quantization (separate knob)

Weight quantization compresses parameters. KV cache quantization compresses the activations stored across generation steps — important for long-context workloads.

By default we use FP8 KV cache for all models. Switch to INT4 KV cache via extra_body: { kv_cache_dtype: 'int4' } to halve KV memory and unlock 2× longer context windows on the same GPU.

When to NOT quantize

  • Reasoning-heavy benchmarks: FP8 sometimes loses 2-3% on MATH or AIME. Use BF16 if you depend on these.
  • Adversarial / safety evals: Quantization can subtly shift refusal boundaries. Re-run your eval suite when switching formats.
  • Fine-tuning loop: Always train in BF16. Quantize only after training.

TL;DR

Stick with the default FP8 unless you have a measured reason to switch. INT4 trades 2-3% quality for 2× memory savings — great for long-context. BF16 only when you're debugging numerical issues or running golden tests.