Quantization: pick the right precision
Lower precision means smaller weights, less memory bandwidth, and higher throughput — at the cost of some quality. Here's how Luminet handles quantization across the catalog and how to pick the right format for your workload.
The four formats we serve
All hosted models ship with at least two quantization variants. By default Luminet routes to the FP8 variant unless you explicitly override it.
BF16
Reference / baseline · fine-tuning · debugging numerical issues
FP8 (E4M3 / E5M2)
DefaultDefault for hosted inference · maximum throughput at near-lossless quality
INT8 W8A8
Hardware without FP8 (Ampere) · activation outlier-tolerant models
INT4 AWQ / GPTQ
Edge deployment · memory-constrained GPUs · long-context KV cache compression
How to override the default
Append @fp8, @int8, or @int4 to the model ID:
// Default (FP8) model: "deepseek/deepseek-v3.2-exp" // Force INT4 for memory-constrained dev cluster model: "deepseek/deepseek-v3.2-exp@int4" // Force BF16 for golden-output regression tests model: "deepseek/deepseek-v3.2-exp@bf16"
Quality validation methodology
Every quantized variant we publish is validated against the BF16 reference on:
- MMLU-Pro (general knowledge, 14K multiple-choice across 14 domains)
- GSM8K (math word problems, chain-of-thought)
- HumanEval+ (code generation, extended test suite)
- MT-Bench (multi-turn chat quality, judge-scored)
We publish per-model quality deltas in our quality report. Any regression > 1.5% triggers a rollback.
KV cache quantization (separate knob)
Weight quantization compresses parameters. KV cache quantization compresses the activations stored across generation steps — important for long-context workloads.
By default we use FP8 KV cache for all models. Switch to INT4 KV cache via extra_body: { kv_cache_dtype: 'int4' } to halve KV memory and unlock 2× longer context windows on the same GPU.
When to NOT quantize
- Reasoning-heavy benchmarks: FP8 sometimes loses 2-3% on MATH or AIME. Use BF16 if you depend on these.
- Adversarial / safety evals: Quantization can subtly shift refusal boundaries. Re-run your eval suite when switching formats.
- Fine-tuning loop: Always train in BF16. Quantize only after training.
TL;DR
Stick with the default FP8 unless you have a measured reason to switch. INT4 trades 2-3% quality for 2× memory savings — great for long-context. BF16 only when you're debugging numerical issues or running golden tests.