Multi-LoRA serving
Serve hundreds of fine-tuned LoRA adapters on a single base-model GPU. Adapter switch overhead under 2 ms. Per-request adapter selection. This is how SaaS teams give every customer a custom model without going bankrupt on GPUs.
The problem multi-LoRA solves
You have 200 customers. Each wants a fine-tuned model for their domain. Naïvely, that's 200 separate model deployments × $5/hr per GPU = $24K/day idle before you serve a single request. Most of those models also see near-zero traffic — but you pay for the GPU regardless.
With multi-LoRA, you deploy one base model and load 200 tiny adapters (typically 50-200 MB each) into GPU memory. Each incoming request specifies which adapter to apply. Switching takes a single matrix-multiply on top of the base computation — essentially free.
How LoRA works (60-second refresher)
LoRA freezes the base model weights and learns a low-rank update ΔW = A · B, where A and B are tiny matrices. Inference becomes y = (W + ΔW) x = Wx + (A · B) x. The first term is the regular base forward pass. The second term is a small extra matmul, computed in the same forward step.
For a 70B model with rank-32 LoRA, an adapter is ~120 MB instead of 140 GB. You can fit hundreds in GPU memory.
Per-request adapter selection
// Each request picks its own adapter
const resp = await client.chat.completions.create({
model: "meta/llama-4-maverick:my-org/customer-acme-v3",
// \______ adapter ID
messages: [
{ role: "user", content: "Summarize this support ticket" },
],
});
// In one batch, you can mix many adapters:
// request 1 → my-org/customer-acme-v3
// request 2 → my-org/customer-globex-v1
// request 3 → my-org/internal-classifier
// They all run on the same forward pass through the base model.Performance characteristics
| Adapters in batch | Throughput vs base-only | P50 latency overhead |
|---|---|---|
| 1 (homogeneous) | 98% | +1 ms |
| 4 (mixed) | 94% | +2 ms |
| 16 (mixed) | 88% | +4 ms |
| 64 (mixed) | 76% | +9 ms |
Adapter management
Adapters are first-class resources, like API keys. Manage them via dashboard or CLI:
# Upload an adapter luminet adapters create \ --base meta/llama-4-maverick \ --name my-org/customer-acme-v3 \ --source ./acme-finetune.safetensors # List active adapters luminet adapters list # Promote / demote luminet adapters set-traffic my-org/customer-acme-v3 --weight 100 # Delete luminet adapters delete my-org/customer-acme-v3
Hot-swap & pinning
Adapters not currently in use are evicted from GPU memory and held on local NVMe. First request after eviction pays a ~50 ms cold-load penalty (one-time per adapter per replica). Subsequent requests are at the rates above.
To eliminate cold-load entirely for a small set of always-hot adapters, use --pin:
luminet adapters set-pinned my-org/customer-acme-v3 # Now stays in HBM permanently. Up to 16 pins per replica.
Pricing
Adapters are free to upload, store, and switch. You only pay for the per-token inference rate of the base model. The math says: 200 customers each generating 1M tokens/month against Llama 4 Maverick costs you ~$170 total — same as serving a single shared model.
When NOT to use multi-LoRA
- Massive domain shift:if your fine-tune changes > 5% of weights, full fine-tuning beats LoRA. See fine-tuning.
- Base model upgrades: adapters are tied to a specific base model version. When you upgrade Llama 4 → Llama 5, you re-train adapters. Plan accordingly.
- Few high-traffic adapters ( ≤ 5): at that scale just deploy each as a separate tenant.
TL;DR
Multi-LoRA lets you serve a per-customer fine-tune for the price of one shared model. Up to 16 mixed adapters per batch with < 5% throughput hit. Pin the always-hot ones, evict the long-tail automatically. Pricing is the base-model per-token rate, full stop.