Inference optimizationCustomization

Multi-LoRA serving

Serve hundreds of fine-tuned LoRA adapters on a single base-model GPU. Adapter switch overhead under 2 ms. Per-request adapter selection. This is how SaaS teams give every customer a custom model without going bankrupt on GPUs.

The problem multi-LoRA solves

You have 200 customers. Each wants a fine-tuned model for their domain. Naïvely, that's 200 separate model deployments × $5/hr per GPU = $24K/day idle before you serve a single request. Most of those models also see near-zero traffic — but you pay for the GPU regardless.

With multi-LoRA, you deploy one base model and load 200 tiny adapters (typically 50-200 MB each) into GPU memory. Each incoming request specifies which adapter to apply. Switching takes a single matrix-multiply on top of the base computation — essentially free.

How LoRA works (60-second refresher)

LoRA freezes the base model weights and learns a low-rank update ΔW = A · B, where A and B are tiny matrices. Inference becomes y = (W + ΔW) x = Wx + (A · B) x. The first term is the regular base forward pass. The second term is a small extra matmul, computed in the same forward step.

For a 70B model with rank-32 LoRA, an adapter is ~120 MB instead of 140 GB. You can fit hundreds in GPU memory.

Per-request adapter selection

// Each request picks its own adapter
const resp = await client.chat.completions.create({
  model: "meta/llama-4-maverick:my-org/customer-acme-v3",
  //                          \______ adapter ID
  messages: [
    { role: "user", content: "Summarize this support ticket" },
  ],
});

// In one batch, you can mix many adapters:
//   request 1 → my-org/customer-acme-v3
//   request 2 → my-org/customer-globex-v1
//   request 3 → my-org/internal-classifier
// They all run on the same forward pass through the base model.

Performance characteristics

Adapters in batchThroughput vs base-onlyP50 latency overhead
1 (homogeneous)98%+1 ms
4 (mixed)94%+2 ms
16 (mixed)88%+4 ms
64 (mixed)76%+9 ms

Adapter management

Adapters are first-class resources, like API keys. Manage them via dashboard or CLI:

# Upload an adapter
luminet adapters create \
  --base meta/llama-4-maverick \
  --name my-org/customer-acme-v3 \
  --source ./acme-finetune.safetensors

# List active adapters
luminet adapters list

# Promote / demote
luminet adapters set-traffic my-org/customer-acme-v3 --weight 100

# Delete
luminet adapters delete my-org/customer-acme-v3

Hot-swap & pinning

Adapters not currently in use are evicted from GPU memory and held on local NVMe. First request after eviction pays a ~50 ms cold-load penalty (one-time per adapter per replica). Subsequent requests are at the rates above.

To eliminate cold-load entirely for a small set of always-hot adapters, use --pin:

luminet adapters set-pinned my-org/customer-acme-v3
# Now stays in HBM permanently. Up to 16 pins per replica.

Pricing

Adapters are free to upload, store, and switch. You only pay for the per-token inference rate of the base model. The math says: 200 customers each generating 1M tokens/month against Llama 4 Maverick costs you ~$170 total — same as serving a single shared model.

When NOT to use multi-LoRA

  • Massive domain shift:if your fine-tune changes > 5% of weights, full fine-tuning beats LoRA. See fine-tuning.
  • Base model upgrades: adapters are tied to a specific base model version. When you upgrade Llama 4 → Llama 5, you re-train adapters. Plan accordingly.
  • Few high-traffic adapters ( ≤ 5): at that scale just deploy each as a separate tenant.

TL;DR

Multi-LoRA lets you serve a per-customer fine-tune for the price of one shared model. Up to 16 mixed adapters per batch with < 5% throughput hit. Pin the always-hot ones, evict the long-tail automatically. Pricing is the base-model per-token rate, full stop.