Bring your own model

Deploy any open-weight checkpoint in minutes

Push a HuggingFace model, your own fine-tune, or a LoRA adapter. Luminet handles quantization, batching, scaling, and monitoring on the same FireAttention runtime that powers our hosted catalog.

1

Push the checkpoint

Upload from HuggingFace, S3, or your local machine via the CLI. We accept Llama, Qwen, Mistral, and Gemma architectures out of the box.

2

We quantize & batch

FP8 quantization runs automatically (BF16 / INT4 also available). We benchmark your model end-to-end and pick the optimal kernel layout.

3

Get an OpenAI-compatible URL

Your model gets a private endpoint at api.luminet.ai/v1 with a unique model ID. Same SDK, same response shape, your weights.

deploy.sh
# 1. Install the CLI
bun add -g @luminet/cli

# 2. Log in
luminet auth login

# 3. Deploy from HuggingFace
luminet deploy \
  --source hf://your-org/your-llama-finetune \
  --name custom-llama-70b \
  --quantization fp8

# Output:
# ✓ Model uploaded (12.4 GB)
# ✓ Quantized FP8 in 2m 14s
# ✓ Benchmarked: 285 tok/s @ batch 32
# ✓ Deployed to api.luminet.ai/v1
# Model ID: yourorg/custom-llama-70b

Supported architectures

Llama (1, 2, 3, 4)
Qwen (1.5, 2, 3)
Mistral / Mixtral
Gemma (1, 2, 3)
DeepSeek (V2, V3, V4)
GLM (4, 4.6)
Yi (6B-34B, Large)
Phi (3, 4)
Custom HF transformer

Pricing

Custom models are billed at the dedicated GPU-hour rate of the cluster they run on. See the pricing page for current GPU rates. Auto-scaling included; you only pay for the GPU-hours consumed.