Bring your own model

Deploy any open-weight checkpoint in minutes

Push a HuggingFace model, your own fine-tune, or a LoRA adapter. Luminet handles quantization, batching, scaling, and monitoring on the same FireAttention runtime that powers our hosted catalog.

Push the checkpoint

Upload from HuggingFace, S3, or your local machine via the CLI. We accept Llama, Qwen, Mistral, and Gemma architectures out of the box.

We quantize & batch

FP8 quantization runs automatically (BF16 / INT4 also available). We benchmark your model end-to-end and pick the optimal kernel layout.

Get an OpenAI-compatible URL

Your model gets a private endpoint at api.luminet.ai/v1 with a unique model ID. Same SDK, same response shape, your weights.

deploy.sh

# 1. Install the CLI
bun add -g @luminet/cli

# 2. Log in
luminet auth login

# 3. Deploy from HuggingFace
luminet deploy \
  --source hf://your-org/your-llama-finetune \
  --name custom-llama-70b \
  --quantization fp8

# Output:
# ✓ Model uploaded (12.4 GB)
# ✓ Quantized FP8 in 2m 14s
# ✓ Benchmarked: 285 tok/s @ batch 32
# ✓ Deployed to api.luminet.ai/v1
# Model ID: yourorg/custom-llama-70b

Supported architectures

Llama (1, 2, 3, 4)

Qwen (1.5, 2, 3)

Mistral / Mixtral

Gemma (1, 2, 3)

DeepSeek (V2, V3, V4)

GLM (4, 4.6)

Yi (6B-34B, Large)

Phi (3, 4)

Custom HF transformer

Pricing

Custom models are billed at the dedicated GPU-hour rate of the cluster they run on. See the pricing page for current GPU rates. Auto-scaling included; you only pay for the GPU-hours consumed.

Start deploying Or fine-tune first →