Deploy any open-weight checkpoint in minutes
Push a HuggingFace model, your own fine-tune, or a LoRA adapter. Luminet handles quantization, batching, scaling, and monitoring on the same FireAttention runtime that powers our hosted catalog.
Push the checkpoint
Upload from HuggingFace, S3, or your local machine via the CLI. We accept Llama, Qwen, Mistral, and Gemma architectures out of the box.
We quantize & batch
FP8 quantization runs automatically (BF16 / INT4 also available). We benchmark your model end-to-end and pick the optimal kernel layout.
Get an OpenAI-compatible URL
Your model gets a private endpoint at api.luminet.ai/v1 with a unique model ID. Same SDK, same response shape, your weights.
# 1. Install the CLI bun add -g @luminet/cli # 2. Log in luminet auth login # 3. Deploy from HuggingFace luminet deploy \ --source hf://your-org/your-llama-finetune \ --name custom-llama-70b \ --quantization fp8 # Output: # ✓ Model uploaded (12.4 GB) # ✓ Quantized FP8 in 2m 14s # ✓ Benchmarked: 285 tok/s @ batch 32 # ✓ Deployed to api.luminet.ai/v1 # Model ID: yourorg/custom-llama-70b
Supported architectures
Pricing
Custom models are billed at the dedicated GPU-hour rate of the cluster they run on. See the pricing page for current GPU rates. Auto-scaling included; you only pay for the GPU-hours consumed.