Build with confidence
From your first request to deploying custom fine-tunes — and the engineering deep-dives behind FireAttention's 2.4× speedup.
Getting started
Make your first call. Understand the API surface.
Models & deployment
What we host, how routing works, and how to bring your own.
Inference optimization
Engineer-grade guides to the techniques behind FireAttention.
Quantization
FP8 vs INT4 vs BF16 — when to use what.
Speculative decoding
Draft models for 2-3× wallclock speedup.
Continuous batching
Iteration-level scheduling, the four knobs.
Structured output
JSON Schema & grammar-constrained sampling.
KV cache & prefix caching
PagedAttention, prefix reuse, KV quantization.
Expert parallelism (MoE)
TP / DP / EP, all-to-all cost, EP=16 on Blackwell.
Disaggregated prefill / decode
Two pools, two parallelism strategies, one rack.
Multi-LoRA serving
Hundreds of adapters, one base GPU.
Function calling
Tool definitions, parallel calls, agent loops.
Streaming optimization
SSE chunking, partial-token handling, TTFT.
Long-context inference
From 128K to 10M tokens — cost & speed.
Guides
End-to-end tutorials for common patterns.
Hello world (TypeScript)
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "https://api.luminet.ai/v1",
apiKey: process.env.LUMINET_API_KEY,
});
const stream = await client.chat.completions.create({
model: "deepseek/deepseek-v3.2-exp",
messages: [{ role: "user", content: "Write a haiku about FP8." }],
stream: true,
});
for await (const chunk of stream) {
process.stdout.write(chunk.choices[0]?.delta?.content ?? "");
}