Luminet Embeddings Inference (LEI)
Embeddings at 2× the throughput.
LEI is a purpose-built embeddings runtime — separate from the LLM stack. Custom CUDA kernels for sequence pooling, FP8 weights, and dynamic batching deliver 2× higher throughput and 12% lower latency vs. reference servers like TEI and Infinity.
throughput vs TEI
2.0×
P50 latency (BGE-M3)
8.4 ms
cost / 1M tokens
−40%
Models
| Model | Dim | Context | $/1M tokens | Embeds/sec/replica |
|---|---|---|---|---|
BGE-M3 embeddings/bge-m3 | 1024 | 8K | $0.012 | 2,400 |
NV-Embed v2 embeddings/nv-embed-v2 | 4096 | 32K | $0.018 | 1,650 |
Jina Embeddings v3 embeddings/jina-v3 | 1024 | 8K | $0.010 | 2,800 |
Qwen3 Embedding 8B embeddings/qwen3-embed | 4096 | 32K | $0.024 | 1,400 |
Voyage-3 Large (routed) embeddings/voyage-3 | 1024 | 32K | $0.18 | — |
Drop-in OpenAI-compatible API
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "https://api.luminet.ai/v1",
apiKey: process.env.LUMINET_API_KEY,
});
const resp = await client.embeddings.create({
model: "embeddings/bge-m3",
input: ["hello world", "luminet ships embeddings"],
});
// resp.data[0].embedding → Float[1024]Need a vector DB to go with it?
We integrate cleanly with Pinecone, Weaviate, Qdrant, and Turbopuffer. One-line index update when you swap embedding models.