Luminet Embeddings Inference (LEI)

Embeddings at 2× the throughput.

LEI is a purpose-built embeddings runtime — separate from the LLM stack. Custom CUDA kernels for sequence pooling, FP8 weights, and dynamic batching deliver 2× higher throughput and 12% lower latency vs. reference servers like TEI and Infinity.

throughput vs TEI
2.0×
P50 latency (BGE-M3)
8.4 ms
cost / 1M tokens
−40%

Models

ModelDimContext$/1M tokensEmbeds/sec/replica
BGE-M3
embeddings/bge-m3
10248K$0.0122,400
NV-Embed v2
embeddings/nv-embed-v2
409632K$0.0181,650
Jina Embeddings v3
embeddings/jina-v3
10248K$0.0102,800
Qwen3 Embedding 8B
embeddings/qwen3-embed
409632K$0.0241,400
Voyage-3 Large (routed)
embeddings/voyage-3
102432K$0.18

Drop-in OpenAI-compatible API

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://api.luminet.ai/v1",
  apiKey: process.env.LUMINET_API_KEY,
});

const resp = await client.embeddings.create({
  model: "embeddings/bge-m3",
  input: ["hello world", "luminet ships embeddings"],
});

// resp.data[0].embedding → Float[1024]

Need a vector DB to go with it?

We integrate cleanly with Pinecone, Weaviate, Qdrant, and Turbopuffer. One-line index update when you swap embedding models.