Luminet Embeddings Inference (LEI)

Embeddings at 2× the throughput.

LEI is a purpose-built embeddings runtime — separate from the LLM stack. Custom CUDA kernels for sequence pooling, FP8 weights, and dynamic batching deliver 2× higher throughput and 12% lower latency vs. reference servers like TEI and Infinity.

Start embedding Read the docs

throughput vs TEI

2.0×

P50 latency (BGE-M3)

8.4 ms

cost / 1M tokens

−40%

Models

Model	Dim	Context	$/1M tokens	Embeds/sec/replica
BGE-M3 embeddings/bge-m3	1024	8K	$0.012	2,400
NV-Embed v2 embeddings/nv-embed-v2	4096	32K	$0.018	1,650
Jina Embeddings v3 embeddings/jina-v3	1024	8K	$0.010	2,800
Qwen3 Embedding 8B embeddings/qwen3-embed	4096	32K	$0.024	1,400
Voyage-3 Large (routed) embeddings/voyage-3	1024	32K	$0.18	—

Drop-in OpenAI-compatible API

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://api.luminet.ai/v1",
  apiKey: process.env.LUMINET_API_KEY,
});

const resp = await client.embeddings.create({
  model: "embeddings/bge-m3",
  input: ["hello world", "luminet ships embeddings"],
});

// resp.data[0].embedding → Float[1024]

Need a vector DB to go with it?

We integrate cleanly with Pinecone, Weaviate, Qdrant, and Turbopuffer. One-line index update when you swap embedding models.

Talk to us about your stack