Inference optimizationLatency

Streaming optimization

SSE chunking, partial-token handling, parsing structured output as it streams, backpressure, and how to bridge to WebSockets if you need duplex streams.

Why streaming, exactly?

Two reasons:

Perceived latency. Users see tokens as they generate. The first paragraph of a 2,000-token answer arrives in ~200 ms instead of 8 s.
Cancellation.Users abort the request when they have what they need; you stop billing for tokens that won't be read.

Basic streaming

const stream = await client.chat.completions.create({
  model: "deepseek/deepseek-v3.2-exp",
  messages: [{ role: "user", content: "Write a haiku about FP8." }],
  stream: true,
});

for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0]?.delta?.content ?? "");
}

Chunk size & flush cadence

We flush an SSE chunk every ~20 ms or every 8 tokens, whichever comes first. This balances:

Network efficiency: too many small chunks = TCP/TLS overhead dominates.
Perceived smoothness: too few chunks = users see jerky bursts.

Override per request (rarely needed):

const stream = await client.chat.completions.create({
  model: "...",
  messages: [...],
  stream: true,
  stream_options: {
    include_usage: true,        // get token counts in final chunk
    chunk_size: 4,              // flush every 4 tokens (smoother UI)
  },
});

Partial-token handling (Unicode safety)

A single Unicode character (Chinese, emoji, math symbols) often spans multiple BPE tokens. Naïve streaming displays partial bytes as replacement characters (�).

Our SSE encoder buffers token boundaries that span an incomplete UTF-8 sequence and flushes them together. You can write the delta.content directly to a UTF-8 sink and never see broken characters.

Streaming structured output

When using JSON Schema or grammar-constrained output, you can parse incrementally. The model is guaranteed to emit valid syntax at every token boundary, so a streaming JSON parser like partial-json can render fields as they complete:

import { parse as parsePartial } from "partial-json";

const stream = await client.chat.completions.create({
  model: "deepseek/deepseek-v3.2-exp",
  messages: [...],
  response_format: { type: "json_schema", json_schema: { ... } },
  stream: true,
});

let buffer = "";
for await (const chunk of stream) {
  buffer += chunk.choices[0]?.delta?.content ?? "";
  const partial = parsePartial(buffer);
  // partial is a typed object that fills in fields as they stream:
  //   { name: "Alice", age: undefined, ... }   → after 80 tokens
  //   { name: "Alice", age: 32, ... }          → after 90 tokens
  renderUI(partial);
}

Backpressure & cancellation

The OpenAI SDK closes the stream when the consumer drops out of the for-await loop. Under the hood that triggers an HTTP/2 RST_STREAM frame. Our gateway sees it within ~5 ms and aborts the upstream generation. You only pay for tokens generated up to the cancellation.

For React Server Components or other cases where you can't break out of an async iterator, use AbortController:

const ctrl = new AbortController();
const stream = await client.chat.completions.create(
  { model: "...", messages: [...], stream: true },
  { signal: ctrl.signal },
);

setTimeout(() => ctrl.abort(), 5000);  // hard timeout

WebSocket bridge for duplex

SSE is one-way: server → client. If you need to interrupt a long generation with new context (e.g. a voice assistant where the user starts speaking again), use our WebSocket gateway at wss://api.luminet.ai/v1/chat/ws. Same auth, but you can push new messages mid-stream and the model will gracefully cut over.

See the API reference for full WebSocket protocol details.

Edge & SSE caveats

Cloudflare Workers: SSE works fine but the Workers runtime caps response duration at 30 s on free / 60 s on paid. For long generations, run the LLM call from a longer-lived edge function (e.g. Vercel Functions, Fly).
Some corporate proxies buffer SSE responses, defeating streaming. Add X-Accel-Buffering: no on your reverse proxy or use the WebSocket bridge.
Browser EventSource doesn't support custom headers. Use fetch() with a ReadableStream reader instead, or pass auth via query string (with the appropriate caveats).

TL;DR

Always stream when serving humans. Parse structured output incrementally for better UX. Use AbortController to enforce hard timeouts. If you need to interrupt mid-stream (voice, agents), switch to the WebSocket gateway.

Next: long-context inference All docs