Inference optimizationReliability

Structured output

Constrain model output to valid JSON, regex, or any context-free grammar — at the token sampling level. No retries, no prompt engineering, no parsing failures.

How it works

During sampling, instead of choosing from all 100K+ tokens in the vocabulary, we mask out tokens that would violate your schema. The model is forced to pick from valid tokens only. The generated text is guaranteed to parse — by construction, not by hope.

Three modes are supported, in order of strictness:

  • JSON mode: guarantees output is parseable JSON of any shape.
  • JSON Schema: guarantees output matches a specific JSON Schema (Draft 2020-12).
  • Grammar mode: guarantees output matches an arbitrary EBNF-style grammar (regex, custom languages, code-completion).

JSON Schema example

extract.ts
const resp = await client.chat.completions.create({
  model: "deepseek/deepseek-v3.2-exp",
  messages: [
    {
      role: "user",
      content: "Extract: 'Alice is 32 and works as a fintech CTO in NYC.'",
    },
  ],
  response_format: {
    type: "json_schema",
    json_schema: {
      name: "person",
      strict: true,
      schema: {
        type: "object",
        properties: {
          name: { type: "string" },
          age: { type: "integer", minimum: 0 },
          role: { type: "string" },
          industry: { type: "string", enum: ["fintech", "healthtech", "edtech", "other"] },
          location: { type: "string" },
        },
        required: ["name", "age", "role", "industry", "location"],
      },
    },
  },
});

// resp.choices[0].message.content is GUARANTEED to parse and match the schema
const person = JSON.parse(resp.choices[0].message.content);
//   { name: "Alice", age: 32, role: "CTO", industry: "fintech", location: "NYC" }

Grammar mode (advanced)

For shapes that JSON Schema can't express — like a SQL subset, a domain-specific language, or matching only valid IPv4 addresses — use grammar mode with EBNF.

ipv4.ts
const grammar = `
  root ::= ipv4
  ipv4 ::= octet "." octet "." octet "." octet
  octet ::= "25" [0-5] | "2" [0-4] [0-9] | "1" [0-9] [0-9] | [1-9]? [0-9]
`;

const resp = await client.chat.completions.create({
  model: "meta/llama-4-scout",
  messages: [
    { role: "user", content: "Give me an IP address." },
  ],
  extra_body: { grammar },
});

// Output is guaranteed valid IPv4, e.g. "192.168.1.42"

Performance impact

Naive constrained decoding adds 20-40% overhead per token (the mask must be computed against the whole vocab). Luminet's implementation precompiles the schema into a finite-state automaton at request time and then masks in O(1) per token. Net overhead: < 4%.

When NOT to use

  • Free-form chat: obviously.
  • Very large schemas (≥ 50 KB): compile time becomes noticeable. Either simplify or pre-compile via our Grammars API.
  • When you need explanation: JSON-only output gives you the answer but not the reasoning. Ask for a reasoning field in your schema if you need both.

TL;DR

Use response_format: json_schema to guarantee parseable output. Overhead is < 4%. Works on every hosted model and most routed models (where the upstream provider supports it).