Inference optimizationAgents

Function calling & tool use

Tool definitions, parallel function calls, tool-choice forcing, and multi-step agent loops. All OpenAI-compatible. Works against every hosted model and most routed ones.

The simplest possible example

const tools = [
  {
    type: "function",
    function: {
      name: "get_weather",
      description: "Get current weather for a city",
      parameters: {
        type: "object",
        properties: {
          city: { type: "string" },
          unit: { type: "string", enum: ["c", "f"] },
        },
        required: ["city"],
      },
    },
  },
];

const resp = await client.chat.completions.create({
  model: "deepseek/deepseek-v3.2-exp",
  messages: [{ role: "user", content: "What's the weather in Tokyo?" }],
  tools,
});

// resp.choices[0].message.tool_calls is populated:
// [{ id: "call_x", function: { name: "get_weather",
//                              arguments: '{"city":"Tokyo","unit":"c"}' } }]

Parallel function calling

The newer hosted models (Llama 4 Maverick, DeepSeek V3.2, Qwen3-Max, Kimi K2) emit multiple tool calls in a single response when they determine independent subtasks. Each call has a unique id; you execute them in parallel and respond with the results.

// User: "Compare weather in Tokyo and Singapore"
// Model emits two tool calls in one shot:
[
  { id: "call_1", function: { name: "get_weather", arguments: '{"city":"Tokyo"}' } },
  { id: "call_2", function: { name: "get_weather", arguments: '{"city":"Singapore"}' } },
]

// Execute both in parallel, then send results back:
messages.push(
  { role: "tool", tool_call_id: "call_1", content: "23°C, sunny" },
  { role: "tool", tool_call_id: "call_2", content: "31°C, humid" },
);

const finalResp = await client.chat.completions.create({
  model: "deepseek/deepseek-v3.2-exp",
  messages, tools,
});

Tool-choice forcing

Three modes:

tool_choice: "auto" (default) — model decides whether to call a tool or reply directly.
tool_choice: "required" — model must call at least one tool.
tool_choice: { type: "function", function: { name: "get_weather" } } — model must call this specific tool.

Forced tool calls are implemented at the constrained-decoding layer (see structured output). Output is guaranteed valid by construction.

Tool-call accuracy benchmarks

We track tool-call correctness on BFCL v3 (Berkeley Function Calling Leaderboard) for every hosted model:

Model	BFCL v3 (overall)	Parallel calls	Hallucination
Kimi K2 Instruct	78.4	82.1	3.2%
Llama 4 Maverick	76.9	80.5	4.1%
DeepSeek V3.2	76.2	78.8	3.8%
Qwen3-Max	75.5	79.4	4.5%
GLM-4.6	73.8	76.2	5.1%
Magistral Medium	70.4	72.1	6.3%

Multi-step agent loops

Most agent frameworks (LangChain, LlamaIndex, Mastra, custom) loop on the same chat-completion call:

async function runAgent(userMessage: string) {
  const messages = [{ role: "user", content: userMessage }];
  for (let step = 0; step < MAX_STEPS; step++) {
    const resp = await client.chat.completions.create({
      model: "kimi/k2-instruct",       // best for agent loops
      messages, tools,
      tool_choice: "auto",
    });
    const msg = resp.choices[0].message;
    messages.push(msg);
    if (!msg.tool_calls?.length) return msg.content;  // done
    const results = await Promise.all(
      msg.tool_calls.map(call => executeTool(call.function.name, call.function.arguments))
    );
    for (const [i, result] of results.entries()) {
      messages.push({
        role: "tool",
        tool_call_id: msg.tool_calls[i].id,
        content: JSON.stringify(result),
      });
    }
  }
  throw new Error("Max steps reached");
}

What we handle automatically

Tool-call retries:if the model emits invalid JSON arguments, we re-sample with constrained-decoding and don't bill you for the failed attempt.
Schema validation: tool definitions are validated up front against JSON Schema spec (Draft 2020-12). Malformed schemas return 400 immediately.
Prefix caching: tool definitions are part of the cached system prefix (see KV cache), so you don't pay re-computation on every step of an agent loop.

TL;DR

Pass tools like you would to OpenAI; parallel calls just work on every newer hosted model. Use Kimi K2 or Llama 4 Maverick for agent loops — both above 76 on BFCL v3. We handle retries, validation, and prefix caching for free.

Next: streaming optimization All docs