Get a quote Book a call

Changelog

Recent updates

Every shipped change. Subscribe via RSS for new releases.

v1.21

2026-05-11

modelQwen3-Next 80B A3B added — 3B-active MoE, 640 tok/s peak
featLong-context inference doc shipped with chunked-prefill tuning guide
perfExpert prefetching for Qwen3-Next reduces cold-start TTFT by 28%
featWebSocket gateway at /v1/chat/ws now generally available

v1.20

2026-05-08

perfFireAttention v3 ships — 2.4× throughput, -38% TTFT vs vLLM 0.6
featTree-based speculative decoding enabled by default (batch ≤ 4)
featSpeculative KV eviction in continuous batching scheduler
perfFP8 fused prefill — single kernel for quantize + matmul + softmax

v1.19

2026-04-22

modelDeepSeek R1.5 added — verifier-aligned reasoning model
modelLlama 4 Behemoth (288B active / 2T total) added to hosted catalog
featMulti-LoRA serving GA: up to 64 mixed adapters per batch
fixResolved P99 latency spike for Llama 4 Maverick in eu-west

v1.18

2026-04-08

featStructured output: JSON Schema (Draft 2020-12) and EBNF grammar modes
featPer-key spend alerts via webhook
perfPrefix caching now per-organisation (was per-deployment)
fixTool calls no longer drop the first chunk on Gemini 2.5 Pro routing

v1.17

2026-03-25

modelClaude Opus 4.7 with 1M context routed via Anthropic
modelMagistral Medium added — Mistral's reasoning model with multilingual CoT
featAudit log export to S3-compatible buckets
perfINT4 KV cache quantization opt-in for ultra-long-context workloads

v1.16

2026-03-08

modelDeepSeek V3.2 Exp added — 685B sparse, 410 tok/s on FireAttention
featBYOK for Together, Fireworks, Replicate (no routing fee)
featRegion pinning (us-east, eu-west, apac) for compliance
fixFixed an edge-case where cancelled streams kept upstream open

v1.15

2026-02-18

modelClaude Sonnet 4.6 routed via Anthropic
featBring Your Own Model (BYOM) GA: HF checkpoint upload via CLI
featQuarterly Business Review program for Enterprise customers

v1.14

2026-01-29

modelGPT-5 / GPT-5 Mini / GPT-5 Nano routed via OpenAI
featTypeScript SDK 0.6 with native JSON streaming
perfContinuous batching scheduler rewrite — 30% higher steady-state throughput

v1.13

2026-01-12

featFine-tuning GA: LoRA, full fine-tune, DPO, continued pretraining
feat$50 free fine-tuning credits for every new account
modelLlama 4 Scout (10M context) with INT4 KV support

v1.12

2025-12-15

modelLlama 3.3 70B hosted with FireAttention v2
modelGemini 2.5 Pro routing live
perfPagedAttention block size lowered to 16 (from 32) — 8% better fragmentation

v1.11

2025-11-20

modelGLM-4.6 hosted (200K context)
modelGrok 4 routing via xAI
featDashboard analytics for cache hit rate and queue depth

v1.10

2025-10-30

featPublic benchmarks page with reproducible scripts
perfFireAttention v2 — 1.7× throughput vs v1
modelYi-Lightning hosted, Cohere Command A routed

Older releases archived at /changelog/archive.