Changelog
Recent updates
Every shipped change. Subscribe via RSS for new releases.
v1.21
2026-05-11- modelQwen3-Next 80B A3B added — 3B-active MoE, 640 tok/s peak
- featLong-context inference doc shipped with chunked-prefill tuning guide
- perfExpert prefetching for Qwen3-Next reduces cold-start TTFT by 28%
- featWebSocket gateway at /v1/chat/ws now generally available
v1.20
2026-05-08- perfFireAttention v3 ships — 2.4× throughput, -38% TTFT vs vLLM 0.6
- featTree-based speculative decoding enabled by default (batch ≤ 4)
- featSpeculative KV eviction in continuous batching scheduler
- perfFP8 fused prefill — single kernel for quantize + matmul + softmax
v1.19
2026-04-22- modelDeepSeek R1.5 added — verifier-aligned reasoning model
- modelLlama 4 Behemoth (288B active / 2T total) added to hosted catalog
- featMulti-LoRA serving GA: up to 64 mixed adapters per batch
- fixResolved P99 latency spike for Llama 4 Maverick in eu-west
v1.18
2026-04-08- featStructured output: JSON Schema (Draft 2020-12) and EBNF grammar modes
- featPer-key spend alerts via webhook
- perfPrefix caching now per-organisation (was per-deployment)
- fixTool calls no longer drop the first chunk on Gemini 2.5 Pro routing
v1.17
2026-03-25- modelClaude Opus 4.7 with 1M context routed via Anthropic
- modelMagistral Medium added — Mistral's reasoning model with multilingual CoT
- featAudit log export to S3-compatible buckets
- perfINT4 KV cache quantization opt-in for ultra-long-context workloads
v1.16
2026-03-08- modelDeepSeek V3.2 Exp added — 685B sparse, 410 tok/s on FireAttention
- featBYOK for Together, Fireworks, Replicate (no routing fee)
- featRegion pinning (us-east, eu-west, apac) for compliance
- fixFixed an edge-case where cancelled streams kept upstream open
v1.15
2026-02-18- modelClaude Sonnet 4.6 routed via Anthropic
- featBring Your Own Model (BYOM) GA: HF checkpoint upload via CLI
- featQuarterly Business Review program for Enterprise customers
v1.14
2026-01-29- modelGPT-5 / GPT-5 Mini / GPT-5 Nano routed via OpenAI
- featTypeScript SDK 0.6 with native JSON streaming
- perfContinuous batching scheduler rewrite — 30% higher steady-state throughput
v1.13
2026-01-12- featFine-tuning GA: LoRA, full fine-tune, DPO, continued pretraining
- feat$50 free fine-tuning credits for every new account
- modelLlama 4 Scout (10M context) with INT4 KV support
v1.12
2025-12-15- modelLlama 3.3 70B hosted with FireAttention v2
- modelGemini 2.5 Pro routing live
- perfPagedAttention block size lowered to 16 (from 32) — 8% better fragmentation
v1.11
2025-11-20- modelGLM-4.6 hosted (200K context)
- modelGrok 4 routing via xAI
- featDashboard analytics for cache hit rate and queue depth
v1.10
2025-10-30- featPublic benchmarks page with reproducible scripts
- perfFireAttention v2 — 1.7× throughput vs v1
- modelYi-Lightning hosted, Cohere Command A routed
Older releases archived at /changelog/archive.