Changelog

Recent updates

Every shipped change. Subscribe via RSS for new releases.

v1.21

2026-05-11
  • modelQwen3-Next 80B A3B added — 3B-active MoE, 640 tok/s peak
  • featLong-context inference doc shipped with chunked-prefill tuning guide
  • perfExpert prefetching for Qwen3-Next reduces cold-start TTFT by 28%
  • featWebSocket gateway at /v1/chat/ws now generally available

v1.20

2026-05-08
  • perfFireAttention v3 ships — 2.4× throughput, -38% TTFT vs vLLM 0.6
  • featTree-based speculative decoding enabled by default (batch ≤ 4)
  • featSpeculative KV eviction in continuous batching scheduler
  • perfFP8 fused prefill — single kernel for quantize + matmul + softmax

v1.19

2026-04-22
  • modelDeepSeek R1.5 added — verifier-aligned reasoning model
  • modelLlama 4 Behemoth (288B active / 2T total) added to hosted catalog
  • featMulti-LoRA serving GA: up to 64 mixed adapters per batch
  • fixResolved P99 latency spike for Llama 4 Maverick in eu-west

v1.18

2026-04-08
  • featStructured output: JSON Schema (Draft 2020-12) and EBNF grammar modes
  • featPer-key spend alerts via webhook
  • perfPrefix caching now per-organisation (was per-deployment)
  • fixTool calls no longer drop the first chunk on Gemini 2.5 Pro routing

v1.17

2026-03-25
  • modelClaude Opus 4.7 with 1M context routed via Anthropic
  • modelMagistral Medium added — Mistral's reasoning model with multilingual CoT
  • featAudit log export to S3-compatible buckets
  • perfINT4 KV cache quantization opt-in for ultra-long-context workloads

v1.16

2026-03-08
  • modelDeepSeek V3.2 Exp added — 685B sparse, 410 tok/s on FireAttention
  • featBYOK for Together, Fireworks, Replicate (no routing fee)
  • featRegion pinning (us-east, eu-west, apac) for compliance
  • fixFixed an edge-case where cancelled streams kept upstream open

v1.15

2026-02-18
  • modelClaude Sonnet 4.6 routed via Anthropic
  • featBring Your Own Model (BYOM) GA: HF checkpoint upload via CLI
  • featQuarterly Business Review program for Enterprise customers

v1.14

2026-01-29
  • modelGPT-5 / GPT-5 Mini / GPT-5 Nano routed via OpenAI
  • featTypeScript SDK 0.6 with native JSON streaming
  • perfContinuous batching scheduler rewrite — 30% higher steady-state throughput

v1.13

2026-01-12
  • featFine-tuning GA: LoRA, full fine-tune, DPO, continued pretraining
  • feat$50 free fine-tuning credits for every new account
  • modelLlama 4 Scout (10M context) with INT4 KV support

v1.12

2025-12-15
  • modelLlama 3.3 70B hosted with FireAttention v2
  • modelGemini 2.5 Pro routing live
  • perfPagedAttention block size lowered to 16 (from 32) — 8% better fragmentation

v1.11

2025-11-20
  • modelGLM-4.6 hosted (200K context)
  • modelGrok 4 routing via xAI
  • featDashboard analytics for cache hit rate and queue depth

v1.10

2025-10-30
  • featPublic benchmarks page with reproducible scripts
  • perfFireAttention v2 — 1.7× throughput vs v1
  • modelYi-Lightning hosted, Cohere Command A routed

Older releases archived at /changelog/archive.