Free playbook

The GPU Inference Optimization Playbook

Everything we've learned shipping FireAttention and designing production inference stacks — distilled into 8 chapters, 104 pages, zero filler. Free PDF.

104 pages PDF 8 chapters ~3 hours read

Download free PDF

We'll send the playbook plus very occasional updates from the inference team. No spam, unsubscribe with one click.

Inside the playbook

8 chapters. Each one stands alone — read in any order. Code samples, real benchmarks, decision matrices, and the kind of details you only get from someone who's shipped this in production.

01

Why your inference is slow (and where to look first)

Memory bandwidth vs compute, the autoregressive bottleneck, and a 5-minute profiling checklist that finds 80% of issues.

12 pages
02

Quantization: the practical guide

When FP8 is enough vs when you need INT4. Quality validation methodology that's actually rigorous. Decision matrix per workload type.

18 pages
03

Continuous batching, deeply

PagedAttention internals, the four scheduling knobs that matter, and how to detect when you're memory-bound vs compute-bound.

16 pages
04

Speculative decoding playbook

Linear vs tree, training a custom draft model, and why you should disable it above batch 16. Real accept-rate data.

14 pages
05

Long-context economics

When to use 1M / 10M context vs RAG. Prefix caching deep dive. The 16× cost reduction nobody tells you about.

10 pages
06

Multi-LoRA serving for SaaS

How to give every customer a custom model without going broke on GPU. Adapter management, hot-swap, eviction policies.

12 pages
07

Migration from closed APIs

Cost models for switching off OpenAI / Anthropic. Compatibility shims. The 7-day shadow-traffic protocol.

10 pages
08

Scaling past one node

When tensor parallelism beats pipeline parallelism. NVLink topology gotchas. Multi-region deployment patterns.

12 pages

Who this is for

✓ You'll get value if…

  • · You run open-weight LLMs in production
  • · You manage a GPU cluster of any size
  • · You're evaluating a migration off closed APIs
  • · You ship AI features and care about COGS
  • · You're an engineering leader scoping inference work

✗ Probably skip if…

  • · You're a researcher (this is about deployment)
  • · You only consume hosted APIs and don't care about the layer below
  • · You want vendor-agnostic content (we're honest about being opinionated)

Want us to ship the optimisations for you?

The playbook tells you what to do. The Solutions team does it, fixed-fee, with deliverable code in your repo.