The GPU Inference Optimization Playbook
Everything we've learned shipping FireAttention and designing production inference stacks — distilled into 8 chapters, 104 pages, zero filler. Free PDF.
Download free PDF
Inside the playbook
8 chapters. Each one stands alone — read in any order. Code samples, real benchmarks, decision matrices, and the kind of details you only get from someone who's shipped this in production.
Why your inference is slow (and where to look first)
Memory bandwidth vs compute, the autoregressive bottleneck, and a 5-minute profiling checklist that finds 80% of issues.
Quantization: the practical guide
When FP8 is enough vs when you need INT4. Quality validation methodology that's actually rigorous. Decision matrix per workload type.
Continuous batching, deeply
PagedAttention internals, the four scheduling knobs that matter, and how to detect when you're memory-bound vs compute-bound.
Speculative decoding playbook
Linear vs tree, training a custom draft model, and why you should disable it above batch 16. Real accept-rate data.
Long-context economics
When to use 1M / 10M context vs RAG. Prefix caching deep dive. The 16× cost reduction nobody tells you about.
Multi-LoRA serving for SaaS
How to give every customer a custom model without going broke on GPU. Adapter management, hot-swap, eviction policies.
Migration from closed APIs
Cost models for switching off OpenAI / Anthropic. Compatibility shims. The 7-day shadow-traffic protocol.
Scaling past one node
When tensor parallelism beats pipeline parallelism. NVLink topology gotchas. Multi-region deployment patterns.
Who this is for
✓ You'll get value if…
- · You run open-weight LLMs in production
- · You manage a GPU cluster of any size
- · You're evaluating a migration off closed APIs
- · You ship AI features and care about COGS
- · You're an engineering leader scoping inference work
✗ Probably skip if…
- · You're a researcher (this is about deployment)
- · You only consume hosted APIs and don't care about the layer below
- · You want vendor-agnostic content (we're honest about being opinionated)
Want us to ship the optimisations for you?
The playbook tells you what to do. The Solutions team does it, fixed-fee, with deliverable code in your repo.