Inference optimizationThroughput

Continuous batching

Static batching wastes GPU. Continuous batching schedules requests at the iteration level, slotting new requests into the running batch as soon as a token slot opens up. This is how we keep H100s at 90%+ utilization under spiky load.

Why static batching fails

A static batch waits for every request to finish before starting the next batch. If request A produces 50 tokens and request B produces 500, the GPU is doing 90% padding work after token 50. Average GPU utilization in production with static batching: ~35%.

Continuous batching, in one paragraph

The scheduler runs a token-level loop. Each iteration: (1) sample the next token for every active request, (2) check which requests finished, (3) admit new requests into the freed slots. New requests start mid-iteration, not at the next batch boundary. The batch size grows and shrinks dynamically based on traffic.

The four knobs that matter

max_batch_sizedefault: 64 (H100) / 32 (L40S)

Hard cap on concurrent requests in flight. Increase if you have spare HBM and your traffic is latency-tolerant; decrease if P99 TTFT matters more.

max_num_batched_tokensdefault: 8192

Total tokens (across all sequences) processed per iteration. Bigger = better throughput at the cost of TTFT. We recommend leaving this alone unless you've profiled your workload.

prefill_chunk_sizedefault: 2048

Long prompts are split into chunks so they don't starve decode steps. This is what keeps long-context requests from blocking everyone else.

kv_block_sizedefault: 16 tokens

PagedAttention block size. Smaller blocks = less internal fragmentation but more bookkeeping overhead. 16 is the sweet spot for most workloads.

PagedAttention (KV cache management)

Continuous batching alone isn't enough — you also need PagedAttention to manage the KV cache. Each sequence's KV cache is stored in fixed-size blocks (16 tokens each by default), allocated on-demand from a global pool. When a sequence finishes, its blocks return to the pool immediately.

The result: zero internal fragmentation, dynamic memory allocation, and the ability to run at very high effective batch sizes. We typically run batch 96+ on a single H100 with 80 GB HBM serving Llama 3.3 70B FP8.

Speculative KV eviction

A v3 addition: when we predict a request will finish in the next 4 steps with > 95% confidence, we proactively allocate the KV blocks for an incoming request into the soon-to-be-free slots. If the prediction is wrong (~1.2% of the time), we fall back without dropping tokens. This adds ~14% steady-state throughput on bursty traffic.

What you should monitor

  • Effective batch size (running average):if it's < 8 you're leaving throughput on the table — investigate why traffic isn't reaching the worker.
  • Queue depth:if requests sit in the admission queue for > 20 ms, you're overcommitted and need more replicas.
  • KV cache utilization:> 95% means you're evicting requests; either reduce max context or scale up.

All three are exposed in your dashboard and via Prometheus metrics.

TL;DR

Continuous batching + PagedAttention + speculative eviction is what makes hosted inference 2.4× cheaper than DIY. You don't need to tune anything — defaults are good. If you do tune, watch the four knobs above and the three metrics under "what to monitor".