Optimizing vLLM Token Throughput: KV-Cache Techniques#

Introduction#

In large-language-model (LLM) inference serving contexts, once the model compute becomes sufficiently fast, the performance bottleneck often shifts to the key-value (KV) cache management layer. The inference engine vLLM illustrates this: without deliberate KV-cache tuning, tokens per second (tokens/sec) can stagnate.

The following document outlines practical optimization strategies for production-grade deployment.

Reserve sufficient GPU memory for the KV-cache#

vLLM pre-allocates a block of GPU memory for KV-cache based on the --gpu_memory_utilization parameter. By increasing this parameter, you free up more contiguous space for KV blocks, reducing fragmentation and enabling larger batch sizes.

Simultaneously, set --max_num_seqs to constrain the number of parallel sequences so that the allocation remains dense and avoids frequent preemption.
Starting point: set gpu_memory_utilization high then increase max_num_seqs.

Code

python -m vllm.entrypoints.openai.api_server \
  --model mistralai/Mistral-7B-Instruct-v0.3 \
  --kv-cache-dtype fp8_e5m2 \
  --gpu-memory-utilization 0.94 \
  --max-num-seqs 192

Quantise the KV-cache (FP8) to increase capacity#

vLLM supports Quantization the KV-cache itself (not only the model weights) via the --kv_cache_dtype flag (e.g., fp8, fp8_e4m3, fp8_e5m2). This reduces per-token KV memory footprint and enables larger context windows or larger batch sizes. However, enable only if your hardware/back-end supports the fast FP8 kernels; otherwise you may fall back to slower paths and lose throughput.

Enable chunked prefill to overlap prefill & decode#

Long prompts (prefill phase) on the GPU can monopolize compute so that the decode phase remains idle. By splitting prefill into smaller “chunks” (i.e., chunked prefill), you allow decode tokens to interleave with prefill, improving GPU utilization and increasing overall throughput. vLLM recent versions enable chunked prefill by default, but you should monitor scheduling and ensure decode is not starved.

Standardize prompts to maximize prefix-cache reuse#

vLLM’s prefix-caching mechanism uses block-level hashing of past KV blocks (via PagedAttention). Even a single token difference in a block invalidates reuse. Thus, to increase reuse hit rate, you should structure prompts (particularly system and user lead templates) so that shared prefixes are aligned to block boundaries. E.g., use identical system prompts, consistently formatted templates, identical token counts up to block boundary.

Use sliding-window attention + hybrid KV manager for long sessions#

When handling long-dialogue or long sequence contexts, the KV-cache grows unbounded under full self-attention. vLLM’s hybrid KV-cache manager supports mixing sliding-window (local) attention layers with full-attention layers, thereby bounding the KV working set. This keeps access “hot” and stabilizes sustained throughput under long sequences. Research on cache management frameworks corroborates this direction

Deploy ROPE scaling — with cost awareness#

The rotary positional embedding (ROPE) scaling field (e.g., linear, dynamic) enables extending the effective context length. However: every attended token still occupies a KV slot, so you are not reducing memory footprint by default. vLLM’s --rope_scaling flag supports e.g. {"type":"dynamic","factor":4.0}. Use only when you need large-context support, and benchmark memory vs throughput trade-offs.

Apply speculative decoding to reduce memory-bound latency#

In memory-bandwidth-bound decoding regimes, the “draft-model + large-model” speculative decoding pattern can yield ~2–3× throughput improvement: a lightweight model drafts tokens, and the main model validates them quickly. vLLM provides first-class support for speculative decoding. Use it particularly when per-token latency is being dominated by memory transfers.

Persist KV-cache across sessions to avoid cold-starts#

In production, if inference pods restart frequently or scale aggressively, you lose cached KV blocks and suffer “cold-start” penalties. Integrating an external KV-cache persistence layer (e.g., LMCache) allows reuse of KV fragments across sessions/requests, especially helpful when many requests share headers/prompts. Example usage of LMCache with vLLM is available.

When your workload includes multi-modal inputs (e.g., image tokens + text tokens), each “image token” also consumes a KV slot just like a text token. This reduces the effective batch width or max number of sequences. Make sure to include multimodal token counts when configuring max_num_seqs and memory utilization.

Match backend implementation to KV-dtype#

Not all attention back-ends support all KV cache dtypes with equal performance. For example: • FlashAttention-2: high performance when full supported, but some builds may not support FP8 KV. • XFormers / FlashInfer: may support FP8 but at some speed trade-offs. If you set --kv_cache_dtype fp8* and see throughput regression, you most likely landed on a slow path. Switch backend or verify kernel support. Benchmarks confirm significant performance variations.