Large Language Models (LLMs) are pushing hardware limits in ways no other workload has before. Unlike training, where the bottleneck is dense compute across huge batches, inference is often constrained by memory and latency. Every generated token requires the model to reference all previous tokens through the self-attention mechanism. Without optimizations, this would mean recomputing attention states repeatedly—a recipe for sluggish responses and astronomical GPU bills.
Enter the KV cache (short for Key-Value cache). Instead of recalculating attention matrices for each token, inference engines store the computed “keys” and “values” in GPU memory and reuse them for subsequent tokens. This simple trick makes autoregressive decoding feasible at scale.
But there’s a catch: KV caches grow linearly with sequence length and model size. A 70B parameter model running with 8K–32K context can demand tens of gigabytes of GPU memory just for caching. That quickly exceeds the VRAM capacity of most accelerators, forcing trade-offs between batch size, latency, and context length.
This has spurred a new class of infrastructure innovation: specialized tensor caches. Unlike traditional web or database caches, these are optimized to handle large tensor objects, manage GPU/CPU memory hierarchies, and minimize recomputation. Three projects stand out in this emerging space:
- vLLM’s Paged KV Cache — a memory-efficient paging system for GPU VRAM.
- LMCache — an open-source cache layer focused on reuse and sharing across requests.
- NVIDIA Dynamo — an enterprise-ready system enabling KV cache offloading to RAM, SSDs, and remote storage.
In this post, we’ll dive into each approach, compare their features, and explore how they reshape the economics of LLM inference.
The Challenge: Why KV Cache is Hard to Manage
Before comparing solutions, it’s worth revisiting why KV cache management is non-trivial:
- Explosive Growth with Context Length
- Memory use scales as
O(N × L × D)whereN= number of layers,L= sequence length,D= hidden dimension. - For long-context models (e.g., 128K tokens), the cache can dwarf the base model weights.
- Memory use scales as
- GPU Memory is Scarce and Expensive
- An H100 GPU has 80GB VRAM. That sounds large until a few concurrent 32K-context sessions fill it up.
- Buying more GPUs to solve a memory problem is economically unsustainable.
- Reuse Across Sessions is Non-Trivial
- Many workloads (RAG, chatbots, customer support) involve repeated prefixes. Without sharing, caches are rebuilt from scratch every time.
- Data Movement Bottlenecks
- Shuttling cache tensors between GPU, CPU RAM, and storage adds transfer latency.
- Any cache system must balance compute savings against I/O overhead.
With these constraints in mind, let’s evaluate the leading solutions.
vLLM’s Paged KV Cache
What it is:
vLLM (developed at UC Berkeley) introduced the idea of a paged KV cache, inspired by operating systems’ virtual memory paging. Instead of allocating one giant contiguous memory block per request, KV tensors are broken into smaller pages that can be dynamically allocated and reused.
Key Features:
- Paging Mechanism: Each token’s KV data is stored in fixed-size pages, making it easy to reuse memory across requests.
- Efficient Memory Fragmentation Handling: Traditional caching wastes VRAM because of fragmented allocations. Paged caching ensures near-100% utilization.
- Dynamic Batching: Works hand-in-hand with vLLM’s scheduling layer to maximize GPU throughput.
- GPU-Resident: The cache remains in VRAM; vLLM doesn’t natively offload to slower tiers.
Strengths:
- Best-in-class GPU efficiency.
- Open source and widely adopted in production (Baseten, Modal, etc.).
- Transparent to users: no changes needed in prompt or API usage.
Limitations:
- Still bounded by GPU VRAM capacity.
- No persistent cache across sessions — caches vanish when processes restart.
Use Case Fit:
Ideal for inference platforms that prioritize throughput and latency within GPU-only environments (short- to mid-context workloads).
LMCache
What it is:
LMCache is an open-source project designed to persist and share KV caches across different sessions and models. Think of it as a distributed caching system purpose-built for LLM inference.
Key Features:
- Persistent Storage: KV caches can be written to CPU RAM or disk and reloaded later.
- Cross-Session Reuse: If multiple users send requests starting with the same prefix (e.g., a long system prompt), LMCache allows sharing rather than recomputation.
- APIs for Management: Provides a cache management layer developers can integrate into inference stacks.
- Integration with NVIDIA Dynamo: Recently merged efforts to unify offloading and reuse strategies.
Strengths:
- Focuses on reuse across workloads, not just memory efficiency.
- Particularly valuable for applications with repeated prefixes (RAG pipelines, fine-tuned prompts).
- Open source, with an extensible design.
Limitations:
- Performance heavily depends on storage medium speed (NVMe vs. network storage).
- Introduces extra complexity into serving pipelines.
Use Case Fit:
Great for enterprise deployments where repeated prompts are common and where lowering GPU recomputation costs is worth additional system complexity.
NVIDIA Dynamo
What it is:
NVIDIA Dynamo is a production-grade framework for tiered KV cache offloading. It was designed to scale inference beyond the limits of VRAM by intelligently moving caches across memory hierarchies.
Key Features:
- Tiered Memory Management: Supports GPU VRAM, CPU DRAM, NVMe SSDs, and even remote storage clusters.
- Smart Prefetching: Uses heuristics and I/O optimization to prefetch needed KV data before the GPU stalls.
- Integration with Storage Vendors: Partnerships with Vast Data and Weka ensure that offloading to storage can actually keep up with inference workloads.
- LMCache Integration: Gains cache reuse semantics from LMCache on top of its tiered storage.
Strengths:
- Handles the longest contexts (100K+ tokens) without requiring massive GPUs.
- Enterprise-ready, with vendor validation and deployment guides.
- Supports multi-node and distributed inference.
Limitations:
- Complexity: requires careful tuning and fast storage hardware to see benefits.
- Some workloads (short, one-off prompts) may see little gain compared to vLLM-only.
Use Case Fit:
Best for large-scale, long-context inference (enterprise chatbots, legal document analysis, retrieval-heavy apps) where context lengths exceed VRAM by orders of magnitude.
Feature Comparison
| Feature | vLLM Paged KV Cache | LMCache | NVIDIA Dynamo |
|---|---|---|---|
| Core Idea | GPU memory paging | Persistent + shared cache | Tiered offload across GPU/CPU/SSD/remote |
| Scope | Within VRAM | Cross-session reuse | Full memory hierarchy |
| Open Source | Yes | Yes | No (NVIDIA-controlled, with open integration points) |
| Best For | High throughput GPU inference | Repeated prefixes, RAG pipelines | Enterprise-scale, long-context workloads |
| Dependencies | GPU only | CPU RAM / disk | Fast NVMe / distributed storage |
| Integration Maturity | Very high | Growing (integrated with Dynamo) | Early enterprise adoption |
| Tradeoff | Limited by VRAM size | Depends on storage latency | Complexity, infra cost |
Why This Matters
Specialized tensor caches may seem like a niche optimization, but they’re becoming central to the economics of inference.
- Cost Savings: Every avoided recomputation saves GPU cycles, lowering cloud bills.
- Throughput Gains: Efficient memory usage means more concurrent users per GPU.
- Unlocking Long Context: Without cache offloading, serving 128K+ context windows at scale is practically impossible.
- Ecosystem Effects: By integrating with storage vendors (Weka, Vast Data), NVIDIA is pushing the boundaries of what “memory hierarchy” means in AI — GPUs are no longer islands, but part of a broader memory fabric.
Conclusion
The KV cache began as a clever software trick to make transformers practical. Today, it’s evolving into a new layer of infrastructure innovation, with specialized tensor caches emerging as a competitive differentiator among inference platforms.
- vLLM’s Paged KV Cache solved the GPU fragmentation problem and set the bar for efficient utilization.
- LMCache recognized that repeated prefixes are everywhere and built mechanisms for cross-session reuse.
- NVIDIA Dynamo extended the hierarchy outward, enabling KV cache offloading to RAM, SSDs, and remote storage for truly long-context inference.
Each approach reflects a different philosophy: efficiency (vLLM), reuse (LMCache), and scalability (Dynamo). Together, they are turning KV cache from an internal model detail into an explicit system design choice.
As LLMs continue to scale in size and context, expect tensor caching to become as standard as web caching. Just as Varnish and CDNs became indispensable in the web era, specialized KV cache systems will be the backbone of real-time AI services. The winners in inference infrastructure won’t just have the biggest GPUs—they’ll have the smartest caches.