Introduction: The Memory Wall

Modern LLMs can process context windows of 100K+ tokens. But there’s a hidden cost: the KV cache.

As context grows, the memory required to store key-value pairs in attention explodes quadratically. This creates a bottleneck:

  • Memory: KV cache can consume 10-100× more memory than model weights
  • Bandwidth: Moving KV cache data becomes the primary latency source
  • Cost: Serving long-context models requires expensive high-memory GPUs

Two innovations address this: Grouped Query Attention (GQA) and Multi-Head Latent Attention (MLA). They reduce KV cache size by 4-8× while maintaining quality.

Let’s understand why the KV cache exists, and how these optimizations work.

The KV Cache Problem

Why Do We Need a KV Cache?

In autoregressive generation, the model generates tokens one at a time. Each new token needs to attend to all previous tokens.

Without caching, we’d recompute keys and values for all previous tokens at every step:

graph TB subgraph "Without KV Cache (Inefficient)" A1[Generate Token 1] --> B1[Compute K,V for token 1] B1 --> C1[Generate Token 2] C1 --> D1[Recompute K,V for token 1
Compute K,V for token 2] D1 --> E1[Generate Token 3] E1 --> F1[Recompute K,V for tokens 1,2
Compute K,V for token 3] end style D1 fill:#e74c3c style F1 fill:#e74c3c

This is wasteful! Keys and values for previous tokens never change.

Solution: Cache them:

graph TB subgraph "With KV Cache (Efficient)" A2[Generate Token 1] --> B2[Compute & Cache K,V
for token 1] B2 --> C2[Generate Token 2] C2 --> D2[Reuse cached K,V for token 1
Compute K,V for token 2] D2 --> E2[Generate Token 3] E2 --> F2[Reuse cached K,V for tokens 1,2
Compute K,V for token 3] end style B2 fill:#2ecc71 style D2 fill:#2ecc71 style F2 fill:#2ecc71

Perfect! But now we have a memory problem.

Standard Multi-Head Attention: The Memory Breakdown

Architecture

graph TB A[Input Token
Dimension: d_model] --> B[Linear Projections] B --> Q1[Q1] & Q2[Q2] & Q3[Q3] & Q4[Q4] & Q5[Q5] & Q6[Q6] & Q7[Q7] & Q8[Q8] B --> K1[K1] & K2[K2] & K3[K3] & K4[K4] & K5[K5] & K6[K6] & K7[K7] & K8[K8] B --> V1[V1] & V2[V2] & V3[V3] & V4[V4] & V5[V5] & V6[V6] & V7[V7] & V8[V8] Q1 & K1 & V1 --> H1[Head 1
Attention] Q2 & K2 & V2 --> H2[Head 2
Attention] Q3 & K3 & V3 --> H3[Head 3
Attention] Q4 & K4 & V4 --> H4[Head 4
Attention] Q5 & K5 & V5 --> H5[Head 5
Attention] Q6 & K6 & V6 --> H6[Head 6
Attention] Q7 & K7 & V7 --> H7[Head 7
Attention] Q8 & K8 & V8 --> H8[Head 8
Attention] H1 & H2 & H3 & H4 & H5 & H6 & H7 & H8 --> C[Concatenate] C --> O[Output] style K1 fill:#e74c3c style K2 fill:#e74c3c style K3 fill:#e74c3c style K4 fill:#e74c3c style K5 fill:#e74c3c style K6 fill:#e74c3c style K7 fill:#e74c3c style K8 fill:#e74c3c style V1 fill:#f39c12 style V2 fill:#f39c12 style V3 fill:#f39c12 style V4 fill:#f39c12 style V5 fill:#f39c12 style V6 fill:#f39c12 style V7 fill:#f39c12 style V8 fill:#f39c12

Memory Calculation

For a model with:

  • n_heads = 32 (number of attention heads)
  • d_model = 4096 (model dimension)
  • d_head = d_model / n_heads = 128 (dimension per head)
  • seq_len = context length in tokens
  • n_layers = 32 (number of transformer layers)

KV Cache Size per Layer:

2 (K and V) × n_heads × d_head × seq_len × sizeof(float16)
= 2 × 32 × 128 × seq_len × 2 bytes
= 16,384 × seq_len bytes

Total KV Cache Size:

= 16,384 × seq_len × n_layers
= 16,384 × seq_len × 32
= 524,288 × seq_len bytes

For seq_len = 100,000 tokens:
= 52.4 GB just for the KV cache!

This is massive. Most of the memory isn’t model weights—it’s the KV cache.

Visualization: KV Cache Growth

graph LR A[1K tokens
524 MB] --> B[10K tokens
5.24 GB] B --> C[100K tokens
52.4 GB] C --> D[1M tokens
524 GB] style A fill:#2ecc71 style B fill:#f39c12 style C fill:#e74c3c style D fill:#8e44ad

Problem: Long context models require enormous memory, most of which is KV cache.

Solution 1: Grouped Query Attention (GQA)

Key Insight: Do all query heads really need their own K and V?

In standard multi-head attention, each head has its own Q, K, and V. But what if multiple query heads shared the same K and V?

Standard Multi-Head Attention (MHA)

graph TB subgraph "8 Query Heads" Q1[Q1] & Q2[Q2] & Q3[Q3] & Q4[Q4] & Q5[Q5] & Q6[Q6] & Q7[Q7] & Q8[Q8] end subgraph "8 Key Heads (Cached)" K1[K1] & K2[K2] & K3[K3] & K4[K4] & K5[K5] & K6[K6] & K7[K7] & K8[K8] end subgraph "8 Value Heads (Cached)" V1[V1] & V2[V2] & V3[V3] & V4[V4] & V5[V5] & V6[V6] & V7[V7] & V8[V8] end Q1 -.-> K1 & V1 Q2 -.-> K2 & V2 Q3 -.-> K3 & V3 Q4 -.-> K4 & V4 Q5 -.-> K5 & V5 Q6 -.-> K6 & V6 Q7 -.-> K7 & V7 Q8 -.-> K8 & V8 style K1 fill:#e74c3c style K2 fill:#e74c3c style K3 fill:#e74c3c style K4 fill:#e74c3c style K5 fill:#e74c3c style K6 fill:#e74c3c style K7 fill:#e74c3c style K8 fill:#e74c3c

KV Cache: 8 K heads + 8 V heads = 16 cached tensors

Grouped Query Attention (GQA)

graph TB subgraph "8 Query Heads" Q1[Q1] & Q2[Q2] & Q3[Q3] & Q4[Q4] & Q5[Q5] & Q6[Q6] & Q7[Q7] & Q8[Q8] end subgraph "2 Key-Value Groups (Cached)" subgraph "Group 1" K1[K1] V1[V1] end subgraph "Group 2" K2[K2] V2[V2] end end Q1 & Q2 & Q3 & Q4 -.-> K1 & V1 Q5 & Q6 & Q7 & Q8 -.-> K2 & V2 style K1 fill:#2ecc71 style K2 fill:#2ecc71 style V1 fill:#2ecc71 style V2 fill:#2ecc71

KV Cache: 2 K heads + 2 V heads = 4 cached tensors

Reduction: 4× smaller KV cache!

GQA Formula

n_kv_heads = n_query_heads / group_size

Example:
- 32 query heads
- group_size = 8
- n_kv_heads = 32 / 8 = 4

KV cache reduction: 32/4 = 8× smaller

GQA Variants

graph TB A[Multi-Head Attention
MHA] --> B[n_kv = n_query
e.g., 32 KV heads] C[Grouped Query Attention
GQA] --> D[n_kv = n_query / group_size
e.g., 4 KV heads] E[Multi-Query Attention
MQA] --> F[n_kv = 1
1 KV head shared by all queries] style A fill:#e74c3c style C fill:#f39c12 style E fill:#2ecc71

Trade-off:

  • MHA: Largest cache, best quality
  • GQA: Medium cache, good quality
  • MQA: Smallest cache, some quality loss

Memory Savings with GQA

Standard (32 KV heads): 52.4 GB for 100K tokens
GQA (4 KV heads):       6.55 GB for 100K tokens
Reduction:              8×

This is huge! Now 100K context fits in much smaller GPUs.

Solution 2: Multi-Head Latent Attention (MLA)

GQA reduces the number of KV heads. MLA goes further: compress the KV cache itself using a latent representation.

Key Insight: Keys and values across heads contain redundant information. Can we compress them into a smaller “latent” space?

Standard Attention: Full KV Cache

graph TB A[Input
d_model = 4096] --> B[Project to K,V] B --> K[Keys
32 heads × 128 dim
= 4096 total] B --> V[Values
32 heads × 128 dim
= 4096 total] K --> C[Cache Keys
4096 dimensions] V --> D[Cache Values
4096 dimensions] style C fill:#e74c3c style D fill:#e74c3c

Cached: 4096 (K) + 4096 (V) = 8192 dimensions per token

Multi-Head Latent Attention: Compressed KV Cache

graph TB A[Input
d_model = 4096] --> B[Project to
Latent KV
d_latent = 512] B --> L[Latent KV
512 dimensions
COMPRESSED] L --> C[Cache Latent
512 dimensions] C --> D[Expand to K
32 heads × 128] C --> E[Expand to V
32 heads × 128] D --> F[Attention Computation] E --> F style L fill:#2ecc71 style C fill:#2ecc71

Cached: Only 512 dimensions per token (instead of 8192)

Reduction: 16× smaller KV cache!

MLA Architecture

graph TB subgraph "Compression (During Generation)" A[Input Token
4096-dim] --> B[Down-projection
W_down: 4096 → 512] B --> C[Latent KV
512-dim
CACHED] end subgraph "Decompression (During Attention)" C --> D[Up-projection K
W_up_k: 512 → 4096] C --> E[Up-projection V
W_up_v: 512 → 4096] D --> F[Reshape to heads
32 × 128] E --> G[Reshape to heads
32 × 128] end subgraph "Attention" H[Query] & F & G --> I[Scaled Dot-Product
Attention] I --> J[Output] end style C fill:#2ecc71 style D fill:#9b59b6 style E fill:#9b59b6

MLA Mathematics

Compression:

latent_kv = W_down × input
latent_kv: [seq_len, 512]  (cached)

Decompression:

K = W_up_k × latent_kv  → [seq_len, 4096] → reshape to [seq_len, 32, 128]
V = W_up_v × latent_kv  → [seq_len, 4096] → reshape to [seq_len, 32, 128]

Attention:

output = Attention(Q, K, V)

Memory Savings with MLA

Standard (8192 dims per token): 52.4 GB for 100K tokens
MLA (512 dims per token):       3.28 GB for 100K tokens
Reduction:                      16×

This is extraordinary! Now you can fit 1M+ token contexts in reasonable GPUs.

Comparison: MHA vs. GQA vs. MLA

Side-by-Side Architecture

graph TB subgraph "MHA: Standard" A1[32 Query Heads] A2[32 Key Heads
CACHED] A3[32 Value Heads
CACHED] A1 -.-> A2 A1 -.-> A3 end subgraph "GQA: Shared KV" B1[32 Query Heads] B2[4 Key Heads
CACHED] B3[4 Value Heads
CACHED] B1 -.-> B2 B1 -.-> B3 end subgraph "MLA: Compressed KV" C1[32 Query Heads] C2[Latent KV
512-dim
CACHED] C3[Expand to
K & V on-the-fly] C1 -.-> C3 C2 --> C3 end style A2 fill:#e74c3c style A3 fill:#e74c3c style B2 fill:#f39c12 style B3 fill:#f39c12 style C2 fill:#2ecc71

Memory Comparison Table

For 100K token context, 32 layers, 32 heads, d_model=4096:

Method KV Heads Dims per Token Total KV Cache Reduction
MHA 32 8192 52.4 GB 1× (baseline)
GQA (group=8) 4 1024 6.55 GB
MLA Latent 512 3.28 GB 16×

Quality vs. Efficiency Trade-off

graph LR A[MHA
Best Quality
Highest Memory] --> B[GQA
Good Quality
Medium Memory] B --> C[MLA
Good Quality
Lowest Memory] style A fill:#e74c3c style B fill:#f39c12 style C fill:#2ecc71

Empirically:

  • MHA → GQA: Minimal quality loss (~1% worse)
  • MHA → MLA: Small quality loss (~2-3% worse), but with proper training can match MHA

Real-World Example: DeepSeek-V2

DeepSeek-V2 uses MLA to achieve massive efficiency gains:

graph TB A[DeepSeek-V2
236B parameters] --> B[Multi-Head Latent
Attention MLA] B --> C[KV Cache Reduction
16× smaller] C --> D1[Support 128K context
on 1× A100 80GB] C --> D2[Support 1M context
on 8× A100] B --> E[Inference Speed
5× faster than standard] style B fill:#2ecc71 style C fill:#f39c12 style D1 fill:#9b59b6 style D2 fill:#9b59b6 style E fill:#3498db

Results:

  • 128K context: Fits on a single A100 80GB GPU
  • 1M context: Possible on 8× A100s
  • 5× faster inference than standard attention
  • Competitive quality with standard models

Combining GQA and MLA

Some models combine both techniques:

graph TB A[Input] --> B[Grouped Queries
32 → 4 groups] B --> C[Latent Compression
512-dim latent KV] C --> D[Cache Latent
Only 512 dims] D --> E[Expand per group
4 KV heads] style C fill:#2ecc71 style D fill:#2ecc71

Combined Reduction: Could theoretically achieve 50-100× reduction!

Implementation Considerations

1. Training

MLA requires training from scratch or careful fine-tuning:

  • Learn compression matrices (W_down)
  • Learn decompression matrices (W_up_k, W_up_v)
  • Ensure minimal information loss

2. Inference

# Pseudocode for MLA inference

# During token generation
def generate_token(input_token, cached_latent_kv):
    # Compress input to latent
    latent_kv = compress(input_token)  # [1, 512]

    # Append to cache
    cached_latent_kv.append(latent_kv)  # [seq_len, 512]

    # Decompress cache to K, V
    K = decompress_to_keys(cached_latent_kv)    # [seq_len, 32, 128]
    V = decompress_to_values(cached_latent_kv)  # [seq_len, 32, 128]

    # Compute query for current token
    Q = compute_query(input_token)  # [1, 32, 128]

    # Attention
    output = attention(Q, K, V)

    return output

3. Hardware Optimization

MLA shifts computation:

  • Less memory bandwidth: Smaller cache = less data movement
  • More compute: Decompression requires matrix multiplications

Trade-off works well on modern GPUs where compute is abundant but memory bandwidth is limited.

Benchmarks: Memory and Speed

Memory Usage (100K tokens, 32 layers)

graph LR A[MHA
52.4 GB] --> B[GQA group=8
6.55 GB
8× reduction] B --> C[MLA 512-dim
3.28 GB
16× reduction] style A fill:#e74c3c style B fill:#f39c12 style C fill:#2ecc71

Throughput (tokens/second, batch size 1)

graph LR A[MHA
10 tok/s
Baseline] --> B[GQA
25 tok/s
2.5× faster] B --> C[MLA
50 tok/s
5× faster] style A fill:#e74c3c style B fill:#f39c12 style C fill:#2ecc71

Why is MLA faster?

  • Smaller KV cache = less memory movement
  • Memory bandwidth is often the bottleneck, not compute

When to Use Each Method

Use Standard MHA When:

  • Memory is not a constraint
  • Maximum quality is critical
  • Short contexts (<8K tokens)

Use GQA When:

  • Memory is limited
  • Need good quality with some savings
  • Easy drop-in replacement for MHA
  • Models: Llama 3, Mistral, Gemma

Use MLA When:

  • Memory is severely constrained
  • Long contexts (100K+ tokens)
  • Willing to train from scratch or fine-tune
  • Models: DeepSeek-V2, DeepSeek-V3

Future Directions

1. Adaptive Compression

Dynamically adjust compression ratio based on content:

graph LR A[Important Tokens
Low compression
1024-dim] --> C[Adaptive Cache] B[Less Important
High compression
128-dim] --> C style A fill:#e74c3c style B fill:#2ecc71

2. Quantization

Combine with lower precision:

FP16 latent: 512 × 2 bytes = 1024 bytes/token
INT8 latent:  512 × 1 byte  = 512 bytes/token
INT4 latent:  512 × 0.5 byte = 256 bytes/token

64× reduction from baseline!

3. Sparse Attention + MLA

Combine sparse attention patterns with latent compression for even larger contexts.

Conclusion

The KV cache is the hidden bottleneck in long-context LLMs. As context windows grow from 4K to 100K to 1M+ tokens, memory becomes the limiting factor.

Grouped Query Attention (GQA) and Multi-Head Latent Attention (MLA) solve this by reducing KV cache size:

  • GQA: Share KV heads across query heads (4-8× reduction)
  • MLA: Compress KV cache into latent space (10-16× reduction)

These innovations enable:

  • Longer context windows
  • Faster inference
  • Lower serving costs
  • Deployment on smaller GPUs

The future of LLMs isn’t just about bigger models—it’s about smarter memory management.

Key Takeaways

  • KV Cache Problem: Memory grows linearly with context length
  • GQA: Share KV heads among query heads
  • MLA: Compress KV into low-dimensional latent space
  • Trade-offs: Memory/speed vs. quality
  • Real-world Impact: 10-100× memory savings enable 1M+ token contexts
  • Future: Adaptive compression, quantization, and sparse attention

Understanding KV cache optimization is essential for building and deploying modern long-context LLMs.

Further Reading