Introduction: The Quadratic Bottleneck

Transformers revolutionized AI, but they have a fundamental flaw: quadratic scaling.

Processing a sequence of length n requires O(n²) operations due to self-attention. Every token attends to every other token, creating an all-to-all comparison:

Context length:     1K      10K     100K    1M
Operations:         1M      100M    10B     1T
Time (relative):    1×      100×    10,000× 1,000,000×

This makes long-context processing prohibitively expensive.

Enter State Space Models (SSMs), specifically Mamba: a new architecture that processes sequences in linear time O(n) while maintaining long-range dependencies.

The future isn’t Transformers vs. SSMs-it’s Transformers + SSMs working together in hybrid architectures.

The Core Problem: Attention Complexity

Self-Attention: All-to-All Communication

graph TB subgraph "Transformer: O(n²) Complexity" T1[Token 1] -.-> T1 & T2 & T3 & T4 & T5 & T6 T2[Token 2] -.-> T1 & T2 & T3 & T4 & T5 & T6 T3[Token 3] -.-> T1 & T2 & T3 & T4 & T5 & T6 T4[Token 4] -.-> T1 & T2 & T3 & T4 & T5 & T6 T5[Token 5] -.-> T1 & T2 & T3 & T4 & T5 & T6 T6[Token 6] -.-> T1 & T2 & T3 & T4 & T5 & T6 end style T1 fill:#e74c3c style T2 fill:#e74c3c style T3 fill:#e74c3c style T4 fill:#e74c3c style T5 fill:#e74c3c style T6 fill:#e74c3c

Complexity: 6 tokens × 6 tokens = 36 comparisons

General case: n tokens → n² comparisons

The Scaling Crisis

graph LR A[Context: 1K
1M ops
Fast] --> B[Context: 10K
100M ops
Slow] B --> C[Context: 100K
10B ops
Very Slow] C --> D[Context: 1M
1T ops
Impossible] style A fill:#2ecc71 style B fill:#f39c12 style C fill:#e74c3c style D fill:#8e44ad

This is why we need alternatives to attention.

State Space Models: Linear-Time Sequences

What Are SSMs?

State Space Models are inspired by control theory. They maintain a hidden state that evolves over time, capturing information from the sequence.

Key Idea: Instead of comparing all tokens to each other, maintain a compressed state that summarizes the past.

SSM Formulation

h_t = A × h_{t-1} + B × x_t    (State update)
y_t = C × h_t + D × x_t         (Output)

Where:
- x_t: Input at time t
- h_t: Hidden state at time t
- y_t: Output at time t
- A, B, C, D: Learned parameters

Visual Comparison

graph TB subgraph "Transformer: All-to-All" A1[x1] & A2[x2] & A3[x3] & A4[x4] & A5[x5] -.-> O1[y1] A1 & A2 & A3 & A4 & A5 -.-> O2[y2] A1 & A2 & A3 & A4 & A5 -.-> O3[y3] A1 & A2 & A3 & A4 & A5 -.-> O4[y4] A1 & A2 & A3 & A4 & A5 -.-> O5[y5] end subgraph "SSM: Sequential State" B1[x1] --> H1[h1] --> C1[y1] B2[x2] --> H2[h2] --> C2[y2] B3[x3] --> H3[h3] --> C3[y3] B4[x4] --> H4[h4] --> C4[y4] B5[x5] --> H5[h5] --> C5[y5] H1 --> H2 H2 --> H3 H3 --> H4 H4 --> H5 end style O1 fill:#e74c3c style O2 fill:#e74c3c style O3 fill:#e74c3c style C1 fill:#2ecc71 style C2 fill:#2ecc71 style C3 fill:#2ecc71

Transformer: Every output depends on all inputs (quadratic) SSM: Each output depends on current input and previous state (linear)

Complexity Comparison

graph LR A[Sequence Length] --> B[Transformer
O n² ] A --> C[SSM Mamba
O n ] style B fill:#e74c3c style C fill:#2ecc71
Context Length Transformer SSM (Mamba) Speedup
1K 1M ops 1K ops 1,000×
10K 100M ops 10K ops 10,000×
100K 10B ops 100K ops 100,000×
1M 1T ops 1M ops 1,000,000×

Conclusion: SSMs scale linearly, making million-token contexts feasible!

Mamba: The Modern SSM

What Makes Mamba Special?

Traditional SSMs (like S4) use fixed parameters A, B, C. Mamba makes them input-dependent:

A_t = f_A(x_t)
B_t = f_B(x_t)
C_t = f_C(x_t)

h_t = A_t × h_{t-1} + B_t × x_t
y_t = C_t × h_t + D × x_t

This selective state space allows Mamba to:

  • Focus on relevant information
  • Ignore irrelevant context
  • Adapt to different inputs

Mamba Architecture

graph TB A[Input x_t
Dimension: d] --> B[Linear Projection
→ 2d] B --> C1[Branch 1:
Pass-through
Dimension: d] B --> C2[Branch 2:
SSM Processing
Dimension: d] C2 --> D[Selective Scan
Compute A, B, C
from input] D --> E[State Update
h_t = A×h_{t-1} + B×x_t] E --> F[Output Projection
y_t = C×h_t] C1 & F --> G[Element-wise
Multiplication ⊙] G --> H[Output
Dimension: d] style D fill:#f39c12 style E fill:#2ecc71 style G fill:#9b59b6

Selective State Updates

sequenceDiagram participant X as Input Token participant P as Parameter Network participant S as State h_t participant O as Output X->>P: Compute selection params P->>P: A_t = f_A(x_t) P->>P: B_t = f_B(x_t) P->>P: C_t = f_C(x_t) P->>S: Update state:
h_t = A_t × h_{t-1} + B_t × x_t S->>O: y_t = C_t × h_t Note over P: Parameters adapt
to each input! style P fill:#f39c12 style S fill:#2ecc71

Why Not Just Use Mamba?

If Mamba is linear and Transformers are quadratic, why not replace Transformers entirely?

Strengths of Transformers

1. In-Context Learning: Attention excels at using examples in-context 2. Copying: Directly attending to and copying previous tokens 3. Associative Recall: Looking up information by key 4. Parallel Training: All tokens processed simultaneously

Strengths of Mamba

1. Long-Range Dependencies: Linear scaling to million+ tokens 2. Efficient Inference: No KV cache needed 3. Memory Efficiency: Constant memory per token 4. Fast Sequential Processing: Natural for autoregressive generation

The Hybrid Solution

Combine both: use each where it excels!

graph LR A[Hybrid Model] --> B[Transformer Layers
In-context learning
Associative recall] A --> C[Mamba Layers
Long-range context
Efficient processing] style B fill:#3498db style C fill:#2ecc71

Hybrid Architecture Patterns

Pattern 1: Interleaved Layers

Alternate between Transformer and Mamba layers:

graph TB A[Input Embeddings] --> B[Mamba Layer 1] B --> C[Transformer Layer 1] C --> D[Mamba Layer 2] D --> E[Transformer Layer 2] E --> F[Mamba Layer 3] F --> G[Transformer Layer 3] G --> H[Output Head] style C fill:#3498db style E fill:#3498db style G fill:#3498db style B fill:#2ecc71 style D fill:#2ecc71 style F fill:#2ecc71

Example: Jamba (AI21 Labs)

  • 32 layers total
  • Pattern: Mamba, Mamba, Mamba, Attention, repeat
  • Ratio: 3:1 (Mamba:Transformer)

Pattern 2: Task-Specific Placement

graph TB subgraph "Local Processing (Mamba)" A[Token Embeddings] --> B[Mamba × 8
Local context] end subgraph "Global Reasoning (Transformer)" B --> C[Transformer × 4
Global attention] end subgraph "Output Processing (Mamba)" C --> D[Mamba × 4
Sequential generation] end D --> E[Output] style B fill:#2ecc71 style C fill:#3498db style D fill:#2ecc71

Design Principle:

  • Bottom layers (Mamba): Efficient local feature extraction
  • Middle layers (Transformer): Global reasoning and in-context learning
  • Top layers (Mamba): Fast sequential output generation

Pattern 3: Mixture-of-Depths

Dynamically route tokens to either Transformer or Mamba:

graph TB A[Input Token] --> B{Router} B -->|Needs attention| C[Transformer Path
Expensive but powerful] B -->|Sequential processing| D[Mamba Path
Efficient] C --> E[Output] D --> E style B fill:#f39c12 style C fill:#3498db style D fill:#2ecc71

Idea: Not all tokens need attention. Route simple tokens through Mamba, complex tokens through Transformer.

State Management in Mamba

The Hidden State

Unlike Transformers that maintain a KV cache, Mamba maintains a hidden state:

graph LR A[h0
Initial state
d_state dims] -->|Input: x1| B[h1
Updated state
d_state dims] B -->|Input: x2| C[h2
Updated state
d_state dims] C -->|Input: x3| D[h3
Updated state
d_state dims] style A fill:#95a5a6 style B fill:#2ecc71 style C fill:#2ecc71 style D fill:#2ecc71

Memory: Fixed size regardless of sequence length!

Transformer KV cache: O(n × d_model)
Mamba state:          O(d_state)

For 100K context:
Transformer: ~50 GB
Mamba:       ~50 MB (constant!)

State Update Visualization

sequenceDiagram participant T1 as Token 1 participant T2 as Token 2 participant T3 as Token 3 participant H as Hidden State Note over H: h0 = [0, 0, ..., 0] T1->>H: Process "The" H->>H: h1 = A×h0 + B×embed("The") T2->>H: Process "cat" H->>H: h2 = A×h1 + B×embed("cat") Note over H: State now encodes
"The cat" T3->>H: Process "sat" H->>H: h3 = A×h2 + B×embed("sat") Note over H: State now encodes
"The cat sat" style H fill:#2ecc71

The state acts as a lossy compression of all previous tokens.

Memory Comparison: Transformer vs. Mamba vs. Hybrid

graph TB subgraph "100K Token Context" A[Transformer Only
32 layers] B[Hybrid 50/50
16 Transformer + 16 Mamba] C[Mamba Only
32 layers] end A --> D[KV Cache: 52 GB
State: 0 GB
Total: 52 GB] B --> E[KV Cache: 26 GB
State: 0.05 GB
Total: 26 GB] C --> F[KV Cache: 0 GB
State: 0.05 GB
Total: 0.05 GB] style D fill:#e74c3c style E fill:#f39c12 style F fill:#2ecc71

Hybrid models offer a sweet spot: better than pure Transformers on memory, better than pure Mamba on quality.

Training Dynamics

Transformers: Parallel Training

All tokens processed simultaneously:

graph TB A[Sequence:
The cat sat on mat] --> B[Parallel Processing] B --> C1[Token 1: The] B --> C2[Token 2: cat] B --> C3[Token 3: sat] B --> C4[Token 4: on] B --> C5[Token 5: mat] C1 & C2 & C3 & C4 & C5 --> D[Backpropagation] style B fill:#3498db

Advantage: Highly parallelizable, fast training

Mamba: Parallel Training via Convolution

Despite being recurrent at inference, Mamba can be parallelized during training using a convolution trick:

graph TB A[Sequence Input] --> B[Rewrite as
Convolution] B --> C[Parallel Convolution
on GPUs] C --> D[Equivalent to
Sequential Processing] style C fill:#2ecc71

This allows Mamba to train as fast as Transformers!

Real-World Hybrid Models

Jamba (AI21 Labs)

graph TB A[Jamba Architecture] --> B[52B parameters] B --> C[Mamba Layers
75% of layers] B --> D[Attention Layers
25% of layers] B --> E[MoE Feed-Forward
16 experts per layer] C & D & E --> F[256K context window
Single 80GB GPU] style C fill:#2ecc71 style D fill:#3498db style E fill:#f39c12

Key Features:

  • 52B parameters with MoE (only 12B active)
  • 256K context window
  • 3:1 ratio of Mamba to Transformer layers
  • Fits on single GPU due to efficient Mamba layers

Striped Hyena (Together AI)

graph LR A[Input] --> B[Mamba Block 1] B --> C[Attention Block 1] C --> D[Mamba Block 2] D --> E[Attention Block 2] E --> F[...] style B fill:#2ecc71 style D fill:#2ecc71 style C fill:#3498db style E fill:#3498db

Pattern: Strict alternation between Mamba and Attention

Inference: Transformer vs. Mamba vs. Hybrid

Transformer Inference

sequenceDiagram participant I as Input participant K as KV Cache participant A as Attention participant O as Output I->>K: Append new token KV Note over K: KV cache grows:
52 GB for 100K tokens K->>A: Attend to entire cache A->>O: Generate token Note over A: Cost: O(n²)
Grows with context style K fill:#e74c3c

Mamba Inference

sequenceDiagram participant I as Input participant S as State participant M as Mamba participant O as Output I->>S: Update state Note over S: State size: constant
~50 MB regardless of context S->>M: Process with state M->>O: Generate token Note over M: Cost: O(n)
Constant per token style S fill:#2ecc71

Hybrid Inference

graph TB A[New Token] --> B{Layer Type} B -->|Mamba Layer| C[Update State
O 1 ] B -->|Transformer Layer| D[Update KV Cache
Attend
O n ] C --> E[Next Layer] D --> E E --> F{More Layers?} F -->|Yes| B F -->|No| G[Output Token] style C fill:#2ecc71 style D fill:#e74c3c

Result: Fewer attention layers = smaller KV cache = lower memory and latency!

Performance Benchmarks

Latency (Time to First Token)

graph LR A[Transformer
100K context
5.0s] --> B[Hybrid 50/50
100K context
2.5s] B --> C[Mamba
100K context
0.5s] style A fill:#e74c3c style B fill:#f39c12 style C fill:#2ecc71

Throughput (Tokens/Second)

graph LR A[Transformer
10 tok/s] --> B[Hybrid
30 tok/s] B --> C[Mamba
100 tok/s] style A fill:#e74c3c style B fill:#f39c12 style C fill:#2ecc71

Quality (Benchmark Accuracy)

graph LR A[Transformer
90% accuracy] --> B[Hybrid
88% accuracy] B --> C[Mamba
85% accuracy] style A fill:#2ecc71 style B fill:#f39c12 style C fill:#e74c3c

Insight: Hybrids offer the best balance of speed, memory, and quality.

Design Considerations

How Many Transformer vs. Mamba Layers?

graph TB A{Task Requirements} --> B[Heavy In-Context
Learning?] A --> C[Long Context
Efficiency?] A --> D[Balanced?] B --> E[More Transformer
70% Attention
30% Mamba] C --> F[More Mamba
20% Attention
80% Mamba] D --> G[Balanced Hybrid
50% Attention
50% Mamba] style E fill:#3498db style F fill:#2ecc71 style G fill:#9b59b6

Layer Placement Strategy

Strategy 1: Uniform Interleaving
[M, A, M, A, M, A, ...]

Strategy 2: Blocked
[M, M, M, A, M, M, M, A, ...]

Strategy 3: Task-Specific
[M, M, M, M, A, A, A, A, M, M, M, M]
 └─ Local ─┘ └─ Global ─┘ └─ Output ─┘

Future Directions

Let the model learn optimal layer placement:

graph TB A[Training] --> B{Per-layer
Decision} B -->|Layer 1| C[Mamba] B -->|Layer 2| D[Transformer] B -->|Layer 3| E[Mamba] C & D & E --> F[Evaluate Performance] F --> G{Optimize Layer
Selection} G --> A style G fill:#f39c12

2. Adaptive Switching

Dynamically choose architecture based on input:

If input requires long-range dependencies:
    Use more Mamba layers
Else:
    Use more Transformer layers

3. Continuous State Space

Extend Mamba to continuous-time processing for irregular sequences (e.g., time-series with varying intervals).

Conclusion

The future of sequence modeling isn’t Transformers OR State Space Models-it’s Transformers AND State Space Models working together.

Hybrid architectures leverage the best of both worlds:

  • Transformers: In-context learning, associative recall, powerful reasoning
  • Mamba (SSMs): Linear scaling, efficient long-context, constant memory

By combining them strategically, we can build models that:

  • Process million-token contexts efficiently
  • Maintain strong performance on complex reasoning tasks
  • Run on reasonable hardware
  • Achieve better quality/cost trade-offs

The next generation of foundation models will almost certainly be hybrid. Understanding both paradigms and how to combine them is essential for building state-of-the-art AI systems.

Key Takeaways

  • Transformer Bottleneck: O(n²) attention limits long contexts
  • Mamba Advantage: O(n) processing with linear scaling
  • Hybrid Solution: Combine both architectures strategically
  • Memory Efficiency: Mamba uses constant state vs. growing KV cache
  • Design Patterns: Interleaved, task-specific, or mixture-of-depths
  • Real-World Models: Jamba, Striped Hyena demonstrate viability
  • Trade-offs: Speed/memory vs. quality-hybrids offer the best balance

The era of pure Transformer models is ending. Hybrid architectures are the future.

Further Reading