Introduction: The Quadratic Bottleneck

Transformers revolutionized AI, but they have a fundamental flaw: quadratic scaling.

Processing a sequence of length n requires O(n²) operations due to self-attention. Every token attends to every other token, creating an all-to-all comparison:

Context length:     1K      10K     100K    1M
Operations:         1M      100M    10B     1T
Time (relative):    1×      100×    10,000× 1,000,000×

This makes long-context processing prohibitively expensive.

Enter State Space Models (SSMs), specifically Mamba: a new architecture that processes sequences in linear time O(n) while maintaining long-range dependencies.

The future isn’t Transformers vs. SSMs—it’s Transformers + SSMs working together in hybrid architectures.

The Core Problem: Attention Complexity

Self-Attention: All-to-All Communication

graph TB subgraph "Transformer: O(n²) Complexity" T1[Token 1] -.-> T1 & T2 & T3 & T4 & T5 & T6 T2[Token 2] -.-> T1 & T2 & T3 & T4 & T5 & T6 T3[Token 3] -.-> T1 & T2 & T3 & T4 & T5 & T6 T4[Token 4] -.-> T1 & T2 & T3 & T4 & T5 & T6 T5[Token 5] -.-> T1 & T2 & T3 & T4 & T5 & T6 T6[Token 6] -.-> T1 & T2 & T3 & T4 & T5 & T6 end style T1 fill:#e74c3c style T2 fill:#e74c3c style T3 fill:#e74c3c style T4 fill:#e74c3c style T5 fill:#e74c3c style T6 fill:#e74c3c

Complexity: 6 tokens × 6 tokens = 36 comparisons

General case: n tokens → n² comparisons

The Scaling Crisis

graph LR A[Context: 1K
1M ops
Fast] --> B[Context: 10K
100M ops
Slow] B --> C[Context: 100K
10B ops
Very Slow] C --> D[Context: 1M
1T ops
Impossible] style A fill:#2ecc71 style B fill:#f39c12 style C fill:#e74c3c style D fill:#8e44ad

This is why we need alternatives to attention.

State Space Models: Linear-Time Sequences

What Are SSMs?

State Space Models are inspired by control theory. They maintain a hidden state that evolves over time, capturing information from the sequence.

Key Idea: Instead of comparing all tokens to each other, maintain a compressed state that summarizes the past.

SSM Formulation

h_t = A × h_{t-1} + B × x_t    (State update)
y_t = C × h_t + D × x_t         (Output)

Where:
- x_t: Input at time t
- h_t: Hidden state at time t
- y_t: Output at time t
- A, B, C, D: Learned parameters

Visual Comparison

graph TB subgraph "Transformer: All-to-All" A1[x1] & A2[x2] & A3[x3] & A4[x4] & A5[x5] -.-> O1[y1] A1 & A2 & A3 & A4 & A5 -.-> O2[y2] A1 & A2 & A3 & A4 & A5 -.-> O3[y3] A1 & A2 & A3 & A4 & A5 -.-> O4[y4] A1 & A2 & A3 & A4 & A5 -.-> O5[y5] end subgraph "SSM: Sequential State" B1[x1] --> H1[h1] --> C1[y1] B2[x2] --> H2[h2] --> C2[y2] B3[x3] --> H3[h3] --> C3[y3] B4[x4] --> H4[h4] --> C4[y4] B5[x5] --> H5[h5] --> C5[y5] H1 --> H2 H2 --> H3 H3 --> H4 H4 --> H5 end style O1 fill:#e74c3c style O2 fill:#e74c3c style O3 fill:#e74c3c style C1 fill:#2ecc71 style C2 fill:#2ecc71 style C3 fill:#2ecc71

Transformer: Every output depends on all inputs (quadratic) SSM: Each output depends on current input and previous state (linear)

Complexity Comparison

graph LR A[Sequence Length] --> B[Transformer
O n² ] A --> C[SSM Mamba
O n ] style B fill:#e74c3c style C fill:#2ecc71

Context Length	Transformer	SSM (Mamba)	Speedup
1K	1M ops	1K ops	1,000×
10K	100M ops	10K ops	10,000×
100K	10B ops	100K ops	100,000×
1M	1T ops	1M ops	1,000,000×

Conclusion: SSMs scale linearly, making million-token contexts feasible!

Mamba: The Modern SSM

What Makes Mamba Special?

Traditional SSMs (like S4) use fixed parameters A, B, C. Mamba makes them input-dependent:

A_t = f_A(x_t)
B_t = f_B(x_t)
C_t = f_C(x_t)

h_t = A_t × h_{t-1} + B_t × x_t
y_t = C_t × h_t + D × x_t

This selective state space allows Mamba to:

Focus on relevant information
Ignore irrelevant context
Adapt to different inputs

Mamba Architecture

graph TB A[Input x_t
Dimension: d] --> B[Linear Projection
→ 2d] B --> C1[Branch 1:
Pass-through
Dimension: d] B --> C2[Branch 2:
SSM Processing
Dimension: d] C2 --> D[Selective Scan
Compute A, B, C
from input] D --> E[State Update
h_t = A×h_{t-1} + B×x_t] E --> F[Output Projection
y_t = C×h_t] C1 & F --> G[Element-wise
Multiplication ⊙] G --> H[Output
Dimension: d] style D fill:#f39c12 style E fill:#2ecc71 style G fill:#9b59b6

Selective State Updates

sequenceDiagram participant X as Input Token participant P as Parameter Network participant S as State h_t participant O as Output X->>P: Compute selection params P->>P: A_t = f_A(x_t) P->>P: B_t = f_B(x_t) P->>P: C_t = f_C(x_t) P->>S: Update state:
h_t = A_t × h_{t-1} + B_t × x_t S->>O: y_t = C_t × h_t Note over P: Parameters adapt
to each input! style P fill:#f39c12 style S fill:#2ecc71

Why Not Just Use Mamba?

If Mamba is linear and Transformers are quadratic, why not replace Transformers entirely?

Strengths of Transformers

1. In-Context Learning: Attention excels at using examples in-context 2. Copying: Directly attending to and copying previous tokens 3. Associative Recall: Looking up information by key 4. Parallel Training: All tokens processed simultaneously

Strengths of Mamba

1. Long-Range Dependencies: Linear scaling to million+ tokens 2. Efficient Inference: No KV cache needed 3. Memory Efficiency: Constant memory per token 4. Fast Sequential Processing: Natural for autoregressive generation

The Hybrid Solution

Combine both: use each where it excels!

graph LR A[Hybrid Model] --> B[Transformer Layers
In-context learning
Associative recall] A --> C[Mamba Layers
Long-range context
Efficient processing] style B fill:#3498db style C fill:#2ecc71

Hybrid Architecture Patterns

Pattern 1: Interleaved Layers

Alternate between Transformer and Mamba layers:

graph TB A[Input Embeddings] --> B[Mamba Layer 1] B --> C[Transformer Layer 1] C --> D[Mamba Layer 2] D --> E[Transformer Layer 2] E --> F[Mamba Layer 3] F --> G[Transformer Layer 3] G --> H[Output Head] style C fill:#3498db style E fill:#3498db style G fill:#3498db style B fill:#2ecc71 style D fill:#2ecc71 style F fill:#2ecc71

Example: Jamba (AI21 Labs)

32 layers total
Pattern: Mamba, Mamba, Mamba, Attention, repeat
Ratio: 3:1 (Mamba:Transformer)

Pattern 2: Task-Specific Placement

graph TB subgraph "Local Processing (Mamba)" A[Token Embeddings] --> B[Mamba × 8
Local context] end subgraph "Global Reasoning (Transformer)" B --> C[Transformer × 4
Global attention] end subgraph "Output Processing (Mamba)" C --> D[Mamba × 4
Sequential generation] end D --> E[Output] style B fill:#2ecc71 style C fill:#3498db style D fill:#2ecc71

Design Principle:

Bottom layers (Mamba): Efficient local feature extraction
Middle layers (Transformer): Global reasoning and in-context learning
Top layers (Mamba): Fast sequential output generation

Pattern 3: Mixture-of-Depths

Dynamically route tokens to either Transformer or Mamba:

graph TB A[Input Token] --> B{Router} B -->|Needs attention| C[Transformer Path
Expensive but powerful] B -->|Sequential processing| D[Mamba Path
Efficient] C --> E[Output] D --> E style B fill:#f39c12 style C fill:#3498db style D fill:#2ecc71

Idea: Not all tokens need attention. Route simple tokens through Mamba, complex tokens through Transformer.

State Management in Mamba

The Hidden State

Unlike Transformers that maintain a KV cache, Mamba maintains a hidden state:

Memory: Fixed size regardless of sequence length!

Transformer KV cache: O(n × d_model)
Mamba state:          O(d_state)

For 100K context:
Transformer: ~50 GB
Mamba:       ~50 MB (constant!)

State Update Visualization

sequenceDiagram participant T1 as Token 1 participant T2 as Token 2 participant T3 as Token 3 participant H as Hidden State Note over H: h0 = [0, 0, ..., 0] T1->>H: Process "The" H->>H: h1 = A×h0 + B×embed("The") T2->>H: Process "cat" H->>H: h2 = A×h1 + B×embed("cat") Note over H: State now encodes
"The cat" T3->>H: Process "sat" H->>H: h3 = A×h2 + B×embed("sat") Note over H: State now encodes
"The cat sat" style H fill:#2ecc71

The state acts as a lossy compression of all previous tokens.

Memory Comparison: Transformer vs. Mamba vs. Hybrid

graph TB subgraph "100K Token Context" A[Transformer Only
32 layers] B[Hybrid 50/50
16 Transformer + 16 Mamba] C[Mamba Only
32 layers] end A --> D[KV Cache: 52 GB
State: 0 GB
Total: 52 GB] B --> E[KV Cache: 26 GB
State: 0.05 GB
Total: 26 GB] C --> F[KV Cache: 0 GB
State: 0.05 GB
Total: 0.05 GB] style D fill:#e74c3c style E fill:#f39c12 style F fill:#2ecc71

Hybrid models offer a sweet spot: better than pure Transformers on memory, better than pure Mamba on quality.

Training Dynamics

Transformers: Parallel Training

All tokens processed simultaneously:

graph TB A[Sequence:
The cat sat on mat] --> B[Parallel Processing] B --> C1[Token 1: The] B --> C2[Token 2: cat] B --> C3[Token 3: sat] B --> C4[Token 4: on] B --> C5[Token 5: mat] C1 & C2 & C3 & C4 & C5 --> D[Backpropagation] style B fill:#3498db

Advantage: Highly parallelizable, fast training

Mamba: Parallel Training via Convolution

Despite being recurrent at inference, Mamba can be parallelized during training using a convolution trick:

graph TB A[Sequence Input] --> B[Rewrite as
Convolution] B --> C[Parallel Convolution
on GPUs] C --> D[Equivalent to
Sequential Processing] style C fill:#2ecc71

This allows Mamba to train as fast as Transformers!

Real-World Hybrid Models

Jamba (AI21 Labs)

graph TB A[Jamba Architecture] --> B[52B parameters] B --> C[Mamba Layers
75% of layers] B --> D[Attention Layers
25% of layers] B --> E[MoE Feed-Forward
16 experts per layer] C & D & E --> F[256K context window
Single 80GB GPU] style C fill:#2ecc71 style D fill:#3498db style E fill:#f39c12

Key Features:

52B parameters with MoE (only 12B active)
256K context window
3:1 ratio of Mamba to Transformer layers
Fits on single GPU due to efficient Mamba layers

Striped Hyena (Together AI)

graph LR A[Input] --> B[Mamba Block 1] B --> C[Attention Block 1] C --> D[Mamba Block 2] D --> E[Attention Block 2] E --> F[...] style B fill:#2ecc71 style D fill:#2ecc71 style C fill:#3498db style E fill:#3498db

Pattern: Strict alternation between Mamba and Attention

Inference: Transformer vs. Mamba vs. Hybrid

Transformer Inference

sequenceDiagram participant I as Input participant K as KV Cache participant A as Attention participant O as Output I->>K: Append new token KV Note over K: KV cache grows:
52 GB for 100K tokens K->>A: Attend to entire cache A->>O: Generate token Note over A: Cost: O(n²)
Grows with context style K fill:#e74c3c

Mamba Inference

sequenceDiagram participant I as Input participant S as State participant M as Mamba participant O as Output I->>S: Update state Note over S: State size: constant
~50 MB regardless of context S->>M: Process with state M->>O: Generate token Note over M: Cost: O(n)
Constant per token style S fill:#2ecc71

Hybrid Inference

graph TB A[New Token] --> B{Layer Type} B -->|Mamba Layer| C[Update State
O 1 ] B -->|Transformer Layer| D[Update KV Cache
Attend
O n ] C --> E[Next Layer] D --> E E --> F{More Layers?} F -->|Yes| B F -->|No| G[Output Token] style C fill:#2ecc71 style D fill:#e74c3c

Result: Fewer attention layers = smaller KV cache = lower memory and latency!

Performance Benchmarks

Latency (Time to First Token)

graph LR A[Transformer
100K context
5.0s] --> B[Hybrid 50/50
100K context
2.5s] B --> C[Mamba
100K context
0.5s] style A fill:#e74c3c style B fill:#f39c12 style C fill:#2ecc71

Throughput (Tokens/Second)

graph LR A[Transformer
10 tok/s] --> B[Hybrid
30 tok/s] B --> C[Mamba
100 tok/s] style A fill:#e74c3c style B fill:#f39c12 style C fill:#2ecc71

Quality (Benchmark Accuracy)

graph LR A[Transformer
90% accuracy] --> B[Hybrid
88% accuracy] B --> C[Mamba
85% accuracy] style A fill:#2ecc71 style B fill:#f39c12 style C fill:#e74c3c

Insight: Hybrids offer the best balance of speed, memory, and quality.

Design Considerations

How Many Transformer vs. Mamba Layers?

graph TB A{Task Requirements} --> B[Heavy In-Context
Learning?] A --> C[Long Context
Efficiency?] A --> D[Balanced?] B --> E[More Transformer
70% Attention
30% Mamba] C --> F[More Mamba
20% Attention
80% Mamba] D --> G[Balanced Hybrid
50% Attention
50% Mamba] style E fill:#3498db style F fill:#2ecc71 style G fill:#9b59b6

Layer Placement Strategy

Strategy 1: Uniform Interleaving
[M, A, M, A, M, A, ...]

Strategy 2: Blocked
[M, M, M, A, M, M, M, A, ...]

Strategy 3: Task-Specific
[M, M, M, M, A, A, A, A, M, M, M, M]
 └─ Local ─┘ └─ Global ─┘ └─ Output ─┘

Future Directions

1. Learned Architecture Search

Let the model learn optimal layer placement:

graph TB A[Training] --> B{Per-layer
Decision} B -->|Layer 1| C[Mamba] B -->|Layer 2| D[Transformer] B -->|Layer 3| E[Mamba] C & D & E --> F[Evaluate Performance] F --> G{Optimize Layer
Selection} G --> A style G fill:#f39c12

2. Adaptive Switching

Dynamically choose architecture based on input:

If input requires long-range dependencies:
    Use more Mamba layers
Else:
    Use more Transformer layers

3. Continuous State Space

Extend Mamba to continuous-time processing for irregular sequences (e.g., time-series with varying intervals).

Conclusion

The future of sequence modeling isn’t Transformers OR State Space Models—it’s Transformers AND State Space Models working together.

Hybrid architectures leverage the best of both worlds:

Transformers: In-context learning, associative recall, powerful reasoning
Mamba (SSMs): Linear scaling, efficient long-context, constant memory

By combining them strategically, we can build models that:

Process million-token contexts efficiently
Maintain strong performance on complex reasoning tasks
Run on reasonable hardware
Achieve better quality/cost trade-offs

The next generation of foundation models will almost certainly be hybrid. Understanding both paradigms and how to combine them is essential for building state-of-the-art AI systems.

Key Takeaways

Transformer Bottleneck: O(n²) attention limits long contexts
Mamba Advantage: O(n) processing with linear scaling
Hybrid Solution: Combine both architectures strategically
Memory Efficiency: Mamba uses constant state vs. growing KV cache
Design Patterns: Interleaved, task-specific, or mixture-of-depths
Real-World Models: Jamba, Striped Hyena demonstrate viability
Trade-offs: Speed/memory vs. quality—hybrids offer the best balance

The era of pure Transformer models is ending. Hybrid architectures are the future.

Introduction: The Quadratic Bottleneck#

The Core Problem: Attention Complexity#

Self-Attention: All-to-All Communication#

The Scaling Crisis#

State Space Models: Linear-Time Sequences#

What Are SSMs?#

SSM Formulation#

Visual Comparison#

Complexity Comparison#

Mamba: The Modern SSM#

What Makes Mamba Special?#

Mamba Architecture#

Selective State Updates#

Why Not Just Use Mamba?#

Strengths of Transformers#

Strengths of Mamba#

The Hybrid Solution#

Hybrid Architecture Patterns#

Pattern 1: Interleaved Layers#

Pattern 2: Task-Specific Placement#

Pattern 3: Mixture-of-Depths#

State Management in Mamba#

The Hidden State#

State Update Visualization#

Memory Comparison: Transformer vs. Mamba vs. Hybrid#

Training Dynamics#

Transformers: Parallel Training#

Mamba: Parallel Training via Convolution#

Real-World Hybrid Models#

Jamba (AI21 Labs)#

Striped Hyena (Together AI)#

Inference: Transformer vs. Mamba vs. Hybrid#

Transformer Inference#

Mamba Inference#

Hybrid Inference#

Performance Benchmarks#

Latency (Time to First Token)#

Throughput (Tokens/Second)#

Quality (Benchmark Accuracy)#

Design Considerations#

How Many Transformer vs. Mamba Layers?#

Layer Placement Strategy#

Future Directions#

1. Learned Architecture Search#

2. Adaptive Switching#

3. Continuous State Space#

Conclusion#

Key Takeaways#

Further Reading#