The 'System 2' LLM: How Models Learn to Reason (o1, R1)

    Introduction: Two Systems of Thinking In cognitive science, Nobel laureate Daniel Kahneman described human thinking as two distinct systems: System 1: Fast, automatic, intuitive (e.g., recognizing faces, reading emotions) System 2: Slow, deliberate, analytical (e.g., solving math problems, planning) Traditional LLMs operate almost entirely in System 1 mode: they generate responses instantly, token by token, with no deliberate planning or self-reflection. Ask GPT-4 a question, and it starts answering immediately—no visible “thinking time.” ...

    February 10, 2025 · 11 min · Rafiul Alam

    Unpacking KV Cache Optimization: MLA and GQA Explained

    Introduction: The Memory Wall Modern LLMs can process context windows of 100K+ tokens. But there’s a hidden cost: the KV cache. As context grows, the memory required to store key-value pairs in attention explodes quadratically. This creates a bottleneck: Memory: KV cache can consume 10-100× more memory than model weights Bandwidth: Moving KV cache data becomes the primary latency source Cost: Serving long-context models requires expensive high-memory GPUs Two innovations address this: Grouped Query Attention (GQA) and Multi-Head Latent Attention (MLA). They reduce KV cache size by 4-8× while maintaining quality. ...

    January 31, 2025 · 11 min · Rafiul Alam