Introduction: Two Systems of Thinking
In cognitive science, Nobel laureate Daniel Kahneman described human thinking as two distinct systems:
- System 1: Fast, automatic, intuitive (e.g., recognizing faces, reading emotions)
- System 2: Slow, deliberate, analytical (e.g., solving math problems, planning)
Traditional LLMs operate almost entirely in System 1 mode: they generate responses instantly, token by token, with no deliberate planning or self-reflection. Ask GPT-4 a question, and it starts answering immediately—no visible “thinking time.”
But what if LLMs could engage System 2 thinking—spending time to reason, plan, critique, and refine before answering?
Enter reasoning-focused models like OpenAI’s o1 and DeepSeek’s R1: models trained to think before they speak.
System 1 vs. System 2 LLMs
System 1: Traditional LLMs
What is 17 × 24?] --> B[Model generates
immediately] B --> C[Output: 408] style A fill:#4ecdc4 style B fill:#ff6b6b style C fill:#95e1d3
Characteristics:
- Instant response generation
- No visible intermediate reasoning
- Token-by-token streaming
- Fast but potentially shallow
System 2: Reasoning Models (o1, R1)
What is 17 × 24?] --> B[Internal Reasoning
Hidden Thoughts] subgraph "Scratchpad (Hidden)" C[Let me break this down...] D[17 × 20 = 340] E[17 × 4 = 68] F[340 + 68 = 408] G[Let me verify:
408 ÷ 17 = 24 ✓] end B --> C C --> D D --> E E --> F F --> G G --> H[Final Output: 408] style A fill:#4ecdc4 style B fill:#f39c12 style C fill:#9b59b6 style D fill:#9b59b6 style E fill:#9b59b6 style F fill:#9b59b6 style G fill:#9b59b6 style H fill:#2ecc71
Characteristics:
- Deliberate reasoning phase
- Self-verification and correction
- Multi-step planning
- Slower but more accurate
The Reasoning Process: Internal Monologue
Reasoning models generate an internal thought process before producing the final answer. This is often called a scratchpad or chain-of-thought.
(Internal) participant O as Output U->>M: Solve: If 5 machines make 5 widgets
in 5 minutes, how long for
100 machines to make 100 widgets? M->>S: Let me think about this... S->>S: Initial thought:
Maybe 100 minutes?
Wait, that doesn't sound right. S->>S: Let's reason step by step:
5 machines → 5 widgets in 5 min
So 1 machine → 1 widget in 5 min S->>S: If 1 machine makes 1 widget
in 5 minutes, then
100 machines make 100 widgets
in... 5 minutes! S->>S: Verification:
Rate per machine = 1 widget/5 min
100 machines × (1 widget/5 min)
= 100 widgets in 5 minutes ✓ S->>M: Reasoning complete M->>O: Final answer: 5 minutes O->>U: Answer: 5 minutes
(with or without reasoning shown) style S fill:#9b59b6 style O fill:#2ecc71
How System 2 Models Are Trained
Traditional LLMs are trained to predict the next token. Reasoning models add an extra step: learn to generate reasoning traces that lead to correct answers.
Training Pipeline
Math, Coding, Logic] --> B[Human Reasoning
Annotators write
step-by-step solutions] B --> C[Reasoning Dataset
Question + Thought Process + Answer] end subgraph "Training" C --> D[Supervised Fine-Tuning
Learn to generate
reasoning traces] D --> E[Reinforcement Learning
Reward: Correct final answer] E --> F[Reasoning-Optimized Model] end subgraph "Inference" F --> G[User Question] G --> H[Generate Reasoning
Hidden scratchpad] H --> I[Extract Final Answer
From reasoning trace] end style B fill:#3498db style D fill:#e74c3c style E fill:#f39c12 style H fill:#9b59b6 style I fill:#2ecc71
Step 1: Supervised Learning on Reasoning Traces
Train the model on examples with explicit reasoning:
Input: "What is the capital of the country where the Eiffel Tower is located?"
Reasoning trace (training data):
- The Eiffel Tower is in Paris
- Paris is in France
- The capital of France is Paris
Output: Paris
The model learns to generate intermediate reasoning steps.
Step 2: Reinforcement Learning with Outcome Rewards
Use RL to optimize for correct final answers, not just plausible-sounding reasoning:
Reward = +1 if final answer is correct, -1 if wrong
This encourages the model to:
- Try multiple approaches
- Self-correct mistakes
- Verify answers
Chain-of-Thought vs. Scratchpad Reasoning
Chain-of-Thought (CoT)
User sees the reasoning process:
Example:
Q: What is 15% of 80?
A: Let me calculate this step by step:
- 10% of 80 = 8
- 5% of 80 = 4
- 15% = 10% + 5% = 8 + 4 = 12
Answer: 12
Scratchpad Reasoning (o1, R1)
Reasoning is hidden from the user:
Internal reasoning] B --> C[Final Answer Only] style B fill:#9b59b6 style C fill:#2ecc71
Example:
Q: What is 15% of 80?
[Internal scratchpad - not shown to user]
- Let me think... 15% = 0.15
- 0.15 × 80 = ?
- 80 × 0.1 = 8
- 80 × 0.05 = 4
- 8 + 4 = 12
- Verify: 12/80 = 0.15 ✓
A: 12
Branching Reasoning: Exploring Multiple Paths
Advanced reasoning models can explore multiple reasoning paths and select the best one:
Solve complex problem] --> B{Reasoning
Strategy} B -->|Path 1| C1[Direct calculation
approach] B -->|Path 2| C2[Break into
subproblems] B -->|Path 3| C3[Work backwards
from answer] C1 --> D1{Verify} C2 --> D2{Verify} C3 --> D3{Verify} D1 -->|Failed| E1[❌ Discard] D2 -->|Success| E2[✓ Keep] D3 -->|Failed| E3[❌ Discard] E2 --> F[Select Best
Reasoning Path] F --> G[Final Answer] style C2 fill:#2ecc71 style E2 fill:#2ecc71 style F fill:#f39c12 style G fill:#3498db
This is similar to Monte Carlo Tree Search in game-playing AI (AlphaGo), but for reasoning.
Self-Correction: The Critique Loop
One of the most powerful features of System 2 models is self-correction:
Example: Self-Correction in Action
Question: If 3x + 6 = 45, what is x?
[Internal Reasoning]
Initial attempt:
- 3x = 45 - 6
- 3x = 39
- x = 13
Verification:
- Check: 3(13) + 6 = 39 + 6 = 45 ✓
- This is correct!
Final Answer: x = 13
Compare to a case where the model catches its own mistake:
Question: If 3x + 6 = 45, what is x?
[Internal Reasoning]
Initial attempt:
- 3x = 45 + 6 [MISTAKE]
- 3x = 51
- x = 17
Verification:
- Check: 3(17) + 6 = 51 + 6 = 57
- Wait, that's 57, not 45!
- I made an error.
Corrected attempt:
- 3x = 45 - 6 [FIXED]
- 3x = 39
- x = 13
Re-verification:
- Check: 3(13) + 6 = 39 + 6 = 45 ✓
- Correct!
Final Answer: x = 13
Real-World Example: OpenAI o1 vs. GPT-4
Problem: Complex Logic Puzzle
Question: “In a room, there are three light switches. Each switch controls one of three light bulbs in another room. You can flip the switches as many times as you want, but you can only enter the other room once. How do you determine which switch controls which bulb?”
GPT-4 (System 1) Response
You can determine which switch controls which bulb by:
1. Flipping switches in different combinations
2. Entering the room to see which lights are on
3. Matching the patterns
[Incomplete/incorrect reasoning]
o1 (System 2) Response
[Internal Reasoning - simplified]
- Direct observation won't work since I can only enter once
- Need to use multiple signals...
- Idea: Switches affect bulbs, but what if I use heat?
Step-by-step plan:
1. Turn on switch 1, leave it on for 10 minutes
2. Turn off switch 1, turn on switch 2
3. Leave switch 3 off
4. Enter the room
Results interpretation:
- Bulb that is ON → controlled by switch 2
- Bulb that is OFF but WARM → controlled by switch 1
- Bulb that is OFF and COOL → controlled by switch 3
This works because the first bulb was on long enough to heat up!
Final Answer: [Complete solution using heat as a signal]
The reasoning model explores the problem space more thoroughly and finds creative solutions.
Architecture: How Is This Implemented?
While exact details are proprietary, we can infer the architecture:
Hidden from user] C --> D{Is reasoning
complete?} D -->|No| E[Continue reasoning] E --> C D -->|Yes| F[Mark reasoning end] end subgraph "Output Extraction" F --> G[Extract final answer
from reasoning trace] G --> H[Generate user-facing
response] end subgraph "Special Tokens" ST1[Reasoning start: <think>] ST2[Reasoning end: </think>] ST3[Answer: <answer>] end style C fill:#9b59b6 style G fill:#f39c12 style H fill:#2ecc71
Special Tokens for Reasoning Control
Models likely use special tokens to delineate reasoning:
<think>
Let me break this problem down step by step.
First, I need to understand what is being asked...
[... extensive reasoning ...]
So the answer is 42.
</think>
<answer>
The answer is 42.
</answer>
During inference:
- Everything between
<think>and</think>is hidden from the user - Only content in
<answer>tags is shown
Training for Self-Correction: Process Supervision
Instead of rewarding only the final answer, reward intermediate reasoning steps:
This is called Process-Supervised Reinforcement Learning (from OpenAI’s research).
Compute-Time Scaling: More Thinking = Better Answers
A key insight: inference-time computation can improve accuracy:
1 second thinking] --> B[70% Accuracy] C[Medium Response
10 seconds thinking] --> D[85% Accuracy] E[Deep Reasoning
60 seconds thinking] --> F[95% Accuracy] style B fill:#e74c3c style D fill:#f39c12 style F fill:#2ecc71
Traditional scaling: Bigger models = better performance New paradigm: More thinking time = better performance
This allows dynamic quality/cost trade-offs at inference time!
Comparison: Traditional vs. Reasoning Models
Many tokens] D1 --> D2[Self-Critique] D2 --> D3{Correct?} D3 -->|No| D1 D3 -->|Yes| D4[Answer Extraction] D4 --> D5[Final Response] style D1 fill:#9b59b6 style D2 fill:#e74c3c style D4 fill:#f39c12 style D5 fill:#2ecc71 end
Limitations and Challenges
1. Computational Cost
More reasoning = more tokens = higher cost:
Traditional model: 50 tokens → $0.001
Reasoning model: 500 reasoning + 50 output = 550 tokens → $0.011
2. Latency
Reasoning takes time:
GPT-4: ~2 seconds
o1: ~30 seconds for complex problems
3. Hallucination in Reasoning
The model might generate plausible-sounding but incorrect reasoning:
wrong reasoning] B --> C[Incorrect answer
presented confidently] style B fill:#e74c3c style C fill:#e74c3c
4. Opacity
Even if reasoning is shown, it may not reflect the model’s “true” process—it’s just more generated text.
Use Cases: When to Use Reasoning Models
Excellent For:
- Complex math problems: Multi-step calculations
- Coding challenges: Algorithm design, debugging
- Logic puzzles: Requires deep reasoning
- Scientific problem-solving: Hypothesis generation and testing
- Strategic planning: Multi-step decision making
Not Ideal For:
- Simple questions: “What is the capital of France?”
- Creative writing: May over-analyze
- Speed-critical applications: Too slow
- Cost-sensitive applications: 10× more expensive
The Future: Hybrid Models
Future models may dynamically choose reasoning depth:
Classifier} B -->|Simple| C[System 1 Mode
Fast response] B -->|Medium| D[Light Reasoning
5 seconds] B -->|Complex| E[Deep Reasoning
60 seconds] C --> F[Answer] D --> F E --> F style C fill:#2ecc71 style D fill:#f39c12 style E fill:#9b59b6
Conclusion: The Dawn of Thinking Machines
Reasoning models like o1 and R1 represent a paradigm shift in AI:
From: Predict the next token as fast as possible To: Think carefully before responding
This mirrors human cognition more closely—we don’t always answer immediately; sometimes we need to stop, think, plan, and verify.
The implications are profound:
- Accuracy: Better performance on complex tasks
- Reliability: Self-correction reduces errors
- Transparency: Reasoning traces provide interpretability
- Flexibility: Tune reasoning depth based on task difficulty
We’re moving from reactive AI (System 1) to deliberative AI (System 2). The next frontier is teaching models not just what to think, but how to think.
Key Takeaways
- System 1 vs. System 2: Fast intuition vs. slow deliberation
- Scratchpad Reasoning: Hidden internal thought process
- Self-Correction: Models can verify and fix their own mistakes
- Process Supervision: Reward intermediate reasoning steps
- Compute-Time Scaling: More thinking time = better accuracy
- Trade-offs: Accuracy and reliability vs. cost and latency
The future of LLMs is not just bigger models—it’s smarter reasoning.