Introduction: Two Systems of Thinking

In cognitive science, Nobel laureate Daniel Kahneman described human thinking as two distinct systems:

System 1: Fast, automatic, intuitive (e.g., recognizing faces, reading emotions)
System 2: Slow, deliberate, analytical (e.g., solving math problems, planning)

Traditional LLMs operate almost entirely in System 1 mode: they generate responses instantly, token by token, with no deliberate planning or self-reflection. Ask GPT-4 a question, and it starts answering immediately—no visible “thinking time.”

But what if LLMs could engage System 2 thinking—spending time to reason, plan, critique, and refine before answering?

Enter reasoning-focused models like OpenAI’s o1 and DeepSeek’s R1: models trained to think before they speak.

System 1 vs. System 2 LLMs

System 1: Traditional LLMs

graph LR A[User Question:
What is 17 × 24?] --> B[Model generates
immediately] B --> C[Output: 408] style A fill:#4ecdc4 style B fill:#ff6b6b style C fill:#95e1d3

Characteristics:

Instant response generation
No visible intermediate reasoning
Token-by-token streaming
Fast but potentially shallow

System 2: Reasoning Models (o1, R1)

graph TB A[User Question:
What is 17 × 24?] --> B[Internal Reasoning
Hidden Thoughts] subgraph "Scratchpad (Hidden)" C[Let me break this down...] D[17 × 20 = 340] E[17 × 4 = 68] F[340 + 68 = 408] G[Let me verify:
408 ÷ 17 = 24 ✓] end B --> C C --> D D --> E E --> F F --> G G --> H[Final Output: 408] style A fill:#4ecdc4 style B fill:#f39c12 style C fill:#9b59b6 style D fill:#9b59b6 style E fill:#9b59b6 style F fill:#9b59b6 style G fill:#9b59b6 style H fill:#2ecc71

Characteristics:

Deliberate reasoning phase
Self-verification and correction
Multi-step planning
Slower but more accurate

The Reasoning Process: Internal Monologue

Reasoning models generate an internal thought process before producing the final answer. This is often called a scratchpad or chain-of-thought.

sequenceDiagram participant U as User participant M as Model participant S as Scratchpad
(Internal) participant O as Output U->>M: Solve: If 5 machines make 5 widgets
in 5 minutes, how long for
100 machines to make 100 widgets? M->>S: Let me think about this... S->>S: Initial thought:
Maybe 100 minutes?
Wait, that doesn't sound right. S->>S: Let's reason step by step:
5 machines → 5 widgets in 5 min
So 1 machine → 1 widget in 5 min S->>S: If 1 machine makes 1 widget
in 5 minutes, then
100 machines make 100 widgets
in... 5 minutes! S->>S: Verification:
Rate per machine = 1 widget/5 min
100 machines × (1 widget/5 min)
= 100 widgets in 5 minutes ✓ S->>M: Reasoning complete M->>O: Final answer: 5 minutes O->>U: Answer: 5 minutes
(with or without reasoning shown) style S fill:#9b59b6 style O fill:#2ecc71

How System 2 Models Are Trained

Traditional LLMs are trained to predict the next token. Reasoning models add an extra step: learn to generate reasoning traces that lead to correct answers.

Training Pipeline

graph TB subgraph "Data Collection" A[Hard Problems
Math, Coding, Logic] --> B[Human Reasoning
Annotators write
step-by-step solutions] B --> C[Reasoning Dataset
Question + Thought Process + Answer] end subgraph "Training" C --> D[Supervised Fine-Tuning
Learn to generate
reasoning traces] D --> E[Reinforcement Learning
Reward: Correct final answer] E --> F[Reasoning-Optimized Model] end subgraph "Inference" F --> G[User Question] G --> H[Generate Reasoning
Hidden scratchpad] H --> I[Extract Final Answer
From reasoning trace] end style B fill:#3498db style D fill:#e74c3c style E fill:#f39c12 style H fill:#9b59b6 style I fill:#2ecc71

Step 1: Supervised Learning on Reasoning Traces

Train the model on examples with explicit reasoning:

Input: "What is the capital of the country where the Eiffel Tower is located?"

Reasoning trace (training data):
- The Eiffel Tower is in Paris
- Paris is in France
- The capital of France is Paris

Output: Paris

The model learns to generate intermediate reasoning steps.

Step 2: Reinforcement Learning with Outcome Rewards

Use RL to optimize for correct final answers, not just plausible-sounding reasoning:

Reward = +1 if final answer is correct, -1 if wrong

This encourages the model to:

Try multiple approaches
Self-correct mistakes
Verify answers

Chain-of-Thought vs. Scratchpad Reasoning

Chain-of-Thought (CoT)

User sees the reasoning process:

graph LR A[Question] --> B[Visible Reasoning] B --> C[Answer] style B fill:#3498db

Example:

Q: What is 15% of 80?

A: Let me calculate this step by step:
   - 10% of 80 = 8
   - 5% of 80 = 4
   - 15% = 10% + 5% = 8 + 4 = 12

Answer: 12

Scratchpad Reasoning (o1, R1)

Reasoning is hidden from the user:

graph TB A[Question] --> B[Hidden Scratchpad
Internal reasoning] B --> C[Final Answer Only] style B fill:#9b59b6 style C fill:#2ecc71

Example:

Q: What is 15% of 80?

[Internal scratchpad - not shown to user]
- Let me think... 15% = 0.15
- 0.15 × 80 = ?
- 80 × 0.1 = 8
- 80 × 0.05 = 4
- 8 + 4 = 12
- Verify: 12/80 = 0.15 ✓

A: 12

Branching Reasoning: Exploring Multiple Paths

Advanced reasoning models can explore multiple reasoning paths and select the best one:

graph TB A[Question:
Solve complex problem] --> B{Reasoning
Strategy} B -->|Path 1| C1[Direct calculation
approach] B -->|Path 2| C2[Break into
subproblems] B -->|Path 3| C3[Work backwards
from answer] C1 --> D1{Verify} C2 --> D2{Verify} C3 --> D3{Verify} D1 -->|Failed| E1[❌ Discard] D2 -->|Success| E2[✓ Keep] D3 -->|Failed| E3[❌ Discard] E2 --> F[Select Best
Reasoning Path] F --> G[Final Answer] style C2 fill:#2ecc71 style E2 fill:#2ecc71 style F fill:#f39c12 style G fill:#3498db

This is similar to Monte Carlo Tree Search in game-playing AI (AlphaGo), but for reasoning.

Self-Correction: The Critique Loop

One of the most powerful features of System 2 models is self-correction:

sequenceDiagram participant M as Model participant R as Reasoning participant C as Critic participant O as Output M->>R: Generate initial solution R->>R: Solution: X = 42 R->>C: Does this make sense? C->>C: Check: plug 42 back in C->>C: Wait, that gives 50, not 45 C->>R: Error detected! R->>R: Let me recalculate... R->>R: New solution: X = 37.5 R->>C: Verify again C->>C: Check: 37.5 works! ✓ C->>O: Confidence: High O->>M: Final answer: 37.5 style C fill:#e74c3c style O fill:#2ecc71

Example: Self-Correction in Action

Question: If 3x + 6 = 45, what is x?

[Internal Reasoning]
Initial attempt:
- 3x = 45 - 6
- 3x = 39
- x = 13

Verification:
- Check: 3(13) + 6 = 39 + 6 = 45 ✓
- This is correct!

Final Answer: x = 13

Compare to a case where the model catches its own mistake:

Question: If 3x + 6 = 45, what is x?

[Internal Reasoning]
Initial attempt:
- 3x = 45 + 6  [MISTAKE]
- 3x = 51
- x = 17

Verification:
- Check: 3(17) + 6 = 51 + 6 = 57
- Wait, that's 57, not 45!
- I made an error.

Corrected attempt:
- 3x = 45 - 6  [FIXED]
- 3x = 39
- x = 13

Re-verification:
- Check: 3(13) + 6 = 39 + 6 = 45 ✓
- Correct!

Final Answer: x = 13

Real-World Example: OpenAI o1 vs. GPT-4

Problem: Complex Logic Puzzle

Question: “In a room, there are three light switches. Each switch controls one of three light bulbs in another room. You can flip the switches as many times as you want, but you can only enter the other room once. How do you determine which switch controls which bulb?”

GPT-4 (System 1) Response

You can determine which switch controls which bulb by:
1. Flipping switches in different combinations
2. Entering the room to see which lights are on
3. Matching the patterns

[Incomplete/incorrect reasoning]

o1 (System 2) Response

[Internal Reasoning - simplified]
- Direct observation won't work since I can only enter once
- Need to use multiple signals...
- Idea: Switches affect bulbs, but what if I use heat?

Step-by-step plan:
1. Turn on switch 1, leave it on for 10 minutes
2. Turn off switch 1, turn on switch 2
3. Leave switch 3 off
4. Enter the room

Results interpretation:
- Bulb that is ON → controlled by switch 2
- Bulb that is OFF but WARM → controlled by switch 1
- Bulb that is OFF and COOL → controlled by switch 3

This works because the first bulb was on long enough to heat up!

Final Answer: [Complete solution using heat as a signal]

The reasoning model explores the problem space more thoroughly and finds creative solutions.

Architecture: How Is This Implemented?

While exact details are proprietary, we can infer the architecture:

graph TB subgraph "Input Processing" A[User Question] --> B[Encode Input] end subgraph "Reasoning Phase" B --> C[Generate Reasoning Tokens
Hidden from user] C --> D{Is reasoning
complete?} D -->|No| E[Continue reasoning] E --> C D -->|Yes| F[Mark reasoning end] end subgraph "Output Extraction" F --> G[Extract final answer
from reasoning trace] G --> H[Generate user-facing
response] end subgraph "Special Tokens" ST1[Reasoning start: <think>] ST2[Reasoning end: </think>] ST3[Answer: <answer>] end style C fill:#9b59b6 style G fill:#f39c12 style H fill:#2ecc71

Special Tokens for Reasoning Control

Models likely use special tokens to delineate reasoning:

<think>
Let me break this problem down step by step.
First, I need to understand what is being asked...
[... extensive reasoning ...]
So the answer is 42.
</think>

<answer>
The answer is 42.
</answer>

During inference:

Everything between <think> and </think> is hidden from the user
Only content in <answer> tags is shown

Training for Self-Correction: Process Supervision

Instead of rewarding only the final answer, reward intermediate reasoning steps:

graph TB A[Problem] --> B[Reasoning Step 1] B --> C{Correct?} C -->|Yes| D[Reward +1] C -->|No| E[Reward -1] D --> F[Reasoning Step 2] E --> F F --> G{Correct?} G -->|Yes| H[Reward +1] G -->|No| I[Reward -1] H --> J[Continue...] I --> J style D fill:#2ecc71 style E fill:#e74c3c style H fill:#2ecc71 style I fill:#e74c3c

This is called Process-Supervised Reinforcement Learning (from OpenAI’s research).

Compute-Time Scaling: More Thinking = Better Answers

A key insight: inference-time computation can improve accuracy:

graph LR A[Quick Response
1 second thinking] --> B[70% Accuracy] C[Medium Response
10 seconds thinking] --> D[85% Accuracy] E[Deep Reasoning
60 seconds thinking] --> F[95% Accuracy] style B fill:#e74c3c style D fill:#f39c12 style F fill:#2ecc71

Traditional scaling: Bigger models = better performance New paradigm: More thinking time = better performance

This allows dynamic quality/cost trade-offs at inference time!

Comparison: Traditional vs. Reasoning Models

graph TB subgraph "Traditional LLM" A1[Question] --> B1[Token 1] B1 --> B2[Token 2] B2 --> B3[Token 3] B3 --> B4[...] B4 --> B5[Final Token] style B1 fill:#95a5a6 style B2 fill:#95a5a6 style B3 fill:#95a5a6 style B5 fill:#95a5a6 end subgraph "Reasoning LLM" C1[Question] --> D1[Reasoning Phase
Many tokens] D1 --> D2[Self-Critique] D2 --> D3{Correct?} D3 -->|No| D1 D3 -->|Yes| D4[Answer Extraction] D4 --> D5[Final Response] style D1 fill:#9b59b6 style D2 fill:#e74c3c style D4 fill:#f39c12 style D5 fill:#2ecc71 end

Limitations and Challenges

1. Computational Cost

More reasoning = more tokens = higher cost:

Traditional model: 50 tokens → $0.001
Reasoning model: 500 reasoning + 50 output = 550 tokens → $0.011

2. Latency

Reasoning takes time:

GPT-4: ~2 seconds
o1: ~30 seconds for complex problems

3. Hallucination in Reasoning

The model might generate plausible-sounding but incorrect reasoning:

graph LR A[Question] --> B[Confident but
wrong reasoning] B --> C[Incorrect answer
presented confidently] style B fill:#e74c3c style C fill:#e74c3c

4. Opacity

Even if reasoning is shown, it may not reflect the model’s “true” process—it’s just more generated text.

Use Cases: When to Use Reasoning Models

Excellent For:

Complex math problems: Multi-step calculations
Coding challenges: Algorithm design, debugging
Logic puzzles: Requires deep reasoning
Scientific problem-solving: Hypothesis generation and testing
Strategic planning: Multi-step decision making

Not Ideal For:

Simple questions: “What is the capital of France?”
Creative writing: May over-analyze
Speed-critical applications: Too slow
Cost-sensitive applications: 10× more expensive

The Future: Hybrid Models

Future models may dynamically choose reasoning depth:

graph TB A[User Question] --> B{Complexity
Classifier} B -->|Simple| C[System 1 Mode
Fast response] B -->|Medium| D[Light Reasoning
5 seconds] B -->|Complex| E[Deep Reasoning
60 seconds] C --> F[Answer] D --> F E --> F style C fill:#2ecc71 style D fill:#f39c12 style E fill:#9b59b6

Conclusion: The Dawn of Thinking Machines

Reasoning models like o1 and R1 represent a paradigm shift in AI:

From: Predict the next token as fast as possible To: Think carefully before responding

This mirrors human cognition more closely—we don’t always answer immediately; sometimes we need to stop, think, plan, and verify.

The implications are profound:

Accuracy: Better performance on complex tasks
Reliability: Self-correction reduces errors
Transparency: Reasoning traces provide interpretability
Flexibility: Tune reasoning depth based on task difficulty

We’re moving from reactive AI (System 1) to deliberative AI (System 2). The next frontier is teaching models not just what to think, but how to think.

Key Takeaways

System 1 vs. System 2: Fast intuition vs. slow deliberation
Scratchpad Reasoning: Hidden internal thought process
Self-Correction: Models can verify and fix their own mistakes
Process Supervision: Reward intermediate reasoning steps
Compute-Time Scaling: More thinking time = better accuracy
Trade-offs: Accuracy and reliability vs. cost and latency

The future of LLMs is not just bigger models—it’s smarter reasoning.

The 'System 2' LLM: How Models Learn to Reason (o1, R1)

Introduction: Two Systems of Thinking

System 1 vs. System 2 LLMs

System 1: Traditional LLMs

System 2: Reasoning Models (o1, R1)

The Reasoning Process: Internal Monologue

How System 2 Models Are Trained

Training Pipeline

Step 1: Supervised Learning on Reasoning Traces

Step 2: Reinforcement Learning with Outcome Rewards

Chain-of-Thought vs. Scratchpad Reasoning

Chain-of-Thought (CoT)

Scratchpad Reasoning (o1, R1)

Branching Reasoning: Exploring Multiple Paths

Self-Correction: The Critique Loop

Example: Self-Correction in Action

Real-World Example: OpenAI o1 vs. GPT-4

Problem: Complex Logic Puzzle

GPT-4 (System 1) Response

o1 (System 2) Response

Architecture: How Is This Implemented?

Special Tokens for Reasoning Control

Training for Self-Correction: Process Supervision

Compute-Time Scaling: More Thinking = Better Answers

Comparison: Traditional vs. Reasoning Models

Limitations and Challenges

1. Computational Cost

2. Latency

3. Hallucination in Reasoning

4. Opacity

Use Cases: When to Use Reasoning Models

Excellent For:

Not Ideal For:

The Future: Hybrid Models

Conclusion: The Dawn of Thinking Machines

Key Takeaways

Further Reading

AI Assistant

Hi! I'm your AI assistant

Introduction: Two Systems of Thinking#

System 1 vs. System 2 LLMs#

System 1: Traditional LLMs#

System 2: Reasoning Models (o1, R1)#

The Reasoning Process: Internal Monologue#

How System 2 Models Are Trained#

Training Pipeline#

Step 1: Supervised Learning on Reasoning Traces#

Step 2: Reinforcement Learning with Outcome Rewards#

Chain-of-Thought vs. Scratchpad Reasoning#

Chain-of-Thought (CoT)#

Scratchpad Reasoning (o1, R1)#

Branching Reasoning: Exploring Multiple Paths#

Self-Correction: The Critique Loop#

Example: Self-Correction in Action#

Real-World Example: OpenAI o1 vs. GPT-4#

Problem: Complex Logic Puzzle#

GPT-4 (System 1) Response#

o1 (System 2) Response#

Architecture: How Is This Implemented?#

Special Tokens for Reasoning Control#

Training for Self-Correction: Process Supervision#

Compute-Time Scaling: More Thinking = Better Answers#

Comparison: Traditional vs. Reasoning Models#

Limitations and Challenges#

1. Computational Cost#

2. Latency#

3. Hallucination in Reasoning#

4. Opacity#

Use Cases: When to Use Reasoning Models#

Excellent For:#

Not Ideal For:#

The Future: Hybrid Models#

Conclusion: The Dawn of Thinking Machines#

Key Takeaways#

Further Reading#

Introduction: Two Systems of Thinking

System 1 vs. System 2 LLMs

System 1: Traditional LLMs

System 2: Reasoning Models (o1, R1)

The Reasoning Process: Internal Monologue

How System 2 Models Are Trained

Training Pipeline

Step 1: Supervised Learning on Reasoning Traces

Step 2: Reinforcement Learning with Outcome Rewards

Chain-of-Thought vs. Scratchpad Reasoning

Chain-of-Thought (CoT)

Scratchpad Reasoning (o1, R1)

Branching Reasoning: Exploring Multiple Paths

Self-Correction: The Critique Loop

Example: Self-Correction in Action

Real-World Example: OpenAI o1 vs. GPT-4

Problem: Complex Logic Puzzle

GPT-4 (System 1) Response

o1 (System 2) Response

Architecture: How Is This Implemented?

Special Tokens for Reasoning Control

Training for Self-Correction: Process Supervision

Compute-Time Scaling: More Thinking = Better Answers

Comparison: Traditional vs. Reasoning Models

Limitations and Challenges

1. Computational Cost

2. Latency

3. Hallucination in Reasoning

4. Opacity

Use Cases: When to Use Reasoning Models

Excellent For:

Not Ideal For:

The Future: Hybrid Models

Conclusion: The Dawn of Thinking Machines

Key Takeaways

Further Reading