Introduction: It’s All About the Data

The secret to building great language models isn’t just architecture or compute—it’s data.

Every decision in the LLM lifecycle revolves around data:

  • What data do we train on?
  • How do we clean and filter it?
  • How do we align the model with human preferences?
  • How do we measure success?

Let’s trace the complete journey from raw text to a production-ready model, with data at the center.

The LLM Lifecycle: End-to-End

graph TB A[Raw Data
Web crawl, books, code] --> B[Data Curation
Filter, dedupe, clean] B --> C[Pre-training
Learn language patterns
100B+ tokens] C --> D[Base Model
Completes text
Not aligned] D --> E[Supervised Fine-Tuning
SFT
Instruction following] E --> F[SFT Model
Follows instructions
Not optimized] F --> G[Alignment
RLHF or DPO
Human preferences] G --> H[Aligned Model
Helpful, harmless, honest] H --> I[Evaluation
Benchmarks, human eval] I -->|Iterate| B style A fill:#95a5a6 style B fill:#3498db style C fill:#e74c3c style E fill:#f39c12 style G fill:#9b59b6 style I fill:#2ecc71

Phase 1: Data Collection and Curation

Raw Data Sources

mindmap root((LLM Training
Data Sources)) Web Data CommonCrawl Wikipedia Reddit GitHub StackOverflow Books Project Gutenberg Published books Academic papers Code GitHub repositories StackOverflow code Documentation Specialized Scientific papers Legal documents Medical records

Data Curation Pipeline

graph TB A[Raw Data
10 TB web crawl] --> B[Language Detection
Filter non-English] B --> C[Quality Filtering
Remove low-quality] C --> D[Deduplication
Remove exact/near duplicates] D --> E[PII Removal
Personal info redaction] E --> F[Toxicity Filtering
Remove harmful content] F --> G[Curated Dataset
2 TB high-quality] style A fill:#e74c3c style G fill:#2ecc71

Quality Filtering

graph LR A[Document] --> B{Quality
Checks} B -->|Pass| C[Keep:
Well-formed sentences
Coherent content
Proper grammar] B -->|Fail| D[Discard:
Gibberish
SEO spam
Boilerplate] style C fill:#2ecc71 style D fill:#e74c3c

Metrics:

  • Perplexity score (vs. reference model)
  • Length distribution
  • Symbol-to-word ratio
  • Repetition detection

Deduplication

sequenceDiagram participant D1 as Document 1 participant H as Hash Function participant DB as Dedup Database participant D2 as Document 2 D1->>H: Hash content H->>DB: Check if hash exists DB->>H: Not found H->>DB: Store hash Note over D1: Keep Document 1 D2->>H: Hash content (same) H->>DB: Check if hash exists DB->>H: Found! Note over D2: Discard Document 2
(duplicate) style D1 fill:#2ecc71 style D2 fill:#e74c3c

Why deduplication matters:

  • Prevents memorization of repeated content
  • Reduces training data size
  • Improves generalization

Phase 2: Pre-Training

Objective: Next-Token Prediction

graph LR A[The cat sat on] --> B[Model] B --> C[Predict:
the] D[The cat sat on the] --> E[Model] E --> F[Predict:
mat] style B fill:#3498db style E fill:#3498db

Loss Function:

Loss = -log P(next_token | previous_tokens)

Minimize cross-entropy between predicted and actual next token

Training Infrastructure

graph TB A[Training Data
1 trillion tokens] --> B[Distributed Training
1000s of GPUs] B --> C[Data Parallel
Split batches] B --> D[Model Parallel
Split model layers] B --> E[Pipeline Parallel
Split forward/backward] C & D & E --> F[Gradient Accumulation] F --> G[Optimizer Update
Adam, AdamW] G --> H[Checkpoint
Every 1000 steps] style B fill:#e74c3c style F fill:#f39c12 style H fill:#2ecc71

Training Dynamics

graph LR A[Iteration 0
Random weights
Loss: 10.5] --> B[Iteration 10K
Basic patterns
Loss: 5.2] B --> C[Iteration 100K
Grammar learned
Loss: 3.1] C --> D[Iteration 1M
Facts, reasoning
Loss: 2.3] style A fill:#e74c3c style D fill:#2ecc71

Compute Requirements

Model: GPT-3 scale (175B parameters)
Data: 300B tokens
Hardware: ~1000 A100 GPUs
Time: ~1 month
Cost: ~$5M
Energy: ~1,300 MWh

Result: Base Model

graph TB A[Prompt:
Translate to French:
Hello] --> B[Base Model] B --> C[Output:
world, how are you?
I am fine, thank you.] Note over C: Completes text
but doesn't follow instruction! style C fill:#e74c3c

Problem: Base models predict text, but don’t follow instructions.

Phase 3: Supervised Fine-Tuning (SFT)

Goal: Instruction Following

Train on examples of input → desired output:

graph LR A[Instruction:
Translate to French:
Hello] --> B[SFT Model] B --> C[Output:
Bonjour] style C fill:#2ecc71

SFT Data Format

graph TB subgraph "Instruction Dataset" A1[Example 1:
Input: Summarize this article
Output: summary text] A2[Example 2:
Input: Write a poem about cats
Output: poem text] A3[Example 3:
Input: Explain gravity
Output: explanation] end A1 & A2 & A3 --> B[Fine-tune base model
10K-100K examples] B --> C[SFT Model
Follows instructions] style C fill:#2ecc71

Data Sources

mindmap root((SFT Data
Sources)) Human-Written Annotators write examples Quality: High Cost: Expensive Synthetic GPT-4 generates examples Quality: Good Cost: Low Distillation Distill from stronger model Quality: Very good Cost: Medium Public Datasets FLAN, T0, OpenAssistant Quality: Variable Cost: Free

Training Process

sequenceDiagram participant D as Dataset participant M as Model participant L as Loss D->>M: Input: "Translate to French: Hello" M->>M: Generate: "Bonjour" D->>L: Compare with target: "Bonjour" L->>M: Loss = 0.01 (good!) D->>M: Input: "Explain gravity" M->>M: Generate: "Gravity is a fundamental force..." D->>L: Compare with target L->>M: Loss = 0.15 (update weights) M->>M: Update parameters
via backpropagation style M fill:#f39c12

Result: Instruction-Following Model

Prompt: Write a haiku about AI

Output:
Silicon neurons
Learning patterns, weaving thought
Intelligence blooms

Better! But still has issues:

  • May generate harmful content
  • Inconsistent quality
  • Doesn’t match human preferences

Phase 4: Alignment (RLHF vs. DPO)

The Alignment Problem

graph TB A[Question:
How do I break into a car?] --> B[SFT Model] B --> C[Response:
detailed instructions
for breaking into cars] Note over C: Technically correct
but harmful! style C fill:#e74c3c

Goal: Align model outputs with human preferences (helpful, harmless, honest).

Alignment Method 1: RLHF (Reinforcement Learning from Human Feedback)

graph TB subgraph "Step 1: Collect Preference Data" A1[Prompt:
Explain quantum computing] --> B1[Model Output A] A1 --> B2[Model Output B] B1 & B2 --> C[Human Evaluator] C --> D[Preference:
B is better than A] end subgraph "Step 2: Train Reward Model" D --> E[Reward Model
Predicts human preferences] end subgraph "Step 3: RL Optimization" E --> F[Use reward model
to train policy via PPO] F --> G[Aligned Model] end style E fill:#f39c12 style G fill:#2ecc71

RLHF: Detailed Flow

sequenceDiagram participant P as Prompt participant M as Model participant R as Reward Model participant O as Optimizer P->>M: Generate response M->>M: Response: "Here's how..." M->>R: Evaluate response R->>R: Score: 0.8
(good, helpful) R->>O: Reward signal O->>M: Update policy
to maximize reward Note over M: Model learns to
generate high-reward
responses style R fill:#f39c12 style M fill:#2ecc71

RLHF Challenges

graph TB A[RLHF Challenges] --> B[Reward Hacking
Model exploits reward
without being helpful] A --> C[Training Instability
RL is notoriously unstable] A --> D[Computational Cost
Requires reward model + policy] style B fill:#e74c3c style C fill:#e74c3c style D fill:#e74c3c

Alignment Method 2: DPO (Direct Preference Optimization)

Key Insight: Skip the reward model, optimize directly from preferences!

graph TB A[Preference Data:
Response A > Response B] --> B[DPO Loss Function] B --> C[Increase prob of A
Decrease prob of B] C --> D[Aligned Model
No reward model needed!] style D fill:#2ecc71

DPO vs. RLHF

graph LR subgraph "RLHF Pipeline" A1[Preference Data] --> A2[Train Reward Model] A2 --> A3[RL with PPO] A3 --> A4[Aligned Model] end subgraph "DPO Pipeline" B1[Preference Data] --> B2[Direct Optimization] B2 --> B3[Aligned Model] end style A2 fill:#e74c3c style A3 fill:#e74c3c style B2 fill:#2ecc71

Advantages of DPO:

  • Simpler (no reward model)
  • More stable training
  • Lower compute cost
  • Comparable or better performance

Preference Data Format

graph TB A[Prompt:
Write a professional email
declining a meeting] --> B[Response A:
Short, rude tone] A --> C[Response B:
Professional, polite,
offers alternatives] B & C --> D[Human Annotation] D --> E[Preference:
B > A] style E fill:#2ecc71

Phase 5: Evaluation

Evaluation Dimensions

mindmap root((LLM
Evaluation)) Capability Reasoning MMLU Math GSM8K Code HumanEval Commonsense HellaSwag Safety Toxicity Bias Refusal rate Alignment Helpfulness Harmlessness Honesty Efficiency Latency Throughput Memory usage

Benchmark Performance

graph LR A[Base Model
MMLU: 45%
GSM8K: 10%
HumanEval: 15%] --> B[+ SFT
MMLU: 65%
GSM8K: 40%
HumanEval: 45%] B --> C[+ Alignment
MMLU: 70%
GSM8K: 55%
HumanEval: 50%] style A fill:#e74c3c style B fill:#f39c12 style C fill:#2ecc71

Human Evaluation

graph TB A[Model Output] --> B[Human Raters] B --> C[Helpfulness:
1-5 stars] B --> D[Harmlessness:
1-5 stars] B --> E[Honesty:
1-5 stars] C & D & E --> F[Aggregate Scores] F --> G[Model Ranking] style G fill:#2ecc71

A/B Testing

sequenceDiagram participant U as User participant A as Model A participant B as Model B participant E as Evaluator U->>A: Send prompt U->>B: Send prompt (same) A->>E: Response A B->>E: Response B E->>E: Which is better? E->>U: Show preferred response Note over E: Track preference rate:
Model B preferred 65% style B fill:#2ecc71

The Feedback Loop: Continuous Improvement

graph TB A[Deploy Model v1] --> B[Collect User Interactions] B --> C[Identify Failures
Incorrect responses
Harmful outputs] C --> D[Create New Training Data
Fix failure modes] D --> E[Fine-tune Model v2] E --> F[Evaluate Improvement] F -->|Better| A F -->|Not better| D style E fill:#2ecc71 style F fill:#f39c12

Failure Mode Analysis

graph LR A[User Reports:
Model refuses
valid requests] --> B[Analyze Patterns] B --> C[Root Cause:
Overly cautious
safety filter] C --> D[Solution:
Add nuanced examples
to training data] D --> E[Retrain] E --> F[Validate Fix] style F fill:#2ecc71

Comparing Training Strategies

SFT vs. DPO vs. RLHF

graph TB A[Training Method] --> B[SFT
Supervised Fine-Tuning] A --> C[DPO
Direct Preference] A --> D[RLHF
Reinforcement Learning] B --> B1[Data: Input → Output] B --> B2[Simple, stable] B --> B3[Limited alignment] C --> C1[Data: A > B preferences] C --> C2[Direct optimization] C --> C3[Good alignment] D --> D1[Data: Preference rankings] D --> D2[Complex RL training] D --> D3[Best alignment] style B2 fill:#2ecc71 style C3 fill:#2ecc71 style D3 fill:#2ecc71 style D2 fill:#e74c3c

Data Requirements

Method Data Type Data Size Annotation Cost
Pre-training Raw text 1T tokens Low (automated)
SFT Input → Output 10K-100K Medium
DPO A > B preferences 10K-50K High
RLHF Preference rankings 50K-200K Very High

The Data Flywheel

graph TB A[More Users] --> B[More Interactions] B --> C[More Feedback Data] C --> D[Better Model] D --> A style D fill:#2ecc71

Companies with large user bases (OpenAI, Google, Anthropic) have a massive advantage: continuous data collection from real usage.

Advanced Techniques

Constitutional AI

graph TB A[Model Output] --> B{Violates
Constitution?} B -->|Yes| C[Self-Critique:
Why is this harmful?] C --> D[Self-Revise:
Generate safer version] B -->|No| E[Accept Output] D --> E style E fill:#2ecc71

Constitution: Set of principles (e.g., “Be helpful and harmless”)

Synthetic Data Generation

sequenceDiagram participant S as Seed Examples participant M as Strong Model (GPT-4) participant D as Dataset participant T as Target Model S->>M: Generate variations M->>M: Create 10K examples
from 100 seeds M->>D: Synthetic dataset D->>T: Fine-tune target model Note over T: Learns from GPT-4
without API costs style M fill:#f39c12 style T fill:#2ecc71

Mixture of Experts (MoE) Fine-Tuning

graph TB A[General Data] --> B[Pre-train Base Model] B --> C[Expert 1:
Fine-tune on Code] B --> D[Expert 2:
Fine-tune on Math] B --> E[Expert 3:
Fine-tune on Creative] C & D & E --> F[Ensemble or Router] F --> G[Specialized Model] style G fill:#2ecc71

Real-World Pipeline: OpenAI GPT-4

graph TB A[Web Crawl
+ Books + Code
~10 TB] --> B[Filter & Dedupe
~2 TB] B --> C[Pre-training
~13T tokens
1000s GPUs, months] C --> D[Base Model
Completes text] D --> E[SFT
~10K human-written
instructions] E --> F[SFT Model
Follows instructions] F --> G[RLHF
~50K preference
comparisons] G --> H[GPT-4
Aligned model] H --> I[Red Teaming
Adversarial testing] I --> J[Deploy] style C fill:#e74c3c style E fill:#f39c12 style G fill:#9b59b6 style J fill:#2ecc71

Cost Breakdown

graph LR A[Total Cost:
~$100M] --> B[Pre-training: $80M
80%] A --> C[SFT: $5M
5%] A --> D[RLHF: $10M
10%] A --> E[Evaluation: $5M
5%] style B fill:#e74c3c style D fill:#f39c12

Insight: Pre-training dominates cost, but alignment is critical for quality.

Key Decisions in LLM Development

1. Data Quality vs. Quantity

graph LR A[Strategy A:
10T tokens
Lower quality] --> C[Model Performance] B[Strategy B:
1T tokens
Higher quality] --> C C --> D[Strategy B often wins!
Quality > Quantity] style D fill:#2ecc71

2. SFT Data: Human vs. Synthetic

graph TB A[Human-Written
10K examples
Cost: $500K] --> C[Model Quality] B[Synthetic GPT-4
100K examples
Cost: $50K] --> C C --> D[Human better for nuance
Synthetic better for scale] style D fill:#f39c12

3. Alignment: DPO vs. RLHF

graph LR A{Choose
Alignment} --> B[DPO
Simpler, stable
Good results] A --> C[RLHF
Complex, powerful
Best results] B --> D[Most teams
choose DPO] C --> E[Large labs
use RLHF] style B fill:#2ecc71 style C fill:#f39c12

The Future: Data-Centric AI

mindmap root((Future of
LLM Training)) Automated Data Curation AI-powered filtering Quality scoring models Automatic deduplication Synthetic Data Self-improvement loops AI tutors generate data Simulation environments Active Learning Model identifies gaps Request specific data Curriculum learning Multimodal Text + images + audio Unified training Cross-modal reasoning

Conclusion

Building production-quality LLMs is a data engineering challenge as much as a machine learning challenge:

  1. Data Curation: Filter trillions of tokens to high-quality datasets
  2. Pre-training: Learn language patterns from raw text
  3. SFT: Teach instruction following with curated examples
  4. Alignment: Shape behavior with human preferences (DPO/RLHF)
  5. Evaluation: Measure capabilities and safety
  6. Iteration: Continuous improvement from user feedback

The models that win aren’t just the biggest—they’re the ones with the best data pipelines.

Key Insight: Every breakthrough in LLM performance has been driven by better data, not just bigger models.

Key Takeaways

  • Pre-training: Learn from trillions of tokens (80% of compute)
  • SFT: Instruction following from curated examples
  • DPO vs. RLHF: DPO simpler, RLHF more powerful
  • Evaluation: Benchmarks + human eval + A/B testing
  • Feedback Loop: Continuous improvement from user data
  • Data Quality: More important than quantity
  • Alignment: Critical for safety and usability

The future of AI is data-centric. Master the data pipeline, and you master LLMs.

Further Reading