Introduction: It’s All About the Data
The secret to building great language models isn’t just architecture or compute—it’s data.
Every decision in the LLM lifecycle revolves around data:
- What data do we train on?
- How do we clean and filter it?
- How do we align the model with human preferences?
- How do we measure success?
Let’s trace the complete journey from raw text to a production-ready model, with data at the center.
The LLM Lifecycle: End-to-End
Web crawl, books, code] --> B[Data Curation
Filter, dedupe, clean] B --> C[Pre-training
Learn language patterns
100B+ tokens] C --> D[Base Model
Completes text
Not aligned] D --> E[Supervised Fine-Tuning
SFT
Instruction following] E --> F[SFT Model
Follows instructions
Not optimized] F --> G[Alignment
RLHF or DPO
Human preferences] G --> H[Aligned Model
Helpful, harmless, honest] H --> I[Evaluation
Benchmarks, human eval] I -->|Iterate| B style A fill:#95a5a6 style B fill:#3498db style C fill:#e74c3c style E fill:#f39c12 style G fill:#9b59b6 style I fill:#2ecc71
Phase 1: Data Collection and Curation
Raw Data Sources
Data Sources)) Web Data CommonCrawl Wikipedia Reddit GitHub StackOverflow Books Project Gutenberg Published books Academic papers Code GitHub repositories StackOverflow code Documentation Specialized Scientific papers Legal documents Medical records
Data Curation Pipeline
10 TB web crawl] --> B[Language Detection
Filter non-English] B --> C[Quality Filtering
Remove low-quality] C --> D[Deduplication
Remove exact/near duplicates] D --> E[PII Removal
Personal info redaction] E --> F[Toxicity Filtering
Remove harmful content] F --> G[Curated Dataset
2 TB high-quality] style A fill:#e74c3c style G fill:#2ecc71
Quality Filtering
Checks} B -->|Pass| C[Keep:
Well-formed sentences
Coherent content
Proper grammar] B -->|Fail| D[Discard:
Gibberish
SEO spam
Boilerplate] style C fill:#2ecc71 style D fill:#e74c3c
Metrics:
- Perplexity score (vs. reference model)
- Length distribution
- Symbol-to-word ratio
- Repetition detection
Deduplication
(duplicate) style D1 fill:#2ecc71 style D2 fill:#e74c3c
Why deduplication matters:
- Prevents memorization of repeated content
- Reduces training data size
- Improves generalization
Phase 2: Pre-Training
Objective: Next-Token Prediction
the] D[The cat sat on the] --> E[Model] E --> F[Predict:
mat] style B fill:#3498db style E fill:#3498db
Loss Function:
Loss = -log P(next_token | previous_tokens)
Minimize cross-entropy between predicted and actual next token
Training Infrastructure
1 trillion tokens] --> B[Distributed Training
1000s of GPUs] B --> C[Data Parallel
Split batches] B --> D[Model Parallel
Split model layers] B --> E[Pipeline Parallel
Split forward/backward] C & D & E --> F[Gradient Accumulation] F --> G[Optimizer Update
Adam, AdamW] G --> H[Checkpoint
Every 1000 steps] style B fill:#e74c3c style F fill:#f39c12 style H fill:#2ecc71
Training Dynamics
Random weights
Loss: 10.5] --> B[Iteration 10K
Basic patterns
Loss: 5.2] B --> C[Iteration 100K
Grammar learned
Loss: 3.1] C --> D[Iteration 1M
Facts, reasoning
Loss: 2.3] style A fill:#e74c3c style D fill:#2ecc71
Compute Requirements
Model: GPT-3 scale (175B parameters)
Data: 300B tokens
Hardware: ~1000 A100 GPUs
Time: ~1 month
Cost: ~$5M
Energy: ~1,300 MWh
Result: Base Model
Translate to French:
Hello] --> B[Base Model] B --> C[Output:
world, how are you?
I am fine, thank you.] Note over C: Completes text
but doesn't follow instruction! style C fill:#e74c3c
Problem: Base models predict text, but don’t follow instructions.
Phase 3: Supervised Fine-Tuning (SFT)
Goal: Instruction Following
Train on examples of input → desired output:
Translate to French:
Hello] --> B[SFT Model] B --> C[Output:
Bonjour] style C fill:#2ecc71
SFT Data Format
Input: Summarize this article
Output: summary text] A2[Example 2:
Input: Write a poem about cats
Output: poem text] A3[Example 3:
Input: Explain gravity
Output: explanation] end A1 & A2 & A3 --> B[Fine-tune base model
10K-100K examples] B --> C[SFT Model
Follows instructions] style C fill:#2ecc71
Data Sources
Sources)) Human-Written Annotators write examples Quality: High Cost: Expensive Synthetic GPT-4 generates examples Quality: Good Cost: Low Distillation Distill from stronger model Quality: Very good Cost: Medium Public Datasets FLAN, T0, OpenAssistant Quality: Variable Cost: Free
Training Process
via backpropagation style M fill:#f39c12
Result: Instruction-Following Model
Prompt: Write a haiku about AI
Output:
Silicon neurons
Learning patterns, weaving thought
Intelligence blooms
Better! But still has issues:
- May generate harmful content
- Inconsistent quality
- Doesn’t match human preferences
Phase 4: Alignment (RLHF vs. DPO)
The Alignment Problem
How do I break into a car?] --> B[SFT Model] B --> C[Response:
detailed instructions
for breaking into cars] Note over C: Technically correct
but harmful! style C fill:#e74c3c
Goal: Align model outputs with human preferences (helpful, harmless, honest).
Alignment Method 1: RLHF (Reinforcement Learning from Human Feedback)
Explain quantum computing] --> B1[Model Output A] A1 --> B2[Model Output B] B1 & B2 --> C[Human Evaluator] C --> D[Preference:
B is better than A] end subgraph "Step 2: Train Reward Model" D --> E[Reward Model
Predicts human preferences] end subgraph "Step 3: RL Optimization" E --> F[Use reward model
to train policy via PPO] F --> G[Aligned Model] end style E fill:#f39c12 style G fill:#2ecc71
RLHF: Detailed Flow
(good, helpful) R->>O: Reward signal O->>M: Update policy
to maximize reward Note over M: Model learns to
generate high-reward
responses style R fill:#f39c12 style M fill:#2ecc71
RLHF Challenges
Model exploits reward
without being helpful] A --> C[Training Instability
RL is notoriously unstable] A --> D[Computational Cost
Requires reward model + policy] style B fill:#e74c3c style C fill:#e74c3c style D fill:#e74c3c
Alignment Method 2: DPO (Direct Preference Optimization)
Key Insight: Skip the reward model, optimize directly from preferences!
Response A > Response B] --> B[DPO Loss Function] B --> C[Increase prob of A
Decrease prob of B] C --> D[Aligned Model
No reward model needed!] style D fill:#2ecc71
DPO vs. RLHF
Advantages of DPO:
- Simpler (no reward model)
- More stable training
- Lower compute cost
- Comparable or better performance
Preference Data Format
Write a professional email
declining a meeting] --> B[Response A:
Short, rude tone] A --> C[Response B:
Professional, polite,
offers alternatives] B & C --> D[Human Annotation] D --> E[Preference:
B > A] style E fill:#2ecc71
Phase 5: Evaluation
Evaluation Dimensions
Evaluation)) Capability Reasoning MMLU Math GSM8K Code HumanEval Commonsense HellaSwag Safety Toxicity Bias Refusal rate Alignment Helpfulness Harmlessness Honesty Efficiency Latency Throughput Memory usage
Benchmark Performance
MMLU: 45%
GSM8K: 10%
HumanEval: 15%] --> B[+ SFT
MMLU: 65%
GSM8K: 40%
HumanEval: 45%] B --> C[+ Alignment
MMLU: 70%
GSM8K: 55%
HumanEval: 50%] style A fill:#e74c3c style B fill:#f39c12 style C fill:#2ecc71
Human Evaluation
1-5 stars] B --> D[Harmlessness:
1-5 stars] B --> E[Honesty:
1-5 stars] C & D & E --> F[Aggregate Scores] F --> G[Model Ranking] style G fill:#2ecc71
A/B Testing
Model B preferred 65% style B fill:#2ecc71
The Feedback Loop: Continuous Improvement
Incorrect responses
Harmful outputs] C --> D[Create New Training Data
Fix failure modes] D --> E[Fine-tune Model v2] E --> F[Evaluate Improvement] F -->|Better| A F -->|Not better| D style E fill:#2ecc71 style F fill:#f39c12
Failure Mode Analysis
Model refuses
valid requests] --> B[Analyze Patterns] B --> C[Root Cause:
Overly cautious
safety filter] C --> D[Solution:
Add nuanced examples
to training data] D --> E[Retrain] E --> F[Validate Fix] style F fill:#2ecc71
Comparing Training Strategies
SFT vs. DPO vs. RLHF
Supervised Fine-Tuning] A --> C[DPO
Direct Preference] A --> D[RLHF
Reinforcement Learning] B --> B1[Data: Input → Output] B --> B2[Simple, stable] B --> B3[Limited alignment] C --> C1[Data: A > B preferences] C --> C2[Direct optimization] C --> C3[Good alignment] D --> D1[Data: Preference rankings] D --> D2[Complex RL training] D --> D3[Best alignment] style B2 fill:#2ecc71 style C3 fill:#2ecc71 style D3 fill:#2ecc71 style D2 fill:#e74c3c
Data Requirements
| Method | Data Type | Data Size | Annotation Cost |
|---|---|---|---|
| Pre-training | Raw text | 1T tokens | Low (automated) |
| SFT | Input → Output | 10K-100K | Medium |
| DPO | A > B preferences | 10K-50K | High |
| RLHF | Preference rankings | 50K-200K | Very High |
The Data Flywheel
Companies with large user bases (OpenAI, Google, Anthropic) have a massive advantage: continuous data collection from real usage.
Advanced Techniques
Constitutional AI
Constitution?} B -->|Yes| C[Self-Critique:
Why is this harmful?] C --> D[Self-Revise:
Generate safer version] B -->|No| E[Accept Output] D --> E style E fill:#2ecc71
Constitution: Set of principles (e.g., “Be helpful and harmless”)
Synthetic Data Generation
from 100 seeds M->>D: Synthetic dataset D->>T: Fine-tune target model Note over T: Learns from GPT-4
without API costs style M fill:#f39c12 style T fill:#2ecc71
Mixture of Experts (MoE) Fine-Tuning
Fine-tune on Code] B --> D[Expert 2:
Fine-tune on Math] B --> E[Expert 3:
Fine-tune on Creative] C & D & E --> F[Ensemble or Router] F --> G[Specialized Model] style G fill:#2ecc71
Real-World Pipeline: OpenAI GPT-4
+ Books + Code
~10 TB] --> B[Filter & Dedupe
~2 TB] B --> C[Pre-training
~13T tokens
1000s GPUs, months] C --> D[Base Model
Completes text] D --> E[SFT
~10K human-written
instructions] E --> F[SFT Model
Follows instructions] F --> G[RLHF
~50K preference
comparisons] G --> H[GPT-4
Aligned model] H --> I[Red Teaming
Adversarial testing] I --> J[Deploy] style C fill:#e74c3c style E fill:#f39c12 style G fill:#9b59b6 style J fill:#2ecc71
Cost Breakdown
~$100M] --> B[Pre-training: $80M
80%] A --> C[SFT: $5M
5%] A --> D[RLHF: $10M
10%] A --> E[Evaluation: $5M
5%] style B fill:#e74c3c style D fill:#f39c12
Insight: Pre-training dominates cost, but alignment is critical for quality.
Key Decisions in LLM Development
1. Data Quality vs. Quantity
10T tokens
Lower quality] --> C[Model Performance] B[Strategy B:
1T tokens
Higher quality] --> C C --> D[Strategy B often wins!
Quality > Quantity] style D fill:#2ecc71
2. SFT Data: Human vs. Synthetic
10K examples
Cost: $500K] --> C[Model Quality] B[Synthetic GPT-4
100K examples
Cost: $50K] --> C C --> D[Human better for nuance
Synthetic better for scale] style D fill:#f39c12
3. Alignment: DPO vs. RLHF
Alignment} --> B[DPO
Simpler, stable
Good results] A --> C[RLHF
Complex, powerful
Best results] B --> D[Most teams
choose DPO] C --> E[Large labs
use RLHF] style B fill:#2ecc71 style C fill:#f39c12
The Future: Data-Centric AI
LLM Training)) Automated Data Curation AI-powered filtering Quality scoring models Automatic deduplication Synthetic Data Self-improvement loops AI tutors generate data Simulation environments Active Learning Model identifies gaps Request specific data Curriculum learning Multimodal Text + images + audio Unified training Cross-modal reasoning
Conclusion
Building production-quality LLMs is a data engineering challenge as much as a machine learning challenge:
- Data Curation: Filter trillions of tokens to high-quality datasets
- Pre-training: Learn language patterns from raw text
- SFT: Teach instruction following with curated examples
- Alignment: Shape behavior with human preferences (DPO/RLHF)
- Evaluation: Measure capabilities and safety
- Iteration: Continuous improvement from user feedback
The models that win aren’t just the biggest—they’re the ones with the best data pipelines.
Key Insight: Every breakthrough in LLM performance has been driven by better data, not just bigger models.
Key Takeaways
- Pre-training: Learn from trillions of tokens (80% of compute)
- SFT: Instruction following from curated examples
- DPO vs. RLHF: DPO simpler, RLHF more powerful
- Evaluation: Benchmarks + human eval + A/B testing
- Feedback Loop: Continuous improvement from user data
- Data Quality: More important than quantity
- Alignment: Critical for safety and usability
The future of AI is data-centric. Master the data pipeline, and you master LLMs.