Introduction: Words as Numbers
How do language models understand meaning? The answer lies in embeddings: representing words, sentences, and entire documents as vectors of numbers in high-dimensional space.
In this space:
- Similar words cluster together
- Analogies emerge as geometric relationships
- Meaning becomes computable through vector arithmetic
Let’s visualize this invisible geometry where meaning is distance.
From Words to Vectors
Traditional Approach: One-Hot Encoding
cat, dog, king, queen, apple] A --> B[cat = 1,0,0,0,0] A --> C[dog = 0,1,0,0,0] A --> D[king = 0,0,1,0,0] A --> E[queen = 0,0,0,1,0] A --> F[apple = 0,0,0,0,1] style B fill:#e74c3c style C fill:#e74c3c style D fill:#e74c3c style E fill:#e74c3c style F fill:#e74c3c
Problem: No semantic relationship!
- “cat” and “dog” are equally distant from each other as “cat” and “apple”
- Sparse: 50,000 word vocabulary = 50,000 dimensions
- No generalization
Embedding Approach: Dense Vectors
cat, dog, king, queen, apple] A --> B[cat = 0.2, -0.5, 0.8, ..., 0.1
768 dimensions] A --> C[dog = 0.3, -0.4, 0.7, ..., 0.0
768 dimensions] A --> D[king = -0.9, 0.2, -0.3, ..., 0.5
768 dimensions] A --> E[queen = -0.8, 0.3, -0.2, ..., 0.6
768 dimensions] A --> F[apple = 0.1, 0.9, -0.6, ..., -0.4
768 dimensions] style B fill:#2ecc71 style C fill:#2ecc71 style D fill:#2ecc71 style E fill:#2ecc71 style F fill:#2ecc71
Advantages:
- Dense: Fixed dimensions (typically 384-4096) regardless of vocabulary size
- Semantic: Similar words have similar vectors
- Computable: Vector arithmetic captures meaning
The Embedding Space: High-Dimensional Geometry
2D Projection (Simplified)
While real embeddings have 768+ dimensions, we can visualize the concept in 2D:
0.2, 0.8] -.-> A2[dog
0.3, 0.7] A2 -.-> A3[mouse
0.1, 0.9] A1 -.-> A3 B1[king
-0.9, -0.3] -.-> B2[queen
-0.8, -0.2] B2 -.-> B3[prince
-0.85, -0.25] B1 -.-> B3 C1[apple
0.8, -0.6] -.-> C2[orange
0.7, -0.7] C2 -.-> C3[banana
0.75, -0.65] C1 -.-> C3 end style A1 fill:#e74c3c style A2 fill:#e74c3c style A3 fill:#e74c3c style B1 fill:#3498db style B2 fill:#3498db style B3 fill:#3498db style C1 fill:#2ecc71 style C2 fill:#2ecc71 style C3 fill:#2ecc71
Clusters:
- Red: Animals
- Blue: Royalty
- Green: Fruits
Distance = Similarity: Words in the same cluster are semantically similar.
Measuring Similarity: Cosine Distance
Vector Similarity
Vector] --> C[Cosine Similarity
cos θ = A·B / A B ] B[dog
Vector] --> C C --> D[Similarity: 0.92
Very similar] E[cat
Vector] --> F[Cosine Similarity] G[apple
Vector] --> F F --> H[Similarity: 0.12
Not similar] style D fill:#2ecc71 style H fill:#e74c3c
Similarity Scale
Cosine Similarity:
1.0 = Identical
0.9 = Very similar (synonyms)
0.7 = Related (same domain)
0.3 = Somewhat related
0.0 = Unrelated
-1.0 = Opposite (rare)
Example Similarity Matrix
Vector Arithmetic: The Magic of Embeddings
The Famous Analogy: King - Man + Woman = Queen
-0.9, 0.2, -0.3] --> B[Subtract Man Vector
0.1, 0.1, 0.0] B --> C[Result:
-1.0, 0.1, -0.3] C --> D[Add Woman Vector
0.2, 0.2, 0.0] D --> E[Final Vector:
-0.8, 0.3, -0.3] E --> F[Find Nearest Neighbor] F --> G[Queen Vector
-0.8, 0.3, -0.2] style E fill:#f39c12 style G fill:#2ecc71
Geometric Interpretation
+ Woman| D end style A fill:#95a5a6 style B fill:#3498db style C fill:#95a5a6 style D fill:#e74c3c
The direction from “man” to “king” encodes the concept of “royalty”. Apply the same direction to “woman” and you get “queen”!
More Analogies
Analogies)) Geography Paris - France + Italy = Rome Tokyo - Japan + China = Beijing Tense walking - walk + swim = swimming ran - run + jump = jumped Plural cat - cats + dog = dogs mouse - mice + house = houses Comparative good - better + bad = worse big - bigger + small = smaller
Clustering: Semantic Neighborhoods
Hierarchical Clustering
t-SNE Projection (Simplified)
t-SNE reduces high-dimensional embeddings to 2D while preserving local structure:
How Embeddings Are Created
Word2Vec: Context Prediction
Predict words from their context:
Size = 2] B --> C[Input: quick, brown
Target: fox] B --> D[Input: brown, jumps
Target: fox] C & D --> E[Neural Network
Learn Embeddings] E --> F[fox embedding
captures context] style E fill:#f39c12 style F fill:#2ecc71
Insight: Words appearing in similar contexts get similar embeddings.
Transformer Embeddings: Contextual
Modern transformers create contextual embeddings—the same word has different embeddings in different contexts:
to deposit money] --> B1[bank embedding
Financial institution] A2[I sat on the river bank
watching the water] --> B2[bank embedding
Riverside edge] end style B1 fill:#3498db style B2 fill:#2ecc71
Contextualization Process:
"financial institution" Note over O: Final embedding depends
on full context! style O fill:#2ecc71
Sentence and Document Embeddings
From Words to Sentences
The cat sat on the mat] --> B[Tokenize] B --> C1[the
embedding] B --> C2[cat
embedding] B --> C3[sat
embedding] B --> C4[on
embedding] B --> C5[the
embedding] B --> C6[mat
embedding] C1 & C2 & C3 & C4 & C5 & C6 --> D[Combine Strategy] D --> E1[Mean Pooling
Average all tokens] D --> E2[CLS Token
Use special token] D --> E3[Max Pooling
Max of each dimension] E1 & E2 & E3 --> F[Sentence Embedding
768 dimensions] style F fill:#2ecc71
Sentence Similarity
on the mat] --> C[Similarity: 0.89
Semantically similar] B[A feline rested
on a rug] --> C D[The cat sat
on the mat] --> E[Similarity: 0.12
Not similar] F[Quantum mechanics
is fascinating] --> E style C fill:#2ecc71 style E fill:#e74c3c
Applications: Putting Embeddings to Work
1. Semantic Search
best pizza recipes] --> B[Encode to Embedding
768-dim vector] C[Document 1:
How to make pizza] --> D[Embedding 1] E[Document 2:
Italian cooking guide] --> F[Embedding 2] G[Document 3:
Car maintenance] --> H[Embedding 3] B --> I[Compute Cosine Similarity] D & F & H --> I I --> J[Rankings:
1. Doc 1 0.94
2. Doc 2 0.78
3. Doc 3 0.05] style J fill:#2ecc71
Traditional Search: Keyword matching Semantic Search: Meaning-based matching
2. Clustering and Topic Modeling
Embedding] B --> C[K-Means Clustering
k=5] C --> D1[Cluster 1:
Sports] C --> D2[Cluster 2:
Politics] C --> D3[Cluster 3:
Technology] C --> D4[Cluster 4:
Entertainment] C --> D5[Cluster 5:
Business] style D1 fill:#e74c3c style D2 fill:#3498db style D3 fill:#2ecc71 style D4 fill:#f39c12 style D5 fill:#9b59b6
3. Recommendation Systems
Article about Python] --> B[Python article
embedding] C[Article 1:
JavaScript tutorial] --> D[Similarity: 0.82
Programming-related] E[Article 2:
Machine learning] --> F[Similarity: 0.91
Python + ML] G[Article 3:
Cooking recipes] --> H[Similarity: 0.03
Unrelated] B --> D & F & H D & F & H --> I[Recommend Article 2] style I fill:#2ecc71
4. Anomaly Detection
blockchain synergy] end A1 & A2 & A3 --> C[Cluster Center] C -.->|Distance: 0.15| A1 C -.->|Distance: 0.98| B B --> D[Flag as Anomaly
Potential spam/fraud] style D fill:#e74c3c
Dimensionality Reduction: Seeing the Invisible
The Challenge
Real embeddings have 768-4096 dimensions. Humans can’t visualize beyond 3D. Solution: Reduce dimensions while preserving structure.
PCA (Principal Component Analysis)
Find directions of maximum variance:
Embeddings] --> B[PCA Analysis
Find principal components] B --> C[Component 1
Most variance
e.g., Topic] B --> D[Component 2
Second most
e.g., Sentiment] B --> E[Component 3
Third most
e.g., Formality] C & D --> F[Project to 2D
for Visualization] style F fill:#2ecc71
Result: Projects high-dimensional data to 2D while preserving global structure.
t-SNE (t-Distributed Stochastic Neighbor Embedding)
Preserves local neighborhoods:
Space] --> B[t-SNE Algorithm
Preserve local structure] B --> C[2D Projection
Clusters visible] style C fill:#2ecc71
Advantage: Better at revealing clusters than PCA Disadvantage: Doesn’t preserve global distances
UMAP (Uniform Manifold Approximation and Projection)
Best of both worlds:
Structure] A --> C[Preserves Global
Structure] A --> D[Faster than t-SNE] style A fill:#2ecc71
Visualization Example: Word Embeddings
Concept: Visualizing Word2Vec
Observations:
- Royalty terms cluster together (blue)
- Gender terms form parallel structures
- Geographic pairs maintain relationships
- Semantic categories remain grouped
Multilingual Embeddings: Cross-Language Geometry
Aligned Embedding Spaces
Modern multilingual models (e.g., multilingual BERT) create shared embedding spaces:
Application: Zero-shot cross-lingual transfer
- Train sentiment classifier on English
- Apply directly to Spanish, French, Japanese
Fine-Tuning Embeddings: Task-Specific Geometry
Generic Embeddings
Pre-trained on general text:
Embeddings] --> B[General semantic
relationships] B --> C[apple ≈ orange
Fruits] B --> D[bank ≈ finance
General association] style B fill:#95a5a6
Fine-Tuned for Medical Domain
Medical Text] --> B[Domain-specific
relationships] B --> C[diabetes ≈ insulin
Strong medical link] B --> D[hypertension ≈ cardiovascular
Clinical relationship] style B fill:#2ecc71
Embedding Space Shift: Fine-tuning reshapes the geometry to emphasize domain-relevant relationships.
Probing Embeddings: What Do They Encode?
Linear Probing
Train simple classifiers on embeddings to detect encoded information:
Detect Part-of-Speech] A --> C[Linear Probe:
Detect Sentiment] A --> D[Linear Probe:
Detect Named Entity] B --> E[Accuracy: 94%
POS is encoded!] C --> F[Accuracy: 78%
Sentiment partly encoded] D --> G[Accuracy: 89%
Entity type encoded] style E fill:#2ecc71 style F fill:#f39c12 style G fill:#2ecc71
Discovery: Embeddings implicitly encode linguistic properties even without explicit training!
Embedding Quality: Metrics
Intrinsic Evaluation: Analogies
Test: king - man + woman = ?
Expected: queen
Model output: queen ✓
Test: paris - france + italy = ?
Expected: rome
Model output: rome ✓
Accuracy: 85% on 10,000 analogy pairs
Extrinsic Evaluation: Downstream Tasks
Accuracy: 92%] A --> C[Named Entity Recognition
F1: 88%] A --> D[Question Answering
EM: 76%] style B fill:#2ecc71 style C fill:#2ecc71 style D fill:#f39c12
Better embeddings → Better downstream performance
Challenges and Limitations
1. Bias in Embeddings
Embeddings reflect biases in training data:
Solution: Debiasing techniques, careful data curation
2. Polysemy (Multiple Meanings)
Static embeddings struggle with words with multiple meanings:
"bank" embedding is average of:
- Financial institution (70%)
- River bank (20%)
- Memory bank (10%)
Result: Embedding doesn't fit any meaning well
Solution: Contextual embeddings (BERT, GPT)
3. Out-of-Vocabulary Words
Unknown words have no embedding:
Problem!] style B fill:#2ecc71 style D fill:#e74c3c
Solutions: Subword tokenization (BPE), character-level models
Future Directions
1. Dynamic Embeddings
Embeddings that evolve with time:
"tweet" embedding in 2005: Bird sound
"tweet" embedding in 2025: Social media post
2. Multimodal Embeddings
Unified space for text, images, audio:
Space] --> B[cat text] A --> C[cat image] A --> D[meow audio] B -.-> C C -.-> D D -.-> B style A fill:#9b59b6
Example: CLIP (OpenAI) aligns images and text
3. Compression
Smaller embeddings without quality loss:
Traditional: 768 dimensions
Compressed: 128 dimensions (6× smaller)
Quality loss: <2%
Conclusion
Embeddings transform language from discrete symbols into continuous geometric space where meaning is computable.
This invisible geometry enables:
- Semantic similarity: Finding related concepts
- Analogies: Reasoning through vector arithmetic
- Clustering: Automatic topic discovery
- Search: Meaning-based retrieval
- Transfer learning: Reusing knowledge across tasks
Understanding embeddings is understanding how modern AI represents knowledge. Every LLM, every semantic search system, every recommendation engine relies on this geometric foundation.
The next time you search for something and get eerily relevant results, or an AI understands your paraphrased question—thank embeddings. They’re the invisible scaffolding of modern AI.
Key Takeaways
- Embeddings: Dense vector representations of text
- Semantic Space: Similar meanings → similar vectors
- Vector Arithmetic: Analogies emerge from geometry
- Clustering: Automatic discovery of semantic categories
- Dimensionality Reduction: Visualizing high-dimensional spaces
- Contextual Embeddings: Same word, different contexts, different vectors
- Applications: Search, recommendations, classification, clustering
- Trade-offs: Generic vs. specialized, static vs. contextual
Embeddings are the lingua franca of modern NLP. Master them, and you master the geometry of meaning.