Introduction: Words as Numbers

How do language models understand meaning? The answer lies in embeddings: representing words, sentences, and entire documents as vectors of numbers in high-dimensional space.

In this space:

Similar words cluster together
Analogies emerge as geometric relationships
Meaning becomes computable through vector arithmetic

Let’s visualize this invisible geometry where meaning is distance.

From Words to Vectors

Traditional Approach: One-Hot Encoding

graph TB A[Vocabulary:
cat, dog, king, queen, apple] A --> B[cat = 1,0,0,0,0] A --> C[dog = 0,1,0,0,0] A --> D[king = 0,0,1,0,0] A --> E[queen = 0,0,0,1,0] A --> F[apple = 0,0,0,0,1] style B fill:#e74c3c style C fill:#e74c3c style D fill:#e74c3c style E fill:#e74c3c style F fill:#e74c3c

Problem: No semantic relationship!

“cat” and “dog” are equally distant from each other as “cat” and “apple”
Sparse: 50,000 word vocabulary = 50,000 dimensions
No generalization

Embedding Approach: Dense Vectors

graph TB A[Vocabulary:
cat, dog, king, queen, apple] A --> B[cat = 0.2, -0.5, 0.8, ..., 0.1
768 dimensions] A --> C[dog = 0.3, -0.4, 0.7, ..., 0.0
768 dimensions] A --> D[king = -0.9, 0.2, -0.3, ..., 0.5
768 dimensions] A --> E[queen = -0.8, 0.3, -0.2, ..., 0.6
768 dimensions] A --> F[apple = 0.1, 0.9, -0.6, ..., -0.4
768 dimensions] style B fill:#2ecc71 style C fill:#2ecc71 style D fill:#2ecc71 style E fill:#2ecc71 style F fill:#2ecc71

Advantages:

Dense: Fixed dimensions (typically 384-4096) regardless of vocabulary size
Semantic: Similar words have similar vectors
Computable: Vector arithmetic captures meaning

The Embedding Space: High-Dimensional Geometry

2D Projection (Simplified)

While real embeddings have 768+ dimensions, we can visualize the concept in 2D:

graph TB subgraph "Semantic Clusters in 2D Space" A1[cat
0.2, 0.8] -.-> A2[dog
0.3, 0.7] A2 -.-> A3[mouse
0.1, 0.9] A1 -.-> A3 B1[king
-0.9, -0.3] -.-> B2[queen
-0.8, -0.2] B2 -.-> B3[prince
-0.85, -0.25] B1 -.-> B3 C1[apple
0.8, -0.6] -.-> C2[orange
0.7, -0.7] C2 -.-> C3[banana
0.75, -0.65] C1 -.-> C3 end style A1 fill:#e74c3c style A2 fill:#e74c3c style A3 fill:#e74c3c style B1 fill:#3498db style B2 fill:#3498db style B3 fill:#3498db style C1 fill:#2ecc71 style C2 fill:#2ecc71 style C3 fill:#2ecc71

Clusters:

Red: Animals
Blue: Royalty
Green: Fruits

Distance = Similarity: Words in the same cluster are semantically similar.

Measuring Similarity: Cosine Distance

Vector Similarity

graph LR A[cat
Vector] --> C[Cosine Similarity
cos θ = A·B / A B ] B[dog
Vector] --> C C --> D[Similarity: 0.92
Very similar] E[cat
Vector] --> F[Cosine Similarity] G[apple
Vector] --> F F --> H[Similarity: 0.12
Not similar] style D fill:#2ecc71 style H fill:#e74c3c

Similarity Scale

Cosine Similarity:
1.0  = Identical
0.9  = Very similar (synonyms)
0.7  = Related (same domain)
0.3  = Somewhat related
0.0  = Unrelated
-1.0 = Opposite (rare)

Example Similarity Matrix

graph TB A[Word Similarity Matrix] A --> B[cat ↔ dog: 0.92] A --> C[cat ↔ mouse: 0.85] A --> D[cat ↔ king: 0.15] A --> E[cat ↔ apple: 0.08] A --> F[king ↔ queen: 0.94] A --> G[apple ↔ orange: 0.91] style B fill:#2ecc71 style C fill:#2ecc71 style F fill:#2ecc71 style G fill:#2ecc71 style D fill:#e74c3c style E fill:#e74c3c

Vector Arithmetic: The Magic of Embeddings

The Famous Analogy: King - Man + Woman = Queen

graph TB A[King Vector
-0.9, 0.2, -0.3] --> B[Subtract Man Vector
0.1, 0.1, 0.0] B --> C[Result:
-1.0, 0.1, -0.3] C --> D[Add Woman Vector
0.2, 0.2, 0.0] D --> E[Final Vector:
-0.8, 0.3, -0.3] E --> F[Find Nearest Neighbor] F --> G[Queen Vector
-0.8, 0.3, -0.2] style E fill:#f39c12 style G fill:#2ecc71

Geometric Interpretation

The direction from “man” to “king” encodes the concept of “royalty”. Apply the same direction to “woman” and you get “queen”!

More Analogies

mindmap root((Vector
Analogies)) Geography Paris - France + Italy = Rome Tokyo - Japan + China = Beijing Tense walking - walk + swim = swimming ran - run + jump = jumped Plural cat - cats + dog = dogs mouse - mice + house = houses Comparative good - better + bad = worse big - bigger + small = smaller

Clustering: Semantic Neighborhoods

Hierarchical Clustering

graph TB A[All Words] --> B[Living Things] A --> C[Objects] A --> D[Concepts] B --> E[Animals] B --> F[Plants] E --> G[Mammals] E --> H[Birds] G --> I[cat, dog, mouse] H --> J[sparrow, eagle] F --> K[tree, flower] C --> L[Furniture] C --> M[Food] L --> N[chair, table] M --> O[apple, bread] D --> P[Emotions] D --> Q[Actions] style I fill:#e74c3c style J fill:#f39c12 style K fill:#2ecc71 style N fill:#3498db style O fill:#9b59b6

t-SNE Projection (Simplified)

t-SNE reduces high-dimensional embeddings to 2D while preserving local structure:

graph TB subgraph "t-SNE 2D Projection" A1[dog] & A2[cat] & A3[wolf] & A4[fox] B1[car] & B2[truck] & B3[bus] C1[happy] & C2[joyful] & C3[excited] D1[red] & D2[blue] & D3[green] E1[run] & E2[walk] & E3[sprint] end style A1 fill:#e74c3c style A2 fill:#e74c3c style A3 fill:#e74c3c style A4 fill:#e74c3c style B1 fill:#3498db style B2 fill:#3498db style B3 fill:#3498db style C1 fill:#f39c12 style C2 fill:#f39c12 style C3 fill:#f39c12 style D1 fill:#2ecc71 style D2 fill:#2ecc71 style D3 fill:#2ecc71 style E1 fill:#9b59b6 style E2 fill:#9b59b6 style E3 fill:#9b59b6

How Embeddings Are Created

Word2Vec: Context Prediction

Predict words from their context:

graph LR A[The quick brown fox jumps] --> B[Context Window
Size = 2] B --> C[Input: quick, brown
Target: fox] B --> D[Input: brown, jumps
Target: fox] C & D --> E[Neural Network
Learn Embeddings] E --> F[fox embedding
captures context] style E fill:#f39c12 style F fill:#2ecc71

Insight: Words appearing in similar contexts get similar embeddings.

Transformer Embeddings: Contextual

Modern transformers create contextual embeddings—the same word has different embeddings in different contexts:

graph TB subgraph "Bank in Different Contexts" A1[I went to the bank
to deposit money] --> B1[bank embedding
Financial institution] A2[I sat on the river bank
watching the water] --> B2[bank embedding
Riverside edge] end style B1 fill:#3498db style B2 fill:#2ecc71

Contextualization Process:

sequenceDiagram participant T as Token "bank" participant E as Initial Embedding participant A as Attention Layers participant O as Contextualized Embedding T->>E: Look up static embedding E->>A: Process through 12 layers A->>A: Attend to "deposit money" A->>O: Adjust embedding toward
"financial institution" Note over O: Final embedding depends
on full context! style O fill:#2ecc71

Sentence and Document Embeddings

From Words to Sentences

graph TB A[Sentence:
The cat sat on the mat] --> B[Tokenize] B --> C1[the
embedding] B --> C2[cat
embedding] B --> C3[sat
embedding] B --> C4[on
embedding] B --> C5[the
embedding] B --> C6[mat
embedding] C1 & C2 & C3 & C4 & C5 & C6 --> D[Combine Strategy] D --> E1[Mean Pooling
Average all tokens] D --> E2[CLS Token
Use special token] D --> E3[Max Pooling
Max of each dimension] E1 & E2 & E3 --> F[Sentence Embedding
768 dimensions] style F fill:#2ecc71

Sentence Similarity

graph LR A[The cat sat
on the mat] --> C[Similarity: 0.89
Semantically similar] B[A feline rested
on a rug] --> C D[The cat sat
on the mat] --> E[Similarity: 0.12
Not similar] F[Quantum mechanics
is fascinating] --> E style C fill:#2ecc71 style E fill:#e74c3c

Applications: Putting Embeddings to Work

1. Semantic Search

graph TB A[User Query:
best pizza recipes] --> B[Encode to Embedding
768-dim vector] C[Document 1:
How to make pizza] --> D[Embedding 1] E[Document 2:
Italian cooking guide] --> F[Embedding 2] G[Document 3:
Car maintenance] --> H[Embedding 3] B --> I[Compute Cosine Similarity] D & F & H --> I I --> J[Rankings:
1. Doc 1 0.94
2. Doc 2 0.78
3. Doc 3 0.05] style J fill:#2ecc71

Traditional Search: Keyword matching Semantic Search: Meaning-based matching

2. Clustering and Topic Modeling

graph TB A[1000 News Articles] --> B[Encode Each to
Embedding] B --> C[K-Means Clustering
k=5] C --> D1[Cluster 1:
Sports] C --> D2[Cluster 2:
Politics] C --> D3[Cluster 3:
Technology] C --> D4[Cluster 4:
Entertainment] C --> D5[Cluster 5:
Business] style D1 fill:#e74c3c style D2 fill:#3498db style D3 fill:#2ecc71 style D4 fill:#f39c12 style D5 fill:#9b59b6

3. Recommendation Systems

graph LR A[User liked:
Article about Python] --> B[Python article
embedding] C[Article 1:
JavaScript tutorial] --> D[Similarity: 0.82
Programming-related] E[Article 2:
Machine learning] --> F[Similarity: 0.91
Python + ML] G[Article 3:
Cooking recipes] --> H[Similarity: 0.03
Unrelated] B --> D & F & H D & F & H --> I[Recommend Article 2] style I fill:#2ecc71

4. Anomaly Detection

graph TB subgraph "Normal Customer Support Tickets" A1[Password reset] & A2[Login issue] & A3[Account locked] end subgraph "Anomaly" B[Quantum encryption
blockchain synergy] end A1 & A2 & A3 --> C[Cluster Center] C -.->|Distance: 0.15| A1 C -.->|Distance: 0.98| B B --> D[Flag as Anomaly
Potential spam/fraud] style D fill:#e74c3c

Dimensionality Reduction: Seeing the Invisible

The Challenge

Real embeddings have 768-4096 dimensions. Humans can’t visualize beyond 3D. Solution: Reduce dimensions while preserving structure.

PCA (Principal Component Analysis)

Find directions of maximum variance:

graph TB A[768-Dimensional
Embeddings] --> B[PCA Analysis
Find principal components] B --> C[Component 1
Most variance
e.g., Topic] B --> D[Component 2
Second most
e.g., Sentiment] B --> E[Component 3
Third most
e.g., Formality] C & D --> F[Project to 2D
for Visualization] style F fill:#2ecc71

Result: Projects high-dimensional data to 2D while preserving global structure.

t-SNE (t-Distributed Stochastic Neighbor Embedding)

Preserves local neighborhoods:

graph LR A[High-Dimensional
Space] --> B[t-SNE Algorithm
Preserve local structure] B --> C[2D Projection
Clusters visible] style C fill:#2ecc71

Advantage: Better at revealing clusters than PCA Disadvantage: Doesn’t preserve global distances

UMAP (Uniform Manifold Approximation and Projection)

Best of both worlds:

graph TB A[UMAP] --> B[Preserves Local
Structure] A --> C[Preserves Global
Structure] A --> D[Faster than t-SNE] style A fill:#2ecc71

Visualization Example: Word Embeddings

Concept: Visualizing Word2Vec

graph TB subgraph "2D Embedding Space After t-SNE" A1[king] --> A2[queen] A2 --> A3[prince] A3 --> A4[princess] A1 --> A4 B1[man] --> B2[woman] B2 --> B3[boy] B3 --> B4[girl] B1 --> B4 C1[france] --> C2[paris] C3[italy] --> C4[rome] C5[japan] --> C6[tokyo] D1[apple] --> D2[orange] D2 --> D3[banana] D3 --> D1 E1[happy] --> E2[joyful] E2 --> E3[excited] E3 --> E1 end style A1 fill:#3498db style A2 fill:#3498db style A3 fill:#3498db style A4 fill:#3498db style B1 fill:#e74c3c style B2 fill:#e74c3c style B3 fill:#e74c3c style B4 fill:#e74c3c

Observations:

Royalty terms cluster together (blue)
Gender terms form parallel structures
Geographic pairs maintain relationships
Semantic categories remain grouped

Multilingual Embeddings: Cross-Language Geometry

Aligned Embedding Spaces

Modern multilingual models (e.g., multilingual BERT) create shared embedding spaces:

graph TB subgraph "Shared Multilingual Space" A1[cat EN] -.-> A2[gato ES] A2 -.-> A3[chat FR] A3 -.-> A4[猫 JP] A4 -.-> A1 B1[king EN] -.-> B2[rey ES] B2 -.-> B3[roi FR] B3 -.-> B4[王 JP] B4 -.-> B1 end style A1 fill:#e74c3c style A2 fill:#e74c3c style A3 fill:#e74c3c style A4 fill:#e74c3c style B1 fill:#3498db style B2 fill:#3498db style B3 fill:#3498db style B4 fill:#3498db

Application: Zero-shot cross-lingual transfer

Train sentiment classifier on English
Apply directly to Spanish, French, Japanese

Fine-Tuning Embeddings: Task-Specific Geometry

Generic Embeddings

Pre-trained on general text:

graph TB A[Generic BERT
Embeddings] --> B[General semantic
relationships] B --> C[apple ≈ orange
Fruits] B --> D[bank ≈ finance
General association] style B fill:#95a5a6

Fine-Tuned for Medical Domain

graph TB A[Fine-Tuned on
Medical Text] --> B[Domain-specific
relationships] B --> C[diabetes ≈ insulin
Strong medical link] B --> D[hypertension ≈ cardiovascular
Clinical relationship] style B fill:#2ecc71

Embedding Space Shift: Fine-tuning reshapes the geometry to emphasize domain-relevant relationships.

Probing Embeddings: What Do They Encode?

Linear Probing

Train simple classifiers on embeddings to detect encoded information:

graph LR A[Word Embeddings] --> B[Linear Probe:
Detect Part-of-Speech] A --> C[Linear Probe:
Detect Sentiment] A --> D[Linear Probe:
Detect Named Entity] B --> E[Accuracy: 94%
POS is encoded!] C --> F[Accuracy: 78%
Sentiment partly encoded] D --> G[Accuracy: 89%
Entity type encoded] style E fill:#2ecc71 style F fill:#f39c12 style G fill:#2ecc71

Discovery: Embeddings implicitly encode linguistic properties even without explicit training!

Embedding Quality: Metrics

Intrinsic Evaluation: Analogies

Test: king - man + woman = ?
Expected: queen
Model output: queen ✓

Test: paris - france + italy = ?
Expected: rome
Model output: rome ✓

Accuracy: 85% on 10,000 analogy pairs

Extrinsic Evaluation: Downstream Tasks

graph TB A[Embedding Quality] --> B[Sentiment Analysis
Accuracy: 92%] A --> C[Named Entity Recognition
F1: 88%] A --> D[Question Answering
EM: 76%] style B fill:#2ecc71 style C fill:#2ecc71 style D fill:#f39c12

Better embeddings → Better downstream performance

Challenges and Limitations

1. Bias in Embeddings

Embeddings reflect biases in training data:

graph LR A[doctor] --> B[Closer to he] C[nurse] --> D[Closer to she] style A fill:#3498db style C fill:#e74c3c

Solution: Debiasing techniques, careful data curation

2. Polysemy (Multiple Meanings)

Static embeddings struggle with words with multiple meanings:

"bank" embedding is average of:
- Financial institution (70%)
- River bank (20%)
- Memory bank (10%)

Result: Embedding doesn't fit any meaning well

Solution: Contextual embeddings (BERT, GPT)

3. Out-of-Vocabulary Words

Unknown words have no embedding:

graph TB A[Known: cat, dog, mouse] --> B[Have Embeddings] C[Unknown: flibbertigibbet] --> D[No Embedding
Problem!] style B fill:#2ecc71 style D fill:#e74c3c

Solutions: Subword tokenization (BPE), character-level models

Future Directions

1. Dynamic Embeddings

Embeddings that evolve with time:

"tweet" embedding in 2005: Bird sound
"tweet" embedding in 2025: Social media post

2. Multimodal Embeddings

Unified space for text, images, audio:

graph TB A[Shared Embedding
Space] --> B[cat text] A --> C[cat image] A --> D[meow audio] B -.-> C C -.-> D D -.-> B style A fill:#9b59b6

Example: CLIP (OpenAI) aligns images and text

3. Compression

Smaller embeddings without quality loss:

Traditional: 768 dimensions
Compressed: 128 dimensions (6× smaller)
Quality loss: <2%

Conclusion

Embeddings transform language from discrete symbols into continuous geometric space where meaning is computable.

This invisible geometry enables:

Semantic similarity: Finding related concepts
Analogies: Reasoning through vector arithmetic
Clustering: Automatic topic discovery
Search: Meaning-based retrieval
Transfer learning: Reusing knowledge across tasks

Understanding embeddings is understanding how modern AI represents knowledge. Every LLM, every semantic search system, every recommendation engine relies on this geometric foundation.

The next time you search for something and get eerily relevant results, or an AI understands your paraphrased question—thank embeddings. They’re the invisible scaffolding of modern AI.

Key Takeaways

Embeddings: Dense vector representations of text
Semantic Space: Similar meanings → similar vectors
Vector Arithmetic: Analogies emerge from geometry
Clustering: Automatic discovery of semantic categories
Dimensionality Reduction: Visualizing high-dimensional spaces
Contextual Embeddings: Same word, different contexts, different vectors
Applications: Search, recommendations, classification, clustering
Trade-offs: Generic vs. specialized, static vs. contextual

Embeddings are the lingua franca of modern NLP. Master them, and you master the geometry of meaning.

Introduction: Words as Numbers#

From Words to Vectors#

Traditional Approach: One-Hot Encoding#

Embedding Approach: Dense Vectors#

The Embedding Space: High-Dimensional Geometry#

2D Projection (Simplified)#

Measuring Similarity: Cosine Distance#

Vector Similarity#

Similarity Scale#

Example Similarity Matrix#

Vector Arithmetic: The Magic of Embeddings#

The Famous Analogy: King - Man + Woman = Queen#

Geometric Interpretation#

More Analogies#

Clustering: Semantic Neighborhoods#

Hierarchical Clustering#

t-SNE Projection (Simplified)#

How Embeddings Are Created#

Word2Vec: Context Prediction#

Transformer Embeddings: Contextual#

Sentence and Document Embeddings#

From Words to Sentences#

Sentence Similarity#

Applications: Putting Embeddings to Work#

1. Semantic Search#

2. Clustering and Topic Modeling#

3. Recommendation Systems#

4. Anomaly Detection#

Dimensionality Reduction: Seeing the Invisible#

The Challenge#

PCA (Principal Component Analysis)#

t-SNE (t-Distributed Stochastic Neighbor Embedding)#

UMAP (Uniform Manifold Approximation and Projection)#

Visualization Example: Word Embeddings#

Concept: Visualizing Word2Vec#

Multilingual Embeddings: Cross-Language Geometry#

Aligned Embedding Spaces#

Fine-Tuning Embeddings: Task-Specific Geometry#

Generic Embeddings#

Fine-Tuned for Medical Domain#

Probing Embeddings: What Do They Encode?#

Linear Probing#

Embedding Quality: Metrics#

Intrinsic Evaluation: Analogies#

Extrinsic Evaluation: Downstream Tasks#

Challenges and Limitations#

1. Bias in Embeddings#

2. Polysemy (Multiple Meanings)#

3. Out-of-Vocabulary Words#

Future Directions#

1. Dynamic Embeddings#

2. Multimodal Embeddings#

3. Compression#

Conclusion#

Key Takeaways#

Further Reading#