01 Core Principles (Plain English)

Computers don't understand the relationship between "cat" and "dog". But if you turn each word into a coordinate point, where similar words have nearby coordinates — then the computer can do "word arithmetic".

Word2Vec's idea: a word's meaning is determined by its neighbors. Words that appear in the same context like "eat ___" (rice, noodles, meat) should have similar meanings, so they get similar vectors.

Two Training Methods

CBOW (Predict Center Word from Context)

Given "I ___ apples", guess the middle word is "eat". Uses surrounding words to predict the target word. Better suited for small datasets.

Skip-gram (Predict Context from Center Word)

Given "eat", guess that "I" and "apples" might appear nearby. Uses the target word to predict surrounding words. Better for rare words.

1
Randomly Initialize Vectors

Each word in the vocabulary is assigned a random 100~300 dimensional vector (just a string of random numbers).

2
Sliding Window Scanning

Use a sliding window over large amounts of text to construct (center word, context word) training pairs, letting the model learn "which words often co-occur".

3
Gradient Update Vectors

If the prediction is correct, no change; if wrong, push the two words' vectors slightly closer. After repeated training, words with similar meanings naturally cluster together.

Word vectors are the foundation of all NLP. The input layers of LSTM, Transformer, and BERT are essentially Embeddings (word vectors). Word2Vec is the starting point for understanding all of this.

Build Word2Vec Step by Step

From corpus to word vectors, build Skip-gram step by step.

Step 1 Prepare Corpus and Vocabulary

Split sentences into words, build word-to-index mapping.

Step 2 Initialize Word Vector Matrices

Input matrix W (center words) and output matrix CM (context words), randomly initialized.

Step 3 Skip-gram Negative Sampling Training Step

Pull positive samples (real context words) closer, push negative samples (random words) away.

Step 4 Training Loop

Iterate over all words and windows, repeatedly updating word vectors until semantic relationships emerge.

02 Code

03 Academic Explanation

Skip-gram Objective Function

Given center word wₜ, maximize the probability of context words appearing:

J = (1/T) Σₜ Σ_{-c≤j≤c, j≠0} log P(wₜ₊ⱼ | wₜ)

where c is the window size and T is the total number of words in the corpus. The conditional probability is computed using softmax:

P(o|c) = exp(uₒᵀvc) / Σw exp(uᵥᵀvc)

Each word has two sets of vectors: v when serving as a center word, and u when serving as a context word. The final representation takes the average of both or uses only v.

Negative Sampling

Full vocabulary softmax is computationally expensive (vocabulary can reach millions). Negative sampling converts multi-class classification into binary classification: for each positive sample (real context word), randomly sample k negative samples (non-context words), and only update vectors for those k+1 words:

J = log σ(uₒᵀvc) + Σₖ E[log σ(−uₖᵀvc)]

Negative samples are drawn proportional to word frequency raised to the 3/4 power, which moderately boosts the sampling probability of rare words.

Comparison with GloVe

Word2Vec is based on local context windows (word-by-word prediction). GloVe is based on global co-occurrence matrices, directly factorizing the word-word co-occurrence frequency matrix, with objective function:

J = Σᵢⱼ f(Xᵢⱼ)(wᵢᵀw̃ⱼ + bᵢ + b̃ⱼ − log Xᵢⱼ)²

Both perform similarly on downstream tasks. Word2Vec is better suited for incremental updates, while GloVe performs slightly better on small corpora.

Geometric Properties of Word Vectors

Linear relationships exist in the trained vector space:

v(king) − v(man) + v(woman) ≈ v(queen)

This shows that the vector space captures the directional nature of semantic relationships (gender axis, capital axis, etc.). Similarity is commonly measured by cosine similarity: cos(θ) = (v₁·v₂)/(‖v₁‖‖v₂‖).