Word2Vec Word Vectors
Turn every word into a string of numbers — similar words get similar numbers: "king − man + woman ≈ queen"
01 Core Principles (Plain English)
Computers don't understand the relationship between "cat" and "dog". But if you turn each word into a coordinate point, where similar words have nearby coordinates — then the computer can do "word arithmetic".
Word2Vec's idea: a word's meaning is determined by its neighbors. Words that appear in the same context like "eat ___" (rice, noodles, meat) should have similar meanings, so they get similar vectors.
Two Training Methods
CBOW (Predict Center Word from Context)
Given "I ___ apples", guess the middle word is "eat". Uses surrounding words to predict the target word. Better suited for small datasets.
Skip-gram (Predict Context from Center Word)
Given "eat", guess that "I" and "apples" might appear nearby. Uses the target word to predict surrounding words. Better for rare words.
Each word in the vocabulary is assigned a random 100~300 dimensional vector (just a string of random numbers).
Use a sliding window over large amounts of text to construct (center word, context word) training pairs, letting the model learn "which words often co-occur".
If the prediction is correct, no change; if wrong, push the two words' vectors slightly closer. After repeated training, words with similar meanings naturally cluster together.
Word vectors are the foundation of all NLP. The input layers of LSTM, Transformer, and BERT are essentially Embeddings (word vectors). Word2Vec is the starting point for understanding all of this.
Build Word2Vec Step by Step
From corpus to word vectors, build Skip-gram step by step.
Split sentences into words, build word-to-index mapping.
Input matrix W (center words) and output matrix CM (context words), randomly initialized.
Pull positive samples (real context words) closer, push negative samples (random words) away.
Iterate over all words and windows, repeatedly updating word vectors until semantic relationships emerge.
02 Code
03 Academic Explanation
Skip-gram Objective Function
Given center word wₜ, maximize the probability of context words appearing:
where c is the window size and T is the total number of words in the corpus. The conditional probability is computed using softmax:
Each word has two sets of vectors: v when serving as a center word, and u when serving as a context word. The final representation takes the average of both or uses only v.
Negative Sampling
Full vocabulary softmax is computationally expensive (vocabulary can reach millions). Negative sampling converts multi-class classification into binary classification: for each positive sample (real context word), randomly sample k negative samples (non-context words), and only update vectors for those k+1 words:
Negative samples are drawn proportional to word frequency raised to the 3/4 power, which moderately boosts the sampling probability of rare words.
Comparison with GloVe
Word2Vec is based on local context windows (word-by-word prediction). GloVe is based on global co-occurrence matrices, directly factorizing the word-word co-occurrence frequency matrix, with objective function:
Both perform similarly on downstream tasks. Word2Vec is better suited for incremental updates, while GloVe performs slightly better on small corpora.
Geometric Properties of Word Vectors
Linear relationships exist in the trained vector space:
This shows that the vector space captures the directional nature of semantic relationships (gender axis, capital axis, etc.). Similarity is commonly measured by cosine similarity: cos(θ) = (v₁·v₂)/(‖v₁‖‖v₂‖).