Q-Learning
Let an agent explore a maze on its own — no rules, no labels, only rewards and penalties, until it learns the optimal path
✦ See It in Action: Agent Navigates a Maze
Click "Train" and watch the agent go from random wandering to finding the optimal path — nobody tells it where the path is, it only learns from the reward signal upon reaching the goal.
Q-Value Heatmap
After training, the highest Q-value in each cell (warmer colors mean more worth visiting):
01 Core Principles (Plain English)
You arrive in an unfamiliar city with no GPS navigation. You can only learn by "trial and error": take this road — traffic jam, noted; take that road — arrive quickly, noted. After running several times, you naturally know which route is best.
Q-Learning does exactly the same thing. It gives an agent an environment (like a maze) and lets it repeatedly explore. Every good decision earns a reward, every bad one gets a penalty. Over time, the agent builds up an "experience table", and by consulting this table it can know "given my current position, which direction is the best move".
What Does the Experience Table Look Like?
This table is called the Q-Table. Each row is a "situation" (state), each column is a "choice" (action), and the number in each cell represents "how much total reward you can expect in the long run if you make this choice".
| Position | Left | Right ✓ | Up | Down |
|---|---|---|---|---|
| (0,0) Start | -0.1 | 2.4 | -0.1 | 0.6 |
| (1,0) | 0.3 | 3.8 | -0.1 | 1.2 |
Green values are the best actions for the current state (highest Q-value), and the agent picks those.
How Do You Fill In This Table?
At the beginning the Q-table is all zeros. The agent picks actions randomly. Walking into dead ends or hitting walls — penalty; accidentally reaching the goal — reward!
After each step, update the Q-value for that step using the formula: "how much reward you got this step + the maximum reward you can still get from the next step". The scores for good paths gradually get "propagated" backward.
The longer the training, the more accurate the Q-table becomes, and the more the agent tends to directly look up the highest-scoring action instead of continuing to explore randomly.
Exploration vs. Exploitation Balance (ε-greedy): If you always explore randomly, you'll never learn well; if you switch to the "known best path" too early, you might miss better routes. The parameter ε controls this: with probability ε, explore randomly; with probability 1-ε, choose the known best action. During training, ε is gradually decreased, transitioning from "curious explorer" to "experienced driver".
What Can Q-Learning Do?
Foundational algorithm for Atari games and board games
Learning to find paths in unknown maps
Data center and traffic light optimization
Using user interactions as reward signals for optimization
Build Q-Learning Step by Step
From environment definition to two TD algorithms, build step by step.
Cliff Walking environment: states are grid coordinates, actions are up/down/left/right, falling off the cliff gives -100.
Use a 2D array to store Q-values. With probability ε explore a random action, otherwise pick the action with the highest Q-value.
Update current Q using the best Q-value of the next state. The target is aggressive — taking shortcuts right along the cliff edge.
Update using the Q-value of the actually executed next action. More conservative — taking a safer path away from the cliff.
02 Code
03 Academic Explanation
Q-Learning is a reinforcement learning algorithm where an agent learns an optimal policy by interacting with an environment. The core idea is to learn a Q-table that records the expected cumulative reward for taking each action in each state.
What Is the Q-Table?
The Q-table is a lookup table where rows represent states, columns represent actions, and the values Q(s, a) represent the expected cumulative reward for taking action a in state s.
| State | Left | Right | Up | Down |
|---|---|---|---|---|
| (0,0) | 0 | 0 | 0 | 0 |
| (0,1) | 0 | 0 | 0 | 0 |
Q-Learning Algorithm
The agent updates the Q-table using the following formula:
Choose an action based on the current state and ε-greedy policy
Execute the action, receive immediate reward r and new state s'
TD error δ = r + γ·max Q(s',a') − Q(s,a), update Q-value using α·δ
ε-Greedy Policy
Balancing exploration and exploitation:
- With probability ε, choose a random action (exploration)
- With probability 1-ε, choose the current best known action (exploitation)
As learning progresses, ε is typically decreased (ε-decay), transitioning the agent from mostly exploring to mostly exploiting.
Hyperparameter Meanings
Learning Rate α
Controls the step size of each update. When α=1, completely overwrite old values with new estimates; when α→0, barely update at all. Typically 0.1~0.5.
Discount Factor γ
Determines the importance of future rewards. γ=0 only considers immediate rewards; γ→1 values long-term returns. Typically 0.9~0.99.
Convergence Guarantee
Under the following conditions, Q-Learning can be proven to converge to the optimal Q* function:
- Every state-action pair is visited infinitely often (sufficient exploration)
- The learning rate satisfies the Robbins-Monro conditions: ∑α = ∞, ∑α² < ∞
- Rewards are bounded
Q-Learning is an off-policy algorithm: it updates using max Q(s',a') (greedy policy) while the behavior policy is ε-greedy — the two can differ. This is the key distinction from SARSA (on-policy).
Summary
State-action value function
Temporal difference learning rule
Behavior policy decoupled from target policy
Requires DQN for large state spaces