Q-Learning - ML Easy Learning

✦ See It in Action: Agent Navigates a Maze

Click "Train" and watch the agent go from random wandering to finding the optimal path — nobody tells it where the path is, it only learns from the reward signal upon reaching the goal.

ε value: 0.3

Training Episodes

Reward

Reached Goal

Q-Value Heatmap

After training, the highest Q-value in each cell (warmer colors mean more worth visiting):

01 Core Principles (Plain English)

You arrive in an unfamiliar city with no GPS navigation. You can only learn by "trial and error": take this road — traffic jam, noted; take that road — arrive quickly, noted. After running several times, you naturally know which route is best.

Q-Learning does exactly the same thing. It gives an agent an environment (like a maze) and lets it repeatedly explore. Every good decision earns a reward, every bad one gets a penalty. Over time, the agent builds up an "experience table", and by consulting this table it can know "given my current position, which direction is the best move".

What Does the Experience Table Look Like?

This table is called the Q-Table. Each row is a "situation" (state), each column is a "choice" (action), and the number in each cell represents "how much total reward you can expect in the long run if you make this choice".

Position	Left	Right ✓	Up	Down
(0,0) Start	-0.1	2.4	-0.1	0.6
(1,0)	0.3	3.8	-0.1	1.2

Green values are the best actions for the current state (highest Q-value), and the agent picks those.

How Do You Fill In This Table?

Random Exploration

At the beginning the Q-table is all zeros. The agent picks actions randomly. Walking into dead ends or hitting walls — penalty; accidentally reaching the goal — reward!

Update Scores Looking Back

After each step, update the Q-value for that step using the formula: "how much reward you got this step + the maximum reward you can still get from the next step". The scores for good paths gradually get "propagated" backward.

Increasingly "Copy the Answer" (Exploitation)

The longer the training, the more accurate the Q-table becomes, and the more the agent tends to directly look up the highest-scoring action instead of continuing to explore randomly.

Exploration vs. Exploitation Balance (ε-greedy): If you always explore randomly, you'll never learn well; if you switch to the "known best path" too early, you might miss better routes. The parameter ε controls this: with probability ε, explore randomly; with probability 1-ε, choose the known best action. During training, ε is gradually decreased, transitioning from "curious explorer" to "experienced driver".

What Can Q-Learning Do?

Game AI

Foundational algorithm for Atari games and board games

Robot Navigation

Learning to find paths in unknown maps

Resource Scheduling

Data center and traffic light optimization

Recommendation Systems

Using user interactions as reward signals for optimization

Build Q-Learning Step by Step

From environment definition to two TD algorithms, build step by step.

Step 1 Define the Environment

Cliff Walking environment: states are grid coordinates, actions are up/down/left/right, falling off the cliff gives -100.

Step 2 Q-Table and ε-Greedy Policy

Use a 2D array to store Q-values. With probability ε explore a random action, otherwise pick the action with the highest Q-value.

Step 3 Q-Learning (Off-Policy)

Update current Q using the best Q-value of the next state. The target is aggressive — taking shortcuts right along the cliff edge.

Step 4 SARSA (On-Policy)

Update using the Q-value of the actually executed next action. More conservative — taking a safer path away from the cliff.

02 Code

03 Academic Explanation

Q-Learning is a reinforcement learning algorithm where an agent learns an optimal policy by interacting with an environment. The core idea is to learn a Q-table that records the expected cumulative reward for taking each action in each state.

What Is the Q-Table?

The Q-table is a lookup table where rows represent states, columns represent actions, and the values Q(s, a) represent the expected cumulative reward for taking action a in state s.

State	Left	Right	Up	Down
(0,0)	0	0	0	0
(0,1)	0	0	0	0

Q-Learning Algorithm

The agent updates the Q-table using the following formula:

Q(s, a) ← Q(s, a) + α × [r + γ × max_a' Q(s', a') − Q(s, a)]

Select Action

Choose an action based on the current state and ε-greedy policy

Execute and Observe

Execute the action, receive immediate reward r and new state s'

Update Q-Value

TD error δ = r + γ·max Q(s',a') − Q(s,a), update Q-value using α·δ

ε-Greedy Policy

Balancing exploration and exploitation:

With probability ε, choose a random action (exploration)
With probability 1-ε, choose the current best known action (exploitation)

As learning progresses, ε is typically decreased (ε-decay), transitioning the agent from mostly exploring to mostly exploiting.

Hyperparameter Meanings

Learning Rate α

Controls the step size of each update. When α=1, completely overwrite old values with new estimates; when α→0, barely update at all. Typically 0.1~0.5.

Discount Factor γ

Determines the importance of future rewards. γ=0 only considers immediate rewards; γ→1 values long-term returns. Typically 0.9~0.99.

Convergence Guarantee

Under the following conditions, Q-Learning can be proven to converge to the optimal Q* function:

Every state-action pair is visited infinitely often (sufficient exploration)
The learning rate satisfies the Robbins-Monro conditions: ∑α = ∞, ∑α² < ∞
Rewards are bounded

Q-Learning is an off-policy algorithm: it updates using max Q(s',a') (greedy policy) while the behavior policy is ε-greedy — the two can differ. This is the key distinction from SARSA (on-policy).

Summary

Q-Table

State-action value function

TD Update

Temporal difference learning rule

Off-Policy

Behavior policy decoupled from target policy

Limitation

Requires DQN for large state spaces