See It in Action: CartPole

The agent must move the cart left or right to keep the pole balanced. A PD controller simulates a learned policy — to see real DQN training, run the code section below.

01 Plain English Explanation of DQN

What problem does Q-Learning run into?

Q-Learning uses a large table to record "how many points each action is worth in each state". In a 5×5 maze, this table only has 25 rows — no problem at all.

But when playing Atari games, each frame of the screen is a state — the number of pixel combinations is astronomical, and this table simply can't be stored.

Q-Learning's Limit

State space explosion: Atari games have 210×160×3 pixels per frame, with more unique combinations than atoms in the universe — the Q table simply can't hold it all

DQN's Solution

Replace the table with a neural network. Input a state, directly output Q-values for each action — the neural network automatically generalizes patterns, giving similar predictions for similar states

Two key techniques for stable training

Directly replacing the Q table with a neural network causes training to collapse — it's like simultaneously changing your exam answers and the grading criteria, you can never learn stably. DQN solves this with two techniques:

1
Experience Replay

Every step the agent takes, it stores 「current state, action, reward, next state」 into a "memory bank". During training, it randomly samples a batch of data from the memory bank.

It's like reviewing flashcards in random order instead of sequentially — shuffling the temporal order prevents the network from only remembering recent events.

2
Target Network

Maintain two identical networks simultaneously: an "online network" updated in real-time, and a "target network" whose parameters are only synced every few hundred steps.

It's like having a fixed answer key during an exam. If the answer key kept changing in real-time based on your responses, you'd never know which direction to improve — the target network provides a stable "reference frame".

DQN's Complete Training Pipeline

Select action using ε-greedy

Most of the time, follow the network's predicted optimal action; with a small probability, explore randomly to avoid getting stuck in local optima

Store experience, random sampling

Store (s, a, r, s') into the Replay Buffer; once a batch (e.g., 32 entries) is ready, start training

Compute target Q-value

Use the "target network" to compute y = r + γ·max Q(s', a'), as the training label

Backpropagation to update online network

Minimize the mean squared error between the online network's output Q(s,a) and target y, taking one gradient descent step

Periodically sync target network

Every C steps, copy the online network parameters to the target network

What can DQN do?

Atari Games

In 2013, DeepMind used DQN to surpass human-level performance, directly inputting game screen pixels

Robot Control

Motion control in continuous state spaces, replacing hand-written controllers

Recommendation Systems

Treating user behavior sequences as states, learning optimal recommendation policies

Traffic Scheduling

Traffic light control and route planning under complex road network conditions

Building DQN Step by Step

From CartPole physics to experience replay training, build it piece by piece.

Step 1 CartPole Physics Environment

Using Newtonian mechanics to simulate a cart-pole system, the state is [position, velocity, angle, angular velocity].

Step 2 Network and Replay Buffer

Two networks (online + target) + experience replay queue, solving the data correlation problem.

Step 3 Experience Replay + TD Update

Random sampling from the buffer, computing TD targets, training the online network to approximate Q*.

02 Code

03 Academic Explanation

DQN (Deep Q-Network) is the deep learning version of Q-Learning, using neural networks to approximate the Q-function, solving the problem that Q-Learning's Q table cannot be stored when the state space is large.

Why do we need DQN?

Q-Learning's Q table cannot be stored when the state space is large (e.g., playing chess, autonomous driving). DQN uses neural networks to approximate the Q-function:

Q(s, a) ≈ NeuralNetwork(s, a)

DQN's Two Key Techniques

1
Experience Replay

Store experiences in a replay buffer, randomly sample for training, breaking temporal correlations between data

2
Target Network

Use two networks: one for selecting actions, one for computing target values, periodically syncing parameters

Loss Function

DQN's loss function is the temporal difference error:

L = (r + γ × max Q(s', a') - Q(s, a))²

Why does experience replay stabilize training?

In reinforcement learning, consecutively collected samples are highly correlated (consecutive frames are similar). If trained sequentially, the network overfits to recent experiences and forgets previously learned knowledge (catastrophic forgetting). Replay Buffer's random sampling disrupts temporal dependencies, making the sample distribution closer to i.i.d., resulting in more stable gradient estimates.

Why do we need a target network?

In the TD target y = r + γ·max Q_θ(s', a'), if Q_θ is the same network being updated, each update step changes the target, creating a "chasing a moving target" problem, leading to training oscillation or even divergence. The target network Q_θ⁻'s parameters are only copied from the online network every C steps, providing a short-term stable supervision signal.

Algorithm Comparison

Q-Learning

Q table stores all state-action values; only suitable for small state spaces; off-policy

DQN

Neural network approximates Q; experience replay + target network; suitable for high-dimensional states

Double DQN

Uses online network for action selection and target network for value estimation, eliminating Q-value overestimation bias

Dueling DQN

Decomposes Q into state value V(s) and advantage function A(s,a), improving sample efficiency

Convergence Conditions and Limitations

Convergence Guarantee

Tabular Q-Learning guarantees convergence when satisfying Robbins-Monro conditions; DQN uses function approximation with weaker theoretical convergence, but in practice is sufficiently stable through experience replay and target networks

!
Main Limitations

Only suitable for discrete action spaces; continuous actions require DDPG, TD3 and other Actor-Critic methods; relatively low sample efficiency, requiring extensive environment interaction

Summary

Network

Approximates Q-function

Replay

Experience replay buffer

Target

Target network

Optimization

Gradient descent