DQN
The deep learning upgrade to Q-Learning — using neural networks to memorize "every possible situation", enabling agents to learn to play Atari games and control robots
✦ See It in Action: CartPole
The agent must move the cart left or right to keep the pole balanced. A PD controller simulates a learned policy — to see real DQN training, run the code section below.
01 Plain English Explanation of DQN
What problem does Q-Learning run into?
Q-Learning uses a large table to record "how many points each action is worth in each state". In a 5×5 maze, this table only has 25 rows — no problem at all.
But when playing Atari games, each frame of the screen is a state — the number of pixel combinations is astronomical, and this table simply can't be stored.
State space explosion: Atari games have 210×160×3 pixels per frame, with more unique combinations than atoms in the universe — the Q table simply can't hold it all
Replace the table with a neural network. Input a state, directly output Q-values for each action — the neural network automatically generalizes patterns, giving similar predictions for similar states
Two key techniques for stable training
Directly replacing the Q table with a neural network causes training to collapse — it's like simultaneously changing your exam answers and the grading criteria, you can never learn stably. DQN solves this with two techniques:
Every step the agent takes, it stores 「current state, action, reward, next state」 into a "memory bank". During training, it randomly samples a batch of data from the memory bank.
It's like reviewing flashcards in random order instead of sequentially — shuffling the temporal order prevents the network from only remembering recent events.
Maintain two identical networks simultaneously: an "online network" updated in real-time, and a "target network" whose parameters are only synced every few hundred steps.
It's like having a fixed answer key during an exam. If the answer key kept changing in real-time based on your responses, you'd never know which direction to improve — the target network provides a stable "reference frame".
DQN's Complete Training Pipeline
Most of the time, follow the network's predicted optimal action; with a small probability, explore randomly to avoid getting stuck in local optima
Store (s, a, r, s') into the Replay Buffer; once a batch (e.g., 32 entries) is ready, start training
Use the "target network" to compute
y = r + γ·max Q(s', a'), as the
training label
Minimize the mean squared error between the online network's output Q(s,a) and target y, taking one gradient descent step
Every C steps, copy the online network parameters to the target network
What can DQN do?
In 2013, DeepMind used DQN to surpass human-level performance, directly inputting game screen pixels
Motion control in continuous state spaces, replacing hand-written controllers
Treating user behavior sequences as states, learning optimal recommendation policies
Traffic light control and route planning under complex road network conditions
Building DQN Step by Step
From CartPole physics to experience replay training, build it piece by piece.
Using Newtonian mechanics to simulate a cart-pole system, the state is [position, velocity, angle, angular velocity].
Two networks (online + target) + experience replay queue, solving the data correlation problem.
Random sampling from the buffer, computing TD targets, training the online network to approximate Q*.
02 Code
03 Academic Explanation
DQN (Deep Q-Network) is the deep learning version of Q-Learning, using neural networks to approximate the Q-function, solving the problem that Q-Learning's Q table cannot be stored when the state space is large.
Why do we need DQN?
Q-Learning's Q table cannot be stored when the state space is large (e.g., playing chess, autonomous driving). DQN uses neural networks to approximate the Q-function:
DQN's Two Key Techniques
Store experiences in a replay buffer, randomly sample for training, breaking temporal correlations between data
Use two networks: one for selecting actions, one for computing target values, periodically syncing parameters
Loss Function
DQN's loss function is the temporal difference error:
Why does experience replay stabilize training?
In reinforcement learning, consecutively collected samples are highly correlated (consecutive frames are similar). If trained sequentially, the network overfits to recent experiences and forgets previously learned knowledge (catastrophic forgetting). Replay Buffer's random sampling disrupts temporal dependencies, making the sample distribution closer to i.i.d., resulting in more stable gradient estimates.
Why do we need a target network?
In the TD target y = r + γ·max Q_θ(s', a'),
if Q_θ is the same network being updated, each update
step changes the target, creating a "chasing a moving
target" problem, leading to training oscillation or even
divergence. The target network Q_θ⁻'s parameters are
only copied from the online network every C steps,
providing a short-term stable supervision signal.
Algorithm Comparison
Q table stores all state-action values; only suitable for small state spaces; off-policy
Neural network approximates Q; experience replay + target network; suitable for high-dimensional states
Uses online network for action selection and target network for value estimation, eliminating Q-value overestimation bias
Decomposes Q into state value V(s) and advantage function A(s,a), improving sample efficiency
Convergence Conditions and Limitations
Tabular Q-Learning guarantees convergence when satisfying Robbins-Monro conditions; DQN uses function approximation with weaker theoretical convergence, but in practice is sufficiently stable through experience replay and target networks
Only suitable for discrete action spaces; continuous actions require DDPG, TD3 and other Actor-Critic methods; relatively low sample efficiency, requiring extensive environment interaction
Summary
Approximates Q-function
Experience replay buffer
Target network
Gradient descent