LSTM Long Short-Term Memory
Give neural networks "selective memory" — instead of remembering everything, intelligently decide what's worth keeping and what to forget
✦ See It in Action: Tang Poetry Continuation
Enter a few characters and the model continues writing character by character — purely based on patterns learned from 50,000 Tang poems, without any rule-based system.
Higher temperature = more creative; lower = more conservative
01 Core Principles (Plain English)
You're watching a 100-episode TV series. By episode 80, you don't remember every line from episode 3, but you definitely remember the key plot point that "the protagonist's father is actually the villain" — because it's so important, your brain saved it.
A standard RNN can't do this. It's like short-term working memory — every time a new input is processed, old information gets diluted, and information from 100 steps ago has basically vanished by now. LSTM's design goal is to solve this problem: let the network decide for itself what's worth remembering long-term and what can be forgotten.
The Memory Pipeline: Three Gates
LSTM introduces an additional "memory channel" (cell state) on top of the standard RNN, along with three learnable "gates" that control information flow:
Looks at the current input and previous memory, then decides which parts of the cell state are no longer needed. For example, upon seeing "new topic begins," it clears the details from the previous topic. Outputs a coefficient between 0 (forget everything) and 1 (keep everything).
Decides which parts of the current input are worth storing in long-term memory. Not every word is important — only key information gets written to the cell state.
The cell state stores a lot of information, but at the current moment we only need to output the task-relevant portion. The output gate decides which part of the memory to "read out" and pass to the next step or final prediction.
The essence of the gating mechanism: Each gate is a small neural network that outputs values between 0~1, acting as "soft switches" for information. The parameters of these gates are learned through backpropagation — the network teaches itself when to remember, when to forget, and when to output.
Compared to Standard RNNs, Where Does LSTM Win?
Standard RNN
Memory is only a "short line" that gets overwritten by new inputs at each step. During backpropagation, gradients decay exponentially — signals from 100 steps ago can barely influence weight updates. This is called the vanishing gradient problem.
LSTM
The cell state is a "highway" — information can flow directly across many time steps, and gradients can propagate backward along this path without decay. It can effectively handle long-range dependencies spanning hundreds of steps.
What Tasks Is LSTM Good At?
Remember context to generate coherent sentences
Read the entire sentence before judging sentiment
Stock prices, weather, sensor data forecasting
Sequence modeling where syllables depend on each other
Building LSTM Step by Step
From time series data to sequence prediction, step by step.
Generate a sine wave, normalize it, and construct sliding window input sequences xs → next-step prediction values ys.
The LSTM(32) layer remembers sequence context, and Dense(1) outputs the prediction value.
After training, perform rolling predictions on the test set, feeding each prediction back as input for the next step.
02 Code
03 Academic Explanation
LSTM (Long Short-Term Memory) was proposed by Hochreiter and Schmidhuber in 1997. It is a special type of recurrent neural network that introduces cell state and three gating mechanisms, solving the vanishing gradient problem of standard RNNs and enabling the learning of long-term dependencies spanning hundreds of time steps.
Why Do We Need LSTM?
When processing long sequences, gradients in standard RNNs decay exponentially layer by layer, making it difficult for early information to be preserved. Given a sequence length T, the gradient during backpropagation is:
If the spectral norm of the Jacobian matrix at each step is less than 1, the multiplied gradient approaches zero (vanishing gradient); if greater than 1, the gradient explodes. LSTM's additive update path for the cell state allows gradients to flow through multiple time steps without degradation.
The Three Gates of LSTM
Decides what information to discard from the
cell state:
f_t = σ(W_f · [h_{t-1}, x_t] +
b_f)
Decides what new information to store in the
cell state:
i_t = σ(W_i · [h_{t-1}, x_t] +
b_i), candidate value
C̃_t = tanh(W_C · [h_{t-1}, x_t] +
b_C)
Decides what information to output:
o_t = σ(W_o · [h_{t-1}, x_t] +
b_o)
Cell State Update
Forget old information, write new information:
where ⊙ denotes element-wise multiplication (Hadamard product). This additive pathway is the key to enabling gradients to propagate over long distances.
Hidden State Output
The hidden state h_t serves simultaneously as the output for the current time step and the memory passed to the next time step.
LSTM Gate State Animation
Observe how the forget gate, input gate, and output gate states change as LSTM processes a sequence:
Summary
Long-term memory cell state
Selective forgetting / remembering
Text, time series prediction
Improved variant of RNN