LSTM

✦ See It in Action: Tang Poetry Continuation

Enter a few characters and the model continues writing character by character — purely based on patterns learned from 50,000 Tang poems, without any rule-based system.

Starting Text

Continuation Length 20 chars

Creativity Temperature 0.2

Higher temperature = more creative; lower = more conservative

Model loading...

Continuation Result

Continuation results will appear here character by character…

01 Core Principles (Plain English)

You're watching a 100-episode TV series. By episode 80, you don't remember every line from episode 3, but you definitely remember the key plot point that "the protagonist's father is actually the villain" — because it's so important, your brain saved it.

A standard RNN can't do this. It's like short-term working memory — every time a new input is processed, old information gets diluted, and information from 100 steps ago has basically vanished by now. LSTM's design goal is to solve this problem: let the network decide for itself what's worth remembering long-term and what can be forgotten.

The Memory Pipeline: Three Gates

LSTM introduces an additional "memory channel" (cell state) on top of the standard RNN, along with three learnable "gates" that control information flow:

Forget Gate: Should We Drop the Past?

Looks at the current input and previous memory, then decides which parts of the cell state are no longer needed. For example, upon seeing "new topic begins," it clears the details from the previous topic. Outputs a coefficient between 0 (forget everything) and 1 (keep everything).

Input Gate: Should We Write Something New?

Decides which parts of the current input are worth storing in long-term memory. Not every word is important — only key information gets written to the cell state.

Output Gate: What Should We Output Now?

The cell state stores a lot of information, but at the current moment we only need to output the task-relevant portion. The output gate decides which part of the memory to "read out" and pass to the next step or final prediction.

The essence of the gating mechanism: Each gate is a small neural network that outputs values between 0~1, acting as "soft switches" for information. The parameters of these gates are learned through backpropagation — the network teaches itself when to remember, when to forget, and when to output.

Compared to Standard RNNs, Where Does LSTM Win?

Standard RNN

Memory is only a "short line" that gets overwritten by new inputs at each step. During backpropagation, gradients decay exponentially — signals from 100 steps ago can barely influence weight updates. This is called the vanishing gradient problem.

The cell state is a "highway" — information can flow directly across many time steps, and gradients can propagate backward along this path without decay. It can effectively handle long-range dependencies spanning hundreds of steps.

What Tasks Is LSTM Good At?

Text Generation

Remember context to generate coherent sentences

Sentiment Analysis

Read the entire sentence before judging sentiment

Time Series Prediction

Stock prices, weather, sensor data forecasting

Speech Recognition

Sequence modeling where syllables depend on each other

Building LSTM Step by Step

From time series data to sequence prediction, step by step.

Step 1 Prepare Time Series Data

Generate a sine wave, normalize it, and construct sliding window input sequences xs → next-step prediction values ys.

Step 2 Build the LSTM Model

The LSTM(32) layer remembers sequence context, and Dense(1) outputs the prediction value.

Step 3 Train and Roll Forward Prediction

After training, perform rolling predictions on the test set, feeding each prediction back as input for the next step.

02 Code

03 Academic Explanation

LSTM (Long Short-Term Memory) was proposed by Hochreiter and Schmidhuber in 1997. It is a special type of recurrent neural network that introduces cell state and three gating mechanisms, solving the vanishing gradient problem of standard RNNs and enabling the learning of long-term dependencies spanning hundreds of time steps.

Why Do We Need LSTM?

When processing long sequences, gradients in standard RNNs decay exponentially layer by layer, making it difficult for early information to be preserved. Given a sequence length T, the gradient during backpropagation is:

∂L/∂h₀ = ∏ᵢ₌₁ᵀ (∂hᵢ/∂hᵢ₋₁)

If the spectral norm of the Jacobian matrix at each step is less than 1, the multiplied gradient approaches zero (vanishing gradient); if greater than 1, the gradient explodes. LSTM's additive update path for the cell state allows gradients to flow through multiple time steps without degradation.

The Three Gates of LSTM

Forget Gate

Decides what information to discard from the cell state: f_t = σ(W_f · [h_{t-1}, x_t] + b_f)

Input Gate

Decides what new information to store in the cell state: i_t = σ(W_i · [h_{t-1}, x_t] + b_i), candidate value C̃_t = tanh(W_C · [h_{t-1}, x_t] + b_C)

Output Gate

Decides what information to output: o_t = σ(W_o · [h_{t-1}, x_t] + b_o)

Cell State Update

Forget old information, write new information:

C_t = f_t ⊙ C_{t-1} + i_t ⊙ C̃_t

where ⊙ denotes element-wise multiplication (Hadamard product). This additive pathway is the key to enabling gradients to propagate over long distances.

Hidden State Output

h_t = o_t ⊙ tanh(C_t)

The hidden state h_t serves simultaneously as the output for the current time step and the memory passed to the next time step.

LSTM Gate State Animation

Observe how the forget gate, input gate, and output gate states change as LSTM processes a sequence:

Speed:

Summary

Memory

Long-term memory cell state

Gating

Selective forgetting / remembering

Applications

Text, time series prediction

Architecture

Improved variant of RNN

LSTM Long Short-Term Memory