Skip to content

RNNs and LSTMs: Mastering Sequential Data

DS
LDS Team
Let's Data Science
16 minAudio
Listen Along
0:00/ 0:00
AI voice

A stock price depends on yesterday's close. A sentence depends on the words that came before. Sequential data has a property that tabular data does not: order matters. Recurrent Neural Networks were the first architecture to take that property seriously, dominating machine translation and speech recognition for nearly two decades. Transformers have since claimed the spotlight, but RNNs remain the foundation you need before any of it makes sense. And recurrent architectures are making a comeback in 2026 through state-space models and xLSTM, proving the core ideas never went away.

We will follow one running example from start to finish: predicting the next value in a temperature time series collected from weather sensors. Each hour produces one reading, and the model must learn temporal patterns (daily cycles, seasonal trends, sudden cold fronts) to forecast what comes next.

The Vanilla RNN: Learning Through Recurrence

A recurrent neural network processes sequences one element at a time, maintaining a hidden state that carries information forward from previous time steps. Unlike a feedforward network that sees each input independently, an RNN passes its "memory" from step to step.

At each time step tt, the vanilla RNN computes:

ht=tanh(Whhht1+Wxhxt+bh)h_t = \tanh(W_{hh} \cdot h_{t-1} + W_{xh} \cdot x_t + b_h)

y^t=Whyht+by\hat{y}_t = W_{hy} \cdot h_t + b_y

Where:

  • hth_t is the hidden state at time step tt
  • ht1h_{t-1} is the hidden state from the previous time step
  • xtx_t is the input at time step tt (a single temperature reading)
  • WhhW_{hh} is the hidden-to-hidden weight matrix (recurrent weights)
  • WxhW_{xh} is the input-to-hidden weight matrix
  • WhyW_{hy} is the hidden-to-output weight matrix
  • bhb_h and byb_y are bias vectors
  • y^t\hat{y}_t is the predicted output (next temperature)

In Plain English: At each hour, the RNN takes two things: the current temperature reading and its memory of all previous readings. It blends them together through learned weights, squashes the result with tanh to keep values bounded, and produces both a prediction and an updated memory for the next step.

Vanilla RNN unrolled through three time steps showing how hidden state flows forwardClick to expandVanilla RNN unrolled through three time steps showing how hidden state flows forward

When you "unroll" an RNN across time, each step shares the same weights (WhhW_{hh}, WxhW_{xh}, WhyW_{hy}). This weight sharing makes RNNs parameter-efficient for sequences of any length: a feedforward network would need separate weights for each position, while an RNN reuses one set everywhere.

Here is a minimal PyTorch implementation:

python
import torch
import torch.nn as nn

class VanillaRNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        self.rnn = nn.RNN(input_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        # x shape: (batch, seq_len, input_size)
        out, h_n = self.rnn(x)          # out: all hidden states
        prediction = self.fc(out[:, -1, :])  # last time step
        return prediction

model = VanillaRNN(input_size=1, hidden_size=64, output_size=1)
# For our temperature series: 1 feature in, predict 1 value out

This looks elegant. The problem is that it barely works for sequences longer than about 20 steps.

The Vanishing Gradient Problem

Training an RNN uses backpropagation through time (BPTT), which unrolls the network and computes gradients across all time steps. The gradient of the loss with respect to early hidden states involves a chain of matrix multiplications:

Lh1=LhTt=2Ththt1\frac{\partial L}{\partial h_1} = \frac{\partial L}{\partial h_T} \cdot \prod_{t=2}^{T} \frac{\partial h_t}{\partial h_{t-1}}

Each factor htht1\frac{\partial h_t}{\partial h_{t-1}} involves the weight matrix and the derivative of tanh. Since tanh derivatives are at most 1, these terms multiply together across all time steps. Two things can happen:

  • Vanishing gradients: When Whh\|W_{hh}\| and the activation derivatives are less than 1, the product shrinks exponentially. After 50 or 100 steps, gradients reaching the early time steps are essentially zero. The network cannot learn long-range dependencies.
  • Exploding gradients: When Whh>1\|W_{hh}\| > 1, gradients grow exponentially. This is easier to fix with gradient clipping, but vanishing gradients have no simple remedy for vanilla RNNs.

Key Insight: Imagine whispering a message through a chain of 100 people. Each person slightly garbles the message. By the end, the original information is gone. That's vanishing gradients. The RNN "forgets" what happened 50 hours ago in our temperature series, even though that information (yesterday's temperature cycle) is exactly what it needs.

For our weather example, the model needs to remember that temperatures follow a 24-hour cycle. A vanilla RNN struggles with even this modest requirement because gradient signals from step 1 are nearly zero by step 24 during training.

This problem was identified by Hochreiter (1991) and later formalized in Bengio et al. (1994). The solution came three years later.

LSTM Architecture: Gating the Flow of Information

The Long Short-Term Memory network, introduced by Hochreiter and Schmidhuber (1997), solves the vanishing gradient problem by introducing a dedicated cell state CtC_t that runs through time with minimal interference, plus three gates that control what information enters, persists, and exits.

LSTM cell architecture showing forget gate, input gate, output gate, and cell state flowClick to expandLSTM cell architecture showing forget gate, input gate, output gate, and cell state flow

The Forget Gate

The forget gate decides what to discard from the cell state:

ft=σ(Wf[ht1,xt]+bf)f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)

Where:

  • ftf_t is a vector of values between 0 and 1 (one per cell state dimension)
  • σ\sigma is the sigmoid function, outputting values in (0,1)(0, 1)
  • [ht1,xt][h_{t-1}, x_t] is the concatenation of the previous hidden state and current input
  • WfW_f and bfb_f are the forget gate's learnable weights and bias

In Plain English: The forget gate looks at the current temperature and recent memory, then decides what to erase. If a cold front arrived, it might forget the previous warm-day patterns and reset that portion of the cell state.

The Input Gate and Candidate Memory

The input gate controls what new information to store:

it=σ(Wi[ht1,xt]+bi)i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)

C~t=tanh(WC[ht1,xt]+bC)\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)

Where:

  • iti_t is the input gate activation (which values to update)
  • C~t\tilde{C}_t is the candidate memory (new information to potentially add)
  • WiW_i, WCW_C are the learnable weight matrices for the input gate and candidate
  • bib_i, bCb_C are the corresponding bias vectors

In Plain English: The input gate works in two stages. First, it decides which dimensions of the cell state deserve an update (the sigmoid produces values near 0 for "skip" and near 1 for "write"). Second, it creates a candidate value using tanh. For our temperature series, after a sudden cold front arrives, the candidate might encode the new lower temperature baseline while the input gate opens wide on the dimensions that track recent trends.

The Cell State Update

The cell state combines forgetting old information and adding new:

Ct=ftCt1+itC~tC_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t

Where:

  • \odot denotes element-wise multiplication (Hadamard product)
  • ftCt1f_t \odot C_{t-1} selectively retains old memory
  • itC~ti_t \odot \tilde{C}_t selectively adds new information

In Plain English: Think of the cell state as a conveyor belt. The forget gate punches holes in it (dropping irrelevant past temperatures), and the input gate places new items on it (adding the current reading). This lets the LSTM maintain information across hundreds of steps because the cell state flows forward with minimal modification when the gates choose to keep it unchanged.

The Output Gate

ot=σ(Wo[ht1,xt]+bo)o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)

ht=ottanh(Ct)h_t = o_t \odot \tanh(C_t)

Where:

  • oto_t controls which parts of the cell state to expose as output
  • hth_t is the new hidden state, a filtered version of the cell state
  • WoW_o and bob_o are the output gate's learnable weights and bias

In Plain English: The cell state holds everything the LSTM knows, but not all of it is relevant right now. The output gate filters it. If the model is predicting the next hour's temperature, it might expose the dimensions encoding recent trend direction while keeping long-term seasonal information hidden inside the cell state for later use.

Why This Solves Vanishing Gradients

The cell state update equation is the key:

Ct=ftCt1+itC~tC_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t

When the forget gate outputs values near 1, the gradient through the cell state is simply ftf_t. No repeated matrix multiplications, no shrinking chain. Hochreiter and Schmidhuber called this the "constant error carousel."

Pro Tip: When using nn.LSTM in PyTorch, always prefer the fused module over manual nn.LSTMCell loops. The fused version runs up to 100x faster because it uses cuDNN-optimized kernels. Only drop to LSTMCell when you need custom gating logic.

python
import torch
import torch.nn as nn

class LSTMForecaster(nn.Module):
    def __init__(self, input_size=1, hidden_size=128, num_layers=2):
        super().__init__()
        self.lstm = nn.LSTM(
            input_size, hidden_size,
            num_layers=num_layers,
            batch_first=True,
            dropout=0.2        # dropout between stacked layers
        )
        self.fc = nn.Linear(hidden_size, 1)

    def forward(self, x):
        out, (h_n, c_n) = self.lstm(x)
        return self.fc(out[:, -1, :])

model = LSTMForecaster()
# Parameters: ~265K for 2-layer, 128-hidden LSTM with 1 input feature

For a deeper treatment of LSTMs applied specifically to forecasting, see Mastering LSTMs for Time Series.

GRU: The Streamlined Alternative

The Gated Recurrent Unit (Cho et al., 2014) matches LSTM performance with a simpler design: one state vector instead of two, and two gates instead of three.

Side-by-side comparison of GRU and LSTM cells showing architectural differencesClick to expandSide-by-side comparison of GRU and LSTM cells showing architectural differences

Reset Gate

rt=σ(Wr[ht1,xt]+br)r_t = \sigma(W_r \cdot [h_{t-1}, x_t] + b_r)

The reset gate determines how much of the previous hidden state to ignore when computing the candidate. When rt0r_t \approx 0, the unit effectively resets its memory.

Update Gate

zt=σ(Wz[ht1,xt]+bz)z_t = \sigma(W_z \cdot [h_{t-1}, x_t] + b_z)

Hidden State Update

h~t=tanh(W[rtht1,xt]+b)\tilde{h}_t = \tanh(W \cdot [r_t \odot h_{t-1}, x_t] + b)

ht=(1zt)ht1+zth~th_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t

Where:

  • ztz_t acts as both the forget and input gate simultaneously
  • (1zt)(1 - z_t) controls how much old state to keep
  • ztz_t controls how much new candidate to accept
  • rtht1r_t \odot h_{t-1} is the reset-filtered previous state

In Plain English: The GRU uses one dial (the update gate ztz_t) to balance old and new information. At zt=0z_t = 0, the hidden state copies forward unchanged; at zt=1z_t = 1, it replaces itself entirely. For our temperature series, the GRU sets zt0z_t \approx 0 during stable weather and zt1z_t \approx 1 when a sudden drop signals a new weather system.

FeatureLSTMGRU
Gates3 (forget, input, output)2 (reset, update)
State vectors2 (hth_t and CtC_t)1 (hth_t)
Parameters per unit$4 \times (n_h^2 + n_h \cdot n_x + n_h)$$3 \times (n_h^2 + n_h \cdot n_x + n_h)$
Training speedSlower (more computation)~25% faster
Long-range memorySlightly better (separate cell state)Comparable on most tasks
Best forLong sequences (>500 steps)Smaller datasets, speed-critical

Common Pitfall: Choosing between LSTM and GRU based on theory alone is a mistake. Hyperparameter tuning matters more than the architecture choice. Train both, compare on your validation set, and pick the winner.

Bidirectional RNNs and Sequence-to-Sequence Models

Bidirectional Processing

Standard RNNs only see past context. A bidirectional RNN runs two hidden states in opposite directions and concatenates the outputs:

ht=RNN(xt,ht1)ht=RNN(xt,ht+1)\overrightarrow{h_t} = \text{RNN}(x_t, \overrightarrow{h_{t-1}}) \qquad \overleftarrow{h_t} = \text{RNN}(x_t, \overleftarrow{h_{t+1}})

ht=[ht;ht]h_t = [\overrightarrow{h_t}; \overleftarrow{h_t}]

Bidirectional models excel when you have the full sequence available at inference time (text classification, named entity recognition). They cannot be used for autoregressive generation or real-time streaming, since future inputs are not yet available. BERT famously used bidirectional context as a core design principle, and understanding bidirectional RNNs makes that architecture much easier to grasp.

Sequence-to-Sequence Architecture

The encoder-decoder framework (Sutskever et al., 2014) maps variable-length inputs to variable-length outputs. The encoder LSTM compresses the input into a context vector (the final hidden state), and the decoder LSTM generates the output from that context.

Sequence-to-sequence encoder-decoder architecture for temperature forecastingClick to expandSequence-to-sequence encoder-decoder architecture for temperature forecasting

Teacher forcing feeds ground-truth tokens to the decoder during training instead of its own predictions. This speeds convergence but creates exposure bias: during inference, the model uses its own (possibly wrong) predictions. Scheduled sampling mitigates this by gradually mixing in model predictions during training.

For our temperature example, a seq2seq model might encode 168 hours (one week) and decode the next 24 hours. The context vector must compress an entire week of weather into a fixed-size representation, a significant bottleneck that directly motivated the attention mechanism and transformers.

When to Use RNNs vs. Transformers in 2026

Transformers dominate large language models and most NLP tasks, but writing off RNNs would be premature. Here is an honest comparison as of March 2026.

CriterionRNNs (LSTM/GRU)Transformers
Inference memoryO(1)O(1) in sequence lengthO(n)O(n) KV-cache
Inference latency per tokenConstantGrows with context
Training parallelizationSequential (slow)Fully parallel (fast)
Long-range retrievalWeak (fixed-size state)Strong (direct attention)
Sequence length limitTheoretically unlimitedFixed context window
Edge device deploymentExcellent (tiny memory)Challenging (KV-cache)
Parameter efficiencyHighModerate

Where RNNs still win

Streaming and real-time inference. An LSTM processes each new input in O(1)O(1) memory regardless of how long the stream runs. A transformer's KV-cache grows linearly, eventually exceeding your memory budget. For audio streams, sensor feeds, and financial tick data on IoT hardware, this difference is decisive.

Resource-constrained edge devices. A 2-layer LSTM with 256 hidden units has roughly 800K parameters and runs comfortably on a microcontroller.

Short sequences with simple temporal patterns. For time series forecasting tasks with fewer than 200 time steps, LSTMs match or beat transformers while being much simpler to deploy and debug.

Where transformers win

In-context learning and retrieval. The ICLR 2025 paper "RNNs Are Not Transformers (Yet)" showed that RNNs fundamentally struggle with in-context retrieval. Transformers can directly attend to any position, while RNNs must compress everything into a fixed-size state.

Long-range recall. If you need to retrieve a specific fact from 10,000 tokens ago, a transformer does it directly. An RNN likely forgot it.

Scale. At billions of parameters, transformers scale more predictably. Their parallelizable architecture makes training on GPU clusters practical.

Key Insight: The real question is not "RNN or transformer?" but "does my task need recall over long contexts?" If yes, use a transformer or hybrid. If you need streaming inference with constant memory, RNNs remain the right choice.

The Recurrent Revival: SSMs and xLSTM

Two research directions in 2024 and 2025 brought recurrence back to the frontier.

State-Space Models and Mamba

Mamba (Gu and Dao, 2023) introduced selective state spaces: a recurrent architecture with input-dependent gating that achieves linear-time sequence processing. Unlike vanilla SSMs that use fixed dynamics, Mamba's selection mechanism lets the model choose which information to propagate or forget at each step, similar in spirit to LSTM gating but with a continuous-time formulation.

Mamba achieves up to 5x inference throughput over comparable transformers while matching their quality on language modeling. Hybrid architectures combining Mamba layers with sparse attention layers are showing strong results across multiple benchmarks as of early 2026.

xLSTM: Hochreiter's Return

In May 2024, Sepp Hochreiter (the original LSTM inventor) published xLSTM, extending the classic architecture with two key innovations:

  1. Exponential gating. Replace sigmoid gates with exponential activation, combined with normalization and stabilization techniques. This gives the gates much sharper on/off dynamics.
  2. Matrix memory (mLSTM). Instead of a vector cell state, mLSTM uses a matrix memory with a covariance update rule, making it fully parallelizable during training while retaining recurrent inference.

The companion sLSTM variant keeps scalar memory with enhanced mixing. xLSTM was a NeurIPS 2024 spotlight, and the xLSTM-7B model (March 2025, 2.3 trillion tokens) matches LLaMA-7B performance with faster inference and lower memory.

Pro Tip: Starting a new sequence modeling project in 2026? Need real-time streaming on edge hardware? Use LSTM/GRU. Need strong in-context retrieval over long documents? Use a transformer. Want both efficiency and range at moderate scale? Try Mamba or xLSTM.

Production Considerations

Training complexity. LSTM training is O(Tnh2)O(T \cdot n_h^2) per sample, where TT is sequence length and nhn_h is hidden size. The sequential dependency means you cannot parallelize across time steps.

Gradient clipping is non-negotiable. Always clip gradients when training RNNs. PyTorch's torch.nn.utils.clip_grad_norm_ with a max norm of 1.0 is a sensible default. Without it, exploding gradients will periodically destabilize training. For a deeper look at optimizer choices, see deep learning optimizers from SGD to AdamW.

Stacking layers carefully. Adding more LSTM layers improves capacity but demands dropout between layers (0.2 to 0.5). More than 3 stacked layers rarely helps.

Hidden size selection. For time series tasks, hidden sizes of 64 to 256 cover most use cases. Going larger increases overfitting risk without proportional accuracy gains.

Truncated BPTT. Longer sequences require more memory during backpropagation. Processing fixed-length chunks with detached hidden states passed between them is the standard workaround.

python
import torch
import torch.nn as nn

# Truncated BPTT training loop sketch
model = nn.LSTM(input_size=1, hidden_size=128, num_layers=2, batch_first=True)
fc = nn.Linear(128, 1)
optimizer = torch.optim.Adam(list(model.parameters()) + list(fc.parameters()), lr=1e-3)
max_grad_norm = 1.0

def train_step(x_chunk, y_chunk, hidden):
    """Process one chunk with truncated BPTT."""
    # Detach hidden state to prevent backprop into previous chunk
    if hidden is not None:
        hidden = tuple(h.detach() for h in hidden)

    out, hidden = model(x_chunk, hidden)
    pred = fc(out[:, -1, :])
    loss = nn.functional.mse_loss(pred, y_chunk)

    optimizer.zero_grad()
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)
    optimizer.step()
    return loss.item(), hidden

Conclusion

Recurrent neural networks taught the field how to think about sequences. The vanilla RNN introduced shared weights across time, and its failure on long sequences led directly to LSTM and GRU gating. These are not just historical artifacts; the same principles of gating and controlled information flow appear in every modern sequence model, from transformers to state-space models.

The practical takeaway is straightforward. For streaming inference and edge deployment where memory must stay constant, LSTMs and GRUs remain the best tool in 2026. For tasks requiring strong recall over long contexts, transformers win. And for applications that need both efficiency and long-range modeling, the new generation of recurrent models like Mamba and xLSTM offer a compelling middle ground. If you are building neural networks from scratch, understanding how activation functions like sigmoid and tanh control information flow through gates is essential. And once you grasp recurrent architectures, the jump to how large language models actually work becomes far more natural.

The architectural debate will keep evolving. What will not change is the fundamental insight that sequences need memory, and how you manage that memory determines everything.

Interview Questions

Why can't a vanilla RNN learn long-range dependencies?

During BPTT, gradients are multiplied by the recurrent weight matrix at every time step. When these products are consistently less than 1, gradients shrink exponentially. After 50 or more steps, the signal from early inputs effectively vanishes and the network cannot learn that those inputs matter.

How does the LSTM cell state solve vanishing gradients?

The cell state update Ct=ftCt1+itC~tC_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t avoids repeated matrix multiplications. When the forget gate outputs values near 1, gradients flow through nearly unchanged. This "constant error carousel" lets gradients propagate across hundreds of steps.

What distinguishes an LSTM from a GRU?

LSTMs have three gates and a separate cell state; GRUs have two gates and a single hidden state, training roughly 25% faster. In practice, performance differences are small, and the choice comes down to dataset size and compute budget. LSTMs have a slight edge on very long sequences.

What is teacher forcing, and what problem does it create?

Teacher forcing feeds ground-truth tokens to the decoder during training instead of its own predictions. This speeds convergence but creates exposure bias: at inference, the model must use its own (potentially wrong) outputs. Scheduled sampling mitigates this by gradually mixing in model predictions during training.

When should you use a bidirectional RNN?

Use bidirectional RNNs when the full sequence is available at inference time, such as text classification, NER, or sentiment analysis. Do not use them for autoregressive tasks where future inputs are unavailable.

Why choose an LSTM over a transformer for time series in 2026?

LSTMs process sequences with O(1)O(1) memory per step, making them ideal for streaming inference and edge deployment. They have far fewer parameters, reducing overfitting on small datasets. For short sequences under 200 time steps, LSTMs match transformer accuracy with simpler deployment.

How do state-space models like Mamba relate to RNNs?

SSMs use continuous-time dynamics discretized for sequences. Mamba adds input-dependent selection, analogous to LSTM gating. Like RNNs, SSMs use constant memory at inference, but they can be parallelized during training via a convolutional view.

What is gradient clipping and why is it essential for RNN training?

Gradient clipping caps the gradient norm at a threshold. RNNs are prone to exploding gradients because the recurrent weight matrix multiplies at every step, and large eigenvalues cause exponential growth. A max norm of 1.0 via clip_grad_norm_ is standard practice.

Practice interview problems based on real data

1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems
Free Career Roadmaps8 PATHS

Step-by-step roadmaps from zero to job-ready — curated courses, salary data, and the exact learning order that gets you hired.

Explore all career paths