A stock price depends on yesterday's close. A sentence depends on the words that came before. Sequential data has a property that tabular data does not: order matters. Recurrent Neural Networks were the first architecture to take that property seriously, dominating machine translation and speech recognition for nearly two decades. Transformers have since claimed the spotlight, but RNNs remain the foundation you need before any of it makes sense. And recurrent architectures are making a comeback in 2026 through state-space models and xLSTM, proving the core ideas never went away.
We will follow one running example from start to finish: predicting the next value in a temperature time series collected from weather sensors. Each hour produces one reading, and the model must learn temporal patterns (daily cycles, seasonal trends, sudden cold fronts) to forecast what comes next.
The Vanilla RNN: Learning Through Recurrence
A recurrent neural network processes sequences one element at a time, maintaining a hidden state that carries information forward from previous time steps. Unlike a feedforward network that sees each input independently, an RNN passes its "memory" from step to step.
At each time step , the vanilla RNN computes:
Where:
- is the hidden state at time step
- is the hidden state from the previous time step
- is the input at time step (a single temperature reading)
- is the hidden-to-hidden weight matrix (recurrent weights)
- is the input-to-hidden weight matrix
- is the hidden-to-output weight matrix
- and are bias vectors
- is the predicted output (next temperature)
In Plain English: At each hour, the RNN takes two things: the current temperature reading and its memory of all previous readings. It blends them together through learned weights, squashes the result with tanh to keep values bounded, and produces both a prediction and an updated memory for the next step.
Click to expandVanilla RNN unrolled through three time steps showing how hidden state flows forward
When you "unroll" an RNN across time, each step shares the same weights (, , ). This weight sharing makes RNNs parameter-efficient for sequences of any length: a feedforward network would need separate weights for each position, while an RNN reuses one set everywhere.
Here is a minimal PyTorch implementation:
import torch
import torch.nn as nn
class VanillaRNN(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super().__init__()
self.rnn = nn.RNN(input_size, hidden_size, batch_first=True)
self.fc = nn.Linear(hidden_size, output_size)
def forward(self, x):
# x shape: (batch, seq_len, input_size)
out, h_n = self.rnn(x) # out: all hidden states
prediction = self.fc(out[:, -1, :]) # last time step
return prediction
model = VanillaRNN(input_size=1, hidden_size=64, output_size=1)
# For our temperature series: 1 feature in, predict 1 value out
This looks elegant. The problem is that it barely works for sequences longer than about 20 steps.
The Vanishing Gradient Problem
Training an RNN uses backpropagation through time (BPTT), which unrolls the network and computes gradients across all time steps. The gradient of the loss with respect to early hidden states involves a chain of matrix multiplications:
Each factor involves the weight matrix and the derivative of tanh. Since tanh derivatives are at most 1, these terms multiply together across all time steps. Two things can happen:
- Vanishing gradients: When and the activation derivatives are less than 1, the product shrinks exponentially. After 50 or 100 steps, gradients reaching the early time steps are essentially zero. The network cannot learn long-range dependencies.
- Exploding gradients: When , gradients grow exponentially. This is easier to fix with gradient clipping, but vanishing gradients have no simple remedy for vanilla RNNs.
Key Insight: Imagine whispering a message through a chain of 100 people. Each person slightly garbles the message. By the end, the original information is gone. That's vanishing gradients. The RNN "forgets" what happened 50 hours ago in our temperature series, even though that information (yesterday's temperature cycle) is exactly what it needs.
For our weather example, the model needs to remember that temperatures follow a 24-hour cycle. A vanilla RNN struggles with even this modest requirement because gradient signals from step 1 are nearly zero by step 24 during training.
This problem was identified by Hochreiter (1991) and later formalized in Bengio et al. (1994). The solution came three years later.
LSTM Architecture: Gating the Flow of Information
The Long Short-Term Memory network, introduced by Hochreiter and Schmidhuber (1997), solves the vanishing gradient problem by introducing a dedicated cell state that runs through time with minimal interference, plus three gates that control what information enters, persists, and exits.
Click to expandLSTM cell architecture showing forget gate, input gate, output gate, and cell state flow
The Forget Gate
The forget gate decides what to discard from the cell state:
Where:
- is a vector of values between 0 and 1 (one per cell state dimension)
- is the sigmoid function, outputting values in
- is the concatenation of the previous hidden state and current input
- and are the forget gate's learnable weights and bias
In Plain English: The forget gate looks at the current temperature and recent memory, then decides what to erase. If a cold front arrived, it might forget the previous warm-day patterns and reset that portion of the cell state.
The Input Gate and Candidate Memory
The input gate controls what new information to store:
Where:
- is the input gate activation (which values to update)
- is the candidate memory (new information to potentially add)
- , are the learnable weight matrices for the input gate and candidate
- , are the corresponding bias vectors
In Plain English: The input gate works in two stages. First, it decides which dimensions of the cell state deserve an update (the sigmoid produces values near 0 for "skip" and near 1 for "write"). Second, it creates a candidate value using tanh. For our temperature series, after a sudden cold front arrives, the candidate might encode the new lower temperature baseline while the input gate opens wide on the dimensions that track recent trends.
The Cell State Update
The cell state combines forgetting old information and adding new:
Where:
- denotes element-wise multiplication (Hadamard product)
- selectively retains old memory
- selectively adds new information
In Plain English: Think of the cell state as a conveyor belt. The forget gate punches holes in it (dropping irrelevant past temperatures), and the input gate places new items on it (adding the current reading). This lets the LSTM maintain information across hundreds of steps because the cell state flows forward with minimal modification when the gates choose to keep it unchanged.
The Output Gate
Where:
- controls which parts of the cell state to expose as output
- is the new hidden state, a filtered version of the cell state
- and are the output gate's learnable weights and bias
In Plain English: The cell state holds everything the LSTM knows, but not all of it is relevant right now. The output gate filters it. If the model is predicting the next hour's temperature, it might expose the dimensions encoding recent trend direction while keeping long-term seasonal information hidden inside the cell state for later use.
Why This Solves Vanishing Gradients
The cell state update equation is the key:
When the forget gate outputs values near 1, the gradient through the cell state is simply . No repeated matrix multiplications, no shrinking chain. Hochreiter and Schmidhuber called this the "constant error carousel."
Pro Tip: When using nn.LSTM in PyTorch, always prefer the fused module over manual nn.LSTMCell loops. The fused version runs up to 100x faster because it uses cuDNN-optimized kernels. Only drop to LSTMCell when you need custom gating logic.
import torch
import torch.nn as nn
class LSTMForecaster(nn.Module):
def __init__(self, input_size=1, hidden_size=128, num_layers=2):
super().__init__()
self.lstm = nn.LSTM(
input_size, hidden_size,
num_layers=num_layers,
batch_first=True,
dropout=0.2 # dropout between stacked layers
)
self.fc = nn.Linear(hidden_size, 1)
def forward(self, x):
out, (h_n, c_n) = self.lstm(x)
return self.fc(out[:, -1, :])
model = LSTMForecaster()
# Parameters: ~265K for 2-layer, 128-hidden LSTM with 1 input feature
For a deeper treatment of LSTMs applied specifically to forecasting, see Mastering LSTMs for Time Series.
GRU: The Streamlined Alternative
The Gated Recurrent Unit (Cho et al., 2014) matches LSTM performance with a simpler design: one state vector instead of two, and two gates instead of three.
Click to expandSide-by-side comparison of GRU and LSTM cells showing architectural differences
Reset Gate
The reset gate determines how much of the previous hidden state to ignore when computing the candidate. When , the unit effectively resets its memory.
Update Gate
Hidden State Update
Where:
- acts as both the forget and input gate simultaneously
- controls how much old state to keep
- controls how much new candidate to accept
- is the reset-filtered previous state
In Plain English: The GRU uses one dial (the update gate ) to balance old and new information. At , the hidden state copies forward unchanged; at , it replaces itself entirely. For our temperature series, the GRU sets during stable weather and when a sudden drop signals a new weather system.
| Feature | LSTM | GRU |
|---|---|---|
| Gates | 3 (forget, input, output) | 2 (reset, update) |
| State vectors | 2 ( and ) | 1 () |
| Parameters per unit | $4 \times (n_h^2 + n_h \cdot n_x + n_h)$ | $3 \times (n_h^2 + n_h \cdot n_x + n_h)$ |
| Training speed | Slower (more computation) | ~25% faster |
| Long-range memory | Slightly better (separate cell state) | Comparable on most tasks |
| Best for | Long sequences (>500 steps) | Smaller datasets, speed-critical |
Common Pitfall: Choosing between LSTM and GRU based on theory alone is a mistake. Hyperparameter tuning matters more than the architecture choice. Train both, compare on your validation set, and pick the winner.
Bidirectional RNNs and Sequence-to-Sequence Models
Bidirectional Processing
Standard RNNs only see past context. A bidirectional RNN runs two hidden states in opposite directions and concatenates the outputs:
Bidirectional models excel when you have the full sequence available at inference time (text classification, named entity recognition). They cannot be used for autoregressive generation or real-time streaming, since future inputs are not yet available. BERT famously used bidirectional context as a core design principle, and understanding bidirectional RNNs makes that architecture much easier to grasp.
Sequence-to-Sequence Architecture
The encoder-decoder framework (Sutskever et al., 2014) maps variable-length inputs to variable-length outputs. The encoder LSTM compresses the input into a context vector (the final hidden state), and the decoder LSTM generates the output from that context.
Click to expandSequence-to-sequence encoder-decoder architecture for temperature forecasting
Teacher forcing feeds ground-truth tokens to the decoder during training instead of its own predictions. This speeds convergence but creates exposure bias: during inference, the model uses its own (possibly wrong) predictions. Scheduled sampling mitigates this by gradually mixing in model predictions during training.
For our temperature example, a seq2seq model might encode 168 hours (one week) and decode the next 24 hours. The context vector must compress an entire week of weather into a fixed-size representation, a significant bottleneck that directly motivated the attention mechanism and transformers.
When to Use RNNs vs. Transformers in 2026
Transformers dominate large language models and most NLP tasks, but writing off RNNs would be premature. Here is an honest comparison as of March 2026.
| Criterion | RNNs (LSTM/GRU) | Transformers |
|---|---|---|
| Inference memory | in sequence length | KV-cache |
| Inference latency per token | Constant | Grows with context |
| Training parallelization | Sequential (slow) | Fully parallel (fast) |
| Long-range retrieval | Weak (fixed-size state) | Strong (direct attention) |
| Sequence length limit | Theoretically unlimited | Fixed context window |
| Edge device deployment | Excellent (tiny memory) | Challenging (KV-cache) |
| Parameter efficiency | High | Moderate |
Where RNNs still win
Streaming and real-time inference. An LSTM processes each new input in memory regardless of how long the stream runs. A transformer's KV-cache grows linearly, eventually exceeding your memory budget. For audio streams, sensor feeds, and financial tick data on IoT hardware, this difference is decisive.
Resource-constrained edge devices. A 2-layer LSTM with 256 hidden units has roughly 800K parameters and runs comfortably on a microcontroller.
Short sequences with simple temporal patterns. For time series forecasting tasks with fewer than 200 time steps, LSTMs match or beat transformers while being much simpler to deploy and debug.
Where transformers win
In-context learning and retrieval. The ICLR 2025 paper "RNNs Are Not Transformers (Yet)" showed that RNNs fundamentally struggle with in-context retrieval. Transformers can directly attend to any position, while RNNs must compress everything into a fixed-size state.
Long-range recall. If you need to retrieve a specific fact from 10,000 tokens ago, a transformer does it directly. An RNN likely forgot it.
Scale. At billions of parameters, transformers scale more predictably. Their parallelizable architecture makes training on GPU clusters practical.
Key Insight: The real question is not "RNN or transformer?" but "does my task need recall over long contexts?" If yes, use a transformer or hybrid. If you need streaming inference with constant memory, RNNs remain the right choice.
The Recurrent Revival: SSMs and xLSTM
Two research directions in 2024 and 2025 brought recurrence back to the frontier.
State-Space Models and Mamba
Mamba (Gu and Dao, 2023) introduced selective state spaces: a recurrent architecture with input-dependent gating that achieves linear-time sequence processing. Unlike vanilla SSMs that use fixed dynamics, Mamba's selection mechanism lets the model choose which information to propagate or forget at each step, similar in spirit to LSTM gating but with a continuous-time formulation.
Mamba achieves up to 5x inference throughput over comparable transformers while matching their quality on language modeling. Hybrid architectures combining Mamba layers with sparse attention layers are showing strong results across multiple benchmarks as of early 2026.
xLSTM: Hochreiter's Return
In May 2024, Sepp Hochreiter (the original LSTM inventor) published xLSTM, extending the classic architecture with two key innovations:
- Exponential gating. Replace sigmoid gates with exponential activation, combined with normalization and stabilization techniques. This gives the gates much sharper on/off dynamics.
- Matrix memory (mLSTM). Instead of a vector cell state, mLSTM uses a matrix memory with a covariance update rule, making it fully parallelizable during training while retaining recurrent inference.
The companion sLSTM variant keeps scalar memory with enhanced mixing. xLSTM was a NeurIPS 2024 spotlight, and the xLSTM-7B model (March 2025, 2.3 trillion tokens) matches LLaMA-7B performance with faster inference and lower memory.
Pro Tip: Starting a new sequence modeling project in 2026? Need real-time streaming on edge hardware? Use LSTM/GRU. Need strong in-context retrieval over long documents? Use a transformer. Want both efficiency and range at moderate scale? Try Mamba or xLSTM.
Production Considerations
Training complexity. LSTM training is per sample, where is sequence length and is hidden size. The sequential dependency means you cannot parallelize across time steps.
Gradient clipping is non-negotiable. Always clip gradients when training RNNs. PyTorch's torch.nn.utils.clip_grad_norm_ with a max norm of 1.0 is a sensible default. Without it, exploding gradients will periodically destabilize training. For a deeper look at optimizer choices, see deep learning optimizers from SGD to AdamW.
Stacking layers carefully. Adding more LSTM layers improves capacity but demands dropout between layers (0.2 to 0.5). More than 3 stacked layers rarely helps.
Hidden size selection. For time series tasks, hidden sizes of 64 to 256 cover most use cases. Going larger increases overfitting risk without proportional accuracy gains.
Truncated BPTT. Longer sequences require more memory during backpropagation. Processing fixed-length chunks with detached hidden states passed between them is the standard workaround.
import torch
import torch.nn as nn
# Truncated BPTT training loop sketch
model = nn.LSTM(input_size=1, hidden_size=128, num_layers=2, batch_first=True)
fc = nn.Linear(128, 1)
optimizer = torch.optim.Adam(list(model.parameters()) + list(fc.parameters()), lr=1e-3)
max_grad_norm = 1.0
def train_step(x_chunk, y_chunk, hidden):
"""Process one chunk with truncated BPTT."""
# Detach hidden state to prevent backprop into previous chunk
if hidden is not None:
hidden = tuple(h.detach() for h in hidden)
out, hidden = model(x_chunk, hidden)
pred = fc(out[:, -1, :])
loss = nn.functional.mse_loss(pred, y_chunk)
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)
optimizer.step()
return loss.item(), hidden
Conclusion
Recurrent neural networks taught the field how to think about sequences. The vanilla RNN introduced shared weights across time, and its failure on long sequences led directly to LSTM and GRU gating. These are not just historical artifacts; the same principles of gating and controlled information flow appear in every modern sequence model, from transformers to state-space models.
The practical takeaway is straightforward. For streaming inference and edge deployment where memory must stay constant, LSTMs and GRUs remain the best tool in 2026. For tasks requiring strong recall over long contexts, transformers win. And for applications that need both efficiency and long-range modeling, the new generation of recurrent models like Mamba and xLSTM offer a compelling middle ground. If you are building neural networks from scratch, understanding how activation functions like sigmoid and tanh control information flow through gates is essential. And once you grasp recurrent architectures, the jump to how large language models actually work becomes far more natural.
The architectural debate will keep evolving. What will not change is the fundamental insight that sequences need memory, and how you manage that memory determines everything.
Interview Questions
Why can't a vanilla RNN learn long-range dependencies?
During BPTT, gradients are multiplied by the recurrent weight matrix at every time step. When these products are consistently less than 1, gradients shrink exponentially. After 50 or more steps, the signal from early inputs effectively vanishes and the network cannot learn that those inputs matter.
How does the LSTM cell state solve vanishing gradients?
The cell state update avoids repeated matrix multiplications. When the forget gate outputs values near 1, gradients flow through nearly unchanged. This "constant error carousel" lets gradients propagate across hundreds of steps.
What distinguishes an LSTM from a GRU?
LSTMs have three gates and a separate cell state; GRUs have two gates and a single hidden state, training roughly 25% faster. In practice, performance differences are small, and the choice comes down to dataset size and compute budget. LSTMs have a slight edge on very long sequences.
What is teacher forcing, and what problem does it create?
Teacher forcing feeds ground-truth tokens to the decoder during training instead of its own predictions. This speeds convergence but creates exposure bias: at inference, the model must use its own (potentially wrong) outputs. Scheduled sampling mitigates this by gradually mixing in model predictions during training.
When should you use a bidirectional RNN?
Use bidirectional RNNs when the full sequence is available at inference time, such as text classification, NER, or sentiment analysis. Do not use them for autoregressive tasks where future inputs are unavailable.
Why choose an LSTM over a transformer for time series in 2026?
LSTMs process sequences with memory per step, making them ideal for streaming inference and edge deployment. They have far fewer parameters, reducing overfitting on small datasets. For short sequences under 200 time steps, LSTMs match transformer accuracy with simpler deployment.
How do state-space models like Mamba relate to RNNs?
SSMs use continuous-time dynamics discretized for sequences. Mamba adds input-dependent selection, analogous to LSTM gating. Like RNNs, SSMs use constant memory at inference, but they can be parallelized during training via a convolutional view.
What is gradient clipping and why is it essential for RNN training?
Gradient clipping caps the gradient norm at a threshold. RNNs are prone to exploding gradients because the recurrent weight matrix multiplies at every step, and large eigenvalues cause exponential growth. A max norm of 1.0 via clip_grad_norm_ is standard practice.