A 3x3 grid of numbers slides across an image, multiplying and summing at each position. That single operation, repeated millions of times, powers everything from your phone's face unlock to autonomous vehicles reading street signs at 60 mph. Convolutional neural networks (CNNs) remain the most deployed deep learning architecture in production, and understanding how convolutions actually work gives you the foundation to build, debug, and optimize any computer vision system.
We'll build a complete CNN for CIFAR-10 image classification as our running example, starting from the raw convolution operation and ending with a trained model that hits over 90% accuracy. Every concept connects back to this same classifier so you can see how the pieces fit together.
The Convolution Operation
A convolution in deep learning is a sliding dot product between a small learnable filter (called a kernel) and a region of the input. The kernel moves across the input image position by position, producing a single output value at each location. The collection of all those output values forms a feature map.
Here's the math. For a 2D input and kernel of size , the output feature map value at position is:
Where:
- is the output feature map value at row , column
- is the input pixel at the offset position
- is the kernel weight at row , column
- is the kernel size (typically 3 or 5)
- is the bias term added after summation
In Plain English: The kernel acts like a magnifying glass scanning the image. At each position, it multiplies its weights by the pixels underneath, sums everything up, and writes one number to the output. A kernel trained to detect vertical edges will produce large values wherever vertical edges exist in the image, and near-zero values everywhere else.
Click to expandConvolution kernel sliding over an input to produce a feature map
Consider a 3x3 edge-detection kernel applied to a 6x6 grayscale image. The kernel has 9 learnable parameters plus 1 bias. It slides across all valid positions, producing a 4x4 feature map. That's the entire operation. No magic, just multiply-accumulate at every position.
Key Insight: The power of convolution comes from parameter sharing. The same 3x3 kernel is reused at every spatial position, so the network learns to detect the same feature regardless of where it appears in the image. A fully connected layer processing a 224x224 image would need 150 billion parameters for a single layer. A 3x3 conv layer needs just 9 weights.
Filters Build Hierarchical Representations
Each convolutional layer learns multiple filters, and each filter produces its own feature map. Early layers learn low-level patterns like edges and textures. Middle layers combine those into parts: eyes, wheels, corners. Deep layers assemble parts into full objects.
This hierarchy isn't designed by hand. It emerges naturally through backpropagation. When the network misclassifies a cat as a dog, gradients flow backward, adjusting every kernel so that cat-specific features get amplified. After thousands of iterations, the first layer's kernels often resemble Gabor filters (oriented edge detectors), which aligns with what neuroscience tells us about the primary visual cortex.
For our CIFAR-10 classifier, the first conv layer might learn 32 different 3x3 filters. That gives us 32 feature maps, each highlighting different patterns in the 32x32 input images. The second layer's filters then operate on those 32 feature maps, combining low-level edges into higher-level shapes.
Stride, Padding, and Dilation Control Output Size
Three hyperparameters control how the kernel traverses the input.
Stride determines how many pixels the kernel moves between positions. Stride 1 (the default) moves one pixel at a time. Stride 2 skips every other position, halving the spatial dimensions. Strided convolutions are often used instead of pooling in modern architectures like all-convolutional nets.
Padding adds zeros around the input border. "Valid" padding (no padding) shrinks the output by pixels in each dimension. "Same" padding adds enough zeros to keep the output the same size as the input. For a 3x3 kernel, same padding adds 1 pixel on each side.
Dilation inserts gaps between kernel elements, expanding the receptive field without increasing parameters. A 3x3 kernel with dilation 2 covers a 5x5 area but still uses only 9 weights. DeepLab (Chen et al., 2017) popularized dilated convolutions for semantic segmentation.
The output dimension formula (including dilation):
Where:
- is the output spatial dimension (height or width)
- is the input spatial dimension
- is the kernel size
- is the padding amount
- is the stride
- is the dilation factor (1 for standard convolution)
In Plain English: For our CIFAR-10 images (32x32), a 3x3 kernel with padding 1, stride 1, and no dilation gives . Switch to stride 2 and the output drops to 16x16. With dilation 2, the effective kernel covers a 5x5 area, so you need padding 2 to maintain the 32x32 output. This formula tells you exactly what spatial size comes out of any conv layer, and you'll use it constantly when designing architectures.
| Setting | Kernel 3x3, Input 32x32 | Output Size |
|---|---|---|
| Stride 1, pad 0 | Valid convolution | 30x30 |
| Stride 1, pad 1 | Same convolution | 32x32 |
| Stride 2, pad 1 | Downsampling | 16x16 |
| Stride 1, pad 2, dilation 2 | Dilated convolution | 32x32 |
Pooling Reduces Spatial Dimensions
Pooling layers downsample feature maps by summarizing local regions, reducing computation and providing translation invariance.
Max pooling takes the maximum value in each window. A 2x2 max pool with stride 2 halves both spatial dimensions, keeping the strongest activation in each region. This is still the most common choice.
Average pooling takes the mean. It's smoother than max pooling but can blur out strong activations. Less common in hidden layers, but critical at the end of modern architectures.
Global average pooling (GAP) collapses each feature map to a single number by averaging all spatial positions. GoogLeNet introduced this in 2014 to replace massive fully connected layers. A network with 512 feature maps of size 7x7 would need 25,088 inputs to a dense layer. GAP reduces that to 512 values, cutting parameters and overfitting risk.
Pro Tip: Global average pooling has become the standard in modern CNNs. It acts as a structural regularizer because there are no learnable parameters to overfit. ResNet, EfficientNet, and ConvNeXt all use GAP before the final classification layer.
The Full CNN Pipeline
A CNN chains convolution blocks together with a classification head. Each block typically contains: convolution, batch normalization, activation, and optional pooling. Here's our complete CIFAR-10 classifier in PyTorch:
Click to expandFull CNN architecture pipeline from input through convolution, pooling, and classification
import torch
import torch.nn as nn
class CIFAR10Net(nn.Module):
def __init__(self):
super().__init__()
# Block 1: 3 -> 32 channels
self.block1 = nn.Sequential(
nn.Conv2d(3, 32, kernel_size=3, padding=1),
nn.BatchNorm2d(32),
nn.ReLU(),
nn.Conv2d(32, 32, kernel_size=3, padding=1),
nn.BatchNorm2d(32),
nn.ReLU(),
nn.MaxPool2d(2, 2) # 32x32 -> 16x16
)
# Block 2: 32 -> 64 channels
self.block2 = nn.Sequential(
nn.Conv2d(32, 64, kernel_size=3, padding=1),
nn.BatchNorm2d(64),
nn.ReLU(),
nn.Conv2d(64, 64, kernel_size=3, padding=1),
nn.BatchNorm2d(64),
nn.ReLU(),
nn.MaxPool2d(2, 2) # 16x16 -> 8x8
)
# Block 3: 64 -> 128 channels
self.block3 = nn.Sequential(
nn.Conv2d(64, 128, kernel_size=3, padding=1),
nn.BatchNorm2d(128),
nn.ReLU(),
nn.Conv2d(128, 128, kernel_size=3, padding=1),
nn.BatchNorm2d(128),
nn.ReLU(),
nn.AdaptiveAvgPool2d(1) # 8x8 -> 1x1 (GAP)
)
self.classifier = nn.Linear(128, 10)
def forward(self, x):
x = self.block1(x)
x = self.block2(x)
x = self.block3(x)
x = x.view(x.size(0), -1) # Flatten
return self.classifier(x)
model = CIFAR10Net()
total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters: {total_params:,}")
# Total parameters: 309,738
This model has roughly 310K parameters. Compare that to a fully connected network on the same 32x32x3 input, which would need over 3 million parameters in the first layer alone. Convolutions make deep vision networks practical.
Common Pitfall: Forgetting to match channel dimensions between blocks is the most frequent CNN debugging headache. Block 1 outputs 32 channels, so Block 2's first conv must accept 32 channels as input. PyTorch will throw a dimension mismatch error at runtime, not at model definition time, which makes it harder to catch.
Batch Normalization Stabilizes Training
Batch normalization (Ioffe and Szegedy, 2015) normalizes each channel's activations across the mini-batch to have zero mean and unit variance, then applies learnable scale and shift parameters.
Where:
- is the activation value for a single channel
- is the mean of that channel across the mini-batch
- is the variance across the mini-batch
- is a small constant (e.g., $10^{-5}$) for numerical stability
- is the learnable scale parameter
- is the learnable shift parameter
In Plain English: Without batch norm, the distribution of activations entering each layer shifts as the network trains, forcing later layers to constantly re-adapt. Batch norm pins each channel's distribution to a stable range, then lets and learn whatever shift and scale actually helps. In our CIFAR-10 model, this means the second conv block receives consistently scaled inputs regardless of how the first block's weights changed during the last update.
Batch norm also acts as a light regularizer due to the noise from computing statistics over mini-batches. It lets you use higher learning rates and reduces sensitivity to weight initialization.
The Evolution of CNN Architectures
CNN architecture design has progressed dramatically since the late 1990s. Each generation solved a specific bottleneck that limited the previous one.
Click to expandTimeline of CNN architecture evolution from LeNet to ConvNeXt with key innovations
| Architecture | Year | Depth | Parameters | Top-1 Accuracy | Key Innovation |
|---|---|---|---|---|---|
| LeNet-5 | 1998 | 5 | 60K | N/A (MNIST) | First practical CNN |
| AlexNet | 2012 | 8 | 61M | 63.3% | ReLU, dropout, GPU training |
| VGG-16 | 2014 | 16 | 138M | 74.4% | Small 3x3 filters throughout |
| GoogLeNet | 2014 | 22 | 4M | 74.8% | Inception modules, GAP |
| ResNet-50 | 2015 | 50 | 25.6M | 76.0% | Skip connections |
| EfficientNet-B0 | 2019 | N/A | 5.3M | 77.1% | Compound scaling (NAS) |
| ConvNeXt V2 | 2023 | N/A | 28.6M (Tiny) | 83.0% | GRN, masked autoencoders |
LeNet-5 (LeCun et al., 1998) proved that CNNs could recognize handwritten digits. Five layers, 60K parameters, and it worked.
AlexNet (Krizhevsky et al., 2012) won ImageNet by a massive margin, dropping the top-5 error from 26% to 15.3%. The key wasn't architectural complexity. It was using ReLU activations instead of sigmoid, training on GPUs, and applying dropout for regularization.
VGG (Simonyan and Zisserman, 2014) showed that stacking many 3x3 convolutions outperforms fewer large kernels. Two 3x3 layers have the same receptive field as one 5x5 layer but use fewer parameters and include an extra nonlinearity. The downside: 138M parameters made VGG expensive to deploy.
ResNet changed everything in 2015. Read the next section to understand why.
Skip Connections Made Deep Networks Possible
Before ResNet, networks deeper than about 20 layers actually performed worse than shallower ones. This wasn't overfitting. Training loss was higher too. The gradients either vanished or exploded as they traveled through dozens of layers, making optimization nearly impossible.
He et al. (2015) solved this with a simple idea: let each block learn a residual function and add the input back to the output.
Where:
- is the block output
- is the learned residual (what the conv layers produce)
- is the identity shortcut (the input, passed through unchanged)
In Plain English: Instead of asking a block to learn the entire desired mapping from scratch, skip connections let it learn only the difference between input and desired output. If the optimal behavior is to pass information through unchanged (identity mapping), the block just needs to drive toward zero, which is much easier than learning an identity function from scratch. For our CIFAR-10 classifier, adding skip connections lets us go deeper without degrading performance.
Click to expandResNet skip connection showing identity shortcut bypassing convolutional layers
This insight enabled networks with 50, 101, even 152 layers. ResNet-152 achieved 78.3% top-1 accuracy on ImageNet. Skip connections also create shorter gradient paths during backpropagation, directly addressing the vanishing gradient problem. Nearly every modern architecture, including transformers, uses some form of residual connection.
Here's a residual block in PyTorch:
class ResidualBlock(nn.Module):
def __init__(self, channels):
super().__init__()
self.conv1 = nn.Conv2d(channels, channels, 3, padding=1)
self.bn1 = nn.BatchNorm2d(channels)
self.conv2 = nn.Conv2d(channels, channels, 3, padding=1)
self.bn2 = nn.BatchNorm2d(channels)
self.relu = nn.ReLU()
def forward(self, x):
residual = x
out = self.relu(self.bn1(self.conv1(x)))
out = self.bn2(self.conv2(out))
out += residual # Skip connection
return self.relu(out)
Key Insight: When input and output channels differ, the skip connection needs a 1x1 convolution to match dimensions. This "projection shortcut" adds a few parameters but preserves the gradient highway that makes deep training work.
Modern CNNs and the Transformer Debate
The 2020 arrival of Vision Transformers (ViTs) sparked a genuine question: are CNNs obsolete? The answer, as of March 2026, is definitively no.
ConvNeXt (Liu et al., 2022) showed that modernizing a standard ResNet with transformer-era design choices (larger kernels, LayerNorm instead of BatchNorm, fewer activation functions, GELU instead of ReLU) could match or beat Swin Transformer on ImageNet. ConvNeXt V2 (Woo et al., CVPR 2023) pushed further by co-designing the architecture with masked autoencoder pretraining and introducing Global Response Normalization (GRN). The result: 88.9% top-1 accuracy on ImageNet with the 650M-parameter Huge variant, matching the best transformers.
EfficientNetV2 (Tan and Le, 2021) combined neural architecture search with progressive training to achieve 85.7% top-1 accuracy on ImageNet while training 5 to 11 times faster than previous EfficientNet models.
When CNNs Beat Transformers
- Small datasets (under 50K images): CNNs' inductive biases (translation equivariance, local connectivity) act as built-in regularization. ViTs need large-scale pretraining or significant augmentation to match.
- Edge and mobile deployment: Convolution operations are heavily optimized in hardware (NVIDIA TensorRT, Apple Neural Engine, Qualcomm SNPE). A MobileNetV3 runs inference in 5ms on a smartphone. A comparable ViT takes 3 to 5 times longer.
- Real-time applications: Autonomous driving, video surveillance, and robotics demand sub-10ms latency. CNNs dominate these use cases.
- Limited compute budgets: Training a ViT-Large from scratch requires thousands of GPU hours. A ResNet-50 trains in hours on a single GPU.
When Transformers Win
- Massive datasets (ImageNet-21K, JFT-300M): Self-attention captures long-range dependencies that local convolutions miss.
- Multi-modal tasks: Vision-language models (CLIP, LLaVA) naturally extend the transformer architecture.
- Tasks requiring global context: Image generation (diffusion models), document understanding, and medical image analysis where relationships span the full image.
Pro Tip: The practical trend in March 2026 is hybrid architectures. Models like MobileViT and EfficientFormer combine convolutional stems for local feature extraction with transformer blocks for global reasoning. If you're starting a new vision project, consider ConvNeXt V2 as your CNN baseline and compare against a ViT of similar size. Choose based on your deployment constraints, not hype.
Understanding how CNNs work from the ground up is also essential for grasping how large language models work, since the attention mechanism in transformers was partly inspired by the need to overcome the local receptive field limitation of convolutions.
When to Use CNNs and When Not To
Use CNNs when:
- Deploying to edge devices, mobile phones, or embedded systems
- Working with small to medium datasets (under 1M images)
- Latency requirements are strict (real-time video, robotics)
- You need a well-understood, debuggable architecture
- Your task involves local spatial patterns (object detection, segmentation)
Avoid CNNs when:
- Your input isn't grid-structured (graphs, point clouds, text)
- You need global context across the entire input from early layers
- You're building multi-modal systems that combine vision with language
- You have access to massive pretraining data and compute (ViTs scale better)
This decision framework parallels the bias-variance tradeoff: CNNs carry stronger inductive bias (good with less data, but potentially limiting at scale), while transformers are lower bias but need more data to generalize well. Validate your choice with proper cross-validation, not a single train-test split.
Conclusion
Convolutions are deceptively simple. A small kernel, a sliding window, and parameter sharing across spatial positions. Yet this operation, stacked and composed, builds the hierarchical feature representations that power modern computer vision. From LeNet's 60K parameters recognizing digits in 1998 to ConvNeXt V2 hitting 88.9% on ImageNet with masked autoencoder pretraining, the core idea hasn't changed. What changed is how we compose, normalize, and connect these layers.
Skip connections from ResNet remain perhaps the single most important architectural innovation in deep learning. They enabled depth, and depth enabled the feature hierarchies that make CNNs so powerful. If you want to understand the optimization dynamics behind training these deep networks, explore our guide on deep learning optimizers from SGD to AdamW.
The CNN versus transformer debate has largely converged on a pragmatic answer: use what works for your constraints. CNNs still dominate edge deployment and real-time inference. Transformers excel at scale and with global context. Hybrid architectures are increasingly the default. Whether you're building a neural network from scratch or fine-tuning a pretrained ConvNeXt, the convolution operation you learned here is the foundation. Master it, and every vision architecture becomes readable.
For taking pretrained CNNs and adapting them to your own datasets with minimal training, see our guide on transfer learning.
Interview Questions
Q: What is the key advantage of using convolutions over fully connected layers for image data?
Convolutions exploit spatial structure through parameter sharing and local connectivity. A 3x3 kernel uses 9 parameters regardless of image size, while a fully connected layer connecting to a 224x224 image needs over 150,000 parameters per neuron. This makes CNNs parameter-efficient, faster to train, and naturally translation-equivariant.
Q: Why do most CNN architectures use 3x3 kernels instead of larger sizes?
Two stacked 3x3 convolutions cover the same receptive field as a single 5x5 but use fewer parameters (18 vs. 25) and include an extra nonlinearity. VGG-16 proved that deep stacks of 3x3 filters outperform shallower networks with larger kernels. The exception is first-layer kernels (AlexNet 11x11, ResNet 7x7) to quickly expand the receptive field on high-resolution inputs.
Q: How do skip connections solve the degradation problem in deep networks?
Without skip connections, networks deeper than about 20 layers show higher training error than shallower ones despite having more capacity. Skip connections provide an identity shortcut that lets gradients flow directly through the network. Each block only needs to learn the residual, which is easier to optimize. This enabled training networks with 100+ layers.
Q: What is the difference between max pooling and global average pooling?
Max pooling (typically 2x2, stride 2) downsamples by selecting the strongest activation per window, preserving prominent features between conv blocks. Global average pooling collapses each channel's entire spatial dimension to a single value, replacing fully connected layers before the classifier. GAP reduces overfitting because it has no learnable parameters.
Q: Your CNN achieves 99% training accuracy but 72% validation accuracy. How do you fix it?
This is severe overfitting. Solutions include adding dropout before the classifier, increasing data augmentation (random crops, flips, color jitter), applying weight decay, reducing model capacity, and using early stopping. Batch normalization also helps through implicit regularization from mini-batch statistics noise.
Q: How does the receptive field grow through a CNN?
Each 3x3 conv layer adds 2 pixels to the receptive field. Pooling layers multiply it. After three 3x3 conv layers with one 2x2 pool, a neuron "sees" a much larger input region than the 3x3 kernel suggests. This matters because receptive field size determines what level of abstraction a layer can capture.
Q: When would you choose a CNN over a Vision Transformer in production?
Choose a CNN for edge devices where inference latency is critical, small to medium datasets (under 1M images), or when hardware-accelerated convolutions provide a speed advantage. MobileNetV3 runs in under 5ms on a smartphone versus 15 to 25ms for a comparable ViT. Choose ViTs when you have large-scale pretraining data, need global context, or are building multi-modal systems.
Q: Where should batch normalization be placed relative to the activation function?
The original paper placed batch norm before the activation (Conv, BN, ReLU), and this remains the most common ordering. Batch norm normalizes each channel to zero mean and unit variance, then applies learnable scale and shift. This stabilizes training, allows higher learning rates, and adds light regularization through mini-batch noise.