Basics of Convolutional Neural Networks — Stride and Pooling

The convolutional layer in a convolutional neural network (CNN) systematically applies filters of features to an input and creates output feature maps. It is challenging to configure the related hyperparameter — stride of the filter on the input image to downsample the size of the output feature map. The output feature maps are sensitive to the location of the features in the input. Down sampling the feature maps can address the sensitivity, which is an approach for pooling layers to down sampling feature maps by summarizing the presence of features in patches of the feature map.


The stride is how the convolution filter is moved over the original image. The filter is moved across the image left to right, top to bottom, with a one-pixel column change on the horizontal movements, then a one-pixel row change on the vertical movements. By changing the stride, we could move our filter by the size of pixels each time, resulting in a smaller number of possible locations for the filter. The default stride in two dimensions is (1,1) for the height and the width movement and it works well in most cases.


A pooling layer is a new layer added after the convolutional layer. Specifically, after a nonlinearity has been applied to the feature maps output by a convolutional layer. Pooling involves selecting a pooling operation. Two common functions used in the pooling operation are average pooling and max pooling.

The pooling layer is the last element in a CNN architecture. This layer is meant to substantially down sample the previous convolutional layers. The idea behind this is that the previous convolutional layers will find patterns such as edges or other basic shapes present in the pictures. From there, pooling layers will take a summary of the convolutions from a larger section. Average pooling and max pooling calculate the average and maximum value for each patch on the feature map, respectively. In practice, Max pooling works better than average pooling as we are typically looking to detect whether a feature is present in that region. Down sampling is essential in order to produce viable execution times in the model training.

What happened couldn’t have happened any other way…