Padding and stride

The step size of the kernel can be controlled using the stride parameter. A large stride along with a large kernel size can be useful if objects are large relative to the dimension of the image. Note that strided convolutions can be used as an alternative way to downsample an image (where it is shown to work better or just as well as the usual conv + pooling layers for certain benchmark tasks) [SDBR14]. Hence, stride significantly reduces computation of the layer (i.e. by the same factor). For example, stride=2 divides the spatial size of the original image by 2:

conv = lambda s: nn.Conv2d(3, 1, stride=s, kernel_size=3)

../../../_images/dfbcbecd3b137ec5a264a6a15a698ce83cf15877ca0574d563298f075aa13180.svg

Padding. Edge pixels of an input image are underrepresented since the kernel has to be kept within the input image. Moreover, information in the edges become lost as we stack more convolutional layers. The simplest idea is zero padding the boundaries (more involved variants exist). Observe the weird effect it has on the boundaries:

pad  = nn.ZeroPad2d(padding=3)
conv = nn.Conv2d(3, 1, kernel_size=3)

../../../_images/b631874ef2a4e9ac739b01446f34de9cfd010b3ed7ceb9844fc9219030b07099.svg

Remark. Padding and stride determine the output shape. A convolution layer with kernel size k and stride s applied to an input of width w with symmetric padding p, results in an output width of ⌊(w + 2p - k)/s + 1⌋. In general, we want to pick stride and padding values so that the kernel can be placed evenly in the image with no input pixel dropped.

For s = 1, the kernel size should be odd so that it covers the entire input in a symmetric manner. A common choice is p = (k - 1)/2 which results in same sized outputs[1] (same convolution). For s > 1, best practice is to choose a kernel size and the smallest p such that s divides w + 2p - k so that the entire input image is symmetrically covered by the kernel.