Convolutional Neural Networks Explained

In the present era, machines have successfully achieved 99% accuracy in understanding and identifying features and objects in images. We see this daily — smartphones recognizing faces in the camera; the ability to search particular photos with Google Images; scanning text from barcodes or book. All of this is possible thanks to the convolutional neural network (CNN), a specific type of neural network also known as convnet.

If you are a deep learning enthusiast, you've probably already heard about convolutional neural networks, maybe you've even developed a few image classifiers yourself. Modern deep-learning frameworks like Tensorflow and PyTorch make it easy to teach machines about images, however, there are still questions: How does data pass through artificial layers of a neural network? How can a computer learn from it? One way to better explain a convolutional neural network is to use PyTorch. So let's deep dive into CNNs by visualizing an image through every layer.

CONVOLUTIONAL NEURAL NETWORKS Explained

What Is a Convolutional Neural Network?

A convolutional neural networks (CNN) is a special type of neural network that works exceptionally well on images. Proposed by Yan LeCun in 1998, convolutional neural networks can identify the number present in a given input image.

Before getting started with convolutional neural networks, it's important to understand the workings of a neural network. Neural networks imitate how the human brain solves complex problems and finds patterns in a given set of data. Over the past few years, neural networks have engulfed many machine learning and computer vision algorithms.

The basic model of a neural network consists of neurons organized in different layers. Every neural network has an input and an output layer, with many hidden layers augmented to it based on the complexity of the problem. Once the data is passed through these layers, the neurons learn and identify patterns. This representation of a neural network is called a model. Once the model is trained, we ask the network to make predictions based on the test data. If you are new to neural networks, this article on deep learning with Python is a great place to start.

CNN, on the other hand, is a special type of neural network which works exceptionally well on images. Proposed by Yan LeCun in 1998, convolutional neural networks can identify the number present in a given input image. Other applications using CNNs include speech recognition, image segmentation and text processing. Before convolutional neural networks, multilayer perceptrons (MLP) were used in building image classifiers.

Image classification refers to the task of extracting information classes from a multi-band raster image. Multilayer perceptrons take more time and space for finding information in pictures as every input feature needs to be connected with every neuron in the next layer. CNNs overtook MLPs by using a concept called local connectivity, which involves connecting each neuron to only a local region of the input volume. This minimizes the number of parameters by allowing different parts of the network to specialize in high-level features like a texture or a repeating pattern. Getting confused? No worries. Let’s compare how the images are sent through multilayer perceptrons and convolutional neural networks for a better understanding.

COMPARING MLPS AND CNNS

Considering an MNIST dataset, the total number of entries to the input layer for a multilayer perceptron will be 784 as the input image is of size 28x28=784. The network should be able to predict the number in the given input image, which means the output might belong to any of the following classes ranging from 0–9 (1, 2, 3, 4, 5, 6, 7, 8, 9). In the output layer, we return the class scores, say if the given input is an image having the number “3," then in the output layer the corresponding neuron “3” has a higher class score in comparison to the other neurons. But how many hidden layers do we need to include and how many neurons should be there in each one? Here is an example of a coded MLP:

convolutional neural network pytorch visualization mlp

The above code snippet is implemented using a framework called Keras (ignore the syntax for now). It tells us there are 512 neurons in the first hidden layer, which are connected to the input layer of shape 784. The hidden layer is followed by a dropout layer which overcomes the problem of overfitting. The 0.2 indicates there is a 20% probability of not considering the neurons right after the first hidden layer. Again, we added the second hidden layer with the same number of neurons as in the first hidden layer (512), followed by another dropout layer. Finally, we end this set of layers with an output layer comprising 10 classes. This class which has the highest value would be the number predicted by the model.

This is how the multilayer network looks like after all the layers are defined. One disadvantage with this multilayer perceptron is that the connections are complete (fully connected) for the network to learn, which takes more time and space. MLP’s only accept vectors as inputs.

convolutional neural network pytorch visualization image classifier

Convolutions don’t use fully connected layers, but sparsely connected layers, that is, they accept matrices as inputs, an advantage over MLPs. The input features are connected to local coded nodes. In MLP, every node is responsible for gaining an understanding of the complete picture. In CNNs, we disintegrate the image into regions (small local areas of pixels). Each hidden node has to report to the output layer, where the output layer combines the received data to find patterns. The image below shows how the layers are connected locally.

convolutional neural network pytorch visualization cnn

Before we can understand how CNNs find information in the pictures, we need to understand how the features are extracted. Convolutional neural networks use different layers and each layer saves the features in the image. For example, consider a picture of a dog. Whenever the network needs to classify a dog, it should identify all the features — eyes, ears, tongue, legs, etc. — and these features are broken down and recognized in the local layers of the network using filters and kernels.

HOW DO COMPUTERS LOOK AT YOUR IMAGE?

Unlike human beings, who understand images by taking snapshots with the eye, computers use a set of pixel values between 0 to 255 to understand a picture. A computer looks at these pixel values and comprehends them. At first glance, it doesn’t know the objects or the colors, it just recognizes pixel values, which is all the image is for a computer.

After analyzing the pixel values, the computer slowly begins to understand if the image is grayscale or color. It knows the difference because grayscale images have only one channel since each pixel represents the intensity of one color. Zero indicates black, and 255 means white, the other variations of black and white, i.e., gray lies in between. Color images, on the other hand, have three channels — red, green and blue. These represent the intensities of three colors (a 3D matrix), and when the values are simultaneously varied, it produces a gives a big set of colors! After figuring out the color properties, a computer recognizes the curves and contours of objects in an image.

This proces can be explored in a convolutional neural network using PyTorch to load the dataset and apply filters to images. Below is the code snippet.

(Find the code on GitHub here)

# Load the libraries
import torch
import numpy as np

from torchvision import datasets
import torchvision.transforms as transforms

# Set the parameters
num_workers = 0
batch_size = 20

# Converting the Images to tensors using Transforms
transform = transforms.ToTensor()

train_data = datasets.MNIST(root='data', train=True,
                                   download=True, transform=transform)
test_data = datasets.MNIST(root='data', train=False,
                                  download=True, transform=transform)

# Loading the Data
train_loader = torch.utils.data.DataLoader(train_data, batch_size=batch_size,
    num_workers=num_workers)
test_loader = torch.utils.data.DataLoader(test_data, batch_size=batch_size, 
    num_workers=num_workers)

import matplotlib.pyplot as plt
%matplotlib inline
    
dataiter = iter(train_loader)
images, labels = dataiter.next()
images = images.numpy()

# Peeking into dataset
fig = plt.figure(figsize=(25, 4))
for image in np.arange(20):
    ax = fig.add_subplot(2, 20/2, image+1, xticks=[], yticks=[])
    ax.imshow(np.squeeze(images[image]), cmap='gray')
    ax.set_title(str(labels[image].item()))