In the present era, machines have successfully achieved 99% accuracy in understanding and identifying features and objects in images. We see this daily — smartphones recognizing faces in the camera; the ability to search particular photos with Google Images; scanning text from barcodes or book. All of this is possible thanks to the convolutional neural network (CNN), a specific type of neural network also known as convnet.
If you are a deep learning enthusiast, you've probably already heard about convolutional neural networks, maybe you've even developed a few image classifiers yourself. Modern deep-learning frameworks like Tensorflow and PyTorch make it easy to teach machines about images, however, there are still questions: How does data pass through artificial layers of a neural network? How can a computer learn from it? One way to better explain a convolutional neural network is to use PyTorch. So let's deep dive into CNNs by visualizing an image through every layer.
CONVOLUTIONAL NEURAL NETWORKS Explained
What Is a Convolutional Neural Network?
Before getting started with convolutional neural networks, it's important to understand the workings of a neural network. Neural networks imitate how the human brain solves complex problems and finds patterns in a given set of data. Over the past few years, neural networks have engulfed many machine learning and computer vision algorithms.
The basic model of a neural network consists of neurons organized in different layers. Every neural network has an input and an output layer, with many hidden layers augmented to it based on the complexity of the problem. Once the data is passed through these layers, the neurons learn and identify patterns. This representation of a neural network is called a model. Once the model is trained, we ask the network to make predictions based on the test data. If you are new to neural networks, this article on deep learning with Python is a great place to start.
CNN, on the other hand, is a special type of neural network which works exceptionally well on images. Proposed by Yan LeCun in 1998, convolutional neural networks can identify the number present in a given input image. Other applications using CNNs include speech recognition, image segmentation and text processing. Before convolutional neural networks, multilayer perceptrons (MLP) were used in building image classifiers.
Image classification refers to the task of extracting information classes from a multi-band raster image. Multilayer perceptrons take more time and space for finding information in pictures as every input feature needs to be connected with every neuron in the next layer. CNNs overtook MLPs by using a concept called local connectivity, which involves connecting each neuron to only a local region of the input volume. This minimizes the number of parameters by allowing different parts of the network to specialize in high-level features like a texture or a repeating pattern. Getting confused? No worries. Let’s compare how the images are sent through multilayer perceptrons and convolutional neural networks for a better understanding.
COMPARING MLPS AND CNNS
Considering an MNIST dataset, the total number of entries to the input layer for a multilayer perceptron will be 784 as the input image is of size 28x28=784. The network should be able to predict the number in the given input image, which means the output might belong to any of the following classes ranging from 0–9 (1, 2, 3, 4, 5, 6, 7, 8, 9). In the output layer, we return the class scores, say if the given input is an image having the number “3," then in the output layer the corresponding neuron “3” has a higher class score in comparison to the other neurons. But how many hidden layers do we need to include and how many neurons should be there in each one? Here is an example of a coded MLP:
The above code snippet is implemented using a framework called Keras (ignore the syntax for now). It tells us there are 512 neurons in the first hidden layer, which are connected to the input layer of shape 784. The hidden layer is followed by a dropout layer which overcomes the problem of overfitting. The 0.2 indicates there is a 20% probability of not considering the neurons right after the first hidden layer. Again, we added the second hidden layer with the same number of neurons as in the first hidden layer (512), followed by another dropout layer. Finally, we end this set of layers with an output layer comprising 10 classes. This class which has the highest value would be the number predicted by the model.
This is how the multilayer network looks like after all the layers are defined. One disadvantage with this multilayer perceptron is that the connections are complete (fully connected) for the network to learn, which takes more time and space. MLP’s only accept vectors as inputs.
Convolutions don’t use fully connected layers, but sparsely connected layers, that is, they accept matrices as inputs, an advantage over MLPs. The input features are connected to local coded nodes. In MLP, every node is responsible for gaining an understanding of the complete picture. In CNNs, we disintegrate the image into regions (small local areas of pixels). Each hidden node has to report to the output layer, where the output layer combines the received data to find patterns. The image below shows how the layers are connected locally.
Before we can understand how CNNs find information in the pictures, we need to understand how the features are extracted. Convolutional neural networks use different layers and each layer saves the features in the image. For example, consider a picture of a dog. Whenever the network needs to classify a dog, it should identify all the features — eyes, ears, tongue, legs, etc. — and these features are broken down and recognized in the local layers of the network using filters and kernels.
HOW DO COMPUTERS LOOK AT YOUR IMAGE?
Unlike human beings, who understand images by taking snapshots with the eye, computers use a set of pixel values between 0 to 255 to understand a picture. A computer looks at these pixel values and comprehends them. At first glance, it doesn’t know the objects or the colors, it just recognizes pixel values, which is all the image is for a computer.
After analyzing the pixel values, the computer slowly begins to understand if the image is grayscale or color. It knows the difference because grayscale images have only one channel since each pixel represents the intensity of one color. Zero indicates black, and 255 means white, the other variations of black and white, i.e., gray lies in between. Color images, on the other hand, have three channels — red, green and blue. These represent the intensities of three colors (a 3D matrix), and when the values are simultaneously varied, it produces a gives a big set of colors! After figuring out the color properties, a computer recognizes the curves and contours of objects in an image.
This proces can be explored in a convolutional neural network using PyTorch to load the dataset and apply filters to images. Below is the code snippet.
(Find the code on GitHub here)
# Load the libraries import torch import numpy as np from torchvision import datasets import torchvision.transforms as transforms # Set the parameters num_workers = 0 batch_size = 20 # Converting the Images to tensors using Transforms transform = transforms.ToTensor() train_data = datasets.MNIST(root='data', train=True, download=True, transform=transform) test_data = datasets.MNIST(root='data', train=False, download=True, transform=transform) # Loading the Data train_loader = torch.utils.data.DataLoader(train_data, batch_size=batch_size, num_workers=num_workers) test_loader = torch.utils.data.DataLoader(test_data, batch_size=batch_size, num_workers=num_workers) import matplotlib.pyplot as plt %matplotlib inline dataiter = iter(train_loader) images, labels = dataiter.next() images = images.numpy() # Peeking into dataset fig = plt.figure(figsize=(25, 4)) for image in np.arange(20): ax = fig.add_subplot(2, 20/2, image+1, xticks=, yticks=) ax.imshow(np.squeeze(images[image]), cmap='gray') ax.set_title(str(labels[image].item()))