Image Classification With MobileNet

MobileNet is a mobile-first class of CNN that was open-sourced by Google and provides a starting point for training classifiers through a lightweight model.

Written by Abhijeet Pujara
Published on Jan. 26, 2023
A large digital, machine vision eye.
Image: Shutterstock / Built In
Brand Studio Logo

MobileNet is a light-weight computer vision model designed to be used in mobile applications. 

MobileNet Defined 

MobileNet is a computer vision model open-sourced by Google and designed for training classifiers. It uses depthwise convolutions to significantly reduce the number of parameters compared to other networks, resulting in a lightweight deep neural network. MobileNet is Tensorflow’s first mobile computer vision model.  

 This article covers five parts:

  1. What is MobileNet?
  2. MobileNet depthwise separable convolution explained.
  3. Difference between standard convolution and depthwise separable convolution.
  4. MobileNet with Python.
  5. Advantages of MobileNet.

 

What Is MobileNet?

MobileNet is TensorFlow’s first mobile computer vision model.

It uses depthwise separable convolutions to significantly reduce the number of parameters compared to other networks with regular convolutions and the same depth in the nets. This results in lightweight deep neural networks.

A depthwise separable convolution is made from two operations.

  1. Depthwise convolution.
  2. Pointwise convolution.

MobileNet is a class of convolutional neural network (CNN) that was open-sourced by Google, and therefore, provides an excellent starting point for training classifiers that are insanely small and insanely fast.

The speed and power consumption of the network is proportional to the number of multiply-accumulates (MACs) which is a measure of the number of fused multiplication and addition operations.

More on Machine Learning: NLP for Beginners: A Complete Guide

 

MobileNet Depthwise Separable Convolution Explained

This convolution originated from the idea that a filter’s depth and spatial dimension can be separated, thus, the name separable. Let’s take the example of the Sobel filter used in image processing to detect edges.

You can separate the height and width dimensions of these filters. Gx filter can be viewed as a matrix product of [1 2 1] transpose with [-1 0 1].

You’ll notice that the filter has disguised itself. It shows it had nine parameters, but it has six. This is possible because of the separation of its height and width dimensions.

The same idea applies to a separate depth dimension from horizontal (width*height), which gives us a depthwise separable convolution where we perform depthwise convolution. After that, we use a 1*1 filter to cover the depth dimension.

One thing to notice is how much the parameters are reduced by this convolution to output the same number of channels. To produce one channel, we’d need 3*3*3 parameters to perform depth-wise convolution and 1*3 parameters to perform further convolution in-depth dimension.

But if we’d need three output channels, we’d only need 31*3 depth filter, giving us a total of 36 ( = 27 +9) parameters, while for the same number of output channels in regular convolution, we’d need 33*3*3 filters giving us a total of 81 parameters.

Depthwise separable convolution is a depthwise convolution followed by a pointwise convolution, as follows:

  1. Depthwise convolution is the channel-wise DK×DK spatial convolution. Suppose we have five channels, we’d then have five DK×DK spatial convolutions.
  2. Pointwise convolution is the 1×1 convolution to change the dimension.

Depthwise is a map of a single convolution on each input channel separately. Therefore, its number of output channels is the same as the number of the input channels. Its computational cost is: Df² * M * Dk².

 

Pointwise Convolution.

Pointwise convolution is a convolution with a kernel size of 1x1 that simply combines the features created by the depthwise convolution. Its computational cost is: M * N * Df².

 

Difference Between Standard Convolution and Depthwise Separable Convolution

.The main difference between MobileNet architecture and a traditional CNN versus a single 3x3 convolution layer followed by the batch norm and a rectified linear unit (ReLU) is that MobileNet splits the convolution into a 3x3 depthwise convolution and a 1x1 pointwise convolution.

 

MobileNet With Python

Below is an example of MobileNet with Python. 

Using TensorFlow backend.
Using TensorFlow Backend. | Image: Abhijeet Pujara
mobile = keras.applications.mobilenet.MobileNet()
Image(filename=’click.jpg’, width=250,height=300)
An input image of an ice cream cone.
An input image of an ice cream cone. | Image: Abhijeet Pujara
preprocessed_image = prepare_image(‘click.jpg’) predictions = mobile.predict(preprocessed_image) results = imagenet_utils.decode_predictions(predictions) results

This creates an output:

Output from the input image of an ice cream cone.
Output from the input image of an ice cream cone. | Screenshot: Abhijeet Pujara
An introductory guide through the first MobileNet research paper. | Video: Rahul Deora

More on Machine Learning: Understanding Cosine Similarity and Its Applications

 

Advantages of MobileNet

MobileNets are a family of mobile-first computer vision models for TensorFlow, designed to effectively maximize accuracy while being mindful of the restricted resources for an on-device or embedded application.

MobileNets are small, low-latency, low-power models parameterized to meet the resource constraints of a variety of use-cases. They can be built upon for classification, detection, embeddings and segmentation.

Explore Job Matches.