A Guide to Image Captioning in Deep Learning

Image captioning is the process of using natural language processing and computer vision to generate captions from an image. Learn more about how it works. 

Published on Jan. 17, 2024
A Guide to Image Captioning in Deep Learning
Image: Shutterstock / Built In
Brand Studio Logo

Image captioning is the process of generating textual description of an image. It uses both natural language processing (NLP) and computer vision to generate the captions.

What Is Image Captioning?

Image captioning is the process of using natural language processing and computer vision to generate captions from an image.

The data set will be in the form [image → captions]. The data set consists of input images and their corresponding output captions.

 

How Image Captioning Works

Below is the network topology for image captioning that enables it to process images and generate an accurate caption.

1. Encoder

The convolutional neural network (CNN) can be thought of as an encoder. The input image is given to CNN to extract the features. The last hidden state of the CNN is connected to the decoder.

More on AITop Applications for Computer Vision in Sports

 

2. Decoder

The decoder is a recurrent neural network (RNN), which does language modeling up to the word level. The first time step receives the encoded output from the encoder and also the START vector.

 

3. Training

The output from the last hidden state of the CNN (encoder) is given to the first time step of the decoder. We set x1 =START vector and the desired label y1 = first word in the sequence. Analogously, we set x2 =word vector of the first word and expect the network to predict the second word. Finally, on the last step, xT = last word, the target label yT =END token.

During training, the correct input is given to the decoder at every time-step, even if the decoder made a mistake before.

A tutorial on image captioning in deep learning. | Video: Hackers Realm

More on AIVision Transformer: An Introduction

 

4. Testing

The image representation is provided to the first time-step of the decoder. Set x1 =START vector and compute the distribution over the first word y1. We sample a word from the distribution (or pick the arg max), set its embedding vector as x2 and repeat this process until the END token is generated.

During Testing, the output of the decoder at time t is fed back and becomes the input of the decoder at time t+1.

Hiring Now
Chamberlain Group
Automotive • Hardware • Internet of Things • Mobile • Software • Design • App development
SHARE