Image captioning is the process of generating textual description of an image. It uses both natural language processing (NLP) and computer vision to generate the captions.
What Is Image Captioning?
Image captioning is the process of using natural language processing and computer vision to generate captions from an image.
The data set will be in the form [image → captions]
. The data set consists of input images and their corresponding output captions.
How Image Captioning Works
Below is the network topology for image captioning that enables it to process images and generate an accurate caption.
1. Encoder
The convolutional neural network (CNN) can be thought of as an encoder. The input image is given to CNN to extract the features. The last hidden state of the CNN is connected to the decoder.
2. Decoder
The decoder is a recurrent neural network (RNN), which does language modeling up to the word level. The first time step receives the encoded output from the encoder and also the START
vector.
3. Training
The output from the last hidden state of the CNN (encoder) is given to the first time step of the decoder. We set x1 =START
vector and the desired label y1 = first word in the sequence
. Analogously, we set x2 =word vector of the first word
and expect the network to predict the second word. Finally, on the last step, xT = last word
, the target label yT =END
token.
During training, the correct input is given to the decoder at every time-step, even if the decoder made a mistake before.
4. Testing
The image representation is provided to the first time-step of the decoder. Set x1 =START
vector and compute the distribution over the first word y1
. We sample a word from the distribution (or pick the arg max), set its embedding vector as x2
and repeat this process until the END
token is generated.
During Testing, the output of the decoder at time t
is fed back and becomes the input of the decoder at time t+1
.