What Is Image Captioning?
Image captioning is the process of using natural language processing and computer vision to generate captions from an image.
The data set will be in the form
[image → captions]. The data set consists of input images and their corresponding output captions.
How Image Captioning Works
Below is the network topology for image captioning that enables it to process images and generate an accurate caption.
The convolutional neural network (CNN) can be thought of as an encoder. The input image is given to CNN to extract the features. The last hidden state of the CNN is connected to the decoder.
The output from the last hidden state of the CNN (encoder) is given to the first time step of the decoder. We set
x1 =START vector and the desired label
y1 = first word in the sequence. Analogously, we set
x2 =word vector of the first word and expect the network to predict the second word. Finally, on the last step,
xT = last word, the target label
yT =END token.
During training, the correct input is given to the decoder at every time-step, even if the decoder made a mistake before.
The image representation is provided to the first time-step of the decoder. Set
x1 =START vector and compute the distribution over the first word
y1. We sample a word from the distribution (or pick the arg max), set its embedding vector as
x2 and repeat this process until the
END token is generated.
During Testing, the output of the decoder at time
t is fed back and becomes the input of the decoder at time