How to Build a Neural Network With Keras Using the IMDB Dataset
Keras is one of the most popular deep learning libraries of the day and has made a big contribution to the commoditization of artificial intelligence. It is simple to use and can build powerful neural networks in just a few lines of code.
In this post, we'll walk through how to build a neural network with Keras that predicts the sentiment of user reviews by categorizing them into two categories: positive or negative. This is called sentiment analysis and we will do it with the famous IMDB review dataset. The model we'll build can also be applied to other machine learning problems with just a few changes.
Note that we will not go into the details of Keras or deep learning. This post is intended to provide a blueprint of a Keras neural network and to make you familiar with its implementation.
Table of Contents
- What Is Keras?
- What Is Sentiment Analysis?
- The IMDB Dataset
- Importing Dependencies and Getting the Data
- Exploring the Data
- Data Preparation
- Building and Training the Model
What is Keras?
Keras is an open source Python library for easily building neural networks. The library is capable of running on top of TensorFlow, Microsoft Cognitive Toolkit, Theano and MXNet. Tensorflow and Theano are the most used numerical platforms in Python when building deep learning algorithms, but they can be quite complex and difficult to use. By comparison, Keras provides an easy and convenient way to build deep learning models.
Keras creator François Chollet developed the library to help people build neural networks as quickly and easily as possible, putting a focus on extensibility, modularity, minimalism and Python support. Keras can be used with GPUs and CPUs and it supports both Python 2 and 3.
Google Keras made a big contribution to the commoditization of deep learning and artificial intelligence since it has commoditized powerful, modern deep learning algorithms that were not only previously inaccessible, but unusable as well.
What is Sentiment Analysis?
Sentiment analysis aims to determine the attitude, or sentiment. For example, a speaker or writer with respect to a document, interaction, or event. It is a natural language processing problem in which text needs to be understood to predict the underlying intent.
The sentiment is mostly categorized into positive, negative and neutral categories. Through sentiment analysis we might want to predict, for example, a customer's opinion and attitude about a product based on a review they wrote. This technique is widely applied to things like reviews, surveys, documents and much more.
The imdb Dataset
The IMDB sentiment classification dataset consists of 50,000 movie reviews from IMDB users that are labeled as either positive (1) or negative (0). The reviews are preprocessed and each one is encoded as a sequence of word indexes in the form of integers. The words within the reviews are indexed by their overall frequency within the dataset. For example, the integer “2” encodes the second most frequent word in the data. The 50,000 reviews are split into 25,000 for training and 25,000 for testing.
The dataset was created by researchers at Stanford University and published in a 2011 paper, where they achieved 88.89% accuracy. It was also used within the “Bag of Words Meets Bags of Popcorn” Kaggle competition in 2011.
Importing Dependencies and getting the Data
We start by importing the required dependencies to preprocess our data and build our model.
%matplotlib inline import matplotlib import matplotlib.pyplot as plt import numpy as np from keras.utils import to_categorical from keras import models from keras import layers
(NOTE: You will need to use Numpy, version 1.16.2. Due to a recent change in the framework, Keras has some problems loading the IMDB dataset. Until a new Keras release fixes the issue, this specific version of Numpy will do the trick.)
Continue downloading the IMDB dataset, which is, fortunately, already built into Keras. Since we want to avoid a 50/50 train test split, we will immediately merge the data into data and targets after downloading so we can do an 80/20 split later on.
from keras.datasets import imdb (training_data, training_targets), (testing_data, testing_targets) = imdb.load_data(num_words=10000) data = np.concatenate((training_data, testing_data), axis=0) targets = np.concatenate((training_targets, testing_targets), axis=0)
Exploring the Data
Now we can start exploring the dataset:
print("Categories:", np.unique(targets)) print("Number of unique words:", len(np.unique(np.hstack(data)))) Categories: [0 1] Number of unique words: 9998 length = [len(i) for i in data] print("Average Review length:", np.mean(length)) print("Standard Deviation:", round(np.std(length))) Average Review length: 234.75892 Standard Deviation: 173.0
You can see in the output above that the dataset is labeled into two categories, — 0 or 1, which represents the sentiment of the review. The whole dataset contains 9,998 unique words and the average review length is 234 words, with a standard deviation of 173 words.
Let's look at a single training example:
print("Label:", targets) Label: 1 print(data) [1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]
Above you can see the first review of the dataset, which is labeled as positive (1). The code below retrieves the dictionary mapping word indices back into the original words so that we can read them. It replaces every unknown word with a “#”. It does this by using the get_word_index() function.
index = imdb.get_word_index() reverse_index = dict([(value, key) for (key, value) in index.items()]) decoded = " ".join( [reverse_index.get(i - 3, "#") for i in data] ) print(decoded) # this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert # is an amazing actor and now the same being director # father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for # and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also # to the two little boy's that played the # of norman and paul they were just brilliant children are often left out of the # list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all
Now it's time to prepare our data. We will vectorize every review and fill it with zeros so it contains exactly 10,000 numbers. That means we fill every review that is shorter than 10,000 with zeros. We need to do this because the biggest review is nearly that long and every input for our neural network needs to have the same size. We will also transform the targets into floats.
def vectorize(sequences, dimension = 10000): results = np.zeros((len(sequences), dimension)) for i, sequence in enumerate(sequences): results[i, sequence] = 1 return results data = vectorize(data) targets = np.array(targets).astype("float32")
Now we split our data into a training and a testing set. The training set will contain 40,000 reviews and the testing set 10,000.
test_x = data[:10000] test_y = targets[:10000] train_x = data[10000:] train_y = targets[10000:]
Building and Training the Model
Now we're ready to build our simple neural network. We'll start by defining the type of model we want to build. There are two types of models available in Keras: the sequential model and the model class used with functional API.
Next we simply add the input-, hidden- and output-layers. Between them, we are using dropout to prevent overfitting. Please note you should always use a dropout rate between 20% and 50%.
We use “dense” at every layer, which means the units are fully connected. Within the hidden-layers we use the relu function because this is always a good start and yields a satisfactory result most of the time. Feel free to experiment with other activation functions.
At the output-layer we use the sigmoid function, which maps the values between 0 and 1. Note that we set the input-shape to 10,000 at the input-layer because our reviews are 10,000 integers long. The input-layer takes 10,000 as input and outputs it with a shape of 50.
Lastly, we let Keras print a summary of the model we have just built.
# Input - Layer model.add(layers.Dense(50, activation = "relu", input_shape=(10000, ))) # Hidden - Layers model.add(layers.Dropout(0.3, noise_shape=None, seed=None)) model.add(layers.Dense(50, activation = "relu") model.add(layers.Dropout(0.2, noise_shape=None, seed=None)) model.add(layers.Dense(50, activation = "relu")) # Output- Layer model.add(layers.Dense(1, activation = "sigmoid"))model.summary() model.summary() _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense_1 (Dense) (None, 50) 500050 _________________________________________________________________ dropout_1 (Dropout) (None, 50) 0 _________________________________________________________________ dense_2 (Dense) (None, 50) 2550 _________________________________________________________________ dropout_2 (Dropout) (None, 50) 0 _________________________________________________________________ dense_3 (Dense) (None, 50) 2550 _________________________________________________________________ dense_4 (Dense) (None, 1) 51 ================================================================= Total params: 505,201 Trainable params: 505,201 Non-trainable params: 0 _________________________________________________________________
Now we compile our model, which is nothing but configuring the model for training. We use the “adam” optimizer, an algorithm that changes the weights and biases during training. We also choose binary-crossentropy as loss (because we deal with binary classification) and accuracy as our evaluation metric.
model.compile( optimizer = "adam", loss = "binary_crossentropy", metrics = ["accuracy"] )
Now we're able to train our model. We'll do this with a batch_size of 500 and only for two epochs because I recognized that the model overfits if we train it longer.
The batch size defines the number of samples that will be propagated through the network and an epoch is an iteration over the entire training data. In general, a larger batch size results in faster training, but doesn't always converge as fast. A smaller batch size is slower in training but it can converge faster. This is definitely problem dependent and you'll need to try out a few different values. If you start with a problem for the first time, I recommend first using a batch-size of 32, which is the standard size.
results = model.fit( train_x, train_y, epochs= 2, batch_size = 500, validation_data = (test_x, test_y) ) Train on 40000 samples, validate on 10000 samples Epoch 1/2 40000/40000 [==============================] - 5s 129us/step - loss: 0.4051 - acc: 0.8212 - val_loss: 0.2635 - val_acc: 0.8945 Epoch 2/2 40000/40000 [==============================] - 4s 90us/step - loss: 0.2122 - acc: 0.9190 - val_loss: 0.2598 - val_acc: 0.8950
It is time to evaluate our model:
Awesome! With this simple model, we already beat the accuracy of the 2011 paper that I mentioned in the beginning. Feel free to experiment with the hyperparameters and the number of layers.
You can see the code for the whole model below:
import numpy as np from keras.utils import to_categorical from keras import models from keras import layers from keras.datasets import imdb (training_data, training_targets), (testing_data, testing_targets) = imdb.load_data(num_words=10000) data = np.concatenate((training_data, testing_data), axis=0) targets = np.concatenate((training_targets, testing_targets), axis=0) def vectorize(sequences, dimension = 10000): results = np.zeros((len(sequences), dimension)) for i, sequence in enumerate(sequences): results[i, sequence] = 1 return results data = vectorize(data) targets = np.array(targets).astype("float32") test_x = data[:10000] test_y = targets[:10000] train_x = data[10000:] train_y = targets[10000:] model = models.Sequential() # Input - Layer model.add(layers.Dense(50, activation = "relu", input_shape=(10000, ))) # Hidden - Layers model.add(layers.Dropout(0.3, noise_shape=None, seed=None)) model.add(layers.Dense(50, activation = "relu")) model.add(layers.Dropout(0.2, noise_shape=None, seed=None)) model.add(layers.Dense(50, activation = "relu")) # Output- Layer model.add(layers.Dense(1, activation = "sigmoid")) model.summary() # compiling the model model.compile( optimizer = "adam", loss = "binary_crossentropy", metrics = ["accuracy"] ) results = model.fit( train_x, train_y, epochs= 2, batch_size = 500, validation_data = (test_x, test_y) ) print("Test-Accuracy:", np.mean(results.history["val_acc"]))
You can now use this model to also do binary sentiment analysis on other sources of text but you need to change them all to a length of 10,000 or you change the input-shape of the input layer. You can also apply this model to other related machine learning problems with only a few changes.
Niklas Donges is an entrepreneur, technical writer and AI expert. He worked on an AI team of SAP for 1.5 years, after which he founded Markov Solutions. The Berlin-based company specializes in artificial intelligence, machine learning and deep learning, offering customized AI-powered software solutions and consulting programs to various companies.