Recently, I discussed linear regression analysis in this space. In this piece, I’m going to look at logistic regression, which is just like linear regression, but with a different cost function. The cost function is the element that deviates the path from linear to logistic. In linear regression, the output is a continuously valued label, such as the heat index in Atlanta or the price of fuel. Unfortunately, only a small set of problems actually deal with continuous values.
Having said that, there are scenarios where classification comes into the picture. Consider an example in which the output juggles between true and false. Linear regression wouldn’t be able to solve this problem because the output is discrete. With logistic regression, though, we can segregate the processed inputs into discrete classes by estimating the probabilities. There’s a lot more in the box, though, and so, in this article, we’ll explore every minute detail to understand logistic regression. We’ll also go over how to code a small application logistic regression using TensorFlow 2.0.
Introduction to Logistic Regression
Logistic regression uses probabilities to distinguish inputs and thereby puts them into separate bags of output classes. To better understand how this process works, let’s look at an example.
Consider a case where you want to sketch a relation between your basketball shot’s accuracy and the distance you shoot from. On the whole, it’s about predicting whether you make the basket or not. Let’s suppose you’re going to predict the answer using linear regression. The relation between the win (y) and distance (x) is given by a linear equation, y = mx + c. As a prerequisite, you played for a month, jotted down all the values for x and y, and now you insert the values into the equation. This completes the training phase.
Later, you want to estimate the possibility of making the shot from a specific distance. You note the value x and pass it to the trained math equation described above. It will now be a static equation, i.e. y = (trained_m)x + (trained_c). As a result, a y (win) value flew out of the equation. Discretizing y to predict the output, either win or lose, isn’t a great technique. Although it technically works, it isn’t a sound approach because y isn’t a probability.
What about using an activation function in the final stage to compel the output to fall into either the win class or the lose class? This indeed seems like a fix to our problem because it takes the concept of probability into consideration. This technique is what’s meant by logistic regression. Yet this isn’t the whole story, so let’s get a detailed overview of the fix.
What Is Wrong With Linear Regression for Classification?
Linear regression never deals with probabilistic values. It simply draws a linear interpolation between data points and constructs a hyperplane such that the error is minimized between the points and the hyperplane. There is no meaningful threshold at which you can distinguish one class from the other. This actually seems like a rote-learning approach.
Also, consider a case where you would want to do multi-class classification. To accomplish this with linear regression, the outputs need to be labeled with the respective class labels. Say the class labels are 1, 2, 3 and 4. When the input value is fed with a positive weight, the output would be biased towards the class with higher class labels in a majority of the cases. Therefore, linear regression isn’t sufficient for solving classification problems.
Types of Logistic Regression
Logistic regression can be one of three types based on the output values:
- Binary Logistic Regression, in which the target variable has only two possible values, e.g., pass/fail or win/lose.
- Multi Logistic Regression, in which the target variable has three or more possible values that are not ordered, e.g., sweet/sour/bitter or cat/dog/fox.
- Ordinal Logistic Regression, in which the outputs are ordered in some way, e.g., bad/good/better/best or low/medium/high.
Theory and Interpretation
So far, we’ve explored an abstract view of logistic regression analysis. Now, let’s look into the math that actually molds logistic regression.
Sigmoid Activation
The activation function is the primary factor that yields desired outputs by manipulating the values. In logistic regression, we use logistic activation/sigmoid activation. This maps the input values to output values that range from 0 to 1, meaning it squeezes the output to limit the range. This activation, in turn, is the probabilistic factor. It is given by the equation
where n is the algorithm’s prediction, i.e. y or mx + c. In this equation, logistic(n) is the probability estimate.
Mathematically, the output would be
Here, the output y is substituted in the sigmoid activation function to output a probability that lies in between 0 and 1. P(y=1) indicates that as the probability nears 1, our model is more confident that the output is in class 1.
Decision Boundary
A threshold can be set to 0.5, meaning the values that fall below 0.5 could be labeled as class A instances, and the values that fall above 0.5 could be labeled as class B instances. We call this threshold a decision boundary because it establishes and finalizes the decision by splitting the output values.
For instance, say you want to buy a piece of land that covers a specific area but can’t arrive at a reliable decision. To inform your decision, you procure the previous land buyers’ data with respect to that area, plot the numbers, and draw a decision boundary of 0.5 to differentiate between the two factors: buy or not buy. After conducting the analysis, if the output falls under 0.5, then the result is negative, indicating that it’s not profitable to buy a piece of land.
Cost Function
Solely predicting in one run won’t produce accurate results though. We also need to implement backpropagation to minimize errors that might pop up. To do so, the cost function/error function has to be formulated.
A cost function calculates the error between actual and predicted values. Unlike in linear regression, mean squared error [MSE] isn’t the right approach for several reasons. The most significant is the nonlinearity induced by a logistic regression function. Since it doesn’t output a linear equation, using MSE creates many bumps in the cost function that make it impossible to arrive at an optimal solution. Therefore, a simple equation has been introduced that neatly explains the classification flavor.
In this equation, y denotes the actual output, and h denotes the observed output. This cost function is called cross-entropy or log loss function. The two cost functions are condensed into one as follows:
Here, log here smooths the curves to compute gradient descent with ease. The curves are either monotonically increasing or decreasing. To prove the credibility of the cost function, let’s take the case where y = 1 and h = 1; log(1) = 0, meaning cost/error is 0. Similarly, when y = 0 and h = 0, log(1-0) = 0 meaning cost/error is 0. Thus, when the predicted outputs are as expected, cost function would be 0.
Another key point is the importance of y and (1-y). When y = 1, the second factor in the equation disappears, and when y = 0, the first factor disappears, enabling us to perform only the operation we need.
For m output labels, the cost function would be divided by the value m.
Gradient Descent
To reduce the cost/error, the gradient descent algorithm is used. We compute this by differentiating the cost function with respect to the weights. The final equation is:
where C is the derivative of the cost function w/r/t weights of the network, x is the whole feature vector, s(z) is the predicted output and y is the actual output.
The average of the computed gradient is taken over the number of features to extract the gradient for every feature, i.e. gradient /= N, where N is the total number of features. The resultant value is multiplied by the learning rate and subtracted from the weights. Therefore, the weights are updated to increase the proximity between the predicted and actual values.
Advantages and Disadvantages of Logistic Regression
Outputs from the logistic regression algorithm are easy to interpret since they return the probabilities or the class scores. You can also use this for ranking instead of using this as a classification problem. We can put this algorithm into action easily where the features are expected to be roughly linear and the problem to be linearly separable. One more big advantage in using this algorithm is that it’s robust to noise, which means we can use it for any kind of data. Whether we’re working with numbers, text or images, the model still performs well in doing the classification job.
On the other hand, there are a few disadvantages with logistic regression. In a few cases, this algorithm does not handle categorical (binary) variables well. It also suffers multicollinearity, meaning that one predictor variable in a multiple regression model can be linearly predicted from the others with a substantial degree of accuracy without any distributions.
Building Logistic Regression Using TensorFlow 2.0.
Step 1: Importing Necessary Modules
To get started with the program, we need to import all the necessary packages using the import statement in Python. Instead of using the long keywords every time we write the code, we can alias them with a shortcut using as. For example, aliasing numpy as np:
from __future__ import absolute_import, division, print_function
import tensorflow as tf
import numpy as np
Step 2: Loading and Preparing the MNIST Data Set
For the logistic regression model that we’re building, we will be using the MNIST data set. MNIST data is a collection of hand-written digits that contains 60,000 examples for training and 10,000 examples for testing. The digits have been size-normalized and centered in a fixed-size image (28x28 pixels) with values from 0 to 255. In the code below, each image will be converted to float32, normalized to [0, 1] and flattened to a 1-D array of 784 features (28*28).
To import the MNIST data set to our program, we use tensorflow.keras.datasets. Next, we load the training data set and testing data set in the variables (x_train,y_train) and (x_test,y_test) using the mnist.load_data() function. Since the data are images, we flatten the pixel values or features into a 1-D array of size 784 using the reshape method. We also normalize the pixel intensities to make sure their values are between 0 to 1 by dividing them with 255.
from tensorflow.keras.datasets import mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
# Convert to float32.
x_train, x_test = np.array(x_train, np.float32), np.array(x_test, np.float32)
# Flatten images to 1-D vector of 784 features (28*28).
x_train, x_test = x_train.reshape([-1, num_features]), x_test.reshape([-1, num_features])
# Normalize images value from [0, 255] to [0, 1].
x_train, x_test = x_train / 255., x_test / 255.
Step 3: Setting Up Hyperparameters and Data Set Parameters
In this step, we initialize the model parameters. num_classes denotes the number of outputs, which is 10, as we have digits from 0 to 9 in the data set. num_features defines the number of input parameters, and we store 784 since each image contains 784 pixels.
learning_rate defines the step size the model should take to converge to a minimum loss. training_steps defines the number of steps the model will take to train itself completely, and batch_size denotes the number of samples per each batch in the training process. We use display_step to iterate over the training steps and print them in the training process.
# MNIST dataset parameters.
num_classes = 10 # 0 to 9 digits
num_features = 784 # 28*28
# Training parameters.
learning_rate = 0.01
training_steps = 1000
batch_size = 256
display_step = 50
Step 4: Shuffling and Batching the Data
We need to shuffle and batch the data before we start the actual training to avoid the model from getting biased by the data. This will allow our data to be more random and helps our model to gain higher accuracies with the test data.
With the help of tf.data.Dataset.from_tensor_slices, we can get the slices of an array in the form of objects. The function shuffle(5000) randomizes the order of the data set’s examples. Here, 5000 denotes the variable shuffle_buffer, which tells the model to pick a sample randomly from 1 to 5000 samples. After that, there are only 4999 samples left in the buffer, so the sample 5001 gets added to the buffer. The perfect method allows having an efficient input pipeline by making input processing operations runnable in parallel to downstream GPU operations.
# Use tf.data API to shuffle and batch data.
train_data=tf.data.Dataset.from_tensor_slices((x_train,y_train))
train_data=train_data.repeat().shuffle(5000).batch(batch_size).prefetch(1)
Step 5: Initializing Weights and Biases
We now initialize the weights vector and bias vector with ones and zeros, respectively using tf.ones and tf.zeros. We use tf.variable to define these vectors as we will be changing the values of weights and biases during the course of training.
# Weight of shape [784, 10], the 28*28 image features, and a total number of classes.
W = tf.Variable(tf.ones([num_features, num_classes]), name="weight")
# Bias of shape [10], the total number of classes.
b = tf.Variable(tf.zeros([num_classes]), name="bias")
Step 6: Defining Logistic Regression and Cost Function
We define the logistic_regression function below, which converts the inputs into a probability distribution proportional to the exponents of the inputs using the softmax function. The softmax function, which is implemented using the function tf.nn.softmax, also makes sure that the sum of all the inputs equals one.
In the next piece of code, we encode the outputs using the function tf.one_hot. We also define and compute the cross-entropy function as the loss function, which is given as cross-entropy loss = -ytrue*(log(ypred)) using tf.reduce_mean and tf.reduce_sum, which are analogous to the mean and sum functions using numpy such as np.mean and np.sum.
# Logistic regression (Wx + b).
def logistic_regression(x):
# Apply softmax to normalize the logits to a probability distribution.
return tf.nn.softmax(tf.matmul(x, W) + b)
# Cross-Entropy loss function.
def cross_entropy(y_pred, y_true):
# Encode label to a one hot vector.
y_true = tf.one_hot(y_true, depth=num_classes)
# Clip prediction values to avoid log(0) error.
y_pred = tf.clip_by_value(y_pred, 1e-9, 1.)
# Compute cross-entropy.
return tf.reduce_mean(-tf.reduce_sum(y_true * tf.math.log(y_pred)))
Step 7: Defining Optimizers and Accuracy Metrics
Now we define a function to choose the correct prediction. When we compute the output, it gives us the probability of the given data to fit a particular class of output. We consider the correct prediction as to the class having the highest probability. We compute this using the function tf.argmax. We also define the stochastic gradient descent as the optimizer from several optimizers present in TensorFlow. We do this using the function tf.optimizers.SGD. This function takes in the learning rate as its input, which defines how fast the model should reach its minimum loss or gain the highest accuracy.
# Accuracy metric.
def accuracy(y_pred, y_true):
# Predicted class is the index of the highest score in prediction vector (i.e. argmax).
correct_prediction = tf.equal(tf.argmax(y_pred, 1), tf.cast(y_true, tf.int64))
return tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
# Stochastic gradient descent optimizer.
optimizer = tf.optimizers.SGD(learning_rate)
Step 8: Optimization Process and Updating Weights and Biases
Now we define the run_optimization() method where we update the weights of our model. We calculate the predictions using the logistic_regression(x) method by taking the inputs and find out the loss generated by comparing the predicted value and the original value present in the data set. Next, we compute the gradients using and update the weights of the model with our stochastic gradient descent optimizer.
# Optimization process.
def run_optimization(x, y):
# Wrap computation inside a GradientTape for automatic differentiation.
with tf.GradientTape() as g:
pred = logistic_regression(x)
loss = cross_entropy(pred, y)
# Compute gradients.
gradients = g.gradient(loss, [W, b])
# Update W and b following gradients.
optimizer.apply_gradients(zip(gradients, [W, b]))
Step 9: The Training Loop
# Run training for the given number of steps.
For step, (batch_x, batch_y) in enumerate(train_data.take(training_steps), 1):
# Run the optimization to update W and b values.
run_optimization(batch_x, batch_y)
if step % display_step == 0:
pred = logistic_regression(batch_x)
loss = cross_entropy(pred, batch_y)
acc = accuracy(pred, batch_y)
print("step: %i, loss: %f, accuracy: %f" % (step, loss, acc))
Step 10: Testing Model Accuracy Using the Test Data
Finally, we check the model accuracy by sending the test data set into our model and compute the accuracy using the accuracy function that we defined earlier.
# Test model on validation set.
pred = logistic_regression(x_test)
print("Test Accuracy: %f" % accuracy(pred, y_test))
Applications of Logistic Regression
Logistic regression has proven useful in many industries, including marketing, medicine, finance and human resources, by providing solutions to complex business problems. Some practical applications include measuring customer behavior, predicting risk factors, estimate the profitability of a given product, making investment decisions, and the likelihood of committing fraudulent actions.
In this article, we’ve reviewed logistic regression, which is one of the most popular machine learning algorithms. We’ve discussed in detail how logistic regression works, and used TensorFlow 2.0 to implement it. There are many other Python frameworks where you can try experimenting and increase the model’s accuracy based on the data sets.