How to Fix a CUDA Error: Device-Side Assert Triggered in PyTorch

A CUDA Error: Device-Side Assert Triggered can either be caused by an inconsistency between the number of labels and output units or an incorrect input for a loss function. Follow this guide to fix it. 

Written by Perez Ogayo
Published on Sep. 14, 2022
Image: Shutterstock / Built In
Image: Shutterstock / Built In
Brand Studio Logo

If you happen to run into this error — cuda runtime error (59): device-side assert triggered — you know how frustrating it can be. The most frustrating part for me was the lack of a clear, step-by-step solution to this problem. This could be due to the fact that PyTorch is still relatively new. 

cuda error device side assert triggered error screenshot
Code view of the CUDA runtime error (59): device-side assert triggered. | Screenshot: Perez Ogayo

I first encountered this problem while working on the Stanford car data set during a hackathon for the Udacity Pytorch Challenge. It took me a while to fix it, and it didn’t help that I was using Kaggle Kernels, which presented its own challenges in regards to GPU.

What Is a CUDA Error: Device-Side Assert Triggered? 

A CUDA error: device-side assert triggered is an error that’s often caused when you either have inconsistency between the number of labels and output units or you input the loss function incorrectly. To solve it, you need to make sure your output units match the number of classes and that your output layer returns values in the range of the loss function (criterion) that you chose.

 

What Causes a CUDA Error: Device-Side Assert Triggered?

The following two reasons cause a CUDA error to occur:

  1. Inconsistency between the number of labels/classes and the number of output units
  2. The input of the loss function may be incorrect.

Let’s unpack these reasons and their solutions below.

 

Inconsistency Between the Number of Labels and Output Units

When defining the final fully connected layer of my model, instead of putting 196 — the total number of classes for the Stanford car data set — as the number of output units, I put 195.

cuda error device-side assert triggered inconsistent label and ouptut mistake
Entering the wrong output can trigger the error. | Image: Perez Ogayo

The error is usually identified in the line where you do the backpropagation. Your loss function will be comparing the output from your model and the label of that observation in your data set. Just in case you are confused between labels and output, I define them as:

  • Label: These are the tags associated with an observation. When working on a classification problem, the label is the class. For example, in the classic dog vs. cat problem, the labels are cat and dog.
  • Output: This is the predicted “label” from your model. You give your model a set of features from an observation, and it gives out a prediction called “output” in the PyTorch ecosystem. Think of it as a predicted label.

In my case, some of the labels had a value of 195, which was beyond the range of output for my model in which the greatest possible value was 194 (you start counting from zero). This triggered the error.

More on Machine LearningA Complete Guide to PySpark Data Frames

 

How Do You Fix This Error?

Make sure the number of output units match the number of your classes. That would involve changing my classifier to be as follows:

classifier = nn.Sequential(nn.Linear(2048, 700),
nn.ReLU(),
nn.Dropout(p = 0.2),
nn.Linear(700 , 300),
nn.ReLU(),
nn.Dropout(p = 0.2),
nn.Linear(300 ,196), #changed this from 195 to 196
nn.LogSoftmax(dim = 1))
model.fc = classifier

This is how you get the number of classes in your data set programmatically:

#.classes returns a list of the classes in your dataset, usually #numbered from 0 to number_of_classes-1
len(dataset.classes)
#Calling len() function on the list will return the number of classes in your dataset

 

Reason 2: Wrong Input for the Loss Function

Loss functions have different ranges for the possible inputs that they can accept. If you choose an incompatible activation function for your output layer, it will trigger this error. For example, BCELoss() requires its input to be between zero and one. If the input (output from your model) is beyond the acceptable range for that particular loss function, the error will get triggered.

 

What Are Activation Loss Functions?

Activation functions are the mathematical equations that determine the output of your neural network. The purpose of the activation function is to introduce non-linearity into the output of a model, thus making a model capable of learning and performing more complex tasks. In turn, they determine how accurate your network will be.

Loss functions are the equations that compute the error that is used to learn via backpropagation.

cuda error device side assert triggered mismatch between input and target
Mismatch between input and target. | Image: Perez Ogayo

More on Loss FunctionsThink You Don’t Need Loss Functions in Deep Learning? Think Again.

 

How to Resolve a CUDA Error: Device-Side Assert Triggered in PyTorch

Make sure your output layer returns values in the range of the loss function (criterion) that you chose. This implies that you’re using the appropriate activation function (sigmoid, softmax, LogSoftmax) in your final output layer.

 

Example of Problematic Code

model = nn.Linear()
input = torch.randn(128, 2)
output = model(input)
criterion=nn.BCELoss()
torch.empty(128).random_(2)
loss=criterion(output, target)

The code above will trigger a CUDA runtime error 59 if you are using a GPU. You can fix it by passing your output through the sigmoid function or using BCEWithLogitsLoss().

 

Solution 1: Pass the Results Through Sigmoid Function

model = nn.Linear()  #The sigmoid funtion can also be applied here as model=nn.Sigmoid(nn.Linear())
input = torch.randn(128, 2)
output = model(input)
criterion=nn.BCELoss()
target=torch.empty(128).random_(2)
loss=criterion(nn.Sigmoid(output), target)

 

Solution 2: Using “BCEWithLogitsLoss()” 

model = nn.Linear()
input = torch.randn(128, 2)
output = model(input)
criterion=nn.BCEWithLogitsLoss() #changed fromBCELoss
target = torch.empty(128).random_(2)
loss=criterion(output, target)

 

Fixing CUDA Error: Device-Side Assert Triggered on Kaggle

Once that error is triggered, you cannot continue using your GPU. Even after changing the problematic line and reloading the entire kernel, you will still get the error presented in different forms. The form depends on which line is the first one to attempt to use the GPU. Look at the image below for an idea.

cuda error device-side assert triggered calling the train function after fixing error
Calling the train function after fixing the error. | Image: Perez Ogayo

The reason this happens is that even though you may have fixed the bug in your code, once the runtime error 59 is triggered, you are blocked from using the GPU entirely during the same GPU session in which the error was triggered.

 

The Solution

Stop your current kernel session and start a new one.

To do this, follow the steps below:

  1. From your kernel, click the ‘K’ on the top left. It automatically takes you to your kernels
  2. You will see a list of your kernels, which will have have edit option and additional stop for the ones currently running.
  3. Click “Stop kernel.”
  4. Restart your kernel session fresh. Every variable should reset, and you should have a brand new GPU session
How to stop running a Kaggle kernel. | Video: Perez Ogayo

 

CUDA Error: Device-Side Assert Triggered Tips

The error messages you get when running into this error may not be very descriptive. To make sure you get a complete and useful stack trace, enter CUDA_LAUNCH_BLOCKING="1" at the very beginning of your code and run it before anything else.

Explore Job Matches.