ResNet Architecture and Its Variants: An Overview

Summary: ResNet is a deep learning architecture that uses identity shortcut connections to train very deep networks effectively. Variants like ResNeXt, DenseNet and stochastic depth models build on it to improve training efficiency, performance and feature reuse across computer vision tasks.

After the celebrated victory of AlexNet at the LSVRC2012 classification contest, deep residual network (ResNet) was arguably the most groundbreaking work in the computer vision and deep learning community in the last few years. ResNet makes it possible to train up to hundreds or even thousands of layers and still achieve a compelling performance.

What Is a Residual Network (ResNet)?

ResNet is an artificial neural network that introduces a so-called “identity shortcut connection,” which allows the model to skip one or more layers. This approach enables the training of significantly deeper networks by reducing vanishing gradients, often improving performance on complex computer vision tasks.

Taking advantage of its powerful representational ability, the performance of many computer vision applications — other than image classification — have been boosted, including object detection and facial recognition.

Since ResNet blew people’s minds in 2015, many in the research community have dived into the secrets of its success, and several refinements have been made in the architecture. In the first part of this article, I am going to give a little bit of background knowledge for those who are unfamiliar with ResNet, and in the second half, I will review some of the papers I read recently regarding different variants and interpretations of the ResNet architecture.

A tutorial video explaining the basics of ResNet. | Video: Connor Shorten

What Is ResNet?

According to the universal approximation theorem, given enough capacity, we know that a feedforward network with a single layer is sufficient to represent any function. However, the layer might be massive, and the network is prone to overfitting the data. Therefore, there is a common trend in the research community that our network architecture needs to go deeper.

Since AlexNet, the state-of-the-art convolutional neural network (CNN) architecture is going deeper and deeper. While AlexNet had only five convolutional layers, the VGG network and GoogleNet (also codenamed Inception_v1) had 19 and 22 layers respectively.

However, you can’t simply stack layers together to increase network depth. Deep networks are hard to train because of the notorious vanishing gradient problem. As the gradient is backpropagated to earlier layers, repeated multiplication may make the gradient infinitely small. As a result, the deeper the network goes, the more its performance becomes saturated or even starts rapidly degrading.

Before ResNet, there had been several ways to deal with the vanishing gradient issue. For instance, GoogleNet adds an auxiliary loss in a middle layer for extra supervision, but none of those solutions seemed to really tackle the problem once and for all.

The core idea of ResNet is that it introduces a so-called “identity shortcut connection” that skips one or more layers.

The authors of the study on deep residual learning for image recognition (He et al., 2015) argue that stacking layers shouldn’t degrade the network performance because we could simply stack identity mappings — a layer that doesn’t do anything — on top of the current network, and the resulting architecture would perform the same. This indicates that the deeper model should not produce a training error higher than its shallower counterparts. They hypothesize that letting the stacked layers fit a residual mapping is easier than letting them directly fit the desired underlying mapping. And the residual block above explicitly allows it to do precisely that.

As a matter of fact, ResNet was not the first to make use of shortcut connections. The authors of a study on Highway Network also introduced gated shortcut connections. These parameterized gates control how much information is allowed to flow across the shortcut. A similar idea can be found in the report on long short-term memory (LSTM) cell, in which there is a parameterized forget gate that controls how much information will flow to the next time step. Therefore, ResNet can be thought of as a special case of highway network.

However, experiments show that the highway network performs no better than ResNet, which is unusual because the solution space of highway network contains ResNet. Therefore, it should perform at least as good as ResNet. The findings suggest maintaining unobstructed gradient flow is more critical than model flexibility alone.

Following this intuition, He et al., 2015 refined the residual block and proposed in a study on identity mappings in deep ResNets (He et al., 2016), a pre-activation variant of residual block, in which the gradients can flow unimpeded through the shortcut connections to any other earlier layer. In fact, using the original residual block in image recognition study, training a 1202-layer ResNet resulted in a worse performance than its 110-layer counterpart.

He et al., 2016 demonstrated with experiments that they can now train a 1001-layer deep ResNet to outperform its shallower counterparts. Because of its compelling results, ResNet quickly became one of the most popular architectures for various computer vision tasks.

More on Artificial IntelligenceIs Google’s LaMDA AI Truly Sentient?

ResNet Architecture Variants and Interpretations

As ResNet has gained popularity in the AI research community, its architecture is getting studied heavily. In this section, I will first introduce several new architectures based on ResNet, then introduce a paper that provides an interpretation of treating ResNet as an ensemble of many smaller networks.

ResNeXt

The authors in a study on aggregated residual transformations for deep neural networks proposed a variant of ResNet that is codenamed ResNeXt.

It is very similar to the inception module that the authors from the study on going deeper with convolutions (Szegedy et al., 2014) came up with. They both follow the split-transform-merge paradigm. However, in the ResNeXt variant, the outputs of different paths are merged by adding them together, while in the Szegedy et al., 2014 study, they are depth-concatenated. Another difference is that in the Szegedy et al., 2014 study, each path is different (1x1, 3x3 and 5x5 convolution) from each other, while in the ResNeXt architecture, all paths share the same topology.

The ResNeXt study authors introduced a hyper-parameter called cardinality — the number of independent paths — to provide a new way of adjusting the model capacity. Experiments show that accuracy can be gained more efficiently by increasing the cardinality than by going deeper or wider. The authors state that compared to inception, this novel architecture is easier to adapt to new data sets and tasks, as it has a simple paradigm and only one hyper-parameter needs to be adjusted. Inception, however, has many hyper-parameters (like the kernel size of the convolutional layer of each path) to tune. This novel building block has three equivalent forms.

In practice, the “split-transform-merge” is usually done via a pointwise grouped convolutional layer, which divides its input into groups of feature maps and performs normal convolution respectively. Their outputs are depth-concatenated and then fed to a 1x1 convolutional layer.

Densely Connected CNN

Another team of researchers in 2016 proposed a novel architecture called DenseNet that further exploits the effects of shortcut connections. It connects all layers directly with each other. In this novel architecture, the input of each layer consists of the feature maps of all earlier layer, and its output is passed to each subsequent layer. The feature maps are aggregated with depth-concatenation.

Other than tackling the vanishing gradients problem, the authors of “Aggregated Residual Transformations for Deep Neural Networks” argue that this architecture also encourages feature reuse, making the network highly parameter-efficient. One simple interpretation of this is that, in ResNet, additive merges may suppress feature diversity if distributions differ significantly. Meanwhile, DenseNet’s concatenation preserves all features and promotes reuse.

Following this paradigm, we know that the l_th layer will have k * (l-1) + k_0 input feature maps, where k_0 is the number of channels in the input image. The authors used a hyper-parameter called growth rate (k) to prevent the network from growing too wide. They also used a 1x1 convolutional bottleneck layer to reduce the number of feature maps before the expensive 3x3 convolution.

Deep Network with Stochastic Depth

Although ResNet has proven powerful in many applications, one major drawback is that a deeper network usually requires weeks for training, making it practically infeasible in real-world applications. To tackle this issue, the researchers for a study on “Deep Networks with Stochastic Depth” introduced a counter-intuitive method of randomly dropping layers during training and using the full network in testing.

The authors used the residual block as their network’s building block. Therefore, during training, when a particular residual block is enabled, its input flows through both the identity shortcut and the weight layers, otherwise the input only flows through the identity shortcut. In training time, each layer has a “survival probability” and is randomly dropped. In testing time, all blocks are kept active and re-calibrated according to its survival probability during training.

Formally, let H_l be the output of the l_th residual block, f_l be the mapping defined by the l_th block’s weighted mapping, b_l be a Bernoulli random variable that can only be a one or zero (indicating whether a block is active), during training:

When b_l = 1, this block becomes a normal residual block. And when b_l = 0, the above formula becomes:

Since we know that H_(l-1) is the output of a ReLU, which is already non-negative, the above equation reduces to an identity layer that only passes the input through to the next layer:

Let p_l be the survival probability of layer l during training, during test time, we have:

The authors applied a linear decay rule to the survival probability of each layer. They argue that since earlier layers extract low-level features that will be used by later ones, they should not be dropped too frequently. The resulting rule therefore becomes:

Where L denotes the total number of blocks, thus p_L is the survival probability of the last residual block and is fixed to 0.5 throughout experiments. Also note that in this setting, the input is treated as the first layer (l = 0) and thus, is never dropped. The overall framework over stochastic depth training is demonstrated in the figure below.

Similar to Dropout, training a deep network with stochastic depth can be viewed as training an ensemble of many smaller ResNets. The difference is that this method randomly drops an entire layer while Dropout only drops part of the hidden units in one layer during training.

Experiments show that training a 110-layer ResNet with stochastic depth results in better performance than training a constant-depth 110-layer ResNet, while also dramatically reducing the training time. This suggests that some of the layers (paths) in ResNet might be redundant.

More on Machine LearningMarkov Chain Explained

ResNet as an Ensemble of Smaller Networks

In the study on deep networks with stochastic depth (Huang et al., 2016), the researchers proposed a counter-intuitive way of training a very deep network that involved randomly dropping its layers during training and using the full network in testing time. The researchers of the study, “Residual Networks Behave Like Ensembles of Relatively Shallow Networks” had an even more counter-intuitive finding. We can actually drop some of the layers of a trained ResNet and still have comparable performance. This makes the ResNet architecture even more interesting, as the study authors also dropped layers of a VGG network and degraded its performance dramatically.

The Huang et al., 2016 study first provides an unraveled view of ResNet to make things clearer. After we unroll the network architecture, it is quite clear that a ResNet architecture with i residual blocks has 2^i different paths (because each residual block provides two independent paths).

Given that finding, it is quite clear why removing a couple of layers in a ResNet architecture doesn’t compromise its performance too much. The architecture has many independent effective paths and the majority of them remain intact after we remove a couple of layers. On the contrary, the VGG network has only one effective path, so removing a single layer compromises this one. As shown in the study’s extensive experiments.

Huang et al., 2016 also conducted experiments to show that the collection of paths in ResNet have ensemble-like behavior. They tested performance by removing varying numbers of layers, and checked to see if the performance of the network smoothly correlated with the number of deleted layers. The results suggested that the network indeed behaves like an ensemble.

Finally, the Huang et al., 2016 looked into the characteristics of the paths in ResNet.

It is apparent that the distribution of all possible path lengths follows a binomial distribution. The majority of paths go through 19 to 35 residual blocks.

Huang et al., 2016 also conducted experiments to investigate the relationship between path length and the magnitude of the gradients flowing through it. To get the magnitude of gradients in the path of length k, the authors first fed a batch of data to the network and randomly sampled k residual blocks. When backpropagating the gradients, they propagated through the weight layer only for the sampled residual blocks. Their graphs show that the magnitude of gradients decreases rapidly as the path becomes longer.

We can now multiply the frequency of each path length with its expected magnitude of gradients to get a feel for how many paths of each length contribute to training. Surprisingly, most contributions come from paths of length nine to 18, but they constitute only a tiny portion of the total paths. This is a very interesting finding, as it suggests that ResNet did not solve the vanishing gradients problem for very long paths, and that ResNet actually enables training very deep networks by shortening its effective paths.

In this article, I revisited the compelling ResNet architecture and briefly explained the intuitions behind its recent success. I hope it helps strengthen your understanding of this groundbreaking work.

Frequently Asked Questions

What is ResNet?

ResNet, short for Residual Network, is a type of deep neural network that uses identity shortcut connections to bypass one or more layers. This architecture helps train very deep models by addressing the vanishing gradient problem, making it effective for tasks like image classification, object detection and facial recognition.

What is the main advantage of ResNet over earlier CNN architectures?

ResNet introduces identity shortcut connections that help mitigate vanishing gradients, enabling the training of very deep neural networks.

How does ResNet help solve the vanishing gradient problem?

By using shortcut connections, ResNet allows gradients to flow more easily through the network during backpropagation, making it easier to train deep models.

An Overview of ResNet Architecture and Its Variants