After the celebrated victory of AlexNet at the LSVRC2012 classification contest, deep residual network (ResNet) was arguably the most groundbreaking work in the computer vision and deep learning community in the last few years. ResNet makes it possible to train up to hundreds or even thousands of layers and still achieve a compelling performance.

## What Is a Residual Network (ResNet)?

Taking advantage of its powerful representational ability, the performance of many computer vision applications — other than image classification — have been boosted, including object detection and facial recognition.

Since ResNet blew people’s minds in 2015, many in the research community have dived into the secrets of its success, and several refinements have been made in the architecture. In the first part of this article, I am going to give a little bit of background knowledge for those who are unfamiliar with ResNet, and in the second half, I will review some of the papers I read recently regarding different variants and interpretations of the ResNet architecture.

## What Is ResNet?

According to the universal approximation theorem, given enough capacity, we know that a feedforward network with a single layer is sufficient to represent any function. However, the layer might be massive, and the network is prone to overfitting the data. Therefore, there is a common trend in the research community that our network architecture needs to go deeper.

Since AlexNet, the state-of-the-art convolutional neural network (CNN) architecture is going deeper and deeper. While AlexNet had only five convolutional layers, the VGG network and GoogleNet (also codenamed Inception_v1) had 19 and 22 layers respectively.

However, you can’t simply stack layers together to increase network depth. Deep networks are hard to train because of the notorious vanishing gradient problem. As the gradient is backpropagated to earlier layers, repeated multiplication may make the gradient infinitely small. As a result, the deeper the network goes, the more its performance becomes saturated or even starts rapidly degrading.

Before ResNet, there had been several ways to deal with the vanishing gradient issue. For instance, GoogleNet adds an auxiliary loss in a middle layer for extra supervision, but none of those solutions seemed to really tackle the problem once and for all.

The core idea of ResNet is that it introduced a so-called “identity shortcut connection” that skips one or more layers.

The authors of the study on deep residual learning for image recognition argue that stacking layers shouldn’t degrade the network performance because we could simply stack identity mappings — a layer that doesn’t do anything — on top of the current network, and the resulting architecture would perform the same. This indicates that the deeper model should not produce a training error higher than its shallower counterparts. They hypothesize that letting the stacked layers fit a residual mapping is easier than letting them directly fit the desired underlying mapping. And the residual block above explicitly allows it to do precisely that.

As a matter of fact, ResNet was not the first to make use of shortcut connections. The authors of a study on Highway Network also introduced gated shortcut connections. These parameterized gates control how much information is allowed to flow across the shortcut. A similar idea can be found in the report on long__ __short-term memory (LSTM) cell, in which there is a parameterized forget gate that controls how much information will flow to the next time step. Therefore, ResNet can be thought of as a special case of highway network.

However, experiments show that the highway network performs no better than ResNet, which is unusual because the solution space of highway network contains ResNet. Therefore, it should perform at least as good as ResNet. This suggests that it is more important to keep these “gradient highways” clear than to go for a larger solution space.

Following this intuition, the authors of deep residual learning for image recognition refined the residual block and proposed in a study on identity mappings in deep ResNets a pre-activation variant of residual block, in which the gradients can flow unimpeded through the shortcut connections to any other earlier layer. In fact, using the original residual block in image recognition study, training a 1202-layer ResNet resulted in a worse performance than its 110-layer counterpart.

The authors of identity mappings in deep ResNets demonstrated with experiments that they can now train a 1001-layer deep ResNet to outperform its shallower counterparts. Because of its compelling results, ResNet quickly became one of the most popular architectures for various computer vision tasks.

## ResNet Architecture Variants and Interpretations

As ResNet gains popularity in the research community, its architecture is getting studied heavily. In this section, I will first introduce several new architectures based on ResNet, then introduce a paper that provides an interpretation of treating ResNet as an ensemble of many smaller networks.

### ResNeXt

The authors in a study on aggregated residual transformations for deep neural networks proposed a variant of ResNet that is codenamed ResNeXt.

It is very similar to the inception module that the authors from the study on going deeper with convolutions came up with in 2015. They both follow the split-transform-merge paradigm, except in this variant, the outputs of different paths are merged by adding them together, while in the 2015 study, they are depth-concatenated. Another difference is that in the study on going deeper with convolutions, each path is different (1x1, 3x3 and 5x5 convolution) from each other, while in this architecture, all paths share the same topology.

The authors introduced a hyper-parameter called cardinality — the number of independent paths — to provide a new way of adjusting the model capacity. Experiments show that accuracy can be gained more efficiently by increasing the cardinality than by going deeper or wider. The authors state that compared to inception, this novel architecture is easier to adapt to new data sets and tasks, as it has a simple paradigm and only one hyper-parameter needs to be adjusted. Inception, however, has many hyper-parameters (like the kernel size of the convolutional layer of each path) to tune. This novel building block has three equivalent forms.

In practice, the “split-transform-merge” is usually done via a pointwise grouped convolutional layer, which divides its input into groups of feature maps and performs normal convolution respectively. Their outputs are depth-concatenated and then fed to a 1x1 convolutional layer.

### Densely Connected CNN

Another team of researchers in 2016 proposed a novel architecture called DenseNet that further exploits the effects of shortcut connections. It connects all layers directly with each other. In this novel architecture, the input of each layer consists of the feature maps of all earlier layer, and its output is passed to each subsequent layer. The feature maps are aggregated with depth-concatenation.

Other than tackling the vanishing gradients problem, the authors of “Aggregated Residual Transformations for Deep Neural Networks” argue that this architecture also encourages feature reuse, making the network highly parameter-efficient. One simple interpretation of this is that, in the studies on deep residual learning for image recognition and identity mappings in deep ResNetts the output of the identity mapping was added to the next block, which might impede information flow if the feature maps of two layers have very different distributions. Therefore, concatenating feature maps can preserve them all and increase the variance of the outputs, encouraging feature reuse

Following this paradigm, we know that the *l_th* layer will have `k * (l-1) + k_0`

input feature maps, where *k_0* is the number of channels in the input image. The authors used a hyper-parameter called growth rate (*k*) to prevent the network from growing too wide. They also used a 1x1 convolutional bottleneck layer to reduce the number of feature maps before the expensive 3x3 convolution.

### Deep Network with Stochastic Depth

Although ResNet has proven powerful in many applications, one major drawback is that a deeper network usually requires weeks for training, making it practically infeasible in real-world applications. To tackle this issue, the researchers for a study on “Deep Networks with Stochastic Depth” introduced a counter-intuitive method of randomly dropping layers during training and using the full network in testing.

The authors used the residual block as their network’s building block. Therefore, during training, when a particular residual block is enabled, its input flows through both the identity shortcut and the weight layers, otherwise the input only flows through the identity shortcut. In training time, each layer has a “survival probability” and is randomly dropped. In testing time, all blocks are kept active and re-calibrated according to its survival probability during training.

Formally, let *H_l* be the output of the *l_th* residual block, *f_l* be the mapping defined by the *l_th* block’s weighted mapping, *b_l * be a Bernoulli random variable that can only be a one or zero (indicating whether a block is active), during training:

When `b_l = 1`

, this block becomes a normal residual block. And when `b_l = 0`

, the above formula becomes:

Since we know that *H_(l-1)* is the output of a ReLU, which is already non-negative, the above equation reduces to an identity layer that only passes the input through to the next layer:

Let *p_l* be the survival probability of layer *l* during training, during test time, we have:

The authors applied a linear decay rule to the survival probability of each layer. They argue that since earlier layers extract low-level features that will be used by later ones, they should not be dropped too frequently. The resulting rule therefore becomes:

Where *L* denotes the total number of blocks, thus *p_L* is the survival probability of the last residual block and is fixed to 0.5 throughout experiments. Also note that in this setting, the input is treated as the first layer (*l = 0*) and thus, is never dropped. The overall framework over stochastic depth training is demonstrated in the figure below.

Similar to Dropout, training a deep network with stochastic depth can be viewed as training an ensemble of many smaller ResNets. The difference is that this method randomly drops an entire layer while Dropout only drops part of the hidden units in one layer during training.

Experiments show that training a 110-layer ResNet with stochastic depth results in better performance than training a constant-depth 110-layer ResNet, while also dramatically reducing the training time. This suggests that some of the layers (paths) in ResNet might be redundant.

## ResNet as an Ensemble of Smaller Networks

In the study on deep networks with stochastic depth, the researchers proposed a counter-intuitive way of training a very deep network that involved randomly dropping its layers during training and using the full network in testing time. The researchers of the study, “Residual Networks Behave Like Ensembles of Relatively Shallow Networks” had an even more counter-intuitive finding. We can actually drop some of the layers of a trained ResNet and still have comparable performance. This makes the ResNet architecture even more interesting, as the study authors also dropped layers of a VGG network and degraded its performance dramatically.

This study first provides an unraveled view of ResNet to make things clearer. After we unroll the network architecture, it is quite clear that a ResNet architecture with *i* residual blocks has 2^i different paths (because each residual block provides two independent paths).

Given that finding, it is quite clear why removing a couple of layers in a ResNet architecture doesn’t compromise its performance too much. The architecture has many independent effective paths and the majority of them remain intact after we remove a couple of layers. On the contrary, the VGG network has only one effective path, so removing a single layer compromises this one. As shown in the study’s extensive experiments.

The authors also conducted experiments to show that the collection of paths in ResNet have ensemble-like behavior. They did so by deleting different numbers of layers at test time, and checked to see if the performance of the network smoothly correlated with the number of deleted layers. The results suggested that the network indeed behaves like an ensemble.

Finally, the authors looked into the characteristics of the paths in ResNet.

It is apparent that the distribution of all possible path lengths follows a binomial distribution. The majority of paths go through 19 to 35 residual blocks.

The authors also conducted experiments to investigate the relationship between path length and the magnitude of the gradients flowing through it. To get the magnitude of gradients in the path of length *k*, the authors first fed a batch of data to the network and randomly sampled *k* residual blocks. When backpropagating the gradients, they propagated through the weight layer only for the sampled residual blocks. Their graphs show that the magnitude of gradients decreases rapidly as the path becomes longer.

We can now multiply the frequency of each path length with its expected magnitude of gradients to get a feel for how many paths of each length contribute to training. Surprisingly, most contributions come from paths of length nine to 18, but they constitute only a tiny portion of the total paths. This is a very interesting finding, as it suggests that ResNet did not solve the vanishing gradients problem for very long paths, and that ResNet actually enables training very deep networks by shortening its effective paths.

In this article, I revisited the compelling ResNet architecture and briefly explained the intuitions behind its recent success. I hope it helps strengthen your understanding of this groundbreaking work.