DeepLab

1、简介

DeepLab V3是2017年发表在CVPR上的文章，与DeepLab V2相比，个人感觉有如下的三种变化：1）引入Multi-grid， 2）改进ASPP结构， 3）把CRFs后处理给移除掉了。

DeepLabV3两种模型结构

文章中, 穿插讲解两种模型的实验, 这两种模型分别是cascded model和ASPP model。在cascded model中没有使用ASPP模块，在ASPP model中没有使用cascaded blocks模块。注意，虽然文中提出了两种结构，但作者说ASPP model比cascaded model略好点，包括在Github上开源的一些代码，大部分也是用的ASPP model。

Both our best cascaded model (in Tab. 4) and ASPP model in（Tab. 6) (in both cases without Dense CRF post-processing or MS-COCO pre-training) already outperform DeepLabv2.

cascaded model

文中，大部分的实验都是围绕cascaded model做的。如下图所示，论文中提出的cascaded model指的是图(b)，其中Block1, Block2, Block3, Block4是原始ResNet网络中的层结构，但Block4中将第一个残差结构里的3×3卷积层以及捷径分支上的1×1卷积层步距stride由2改为1（即不再进行下采样），并且所有的残差结构里3×3的普通卷积层都换成了膨胀卷积层。Block5, Block6和Block7是额外新增的层结构，它们的结构和Block4是一模一样的，即由三个残差结构组成。

原论文中说在训练cascaded model时output_stride=16（即特征层相对输入图片的下采样率），但验证时使用的output_stride=8，原因估计是把Block3中的下采样取消了。因为output_stride=16时最终得到的特征层H和W会更小，这意味着可以设置更大的batch_size并且能够加快训练速度。但特征层H和W变小导致特征层丢失细节信息（文章中说变得更“粗糙“），所以在验证时采用的是output_stride=8。其实只要设备的GPU显存足够大，算力足够强也可以直接把output_stride设置成8。

Also note that training with output stride = 16 is several times faster than output stride = 8 since the intermediate feature maps are spatially four times smaller, but at a sacrifice of accuracy since output stride = 16 provides coarser feature maps.

另外需要注意的是，图中标注的rate并不是膨胀卷积真正采用的膨胀系数。真正采用的膨胀系数应该是图中的rate乘上Multi-Grid参数，比如Block4中的rate=2， Multi-Grid=(1, 2, 4)，那么真正采用的膨胀系数是2×（1， 2， 4）=（2， 4， 8）。关于Multi-Grid参数后面会提到。

The final atrous rate for the convolutional layer is equal to the multiplication of the unit rate and the corresponding rate, For example, when output stride = 16 and Multi Grid = (1, 2, 4), the three convolutions will have rates=2 ×（1， 2， 4）= （2， 4， 8） in block4, respectively.

虽然论文大篇幅的内容都在讲cascaed model以及对应的实验，但实际使用的最多的还是ASPP model, ASPP model结构如下图所示。

注意，和cascaded model一样，原论文中说在训练时output_stride=16（即特征层相对输入图片的下采样率），但验证时使用的output_stride=8, pytorch官方实现的DeepLabV3源码中就直接把output_stride设置成8进行进行训练的。

接下来分析DeepLab V3中ASPP 结构。首先回顾一下DeepLab V2中的ASPP结构， DeepLab V2中的ASPP结构其实就是通过4个并行的膨胀卷积，每个分支上的膨胀卷积层所采用的膨胀系数不同（注意，这里的膨胀卷积层后没有跟BatchNorm并且使用了偏置bias）。接着通过add相加的方式融合成4个分支的输出。

接着再来看一下DeepLab V3中的ASPP结构，这里的ASPP结构有5个分支，分别是1×1的卷积层，三个3×3的膨胀卷积，以及一个全局平均池化层（后面还跟有一个1×1的卷积层，然后通过双线性插值的方法还原回到输入的W和H）。关于最后一个全局池化分支文中说是为了增加一个全局上下文信息global context information。然后通过Concat的方式将这5个分支的输出进行拼接（沿着channels方向），最后通过一个1×1的卷积层进一步融合信息。

DeepLab_V3_aspp

论文中的ASPP结构介绍，可以看下面这段话。

Specifically, we apply global average pooling on the last feature map of the model, feed the resulting image-level features to a 1×1 convolution with 256 filters （and batch normalization）, and then bilinearly upsample the feature to the desired spatial dimension. In the end, our improved ASPP consists of (a) one 1×1 convolution and three 3×3 convolitions with rates = (6, 12, 18) when output stride = 16 (all with 2256 filters and batch normalization), and (b) the image-level features, as shown in Fig.5. Note that the rates are doubled when output stride=8. The resulting features from all the branches are then concatenated and pass through another 1×1 convolution (also with 256 filters and batch normalization) before the final 1×1 convolution which generates the final logits.

Multi-grid

在之前的DeepLab模型中虽然一直在使用膨胀卷积，但设置的膨胀系数都比较随意。在DeepLab V3中作者有去做一些相关实验看如何设置的更合理。下表是以cascaded model (ResNat101作为backbone为例)为实验对象，研究采用不同数量的cascaded blocks模型以及cascaded blocks采用不同的Multi-Grid参数的效果。注意，刚刚在讲cascaded model时有提到，blocks中真正采用的膨胀系数应该是图中的rate乘上这里的Multi-Grid参数。通过实验发现，当采用三个额外的Block时（即额外添加Block5，Block6和Block7）将Multi-Grid设置成(1, 2, 1)效果最好。另外如果不添加任何额外Block（即没有Block5，Block6和Block7）将Multi-Grid设置成(1, 2, 4)效果最好，因为在ASPP model中是没有额外添加Block层的，后面讲ASPP model的消融实验时采用的就是Multi-Grid等于(1, 2, 4)的情况。

消融实验

cascaded model消融实验

下表有关cascaded model的消融实验，其中：

MG代表Multi-Grid，刚刚上面也有说在cascaded model中采用MG(1, 2, 1)是最好的。

OS表示output_stride，刚刚也在上面有提到过验证时将output_stride设置为8效果会更好

MS表示多尺度，和DeepLabV2中类似，不过在DeepLab V3中采用的多尺度更多scales={0.5, 0.75, 1.0, 1.25, 1.5, 1.75}

Flip代表增加一个水平翻转后的图像输入。

ASPP model消融实验

下表有关ASPP model的消融实验，其中：

MG表示Multi-Grid，刚刚也在上面有说在ASPP model中采用MG(1, 2, 4)是最好的。

Image Pooling代表在ASPP中加入全局平均池化层分支。

OS代表output_stride, 刚刚在上面也有提到将output_stride设置为8效果更好。

MS代表多尺度，和DeepLabV2中类似，不过在DeepLab V3中采用的尺度更多scales={0.5, 0.75, 1.0, 1.25, 1.75}

Flip代表增加一个水平翻转后的图像输入

COCO代表在COCO数据集上进行预训练。

Pytorch官方中DeepLab V3模型结构

在pytorch官方实现的DeepLab V3中： 1）并没有使用Multi-Grid；2）多了一个FCNHead辅助训练分支，可以选择不使用；3）无论是训练还是验证output_size都设置为8；4）ASPP中三个膨胀卷积分支的膨胀系数是12， 24， 36

In the end, out improved ASPP consists of (a) one 1×1 convolution and three 3×3 convolutions with rates=(6, 12, 18) when output stride=16 (all with 256 filters and batch normalization), and (b) the image-level features , as shown in Fig.5. Note that the rates are doubled when output stride=8.

DeepLab_V3_pytorch