DeepLab

1、简介

论文原文：论文原文](https://arxiv.org/pdf/1606.00915.pdf))

这是一篇2016年发表在CVPR上的文章，其实对比模型结构图，发现DeepLab v1与Deep Lab v2相比，其实就是换了一个backbone（VGG->ResNet，换个backbone大概能涨3个点），然后引入了一个新的模块ASPP（Atros Spatial Pyramid Pooling），其他的没有太大区别。

DCNNs应用在语义分割任务中的问题

和上一篇文章一样，在文章的引言部分作者提出了DCNNs应用在语义分割任务中遇到的问题。

分辨率被降低（主要由于下采样stride>1的层导致）

目标的多尺度问题

DCNNs的不变性（invariance)会降低定位精度

文中对应的解决方法

针对分辨率被降低的问题，一般就是将最后的几个Maxpooling层的stride设置成1（如果是通过卷积下采样的，比如resnet，同样将stride设置成1即可)，然后再配合使用膨胀卷积。

In order to overcome this hurdle and efficiently produce denser feature maps, we remove the downsampling operator from the last few max pooling layers of DCNNs and instead umsample the filters in subsequent convolutional layers, resuliting in feature maps computed at a higher sampling rate, Filter upsampling amounts to inserting holes between nonzero filter taps

针对目标多尺度问题，最容易想到的就是将图像缩放到多个尺寸分别通过网络进行推理，最后将多个结果进行融合即可，这样做虽然有用但是计算量太大了，为了解决这个问题， DeepLab V2中提出了ASPP模块（astrous spatial pyramid pooling).

A standard way to deal with this is to present to the DCNN rescaled versions of the same image and then aggregate the feature or score maps. We show that this approach indeed increases the performance of our system, but comes at the cost of computing feature responses at all DCNN layers for multiple scaled versions of the input image. Instead, motivated by spatial pyramid pooling, we propose a computationally efficient scheme of resampling a given feature layer at multiple rates prior to convolution. This amounts to probing the original image with multiple filters that have complementary effective fields of view, thus capturing objects as well as useful image context at multiple scales. Rather than actually resampling features, we efficiently implement this mapping using multiple parallel atrous convolutional layers with different sampling rates; we call the proposed technique “atrous spatial pyramid pooling” (ASPP).

针对DCNNs不变性导致定位精度降低的问题，和DeepLab V1差不多还是通过CRFs解决，不过这里用的是fully connected pairwise CRF, 相比V1里的fully connected CRF要更高效。在DeepLab v2中CRF涨点就没有DeepLab v1猛了，在DeepLab V1中大概能提升4个点，在DeepLab V2中通过Table4可以看到大概只能提升1个多点了。

Our work explores an alternative approach which we show to be highly effective. In particular, we boost our model’s ability to capture fine details by employing a fully-connected Conditional Random Field (CRF) [22]. CRFs have been broadly used in semantic segmentation to combine class scores computed by multi-way classifiers with the low-level information captured by the local interactions of pixels and edges [23], [24] or superpixels [25]. Even though works of increased sophistication have been proposed to model the hierarchical dependency [26], [27], [28] and/or high-order dependencies of segments [29], [30], [31], [32], [33], we use the fully connected pairwise CRF proposed by [22] for its efficient computation, and ability to capture fine edge details while also catering for long range dependencies.

DeepLab V2的优势

和DeepLab V1中写的一样：

速度更快

准确率更高（当时的state-of-the-art)

模型结构简单，还是DCNN和CRFs联级

From a practical standpoint, the three main advantages of our DeepLab system are: (1) Speed: by virtue of atrous convolution, our dense DCNN operates at 8 FPS on an NVidia Titan X GPU, while Mean Field Inference for the fully-connected CRF requires 0.5 secs on a CPU. (2) Accuracy: we obtain state-of-art results on several challenging datasets, including the PASCAL VOC 2012 semantic segmentation benchmark [34], PASCAL-Context [35], PASCAL-Person-Part [36], and Cityscapes [37]. (3) Simplicity: our system is composed of a cascade of two very well-established modules, DCNNs and CRFs.

ASPP(astrous spatial pyramid pooling)

感觉这个ASPP就是DeepLab V1中LargeFOV的升级版（加了多尺度的特性）。下图是原论文中介绍ASPP的示意图，就是在backbone输出的Feature Map上并联4个分支，每个分支的第一层都是使用的膨胀卷积，但不同的分支使用的膨胀卷积系数不同（即每个分支的感受野不同，从而具有解决目标多尺度的问题）。

下图中有更加详细的ASPP结构（这里针对VGG网络为例），将Pool5输出的特征层（这里以VGG为例）并联4个分支，每个分支分别通过一个3×3的膨胀卷积层， 1×1的卷积层， 1×1的卷积层（卷积核的个数等于num_classes)。最后将四个分支的结果进行add融合即可。如果以ResNet101作为Backbone的话，每个分支只有一个3×3的膨胀卷积层，卷积核的个数等于num_classes)

在论文中有给出两个ASPP的配置， ASPP-S（四个分支膨胀系数分别为2， 4， 8， 12）和ASPP-L（四个分支膨胀系数分别为6， 12， 18， 24）, 下表是对比LargeFOV, ASPP-S以及ASPP-L的效果，这里只看了CRF之前的（before CRF)对比， ASPP-L优于ASPP-S优于LargeFOV

DeepLab V2网络结构

这里以ResNet101作为backbone为例，下图是根据官方源码绘制的网络结构（这里不考虑MSC即多尺度）。在ResNet的Layer3中的Bottleneck1中原本是需要下采样的（3×3的卷积层的stride=2），但在DeepLab v2中将stride设置为1，即不进行下采样。而且3×3卷积层全部采用膨胀卷积膨胀系数为2，在Layer4中也是一样，取消了下采样，所有的3×3卷积层全部采用膨胀卷积膨胀系数为4。最后需要注意的是ASPP模块，在以ResNet101作为Backbone时，每个分支只有一个3×3膨胀卷积层，且卷积核的个数都等于num_classes。

DeepLab_v2

Learning rate policy

在DeepLab V2中训练时采用的学习率策略叫poly, 相比普通的step策略(即每间隔一定步数就降低一次学习率)效果要更好，文中说最高提升了3.63个点。poly学习率变化策略公式如下：

$lr \times (1 - \frac{iter}{max_iter})^{power}$

其中lr为初始学习率， iter为当前迭代的step数， max_iter为训练过程中总的迭代步数。

消融实验

下表是原论文中给出的一些消融实验对比

其中

MSC表示多尺度输入，即先将图像缩放到0.5， 0.7和1.0三个尺度，然后分别送入到网络预测得到score maps，最后融合这三个score maps（对每个位置取三个score maps的最大值)

Multi-scale inputs: We separately feed to the DCNN images at scale = {0.5, 0.75, 1} ， fusing their score maps by taking the maximum response across scales for each position separately

COCO表示在COCO数据集上进行预训练

Models pretrained on MS-COCO

Aug代表数据增强，这里就是对输入的图片从0.5到1.5之间随机缩放。

Data augmentation by randomly scaling the input images (from 0.5 to 1.5) during training

LargeFOV是在DeepLab V1中讲到过的结构