看的顺序有点问题。先看了v3+再阅读的本论文。其实以后新涉及到一个领域的时候,应该先看比较老的论文,或者综述性的论文,这样效率会比较高。
Rethinking Atrous Convolution for Semantic Image Segmentation
Liang-Chieh Chen
Google Inc.
1. Introduction
语义分割中应用深度学习有两个挑战:池化或卷积中的步长降低了特征分辨率,这能让网络学到更抽象的特征表达,但损失了空间信息。空洞卷积允许我们把在ImageNet上预训练的网络,移除最后几层的下采样并对应上采样。空洞卷积的rate可以轻松控制网络产生的特征的分辨率,而无需额外学习参数。
另一个问题来自于在多个尺度的物体的存在。图2是我们考虑的解决此问题的方法。第一种的图片金字塔是在各尺度的输入上提取特征,不同尺度的物体就在不同的特征图上变得显著。第二种encoder-decoder结构利用来自encoder的多尺度特征,并在decoder部分恢复空间分辨率。第三种是把额外的模块叠加到原始网络上以获取更长范围的信息。使用了DenseCRF来编码逐像素的相似度,同时使用额外的多个卷积层来逐步获得更长范围的上下文。第四,空间金字塔池化使用核或池化在多个频率和有效field-of-view下探测输入的特征图,从而捕获各个尺度的物体。
本文中,我们重新审视了空洞卷积的使用。我们提出的网络有多个不同rate的空洞卷积和bn层。我们自己的模块以层叠和并行的方式组织(即Atrous Spatial Pyramid Pooling ASPP方法)。
2. Related Work
Image Pyramid: 通常使用同样的共享权重的模型,用到不同尺度的输入上。从小尺度输入提取的特征编码了大范围的上下文,而大尺度输入提供了小物体的细节。典型的例子如Farabet等人的[22],先将图片用拉普拉斯金字塔laplacian pyramid处理,将各尺度的输入到一个网络,最终合并特征图。[19,69]将多尺度的输入从粗糙到精细依次处理,而[55,12,11]直接缩放输入到多尺度并合并它们的特征。这个方法主要缺点是受限于显存,它很难为深度网络缩放到合适,因此通常用于推理阶段。
Encoder-decoder: 模型由两部分组成,encoder产生特征图的空间分辨率逐步降低,越深的输出越能轻易捕获到更大范围的信息,decoder则逐步恢复物体细节和分辨率。例如[60,64]训练deconvolution来上采样低分辨率的特征。SegNet复用了encoder的pooling indices,并训练额外的卷积层来使特征图稠密。U-Net在encoder和decoder对应特征间加入了短路连接,而[25]使用拉普拉斯金字塔重构了这一网络。近期的RefineNet和[70,68,39]证明了encoder-decoder结构在语义分割上的有效性。这类结构也用于物体检测。
Context module: 这个模型包含额外的不在层叠中的模块以编码大范围的上下文。一个有效的方法使用将DenseCRF(有效的高维滤波算法)合并到网络中。[96,55,73]提出联合训练DenseCRF和网络。[59,90]将多个额外卷积层用于网络的置信图belief map(网络最后的特征,通道数等于预测类数)上,来获取上下文信息。近期[41]提出训练一个通用且稀疏的高维卷积(bilateral双边卷积),且[82,8]把网络和高斯条件随机场合并用于语义分割。
Spatial pyramid pooling: 这一模型使用空间金字塔池化来捕获不同范围的上下文。ParseNet使用了图片级的特征获得全局上下文信息。DeepLabv2提出了空洞空间金字塔池化ASPP,即有不同rate的平行空洞卷积层。近期Pyramid Scene Parsing Net(PSP)在多个grid尺度进行空间池化,性能不错。还有使用LSTM[35]来聚合全局信息的[53,6,88]。这个方法也用于目标检测。
本论文中我们主要使用空洞卷积作为context module及Spatial pyramid pooling的工具。我们复制了ResNet的最后一个block多次,并将它们叠起来。我们发现BN很有用。
Atrous convolution: 介绍了使用空洞卷积的模型。
3. Methods
3.1 Atrous Convolution for Dense Feature Extraction
近期的DCNN通常会将每边缩小32倍。逆卷积(转置卷集)曾用于恢复空间分辨率。而我们提倡空洞卷积,而它在被DCNN使用前,用于非抽样小波变换undecimated wavelet transform。
一个二维信号,在输出y和核w的每个位置i,空洞卷积应用于输入特征图x:
rate r控制了我们从输入中采样的间隔。见图1。
我们把output stride记做最后输出的空间分辨率与原始比例。用于分类的DCNN的output stride通常为32。如果想让其分辨率翻倍,则最后一个降低分辨率的池化或卷积层的步长需设为1,后续所有卷积层都替换为r=2的空洞卷积。这使我们在无需学习额外参数的前提下提取更密集的特征。细节参见[11]。
3.2 Going Deeper with Atrous Convolution
我们首先尝试了以瀑布式叠加空洞卷积。我们把ResNet最后一个block复制了几份,记做图3中的block4,并瀑布式地连接起来。block中共3个$3\times3$卷积,除了最后一个block,最后的卷积步长为2,类似原始ResNet。模型背后的动机是步长能更轻易地捕获大范围的信息。如图3a,整个图片信息最终总结为小的特征图。但这对语义分割不利。因此如图3b,我们应用了空洞卷积。如果不应用空洞卷积,最终output stride将达到256。
3.3 Atrous Spatial Pyramid Pooling
我们再次思考了[11]中提出的ASPP,即是将4个并行的有着不同的r的空洞卷积应用于特征图上。与[11]不同的是我们在ASPP中引入了BN。
有不同rate的ASPP能有效捕捉多尺度的信息。但我们发现随着rate变大,有效的filter权重(应用于有效特征区域而不是填充的0的权重)数量在变少。这一效应见图4。极端情况如rate接近特征图尺寸时,$3\times3$核退化为$1\times1$,因为仅中心的权重有效。
为了克服这一问题并为模型引入全局上下文信息。我们采用了类似[58,95]的图片级特征。具体地,我们在模型的最后特征图上应用全局平均池化,将结果输入到256核的$1\time1$卷积(以及BN),然后双线性将其上采样为需要的尺寸。最终,我们提出的ASPP包含 (a)1个$1\times1$卷积和3个$3\times3$卷积,当output stride为16时,rate为(6,12,18),(b)图片级特征,如图5。所有分支产出的特征图连接后传入另一个$1\times1$卷积(也是256通道带BN),最终再进行$1\times1$卷积,输出最终结果。
4. Experimental Evaluation
训练参数以及细节请参见原文。
References
[1] M. Abadi, A. Agarwal, et al. Tensorflow: Large-scalemachine learning on heterogeneous distributed systems.arXiv:1603.04467, 2016.
[2] A. Adams, J. Baek, and M. A. Davis. Fast high-dimensionalfiltering using the permutohedral lattice. In Eurographics, 2010.
[3] V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: Adeep convolutional encoder-decoder architecture for imagesegmentation. arXiv:1511.00561, 2015.
[4] A. Brandt. Multi-level adaptive solutions to boundary-valueproblems. Mathematics of computation, 31(138):333–390,1977.
[5] W. L. Briggs, V. E. Henson, and S. F. McCormick. A multigridtutorial. SIAM, 2000.
[6] W. Byeon, T. M. Breuel, F. Raue, and M. Liwicki. Scenelabeling with lstm recurrent neural networks. In CVPR, 2015.
[7] H. Caesar, J. Uijlings, and V. Ferrari. COCO-Stuff: Thingand stuff classes in context. arXiv:1612.03716, 2016.
[8] S. Chandra and I. Kokkinos. Fast, exact and multi-scale inference for semantic image segmentation with deep GaussianCRFs. arXiv:1603.08358, 2016.
[9] L.-C. Chen, J. T. Barron, G. Papandreou, K. Murphy, and A. L.Yuille. Semantic image segmentation with task-specific edgedetection using cnns and a discriminatively trained domaintransform. In CVPR, 2016.
[10] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L.Yuille. Semantic image segmentation with deep convolutionalnets and fully connected crfs. In ICLR, 2015.
[11] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L.Yuille. Deeplab: Semantic image segmentation with deepconvolutional nets, atrous convolution, and fully connectedcrfs. arXiv:1606.00915, 2016.
[12] L.-C. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille. Attention to scale: Scale-aware semantic image segmentation.In CVPR, 2016.
[13] F. Chollet. Xception: Deep learning with depthwise separableconvolutions. arXiv:1610.02357, 2016.
[14] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler,R. Benenson, U. Franke, S. Roth, and B. Schiele. Thecityscapes dataset for semantic urban scene understanding. InCVPR, 2016.
[15] J. Dai, K. He, and J. Sun. Convolutional feature masking forjoint object and stuff segmentation. arXiv:1412.1283, 2014.
[16] J. Dai, K. He, and J. Sun. Boxsup: Exploiting bounding boxesto supervise convolutional networks for semantic segmentation. In ICCV, 2015.
[17] J. Dai, Y. Li, K. He, and J. Sun. R-fcn: Objectdetection via region-based fully convolutional networks.arXiv:1605.06409, 2016.
[18] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei.Deformable convolutional networks. arXiv:1703.06211,2017.
[19] D. Eigen and R. Fergus. Predicting depth, surface normalsand semantic labels with a common multi-scale convolutionalarchitecture. arXiv:1411.4734, 2014.
[20] M. Everingham, S. M. A. Eslami, L. V. Gool, C. K. I.Williams, J. Winn, and A. Zisserma. The pascal visual objectclasses challenge a retrospective. IJCV, 2014.
[21] H. Fan, X. Mei, D. Prokhorov, and H. Ling. Multi-levelcontextual rnns with attention model for scene labeling.arXiv:1607.02537, 2016.
[22] C. Farabet, C. Couprie, L. Najman, and Y. LeCun. Learninghierarchical features for scene labeling. PAMI, 2013.
[23] J. Fu, J. Liu, Y. Wang, and H. Lu. Stacked deconvolutionalnetwork for semantic segmentation. arXiv:1708.04943, 2017.
[24] R. Gadde, V. Jampani, and P. V. Gehler. Semantic video cnnsthrough representation warping. In ICCV, 2017.
[25] G. Ghiasi and C. C. Fowlkes. Laplacian reconstruction andrefinement for semantic segmentation. arXiv:1605.02264,2016.
[26] A. Giusti, D. Ciresan, J. Masci, L. Gambardella, andJ. Schmidhuber. Fast image scanning with deep max-poolingconvolutional neural networks. In ICIP, 2013.
[27] S. Gould, R. Fulton, and D. Koller. Decomposing a sceneinto geometric and semantically consistent regions. In ICCV.IEEE, 2009.
[28] K. Grauman and T. Darrell. The pyramid match kernel: Discriminative classification with sets of image features. In ICCV, 2005.
[29] B. Hariharan, P. Arbelaez, L. Bourdev, S. Maji, and J. Malik. ´Semantic contours from inverse detectors. In ICCV, 2011.
[30] B. Hariharan, P. Arbelaez, R. Girshick, and J. Malik. Hyper- ´columns for object segmentation and fine-grained localization.In CVPR, 2015.
[31] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid poolingin deep convolutional networks for visual recognition. InECCV, 2014.
[32] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learningfor image recognition. arXiv:1512.03385, 2015.
[33] X. He, R. S. Zemel, and M. Carreira-Perpindn. Multiscaleconditional random fields for image labeling. In CVPR, 2004.
[34] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledgein a neural network. In NIPS, 2014.
[35] S. Hochreiter and J. Schmidhuber. Long short-term memory.Neural computation, 9(8):1735–1780, 1997.
[36] M. Holschneider, R. Kronland-Martinet, J. Morlet, andP. Tchamitchian. A real-time algorithm for signal analysiswith the help of the wavelet transform. In Wavelets: TimeFrequency Methods and Phase Space, pages 289–297. 1989.
[37] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi,I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, and K. Murphy.Speed/accuracy trade-offs for modern convolutional objectdetectors. In CVPR, 2017.
[38] S. Ioffe and C. Szegedy. Batch normalization: Acceleratingdeep network training by reducing internal covariate shift.arXiv:1502.03167, 2015.
[39] M. A. Islam, M. Rochan, N. D. Bruce, and Y. Wang. Gatedfeedback refinement network for dense image labeling. InCVPR, 2017.
[40] S. D. Jain, B. Xiong, and K. Grauman. Fusionseg: Learning to combine motion and appearance for fully automaticsegmention of generic objects in videos. In CVPR, 2017.
[41] V. Jampani, M. Kiefel, and P. V. Gehler. Learning sparse highdimensional filters: Image filtering, dense crfs and bilateralneural networks. In CVPR, 2016.
[42] X. Jin, X. Li, H. Xiao, X. Shen, Z. Lin, J. Yang, Y. Chen,J. Dong, L. Liu, Z. Jie, J. Feng, and S. Yan. Video sceneparsing with predictive feature learning. In ICCV, 2017.
[43] P. Kohli, P. H. Torr, et al. Robust higher order potentials forenforcing label consistency. IJCV, 82(3):302–324, 2009.
[44] S. Kong and C. Fowlkes. Recurrent scene parsing with perspective understanding in the loop. arXiv:1705.07238, 2017.
[45] P. Krahenb ¨ uhl and V. Koltun. Efficient inference in fully ¨connected crfs with gaussian edge potentials. In NIPS, 2011.
[46] I. Kreso, S. ˇ Segvi ˇ c, and J. Krapac. Ladder-style densenets ´for semantic segmentation of large natural images. In ICCVCVRSUAD workshop, 2017.
[47] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetclassification with deep convolutional neural networks. InNIPS, 2012.
[48] L. Ladicky, C. Russell, P. Kohli, and P. H. Torr. Associativehierarchical crfs for object class image segmentation. InICCV, 2009.
[49] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scenecategories. In CVPR, 2006.
[50] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E.Howard, W. Hubbard, and L. D. Jackel. Backpropagationapplied to handwritten zip code recognition. Neural computation, 1(4):541–551, 1989.
[51] X. Li, Z. Jie, W. Wang, C. Liu, J. Yang, X. Shen, Z. Lin,Q. Chen, S. Yan, and J. Feng. Foveanet: Perspective-awareurban scene parsing. arXiv:1708.02421, 2017.
[52] X. Li, Z. Liu, P. Luo, C. C. Loy, and X. Tang. Not all pixelsare equal: Difficulty-aware semantic segmentation via deeplayer cascade. arXiv:1704.01344, 2017.
[53] X. Liang, X. Shen, D. Xiang, J. Feng, L. Lin, and S. Yan.Semantic object parsing with local-global long short-termmemory. arXiv:1511.04510, 2015.
[54] G. Lin, A. Milan, C. Shen, and I. Reid. Refinenet: Multipath refinement networks with identity mappings for highresolution semantic segmentation. arXiv:1611.06612, 2016.
[55] G. Lin, C. Shen, I. Reid, et al. Efficient piecewise training of deep structured models for semantic segmentation.arXiv:1504.01013, 2015.
[56] T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and ´S. Belongie. Feature pyramid networks for object detection.arXiv:1612.03144, 2016.
[57] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick. Microsoft COCO: Com- ´mon objects in context. In ECCV, 2014.
[58] W. Liu, A. Rabinovich, and A. C. Berg. Parsenet: Lookingwider to see better. arXiv:1506.04579, 2015.
[59] Z. Liu, X. Li, P. Luo, C. C. Loy, and X. Tang. Semantic imagesegmentation via deep parsing network. In ICCV, 2015.
[60] J. Long, E. Shelhamer, and T. Darrell. Fully convolutionalnetworks for semantic segmentation. In CVPR, 2015.
[61] P. Luo, G. Wang, L. Lin, and X. Wang. Deep dual learningfor semantic image segmentation. In ICCV, 2017.
[62] M. Mostajabi, P. Yadollahpour, and G. Shakhnarovich. Feedforward semantic segmentation with zoom-out features. InCVPR, 2015.
[63] R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee, S. Fidler,R. Urtasun, and A. Yuille. The role of context for objectdetection and semantic segmentation in the wild. In CVPR,2014.
[64] H. Noh, S. Hong, and B. Han. Learning deconvolution network for semantic segmentation. In ICCV, 2015.
[65] G. Papandreou, L.-C. Chen, K. Murphy, and A. L. Yuille.Weakly- and semi-supervised learning of a dcnn for semanticimage segmentation. In ICCV, 2015.
[66] G. Papandreou, I. Kokkinos, and P.-A. Savalle. Modelinglocal and global deformations in deep learning: Epitomicconvolution, multiple instance learning, and sliding windowdetection. In CVPR, 2015.
[67] G. Papandreou and P. Maragos. Multigrid geometric activecontour models. TIP, 16(1):229–240, 2007.
[68] C. Peng, X. Zhang, G. Yu, G. Luo, and J. Sun. Large kernelmatters–improve semantic segmentation by global convolutional network. arXiv:1703.02719, 2017.
[69] P. Pinheiro and R. Collobert. Recurrent convolutional neuralnetworks for scene labeling. In ICML, 2014.
[70] T. Pohlen, A. Hermans, M. Mathias, and B. Leibe. Fullresolution residual networks for semantic segmentation instreet scenes. arXiv:1611.08323, 2016.
[71] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutionalnetworks for biomedical image segmentation. In MICCAI,2015.
[72] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg,and L. Fei-Fei. ImageNet Large Scale Visual RecognitionChallenge. IJCV, 2015.
[73] A. G. Schwing and R. Urtasun. Fully connected deep structured networks. arXiv:1503.02351, 2015.
[74] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, andY. LeCun. Overfeat: Integrated recognition, localization anddetection using convolutional networks. arXiv:1312.6229,2013.
[75] F. Shen, R. Gan, S. Yan, and G. Zeng. Semantic segmentationvia structured patch prediction, context crf and guidance crf.In CVPR, 2017.
[76] J. Shotton, J. Winn, C. Rother, and A. Criminisi. Textonboostfor image understanding: Multi-class object recognition andsegmentation by jointly modeling texture, layout, and context.IJCV, 2009.
[77] A. Shrivastava, R. Sukthankar, J. Malik, and A. Gupta. Beyond skip connections: Top-down modulation for object detection. arXiv:1612.06851, 2016.
[78] K. Simonyan and A. Zisserman. Very deep convolutionalnetworks for large-scale image recognition. In ICLR, 2015.
[79] C. Sun, A. Shrivastava, S. Singh, and A. Gupta. Revisitingunreasonable effectiveness of data in deep learning era. InICCV, 2017.
[80] H. Sun, D. Xie, and S. Pu. Mixed context networks forsemantic segmentation. arXiv:1610.05854, 2016.
[81] D. Terzopoulos. Image analysis using multigrid relaxationmethods. TPAMI, (2):129–139, 1986.
[82] R. Vemulapalli, O. Tuzel, M.-Y. Liu, and R. Chellappa. Gaussian conditional random field network for semantic segmentation. In CVPR, 2016.
[83] G. Wang, P. Luo, L. Lin, and X. Wang. Learning object interactions and descriptions for semantic image segmentation. InCVPR, 2017.
[84] P. Wang, P. Chen, Y. Yuan, D. Liu, Z. Huang, X. Hou, andG. Cottrell. Understanding convolution for semantic segmentation. arXiv:1702.08502, 2017.
[85] Z. Wu, C. Shen, and A. van den Hengel. Bridgingcategory-level and instance-level semantic image segmentation. arXiv:1605.06885, 2016.
[86] Z. Wu, C. Shen, and A. van den Hengel. Wider ordeeper: Revisiting the resnet model for visual recognition.arXiv:1611.10080, 2016.
[87] F. Xia, P. Wang, L.-C. Chen, and A. L. Yuille. Zoom betterto see clearer: Huamn part segmentation with auto zoom net.arXiv:1511.06881, 2015.
[88] Z. Yan, H. Zhang, Y. Jia, T. Breuel, and Y. Yu. Combining thebest of convolutional layers and recurrent layers: A hybridnetwork for semantic segmentation. arXiv:1603.04871, 2016.
[89] J. Yao, S. Fidler, and R. Urtasun. Describing the scene as awhole: Joint object detection, scene classification and semantic segmentation. In CVPR, 2012.
[90] F. Yu and V. Koltun. Multi-scale context aggregation bydilated convolutions. In ICLR, 2016.
[91] S. Zagoruyko and N. Komodakis. Wide residual networks.arXiv:1605.07146, 2016.
[92] M. D. Zeiler, G. W. Taylor, and R. Fergus. Adaptive deconvolutional networks for mid and high level feature learning. InICCV, 2011.
[93] R. Zhang, S. Tang, M. Lin, J. Li, and S. Yan. Global-residualand local-boundary refinement networks for rectifying sceneparsing predictions. IJCAI, 2017.
[94] R. Zhang, S. Tang, Y. Zhang, J. Li, and S. Yan. Scale-adaptiveconvolutions for scene parsing. In ICCV, 2017.
[95] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid sceneparsing network. arXiv:1612.01105, 2016.
[96] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet,Z. Su, D. Du, C. Huang, and P. Torr. Conditional randomfields as recurrent neural networks. In ICCV, 2015.
[97] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba. Scene parsing through ade20k dataset. In CVPR, 2017.