ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices

ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices

Xiangyu Zhang
Megvii


1. Introduction

为移动设备有限的算力(10-150MFLOPs)设计,在ARM架构的设备上,ShuffleNet比AlexNet快13倍,同时保有相似的准确度。

很多现有工作都聚焦于基础网络的剪枝、压缩、或半精度方法。我们提出了一种高度有效的架构。

我们注意到因为密集而昂贵的$1\times1$卷积,SOTA的架构如Xception和ResNeXt在极小的网络上不那么有效。我们提出用group pointwise convolution来降低$1\times1$卷积的复杂度。为了解决group convolution 的副作用,我们提出了一种channel shuffle操作,有助于信息在特征通道间流动。基于这两种技术,我们构建了新的架构ShuffleNet。在同样的计算复杂度下,它能有更多的特征图通道。

Group Convolution. 这一概念在AlexNet时就提出了,用于将模型分布于两块GPU上,已于ResNeXt[40]上显示了它的有效性。Xception提出的Depthwise separable convolution泛化了Inception系列的separable convolution的思想。近期,MobileNet使用了depthwise separable convolution并使用轻量模型获得了SOTA的结果。我们的研究以一种新形式泛化了group convolution和depthwise separable convolution。

3. Approach

3.1 Channel Shuffle for Group Convolutions

现代的CNN通常由重复的block组成。SOTA的网络如Xception和ResNeXt提出了有效的Depthwise separable convolution和group convolution。但这些设计都未考虑$1\times1$卷积(在[12]中称为pointwise convolution)的不可忽视的计算量。例如,在ResNeXt中,仅$3\times3$卷积使用了group卷积。因此,ResNeXt中的每个残差块中pointwise卷积占据了93.4%的multiplication-adds(如[40]中建议的,cardinality势=32)。在小网络中,昂贵的pointwise卷积限制了通道数,极大地损害了精度。

图1

这一问题的直接解决办法就是应用channel sparse connections,如group convolution,到$1\times1$层。Group Convolution通过确保卷积操作仅基于对应输入group,显著地降低了计算复杂度。但当多个group卷积叠加时会有一个副作用:某一通道的输出仅源自输入通道的一小部分。图1a展示了堆叠的两个group卷积。某个group的输出仅与其group的输入有关。这一特性阻碍了信息在channel group中流动,有损于表达。

如果我们让group卷积能从其他groups中获得输入数据(如图1b),输入和输出通道会全连接。对于前一层生成的特征图,我们能将其每个group中的channel分为多个subgroups。这可高效而优雅地通过叫channel shuffle操作(图1c)实现:假如某个有g个groups的卷积层数输出有$g\times n$个通道,我们首先将输出channel reshape为(g,n),转置再flatten为下一层输入。两个有不同group数的层间也有用。这一操作也是可微的,可以端到端的训练。

图2

3.2 ShuffleNet Unit

首先是图2a的bottleneck unit,它是一个残差块。在短路分支中,我们用计算量小的$3\times3$depthwise卷积替换了原$3\times3$卷积。接下来,我们把$1\times1$卷积替换为一个pointwise group卷积,后跟一个channel shuffle操作,得到图2b。第二个pointwise group卷积的目的是恢复通道维度,以匹配shortcut的维度。为了简单,第二个pointwise group卷积后未加channel shuffle操作。对于需要步长的时候,我们有两种修改(图2c):(i)在shortcut上添加一个$3\times3$ avg pooling;(ii)将element-wise add替换为通道串联channel concatenation,只需很少的额外计算消耗就扩大了通道维度

我们的网络在同样设定下比ResNet和ResNeXt复杂度低。设输入为$c \times h \times w$,bottleneck通道数为m,ResNet unit需要$hw(2cm+9m^2)$ FLOPs,ResNeXt需要$hw(2cm+9m^2/g)$ FLOPs,我们的仅需$hw(2cm/g+9m)$FLOPs,其中g是group数。

另外,ShuffleNet中仅在bottleneck特征上应用depthwise卷积。尽管它理论上复杂度很低,但它在低算力的移动设备上很难高效实现。可能是因为比起其他密集操作,它有着糟糕的计算/内存访问比。这一问题在[3]中也提到了。

表1

3.3 Network Architecture

ShuffleNet架构见表1,主要是3阶段ShuffleNet unit的堆叠。每个阶段的第一块的步长为2。其余的超参数在阶段内保持一致,每个阶段的输出通道数都翻倍。类似[9],我们的每个unit的bottlenect通道数为输出通道数的1/4。我们的目的是使架构尽可能简单,尽管我们发现某些超参数调优结果更好。

ShuffleNet unit中,group数g控制了pointwise Convolution的连接稀疏度。表1总结了不同的g的情况,我们调整了输出通道数使它们的复杂度大体一致(~140MFLOPs)。显然,在给定复杂度约束下,更大的g会得到更多的输出通道,有助于编码更多信息。不过也可能因为受限的对应输入通道而导致一些卷积核的退化。

为了适应不同复杂度限制,我们可以在通道数上应用scale factor s。表1的网络记做ShuffleNet 1x

4. Experiments

我们在ImageNet 2012 Classification数据集上评估了我们的模型。我们的训练设定和超参数类似[40],除了两点:(i)我们将weight decay设置为4e-5而不是1e-4,而且使用Linear-decay rate策略(从0.5降低到0); (ii) 我们的数据预处理使用了略不那么激烈的尺度增广。[12]也使用了类似的修改,因为小网络通常会欠拟合而不是过拟合。在4GPU上训练$3\times 10^5$次迭代需要1~2天,batch size为1024。测试是从256的输入中心截取224的单切片测试。

4.1 Ablation Study

表2

4.1.1 Pointwise Group Convolutions

结果见表2。很显然有着PGC的(g > 1)的比没有的要好。更小的模型从PGC中受益更多。

表2还显示了某些模型(如ShuffleNet 0.5x)在g相对大时,其准确度饱和甚至降低。随着g的增加,每个卷积核的输入通道变少,也许会有损其表达力。但更小的模型的性能提高更持续。

表3

4.1.2 Channel Shuffle vs. No Shuffle

Shuffle操作的目的是在堆叠多个group卷积层时,让信息在group间流动。表3比较了带不带channel shuffle的性能。

表4

4.2. Comparison with Other Structure Units

将ShuffleNet unit替换为其它结构,并调整通道数使复杂度一致,结构见表4。

表5

4.3. Comparison with MobileNets and Other Frameworks

与MobileNet比较见表5。

表6是与一些流行模型的比较。
表6

表7

4.4. Generalization Ability

我们在ImageNet上测试了泛化能力,我们采用了Faster-RCNN作为检测框架。结果见表7。

表8

4.5. Actual Speedup Evaluation

我们在一个ARM的手机上测试了模型推理速度。尽管有更大的g的模型通常性能更好,但我们发现我们当前实现不那么有效,最终我们采用了g=3。结果见表8。

[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen,C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al.Tensorflow: Large-scale machine learning on heterogeneousdistributed systems. arXiv preprint arXiv:1603.04467, 2016.4
[2] H. Bagherinezhad, M. Rastegari, and A. Farhadi. Lcnn:Lookup-based convolutional neural network. arXiv preprintarXiv:1611.06473, 2016. 2
[3] F. Chollet. Xception: Deep learning with depthwise separable convolutions. arXiv preprint arXiv:1610.02357, 2016. 1,2, 3, 4, 5, 6
[4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. FeiFei. Imagenet: A large-scale hierarchical image database.In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009. 1,4
[5] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semanticsegmentation. In Proceedings of the IEEE conference oncomputer vision and pattern recognition, pages 580–587, 2014. 1
[6] S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural networks with pruning,trained quantization and huffman coding. arXiv preprintarXiv:1510.00149, 2015. 2
[7] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weightsand connections for efficient neural network. In Advances inNeural Information Processing Systems, pages 1135–1143, 2015. 2
[8] K. He and J. Sun. Convolutional neural networks at constrained time cost. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages 5353–5360, 2015. 1
[9] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages770–778, 2016. 1, 2, 3, 4, 5, 6
[10] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings indeep residual networks. In European Conference on Computer Vision, pages 630–645. Springer, 2016. 1, 2
[11] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledgein a neural network. arXiv preprint arXiv:1503.02531, 2015.2
[12] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017. 1, 2, 3, 5,6, 7
[13] J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. arXiv preprint arXiv:1709.01507, 2017. 1, 6, 7
[14] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J.Dally, and K. Keutzer. Squeezenet: Alexnet-level accuracywith 50x fewer parameters and¡ 0.5 mb model size. arXivpreprint arXiv:1602.07360, 2016. 1, 7, 8
[15] S. Ioffe and C. Szegedy. Batch normalization: Acceleratingdeep network training by reducing internal covariate shift.arXiv preprint arXiv:1502.03167, 2015. 3, 5
[16] M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding upconvolutional neural networks with low rank expansions.arXiv preprint arXiv:1405.3866, 2014. 1, 2, 8
[17] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, pages 675–678. ACM, 2014. 7
[18] J. Jin, A. Dundar, and E. Culurciello. Flattened convolutionalneural networks for feedforward acceleration. arXiv preprintarXiv:1412.5474, 2014. 2
[19] K.-H. Kim, S. Hong, B. Roh, Y. Cheon, and M. Park. Pvanet:Deep but lightweight neural networks for real-time object detection. arXiv preprint arXiv:1608.08021, 2016. 6
[20] A. Krizhevsky. cuda-convnet: High-performance c++/cudaimplementation of convolutional neural networks, 2012. 2
[21] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetclassification with deep convolutional neural networks. InAdvances in neural information processing systems, pages1097–1105, 2012. 1, 2, 7, 8
[22] V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, andV. Lempitsky. Speeding-up convolutional neural networks using fine-tuned cp-decomposition. arXiv preprintarXiv:1412.6553, 2014. 1, 2, 8
[23] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick. Microsoft coco: Com- ´mon objects in context. In European Conference on Computer Vision, pages 740–755. Springer, 2014. 1, 7
[24] J. Long, E. Shelhamer, and T. Darrell. Fully convolutionalnetworks for semantic segmentation. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition, pages 3431–3440, 2015. 1
[25] M. Mathieu, M. Henaff, and Y. LeCun. Fast trainingof convolutional networks through ffts. arXiv preprintarXiv:1312.5851, 2013. 2
[26] P. Ramachandran, B. Zoph, and Q. V. Le. Swish: a self-gatedactivation function. arXiv preprint arXiv:1710.05941, 2017.7
[27] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnornet: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision,pages 525–542. Springer, 2016. 1, 2
[28] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towardsreal-time object detection with region proposal networks. InAdvances in neural information processing systems, pages91–99, 2015. 1, 7
[29] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,et al. Imagenet large scale visual recognition challenge.International Journal of Computer Vision, 115(3):211–252, 2015. 1, 4
[30] K. Simonyan and A. Zisserman. Very deep convolutionalnetworks for large-scale image recognition. arXiv preprintarXiv:1409.1556, 2014. 1, 2, 5, 7
[31] D. Soudry, I. Hubara, and R. Meir. Expectation backpropagation: Parameter-free training of multilayer neural networkswith continuous or discrete weights. In Advances in NeuralInformation Processing Systems, pages 963–971, 2014. 2
[32] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi. Inceptionv4, inception-resnet and the impact of residual connectionson learning. arXiv preprint arXiv:1602.07261, 2016. 1, 2, 6
[33] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.Going deeper with convolutions. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition,pages 1–9, 2015. 1, 2, 5, 6, 7
[34] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna.Rethinking the inception architecture for computer vision.In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 2818–2826, 2016. 1, 2, 6
[35] N. Vasilache, J. Johnson, M. Mathieu, S. Chintala, S. Piantino, and Y. LeCun. Fast convolutional nets withfbfft: A gpu performance evaluation. arXiv preprintarXiv:1412.7580, 2014. 2
[36] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show andtell: A neural image caption generator. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition, pages 3156–3164, 2015. 1
[37] M. Wang, B. Liu, and H. Foroosh. Design of efficientconvolutional layers using single intra-channel convolution,topological subdivisioning and spatial ”bottleneck” structure. arXiv preprint arXiv:1608.04337, 2016. 2
[38] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learningstructured sparsity in deep neural networks. In Advances inNeural Information Processing Systems, pages 2074–2082, 2016. 1, 2, 8
[39] J. Wu, C. Leng, Y. Wang, Q. Hu, and J. Cheng. Quantizedconvolutional neural networks for mobile devices. In Proceedings of the IEEE Conference on Computer Vision andPattern Recognition, pages 4820–4828, 2016. 2
[40] S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He. Aggregated ´residual transformations for deep neural networks. arXivpreprint arXiv:1611.05431, 2016. 1, 2, 3, 4, 5, 6
[41] T. Zhang, G.-J. Qi, B. Xiao, and J. Wang. Interleaved groupconvolutions for deep neural networks. In International Conference on Computer Vision, 2017. 2
[42] X. Zhang, J. Zou, K. He, and J. Sun. Accelerating verydeep convolutional networks for classification and detection.IEEE transactions on pattern analysis and machine intelligence, 38(10):1943–1955, 2016. 1, 8
[43] X. Zhang, J. Zou, X. Ming, K. He, and J. Sun. Efficientand accurate approximations of nonlinear convolutional networks. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 1984–1992, 2015. 1,8
[44] A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen. Incremental network quantization: Towards lossless cnns with lowprecision weights. arXiv preprint arXiv:1702.03044, 2017.2
[45] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou.Dorefa-net: Training low bitwidth convolutional neuralnetworks with low bitwidth gradients. arXiv preprintarXiv:1606.06160, 2016. 2
[46] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. Learning transferable architectures for scalable image recognition.arXiv preprint arXiv:1707.07012, 2017. 1, 2