Learning Efficient Single-stage Pedestrian Detectors by Asymptotic Localization Fitting
Wei Liu
最近寻找更好的行人检测算法,这是一篇ECCV 2018行人检测论文,SSD的作者对SSD的改进版,同时有开源的代码及模型。作者认为Faster R-CNN高精度的关键在于其完成了对目标候选框的两次预测,而非耗时的ROI-pooling操作,如果能将这一核心迁移到SSD中,就可以保证在获得速度优势的同时保证高准确率。
1. Introduction
我们认为Faster R-CNN精确的关键在于对默认锚点的两步:RPN和对ROI的预测,而非ROI-pooling模块。近期其它研究也显示可以以一种简单的办法对anchor进行多部处理,不需要RPN或ROI-pooling。
SSD的另外一个问题是训练时使用的单IoU阈值。较低的阈值(0.5)有助于获得更多正样本。但训练时单一的较低阈值会使测试时出现很多“接近但不正确”的false positive。较高的阈值能避免这一问题,但会使训练正样本过少。见图1。
因此我们提出了一个简单但有效的模块:Asymptotic Localization Fitting(ALF)。从SSD中默认的anchor开始,卷积化地一步步提升所有anchor,使它们与gt框更近。基于此,构建了新的行人检测架构,ALF。
2. Related Work
介绍了一些两步检测算法,比较了本论文与Cascade R-CNN及RefineNet的异同。
3. Approach
3.1 Preliminary
方法基于单步检测框架,因此进行简要回顾。
单步检测中,有着不同分辨率的多个特征图通过基准网络(VGG、ResNet等)提取,它们定义如下:
其中I代表输入图片,$f_n$是基准网中现有层或一个新增的特征提取层,$\Phi _n$是来自第n层生成的特征图。在多尺度的特征图上,检测可写做:
其中$B_n$是第n层的特征图cell中预定义的anchor框。$p_n$是将第n个特征图$\Phi_n$转换为检测结果的卷积预测器。它通常包含两个部分,预测分类得分的$cls_n$和预测对第n层的anchor的缩放和偏移的$regr_n$,最终得到限位框。$F$将所有层的框集中,输出最终结果。
我们可以发现等式2扮演了Faster-RCNN中FPN的角色,但RPN将最后一层的预测器应用在所有尺度的所有anchor上,记做:
两步方法中,等式4中的proposal进一步被ROI-pooling处理,并送入另一个检测子网,得到分类和回归结果,因此它比单步方法更精确、但效率更低。
3.2 Asymptotic Localization Fitting
从上文分析可以看出,单步方法非最优的原因在于很难让单个预测器$p_n$对在特征图上均匀铺设的默认anchor框表现完美。我们提出一个合理的解:堆叠一系列预测器$p_n ^t$,应用到从粗糙到精细的anchor$\mathfrak B_n^t$,t指第t步,则等式3可重写为:
其中T是总步数,$\mathfrak B_n^0$是第n层铺设的默认anchor框。每一步中,预测器$p_n^t$是使用回归后的框$\mathfrak B_n^{t-1}$而不是默认anchor。换句话说,逐渐改善的anchor会提供更多正样本,使后续步骤可以以更高的IoU阈值训练,有助于测试时产出更精确的定位。另一个好处是各个步骤用不同阈值训练出的多个分类器可以以一种“多专家”的方式为每个anchor打分。
图2是ALF模块有效性的例子。a中在IoU阈值为0.5时,分别只有7、16个默认anchor分配到了正样本。而这个数量会逐渐上升,IoU均值也在提升。
3.3 Overall Framework
架构细节见图3。以基网为ResNet50为例,我们从最后3个stage的特征图(记做$\Phi_3, \Phi_4, \Phi_5$,图3中的黄块)发出分支,并在最后增加一个额外的卷积层,记做$\Phi_6$,生成一个辅助分支(绿块)。检测在{$\Phi_3, \Phi_4, \Phi_5, \Phi_6$}上进行,分别对输入图片尺寸进行了8,16,32,64的下采样。它们的anchor宽度为{(16,24),(32,48),(64,80),(128,160)},宽高比为0.41。接着我们附加Convolutional Predictor Block(CPB),进行多步的限位框分类和回归。
3.4 Training and Inference
Trainning 如果anchor与某一gt的IoU高于阈值$u_h$,则被赋为正$S_+$。如低于$u_l$否则为负$S_-$。$[u_l, u_h)$间的anchor在训练中忽略。在渐进式步骤中会赋的阈值将在实验中讨论。
每一步中,CPB通过多任务Loss优化:
回归loss $l_loc$是和Faster-RCNN中一样的smooth L1 loss,$l_{cls}$是用于二分类的交叉熵loss。被[26]所激发,我们也在分类loss中加入了焦点focal weight来处理正负样本失衡问题:
$p_i$是样本i的正概率,$\alpha , \gamma$是focusing 参数,通常设为$\alpha = 0.25, \gamma = 2$,将容易样本的loss贡献进行了降权。
数据增广包括,随机颜色失真,0.5的概率水平翻转,随机从原图裁切[0.3, 1]的块,并将它沿短边等比缩放到固定大小N(CityPerson为640,Caltech为336)。
Inference ALFNet简单地进行前向传播。每一层我们都会得到来自最终预测器所回归的anchor框,以及来自所有预测器的混合信心分。我们首先过滤信心低于0.01的框,并将剩余框使用阈值为0.5的NMS进行合并
4. Experiment
4.1 Experiment Settings
batchsize 10,优化器为Adam。对CityPerson,基网在ImageNet上预训练,额外的层通过xavier方法初始化。共训练了240k iter,初始lr为0.0001,在160k后缩小10。对Caltech,使用0.00001的lr训练了140k。
4.2 Ablation Experiment
就不赘述了,实验结果见下图。效果还是很好的。
References
- Bell, S., Lawrence Zitnick, C., Bala, K., Girshick, R.: Inside-outside net: Detectingobjects in context with skip pooling and recurrent neural networks. In: Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2874–2883 (2016)
- Benenson, R., Omran, M., Hosang, J., Schiele, B.: Ten years of pedestriandetection, what have we learned? In: European Conference on Computer Vision.pp. 613–627. Springer (2014)
- Brazil, G., Yin, X., Liu, X.: Illuminating pedestrians via simultaneous detection &segmentation. arXiv preprint arXiv:1706.08564 (2017)
- Cai, Z., Fan, Q., R.S.Feris, Vasconcelos, N.: A unified multi-scale deepconvolutional neural network for fast object detection. In: European Conferenceon Computer Vision. pp. 354–370. Springer (2016)
- Cai, Z., Saberian, M., Vasconcelos, N.: Learning complexity-aware cascades fordeep pedestrian detection. international conference on computer vision pp. 3361–3369 (2015)
- Cai, Z., Vasconcelos, N.: Cascade r-cnn: Delving into high quality object detection.arXiv preprint arXiv:1712.00726 (2017)
- Chollet, F.: Keras. published on github (https://github.com/fchollet/keras), (2015)
- Dai, J., Li, Y., He, K., Sun, J.: R-fcn: Object detection via region-based fullyconvolutional networks. In: Advances in neural information processing systems.pp. 379–387 (2016)
- Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Li, F.F.: Imagenet: A large-scalehierarchical image database. In: Computer Vision and Pattern Recognition, 2009.CVPR 2009. IEEE Conference on. pp. 248–255. IEEE (2009)
- Doll´ar, P., Appel, R., Belongie, S., Perona, P.: Fast feature pyramids for objectdetection. IEEE Transactions on Pattern Analysis and Machine Intelligence 36(8),1532–1545 (2014)
- Doll´ar, P., Tu, Z., Perona, P., Belongie, S.: Integral channel features (2009)
- Dollar, P., Wojek, C., Schiele, B., Perona, P.: Pedestrian detection: An evaluation ofthe state of the art. IEEE transactions on pattern analysis and machine intelligence34(4), 743–761 (2012)
- Du, X., El-Khamy, M., Lee, J., Davis, L.: Fused dnn: A deep neural network fusionapproach to fast and robust pedestrian detection. In: Applications of ComputerVision (WACV), 2017 IEEE Winter Conference on. pp. 953–961. IEEE (2017)
- Fu, C.Y., Liu, W., Ranga, A., Tyagi, A., Berg, A.C.: Dssd: Deconvolutional singleshot detector. arXiv preprint arXiv:1701.06659 (2017)
- Gidaris, S., Komodakis, N.: Object detection via a multi-region and semanticsegmentation-aware cnn model. In: Proceedings of the IEEE InternationalConference on Computer Vision. pp. 1134–1142 (2015)
- Girshick, R.: Fast r-cnn. In: Proceedings of the IEEE international conference oncomputer vision. pp. 1440–1448 (2015)
- Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies foraccurate object detection and semantic segmentation. In: Proceedings of the IEEEconference on computer vision and pattern recognition. pp. 580–587 (2014)
- He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In:Proceedings of the IEEE conference on computer vision and pattern recognition.pp. 770–778 (2016)
- Hosang, J., Omran, M., Benenson, R., Schiele, B.: Taking a deeper look atpedestrians. In: Proceedings of the IEEE Conference on Computer Vision andPattern Recognition. pp. 4073–4082 (2015)
- Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T.,Andreetto, M., Adam, H.: Mobilenets: Efficient convolutional neural networks formobile vision applications. arXiv preprint arXiv:1704.04861 (2017)
- Kong, T., Sun, F., Yao, A., Liu, H., Lu, M., Chen, Y.: Ron: Reverse connection withobjectness prior networks for object detection. arXiv preprint arXiv:1707.01691(2017)
- Kong, T., Yao, A., Chen, Y., Sun, F.: Hypernet: Towards accurate region proposalgeneration and joint object detection. In: Proceedings of the IEEE conference oncomputer vision and pattern recognition. pp. 845–853 (2016)
- Lee, H., Eum, S., Kwon, H.: Me r-cnn: multi-expert region-based cnn for objectdetection. arXiv preprint arXiv:1704.01069 (2017)
- Li, J., Liang, X., Shen, S., Xu, T., Feng, J., Yan, S.: Scale-aware fast r-cnn forpedestrian detection. IEEE Transactions on Multimedia (2017)
- Lin, Y.T., Doll´ar, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Featurepyramid networks for object detection. arXiv preprint arXiv:1612.03144 (2016)
- Lin, Y.T., Goyal, P., Girshick, R., He, K., Doll´ar, P.: Focal loss for dense objectdetection. arXiv preprint arXiv:1708.02002 (2017)
- Liu, W., Anguelov, D., Erhan, D., Szegedy, C., S.Reed, Fu, C.Y., Berg, A.: Ssd:Single shot multibox detector. In: European conference on computer vision. pp.21–37. Springer (2016)
- Mao, J., Xiao, T., Jiang, Y., Cao, Z.: What can help pedestrian detection? In: TheIEEE Conference on Computer Vision and Pattern Recognition (CVPR). vol. 1,p. 3 (2017)
- Nam, W., Doll´ar, P., Han, J.H.: Local decorrelation for improved pedestriandetection. In: Advances in Neural Information Processing Systems. pp. 424–432(2014)
- Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified,real-time object detection. In: Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition. pp. 779–788 (2016)
- Redmon, J., Farhadi, A.: Yolo9000: better, faster, stronger. arXiv preprint 1612(2016)
- Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time objectdetection with region proposal networks. In: Advances in neural informationprocessing systems. pp. 91–99 (2015)
- Shen, Z., Liu, Z., Li, J., Jiang, Y.G., Chen, Y., Xue, X.: Dsod: Learning deeplysupervised object detectors from scratch. In: The IEEE International Conferenceon Computer Vision (ICCV). vol. 3, p. 7 (2017)
- Shrivastava, A., Gupta, A., Girshick, R.: Training region-based object detectorswith online hard example mining. In: Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition. pp. 761–769 (2016)
- Shrivastava, A., Gupta, A.: Contextual priming and feedback for faster r-cnn. In:European Conference on Computer Vision. pp. 330–348. Springer (2016)
- Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scaleimage recognition. arXiv preprint arXiv:1409.1556 (2014)
- Tian, Y., Luo, P., Wang, X., Tang, X.: Deep learning strong parts for pedestriandetection. In: Proceedings of the IEEE international conference on computer vision.pp. 1904–1912 (2015)
- Tian, Y., Luo, P., Wang, X., Tang, X.: Pedestrian detection aided by deep learningsemantic tasks. In: Proceedings of the IEEE Conference on Computer Vision andPattern Recognition. pp. 5079–5087 (2015)
- Wang, X., Shrivastava, A., Gupta, A.: A-fast-rcnn: Hard positive generation viaadversary for object detection. arXiv preprint arXiv:1704.03414 2 (2017)
- Wang, X., Xiao, T., Jiang, Y., Shao, S., Sun, J., Shen, C.: Repulsion loss: Detectingpedestrians in a crowd. arXiv preprint arXiv:1711.07752 (2017)
- Yang, B., Yan, J., Lei, Z., Li, S.Z.: Convolutional channel features. In: ComputerVision (ICCV), 2015 IEEE International Conference on. pp. 82–90. IEEE (2015)
- Zhang, L., Lin, L., Liang, X., He, K.: Is faster r-cnn doing well for pedestriandetection? In: European Conference on Computer Vision. pp. 443–457. Springer(2016)
- Zhang, S., Benenson, R., Omran, M., Hosang, J., Schiele, B.: How far are we fromsolving pedestrian detection? In: Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition. pp. 1259–1267 (2016)
- Zhang, S., Benenson, R., Schiele, B.: Citypersons: A diverse dataset for pedestriandetection. arXiv preprint arXiv:1702.05693 (2017)
- Zhang, S., Wen, L., Bian, X., Lei, Z., Li, S.Z.: Single-shot refinement neural networkfor object detection. arXiv preprint arXiv:1711.06897 (2017)