Pedestrian Detection: State of art

Learning Efficient Single-stage Pedestrian Detectors by Asymptotic Localization Fitting

Wei Liu

最近寻找更好的行人检测算法,这是一篇ECCV 2018行人检测论文,SSD的作者对SSD的改进版,同时有开源的代码及模型。作者认为Faster R-CNN高精度的关键在于其完成了对目标候选框的两次预测,而非耗时的ROI-pooling操作,如果能将这一核心迁移到SSD中,就可以保证在获得速度优势的同时保证高准确率。

1. Introduction

我们认为Faster R-CNN精确的关键在于对默认锚点的两步:RPN和对ROI的预测,而非ROI-pooling模块。近期其它研究也显示可以以一种简单的办法对anchor进行多部处理,不需要RPN或ROI-pooling。

SSD的另外一个问题是训练时使用的单IoU阈值。较低的阈值(0.5)有助于获得更多正样本。但训练时单一的较低阈值会使测试时出现很多“接近但不正确”的false positive。较高的阈值能避免这一问题,但会使训练正样本过少。见图1。

Alt text

因此我们提出了一个简单但有效的模块:Asymptotic Localization Fitting(ALF)。从SSD中默认的anchor开始,卷积化地一步步提升所有anchor,使它们与gt框更近。基于此,构建了新的行人检测架构,ALF。

介绍了一些两步检测算法,比较了本论文与Cascade R-CNN及RefineNet的异同。

3. Approach

3.1 Preliminary

方法基于单步检测框架,因此进行简要回顾。

单步检测中,有着不同分辨率的多个特征图通过基准网络(VGG、ResNet等)提取,它们定义如下:

其中I代表输入图片,$f_n$是基准网中现有层或一个新增的特征提取层,$\Phi _n$是来自第n层生成的特征图。在多尺度的特征图上,检测可写做:

Alt text

其中$B_n$是第n层的特征图cell中预定义的anchor框。$p_n$是将第n个特征图$\Phi_n$转换为检测结果的卷积预测器。它通常包含两个部分,预测分类得分的$cls_n$和预测对第n层的anchor的缩放和偏移的$regr_n$,最终得到限位框。$F$将所有层的框集中,输出最终结果。

我们可以发现等式2扮演了Faster-RCNN中FPN的角色,但RPN将最后一层的预测器应用在所有尺度的所有anchor上,记做:

两步方法中,等式4中的proposal进一步被ROI-pooling处理,并送入另一个检测子网,得到分类和回归结果,因此它比单步方法更精确、但效率更低。

3.2 Asymptotic Localization Fitting

从上文分析可以看出,单步方法非最优的原因在于很难让单个预测器$p_n$对在特征图上均匀铺设的默认anchor框表现完美。我们提出一个合理的解:堆叠一系列预测器$p_n ^t$,应用到从粗糙到精细的anchor$\mathfrak B_n^t$,t指第t步,则等式3可重写为:

其中T是总步数,$\mathfrak B_n^0$是第n层铺设的默认anchor框。每一步中,预测器$p_n^t$是使用回归后的框$\mathfrak B_n^{t-1}$而不是默认anchor。换句话说,逐渐改善的anchor会提供更多正样本,使后续步骤可以以更高的IoU阈值训练,有助于测试时产出更精确的定位。另一个好处是各个步骤用不同阈值训练出的多个分类器可以以一种“多专家”的方式为每个anchor打分。

图2

图2是ALF模块有效性的例子。a中在IoU阈值为0.5时,分别只有7、16个默认anchor分配到了正样本。而这个数量会逐渐上升,IoU均值也在提升。

图3

3.3 Overall Framework

架构细节见图3。以基网为ResNet50为例,我们从最后3个stage的特征图(记做$\Phi_3, \Phi_4, \Phi_5$,图3中的黄块)发出分支,并在最后增加一个额外的卷积层,记做$\Phi_6$,生成一个辅助分支(绿块)。检测在{$\Phi_3, \Phi_4, \Phi_5, \Phi_6$}上进行,分别对输入图片尺寸进行了8,16,32,64的下采样。它们的anchor宽度为{(16,24),(32,48),(64,80),(128,160)},宽高比为0.41。接着我们附加Convolutional Predictor Block(CPB),进行多步的限位框分类和回归。

3.4 Training and Inference

Trainning 如果anchor与某一gt的IoU高于阈值$u_h$,则被赋为正$S_+$。如低于$u_l$否则为负$S_-$。$[u_l, u_h)$间的anchor在训练中忽略。在渐进式步骤中会赋的阈值将在实验中讨论。

每一步中,CPB通过多任务Loss优化:

回归loss $l_loc$是和Faster-RCNN中一样的smooth L1 loss,$l_{cls}$是用于二分类的交叉熵loss。被[26]所激发,我们也在分类loss中加入了焦点focal weight来处理正负样本失衡问题:

$p_i$是样本i的正概率,$\alpha , \gamma$是focusing 参数,通常设为$\alpha = 0.25, \gamma = 2$,将容易样本的loss贡献进行了降权。

数据增广包括,随机颜色失真,0.5的概率水平翻转,随机从原图裁切[0.3, 1]的块,并将它沿短边等比缩放到固定大小N(CityPerson为640,Caltech为336)。

Inference ALFNet简单地进行前向传播。每一层我们都会得到来自最终预测器所回归的anchor框,以及来自所有预测器的混合信心分。我们首先过滤信心低于0.01的框,并将剩余框使用阈值为0.5的NMS进行合并

4. Experiment

4.1 Experiment Settings

batchsize 10,优化器为Adam。对CityPerson,基网在ImageNet上预训练,额外的层通过xavier方法初始化。共训练了240k iter,初始lr为0.0001,在160k后缩小10。对Caltech,使用0.00001的lr训练了140k。

4.2 Ablation Experiment

就不赘述了,实验结果见下图。效果还是很好的。

Alt text

Alt text

Alt text

Alt text

Alt text

Alt text

Alt text

Alt text

Alt text

Alt text

References

  1. Bell, S., Lawrence Zitnick, C., Bala, K., Girshick, R.: Inside-outside net: Detectingobjects in context with skip pooling and recurrent neural networks. In: Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2874–2883 (2016)
  2. Benenson, R., Omran, M., Hosang, J., Schiele, B.: Ten years of pedestriandetection, what have we learned? In: European Conference on Computer Vision.pp. 613–627. Springer (2014)
  3. Brazil, G., Yin, X., Liu, X.: Illuminating pedestrians via simultaneous detection &segmentation. arXiv preprint arXiv:1706.08564 (2017)
  4. Cai, Z., Fan, Q., R.S.Feris, Vasconcelos, N.: A unified multi-scale deepconvolutional neural network for fast object detection. In: European Conferenceon Computer Vision. pp. 354–370. Springer (2016)
  5. Cai, Z., Saberian, M., Vasconcelos, N.: Learning complexity-aware cascades fordeep pedestrian detection. international conference on computer vision pp. 3361–3369 (2015)
  6. Cai, Z., Vasconcelos, N.: Cascade r-cnn: Delving into high quality object detection.arXiv preprint arXiv:1712.00726 (2017)
  7. Chollet, F.: Keras. published on github (https://github.com/fchollet/keras), (2015)
  8. Dai, J., Li, Y., He, K., Sun, J.: R-fcn: Object detection via region-based fullyconvolutional networks. In: Advances in neural information processing systems.pp. 379–387 (2016)
  9. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Li, F.F.: Imagenet: A large-scalehierarchical image database. In: Computer Vision and Pattern Recognition, 2009.CVPR 2009. IEEE Conference on. pp. 248–255. IEEE (2009)
  10. Doll´ar, P., Appel, R., Belongie, S., Perona, P.: Fast feature pyramids for objectdetection. IEEE Transactions on Pattern Analysis and Machine Intelligence 36(8),1532–1545 (2014)
  11. Doll´ar, P., Tu, Z., Perona, P., Belongie, S.: Integral channel features (2009)
  12. Dollar, P., Wojek, C., Schiele, B., Perona, P.: Pedestrian detection: An evaluation ofthe state of the art. IEEE transactions on pattern analysis and machine intelligence34(4), 743–761 (2012)
  13. Du, X., El-Khamy, M., Lee, J., Davis, L.: Fused dnn: A deep neural network fusionapproach to fast and robust pedestrian detection. In: Applications of ComputerVision (WACV), 2017 IEEE Winter Conference on. pp. 953–961. IEEE (2017)
  14. Fu, C.Y., Liu, W., Ranga, A., Tyagi, A., Berg, A.C.: Dssd: Deconvolutional singleshot detector. arXiv preprint arXiv:1701.06659 (2017)
  15. Gidaris, S., Komodakis, N.: Object detection via a multi-region and semanticsegmentation-aware cnn model. In: Proceedings of the IEEE InternationalConference on Computer Vision. pp. 1134–1142 (2015)
  16. Girshick, R.: Fast r-cnn. In: Proceedings of the IEEE international conference oncomputer vision. pp. 1440–1448 (2015)
  17. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies foraccurate object detection and semantic segmentation. In: Proceedings of the IEEEconference on computer vision and pattern recognition. pp. 580–587 (2014)
  18. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In:Proceedings of the IEEE conference on computer vision and pattern recognition.pp. 770–778 (2016)
  19. Hosang, J., Omran, M., Benenson, R., Schiele, B.: Taking a deeper look atpedestrians. In: Proceedings of the IEEE Conference on Computer Vision andPattern Recognition. pp. 4073–4082 (2015)
  20. Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T.,Andreetto, M., Adam, H.: Mobilenets: Efficient convolutional neural networks formobile vision applications. arXiv preprint arXiv:1704.04861 (2017)
  21. Kong, T., Sun, F., Yao, A., Liu, H., Lu, M., Chen, Y.: Ron: Reverse connection withobjectness prior networks for object detection. arXiv preprint arXiv:1707.01691(2017)
  22. Kong, T., Yao, A., Chen, Y., Sun, F.: Hypernet: Towards accurate region proposalgeneration and joint object detection. In: Proceedings of the IEEE conference oncomputer vision and pattern recognition. pp. 845–853 (2016)
  23. Lee, H., Eum, S., Kwon, H.: Me r-cnn: multi-expert region-based cnn for objectdetection. arXiv preprint arXiv:1704.01069 (2017)
  24. Li, J., Liang, X., Shen, S., Xu, T., Feng, J., Yan, S.: Scale-aware fast r-cnn forpedestrian detection. IEEE Transactions on Multimedia (2017)
  25. Lin, Y.T., Doll´ar, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Featurepyramid networks for object detection. arXiv preprint arXiv:1612.03144 (2016)
  26. Lin, Y.T., Goyal, P., Girshick, R., He, K., Doll´ar, P.: Focal loss for dense objectdetection. arXiv preprint arXiv:1708.02002 (2017)
  27. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., S.Reed, Fu, C.Y., Berg, A.: Ssd:Single shot multibox detector. In: European conference on computer vision. pp.21–37. Springer (2016)
  28. Mao, J., Xiao, T., Jiang, Y., Cao, Z.: What can help pedestrian detection? In: TheIEEE Conference on Computer Vision and Pattern Recognition (CVPR). vol. 1,p. 3 (2017)
  29. Nam, W., Doll´ar, P., Han, J.H.: Local decorrelation for improved pedestriandetection. In: Advances in Neural Information Processing Systems. pp. 424–432(2014)
  30. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified,real-time object detection. In: Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition. pp. 779–788 (2016)
  31. Redmon, J., Farhadi, A.: Yolo9000: better, faster, stronger. arXiv preprint 1612(2016)
  32. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time objectdetection with region proposal networks. In: Advances in neural informationprocessing systems. pp. 91–99 (2015)
  33. Shen, Z., Liu, Z., Li, J., Jiang, Y.G., Chen, Y., Xue, X.: Dsod: Learning deeplysupervised object detectors from scratch. In: The IEEE International Conferenceon Computer Vision (ICCV). vol. 3, p. 7 (2017)
  34. Shrivastava, A., Gupta, A., Girshick, R.: Training region-based object detectorswith online hard example mining. In: Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition. pp. 761–769 (2016)
  35. Shrivastava, A., Gupta, A.: Contextual priming and feedback for faster r-cnn. In:European Conference on Computer Vision. pp. 330–348. Springer (2016)
  36. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scaleimage recognition. arXiv preprint arXiv:1409.1556 (2014)
  37. Tian, Y., Luo, P., Wang, X., Tang, X.: Deep learning strong parts for pedestriandetection. In: Proceedings of the IEEE international conference on computer vision.pp. 1904–1912 (2015)
  38. Tian, Y., Luo, P., Wang, X., Tang, X.: Pedestrian detection aided by deep learningsemantic tasks. In: Proceedings of the IEEE Conference on Computer Vision andPattern Recognition. pp. 5079–5087 (2015)
  39. Wang, X., Shrivastava, A., Gupta, A.: A-fast-rcnn: Hard positive generation viaadversary for object detection. arXiv preprint arXiv:1704.03414 2 (2017)
  40. Wang, X., Xiao, T., Jiang, Y., Shao, S., Sun, J., Shen, C.: Repulsion loss: Detectingpedestrians in a crowd. arXiv preprint arXiv:1711.07752 (2017)
  41. Yang, B., Yan, J., Lei, Z., Li, S.Z.: Convolutional channel features. In: ComputerVision (ICCV), 2015 IEEE International Conference on. pp. 82–90. IEEE (2015)
  42. Zhang, L., Lin, L., Liang, X., He, K.: Is faster r-cnn doing well for pedestriandetection? In: European Conference on Computer Vision. pp. 443–457. Springer(2016)
  43. Zhang, S., Benenson, R., Omran, M., Hosang, J., Schiele, B.: How far are we fromsolving pedestrian detection? In: Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition. pp. 1259–1267 (2016)
  44. Zhang, S., Benenson, R., Schiele, B.: Citypersons: A diverse dataset for pedestriandetection. arXiv preprint arXiv:1702.05693 (2017)
  45. Zhang, S., Wen, L., Bian, X., Lei, Z., Li, S.Z.: Single-shot refinement neural networkfor object detection. arXiv preprint arXiv:1711.06897 (2017)