MTCNN

公司其它项目组在复现的一篇论文,经典人脸检测与对齐。自上一篇论文翻吐了之后,打算总结性的记录论文了。这一篇没有全文翻译,看得非常轻松,不过跟论文本身高度精炼,不吹闲B有关系吧。喜欢这样的paper!

MTCNN: Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks

Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, Yu Qiao
IEEE 2016

I. Introduction

人脸检测和对齐是很多人脸应用的基础,如人脸识别和表情分析facial expression analysis。但脸部视觉变化过大,如遮挡,姿势和光照都为实际应用提出了挑战。

Viola和Jones提出的层叠人脸检测$^{[2]}$使用了Haar-Like特征和AdaBoost取得了不错的性能,也有实时效率。但在实际应用效果不佳。也有使用Deformable part models DPM的$^{[5,6,7]}$。Yang等人$^{[11]}$使用了卷积神经网络进行人脸属性识别,来进一步产生人脸窗口候选。Li等人$^{[19]}$使用了层叠CNN,但需要限位框校准,且忽略了人脸关键点位置的相关性和限位框回归。

人脸对齐可以分为两类,基于回归的方法$^{[12,13,16]}$和模板匹配template fitting方法$^{[14, 15, 7]}$。近期Zhang等人$^{[22]}$提出了人脸属性识别作为额外的任务,以加强人脸对齐性能。

但之前所有人脸检测和对齐方法都忽略了它们之间的相关性。有的研究尝试了同时解决它们,但仍有限制。Chen等人$^{[18]}$使用不同像素值的随机森林同时进行对齐和检测,但性能受限于手工特征。Zhang等人$^{[20]}$使用多任务CNN来提升多view人脸检测,但检测召回率受限于弱人脸检测器产生的初始检测窗口。

传统Hard sample mining使用离线形式,极大增加了手工操作。需要在线的才能自动适应当前训练状态。

本论文提出了一个新的框架,用多任务训练的层叠CNN集成了这两个任务。该CNN包含3个阶段。第一阶段通过浅CNN快速产生候选窗口。接着通过一个更复杂的CNN,拒绝大量无脸窗口。最终使用一个更强力的CNN来再次精炼结果,输出5个人脸关键点坐标。

II. Approach

A. Overall Framework

我们方法流程见图1。
图1

首先缩放到不同尺度,构建图片金字塔。

Stage 1:我们提出了一个FCN,叫P-Net(proposal net),来获取候选脸部窗口和它们的限位框回归向量。接着根据回归向量校正候选。最后使用非极大值抑制NMS来合并高度重合的候选。

Stage 2:所有候选输入到另一个CNN,叫Refine CNN(R-Net),它拒绝大量错误候选,使用限位框回归进行校正,使用了NMS。

Stage 3:类似阶段2,但本阶段更偏向有监督的人类区域识别。网络会输出5个人类关键点。

B. CNN Architecture

现有CNN方式有以下方面的限制:(1)一些卷积层的卷积核缺少多样性,也许会影响其辨识力。(2)与多类物体检测和分类任务相比,人脸检测是一个挑战性的二分类任务,也许每层只需更少的卷结核。因此我们降低了核数量,将$5\times5$改为$3\times3$卷积核来降低计算量,增大深度以获得更好性能。我们因此用更少的运行时间获得了更高性能。(训练阶段结果见表1)。CNN架构见图2,我们使用了PReLU$^{[30]}$作为非线性激活。

表1

图2

C. Training

我们用了3个任务训练CNN检测器:人脸/无人脸分类,限位框回归和人脸关键点定位。

1) Face classification: 训练目标表达为二分类问题,每个样本$x_i$我们使用交叉熵loss:

其中$p_i$是网络输出的样本$x_i$是一张脸的概率。$y_i^{det} \in \{ 0, 1\}$标记了gt标签

2) Bounding box regression: 对于每个候选窗口,我们预测它和最近gt的偏移。训练目标表达为回归问题,我们为每个样本使用了欧几里得loss:

其中$\hat y_i^{box} $是来自网络的回归目标,$ y _i ^{box} $是gt坐标。共四个坐标,包括左上,宽和高,故$ y _i ^{box} \in \mathbb R^4 $

3) Facial landmark localization: 类似限位框,人脸关键点检测也化为回归问题,我们最小化欧几里得loss:

共5个关键点,包括左右眼,鼻子和左右嘴角。

4) Multi-source training: 因为每个CNN有多个任务,因此训练过程会有多种训练图片,比如脸,没有脸和部分对齐的脸。有些loss函数(公式1-3)不会用到。例如对于背景样本,我们只会计算$L_i^{det}$,另两个loss设为0。这可以通过样本类型指示器实现。总体训练目标为:

其中N是训练样本数量,$\alpha_j$标识任务重要性。在P-Net和R-Net中,$\alpha _{det} = 1 , \alpha _{box} = 0.5 , \alpha _{landmark} = 0.5$,而在O-Net中为$\alpha _{det} = 1 , \alpha _{box} = 0.5 , \alpha _{landmark} = 1$。$\beta_i ^j \in \{ 0 ,1\}$是样本类型指示器。使用了SGD训练这些CNN。

5) Online Hard sample mining: 与传统HEM在原始分类器训练完成后使用不同,我们进行在线HEM,自动使用训练过程。

特别的,在每个mini-batch,我们为前向传播计算的所有样本loss排序,选择前70%作为困难样本用于反向传播。抛弃了剩余简单样本。效果见章节III。

III. Experiments

本节我们评估了HEM的有效性,用我们的人脸检测和对齐方法在FDDB,WIDER FACE和Wild(AFLW)上与前沿方法进行了比较。

A. Training Data

因为我们同时进行人脸检测和对齐,在训练中我们使用了四中数据标注:(i) Negative: 与任意gt脸IoU都低于0.3的区域; (2) Positive: 与一个gt脸IoU大于0.65;(3) Part faces: 与一个gt脸$IoU \in [0.4, 0.65];(4)关键点脸: 有5个关键点的脸。部分脸和负样本之间有模糊地带,不同数据集标注中它们都不同。正负样本用于分类任务,正样本和部分脸用于限位框回归,关键点脸用于关键点定位。总体训练数据组成为3:1:1:2(Negative/ positive/ part face/ landmark face)。每个网络训练集如下:

1) P-Net: 随机从WIDER FACE取几小块,获得正、负和部分脸样本,从CelebA随机取关键点脸

2) R-Net: 使用第一阶段的框架在WIDER FACE中检测脸,获得获得正、负和部分脸样本,从CelebA中检测关键点脸

3) O-Net: 类似R-Net,但我们使用框架前两步进行检测以获得样本。

B. The effectiveness of online hard sample mining

我们训练了两个P-Net(是否使用OHEM)并在FDDB上比较性能(图3.a )。OHEM提升了1.5%的整体性能。

图3

C. The effectiveness of joint detection and alignment

我们在FDDB上(用同样的P-Net和R-Net)评估两个O-Net(是否联合训练关键点回归)。我们也比较了两个O-Net的限位框回归性能(图3.b)。

D. Evaluation on face detection

我们在FDDB上与前沿方法[1, 5, 6, 11, 18, 19, 26, 27, 28, 29]进行了比较。图4 a-d显示了我们领先一大截。
图4

E. Evaluation on face alignment

我们与以下人脸对齐方法比较了性能:RCPR[12], TSPM[7], Luxand face SDK[17], ESR[13], CDM[15], SDM[21], TCDCN[22]。平均误差通过预估关键点与gt距离计算,并用眼间距离interocular归一化。图5说明我们领先很多。也显示了我们的方法在嘴角定位不那么优秀,也许是因为我们训练集的表情变化小,这对嘴角位置影响很大。

图5

F. Runtime efficiency

见表2。我们代码是未优化的Matlab 代码。

表2

IV. Conclusion

Reference

[1] B. Yang, J. Yan, Z. Lei, and S. Z. Li, “Aggregate channel features for multi-view face detection,” in IEEE International Joint Conference on Biometrics, 2014, pp. 1-8.
[2] P. Viola and M. J. Jones, “Robust real-time face detection. International journal of computer vision,” vol. 57, no. 2, pp. 137-154, 2004
[3] M. T. Pham, Y. Gao, V. D. D. Hoang, and T. J. Cham, “Fast polygonal integration and its application in extending haar-like features to improve object detection,” in IEEE Conference on Computer Vision and Pattern Recognition, 2010, pp. 942-949.
[4] Q. Zhu, M. C. Yeh, K. T. Cheng, and S. Avidan, “Fast human detection using a cascade of histograms of oriented gradients,” in IEEE Computer Conference on Computer Vision and Pattern Recognition, 2006, pp. 1491-1498.
[5] M. Mathias, R. Benenson, M. Pedersoli, and L. Van Gool, “Face detection without bells and whistles,” in European Conference on Computer Vision, 2014, pp. 720-735.
[6] J. Yan, Z. Lei, L. Wen, and S. Li, “The fastest deformable part model for object detection,” in IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 2497-2504.
[7] X. Zhu, and D. Ramanan, “Face detection, pose estimation, and landmark localization in the wild,” in IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 2879-2886.
[8] M. Köstinger, P. Wohlhart, P. M. Roth, and H. Bischof, “Annotated facial landmarks in the wild: A large-scale, real-world database for facial landmark localization,” in IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2011, pp. 2144-2151.
[9] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097-1105.
[10] Y. Sun, Y. Chen, X. Wang, and X. Tang, “Deep learning face representation by joint identification-verification,” in Advances in Neural Information Processing Systems, 2014, pp. 1988-1996.
[11] S. Yang, P. Luo, C. C. Loy, and X. Tang, “From facial parts responses to face detection: A deep learning approach,” in IEEE International Conference on Computer Vision, 2015, pp. 3676-3684.
[12] X. P. Burgos-Artizzu, P. Perona, and P. Dollar, “Robust face landmark estimation under occlusion,” in IEEE International Conference on Computer Vision, 2013, pp. 1513-1520.
[13] X. Cao, Y. Wei, F. Wen, and J. Sun, “Face alignment by explicit shape regression,” International Journal of Computer Vision, vol 107, no. 2, pp. 177-190, 2012.
[14] T. F. Cootes, G. J. Edwards, and C. J. Taylor, “Active appearance models,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 6, pp. 681-685, 2001.
[15] X. Yu, J. Huang, S. Zhang, W. Yan, and D. Metaxas, “Pose-free facial landmark fitting via optimized part mixtures and cascaded deformable shape model,” in IEEE International Conference on Computer Vision, 2013, pp. 1944-1951.
[16] J. Zhang, S. Shan, M. Kan, and X. Chen, “Coarse-to-fine auto-encoder networks (CFAN) for real-time face alignment,” in European Conference on Computer Vision, 2014, pp. 1-16.
[17] Luxand Incorporated: Luxand face SDK, http://www.luxand.com/
[18] D. Chen, S. Ren, Y. Wei, X. Cao, and J. Sun, “Joint cascade face detection and alignment,” in European Conference on Computer Vision, 2014, pp. 109-122.
[19] H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua, “A convolutional neural network cascade for face detection,” in IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 5325-5334.
[20] C. Zhang, and Z. Zhang, “Improving multiview face detection with multi-task deep convolutional neural networks,” IEEE Winter Conference on Applications of Computer Vision, 2014, pp. 1036-1041.
[21] X. Xiong, and F. Torre, “Supervised descent method and its applications to face alignment,” in IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 532-539.
[22] Z. Zhang, P. Luo, C. C. Loy, and X. Tang, “Facial landmark detection by deep multi-task learning,” in European Conference on Computer Vision, 2014, pp. 94-108.
[23] Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning face attributes in the wild,” in IEEE International Conference on Computer Vision, 2015, pp. 3730-3738.
[24] S. Yang, P. Luo, C. C. Loy, and X. Tang, “WIDER FACE: A Face Detection Benchmark”. arXiv preprint arXiv:1511.06523.
[25] V. Jain, and E. G. Learned-Miller, “FDDB: A benchmark for face detection in unconstrained settings,” Technical Report UMCS-2010-009, University of Massachusetts, Amherst, 2010.
[26] B. Yang, J. Yan, Z. Lei, and S. Z. Li, “Convolutional channel features,” in IEEE International Conference on Computer Vision, 2015, pp. 82-90.
[27] R. Ranjan, V. M. Patel, and R. Chellappa, “A deep pyramid deformable part model for face detection,” in IEEE International Conference on Biometrics Theory, Applications and Systems, 2015, pp. 1-8.
[28] G. Ghiasi, and C. C. Fowlkes, “Occlusion Coherence: Detecting and Localizing Occluded Faces,” arXiv preprint arXiv:1506.08347.
[29] S. S. Farfade, M. J. Saberian, and L. J. Li, “Multi-view face detection using deep convolutional neural networks,” in ACM on International Conference on Multimedia Retrieval, 2015, pp. 643-650.
[30] K. He, X. Zhang, S. Ren, J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in IEEE International Conference on Computer Vision, 2015, pp. 1026-1034.