Face++ 17年的一篇论文,Aligned ReID,公司有同事正在使用这篇论文。
入坑两个月,18篇博文,感觉还行。不过相关领域经典论文越看越少,今后涉及前沿论文时,不能从海量论文中筛选出有价值的,我就凉了。通过abstract和Introduction快速了解一篇论文,需要好好练习。
要看的太多了,数学的微积分、线代、概率统计,CV的传统特征、传统算法也知之甚少,深度学习的训练也做的不多,只能慢慢积跬步了。
AlignedReID: Surpassing Human-Level Performance in Person Re-Identification
Xuan Zhang, Hao Luo, Xing Fan, Weilai Xiang, Yixiao Sun, Qiqi Xiao, Wei Jiang, Chi Zhang, Jian Sun
Megvii, Inc.(Face++), Institute of Cyber-Systems and Control, Zhejiang University, 2017
1. Introduction
Person Re-ID,在另一个时间或地点认出一个人,是一个有挑战性的计算机视觉任务。其应用广泛,从在不同摄像头追踪一个人到从庞大图片集搜索人,从为图集分组到零售店里的访客分析。如许多视觉识别问题一样,姿势的变化,视角光照,遮挡都让这一问题很困难。
传统方法重点在低级特征,如颜色、形状和局部描述符$^{[9, 11]}$。随着深度学习的复活,CNN以端到端的形式,通过多种loss如contrasive loss$^{[32]}$,triplet loss$^{[18]}$,improved triplet loss$^{[6]}$,quadruplet loss$^{[3]}$和triplet hard loss$^{[13]}$统治了这一领域$^{[24, 32, 6, 54, 16, 24]}$。
许多基于CNN的方法学习全局特征,忽略了人的空间结构,缺点如下:1) 不准确的人检测框会影响特征学习,如图1 (a-b);2) 姿势的变化或不严格的身体变形使训练很困难,如图1(c-d);3) 人体被遮挡部分会为习得特征引入不相关的上下文,如图1(e-f);4) 在全局特征中强调局部的不同非常困难,特别是我们需要区别两个外观非常近似的人,如图1(g-h)。为了克服这些缺点,近期研究开始关注基于部分的、局部特征训练。一些研究$^{[33, 38, 43]}$把整个身体划分为几个固定部分,不考虑它们之间的对齐。但它仍受制于缺点1、2、3。一些其他研究将姿势估计结果用于对齐$^{[52, 37, 50]}$,这需要额外的监督和一个姿势预估步骤(通常有错误倾向)。
本论文提出了一个新的方法,叫做AlignedReID。它仍学习全局特征,但在训练中进行自动部位对齐,同时不需要额外的监督或显式的姿势估计。在训练时,我们同时训练一个全局特征和局部特征。在局部分支,我们通过引入一个shortest path loss,对齐局部部分。在推理时,我们抛弃局部分支,仅提取全局特征。我们发现仅使用全局特征就几乎与同时使用两个的性能近似了。换句话说,在我们的联合训练框架中,全局特征它自己在局部特征训练的帮助下,能非常好的解决我们之前提到的缺点。不仅如此,全局特征的形式使我们的方法很适合大规模ReID系统的部署,不需要花费很高的局部特征匹配。
我们在训练目标设置中还采用了一个互助mutual训练方法$^{[49]}$,允许两个模型互相习得更佳的表达。将AlignedReID与互助学习结合起来,我们的系统在Market1501,CUHK03和CUHK-SYSU上领先前沿一大截。为了理解人类在ReID任务中表现如何,我们测量了10个职业标注人。我们发现带re-ranking的系统超越了人类的准确率。据我们所知,这是首个Re-ID任务中机器超越人类的报告。
2. Related Work
Metric Learning. 深度度量训练方法将原始图片转换为特征,计算特征距离作为相似度。通常一个人的两张图片定义为positive对。Triplet loss被正负对之间的空白所驱动。通过HEM选择合适样本是有效的。将softmax和度量loss结合起来加速收敛也是流行的方法。
Feature Alignment. 近期通过姿势估计对齐局部特征很流行。如用pose invariant embedding PIE将行人对齐为标注姿势来减少姿势不同的影响。Global-Local-Alignment Descriptor(GLAD)不直接对齐行人,而是检测关键姿势点,并从对应区域提取局部特征。SpindleNet$^{[50]}$使用RPN生成多个身体区域,在不同阶段逐步合并相邻身体区域的response map。这些方法都需要额外姿势标注,并处理姿势估计引入的误差。
Mutual Learning. [49]提出了一个深度互助训练策略,在训练过程中全体学生合作学习并互相传授。DarkRank$^{[4]}$提出了一种新的用于模型压缩和加速的knowledge-cross样本相似度,取得了前沿性能。这些方法在分类中使用了互助训练。我们在本论文中学习了这一方法。
Re-ranking. 在获得图片特征后,大部分当前方法都使用L2欧氏距离来为ranking或检索计算相似得分。[35, 57, 1]使用了额外的Re-ranking来提升Re-ID准确率。特别的,[57]提出了使用k-reciprocal编码的Re-ranking方法,结合了原始距离和Jaccard 距离。
3. Our Approach
3.1 AlignedReID
我们为输入图片生成单个全局特征,作为最终输出,并使用L2距离作为相似性度量。但全局特征是在训练阶段与局部特征联合习得的。
对每个图片,我们使用一个CNN(如ResNet-50)来提取特征图,即最后一个卷积层的输出($C \times H \times W$,图2中是$2048\times 7\times7$)。在特征图上直接使用全局池化就得到了全局特征。局部特征使用水平池化,一个在水平方向的全局池化,在每一行提取一个局部特征,再用一个$1\times1$卷积来将通道数从C降低到c。这一每个局部特征(c维向量)表达了一个人的图片的一个水平部分。因此一个人的图片被表达为一个全局特征和H个局部特征。
两个人的图片的距离是它们全局和局部距离的和。全局距离是全局特征的L2距离。对于局部距离,我们从上至下动态的匹配局部部分,找到有最小棕距离的局部特征alignment。这基于一个简单假设,对于同一人的两张图片,第一张图片的一个身体部分的局部特征与另一张图的对应身体部分最相似。
给定两个图片的局部特征,$F=\{f_1,…,f_H\}$和$G=\{g_1,…,g_H\}$,我们先通过一个 逐元素变化归一化距离:
其中$d_{i,j}$是第一张图片的第i个部分和第二张图片的第j部分的距离。基于这些距离构成了距离矩阵D,第(i,j)个元素就是$d_{i,j}$。我们将两张图片的局部距离定义为矩阵D中从(1,1)到(H,H)的最短路径。距离可以如下动态规划求解:
其中$S_{i,j}$是在距离矩阵D中从(1,1)到(i,j)总距离最短的路径距离。
如图3所示,图片AB都是同一个人的样本。对应身体部分的对齐,如A中的P1和B中的P4,都包含在最短路径中。同时,不对应的部分的对齐也包含在了最短路径中,如AB中各自的P1。这些无关对齐对于保持vertical alignment的顺序和让相关对齐可能是必须的。无关的对齐有更大的L2距离,其等式1中的梯度接近0。因此它的贡献很小。最短路径的总距离,即两种图片的局部距离,主要由相关对齐决定。
训练阶段,全局和局部距离一起定义了两种图片的相似度,我们使用了[13]提出的Trihard Loss。对于每个样本按照全局距离,找出同一身份最不相似的和不同身份最相似的样本,组成triplet。使用全局距离进行HEM的原因有二。首先全局距离计算快很多。其次我们观察到使用两种距离进行HEM没有显著区别。
推理阶段仅用全局特征计算相似度,因为接近两个特征一起用的性能。这一反直觉的现象可能有两个原因:1) 联合训练的特征图比单独训练全局特征好,因为我们训练阶段开发了人物图片的结构;2) 在局部特征匹配的帮助下,全局特征能将注意力集中于人体,而不是过拟合背景。
3.2 Mutual Learning for Metric Learning
我们训练AlignedReID模型时采用了互助训练。一个基于distillation(精华/蒸馏)的模型从一个预训练的大教师网向一个小的学生网络转移知识,如[4]。本论文中,我们同时训练学生模型集,互相之间转移知识,如[49]。[49]在分类概率仅采用了Kullback-Leibler(KL)距离,我们提出了一个新的互助学习loss。
训练方法如图4。整体loss函数包含了metric loss, metric mutual loss, classification loss和classification mutual loss。metric loss由全局和局部距离决定,而metric mutual loss仅由全局距离决定。classification mutual loss如[49]一样,是KL divergence。
给定N张图的batch,每个网络提取它们的全局特征,计算全局距离,得到$N\times N$批量距离矩阵,其中$M_{ij}^{\theta_1}$和$M_{ij}^{\theta_2}$标记了各自矩阵的第(i,j)个元素。互助训练loss如下:
其中$ZG(\cdot)$代表zero gradient函数,在计算梯度时将变量档次常量,在训练阶段阻止反向传播。
通过应用0梯度函数,二阶梯度为:
我们发现这能加速收敛并提升准确率。
4. Experiments
4.1 Datasets
4.2 Implementation Details
我们使用在ImageNet上预训练的ResNet-50和ResNet50-Xception(ResNet-X)作为基础模型。ResNet-X将$3\times3$卷积核替换为Xception cell$^{[7]}$,含有一个$3\times3$逐通道的卷积层和一个$1\times1$空间卷积层。每张图缩放到$224\times224$像素。数据增广包括随机水平翻转和裁切。全局和局部距离的TriHard loss的margin为0.5,batch-size为160,每个身份有4张图片。每个epoch有2000个mini-batch。我们使用初始学习率为0.001的Adam optimizer,在80和160epoch缩小10倍,直到收敛。
关于互助训练,Classification mutual loss(KL)的权重为0.01,metric mutual loss的权重为0.001。使用初始学习率为$3\times10^{-4}$的Adam optimizer,在60和120epoch缩小$10^{-4}$和$10^{-5}$。
Re-ranking是提升Re-ID性能的有效技术。我们采用了[57]的方法和细节。
4.3 Advantage of AlignedReID
图5是一些典型对齐结果。图5(a)中,右边人的检测框不准确,导致了头部严重未对准。AlignedReID将左边的第一块和右边的前3块匹配到了最短路径中。图5(b)是另一种困难的情况,人体结构不完全。左图不包含膝盖以下的部分。在对齐中,左右图的裙子部分关联了起来,而右图的腿部为最短路径贡献很小。图5(c)是一个遮挡的例子,人的下半部不可见。对齐显示被遮挡部分为最短路径贡献很小,故训练时重点在其它部分。图5(d)是两个相似的人。他们T恤logo不同,导致两个图片的最短路径总距离较大。
我们接着比较了两个类似的网络:Baseline不含局部特征分支,GL-Baseline有不对齐的局部特征分支。GL-Baseline中局部loss是空间相关的局部特征距离和。所有结果都通过同样的网络和同样的训练设定得到。结果见表1。与Baseline相比,GL-Baseline通常准确率更低。因此不对齐的局部分支没有帮助。同时AlignedReID在所有数据集提升了3.1%~7.9%的rank-1准确率和3.6%~10.1%的mAP。带对齐的局部特征分支有助于让网络关注有用的图片区,分辨出只有微小不同的相似图片。
我们发现如果我们在推理阶段还使用局部距离,rank-1准确率提升约0.3%~0.5%。但耗时太大。
4.4 Analysis of Mutual Learning
结果见表2。
4.5 Comparison with Other Methods
我们比较了各前沿方法,见表3~5。AlignedReID代表我们的方法,使用了互助训练。AlignedReID(RK)是我们的方法,带互助训练和带k-reciprocal编码的Re-ranking。
5. Human Performance in Person ReID
为了让研究可行,每个查询图片,人类仅需从小得多的图集中辨认。人的推理所用图集见图6。结果见表6。
图7显示了一些样本,标注者选择错误,而我们的top-1正确。
6. Conclusion
本论文证明了隐式的局部特征对齐能大幅提升全局特征训练。这一结果给与了我们重要的insight:有结构的先验端到端训练比盲目的要强力得多。
我们的方法仍未完全超越人类,图8是一些很少困惑到人类的大错误。
References
[1] S. Bai, X. Bai, and Q. Tian. Scalable person reidentification on supervised smoothed manifold. arXiv preprint arXiv:1703.08359, 2017.
[2] I. B. Barbosa, M. Cristani, B. Caputo, A. Rognhaugen, and T. Theoharis. Looking beyond appearances: Synthetic training data for deep cnns in re-identification. arXiv preprint arXiv:1701.03153, 2017.
[3] W. Chen, X. Chen, J. Zhang, and K. Huang. Beyond triplet loss: a deep quadruplet network for person re-identification. arXiv preprint arXiv:1704.01719, 2017.
[4] Y. Chen, N. Wang, and Z. Zhang. Darkrank: Accelerating deep metric learning via cross sample similarities transfer. arXiv preprint arXiv:1707.01220, 2017.
[5] Y.-C. Chen, X. Zhu, W.-S. Zheng, and J.-H. Lai. Person reidentification by camera correlation aware feature augmentation. IEEE Transactions on Pattern Analysis and MachineIntelligence, 2017.
[6] D. Cheng, Y. Gong, S. Zhou, J. Wang, and N. Zheng. Person re-identification by multi-channel parts-based cnn with improved triplet loss function. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1335–1344, 2016.
[7] F. Chollet. Xception: Deep learning with depthwise separable convolutions. arXiv preprint arXiv:1610.02357, 2016.
[8] H. Fan, L. Zheng, and Y. Yang. Unsupervised person reidentification: Clustering and fine-tuning. arXiv preprint arXiv:1705.10444, 2017.
[9] M. Farenzena, L. Bazzani, A. Perina, V. Murino, and M. Cristani. Person re-identification by symmetry-driven accumulation of local features. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 2360–2367. IEEE, 2010.
[10] M. Geng, Y. Wang, T. Xiang, and Y. Tian. Deep transfer learning for person re-identification. arXiv preprint arXiv:1611.05244, 2016.
[11] O. Hamdoun, F. Moutarde, B. Stanciulescu, and B. Steux. Person re-identification in multi-camera system by signature based on interest point descriptors collected on short video sequences. In Distributed Smart Cameras, 2008. ICDSC 2008. Second ACM/IEEE International Conference on, pages 1–6. IEEE, 2008.
[12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[13] A. Hermans, L. Beyer, and B. Leibe. In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737, 2017.
[14] W. Li, R. Zhao, T. Xiao, and X. Wang. Deepreid: Deep filter pairing neural network for person re-identification. pages 152–159, 2014.
[15] S. Liao, Y. Hu, X. Zhu, and S. Z. Li. Person re-identification by local maximal occurrence representation and metric learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2197–2206, 2015.
[16] Y. Lin, L. Zheng, Z. Zheng, Y. Wu, and Y. Yang. Improving person re-identification by attribute and identity learning. arXiv preprint arXiv:1703.07220, 2017.
[17] H. Liu, J. Feng, Z. Jie, K. Jayashree, B. Zhao, M. Qi, J. Jiang, and S. Yan. Neural person search machines. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
[18] H. Liu, J. Feng, M. Qi, J. Jiang, and S. Yan. End-to-end comparative attention networks for person re-identification. IEEE Transactions on Image Processing, 2017.
[19] H. Liu, Z. Jie, K. Jayashree, M. Qi, J. Jiang, S. Yan, and J. Feng. Video-based person re-identification with accumulative motion context. arXiv preprint arXiv:1701.00193, 2017.
[20] X. Liu, H. Zhao, M. Tian, L. Sheng, J. Shao, S. Yi, J. Yan, and X. Wang. Hydraplus-net: Attentive deep features for pedestrian analysis. 2017.
[21] Y. Liu, J. Yan, and W. Ouyang. Quality aware network for set to set recognition. arXiv preprint arXiv:1704.03373, 2017.
[22] X. Ma, X. Zhu, S. Gong, X. Xie, J. Hu, K.-M. Lam, and Y. Zhong. Person re-identification by unsupervised video matching. Pattern Recognition, 65:197–210, 2017.
[23] N. Martinel, A. Das, C. Micheloni, and A. K. RoyChowdhury. Temporal model adaptation for person reidentification. In European Conference on Computer Vision, pages 858–877. Springer, 2016.
[24] T. Matsukawa and E. Suzuki. Person re-identification using cnn features learned from combination of attributes. In Pattern Recognition (ICPR), 2016 23rd International Conference on, pages 2428–2433. IEEE, 2016.
[25] N. McLaughlin, J. Martinez del Rincon, and P. Miller. Recurrent convolutional network for video-based person reidentification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1325– 1334, 2016.
[26] P. Peng, T. Xiang, Y. Wang, M. Pontil, S. Gong, T. Huang, and Y. Tian. Unsupervised cross-dataset transfer learning for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1306–1315, 2016.
[27] F. Radenovic, G. Tolias, and O. Chum. Cnn image retrieval ´ learns from bow: Unsupervised fine-tuning with hard examples. In European Conference on Computer Vision, pages 3–20. Springer, 2016.
[28] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
[29] A. Schumann, S. Gong, and T. Schuchert. Deep learning prototype domains for person re-identification. arXiv preprint arXiv:1610.05047, 2016.
[30] Springer. MARS: A Video Benchmark for Large-Scale Person Re-identification, 2016.
[31] Y. T. Tesfaye, E. Zemene, A. Prati, M. Pelillo, and M. Shah. Multi-target tracking in multiple non-overlapping cameras using constrained dominant sets. arXiv preprint arXiv:1706.06196, 2017.
[32] R. R. Varior, M. Haloi, and G. Wang. Gated siamese convolutional neural network architecture for human reidentification. In European Conference on Computer Vision, pages 791–808. Springer, 2016.
[33] R. R. Varior, B. Shuai, J. Lu, D. Xu, and G. Wang. A siamese long short-term memory architecture for human reidentification. In European Conference on Computer Vision, pages 135–153. Springer, 2016.
[34] R. R. Varior, B. Shuai, J. Lu, D. Xu, and G. Wang. A siamese long short-term memory architecture for human reidentification. In European Conference on Computer Vision, pages 135–153, 2016.
[35] J. Wang, S. Zhou, J. Wang, and Q. Hou. Deep ranking model by large adaptive margin learning for person reidentification. arXiv preprint arXiv:1707.00409, 2017.
[36] T. Wang, S. Gong, X. Zhu, and S. Wang. Person reidentification by discriminative selection in video ranking. IEEE transactions on pattern analysis and machine intelligence, 38(12):2501–2514, 2016.[37] L. Wei, S. Zhang, H. Yao, W. Gao, and Q. Tian. Glad: Global-local-alignment descriptor for pedestrian retrieval. arXiv preprint arXiv:1709.04329, 2017.
[38] Q. Xiao, K. Cao, H. Chen, F. Peng, and C. Zhang. Cross domain knowledge transfer for person re-identification. arXiv preprint arXiv:1611.06026, 2016.
[39] Q. Xiao, H. Luo, and C. Zhang. Margin sample mining loss: A deep learning based method for person re-identification. arXiv preprint arXiv:1710.00478, 2017.
[40] T. Xiao, H. Li, W. Ouyang, and X. Wang. Learning deep feature representations with domain guided dropout for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1249– 1258, 2016.
[41] T. Xiao, S. Li, B. Wang, L. Lin, and X. Wang. Endto-end deep learning for person search. arXiv preprint arXiv:1604.01850, 2016.
[42] T. Xiao, S. Li, B. Wang, L. Lin, and X. Wang. Joint detection and identification feature learning for person search. In Proc. CVPR, 2017.
[43] H. Yao, S. Zhang, Y. Zhang, J. Li, and Q. Tian. Deep representation learning with part loss for person re-identification. arXiv preprint arXiv:1707.00798, 2017.
[44] J. You, A. Wu, X. Li, and W.-S. Zheng. Top-push videobased person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1345–1353, 2016.
[45] D. Zhang, W. Wu, H. Cheng, R. Zhang, Z. Dong, and Z. Cai. Image-to-video person re-identification with temporally memorized similarity learning. IEEE Transactions on Circuits and Systems for Video Technology, 2017.
[46] L. Zhang, T. Xiang, and S. Gong. Learning a discriminative null space for person re-identification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
[47] L. Zhang, T. Xiang, and S. Gong. Learning a discriminative null space for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1239–1248, 2016.
[48] W. Zhang, S. Hu, and K. Liu. Learning compact appearance representation for video-based person re-identification. arXiv preprint arXiv:1702.06294, 2017.
[49] Y. Zhang, T. Xiang, T. M. Hospedales, and H. Lu. Deep mutual learning. arXiv preprint arXiv:1706.00384, 2017.
[50] H. Zhao, M. Tian, S. Sun, J. Shao, J. Yan, S. Yi, X. Wang, and X. Tang. Spindle net: Person re-identification with human body region guided feature decomposition and fusion. CVPR, 2017.
[51] R. Zhao, W. Oyang, and X. Wang. Person re-identification by saliency learning. IEEE transactions on pattern analysis and machine intelligence, 39(2):356–370, 2017.
[52] L. Zheng, Y. Huang, H. Lu, and Y. Yang. Pose invariant embedding for deep person re-identification. arXiv preprint arXiv:1701.07732, 2017.
[53] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian. Scalable person re-identification: A benchmark. In Computer Vision, IEEE International Conference, 2015.
[54] L. Zheng, Y. Yang, and A. G. Hauptmann. Person reidentification: Past, present and future. arXiv preprint arXiv:1610.02984, 2016.
[55] Z. Zheng, L. Zheng, and Y. Yang. A discriminatively learned cnn embedding for person re-identification. arXiv preprint arXiv:1611.05666, 2016.
[56] Z. Zheng, L. Zheng, and Y. Yang. Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
[57] Z. Zhong, L. Zheng, D. Cao, and S. Li. Re-ranking person re-identification with k-reciprocal encoding. arXiv preprint arXiv:1701.08398, 2017.
[58] Z. Zhou, Y. Huang, W. Wang, L. Wang, and T. Tan. See the forest for the trees: Joint spatial and temporal recurrent neural networks for video-based person re-identification.