Reviews: Learning a Metric Embedding for Face Recognition using the Multibatch Method

NIPS 2016
Mon Dec 5th through Sun the 11th, 2016 at Centre Convencions Internacional Barcelona

Paper ID:	765
Title:	Learning a Metric Embedding for Face Recognition using the Multibatch Method

Reviewer 1

Summary

The paper aims at developing a deep neural network for face-recognition. It presents a unified neural network for alignment of faces followed by generating an embedding for the each face image that is trained in a supervised fashion by maximizing the margin between samples from different class while minimizing the distance between same class samples, using a margin. The major contribution as per the authors is the multi-batch sampler that allows considering all possible pairs from a mini-batch participation in the loss for that mini-batch. Authors have also used an alignment network before extracting the face signature for comparison. The architecture used is NIN and data-collection is carried out by crawling internet.

Qualitative Assessment

The major claim of the paper ie the use of the multi-batch is actually a very simple software change to any deep learning package such as Caffe, which allows the full utilization of a mini-batch by considering all possible image pairs instead of using pre-defined image pairs. This contribution is not novel enough for NIPS. Its a common knowledge that more "loss samples" per mini-batch will provide a smoother and better gradient for almost any task. The experiments are poorly designed and the claims with regards to accuracy are not substantially supported. It would have been better if the individual affects of the network-architecture, multi-batch and warping net were shown on the same dataset in an ablation study as compared to some standard published work such [11, 15, 10] to tease out the performance gain from different components. Everything clubbed together makes it difficult to judge the impact of claimed contributions. How about comparing with a triplet loss network as explained in the Google's FaceNet paper with linear pairing ie k / 3 < anchor,positive,negatives > per mini-batch and also all possible triplets per mini-batch i.e. C * ( S* S - S ) < anchor,positive,negatives > triplets per mini-batch, where C = number of classes and S = number of samples per class, in case of this paper C = 16 and S = 16. Google's FaceNet paper did not mention this insignificant detail of sampling the triplets in the optimal fashion and some following works, very unfortunately, made an impression that the triplets were sampled linearly.

Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)

Reviewer 2

Summary

This paper presents a training method, called Multibatch, for similarity learning. The Multibatch method estimates the full gradient using just a subset of images. The Multibatch estimator has a smaller variance, compared to the standard gradient estimator, what makes the optimization of the loss function (stochastic gradient descent) significantly faster. The authors test the approach training a deep CNN on the Labeled Faces in the Wild benchmark, showing the advantage of Multibatch, compared to using pairs.

Qualitative Assessment

The Multibatch method is well presented, well motivated and clearly described. The formulation seems correct and the results are convincing. I think the idea has potential to be applied to other problems, not just Face Recognition. It would be nice to see some experiments in other domains, I think it would strength the paper. I didn't find a reference to Figure 1 in the introduction. It is confusing to see this figure at the beginning, because the introduction is more focused on the loss function and it does not mention the whole pipeline (including the alignment etc). A description of Figure 1 in the introduction would help to understand better the final application (as far as I've seen Figure 1 is not mentioned until Section 4). Furthermore I would recommend to the authors to include in this Figure some details (with text inside the figure) regarding to the network parameters, and increase the font size.

Confidence in this Review

2-Confident (read it all; understood it all reasonably well)

Reviewer 3

Summary

The authors present a novel method for metric learning called multibatch. The apply the new method on the problem of face verification and see significant improvements both in terms of speed (training time) but also accuracy in terms of face verification.

Qualitative Assessment

I liked the paper. It was well presented and excellent in every aspect. The fact that a near-state-of-the-art face verification network can be learnt in 12 hours is fairly impressive. I am wondering if the authors are planning to make their implementation and data publicly available upon acceptance of the paper? Moreover multibatch learning is applicable to a variety of problems and could be of significance for the community (e.g, [3, 4]). A few comments: The experimental results could be even stronger with the inclusion of the triplet loss as another baseline. The joint optimization of the face alignment and face verification into a single network was a nice touch. It would have been an interesting experiment to evaluate quantitatively how well the frontalization module performs using a dataset with landmark annotations which in that case optimal transform is known. The authors mention that existing datasets of faces are not sufficient for the problem of face verification. To my knowledge there are a few major datasets like CASIA [1], Janus [2] which contain a large amount of annotated images that the authors could take advantage of. Minor issues: 111. Eqn. 3 -> Eq. 3 [1] http://www.cbsr.ia.ac.cn/english/CASIA-WebFace-Database.html [2] Klare, Brendan F., et al. "Pushing the frontiers of unconstrained face detection and recognition: IARPA Janus Benchmark A." 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2015. [3] Wohlhart, Paul, and Vincent Lepetit. "Learning descriptors for object recognition and 3d pose estimation." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015. [4] Wan, Ji, et al. "Deep learning for content-based image retrieval: A comprehensive study." Proceedings of the 22nd ACM international conference on Multimedia. ACM, 2014.

Confidence in this Review

2-Confident (read it all; understood it all reasonably well)

Reviewer 4

Summary

Overall, this paper is about pairwise metric learning with an application to face recognition. The main idea: Instead of inducing gradient from each pair separately, why not induce gradient from all samples from the minibatch at the same time? That way, we get O(n) computation in a minibatch of size n, but can get gradient from O(n^2) label pairs in each minibatch. This is a really clever idea. Most of us are trained to think that the samples within each minibatch should be independent and should contribute independent terms to a loss function. But in reality, we can speed up computation by comparing all results within each minibatch to each other to reduce gradient variance. Contributions: - They show their minibatch learning method, which can train a close-to-state-of-the-art face recognizer "overnight on a Titan X"; - Proof that their minibatch learning creates unbiased gradient that approximates the gradient in the entire training set with lower variance, O(1/n^2) rather than O(1/n) with standard minibatch training - An interesting proof showing metric learning is "harder" to learn than a comparable multiclass logistic regression labeling problem.

Qualitative Assessment

I really like this paper. It feels like everyone doing pairwise metric learning should have been doing this from the start. The paper itself is reasonably clear and easy to understand. There are many contributions to this work. Most of my complains are comparatively minor: - I'm not convinced by the artificial examples in Fig. 2 and 3 which try to show that metric learning is harder than multiclass learning. I bet I could draw up some toy figures that show the opposite is true. How often do the observed configurations occur in the actual cases? In 2D, it's easy to box away separate points that should be together, but please keep in mind these feature spaces are high dimensional, so there are exponentially many paths that a point could take around obstacles to improve loss. Perhaps one experiment to show this could be to analyze how often bad gradients occur in a partially-trained feature representation, e.g. some sentence like "These examples actually occur in real-world cases: we found that for X% of labeling pairs, moving the points by epsilon in the direction toward the others in their class centroids would increase loss." This way we can see just how prevalant these obstacles are. I bet they're rare enough to make the proposed configurations unlikely. - I also have a question about the dataset: will the authors release URLs to the images of the expanded set? Some researchers believe that enough data will allow even inferior methods to achieve competitive performance.

Confidence in this Review

2-Confident (read it all; understood it all reasonably well)

Reviewer 5

Summary

This paper proposed a multibatch method for deep neural networks, in which the authors claimed to achieve faster convergence and better performance on face verification task.

Qualitative Assessment

This paper proposed a multibatch method for deep neural networks, which achieved faster convergence and better performance on face verification task. The motivation is good and clear. However, the experiments are insufficient and unconvincing. The reasons are as follows. 1) Since the authors used a self-collected dataset and compared their results with [11][15][10]. It can not demonstrate the effectiveness of the proposed method. So I think at least one more experiment is needed: train the CNN with VGG face dataset (released before the submission of NIPS2016) and test on LFW. it is helpful and necessary to perform experiment with VGG face dataset. 2) The proposed loss function is similar to contrastive loss and triplet loss. It would be more convincing if the authors could perform more fair comparisons (the same training data and networks, different loss functions) and do some analysis. 3) This paper reported two results on LFW, 98.2% with 30ms and 98.8% with 6.6sec. I notice that the latter is inferior to [11] on both speed and performance (as shown in Tab. 1). At this point, the contribution of this paper is very limited. 4) For the convergence experiment (figure 4), it fails to show faster convergence (less iteration or less training time). More detailed experiments would be preferred. 5) It only performs experiments on LFW. Additional experiments on other datasets (e.g. YouTube, MegaFace, IJB-A) can greatly strengthen this paper.

Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)

Reviewer 6

Summary

The paper addresses an important engineering challenge of developing a metric learning protocol which is computationally cheap while providing near state of the art performance. The task that the paper focuses on is face recognition. The paper proposes a method to sample a dense set of loses to back prop. These loses are pairwise between k samples from different classes. The method is named multibatch, and the paper shows that the variance decreases as the square of the number of samples leading to faster convergence. The architecture proposed jointly aligns the faces and extracts a discriminative signature. On the theoretical side, the paper presents an interesting view that metric learning can be seen as a harder problem. The multibatch method is a practical way to get SGD to converge quicker on the loss.

Qualitative Assessment

Major comments: 1) This reviewer's main concern is that although impressive performance on the LFW protocol is showcased in a very competitive setting, a atleast one or two more databases might be required to demonstrate that the protocol can scale up. Given that the practical contribution is primarily in terms of training through multibatch, more empirical evidence might be required. This reviewer imagines the results would generalize to other datasets, but given the strong focus of the paper on face recognition specifically, perhaps one or two more dataset results might be needed. This is the only major reason for the score of 2 on technical quality. 2) As most deep learning studies, this reviewer feels that perhaps the paper lacks a bit of intuition as to why a certain architecture worked. The final architecture employs a Network in network model, however given the fact that the paper aims to minimize model and computational complexity, it is a bit confusing (perhaps I m missing something) as to why the final model still seems more complex than a standard CNN. It would be more informative to know what other simpler models in fact did not work. I know that the paper mentions a VGG model that performed marginally better. This need not be in too much detail, however any discussion of that sort would provide more completeness and structure to the study. Minor comments: There are errors in the paper that have cropped up, eg. Fig 4 caption: LFW A thorough proofread is suggested. Fig 2 is hard to read, could be made bigger The experimental section could perhaps be given more structure, with small headings for each aspect of the experiment. It would improve readability as opposed to a big block of text. Is the architecture shown in Table 2 in order of being applied to the input ? The exact ordering would be more informative than a simple listing for parameter count purposes.

Confidence in this Review

2-Confident (read it all; understood it all reasonably well)