NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Reviewer 1
In this paper, the authors proposed Fast Low-Rank Metric Learning (FLRML) and M-FLRML for large-scale and high-dimensional metric learning problem, by employing the low rank constraints and the stochastic learning methods to reduce the computational complexity of the method. I think this paper is well prepared. I have the following comments. (1) In my opinion, it is a common way to process the large-scale and high dimensional data by using low-rank constraints and stochastic learning strategy for metric learning. Besides, the main idea of FLRML is to replace Y by $BV^T$ and is very similar with anchor-based strategy [1]. Please explain the difference between them and point out the main contribution of this paper. (2) Is there existing any theoretical results of the process of using $BV^T$ to replace Y, which ensures the performance of accelerated low-rank metric learning? (3) For the stochastic metric learning methods, there are some recent methods, which are not summarized in the related works of this paper, such as [2] and [3]. Meanwhile, what the differences between FLRML and OPML[3] and the method in [2]? [1] Liu, Wei, Junfeng He, and Shih-Fu Chang. "Large graph construction for scalable semi-supervised learning." In Proceedings of the 27th international conference on machine learning (ICML-10), pp. 679-686. 2010. [2] Qian, Qi, Rong Jin, Jinfeng Yi, Lijun Zhang, and Shenghuo Zhu. "Efficient distance metric learning by adaptive sampling and mini-batch stochastic gradient descent (SGD)." Machine Learning 99, no. 3 (2015): 353-372. [3] Li, Wenbin, Yang Gao, Lei Wang, Luping Zhou, Jing Huo, and Yinghuan Shi. "OPML: A one-pass closed-form solution for online metric learning." Pattern Recognition 75 (2018): 302-314.
Reviewer 2
Low-rank metric learning optimizes a metric matrix subject to low-rank constraints, preserving the intrinsic low-rank structure of the data. However, it still encounters scalability problem when handling large data. This work gives a new formulation that learns the low-rank cosine similarity metric by embedding the triplet constraints into a matrix to further reduce the complexity and the size of involved matrices. The idea of embedding the evaluation of loss functions into matrices is interesting. For Stiefel manifolds, rather than following the projection and retraction convention, it adopts the optimization algorithm proposed by Wen et al. (Ref. [3]). Generally, this paper is well-written with promising results. Here are some concerns: 1) The upper bound is set to 3000, which means the dimension $r$ is truncated regardless of the intrinsic value of very large and high-dimensional data. Is there any theoretical analysis or just an empirical value? 2) To make the gradient of $f(P)$ continuous, the smoothing function $\mu(x)$ is adopted for $max(0, x)$, what about the approximation loss between them? If not using such smoothing, how to optimize the problem in Eq. (6)? 3) The SVD pre-processing still needs expensive cost which cannot be avoided. It proposed mini-batch version which requires $O(D{n_I}^2 + Ddn_I d)$, i. e., calculating a descent direction from each mini-batch of data and updating the transformation matrix $L$ at a decreasing ratio. It would be nice to provide some theoretical analysis of this strategy. 4) Some parameter sensitivity examinations are expected, e.g., the rank constraint for $M$ is set to $d$=100, the number of triplet constraints, and the size of mini-batch. Minor issue: Line 93, the parameter "m" which is about the margin is undefined and some analysis is required here. ================ In the rebuttal, authors have well addressed most of the raised points. I vote for acceptance.
Reviewer 3
The paper is well written and structured. It tackles a relevant topic. The problem is well formulated, the method is exposed clearly and the simulation studies make a good case for the method.