Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
This paper considers the problem of Non-linear Inductive Matrix Completion (NIMC) in a deep learning formulation. In NIMC one is given a query set, an item set and a few query-item relevance values, and the goal is to learn the query-item relevance function. The main contribution of the paper is to provide theoretical guarantees for using a one hidden layer network to estimate that function via an L2 loss. In particular, this can be thought of as two one-layer networks, one learning the embedding of the queries and the other the embedding of the items, while the relevance function is taken to be the inner product of the outputs of the two networks. The authors prove that for sigmoid and tanh activation functions the objective function is locally strongly convex around the global optimum and that stochastic gradient descent converges linearly if initialized sufficiently well. Interestingly, the authors show that even though ReLu is harder to analyze, locally strong convexity is also possible if one suitably removes some redundancy from the objective function. The paper is mostly well written (especially the first half of it), with well-informed literature, featuring significant and novel results. Technical correctness was also checked to the best of the reviewer’s ability. However, there are some weaknesses that the authors should address: 1. Even though the analysis contains sufficiently novel elements, it is quite similar in structure with some of the papers already appearing in the references. I believe it would be fruitful for the community if the authors actually acknowledge this and in fact describe very carefully what is standard methodology and what is novel. In that spirit, there is a lot of room for improvement in section 3.3. 2. Along the same lines, i find the writing in section 3.4 unsatisfactory. It is too cryptic and it is not essential for a short paper: this type of initialization has already been discussed in references appearing in the paper and a short mention that this can be applied should be enough. Hence i suggest reducing section 3.4 to a short sentence. 3. Instead, more experiments are needed: a) please show the convergence rates for sigmoid and tanh. b) please compare results using “good” and random initialization and c) please show some more realistic experiments with higher values of d,k,n and real data. 4. Please clarify: removing the redundancy from the objective function requires some knowledge about the optimal solution, correct? If so, what is the applicability of this result?
In my opinion the paper is very well written. I particularly like that they explain their theoretical results in intuitive words. This paper provides theoretical recovery guarantees for non-linear inductive matrix completion (NIMC) for several activation functions such as sigmoid and ReLu. They show that this non-convex problem behaves as convex near the optimum, and give initialization conditions to "fall" near the optimum. As the authors explain, NIMC tightly related to a one-layer neural network. Providing theoretical guarantees for this type of problems is indeed a very fundamental one, and arguably one of the most important in our field this days. However, the real problem of interest lies in deep networks, as opposed to the single-layer case. In fact, single layers are rather easier to analyze, and have in fact been analyzed quite a bit (e.g., their own references: Zhong et. al, Recovery guarantees for one-hidden-layer neural networks, ICML 2017; Li and Yuan, Convergence Analysis of Two-layer Neural Networks with ReLU Activation, NeurIPS 2017, and more). Hence, while the paper does provide new theory for NIMC, I believe the authors are overselling the importance/novelty of their contribution.
The paper considers the recommendation systems, where each user/item is represented by d-dim vector x/y. The goal is to predict the user-iter, relevance score. The score is modeled as dot(Phi(Ux) , Phi(Vy)). This can be considered as a single layer network with parameters U and V and activation function Phi. The paper analyses the squared loss objective function in this system for sigmoid and ReLu activations. The paper studies local geometry of the loss function and shows that, close to the optima, the function is strongly convex. - My main concern is the practical application of this approach. In table 1, the RMSE results of NIMC and IMC are shown on movielens datasets. These RMSEs are very large for these datasets. Simple SVD methods are able to get RMSE error of .95 in ml100k. So the first question is why both NIMC and IMC work so bad in these two datasets? - Another concern is the generalization capability of the recommendation systems. It is well known in the literature that most methods suffer from overfitting since the rating matrix is sparse. So most methods stop training long before getting close to the local/global solutions, which are able to reasonably reconstruct the matrix perfectly. The paper shows that for a simple one-layer neural network, close to the optima, the function is strongly convex. But, considering the facts about generalization, do the authors think convexity around the optimal solution is helpful in understanding why neural network based methods perform better than the linear ones?