{"title": "Online Learning in The Manifold of Low-Rank Matrices", "book": "Advances in Neural Information Processing Systems", "page_first": 2128, "page_last": 2136, "abstract": "When learning models that are represented in matrix forms, enforcing a low-rank constraint can dramatically improve the memory and run time complexity, while providing a natural regularization of the model. However, naive approaches for minimizing functions over the set of low-rank matrices are either prohibitively time consuming (repeated singular value decomposition of the matrix) or numerically unstable (optimizing a factored representation of the low rank matrix). We build on recent advances in optimization over manifolds, and describe an iterative online learning procedure, consisting of a gradient step, followed by a second-order retraction back to the manifold. While the ideal retraction is hard to compute, and so is the projection operator that approximates it, we describe another second-order retraction that can be computed efficiently, with run time and memory complexity of O((n+m)k) for a rank-k matrix of dimension m x n, given rank one gradients. We use this algorithm, LORETA, to learn a matrix-form similarity measure over pairs of documents represented as high dimensional vectors. LORETA improves the mean average precision over a passive- aggressive approach in a factorized model, and also improves over a full model trained over pre-selected features using the same memory requirements. LORETA also showed consistent improvement over standard methods in a large (1600 classes) multi-label image classification task.", "full_text": "Online Learning in the Manifold of\n\nLow-Rank Matrices\n\nUri Shalit\u2217, Daphna Weinshall\nComputer Science Dept. and ICNC\nThe Hebrew University of Jerusalem\nuri.shalit@mail.huji.ac.il\n\ndaphna@cs.huji.ac.il\n\nGal Chechik\n\nGoogle Research and\n\nThe Gonda Brain Research Center\n\nBar Ilan University\ngal@google.com\n\nAbstract\n\nWhen learning models that are represented in matrix forms, enforcing a low-rank\nconstraint can dramatically improve the memory and run time complexity, while\nproviding a natural regularization of the model. However, naive approaches for\nminimizing functions over the set of low-rank matrices are either prohibitively\ntime consuming (repeated singular value decomposition of the matrix) or nu-\nmerically unstable (optimizing a factored representation of the low rank matrix).\nWe build on recent advances in optimization over manifolds, and describe an it-\nerative online learning procedure, consisting of a gradient step, followed by a\nsecond-order retraction back to the manifold. While the ideal retraction is hard to\ncompute, and so is the projection operator that approximates it, we describe an-\nother second-order retraction that can be computed ef\ufb01ciently, with run time and\nmemory complexity of O ((n + m)k) for a rank-k matrix of dimension m \u00d7 n,\ngiven rank-one gradients. We use this algorithm, LORETA, to learn a matrix-\nform similarity measure over pairs of documents represented as high dimensional\nvectors. LORETA improves the mean average precision over a passive- aggres-\nsive approach in a factorized model, and also improves over a full model trained\nover pre-selected features using the same memory requirements. LORETA also\nshowed consistent improvement over standard methods in a large (1600 classes)\nmulti-label image classi\ufb01cation task.\n\nIntroduction\n\n1\nMany learning problems involve models represented in matrix form. These include metric learning,\ncollaborative \ufb01ltering, and multi-task learning where all tasks operate over the same set of features.\nIn many of these models, a natural way to regularize the model is to limit the rank of the correspond-\ning matrix. In metric learning, a low rank constraint allows to learn a low dimensional representation\nof the data in a discriminative way. In multi-task problems, low rank constraints provide a way to\ntie together different tasks. In all cases, low-rank matrices can be represented in a factorized form\nthat dramatically reduces the memory and run-time complexity of learning and inference with that\nmodel. Low-rank matrix models could therefore scale to handle substantially many more features\nand classes than with full rank dense matrices.\n\nAs with many other problems, the rank constraint is non-convex, and in the general case, minimizing\na convex function subject to a rank constraint is NP-hard [1] 1. As a result, two main approaches have\nbeen commonly used. Sometimes, a matrix W \u2208 Rn\u00d7m of rank k is represented as a product of two\nlow dimension matrices W = ABT , A \u2208 Rn\u00d7k, B \u2208 Rm\u00d7k and simple gradient descent techniques\nare applied to each of the product terms separately [3]. Second, projected gradient algorithms can\nbe applied by repeatedly taking a gradient step and projecting back to the manifold of low-rank\nmatrices. Unfortunately, computing the projection to that manifold becomes prohibitively costly for\nlarge matrices and cannot be computed after every gradient step.\n\n\u2217also at the Gonda Brain Research Center, Bar Ilan University\n1Some special cases are solvable (notably, PCA), relying mainly on singular value decomposition [2] and\n\nsemi-de\ufb01nite programming techniques. These methods scale poorly to large scale tasks.\n\n1\n\n\fFigure 1: A two step procedure for com-\nputing a retracted gradient. The \ufb01rst step\ncomputes the Riemannian gradient \u03be (the\nprojection of the gradient onto the tan-\ngent space TxMn,m\n2 =\nxt + \u03b7t\u2207L(xt). The second step com-\nputes the retraction onto the manifold\nxt+1 = Rx(\u03bet).\n\n), yielding xt+ 1\n\nk\n\nIn this paper we propose new algorithms for online learning on the manifold of low-rank matrices,\nwhich are based on an operation called retraction. Retractions are operators that map from a vector\nspace that is tangent to the manifold, into the manifold. They include the projection operator as a\nspecial case, but also include other retractions that can be computed dramatically more ef\ufb01ciently.\nWe use second order retractions to develop LORETA \u2013 an online algorithm for learning low rank\nmatrices. It has a memory and run time complexity of O ((n + m)k) when the gradients have rank\none, a case which is relevant to numerous online learning problems as we show below.\n\nWe test Loreta in two different domains and learning tasks. First, we learn a bilinear similarity\nmeasure among pairs of text documents, where the number of features (text terms) representing\neach document could become very large. Loreta performed better than other techniques that operate\non a factorized model, and also improves retrieval precision by 33% as compared with training a\nfull rank model over pre-selected most informative features, using comparable memory footprint.\nSecond, we applied Loreta to image multi-label ranking, a problem in which the number of classes\ncould grow to millions. Loreta signi\ufb01cantly improved over full rank models, using a fraction of the\nmemory required. These two experiments suggest that low-rank optimization could become very\nuseful for learning in high-dimensional problems.\n\nThis paper is organized as follows. We start with an introduction to optimization on manifolds,\ndescribing the notion of retractions. We then derive our low-rank online learning algorithm, and test\nit in two applications: learning similarity of text documents, and multi-label ranking for images.\n\n2 Optimization on Riemannian manifolds\n\nThe \ufb01eld of numerical optimization on smooth manifolds has advanced signi\ufb01cantly in the past\nfew years. We start with a short introduction to embedded manifolds, which are the focus of\nthis paper. An embedded manifold is a smooth subset of an ambient space Rn. For instance the\nset {x : ||x||2 = 1, x \u2208 Rn}, the unit sphere, is an n \u2212 1 dimensional manifold embedded in n-\ndimensional space Rn. Here we focus on the manifold of low-rank matrices, namely, the set of\nn \u00d7 m matrices of rank k where k < m, n. It is an (n + m)k \u2212 k2 dimensional manifold embedded\nin Rn\u00d7m, which we denote Mn,m\n. Embedded manifolds inherit many properties from the ambient\nspace, a fact which simpli\ufb01es their analysis. For example, the Riemannian metric for embedded\nmanifolds is simply the Euclidean metric restricted to the manifold.\n\nk\n\nMotivated by online learning, we focus here on developing a stochastic gradient descent procedure\nto minimize a loss function L over the manifold of low-rank matrices Mn,m\n\n,\n\nk\n\nmin\n\nx\n\nL(x)\n\ns.t. x \u2208 Mn,m\n\nk\n\n.\n\n(1)\n\nTo illustrate the challenge in this problem, consider a simple stochastic gradient descent algorithm\n(Fig. 1). At every step t of the algorithm, a gradient step update takes xt+ 1\n2 outside of the manifold\nM and has to be mapped back onto the manifold. The most common mapping operation is the\nprojection operation, which, given a point xt+ 1\n2 outside the manifold, would \ufb01nd the closest point\nin M. Unfortunately, the projection operation is very expensive to compute for the manifold of\nlow rank matrices, since it basically involves a singular value decomposition. Here we describe a\nwider class of operations called retractions, that serve a similar purpose: they \ufb01nd a point on the\nmanifold that is in the direction of the gradient. Importantly, we describe a speci\ufb01c retraction that\ncan be computed ef\ufb01ciently.\nthe model matrix\ndimensions m and n; its rank k; and the rank of the gradient matrix, r. The overall complexity is\n\nIts runtime complexity depends on 4 quantities:\n\nO(cid:0)(n + m)(k + r)2(cid:1), and O ((n + m)k) for rank-one gradients, which are a very common case.\n\n2\n\n\fTo explain how retractions are computed, we \ufb01rst describe the notion of a tangent space and the\nRiemannian gradient of a function on a manifold.\n\nRiemannian gradient and the tangent space\nEach point x in an embedded manifold M has a tangent space associated with it, denoted TxM\n(see Fig. 1). The tangent space is a vector space of the same dimension as the manifold that can\nbe identi\ufb01ed in a natural way with a linear subspace of the ambient space. It is usually simple to\ncompute the linear projection Px of any point in the ambient space onto the tangent space TxM.\nGiven a manifold M and a differentiable function L : M \u2192 R, the Riemannian gradient \u2207L(x)\nof L on M at a point x is a vector in the tangent space TxM. A very useful property of embedded\nmanifolds is the following: given a differentiable function f de\ufb01ned on the ambient space (and thus\non the manifold), the Riemannian gradient of f at point x is simply the linear projection Px of the\nordinary gradient of f onto the tangent space TxM. An important consequence follows in case\nthe manifold represents the set of points obeying a certain constraint. In this case the Riemannian\ngradient of f is equivalent to the ordinary gradient of the f minus the component which is normal\nto the constraint. Indeed this normal component is exactly the component which is irrelevant when\nperforming constrained optimization.\n\nThe Riemannian gradient allows us to compute xt+ 1\nand step size \u03b7t. We now examine how xt+ 1\n\n2 can be mapped back onto the manifold.\n\n2 = xt + \u03b7t\u2207L(x), for a given iterate point xt\n\nRetractions\nIntuitively, retractions capture the notion of \u201dgoing along a straight line\u201d on the manifold. The\nmathematically ideal retraction is called the exponential mapping: it maps the tangent vector \u03be \u2208\nTxM to a point along a geodesic curve which goes through x in the direction of \u03be. Unfortunately, for\nmany manifolds (including the low-rank manifold considered here) calculating the geodesic curve is\ncomputationally expensive. A major insight from the \ufb01eld of Riemannian manifold optimization is\nthat using the exponential mapping is unnecessary since computationally cheaper retractions exist.\nFormally, for a point x in an embedded manifold M, a retraction is any function Rx : TxM \u2192 M\nwhich satis\ufb01es the following two conditions [4]: (1) Centering: Rx(0) = x. (2) Local rigidity: the\ncurve de\ufb01ned by \u03b3\u03be(\u03c4 ) = Rx(\u03c4 \u03be) satis\ufb01es \u02d9\u03b3\u03be(0) = \u03be. It can be shown that any such retraction\napproximates the exponential mapping to a \ufb01rst order [4]. Second-order retractions, which ap-\nproximate the exponential mapping to second order around x, have to satisfy the following stricter\n\nd\u03c4 2\n\nconditions: Px(cid:16) dRx(\u03c4 \u03be)\n|\u03c4 =0(cid:17) = 0, for all \u03be \u2208 TxM, where Px is the linear projection from the\nambient space onto the tangent space TxM. When viewed intrinsically, the curve Rx(\u03c4 \u03be) de\ufb01ned\nby a second-order retraction has zero acceleration at point x, namely, its second order derivatives\nare all normal to the manifold. The best known example of a second-order retraction onto embed-\nded manifolds is the projection operation [5]. Importantly, projections are viewed here as one type\nof a second order approximation to the exponential mapping, which can be replaced by any other\nsecond-order retractions, when computing the projection is too costly.\n\nGiven the tangent space and a retraction, we can now de\ufb01ne a Riemannian gradient descent step for\nthe loss L at point xt \u2208 M:\n(1) Gradient step: Compute xt+ 1\nis the ordinary gradient of L in the ambient space.\n(2) Retraction step: Compute xt+1 = Rxt(\u2212\u03b7t\u03bet), where \u03b7t is the step size.\nFor a proper step size, this procedure can be proved to have local convergence for any retraction [4].\n\n2 = xt + \u03bet, with \u03bet = \u2207L(xt) = Pxt( \u02dc\u2207L(xt)), where \u02dc\u2207L(xt)\n\n3 Online learning on the low rank manifold\n\nBased on the retractions described above, we now present an online algorithm for learning low-\nrank matrices, by performing stochastic gradient descent on the manifold of low rank matrices.\nAt every iteration the algorithm suffers a loss, and performs a Riemannian gradient step followed\nby a retraction to the manifold Mn,m\n. Section 3.1 discusses general online updates. Section 3.2\ndiscusses the very common case where the online updates induce a gradient of rank r = 1.\nIn what follows, a lowercase x denotes an abstract point on the manifold, lowercase Greek letters\nlike \u03be denote an abstract tangent vector, and uppercase Roman letters like A denote concrete matrix\n\nk\n\n3\n\n\frepresentations as kept in memory (taking n \u00d7 m \ufb02oat numbers to store). We intermix the two\nnotations, as in \u03be = AZ, when the meaning is clear from the context. The set of n \u00d7 k matrices of\nrank k is denoted Rn\u00d7k\n\n.\n\n\u2217\n\n1\n\n\u2217\n\n\u2217\n\nk\n\nk\n\nk\n\nN2\n\nBT\n\nat x is:\n\n\u22a5A = 0, BT\n\n\u22a5B = 0, AT\n\n0 (cid:21)(cid:20)BT\n\n\u22a5A\u22a5 = In\u2212k, BT\n\n\u22a5 (cid:21) : M \u2208 Rk\u00d7k, N1 \u2208 R(m\u2212k)\u00d7k, N2 \u2208 R(n\u2212k)\u00d7k(cid:27) (2)\n\n, B \u2208\n. Let A\u22a5 \u2208 Rn\u00d7(n\u2212k) and B\u22a5 \u2208 Rm\u00d7(m\u2212k) be the orthogonal complements of A and B\n\u22a5B\u22a5 = Im\u2212k. The tangent space\n\nhave a (non-unique) factorization x = ABT , where A \u2208 Rn\u00d7k\n\n3.1 The general LORETA algorithm\nWe start with a Lemma that gives a representation of the tangent space TxM, extending the con-\nstructions given in [6] to the general manifold of low-rank matrices. The proof is given in the\nsupplemental material.\nLemma 1. Let x \u2208 Mn,m\nRm\u00d7k\nrespectively, such that AT\nto Mn,m\nTxM = (cid:26)[A A\u22a5](cid:20)M N T\nLet \u03be \u2208 Mn,m\nbe a tangent vector to x = ABT . From the characterization above it follows that\nr ,\n\u03be can be decomposed in a unique manner into three orthogonal components: \u03be = \u03beS + \u03beP\nl + \u03beP\nwhere \u03beS = AM BT , \u03beP\nr = A\u22a5N2BT . In online learning we are repeatedly\ngiven a rank-r gradient matrix Z, and want to compute a step on Mn,m\nin the direction of Z. As\na \ufb01rst step we wish to \ufb01nd its projection Px(Z) onto the tangent space. Speci\ufb01cally, we wish to\n\ufb01nd the three matrices M, N1 and N2 such that Px(Z) = AM BT + AN T\n\u22a5 + A\u22a5N2BT . Since\nAT . The matrix\nprojecting onto A\u2019s columns, denoted PA, is exactly equal to AA\u2020. We can similarly de\ufb01ne PA\u22a5,PB\nand PB\u22a5. A straightforward computation shows that for a given matrix Z, we have M = A\u2020ZB\u2020T ,\nN1 = BT\nThe following theorem de\ufb01nes the retraction that we use. The proof is given in the supplemental\nmaterial.\nTheorem 1. Let x \u2208 Mn,m\nsince we assume A and B are of full column rank). Let \u03be \u2208 TxMn,m\ndescribed above, and let\n\nwe assume A is of full column rank, its pseudo-inverse A\u2020 obeys A\u2020 = (cid:0)AT A(cid:1)\u22121\nl = PAZPB\u22a5, \u03beP\n\n, x = ABT , and x\u2020 = B\u2020T A\u2020 = B(BT B)\u22121(AT A)\u22121AT (this holds\nr , as\n\n\u22a5, yielding \u03beS = PAZPB, \u03beP\n\n\u22a5Z T A\u2020T , N2 = AT\n\nr = PA\u22a5 ZPB.\n\n, \u03be = \u03beS + \u03beP\n\nl = AN T\n\n1 BT\n\n\u22a5 and \u03beP\n\nl + \u03beP\n\n\u22a5ZBT\n\n1 BT\n\nk\n\nk\n\nk\n\nw1 = x +\n\nw2 = x +\n\n1\n2\n1\n2\n\n1\n8\n1\n8\n\n1\n2\n1\n2\n\n\u03beS + \u03beP\n\nr \u2212\n\n\u03beSx\u2020\u03beS \u2212\n\n\u03beP\nr x\u2020\u03beS\n\n\u03beS + \u03beP\n\nl \u2212\n\n\u03beSx\u2020\u03beS \u2212\n\n\u03beSx\u2020\u03beP\nl\n\n,\n\n.\n\n(3)\n\nThe mapping Rx(\u03be) = w1x\u2020w2 is a second order retraction from a neighborhood \u0398x \u2282 TxMn,m\nto Mn,m\n\nk\n\n.\n\nk\n\nWe now have the ingredients necessary for a Riemannian stochastic gradient descent algorithm.\n\nAlgorithm 1 : Naive Riemannian stochastic gradient descent\nInput: Matrices A \u2208 Rn\u00d7k\ns.t. x = AB T . Matrices G1 \u2208 Rn\u00d7r, G2 \u2208 Rm\u00d7r s.t. G1GT\n\u2212\u03b7\u03be = \u2212\u03b7 \u02dc\u2207L(x) \u2208 Rn\u00d7m, where \u02dc\u2207L(x) is the gradient in the ambient space and \u03b7 > 0 is the step size.\nOutput: Matrices Z1 \u2208 Rn\u00d7k\nCompute:\n\n2 = Rx(\u2212\u03b7\u03be).\n\nsuch that Z1Z T\n\n, Z2 \u2208 Rm\u00d7k\n\n, B \u2208 Rm\u00d7k\n\n\u2217\n\n\u2217\n\n\u2217\n\n\u2217\n\n2 =\n\nA\u2020 = (AT A)\u22121AT , B\u2020 = (B T B)\u22121B T\nA\u22a5, B\u22a5= orthogonal complements of A, B\nM = A\u2020G1GT\nN1 = B T\n\u22a5G2GT\n\n2 B\u2020T\n1 A\u2020T , N2 = AT\n\n\u22a5G1GT\n\n2 B\u2020T\n\nZ1 = (cid:0)A(cid:0)Ik + 1\nZ2 = (cid:0)B(cid:0)Ik + 1\n\n2 M \u2212 1\n2 M T \u2212 1\n\n8 M 2(cid:1) + A\u22a5N2(cid:0)Ik \u2212 1\n2 M(cid:1)(cid:1)\n8 (M T )2(cid:1) + B\u22a5N1(cid:0)Ik \u2212 1\n\n2 M T(cid:1)(cid:1)\n\nmatrix dimension\nk \u00d7 n,\nk \u00d7 m\nn \u00d7 (n \u2212 k), m \u00d7 (m \u2212 k)\nk \u00d7 k\n(n \u2212 k) \u00d7 k\nn \u00d7 k\nm \u00d7 k\n\n(m \u2212 k) \u00d7 k,\n\nGiven a gradient in the ambient space \u02dc\u2207L(x), we can calculate the matrices M, N1 and N2 which\nallow us to represent its projection onto the tangent space, and furthermore allow us to calculate the\nretraction. The procedure is outlined in algorithm 1, with some rearranging and term collection.\n\n4\n\n\fAlgorithm 1 explicitly computes and stores the orthogonal complement matrices A\u22a5 and B\u22a5, which\nin the low rank case k \u226a m, n, have size O(mn) as the original x. To improve the memory\ncomplexity, we use the fact that the matrices A\u22a5 and B\u22a5 always operate with their transpose. Since\n\u22a5 is a projection matrix, one which we denoted earlier by PA\u22a5,\nthey are orthogonal, the matrix A\u22a5AT\nand likewise for B\u22a5. Because of the orthogonal complementarity, these projection matrices are\nequal to In \u2212 PA and Im \u2212 PB respectively. We use this identity to reformulate the algorithm such\nthat only matrices of size at most max(n, m)\u00d7k or max(n, m)\u00d7r are kept in memory. The runtime\ncomplexity of Algorithm 2 can be easily computed based on matrix multiplications complexity, and\n\nequals O(cid:0)(n + m)(k + r)2(cid:1).\n\nAlgorithm 2 : General Riemannian stochastic gradient descent\nInput and Output: As in Algorithm 1\nCompute:\n\n\u02c6B = B\u2020 \u00b7 G2\n\nA\u2020 = (AT A)\u22121AT , B\u2020 = (B T B)\u22121B T\n\u02c6A = A\u2020 \u00b7 G1,\nP rojAG = A \u00b7 \u02c6A\nQ = \u02c6B T \u00b7 \u02c6A\nA\u2206 = \u2212 1\nZ1 = A + A\u2206 \u00b7 \u02c6B T\n\n2 P rojAG + 3\n\n8 P rojAG \u00b7 Q + G1 \u2212 1\n\n2 G1 \u00b7 Q\n\nGBproj = (cid:0)GT\n\nB\u2206 = \u2212 1\nZ T\n\n2 = B T + \u02c6A \u00b7 B\u2206\n\n2 B(cid:1) \u00b7 B\u2020\n\n2 GBproj + 3\n\n8 Q \u00b7 GBproj + GT\n\n2 \u2212 1\n\n2 Q \u00b7 GT\n\n2\n\nmatrix dimension\nk \u00d7 n,\nk \u00d7 m\nk \u00d7 r\nk \u00d7 r,\nn \u00d7 r\nr \u00d7 r\nn \u00d7 r\nn \u00d7 k\nr \u00d7 m\nr \u00d7 m\nk \u00d7 m\n\n\u2217\n\n, B \u2208 Rm\u00d7k\n\nAlgorithm 3 , Loreta-1: Rank-one Riemannian stochastic gradient descent\nInput: Matrices A \u2208 Rn\u00d7k\nB respectively. Vectors G1 \u2208 Rn\u00d71, G2 \u2208 Rm\u00d71 s.t. G1GT\nis the gradient in the ambient space and \u03b7 > 0 is the step size.\nOutput: Matrices Z1 \u2208 Rn\u00d7k\nof Z1 and Z2 respectively.\nCompute:\n\n2 = Rx(\u2212\u03b7\u03be). Matrices Z \u2020\n\n, Z2 \u2208 Rm\u00d7k\n\ns.t. Z1Z T\n\n\u2217\n\n\u2217\n\n\u2217\n\ns.t. x = AB T . Matrices A\u2020 and B\u2020, the pseudo-inverses of A and\n2 = \u2212\u03b7\u03be = \u2212\u03b7 \u02dc\u2207L(x) \u2208 Rn\u00d7m, where \u02dc\u2207L(x)\n\n1 and Z \u2020\n\n2, the pseudo-inverses\n\n\u02c6A = A\u2020 \u00b7 G1, \u02c6B = B\u2020 \u00b7 G2\nP rojAG = A \u00b7 \u02c6A\nQ = \u02c6B T \u00b7 \u02c6A\n\n2 + 3\n\nZ1 = A + A\u2206 \u00b7 \u02c6B T\n\nA\u2206 = P rojAG(cid:0)\u2212 1\nGBproj = (cid:0)GT\nB\u2206 = GBproj(cid:0)\u2212 1\n\n2 B(cid:1) \u00b7 B\u2020\n\n2 + 3\n\n8 Q(cid:1) + G1(1 \u2212 1\n8 Q(cid:1) + GT\n\n2 (1 \u2212 1\n\n2 Q)\n\n2 Q)\n\nZ T\nZ \u2020\nZ \u2020\n\n2 = B T + \u02c6A \u00b7 B\u2206\n1 = rank one pseudoinverse update(A, A\u2020, A\u2206, \u02c6B)\n2 = rank one pseudoinverse update(B, B\u2020, B\u2206, \u02c6A)\n\nmatrix dimension\nk \u00d7 1\nn \u00d7 1\n1 \u00d7 1\nn \u00d7 1\nn \u00d7 k\n1 \u00d7 m\n1 \u00d7 m\nk \u00d7 m\nk \u00d7 n\nk \u00d7 m\n\n3.2 LORETA with rank-one gradients\n\nIn many learning problems, the gradient matrix required for a gradient step update has a rank of one.\nThis is the case for example, when the matrix model W acts as a bilinear form on two vectors, p and\nT W q (as in [7, 8], and Sec. 5.1). In that case, the gradient\nq, and the loss is a linear function of p\nis the rank-one, outer product matrix pq\nT . As another example, consider the case of multitask\nlearning, where the matrix model W operates on a vector input p, and the loss is the squared loss\nbetween the multiple predictions W p and the true labels q: kW p \u2212 qk2. The gradient of the loss\nis (W p \u2212 q) p\nT , which is again a rank-one matrix. We now show how to reduce the complexity of\neach iteration to be linear in the model rank k when the rank of the gradient matrix r is one.\n\nGiven rank-one gradients (r = 1), the most computationally demanding step in Algorithm 2 is the\ncomputation of the pseudo-inverse of the matrices A and B, taking O(nk2) and O(mk2) operations.\nAll other operations are O(max(n, m)k) at most. For r = 1 the outputs Z1 and Z2 become rank-one\nupdates of the input matrices A and B. This enables us to keep the pseudo-inverses A\u2020 and B\u2020 from\nthe previous round, and perform a rank-one update to them, following a procedure developed by [9].\n\n5\n\n\fThis procedure is similar to the better known Sherman-Morrison formula for the inverse of a rank-\none perturbed matrix, and its computational complexity for an n \u00d7 k matrix is O(nk) operations.\nUsing that procedure, we derive our \ufb01nal algorithm, Loreta-1, the rank-one Riemannian stochastic\ngradient descent. Its overall time and space complexity are both O((n + m)k) per gradient step.\nThe memory requirement of Loreta-1 is about 4nk (assuming m = n), since it receives four input\nmatrices of size nk (A, B, A\u2020, B\u2020) and assuming it can compute the four outputs (Z1, Z2, Z \u2020\n1, Z \u2020\n2),\nin-place while destroying previously computed terms.\n\n4 Related work\nA recent summary of many advances in the \ufb01eld of optimization on manifolds is given in [4]. More\nspeci\ufb01c to the \ufb01eld of low rank matrix manifolds, some work has been done on the general problem\nof optimization with low rank positive semi-de\ufb01nite (PSD) matrices. These include [10] and [6]; the\nlatter introduced the retraction for PSD matrices which we extended here to general low-rank matri-\nces. The problem of minimizing a convex function over the set of low rank matrices, was addressed\nby several authors, including [11], and [12] which also considers additional af\ufb01ne constraints, and\nits connection to recent advances in compresses sensing. The main tools used in these works are the\ntrace norm (sum of singular values) and semi-de\ufb01nite programming. See also [2].\n\nMore closely related to the current paper are the works by Kulis et al. [13] and Meka et al. [14]. The\n\ufb01rst deals with learning low rank PSD matrices, and uses the rank-preserving log-det divergence and\nclever factorization and optimization in order to derive an update rule with runtime complexity of\nO(nk2) for an n \u00d7 n matrix of rank k. The second uses online learning in order to \ufb01nd a minimal\nrank square matrix under approximate af\ufb01ne constraints. The algorithm does not directly allow a\nfactorized representation, and depends crucially on an \u201doracle\u201d component, which typically requires\nto compute an SVD. Multi-class ranking with a large number of features was studied in [3].\n\n5 Experiments\nWe tested Loreta-1 in two learning tasks: learning a similarity measure between pairs of text docu-\nments using the 20-newsgroups data collected by [15], and learning to rank image label annotations\nbased on a multi-label annotated set, using the imagenet dataset [16].2\n\n5.1 Learning similarity on the 20 Newsgroups data set\nIn our \ufb01rst set of experiments, we looked at the problem of learning a similarity measure between\npairs of text documents. Similarity learning is a well studied problem, closely related to metric\nlearning (see [17] for a review). It has numerous applications in information retrieval such as query\nby example, and \ufb01nding related content on the web.\nOne approach to learn pairwise relations is to measure the similarity of two documents p, q \u2208 Rn\nT W q parametrized by a model W \u2208 Rn\u00d7n. Such models\nusing a bilinear form SW (p, q) = p\ncan be learned using standard online methods [8], and were shown to achieve high precision. Un-\nfortunately, since the number of parameters grows as n2, storing the matrix W in memory is only\nfeasible for limited feature dimensionality. To handle larger vocabularies, like those containing all\ntextual terms found in a corpus, a common approach is to pre-select a subset of the features and train\na model over the low dimensional data. However, such preprocessing may remove crucial signals in\nthe data even if features are selected in a discriminative way.\n\nTo overcome this dif\ufb01culty, we used Loreta-1 to learn a rank-k parametrization of the model W ,\nwhich can be factorized as W = ABT , where A, B \u2208 Rn\u00d7k. In each of our experiments, we\nselected a subset of n features, and trained a rank k model. We varied the number of features n and\nthe rank of the matrix k so as to use a \ufb01xed amount of memory. For example, we used a rank-10\nmodel with 50K features, and a rank-50 model with 10K features.\nSimilarity learning with Loreta-1. We use an online procedure similar to that in [7, 8]. At each\nround, three instances are sampled: a query document q, and two documents p1 and p2 such that\np1 is known to be more similar to q that p2. We wish that the model assigns a higher similarity\nscore to the pair (q, p1) than the pair (q, p2), hence use the online ranking hinge loss de\ufb01ned as\nlW (q, p1, p2) = [1 \u2212 SW (q, p1) + SW (q, p2)]+.\n\n2Matlab code for Loreta-1 can be provided upon request.\n\n6\n\n\fData preprocessing and feature selection. We used the 20 newsgroups data set (peo-\nple.csail.mit.edu/jrennie/20Newsgroups), containing 20 classes with approximately 1000 documents\neach. We removed stop words but did not apply stemming. We selected features that conveyed high\ninformation about the identity of the class (over the training set) using the infogain criterion [18].\nThe selected features were normalized using tf-idf, and then represented each document as a bag of\nwords. Two documents were considered similar if they shared the same class label.\nExperimental procedure and evaluation protocol. The 20 newsgroups site proposes a split of\nthe data into train and test sets. We repeated splitting 5 times based on the sizes of the proposed\nsplits (a train / test ratio of 65% / 35%). We evaluated the learned similarity measures using a\nranking criterion. We view every document q in the test set as a query, and rank the remaining test\ndocuments p by their similarity scores q\nT W p. We then compute the precision (fraction of positives)\nat the top r ranked documents. We further compute the mean average precision (mAP), a widely\nused measure in the information retrieval community, which averages over all values of r.\nComparisons. We compared Loreta with the following approaches. (1) A direct gradient descent\n(GD) similar to [3]. The model is represented as a product of two matrices \u02c6W = ABT . Stochastic\ngradient descent steps are computed over the factors A and B, for the same loss used by Loreta\nlW (q, p1, p2). The step size \u03b7 was selected using cross validation. The GD steps are: Anew =\nT A. (2) Iterative Passive-Aggressive (PA). We\nA+\u03b7q(p1 \u2212 p2)T B, and Bnew = A+\u03b7(p1 \u2212 p2)q\nfound the above GD procedure to be very unstable, often causing the models to diverge. We therefore\nused a related online algorithm from the family of passive-aggressive algorithms [19]. We iteratively\noptimize over A given a \ufb01xed B and vice versa. The optimization is a tradeoff between minimizing\nthe loss lW , and limiting how much the models change at each iteration. The steps size for updating\nA is computed to be \u03b7A = max(\nqk2 , C). C\nis a prede\ufb01ned parameter controlling the maximum magnitude of the step size. This procedure is\nnumerically more stable because of the normalization by the norms of the matrices multiplied by\nthe gradient factors. (3) Naive Passive-Aggressive (PA v2) This method is similar to the iterative\nPA above, with the step size computed as with unfactored matrices \u03b7 = max( lW (q,p1,p2))\nkqk2k(p1\u2212p2)k2 , C).\n(4) Full rank similarity learning models. We compared with two online metric learning methods,\nLEGO [20] and OASIS [8]. Both algorithms learn a full (non-factorized) model, and were run with\nn = 1000, in order to be consistent with the memory constraint of Loreta-1. We have not compared\nwith batch approaches such as [13]\n\nkqk2kBT (p1\u2212p2)k2 , C), and \u03b7B = max(\n\nlW (q,p1,p2))\nk(p1\u2212p2)k2kAT\n\nlW (q,p1,p2))\n\nFigure 2b shows the mean average precision obtained with the three measures. Loreta outperforms\nthe PA approach across all ranks. More importantly, learning a low rank model of rank 30, using the\nbest 16660 features, is signi\ufb01cantly more precise than learning a much fuller model of rank 100 and\n5000 features. The intuition is that Loreta can be viewed as adaptively learning a linear projection\nof the data into low dimensional space, which is tailored to the pairwise similarity task.\n\nImage multilabel ranking\n\n5.2\nOur second set of experiments tackled the problem of learning to rank labels for images taken from\na large number of classes (L = 1661) with multiple labels per image.\nIn our approach, we learn a linear classi\ufb01er over n features per each label c \u2208 C = {1, . . . , L}, and\nstack all models together to a single matrix W \u2208 RL\u00d7n. At test time, given an image p \u2208 Rn, the\nproduct W p provides scores for every label per that image p. Given a ground truth labeling, a good\nmodel would rank the true labels higher than the false ones. Each row of the matrix model can be\nthought of as a sub-model for the corresponding label. Imposing a low rank constraint on the model\nimplies that these sub-models are linear combinations of a smaller number of latent models.\nOnline learning of label rankings with Loreta-1. At each iteration, an image p is sampled, and\nusing the current model W the scores for all its labels were computed, W p. These scores are\ncompared with the ground truth labeling y = {y1, . . . , yr} \u2282 C. The learner suffers a multilabel\nmulticlass hinge loss as follows. Let \u00afy = argmaxs /\u2208y(W p)s, be the negative label which obtained\nthe highest score, where (W p)s is the sth component of the score vector (W p). The loss is then\ni=1 [(W p)\u00afy \u2212 (W p)yi + 1]+. We then used the subgradient G of this loss for\nLoreta: for the set of indices i1, i2, . . . id \u2282 y which incurred a non zero hinge loss, the ij row of G\nis p, and for the row \u00afy we set G to be \u2212d \u00b7 p. The matrix G is rank one, unless no loss was suffered\nin which case it is 0. .\n\nL(W, p, y) = Pr\n\n7\n\n\f(a) 20 Newsgroups\n\n(b) 20 Newsgroups\n\n0.8\n\n)\n\nP\nA\nM\n\n(\n \n\n0.7\n\n0.6\n\ni\n\ni\n\n \n\nn\no\ns\nc\ne\nr\np\ne\ng\na\nr\ne\nv\na\nn\na\ne\nm\n\n \n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n \n\n0\n0\n\n100\n\n200\n\n300\n\niterations (thousands)\n\n \n\n)\n\nP\nA\nm\n\n(\n \n\ni\n\ni\n\nn\no\ns\nc\ne\nr\np\n\n \n\ne\ng\na\nr\ne\nv\na\n\n \n\nn\na\ne\nm\n\nrank = 100\nrank = 75\nrank = 50\nrank = 40\nrank = 30\nrank = 20\nrank = 10\n\n400\n\n0.72\n\n0.68\n\n0.61\n\n0.45\n\n \n\nLORETA\nPA\nPA v2\nOASIS\nLEGO\n30\n\n10\n\n(c) ImageNet\n\n \n\n \n\n0.109\n\n)\n\nP\nA\nm\n\ni\n\ni\n\n(\n \nn\no\ns\nc\ne\nr\np\n \ne\ng\na\nr\ne\nv\na\n \nn\na\ne\nm\n\n0.074\n\n0.02\n\n50\nmatrix rank k\n\n75\n\n100\n\n1000\n\n0\n\n \n10 50\n\n150\n\nLoreta\u22121\nLoreta\u22121 rand. init.\nIterative PA\nIterative PA rand. init.\nMatrix Perceptron\nGroup MC Perceptron\n1000\n\n400\n\n250\n\nmatrix rank k\n\nFigure 2: (a) Mean average precision (mAP) over 20 newsgroups test set as traced along Loreta learning for\nvarious ranks. Curve values are averages over 5 train-test splits. (b) mAP of different models with varying\nrank. For each rank, a different number of features was selected using an information gain criterion, such that\nthe total memory requirement is kept \ufb01xed (number of features \u00d7 rank is constant). 50000 features were used\nfor rank = 10. LEGO and OASIS were trained with the same memory (using 1000 features and rank=1000).\nError bars denote the standard error of the mean over 5 train-test splits (s.e.m.). (c) ImageNet data. mAP as a\nfunction of the rank k. Curves are means over three train-test splits. Error bars denote the standard error of the\nmean (s.e.m.). All hyper parameters were selected using cross validation. Models were initialized either with\nk ones along the diagonal, or as a product of rank-k matrices with random normal entries (denoted rand. init.).\n\nWe used a subset of\n\nData set and preprocessing.\nthe ImageNet 2010 Challenge\n(www.imagenet.org/challenges/LSVRC/2010/) containing images labeled with respect to the Word-\nNet hierarchy. Each image was manually labeled with a single class label (for a total of 1000 classes).\nWe added labels for each image, using classes along the path to the root of the hierarchy (adding 676\nclasses in total). We discarded ancestor labels covering more than 10% of the images, leaving 1661\nlabels (5.3 labels per image on average). We used ImageNets bag of words representation, based on\nvector quantizing SIFT features with a vocabulary of 1000 words, followed by tf-idf normalization.\nExperimental procedure and evaluation protocol. We split the data into 30 training and 20 testing\nimages per every base level label. The quality of the learned label ranking, was evaluated using the\nmean average precision (mAP) criterion mentioned above.\nComparisons. We compared the performance of Loreta on this task with three other approaches:\n(1) PA: Iterative Passive-Aggressive as described above.\n(2) Matrix Perceptron: a full rank\nconservative gradient descent (3) Group Multi-Class Perceptron a mixed (2,1) norm online mirror\ndescent algorithm [21]. Loreta and PA were run using a range of different model ranks. For all three\nmethods the step size (or C parameter for the PA) was chosen by 5-fold validation on the test set.\n\nFigure Fig. 2c plots the mAP precision of Loreta and PA for different model ranks, while showing\non the right the mAP of the full rank 1000 gradient descent and (2, 1) norm algorithms. Loreta\nsigni\ufb01cantly improves over all other methods across all ranks.\n6 Discussion\nWe presented Loreta, an algorithm which learns a low-rank matrix based on stochastic Riemannian\ngradient descent and ef\ufb01cient retraction to the manifold of low-rank matrices. Loreta achieves supe-\nrior precision in a task of learning similarity in high dimensional feature spaces, and in multi-label\nannotation, where it scales well with the number of classes.\n\nLoreta yields a factorized representation of the low rank matrix. For classi\ufb01cation, it can be viewed\nas learning two matrix components: one that projects the high dimensional data into a low dimen-\nsion, and a second that learns to classify in the low dimension. It may become useful in the future\nfor exploring high dimensional data, or extract relations between large number of classes.\n\nAcknowledgments\n\nThis work was supported by the Israel Science Foundation (ISF) and by the European Union under\nthe DIRAC integrated project IST-027787.\n\n8\n\n\fReferences\n\n[1] B.K. Natarajan. Sparse approximate solutions to linear systems. SIAM journal on computing,\n\n24(2):227\u2013234, 1995.\n\n[2] M. Fazel, H. Hindi, and S. Boyd. Rank minimization and applications in system theory. In\n\nProceedings of the 2004 American Control Conference, pages 3273\u20133278. IEEE, 2005.\n\n[3] B. Bai, J. Weston, R. Collobert, and D. Grangier. Supervised semantic indexing. Advances in\n\nInformation Retrieval, pages 761\u2013765, 2009.\n\n[4] P.A. Absil, R. Mahony, and R. Sepulchre. Optimization Algorithms on Matrix Manifolds.\n\nPrinceton Univ Press, 2008.\n\n[5] P.-A. Absil and J\u00b4er\u02c6ome Malick. Projection-like retractions on matrix manifolds. Technical Re-\nport UCL-INMA-2010.038, Department of Mathematical Engineering, Universit\u00b4e catholique\nde Louvain, July 2010.\n\n[6] B. Vandereycken and S. Vandewalle. A Riemannian optimization approach for computing low-\nrank solutions of Lyapunov equations. SIAM Journal on Matrix Analysis and Applications,\n31:2553, 2010.\n\n[7] D. Grangier D. and S. Bengio. A discriminative kernel-based model to rank images from text\nIEEE Transactions on Pattern Analysis and Machine Intelligence, 30:1371\u20131384,\n\nqueries.\n2008.\n\n[8] G. Chechik, V. Sharma, U. Shalit, and S. Bengio. Large scale online learning of image simi-\n\nlarity through ranking. Journal of Machine Learning Research, 11:1109\u20131135, 2010.\n\n[9] C.D. Meyer. Generalized inversion of modi\ufb01ed matrices. SIAM Journal on Applied Mathe-\n\nmatics, 24(3):315\u2013323, 1973.\n\n[10] M. Journee, F. Bach, PA Absil, and R. Sepulchre. Low-Rank Optimization on the Cone of\n\nPositive Semide\ufb01nite Matrices. SIAM Journal on Optimization, 20:2327\u20132351, 2010.\n\n[11] M. Fazel. Matrix rank minimization with applications. PhD thesis, Electrical Engineering\n\nDepartment, Stanford University, 2002.\n\n[12] Benjamin Recht, Maryam Fazel, and Pablo A. Parrilo. Guaranteed minimum-rank solutions of\n\nlinear matrix equations via nuclear norm minimization. SIAM Review, 52(3):471\u2013501, 2010.\n\n[13] B. Kulis, M.A. Sustik, and I.S. Dhillon. Low-rank kernel learning with bregman matrix diver-\n\ngences. The Journal of Machine Learning Research, 10:341\u2013376, 2009.\n\n[14] R. Meka, P. Jain, C. Caramanis, and I.S. Dhillon. Rank minimization via online learning. In\nProceedings of the 25th International Conference on Machine learning, pages 656\u2013663, 2008.\nIn Proceeding of the 12th Internation Conference on\n\n[15] K. Lang. Learning to \ufb01lter netnews.\n\nMachine Learning, pages 331\u2013339, 1995.\n\n[16] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. Fei-Fei. ImageNet: a large-scale hierar-\nchical image database. In Proceedings of the 22nd IEEE Conference on Computer Vision and\nPattern Recognition, 2009.\n\n[17] L. Yang. An overview of distance metric learning. Technical report, School of Computer\n\nScience, Carnegie Mellon University, 2007.\n\n[18] Y. Yang and J.O. Pedersen. A comparative study on feature selection in text categorization. In\nProceedings of the 14th International Conference on Machine learning, pages 412\u2013420, 1997.\n[19] K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online passive-aggressive\n\nalgorithms. Journal of Machine Learning Research, 7:551\u2013585, 2006.\n\n[20] P. Jain, B. Kulis, I.S. Dhillon, and K. Grauman. Online metric learning and fast similarity\n\nsearch. Advances in Neural Information Processing Systems, pages 761\u2013768, 2008.\n\n[21] Sham M. Kakade, Shai Shalev-Shwartz, and Ambuj Tewari. Regularization techniques for\n\nlearning with matrices, 2010. preprint.\n\n9\n\n\f", "award": [], "sourceid": 1015, "authors": [{"given_name": "Uri", "family_name": "Shalit", "institution": null}, {"given_name": "Daphna", "family_name": "Weinshall", "institution": null}, {"given_name": "Gal", "family_name": "Chechik", "institution": null}]}