{"title": "Practical Large-Scale Optimization for Max-norm Regularization", "book": "Advances in Neural Information Processing Systems", "page_first": 1297, "page_last": 1305, "abstract": "The max-norm was proposed as a convex matrix regularizer by Srebro et al (2004) and was shown to be empirically superior to the trace-norm for collaborative filtering problems. Although the max-norm can be computed in polynomial time, there are currently no practical algorithms for solving large-scale optimization problems that incorporate the max-norm. The present work uses a factorization technique of Burer and Monteiro (2003) to devise scalable first-order algorithms for convex programs involving the max-norm. These algorithms are applied to solve huge collaborative filtering, graph cut, and clustering problems. Empirically, the new methods outperform mature techniques from all three areas.", "full_text": "Practical Large-Scale Optimization\n\nfor Max-Norm Regularization\n\nInstitute of Computational and Mathematical Engineering\n\nJason Lee\n\nStanford University\n\nemail: jl115@yahoo.com\n\nBenjamin Recht\n\nDepartment of Computer Sciences\nUniversity of Wisconsin-Madison\nemail: brecht@cs.wisc.edu\n\nRuslan Salakhutdinov\n\nBrain and Cognitive Sciences and CSAIL\n\nMassachusetts Institute of Technology\n\nemail: rsalakhu@mit.edu\n\nNathan Srebro\n\nToyota Technological Institute at Chicago\n\nemail: nati@ttic.edu\n\nJoel A. Tropp\n\nComputing and Mathematical Sciences\n\nCalifornia Institute of Technology\n\nemail: jtropp@acm.caltech.edu\n\nAbstract\n\nThe max-norm was proposed as a convex matrix regularizer in [1] and was shown\nto be empirically superior to the trace-norm for collaborative \ufb01ltering problems.\nAlthough the max-norm can be computed in polynomial time, there are currently\nno practical algorithms for solving large-scale optimization problems that incor-\nporate the max-norm. The present work uses a factorization technique of Burer\nand Monteiro [2] to devise scalable \ufb01rst-order algorithms for convex programs\ninvolving the max-norm. These algorithms are applied to solve huge collabora-\ntive \ufb01ltering, graph cut, and clustering problems. Empirically, the new methods\noutperform mature techniques from all three areas.\n\nIntroduction\n\n1\nA foundational concept in modern machine learning is to construct models for data by balancing\nthe complexity of the model against \ufb01delity to the measurements. In a wide variety of applications,\nsuch as collaborative \ufb01ltering, multi-task learning, multi-class learning and clustering of multivariate\nobservations, matrices offer a natural way to tabulate data. For such matrix models, the matrix rank\nprovides an intellectually appealing way to describe complexity. The intuition behind this approach\nholds that many types of data arise from a noisy superposition of a small number of simple (i.e.,\nrank-one) factors.\nUnfortunately, optimization problems involving rank constraints are computationally intractable ex-\ncept in a few basic cases. To address this challenge, researchers have searched for alternative com-\nplexity measures that can also promote low-rank models. A particular example of a low-rank reg-\nularizer that has received a huge amount of recent attention is the trace-norm, equal to the sum of\nthe matrix\u2019s singular values (See the comprehensive survey [3] and its bibliography). The trace-\nnorm promotes low-rank decompositions because it minimizes the (cid:96)1 norm of the vector of singular\nvalues, which encourages many zero singular values.\nAlthough the trace-norm is a very successful regularizer in many applications, it does not seem to be\nwidely known or appreciated that there are many other interesting norms that promote low rank. The\n\n1\n\n\fpaper [4] is one of the few articles in the machine learning literature that pursues this idea with any\nvigor. The current work focuses on another rank-promoting regularizer, sometimes called the max-\nnorm, that has been proposed as an alternative to the rank for collaborative \ufb01ltering problems [1, 5].\nThe max-norm can be de\ufb01ned via matrix factorizations:\n\n(cid:110)(cid:107)U(cid:107)2,\u221e (cid:107)V (cid:107)2,\u221e : X = U V (cid:48)(cid:111)\n\n(cid:107)X(cid:107)max := inf\n\n(1)\n\nwhere (cid:107)\u00b7(cid:107)2,\u221e denotes the maximum (cid:96)2 row norm of a matrix:\n(cid:17)1/2\n\n(cid:16)(cid:88)\n\n(cid:107)A(cid:107)2,\u221e := maxj\n\nA2\njk\n\nk\n\n.\n\nFor general matrices, the computation of the max-norm can be rephrased as a semide\ufb01nite pro-\ngram; see (4) below. When X is positive semide\ufb01nite, we may force U = V and then verify that\n(cid:107)X(cid:107)max = maxj xjj, which should explain the terminology.\nThe fundamental result in the metric theory of tensor products, due to Grothendieck, states that the\nmax-norm is comparable with a nuclear norm (see Chapter 10 of [6]):\n\nThe factor of equivalence 1.676 \u2264 \u03baG \u2264 1.783 is called Grothendieck\u2019s constant. The trace-norm,\non the other hand, is equal to\n\n(cid:110)(cid:107)\u03c3(cid:107)1 : X =(cid:88)\n(cid:110)(cid:107)\u03c3(cid:107)1 : X =(cid:88)\n\nj\n\n(cid:111)\nj where (cid:107)uj(cid:107)\u221e = 1 and (cid:107)vj(cid:107)\u221e = 1\n(cid:111)\nj where (cid:107)uj(cid:107)2 = 1 and (cid:107)vj(cid:107)2 = 1\n\n.\n\n.\n\n\u03c3jujv(cid:48)\n\n\u03c3jujv(cid:48)\n\nj\n\n(cid:107)X(cid:107)max \u2248 inf\n\n(cid:107)X(cid:107)tr := inf\n\nThis perspective reveals that the max-norm promotes low-rank decompositions with factors in (cid:96)\u221e,\nrather than the (cid:96)2 factors produced by the trace-norm! Heuristically, we expect max-norm regular-\nization to be effective for uniformly bounded data, such as preferences.\nThe literature already contains theoretical and empirical evidence that the max-norm is superior to\nthe trace-norm for certain types of problems. Indeed, the max-norm offers better generalization\nerror bounds for collaborative \ufb01ltering [5], and it outperforms the trace-norm in small-scale experi-\nments [1]. The paper [7] provides further evidence that the max-norm serves better for collaborative\n\ufb01ltering with nonuniform sampling patterns.\nWe believe that the max-norm has not achieved the same prominence as the trace-norm because of an\napprehension that it is challenging to solve optimization problems involving a max-norm regularizer.\nThe goal of this paper is to refute this misconception.\nWe provide several algorithms that are effective for very large scale problems, and we demonstrate\nthe power of the max-norm regularizer using examples from a variety of applications. In particular,\nwe study convex programs of the form\n\nmin f(X) + \u00b5(cid:107)X(cid:107)max\n\n(2)\n\nwhere f is a smooth function and \u00b5 is a positive penalty parameter. Section 4 outlines a proximal-\npoint method, based on the work of Fukushima and Mine [8], for approaching (2). We also study\nthe bound-constrained problem\n\nmin f(X)\n\nsubject to\n\n(cid:107)X(cid:107)max \u2264 B.\n\n(3)\n\nOf course, (2) and (3) are equivalent for appropriate choices of \u00b5 and B, but we describe scenarios\nwhere there may be a preference for one versus the other. Section 3 provides a projected gradient\nmethod for (3), and Section 5 develops a stochastic implementation that is appropriate for decom-\nposable loss functions. These methods can be coded up in a few lines of numerical python or Matlab,\nand they scale to huge instances, even on a standard desktop machine. In Section 6, we apply these\nnew algorithms to large-scale collaborative \ufb01ltering problems, and we demonstrate performance su-\nperior to methods based on the trace-norm. We apply the algorithms to solve enormous instances\nof graph cut problems, and we establish that clustering based on these cuts outperforms spectral\nclustering on several data sets.\n\n2\n\n\f2 The SDP and Factorization Approaches\nThe max-norm of an m \u00d7 n matrix X can be expressed as the solution to a semide\ufb01nite program:\n\n(cid:21)\n\n(cid:20)W1 X\n\nX(cid:48) W2\n\n(cid:107)X(cid:107)max = min t\n\nsubject to\n\n(cid:23) 0,\n\ndiag(W1) \u2264 t,\n\ndiag(W2) \u2264 t.\n\n(4)\n\nUnfortunately, standard interior-point methods for this problem do not scale to matrices with more\nthan a few hundred rows or columns. For large-scale problems, we use an alternative formulation\nsuggested by (1) that explicitly works with a factorization of the decision variable X.\nWe employ an idea of Burer and Monteiro [2] that has far reaching consequences. The positive\nde\ufb01nite constraint in the SDP formulation above is trivially satis\ufb01ed if we de\ufb01ne L and R via\n\n(cid:21)\n\n(cid:20)W1 X\n\nX(cid:48) W2\n\n=\n\n(cid:20)L\n\n(cid:21)(cid:20)L\n\n(cid:21)(cid:48)\n\nR\n\nR\n\n.\n\nBurer and Monteiro showed that as long as L and R have suf\ufb01ciently many columns, then the global\noptimum of (4) is equal to that of\n(cid:107)X(cid:107)max =\n\n2,\u221e ,(cid:107)R(cid:107)2\n\nmax{(cid:107)L(cid:107)2\n\n2,\u221e} .\n\nmin\n\n(5)\n\n(L,R) : LR(cid:48)=X\n\nIn particular, we may assume that the number of columns is less than m+ n. This formulation of the\nmax-norm is nonconvex because it involves a constraint on the product LR(cid:48), but Burer and Mon-\nteiro proved that each local minimum of the reformulated problem is also a global optimum [9]. If\nwe select L and R to have a very small number of columns, say r, then the number of real decision\nvariables in the optimization problems (2) and (3) is reduced from mn to r(m + n), a dramatic\nimprovement in the dimensionality of the problem. On the other hand, the new formulation is non-\nconvex with respect to L and R so it might not be ef\ufb01ciently solvable. In what follows, we present\nfast, \ufb01rst-order methods for solving (2) and (3) via this low-dimensional factored representation.\n3 Projected Gradient Method\nThe constrained formulation (3) admits a simple projected gradient algorithm. We replace X with\nthe product LR(cid:48) and use the factored form of the max-norm (5) to obtain\n2,\u221e ,(cid:107)R(cid:107)2\n(cid:21)(cid:19)\n\nThe projected gradient descent method \ufb01xes a step size \u03c4 and computes updates with the rule\n\n(cid:18)(cid:20) L \u2212 \u03c4\u2207f(LR)R\n\nminimize(L,R)f(LR(cid:48))\n\nsubject to max{(cid:107)L(cid:107)2\n\n2,\u221e} \u2264 B.\n\n(cid:20)L\n\n(cid:21)\n\n(6)\n\n\u2190 PB\n\nR\n\nR \u2212 \u03c4\u2207f(LR)(cid:48)L\n\n\u221a\n\n\u221a\n\n2,\u221e ,(cid:107)R(cid:107)2\n\nB so their norms equal\n\nB. Rows with norms less than\n\nwhere PB denotes the Euclidean projection onto the set {(L, R) : max((cid:107)L(cid:107)2\n2,\u221e) \u2264 B}.\n\u221a\nThis projection can be computed by re-scaling the rows of the current iterate whose norms exceed\nB are unchanged by the projection. The\nprojected gradient algorithm is elegant and simple, and it has an online implementation, described\nbelow. Moreover, using an Armijo line search rule to guarantee suf\ufb01cient decrease of the cost\nfunction, we can guarantee convergence to a stationary point of (3); see [10, Sec. 2.3].\n4 Proximal Point Method for Penalty Formulation\nSolving (2) is slightly more complicated than its constrained counterpart. We employ a classical\nproximal point method, proposed by Fukushima and Mine [8], which forms the algorithmic foun-\ndation of many popular \ufb01rst-order methods of for (cid:96)1-norm minimization [11, 12] and trace-norm\nminimization [13, 14]. The key idea is that our cost function is the sum of a smooth term plus a\nconvex term. At each iteration, we replace the smooth term by a linear approximation. The new\ncost function can then be minimized in closed form. Before describing the proximal point algorithm\nin detail, we \ufb01rst discuss how a simple max-norm problem (the Frobenius norm plus a max-norm\npenalty) admits an explicit formula for its unique optimal solution.\nConsider the simple regularization problem\n\nminimizeW (cid:107)W \u2212 V (cid:107)2\n\nF + \u03b2 (cid:107)W(cid:107)2\n\n2,\u221e\n\n(7)\n\n3\n\n\fAlgorithm 1 Compute W = squash(V , \u03b2)\nRequire: A d \u00d7 D matrix V , a positive scalar \u03b2.\nEnsure: A d \u00d7 D matrix W \u2208 arg minZ\n1: for k = 1 to d set nk \u2190 (cid:107)vk(cid:107)2\n2: sort {nk} in descending order. Let \u03c0 denote the sorting permutation such that n\u03c0(j) is the jth\n\nF + \u03b2 (cid:107)Z(cid:107)2\n\n(cid:107)Z \u2212 V (cid:107)2\n\n2,\u221e.\n\n3: for k = 1 to d set sk \u2190(cid:80)k\n\nlargest element in the sequence.\ni=1 n\u03c0(i).\nk+\u03b2}\n\n4: q \u2190 max{k : n\u03c0(k) \u2265 sk\n5: \u03b7 \u2190 sq\n6: for k = 1 to d, if k \u2264 q, set w\u03c0(k) \u2190 \u03b7v\u03c0(k)/(cid:107)v\u03c0(k)(cid:107)2. otherwise set w\u03c0(k) \u2190 v\u03c0(k)\n\nq+\u03b2\n\nwhere W and V are d \u00d7 D matrices. Just as with (cid:96)1-norm and trace-norm regularization, this\nproblem can be solved in closed form. An ef\ufb01cient algorithm to solve (7) is given by Algorithm 1.\nWe call this procedure squash because the rows of V with large norm have their magnitude clipped\nat a critical value \u03b7 = \u03b7(V , \u03b2).\n\nProposition 4.1 squash(V , \u03b2) is an optimal solution of (7)\n\nThe proof of this proposition follows from an analysis of the KKT conditions for the regularized\nproblem. We include a full derivation in the appendix. Note that squash can be computed in\nO(d max{log(d), D}) \ufb02ops. Computing the row norms requires O(dD) \ufb02ops, and then the sort\nrequires O(d log d) \ufb02ops. Computing \u03b7 and q require O(d) operations. Constructing W then re-\nquires O(dD) operations.\nWith the squash function in hand, we can now describe our proximal-point algorithm. Replace\nthe decision variable X in (2) with LR(cid:48). With this substitution and the factored form of the max-\nnorm, (5), Problem (2) reduces to\n\nminimize(L,R)f(LR(cid:48)) + \u00b5 max{(cid:107)L(cid:107)2\n\n(8)\n\n2,\u221e ,(cid:107)R(cid:107)2\n\n2,\u221e} .\n\n(cid:21)\n.\n2,\u221e}. Also let \u02dcf(A) denote f(LR(cid:48)),\n\n(cid:20) L\n\nR\n\nFor ease of notation, de\ufb01ne A to be the matrix of factors stacked on top of one another A =\nWith this notation, we have (cid:107)A(cid:107)2\nand \u03d5(A) := \u02dcf(A) + \u00b5(cid:107)A(cid:107)2\n2,\u221e .\nUsing the squash algorithm, we can solve\n\n2,\u221e = max{(cid:107)L(cid:107)2\n\n2,\u221e ,(cid:107)R(cid:107)2\n\nminimize(cid:104)\u2207 \u02dcf(Ak), A(cid:105) + \u03c4\u22121\n\nk (cid:107)A \u2212 Ak(cid:107)2\n\nF + \u00b5(cid:107)A(cid:107)2\n\n2,\u221e\n\n(9)\n\nin closed form. To see this, complete the square and multiply by \u03c4k. Then (9) is equivalent to (7)\nwith the identi\ufb01cations W = A, V = Ak \u2212 \u03c4k\u2207 \u02dcf(Ak), \u03b2 = \u03c4k\u00b5. That is, the optimal solution\nof (9) is squash\n\nAk \u2212 \u03c4k\u2207 \u02dcf(Ak), \u03c4k\u00b5\n\n.\n\n(cid:16)\n\n(cid:17)\n\nWe can now directly apply the proximal-point algorithm of Fukushima and Mine, detailed in Algo-\nrithm 2. Step 2 is the standard linearized proximal-point method that is prevalent in convex algo-\nrithms like Mirror Descent and Nesterov\u2019s optimal method. The cost function \u02dcf is replaced with a\nquadratic approximation localized at the previous iterate Ak, and the resulting approximation (9)\ncan be solved in closed form. Step 3 is a backtracking line search that looks for a step that obeys\nan Armijo step rule. This linesearch guarantees that the algorithm produces a suf\ufb01ciently large de-\ncrease of the cost function at each iteration, but it may require several function evaluations to \ufb01nd l.\nThis algorithm is guaranteed to converge to a critical point of (8) as long as the step sizes are chosen\ncommensurate with the norm of the Hessian [8]. In particular, Nesterov has recently shown that if \u02dcf\nhas a Lipschitz-continuous gradient with Lipschitz constant L, then the algorithm will converge at a\nrate of 1/k where k is the iteration counter [15].\n\n4\n\n\fAn initial point A0 = (L0, R0) and a counter k set to 0.\n\nAlgorithm 2 A proximal-point method for max-norm regularization\nRequire: Algorithm parameters \u03b1 > 0, 1 > \u03b3 > 0, \u0001tol > 0. A sequence of positive numbers {\u03c4k}.\nEnsure: A critical point of (8).\n1: repeat\n2:\n3:\n\nSolve (9) to \ufb01nd \u02c6Ak. That is, \u02c6Ak \u2190 squash\nCompute the smallest nonnegative integer l such that\n\nAk \u2212 \u03c4k\u2207 \u02dcf(Ak), \u03c4k\u00b5\n\n(cid:16)\n\n(cid:17)\n\n.\n\n\u03d5(Ak + \u03b3l( \u02c6Ak \u2212 Ak)) \u2264 \u03d5(Ak) \u2212 \u03b1\u03b3l(cid:107)Ak \u2212 \u02c6Ak(cid:107)2\nF .\n\nset Ak+1 \u2190 (1 \u2212 \u03b3l)Ak + \u03b3l \u02c6Ak, k \u2190 k + 1.\n\n4:\n5: until (cid:107)Ak\u2212 \u02c6Ak(cid:107)2\n(cid:107)Ak(cid:107)2\n\nF\n\nF\n\n< \u0001tol\n\n5 Stochastic Gradient\n\nFor many problems, including matrix completion and max-cut problems, the cost function decom-\nposes over the individual entries in the matrix, so the function f(LR(cid:48)) takes the particularly simple\nform:\n\n(cid:96)(Yij, L(cid:48)\n\niRj)\n\n(10)\n\nf(L, R) = (cid:88)\n\ni,j\u2208S\n\nwhere (cid:96) is some \ufb01xed loss function, S is a set of row-column indices, Yij are some real numbers,\nand Li and Rj denote the ith row of L and jth row of R respectively. When dealing with very\nlarge datasets, S may consist of hundreds of millions of pairs, and there are algorithmic advantages\nto utilizing stochastic gradient methods that only query a very small subset of S at each iteration.\nIndeed, the above decomposition for f immediately suggests a stochastic gradient method: pick one\ntraining pair (i, j) at random at each iteration, take a step in the direction opposite the gradient of\niRj) and then either apply the projection PB described in Section 3 or the squash function\n(cid:96)(Yi,j, L(cid:48)\ndescribed in 4.\nThe projection PB is particularly easy to compute in the stochastic setting. Namely, if (cid:107)Li(cid:107)2 > B,\nwe project it back so that (cid:107)Li(cid:107) =\nB, otherwise we do not do anything (and similarly for Rj). We\nneed not look at any other rows of L and R. As we demonstrate in experimental results section, this\nsimple algorithm is computationally as ef\ufb01cient as optimization with the trace-norm.\nWe can also implement an ef\ufb01cient algorithm for stochastic gradient descent for problem (2). If we\nwanted to apply the squash algorithm to such a stochastic gradient step, only the norms correspond-\ning to Li and Rj would be modi\ufb01ed. Hence, in Algorithm 1, if the set of row norms of L and R\nis sorted from the previous iteration, we can implement a balanced-tree data structure that allows us\nto perform individual updates in amortized logarithmic time. We leave such an implementation to\nfuture work. In the experiments, however, we demonstrate that the proximal point method is still\nquite ef\ufb01cient and fast when dealing with stochastic gradient updates corresponding to medium-size\nbatches {(i, j)} selected from S, even if a full sort is performed at each squash operation.\n\n\u221a\n\n6 Numerical Experiments\n\nMatrix Completion. We tested our proximal point and projected gradient methods on the Net-\n\ufb02ix dataset, which is the largest publicly available collaborative \ufb01ltering dataset. The training set\ncontains 100,480,507 ratings from 480,189 anonymous users on 17,770 movie titles. Net\ufb02ix also\nprovides a quali\ufb01cation set, containing 1,408,395 ratings. The \u201cquali\ufb01cation set\u201d pairs were selected\nby Net\ufb02ix from the most recent ratings for a subset of the users. As a baseline, Net\ufb02ix provided the\ntest score of its own system trained on the same data, which is 0.9514. This dataset is interesting for\nseveral reasons. First, it is very large, and very sparse (98.8% sparse). Second, the dataset is very\nimbalanced, with highly nonuniform samples. It includes users with over 10,000 ratings as well as\nusers who rated fewer than 5 movies.\n\n5\n\n\fAlgorithm\n\nProximal Point\nProjected Gradient\nTrace-norm\nWeighted Trace-norm\n\nf (X)\n\n0.7676\n0.7728\n\n-\n-\n\nTraining RMSE\n(cid:107)X(cid:107)max\n\nf (X) +\n+ \u00b5 (cid:107)X(cid:107)max\n\n2.5549\n2.2500\n\n-\n-\n\n0.7689\n0.7739\n\n-\n-\n\nQual\nf (X)\n\n0.9150\n0.9138\n0.9235\n0.9105\n\nFigure 1: Performance of regularization methods on the Net\ufb02ix dataset.\n\nFor the net\ufb02ix dataset, we will evaluate our algorithms based on the root mean squared error (RMSE)\nof their predictions. To this end, the objective we seek to minimize takes the following form:\n\nminimizeL,R\n\n1\n|S|\n\n(Yij \u2212 L(cid:48)\n\niRj)2 + \u00b5 max{(cid:107)L(cid:107)2\n\n2,\u221e ,(cid:107)R(cid:107)2\n\n2,\u221e}\n\n(cid:88)\n\n(i,j)\u2208S\n\nwhere S here represents the set of observed user-movie pairs and Yij denote the provided ratings.\nFor all of our experiments, we learned a factorization L(cid:48)R with k = 30 dimensions (factors).\nIn our experiments, all ratings were normalized to be zero-mean by subtracting 3.6.\nTo\nspeed up learning, we subdivided the Net\ufb02ix dataset into minibatches, each containing 100,000\nuser/movie/rating triplets. Both proximal-point and projected gradient methods performed 40\nepochs (or passes through the training set), with parameters {L, R} updated after each minibatch.\nFor both algorithms we used momentum of 0.9, and a step size of 0.005, which was decreased by\na factor of 0.8 after each epoch. For the proximal-point method, \u00b5 was set to 5\u00d710\u22124, and for\nthe projected gradient algorithm, B was set to 2.25. The running times of both algorithms on this\nlarge-scale Net\ufb02ix dataset is comparable. On a 2.5 GHz Intel Xeon, our implementation of projected\ngradient takes 20.1 minutes per epoch, whereas the proximal-point method takes about 19.6 minutes.\nFigure 1 shows predictive performance of both the proximal-point and projected gradient algorithms\non the training and quali\ufb01cation set. Observe that the proximal-point algorithm converges consider-\nably faster than projected gradient, but both algorithms achieve a similar RMSE of 0.9150 (proximal\npoint) and 0.9138 (projected gradient) on the quali\ufb01cation set. Figure 1, left panel, further shows that\nthe max-norm based regularization signi\ufb01cantly outperforms the corresponding trace-norm based\nregularization, which is widely used in many large-scale collaborative \ufb01ltering applications. We\nalso note that the differences between the max-norm and the weighted trace-norm [7] are rather\nsmall, with the weighted trace-norm slightly outperforming max-norm.\n\nGset Max-Cut Experiments.\nwe aim to solve the problem\n\nIn the MAX-CUT problem, we are given a graph G = (V, E), and\n\n(1 \u2212 xixj) subject to x2\n\ni = 1 \u2200i \u2208 V\n\n(i,j)\u2208E\n\nminimize (cid:88)\nminimize (cid:88)\nminimize (cid:88)\n\n(i,j)\u2208E\n\n(i,j)\u2208E\n\nThe heralded Goemans-Williamson relaxation [16] converts this problem into a constrained, sym-\nmetric max-norm problem:\n\n(1 \u2212 Xij) subject to (cid:107)X(cid:107)max \u2264 1, X (cid:23) 0 .\n\nIn our nonconvex formulation, this optimization becomes\n\n(1 \u2212 A(cid:48)\n\niAj) subject to (cid:107)A(cid:107)2\n\n2,\u221e \u2264 1 .\n\nSince the decision variable is symmetric and positive de\ufb01nite, we only need one factor A of size\n|V | \u00d7 r. In all of our experiments with MAX-CUT type problems, we \ufb01xed r = 20. We used a\ndiminishing step size rule of \u03c4k = \u03c40\u221a\n\nwhere k is the iteration counter.\n\nk\n\n6\n\n05101520253035400.750.80.850.90.9511.051.11.15Number of epochs RMSE TrainingQualificationProximal PointProjected Gradient\fPrimal\nObj.\n\n14128.5\n8007.4\n7998.3\n20116.6\n15207.0\n7736.4\n9851.51\n7800.4\n11034.1\n15639.6\n\nTime\n(.1%)\n0.6\n0.5\n0.5\n2\n2.1\n21.4\n8.7\n13.8\n18.6\n28.4\n\nIterations\n\n(.1%)\n150\n200\n200\n300\n400\n2050\n1700\n2250\n2150\n2200\n\nTime\n(1%)\n0.4\n0.3\n0.3\n.7\n0.29\n1.3\n.5\n.6\n.9\n1.35\n\nIterations\n\n(1%)\n100\n100\n100\n100\n50\n100\n100\n100\n100\n100\n\nSDPLR\n\nObj.\n\n14135.7\n8014.6\n8005.9\n20135.90\n15221.9\n7744.1\n9861.2\n7808.2\n11045.1\n15655.2\n\nSDPLR\nTime\n\n3\n4\n7\n29\n6\n15\n21\n15\n20\n33\n\nG22\nG35\nG36\nG58\nG60\nG67\nG70\nG72\nG77\nG81\n\n|V |\n2000\n2000\n2000\n5000\n7000\n10000\n10000\n10000\n14000\n20000\n\n|E|\n19990\n11778\n11766\n29570\n17148\n20000\n9999\n20000\n28000\n40000\n\nTable 1: Performance of projected gradient on Gset graphs. Columns show primal objective within .1% of\noptimal, running time for .1% of optimal, number of iterations to reach .1% of optimal, running time for 1% of\noptimal, number of iterations to reach 1% of optimal, primal objective using SDPLR, running time of SDPLR,\nnumber of vertices, and number of edges. In our experiments, we set \u03c40 = 1.\n\n(a) Spectral Clustering\n\n(b) Max-cut clustering\n\nFigure 2: Comparison of spectral clustering (left) with MAX-CUT clustering (right).\n\nWe tested our projected gradient algorithm on graphs drawn from the Gset, a collection of graphs\ndesigned for testing the ef\ufb01cacy of max-cut algorithms [17]. The results for a subset of these appears\nin Table 1 along with a comparison against a C implementation of Burer\u2019s SDPLR code which has\nbeen optimized for the particular structure of the MAX-CUT problem [18]. On the same modern\nhardware, a Matlab implementation of our projected gradient method can reach .1% of the optimal\nvalue faster than the optimized and compiled SDPLR code.\n\n(cid:16)\u2212||xi\u2212xj||2\n\n(cid:17)\n\n2\u03c32\ni\n\n2-class Clustering Experiments. For the 2-class clustering problem, we \ufb01rst build a K-nearest\nneighbor graph with K = 10 and weights wij de\ufb01ned as wij = max(si(j), sj(i)), with si(j) =\nexp\nand \u03c3i equal to the distance from xi to its Kth closest neighbor. We then choose\na scalar \u03b4 > 0 and de\ufb01ne an inverse similarity adjacency matrix Q by Qij = \u03b4\u2212Wij. The parameter\n\u03b4 controls the balancing of the clusters, a large value of \u03b4 forces the clusters to be of equal size. We\nsolve the MAX-CUT problem on the graph Q to \ufb01nd our cluster assignments.\nAs a synthetic example, we generated a \u201ctwo moons\u201d dataset consisting of two half-circles in R2\nwith the bottom half circle shifted to the right by 1/2 and shifted up by 1/2. The data is then\nembedded into RD and each embedded component is corrupted with Gaussian noise with variance\n.02 as done in [19]. The\n\u03c32. For the two moons experiments, we \ufb01x D = 100, n = 2000 and \u03c3 =\nparameters are set to \u03b4 = .01 and \u03c40 = 3/2; the algorithm was executed for 1500 iterations. For\nthe clustering experiments, we repeat the randomized rounding technique [16] for 100 trials, and we\nchoose the rounding with highest primal objective.\nWe compare our MAX-CUT clusterings with the spectral clustering method [20] and the Total Vari-\nation Graph Cut algorithm [19]. Figure 2 shows the clustering results for spectral clustering and\nmaxcut clustering. In all the trials, spectral clustering incorrectly clustered the two ends of both\nhalf-circles. For the clustering problems, the two measures of performance we consider are mis-\nclassi\ufb01cation error rate (number of misclassi\ufb01ed points divided by n) and cut cost. The cut cost is\ni\u2208V1,j\u2208V2 Wij. The MAX-CUT clustering obtained smaller misclassi\ufb01cation error in 98\n\nde\ufb01ned as(cid:80)\n\nof the 100 trials we performed and smaller cut cost in every trial.\nOn the MNIST database, we build the 10-NN graph described above on the digits 4 and 9, where\nwe set \u03b4 = .001 and r = 8. The NN-graph is of size 14, 000 and the MAX-CUT algorithm takes\n\n\u221a\n\n7\n\n\fTwo Moons\nMNIST 4 and 9\nMNIST 3 and 5\n\nError Rate\n\n0.053\n0.021\n0.016\n\nmax-cut\n\nCost\n311.9\n1025.5\n830.9\n\nTime\n13\n90\n53\n\nmin(|V1|,|V2|)\n\n|V1|+|V2|\n\n.495\n.493\n.476\n\nspectral\n\nError Rate\n\n0.171\n0.458\n0.092\n\nCost\n387.8\n1486.5\n2555.1\n\nTV\n\nError Rate\n\n0.082\nN/A\nN/A\n\nTable 2: Clustering results. Error rate, cut cost, and running time comparison for MAX-CUT, spectral, and\ntotal variation (TV) algorithms. The balance of the cut is computed as min(|V1|,|V2|)\n. The two moons results\n|V1|+|V2|\nare averaged over 100 trials.\n\napproximately 1 minute to run 1,000 iterations. The same procedure is repeated for the digits 3 and\n5. The results are shown in Table 2. Our MAX-CUT clustering algorithm again performs substantially\nbetter than the spectral method.\n7 Summary\nIn this paper we presented practical methods for solving very large scale optimization problems\ninvolving a max-norm constraint or regularizer. Using this approaches, we showed evidence that\nthe max-norm can often be superior to established techniques such as trace-norm regularization and\nspectral clustering, supplementing previous evidence on small-scale problems. We hope that the\nincreasing evidence of the utility of max-norm regularization, combined with the practical optimiza-\ntion techniques we present here, will reignite interest in using the max-norm for various machine\nlearning applications.\n\nAcknowledgements\nRS supported by NSERC, Shell, and NTT Communication Sciences Laboratory. JAT supported by\nONR award N00014-08-1-0883, DARPA award N66001-08-1-2065, and AFOSR award FA9550-\n09-1-0643. JL thanks TTI Chicago for hosting him while this work was completed.\n\nA Proof of the correctness of squash\n\nRewrite (7) as the constrained optimization\n\nminimizeW ,t\nsubject to\n\nPd\ni=1 (cid:107)wi \u2212 vi(cid:107)2 + \u03b2t\n\n(cid:107)wi(cid:107)2 \u2264 t\n\nfor 1 \u2264 i \u2264 d\n\nForming a Lagrangian with a vector of Lagrange multipliers p \u2265 0\n\nL(W , t, p) =\n\n(cid:107)wi \u2212 vi(cid:107)2 + \u03b2t +\n\ndX\n\ni=1\n\ndX\nvi, (b) p \u2265 0, (c)Pd\n\npi((cid:107)wi(cid:107)2 \u2212 t) ,\n\ni=1\n\nthe KKT conditions for this problem thus read (a) wi = 1\nfor 1 \u2264 i \u2264 d, (e) pi > 0 =\u21d2 (cid:107)wi(cid:107)2 = t, and (f) (cid:107)wi(cid:107)2 < t =\u21d2 pi = 0.\nWith our candidate W = squash(V , \u03b2), we need only \ufb01nd t and p to verify the optimality conditions. Let \u03c0\nbe as in Algorithm 1 and set t = \u03b72 and\n\n1+pi\n\ni=1 pi = \u03b2, (d) (cid:107)wi(cid:107)2 \u2264 t\n\n( (cid:107)vk(cid:107)\n\u03b7 \u2212 1 1 \u2264 \u03c0(k) \u2264 q\n\notherwise\n\npk =\n\n0\n\nThis de\ufb01nition of p immediately gives (a). For (b), note that by the de\ufb01nition of q, (cid:107)vk(cid:107) \u2265 \u03b7 for 1 \u2264 \u03c0(k) \u2264 q.\nThus, p \u2265 0. Moreover,\n\nP\n1\u2264\u03c0(k)\u2264q (cid:107)vk(cid:107)\n\npk =\n\n\u03b7\n\ndX\n\nk=1\n\n\u2212 q = q + \u03b2 \u2212 q = \u03b2 ,\n\nyielding (c). Also, by construction, (cid:107)wk(cid:107) = \u03b7 if \u03c0(k) \u2264 q verifying (e). Finally, again by the de\ufb01nition of q,\nwe have\n\n(cid:107)v\u03c0(q+1)(cid:107) <\n\n1\n\n\u03b2 + q + 1\n\n(cid:107)v\u03c0(k)(cid:107) =\n\n1\n\n(cid:107)v\u03c0(q+1)(cid:107) +\n\n\u03b2 + q + 1\n\n\u03b2 + q\n\n\u03b2 + q + 1\n\n\u03b7\n\nwhich implies (cid:107)v\u03c0(q+1)(cid:107) < \u03b7. Since (cid:107)vk(cid:107) \u2264 (cid:107)v\u03c0(q+1)(cid:107) for \u03c0(k) > q, this gives (d) and the slackness\ncondition (f).\n\nq+1X\n\nk=1\n\n8\n\n\fReferences\n[1] Nathan Srebro, Jason Rennie, and Tommi Jaakkola. Maximum margin matrix factorization. In Advances\n\nin Neural Information Processing Systems, 2004.\n\n[2] Samuel Burer and R. D. C. Monteiro. A nonlinear programming algorithm for solving semide\ufb01nite\n\nprograms via low-rank factorization. Mathematical Programming (Series B), 95:329\u2013357, 2003.\n\n[3] Benjamin Recht, Maryam Fazel, and Pablo Parrilo. Guaranteed minimum rank solutions of matrix\nSIAM Review, 2007. To appear. Preprint Available at\n\nequations via nuclear norm minimization.\nhttp://pages.cs.wisc.edu/\u02dcbrecht/publications.html.\n\n[4] Francis R. Bach, Julien Marial, and Jean Ponce. Convex sparse matrix factorizations. Preprint available\n\nat arxiv.org/abs/0812.1869, 2008.\n\n[5] Nathan Srebro and Adi Shraibman. Rank, trace-norm and max-norm.\n\nLearning Theory (COLT), 2005.\n\nIn 18th Annual Conference on\n\n[6] G. J. O. Jameson. Summing and Nuclear Norms in Banach Space Theory. Number 8 in London Mathe-\n\nmatical Society Student Texts. Cambridge University Press, Cambridge, UK, 1987.\n\n[7] Ruslan Salakhutdinov and Nathan Srebro. Collaborative \ufb01ltering in a non-uniform world: Learning with\n\nthe weighted trace norm. Preprint available at arxiv.org/abs/1002.2780, 2010.\n\n[8] Masao Fukushima and Hisashi Mine. A generalized proximal point algorithm for certain non-convex\n\nminimization problems. International Journal of Systems Science, 12(8):989\u20131000, 1981.\n\n[9] Samuel Burer and Changhui Choi. Computational enhancements in low-rank semide\ufb01nite programming.\n\nOptimization Methods and Software, 21(3):493\u2013512, 2006.\n\n[10] Dimitri P. Bertsekas. Nonlinear Programming. Athena Scienti\ufb01c, Belmont, MA, 2nd edition, 1999.\n[11] T Hale, W Yin, and Y Zhang. A \ufb01xed-point continuation method for l 1-regularized minimization with\napplications to compressed sensing. Dept. Computat. Appl. Math., Rice Univ., Houston, TX, Tech. Rep.\nTR07-07, 2007.\n\n[12] Stephen J. Wright, Robert Nowak, and M\u00b4ario A. T. Figueiredo. Sparse reconstruction by separable ap-\nproximation. Journal version, to appear in IEEE Transactions on Signal Processing. Preprint available at\nhttp:http://www.optimization-online.org/DB_HTML/2007/10/1813.html, 2007.\n[13] Jian-Feng Cai, Emmanuel J. Cand`es, and Zuowei Shen. A singular value thresholding algorithm for\nmatrix completion. To appear in SIAM J. on Optimization. Preprint available at http://arxiv.org/\nabs/0810.3286, 2008.\n\n[14] Shiqian Ma, Donald Goldfarb, and Lifeng Chen. Fixed point and Bregman iterative methods for matrix\nrank minimization. Preprint available at http://www.optimization-online.org/DB_HTML/\n2008/11/2151.html, 2008.\n\n[15] Yurii Nesterov. Gradient methods for minimizing composite objective function. To appear. Preprint\nAvailable at http://www.optimization-online.org/DB_HTML/2007/09/1784.html,\nSeptember 2007.\n\n[16] M. X. Goemans and D. P. Williamson. Improved approximation algorithms for maximum cut and satis\ufb01-\n\nability problems using semide\ufb01nite programming. Journal of the ACM, 42:1115\u20131145, 1995.\n\n[17] The Gset is available for download at http://www.stanford.edu/\u02dcyyye/yyye/Gset/.\n[18] Samuel Burer. Sdplr. Software available at http://dollar.biz.uiowa.edu/\u02dcsburer/www/\n\ndoku.php?id=software#sdplr.\n\n[19] Arthur Szlam and Xavier Bresson. A total variation-based graph clustering algorithm for cheeger\nratio cuts. To appear in ICML 2010. Preprint available at ftp://ftp.math.ucla.edu/pub/\ncamreport/cam09-68.pdf, 2010.\n\n[20] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern\n\nAnalysis and Machine Intelligence, 22(8):888\u2013905, 2000.\n\n9\n\n\f", "award": [], "sourceid": 678, "authors": [{"given_name": "Jason", "family_name": "Lee", "institution": null}, {"given_name": "Ben", "family_name": "Recht", "institution": null}, {"given_name": "Nathan", "family_name": "Srebro", "institution": null}, {"given_name": "Joel", "family_name": "Tropp", "institution": null}, {"given_name": "Russ", "family_name": "Salakhutdinov", "institution": null}]}