{"title": "Spectral k-Support Norm Regularization", "book": "Advances in Neural Information Processing Systems", "page_first": 3644, "page_last": 3652, "abstract": "The $k$-support norm has successfully been applied to sparse vector prediction problems. We observe that it belongs to a wider class of norms, which we call the box-norms. Within this framework we derive an efficient algorithm to compute the proximity operator of the squared norm, improving upon the original method for the $k$-support norm. We extend the norms from the vector to the matrix setting and we introduce the spectral $k$-support norm. We study its properties and show that it is closely related to the multitask learning cluster norm. We apply the norms to real and synthetic matrix completion datasets. Our findings indicate that spectral $k$-support norm regularization gives state of the art performance, consistently improving over trace norm regularization and the matrix elastic net.", "full_text": "Spectral k-Support Norm Regularization\n\nAndrew M. McDonald, Massimiliano Pontil, Dimitris Stamos\n\n{a.mcdonald,m.pontil,d.stamos}@cs.ucl.ac.uk\n\nDepartment of Computer Science\n\nUniversity College London\n\nAbstract\n\nThe k-support norm has successfully been applied to sparse vector prediction\nproblems. We observe that it belongs to a wider class of norms, which we call the\nbox-norms. Within this framework we derive an ef\ufb01cient algorithm to compute\nthe proximity operator of the squared norm, improving upon the original method\nfor the k-support norm. We extend the norms from the vector to the matrix setting\nand we introduce the spectral k-support norm. We study its properties and show\nthat it is closely related to the multitask learning cluster norm. We apply the norms\nto real and synthetic matrix completion datasets. Our \ufb01ndings indicate that spec-\ntral k-support norm regularization gives state of the art performance, consistently\nimproving over trace norm regularization and the matrix elastic net.\n\n1\n\nIntroduction\n\nIn recent years there has been a great deal of interest in the problem of learning a low rank matrix\nfrom a set of linear measurements. A widely studied and successful instance of this problem arises\nin the context of matrix completion or collaborative \ufb01ltering, in which we want to recover a low\nrank (or approximately low rank) matrix from a small sample of its entries, see e.g. [1, 2]. One\nprominent method to solve this problem is trace norm regularization: we look for a matrix which\nclosely \ufb01ts the observed entries and has a small trace norm (sum of singular values) [3, 4, 5]. Besides\ncollaborative \ufb01ltering, this problem has important applications ranging from multitask learning, to\ncomputer vision and natural language processing, to mention but a few.\nIn this paper, we propose new techniques to learn low rank matrices. These are inspired by the notion\nof the k-support norm [6], which was recently studied in the context of sparse vector prediction and\nshown to empirically outperform the Lasso [7] and Elastic Net [8] penalties. We note that this\nnorm can naturally be extended to the matrix setting and its characteristic properties relating to the\ncardinality operator translate in a natural manner to matrices. Our approach is suggested by the\nobservation that the k-support norm belongs to a broader class of norms, which makes it apparent\nthat they can be extended to spectral matrix norms. Moreover, it provides a link between the spectral\nk-support norm and the cluster norm, a regularizer introduced in the context of multitask learning\n[9]. This result allows us to interpret the spectral k-support norm as a special case of the cluster\nnorm and furthermore adds a new perspective of the cluster norm as a perturbation of the former.\nThe main contributions of this paper are threefold. First, we show that the k-support norm can\nbe written as a parametrized in\ufb01mum of quadratics, which we term the box-norms, and which are\nsymmetric gauge functions. This allows us to extend the norms to orthogonally invariant matrix\nnorms using a classical result by von Neumann [10]. Second, we show that the spectral box-norm\nis essentially equivalent to the cluster norm, which in turn can be interpreted as a perturbation of\nthe spectral k-support norm, in the sense of the Moreau envelope [11]. Third, we use the in\ufb01mum\nframework to compute the box-norm and the proximity operator of the squared norm in O(d log d)\ntime. Apart from improving on the O(d(k + log d)) algorithm in [6], this method allows one to use\noptimal \ufb01rst order optimization algorithms [12] with the cluster norm. Finally, we present numerical\n\n1\n\n\fexperiments which indicate that the spectral k-support norm shows a signi\ufb01cant improvement in\nperformance over regularization with the trace norm and the matrix elastic net, on four popular\nmatrix completion benchmarks.\nThe paper is organized as follows. In Section 2 we recall the k-support norm, and de\ufb01ne the box-\nnorm. In Section 3 we study their properties, we introduce the corresponding spectral norms, and\nwe observe the connection to the cluster norm. In Section 4 we compute the norm and we derive\na fast method to compute the proximity operator. Finally, in Section 5 we report on our numerical\nexperiments. The supplementary material contains derivations of the results in the body of the paper.\n\n2 Preliminaries\n\nIn this section, we recall the k-support norm and we introduce the box-norm and its dual. The k-\nsupport norm k \u00b7 k(k) was introduced in [6] as the norm whose unit ball is the convex hull of the\nset of vectors of cardinality at most k and `2-norm no greater than one. The authors show that the\nk-support norm can be written as the in\ufb01mal convolution [11]\n\nkwk(k) = inf8<:Xg2Gk\n\nkvgk2 : vg 2 Rd, supp(vg) \u2713 g, Xg2Gk\n\n, w 2 Rd,\n\n(1)\n\nvg = w9=;\n\nwhere Gk is the collection of all subsets of {1, . . . , d} containing at most k elements, and for any\nv 2 Rd, the set supp(v) = {i : vi 6= 0} denotes the support of v. When used as a regularizer,\nthe norm encourages vectors w to be a sum of a limited number of vectors with small support. The\nk-support norm is a special case of the group lasso with overlap [13], where the cardinality of the\nsupport sets is at most k. Despite the complicated form of the primal norm, the dual norm has a\nsimple formulation, namely the `2-norm of the k largest components of the vector\n\nkuk\u21e4,(k) =vuut\n\nkXi=1\n\n(|u|#i )2, u 2 Rd,\n\n(2)\n\nwhere |u|# is the vector obtained from u by reordering its components so that they are non-increasing\nin absolute value [6]. The k-support norm includes the `1-norm and `2-norm as special cases. This\nis clear from the dual norm since for k = 1 and k = d, it is equal to the `1-norm and `2-norm,\nrespectively. We note that while de\ufb01nition (1) involves a combinatorial number of variables, [6]\nobserved that the norm can be computed in O(d log d).\nWe now de\ufb01ne the box-norm, and in the following section we will show that the k-support norm is\na special case of this family.\n\nDe\ufb01nition 2.1. Let 0 \uf8ff a \uf8ff b and c 2 [ad, bd] and let \u21e5 = {\u2713 2 Rd : a \uf8ff \u2713i \uf8ff b,Pd\n\nThe box-norm is de\ufb01ned as\n\ni=1 \u2713i \uf8ff c}.\n\nkwk\u21e5 =vuut inf\n\n\u27132\u21e5\n\nw2\ni\n\u2713i\n\ndXi=1\n\n, w 2 Rd.\n\n(3)\n\nThis formulation will be fundamental in deriving the proximity operator in Section 4.1. Note that\nwe may assume without loss of generality that b = 1, as by rescaling we obtain an equivalent norm,\nhowever we do not explicitly \ufb01x b in the sequel.\n\nProposition 2.2. The norm (3) is well de\ufb01ned and the dual norm is kuk\u21e4,\u21e5 =ssup\n\ni .\n\u2713iu2\n\ndPi=1\n\n\u27132\u21e5\n\nThe result holds true in the more general case that \u21e5 is a bounded convex subset of the strictly\npositive orthant (for related results see [14, 15, 16, 17, 18, 19] and references therein).\nIn this\npaper we limit ourselves to the box constraints above. In particular we note that the constraints are\ninvariant with respect to permutation of the components of \u21e5, and as we shall see this property is\nkey to extend the norm to matrices.\n\n2\n\n\f3 Properties of the Norms\n\nIn this section, we study the properties of the vector norms, and we extend the norms to the matrix\nsetting. We begin by deriving the dual box-norm.\nProposition 3.1. The dual box-norm is given by\n\nkuk\u21e4,\u21e5 =qakuk2\nba and k is the largest integer not exceeding \u21e2.\n\n2 + (b a)kuk2\n\nwhere \u21e2 = cda\n\n\u21e4,(k) + (b a)(\u21e2 k)(|u|#k+1)2,\n\n(4)\n\nWe see from (4) that the dual norm decomposes into two `2-norms plus a residual term, which\nvanishes if \u21e2 = k, and for the rest of this paper we assume this holds, which loses little generality.\nNote that setting a = 0, b = 1, and c = k 2 {1, . . . , d}, the dual box-norm (4) is the `2-norm of the\nlargest k components of u, and we recover the dual k-support norm in equation (2). It follows that\nthe k-support norm is a box-norm with parameters a = 0, b = 1, c = k.\nThe following in\ufb01mal convolution interpretation of the box-norm provides a link between the box-\nnorm and the k-support norm, and illustrates the effect of the parameters.\nProposition 3.2. If 0 < a \uf8ff b and c = (b a)k + da, for k 2 {1, . . . , d}, then\n\nkwk\u21e5 = inf8<:Xg2Gk\n\nvuutXi2g\n\nv2\ng,i\nb\n\nv2\ng,i\na\n\n+Xi /2g\n\n: vg 2 Rd, Xg2Gk\n\n.\n\n(5)\n\nvg = w9=;\n\nNotice that if b = 1, then as a tends to zero, we obtain the expression of the k-support norm (1),\nrecovering in particular the support constraints. If a is small and positive, the support constraints\nare not imposed, however effectively most of the weight for each vg tends to be concentrated on\nsupp(g). Hence, Proposition 3.2 suggests that the box-norm regularizer will encourage vectors w\nwhose dominant components are a subset of a union of a small number of groups g 2 Gk.\nThe previous results have characterized the k-support norm as a special case of the box-norm. Con-\nversely, the box-norm can be seen as a perturbation of the k-support norm with a quadratic term.\nProposition 3.3. Let k\u00b7k\u21e5 be the box-norm on Rd with parameters 0 < a < b and c = k(ba)+da,\nfor k 2 {1, . . . , d}, then\n\n(6)\n\nConsider the regularization problem minw2Rd kXw yk2\nUsing Proposition 3.3 and setting w = u + z, we see that this problem is equivalent to\n\n\u21e5, with data X and response y.\n\nkwk2\n\n\u21e5 = min\n\nakw zk2\n\n2 +\n\nz2Rd\u21e2 1\n\n(k) .\n\n1\n\nb akzk2\n2 + kwk2\n\nu,z2Rd\u21e2kX(u + z) yk2\n\nmin\n\n2 +\n\n\nakuk2\n\n2 +\n\n\n\nb akzk2\n\n(k) .\n\nFurthermore, if (\u02c6u, \u02c6z) solves this problem then \u02c6w = \u02c6u + \u02c6z solves problem (6). The solution \u02c6w can\ntherefore be interpreted as the superposition of a vector which has small `2 norm, and a vector which\nhas small k-support norm, with the parameter a regulating these two components. Speci\ufb01cally, as\na tends to zero, in order to prevent the objective from blowing up, \u02c6u must also tend to zero and we\nrecover k-support norm regularization. Similarly, as a tends to b, \u02c6z vanishes and we have a simple\nridge regression problem.\n\n3.1 The Spectral k-Support Norm and the Spectral Box-Norm\nWe now turn our focus to the matrix norms. For this purpose, we recall that a norm k\u00b7k on Rd\u21e5m is\ncalled orthogonally invariant if kWk = kU W V k, for any orthogonal matrices U 2 Rd\u21e5d and\nV 2 Rm\u21e5m. A classical result by von Neumann [10] establishes that a norm is orthogonally\ninvariant if and only if it is of the form kWk = g((W )), where (W ) is the vector formed by\nthe singular values of W in nonincreasing order, and g is a symmetric gauge function, that is a norm\nwhich is invariant under permutations and sign changes of the vector components.\n\n3\n\n\fLemma 3.4. If \u21e5 is a convex bounded subset of the strictly positive orthant in Rd which is invariant\nunder permutations, then k \u00b7 k\u21e5 is a symmetric gauge function.\nIn particular, this readily applies to both the k-support norm and box-norm. We can therefore extend\nboth norms to orthogonally invariant norms, which we term the spectral k-support norm and the\nspectral box-norm respectively, and which we write (with some abuse of notation) as kWk(k) =\nk(W )k(k) and kWk\u21e5 = k(W )k\u21e5. We note that since the k-support norm subsumes the `1 and\n`2-norms for k = 1 and k = d respectively, the corresponding spectral k-support norms are equal\nto the trace and Frobenius norms respectively. We \ufb01rst characterize the unit ball of the spectral\nk-support norm.\nProposition 3.5. The unit ball of the spectral k-support norm is the convex hull of the set of matrices\nof rank at most k and Frobenius norm no greater than one.\n\nReferring to the unit ball characterization of the k-support norm, we note that the restriction on the\ncardinality of the vectors whose convex hull de\ufb01nes the unit ball naturally extends to a restriction\non the rank operator in the matrix setting. Furthermore, as noted in [6], regularization using the\nk-support norm encourages vectors to be sparse, but less so that the `1-norm. In matrix problems, as\nthe extreme points of the unit ball have rank k, Proposition 3.5 suggests that the spectral k-support\nnorm for k > 1 should encourage matrices to have low rank, but less so than the trace norm.\n\n3.2 Cluster Norm\n\nWe end this section by brie\ufb02y discussing the cluster norm, which was introduced in [9] as a convex\nrelaxation of a multitask clustering problem. The norm is de\ufb01ned, for every W 2 Rd\u21e5m, as\n\ntr(S1W >W )\n\n(7)\n\nkWkcl =r inf\n\nS2Sm\n\nwhere Sm = {S 2 Rm\u21e5m, S \u232b 0 : aI S bI, tr S = c}, and 0 < a \uf8ff b. In [9] the authors\nstate that the cluster norm of W equals the box-norm of the vector formed by the singular values of\nW where c = (b a)k + da. Here we provide a proof of this result. Denote by i(\u00b7) the eigenvalues\nof a matrix which we write in nonincreasing order 1(\u00b7) 2(\u00b7) \u00b7\u00b7\u00b7 d(\u00b7). Note that if \u2713i are\nthe eigenvalues of S then \u2713i = di+1(S1). We have that\n\ntr(S1W >W ) = tr(S1U \u23032U >) \n\ndi+1(S1)i(W >W ) =\n\nmXi=1\n\n2\ni (W )\n\u2713i\n\ndXi=1\n\nwhere we have used the inequality [20, Sec. H.1.h] for S1, W >W \u232b 0. Since this inequality is\nattained whenever S = U Diag(\u2713)U, where U are the eigenvectors of W >W , we see that kWkcl =\nk(W )k\u21e5, that is, the cluster norm coincides with the spectral box-norm. In particular, we see that\nthe spectral k-support norm is a special case of the cluster norm, where we let a tend to zero, b = 1\nand c = k. Moreover, the methods to compute the norm and its proximity operator described in the\nfollowing section can directly be applied to the cluster norm.\nAs in the case of the vector norm (Proposition 3.3), the spectral box-norm or cluster norm can be\nwritten as a perturbation of spectral k-support norm with a quadratic term.\nProposition 3.6. Let k \u00b7 k\u21e5 be a matrix box-norm with parameters a, b, c and let k = cda\n\nba . Then\n\nkWk2\n\n\u21e5 = min\nZ\n\n1\nakW Zk2\n\nF +\n\n1\n\nb akZk2\n\n(k).\n\nIn other words, this result shows that the cluster norm can be seen as the Moreau envelope [11] of a\nspectral k-support norm.\n\n4 Computing the Norms and their Proximity Operator\n\nIn this section, we compute the norm and the proximity operator of the squared norm by explicitly\nsolving the optimization problem in (3). We begin with the vector norm.\n\n4\n\n\fTheorem 4.1. For every w 2 Rd it holds that\n1\nbkwQk2\n\nkwk2\n\n\u21e5 =\n\n2 +\n\n1\npkwIk2\n\n1 +\n\n1\nakwLk2\n2,\n\n(8)\n\nwhere wQ = (|w|#1, . . . ,|w|#q), wI = (|w|#q+1, . . . ,|w|#d`), wL = (|w|#d`+1, . . . ,|w|#d), and q and\n` are the unique integers in {0, . . . , d} that satisfy q + ` \uf8ff d,\n1\np\n\n|wi| > |wd`+1|\n\n|wq|\nb \n\na \n\nd`Xi=q+1\n\n|wi| > |wq+1|\n\nd`Xi=q+1\np = c qb `a and we have de\ufb01ned |w0| = 1 and |wd+1| = 0.\nProof. (Sketch) We need to solve the optimization problem\n\n|wd`|\n\n(9)\n\n1\np\n\na\n\n,\n\n,\n\nb\n\ninf\n\n\u2713 \u21e2 dXi=1\n\nw2\ni\n\u2713i\n\n: a \uf8ff \u2713i \uf8ff b,\n\n\u2713i \uf8ff c.\n\n(10)\n\ndXi=1\n\nWe assume without loss of generality that the wi are ordered nonincreasing in absolute values, and\nit follows that at the optimum the \u2713i are also ordered nonincreasing. We further assume that wi 6= 0\nfor all i and c \uf8ff db, so the sum constraint will be tight at the optimum. The Lagrangian is given by\n\nL(\u2713, \u21b5) =\n\nw2\ni\n\u2713i\n\n+\n\ndXi=1\n\n1\n\n\u21b52 dXi=1\n\n\u2713i c!\n\nwhere 1/\u21b52 is a strictly positive multiplier to be chosen such that S(\u21b5) := Pd\ni=1 \u2713i(\u21b5) = c. We\ncan then solve the original problem by minimizing the Lagrangian over the constraint \u2713 2 [a, b]d.\nDue to the decoupling effect of the multiplier we can solve the simpli\ufb01ed problem componentwise,\nobtaining the solution\n\n\u2713i = \u2713i(\u21b5) = min(b, max(a, \u21b5|wi|))\n\n(11)\nwhere S(\u21b5) = c. The minimizer has the form \u2713 = (b, . . . , b, \u2713q+1, . . . , \u2713d`, a, . . . , a), where q, `\nare determined by the value of \u21b5. From S(\u21b5) = c we get \u21b5 = p/(Pd`\ni=q+1 |wi|). The value\nof the norm in (8) follows by substituting \u2713 into the objective. Finally, by construction we have\n\u2713q b > \u2713q+1 and \u2713d` > a \u2713d`+1, which give rise to the conditions in (9).\nTheorem 4.1 suggests two methods for computing the box-norm. First we \ufb01nd \u21b5 such that S(\u21b5) = c;\nthis value uniquely determines \u2713 in (11), and the norm follows by substitution into (10). Alterna-\ntively, we identify q and ` that jointly satisfy (9) and we compute the norm using (8). Taking\nadvantage of the structure of \u2713 in the former method leads to a computation time that is O(d log d).\nTheorem 4.2. The computation of the box-norm can be completed in O(d log d) time.\nThe k-support norm is a special case of the box-norm, and as a direct corollary of Theorem 4.1 and\nTheorem 4.2, we recover [6, Proposition 2.1].\n\n4.1 Proximity Operator\n\nProximal gradient methods can be used to solve optimization problems of the form minw f (w) +\ng(w), where f is a convex loss function with Lipschitz continuous gradient, > 0 is a regu-\nlarization parameter, and g is a convex function for which the proximity operator can be computed\nef\ufb01ciently, see [12, 21, 22] and references therein. The proximity operator of g with parameter \u21e2 > 0\nis de\ufb01ned as\n\nprox\u21e2g(w) = argmin\u21e2 1\n\n2kx wk2 + \u21e2g(x) : x 2 Rd .\n\nWe now use the in\ufb01mum formulation of the box-norm to derive the proximity operator of the squared\nnorm.\n\n5\n\n\fAlgorithm 1 Computation of x = prox \nRequire: parameters a, b, c, .\n\n2 k\u00b7k2\n\n\u21e5\n\n(w).\n\n, b+\n\ni=1 =n a+\n\n1. Sort points\u21b5i 2d\n2. Identify points \u21b5i and \u21b5i+1 such that S(\u21b5i) \uf8ff c and S(\u21b5i+1) c by binary search;\n3. Find \u21b5\u21e4 between \u21b5i and \u21b5i+1 such that S(\u21b5\u21e4) = c by linear interpolation;\n4. Compute \u2713i(\u21b5\u21e4) for i = 1, . . . , d;\n5. Return xi = \u2713iwi\n\u2713i+ for i = 1, . . . , d.\n\nsuch that \u21b5i \uf8ff \u21b5i+1;\n\n|wj|od\n\n|wj|\n\nj=1\n\n2 k\u00b7k2\n\n\u21e5\n\nTheorem 4.3. The proximity operator of the square of the box-norm at point w 2 Rd with parameter\n2 is given by prox \n\n\u27131+ , . . . , \u2713dwd\n\n\u2713d+ ), where\n\n(w) = ( \u27131w1\n\n\n\n\u2713i = \u2713i(\u21b5) = min(b, max(a, \u21b5|wi| ))\n\n(12)\ni=1 \u2713i(\u21b5) = c. Furthermore, the computation of the proximity\n\nand \u21b5 is chosen such that S(\u21b5) :=Pd\noperator can be completed in O(d log d) time.\nThe proof follows a similar reasoning to the proof of Theorem 4.1. Algorithm 1 illustrates the\ncomputation of the proximity operator for the squared box-norm in O(d log d) time. This includes\nthe k-support as a special case, where we let a tend to zero, and set b = 1 and c = k, which\nimproves upon the complexity of the O(d(k + log d)) computation provided in [6], and we illustrate\nthe improvement empirically in Table 1.\n\n4.2 Proximity Operator for Orthogonally Invariant Norms\n\nThe computational considerations outlined above can be naturally extended to the matrix setting by\nusing von Neumann\u2019s trace inequality (see, e.g. [23]). Here we comment on the computation of the\nproximity operator, which is important for our numerical experiments in the following section. The\nproximity operator of an orthogonally invariant norm k \u00b7 k = g((\u00b7)) is given by\nproxk\u00b7k(W ) = Udiag(proxg((W )))V >, W 2 Rm\u21e5d,\n\nwhere U and V are the matrices formed by the left and right singular vectors of W (see e.g. [24,\nProp 3.1]). Using this result we can employ proximal gradient methods to solve matrix regularization\nproblems using the squared spectral k-support norm and spectral box-norm.\n\n5 Numerical Experiments\n\nIn this section, we report on the statistical performance of the spectral regularizers in matrix com-\npletion experiments. We also offer an interpretation of the role of the parameters in the box-norm\nand we empirically verify the improved performance of the proximity operator computation (see\nTable 1). We compare the trace norm (tr) [25], matrix elastic net (en) [26], spectral k-support (ks)\nand the spectral box-norm (box). The Frobenius norm, which is equal to the spectral k-support\nnorm for k = d, performed considerably worse than the trace norm and we omit the results here.\nWe report test error and standard deviation, matrix rank (r) and optimal parameter values for k and\na, which were determined by validation, as were the regularization parameters. When comparing\nperformance, we used a t-test to determine statistical signi\ufb01cance at a level of p < 0.001. For the\noptimization we used an accelerated proximal gradient method (FISTA), see e.g. [12, 21, 22], with\nthe percentage change in objective as convergence criterion, with a tolerance of 105 for the simu-\nlated datasets and 103 for the real datasets. As is typical with spectral regularizers we found that\nthe spectrum of the learned matrix exhibited a rapid decay to zero. In order to explicitly impose a\nlow rank on the solution we included a \ufb01nal step where we hard-threshold the singular values of the\n\ufb01nal matrix below a level determined by validation. We report on both sets of results below.\n\n5.1 Simulated Data\nMatrix Completion. We applied the norms to matrix completion on noisy observations of low rank\nmatrices. Each m\u21e5 m matrix was generated as W = AB> + E, where A, B 2 Rm\u21e5r, r \u2327 m, and\n\n6\n\n\fTable 1: Comparison of proximity operator algorithms for the k-support norm (time in s), k = 0.05d.\nAlgorithm 1 is the method in [6], Algorithm 2 is our method.\n\nd\nAlg. 1\nAlg. 2\n\n1,000\n0.0443\n0.0011\n\n2,000\n0.1567\n0.0016\n\n4,000\n0.5907\n0.0026\n\n8,000\n2.3065\n0.0046\n\n16,000\n9.0080\n0.0101\n\n32,000\n35.6199\n0.0181\n\n0.03\n\n0.02\n\n0.01\n\nl\n\ne\nu\na\nv\n \n\na\n\n0\n\n2\n\n4\n\n6\n\nSNR\n\n8\n\n10\n\nl\n\ne\nu\na\nv\n \nk\n\n5\n\n4\n\n3\n\n2\n\n1\n\n2\n\n4\n\n6\n\n8\n\n10\n\ntrue rank\n\nFigure 1: Impact of signal to noise on a.\n\nFigure 2: Impact of matrix rank on k.\n\nthe entries of A, B and E are i.i.d. standard Gaussian. We set m = 100, r 2 {5, 10} and we sampled\nuniformly a percentage \u21e2 2 {10%, 20%, 30%} of the entries for training, and used a \ufb01xed 10% for\nvalidation. The error was measured as ktrue predictedk2/ktruek2 [5] and averaged over 100 trials.\nThe results are summarized in Table 2. In the thresholding case, all methods recovered the rank of\nthe true noiseless matrix. The spectral box-norm generated the lowest test errors in all regimes, with\nthe spectral k-support a close second, in particular in the thresholding case. This suggests that the\nnon zero parameter a in the spectral box-norm counteracted the noise to some extent.\nRole of Parameters.\nIn the same setting we investigated the role of the parameters in the box-\nnorm. As previously discussed, parameter b can be set to 1 without loss of generality. Figure 1\nshows the optimal value of a chosen by validation for varying signal to noise ratios (SNR), keeping\nk \ufb01xed. We see that for greater noise levels (smaller SNR), the optimal value for a increases. While\nfor a > 0, the recovered solutions are not sparse, as we show below this can still lead to improved\nperformance in experiments, in particular in the presence of noise. Figure 2 shows the optimal value\nof k chosen by validation for matrices with increasing rank, keeping a \ufb01xed. We notice that as the\nrank of the matrix increases, the optimal k value increases, which is expected since it is an upper\nbound on the sum of the singular values.\n\nTable 2: Matrix completion on simulated data sets, without (left) and with (right) thresholding.\n\nnorm\ndataset\nrank 5\ntr\n\u21e2=10% en\nks\nbox\nrank 5\ntr\n\u21e2=20% en\nks\nbox\n\nrank 10 tr\n\u21e2=20% en\nks\nbox\n\nrank 10 tr\n\u21e2=30% en\nks\nbox\n\ntest error\na\nk\nr\n-\n-\n20\n0.8184 (0.03)\n-\n20\n-\n0.8164 (0.03)\n16 3.6\n-\n0.8036 (0.03)\n87 2.9 1.7e-2\n0.7805 (0.03)\n-\n-\n23\n0.4085 (0.03)\n-\n23\n-\n0.4081 (0.03)\n-\n0.4031 (0.03)\n21 3.1\n0.3898 (0.03) 100 1.3\n9e-3\n-\n-\n0.6356 (0.03)\n27\n-\n27\n0.6359 (0.03)\n-\n-\n24 4.4\n0.6284 (0.03)\n9e-3\n89 1.8\n0.6243 (0.03)\n0.3642 (0.02)\n36\n-\n-\n-\n36\n0.3638 (0.002\n-\n-\n0.3579 (0.02)\n33 5.0\n0.3486 (0.02) 100 2.5\n9e-3\n\nnorm\ndataset\nrank 5\ntr\n\u21e2=10% en\nks\nbox\nrank 5\ntr\n\u21e2=20% en\nks\nbox\n\nrank 10 tr\n\u21e2=20% en\nks\nbox\n\nrank 10 tr\n\u21e2=30% en\nks\nbox\n\ntest error\na\nk\nr\n-\n-\n5\n0.7799 (0.04)\n-\n5\n-\n0.7794 (0.04)\n5 4.23\n-\n0.7728 (0.04)\n5 3.63 8.1e-3\n0.7649 (0.04)\n-\n-\n5\n0.3449 (0.02)\n-\n5\n-\n0.3445 (0.02)\n5 2.97\n-\n0.3381 (0.02)\n5 3.28 1.9e-3\n0.3380 (0.02)\n-\n-\n0.6084 (0.03) 10\n-\n0.6074 (0.03) 10\n-\n0.6000 (0.03) 10 5.02\n-\n0.6000 (0.03) 10 5.22 1.9e-3\n-\n0.3086 (0.02) 10\n-\n-\n0.3082 (0.02) 10\n-\n-\n0.3025 (0.02) 10 5.13\n0.3025 (0.02) 10 5.16\n3e-4\n\n7\n\n\fTable 3: Matrix completion on real data sets, without (left) and with (right) thresholding.\n\ndataset\nnorm test error\na\nr\nk\n-\n87\n-\n0.2034\nMovieLens\ntr\n-\n87\n-\n100k\nen\n0.2034\n102 1.00\n-\n0.2031\n\u21e2 = 50% ks\n943 1.00 1e-5\n0.2035\nbox\n-\n-\n325\n0.1821\ntr\nMovieLens\n-\n319\n-\nen\n0.1821\n1M\n0.1820\n\u21e2 = 50% ks\n-\n317 1.00\n0.1817 3576 1.09 3e-5\nbox\n-\n98\n-\n0.1787\ntr\n98\n-\n-\n0.1787\nen\n84 5.00\nks\n0.1764\n-\n100 4.00 1e-6\n0.1766\nbox\n-\n-\n49\n0.1988\ntr\n-\n49\n0.1988\nen\n-\n46 3.70\nks\n0.1970\n-\n100 5.91 1e-3\n0.1973\nbox\n\nJester 1\n20 per line\n\nJester 3\n8 per line\n\nk\n-\n-\n9 1.87\n\ndataset\nnorm test error\nMovieLens\ntr\n100k\nen\n\u21e2 = 50% ks\nbox\ntr\nMovieLens\nen\n1M\n\u21e2 = 50% ks\nbox\ntr\nen\nks\nbox\ntr\nen\nks\nbox\n\na\nr\n-\n0.2017 13\n-\n0.2017 13\n0.1990\n-\n0.1989 10 2.00 1e-5\n-\n-\n0.1790 17\n-\n0.1789 17\n-\n0.1782 17 1.80\n-\n0.1777 19 2.00 1e-6\n-\n-\n0.1752 11\n0.1752 11\n-\n-\n0.1739 11 6.38\n-\n0.1726 11 6.40 2e-5\n-\n0.1959\n-\n3\n-\n0.1959\n3\n-\n0.1942\n-\n3 2.13\n3 4.00 8e-4\n0.1940\n\nJester 1\n20 per line\n\nJester 3\n8 per line\n\n5.2 Real Data\nMatrix Completion (MovieLens and Jester).\nIn this section we report on matrix completion on\nreal data sets. We observe a percentage of the (user, rating) entries of a matrix and the task is to pre-\ndict the unobserved ratings, with the assumption that the true matrix has low rank. The datasets we\nconsidered were MovieLens 100k and MovieLens 1M (http://grouplens.org/datasets/movielens/),\nwhich consist of user ratings of movies, and Jester 1 and Jester 3 (http://goldberg.berkeley.edu/jester-\ndata/), which consist of users and ratings of jokes (Jester 2 showed essentially identical performance\nto Jester 1). Following [4], for MovieLens we uniformly sampled \u21e2 = 50% of the available entries\nfor each user for training, and for Jester 1 and Jester 3 we sampled 20, respectively 8, ratings per\nuser, and we used 10% for validation. The error was measured as normalized mean absolute error,\n#observations/(rmaxrmin), where rmin and rmax are lower and upper bounds for the ratings [4]. The\nresults are outlined in Table 3. In the thresholding case, the spectral box and k-support norms had\nthe best performance. In the absence of thresholding, the spectral k-support showed slightly better\nperformance. Comparing to the synthetic data sets, this suggests that in the absence of noise the\nparameter a did not provide any bene\ufb01t. We note that in the absence of thresholding our results for\nthe trace norm on MovieLens 100k agreed with those in [3].\n\nktruepredictedk2\n\n6 Conclusion\n\nWe showed that the k-support norm belongs to the family of box-norms and noted that these can\nbe naturally extended from the vector to the matrix setting. We also provided a connection between\nthe k-support norm and the cluster norm, which essentially coincides with the spectral box-norm.\nWe further observed that the cluster norm is a perturbation of the spectral k-support norm, and we\nwere able to compute the norm and its proximity operator. Our experiments indicate that the spectral\nbox-norm and k-support norm consistently outperform the trace norm and the matrix elastic net on\nvarious matrix completion problems. With a single parameter to validate, compared to two for the\nspectral box-norm, our results suggest that the spectral k-support norm is a powerful alternative to\nthe trace norm and the elastic net, which has the same number of parameters. In future work, we\nwould like to study the application of the norms to clustering problems in multitask learning [9],\nin particular the impact of centering. It would also be valuable to derive statistical inequalities and\nRademacher complexities for these norms.\n\nAcknowledgements\nWe would like to thank Andreas Maurer, Charles Micchelli and especially Andreas Argyriou for\nuseful discussions. Part of this work was supported by EPSRC Grant EP/H027203/1.\n\n8\n\n\fReferences\n[1] N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. Advances in\n\nNeural Information Processing Systems 17, 2005.\n\n[2] J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. A new approach to collaborative \ufb01ltering: Operator\nestimation with spectral regularization. Journal of Machine Learning Research, Vol. 10:803\u2013826, 2009.\n[3] M Jaggi and M. Sulovsky. A simple algorithm for nuclear norm regularized problems. Proceedings of the\n\n27th International Conference on Machine Learning, 2010.\n\n[4] K.-C. Toh and S. Yun. An accelerated proximal gradient algorithm for nuclear norm regularized least\n\nsquares problems. SIAM Journal on Imaging Sciences, 4:573\u2013596, 2011.\n\n[5] R. Mazumder, T. Hastie, and R. Tibshirani. Spectral regularization algorithms for learning large incom-\n\nplete matrices. Journal of Machine Learning Research, 11:2287\u20132322, 2010.\n\n[6] A. Argyriou, R. Foygel, and N. Srebro. Sparse prediction with the k-support norm. In Advances in Neural\n\nInformation Processing Systems 25, pages 1466\u20131474, 2012.\n\n[7] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society,\n\nVol. 58:267\u2013288, 1996.\n\n[8] H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal\n\nStatistical Society, Series B, 67(2):301\u2013320, 2005.\n\n[9] L. Jacob, F. Bach, and J.-P. Vert. Clustered multi-task learning: a convex formulation. Advances in Neural\n\nInformation Processing Systems (NIPS 21), 2009.\n\n[10] J. Von Neumann. Some matrix-inequalities and metrization of matric-space. Tomsk. Univ. Rev. Vol I,\n\n1937.\n\n[11] R. T. Rockafellar. Convex Analysis. Princeton University Press, 1970.\n[12] Y. Nesterov. Gradient methods for minimizing composite objective function. Center for Operations\n\nResearch and Econometrics, 76, 2007.\n\n[13] L. Jacob, G. Obozinski, and J.-P. Vert. Group lasso with overlap and graph lasso. Proc of the 26th Int.\n\nConf. on Machine Learning, 2009.\n\n[14] Y. Grandvalet. Least absolute shrinkage is equivalent to quadratic penalization.\n\n201\u2013206. Springer London, 1998.\n\nIn ICANN 98, pages\n\n[15] C. A. Micchelli and M. Pontil. Learning the kernel function via regularization. Journal of Machine\n\nLearning Research, 6:1099\u20131125, 2005.\n\n[16] M. Szafranski, Y. Grandvalet, and P. Morizet-Mahoudeaux. Hierarchical penalization. In Advances in\n\nNeural Information Processing Systems 21, 2007.\n\n[17] C. A. Micchelli, J. M. Morales, and M. Pontil. Regularizers for structured sparsity. Advances in Comp.\n\nMathematics, 38:455\u2013489, 2013.\n\n[18] A. Maurer and M. Pontil. Structured sparsity and generalization. The Journal of Machine Learning\n\nResearch, 13:671\u2013690, 2012.\n\n[19] G. Obozinski and F. Bach. Convex relaxation for combinatorial penalties. CoRR, 2012.\n[20] A. W. Marshall and I. Olkin. Inequalities: Theory of Majorization and its Applications. Academic Press,\n\n1979.\n\n[21] P. L. Combettes and J.-C. Pesquet. Proximal splitting methods in signal processing.\n\nAlgorithms for Inv Prob. Springer, 2011.\n\nIn Fixed-Point\n\n[22] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.\n\nSIAM J. Imaging Sciences, 2(1):183\u2013202, 2009.\n\n[23] A. S. Lewis. The convex analysis of unitarily invariant matrix functions. Journal of Convex Analysis,\n\n2:173\u2013183, 1995.\n\n[24] A. Argyriou, C. A. Micchelli, M. Pontil, L. Shen, and Y. Xu. Ef\ufb01cient \ufb01rst order methods for linear\n\ncomposite regularizers. CoRR, abs/1104.1436, 2011.\n\n[25] J.-F. Cai, E. J. Candes, and Z. Shen. A singular value thresholding algorithm for matrix completion. SIAM\n\nJournal on Optimization, 20(4):1956\u20131982, 2008.\n\n[26] H. Li, N. Chen, and L. Li. Error analysis for matrix elastic-net regularization algorithms. IEEE Transac-\n\ntions on Neural Networks and Learning Systems, 23-5:737\u2013748, 2012.\n\n[27] W. Rudin. Functional Analysis. McGraw Hill, 1991.\n[28] D. P. Bertsekas, A. Nedic, and A. E. Ozdaglar. Convex Analysis and Optimization. Athena Scienti\ufb01c,\n\n2003.\n\n[29] R. A. Horn and C. R. Johnson. Topics in Matrix Analysis. Cambridge University Press, 1991.\n\n9\n\n\f", "award": [], "sourceid": 1917, "authors": [{"given_name": "Andrew", "family_name": "McDonald", "institution": "University College London"}, {"given_name": "Massimiliano", "family_name": "Pontil", "institution": "UCL"}, {"given_name": "Dimitris", "family_name": "Stamos", "institution": "UCL"}]}