{"title": "Convex Multi-view Subspace Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1673, "page_last": 1681, "abstract": "Subspace learning seeks a low dimensional representation of data that enables accurate reconstruction. However, in many applications, data is obtained from multiple sources rather than a single source (e.g. an object might be viewed by cameras at different angles, or a document might consist of text and images). The conditional independence of separate sources imposes constraints on their shared latent representation, which, if respected, can improve the quality of the learned low dimensional representation. In this paper, we present a convex formulation of multi-view subspace learning that enforces conditional independence while reducing dimensionality. For this formulation, we develop an efficient algorithm that recovers an optimal data reconstruction by exploiting an implicit convex regularizer, then recovers the corresponding latent representation and reconstruction model, jointly and optimally. Experiments illustrate that the proposed method produces high quality results.", "full_text": "Convex Multi-view Subspace Learning\n\nDepartment of Computing Science, University of Alberta, Edmonton AB T6G 2E8, Canada\n\nMartha White, Yaoliang Yu, Xinhua Zhang\u2217and Dale Schuurmans\n{whitem,yaoliang,xinhua2,dale}@cs.ualberta.ca\n\nAbstract\n\nSubspace learning seeks a low dimensional representation of data that enables\naccurate reconstruction. However, in many applications, data is obtained from\nmultiple sources rather than a single source (e.g. an object might be viewed by\ncameras at different angles, or a document might consist of text and images). The\nconditional independence of separate sources imposes constraints on their shared\nlatent representation, which, if respected, can improve the quality of a learned\nlow dimensional representation. In this paper, we present a convex formulation\nof multi-view subspace learning that enforces conditional independence while re-\nducing dimensionality. For this formulation, we develop an ef\ufb01cient algorithm\nthat recovers an optimal data reconstruction by exploiting an implicit convex reg-\nularizer, then recovers the corresponding latent representation and reconstruction\nmodel, jointly and optimally. Experiments illustrate that the proposed method\nproduces high quality results.\n\nIntroduction\n\n1\nDimensionality reduction is one of the most important forms of unsupervised learning, with roots\ndating to the origins of data analysis. Re-expressing high dimensional data in a low dimensional\nrepresentation has been used to discover important latent information about individual data items,\nvisualize entire data sets to uncover their global organization, and even improve subsequent clus-\ntering or supervised learning [1]. Modern data is increasingly complex, however, with descriptions\nof increasing size and heterogeneity. For example, multimedia data analysis considers data objects\n(e.g. documents or webpages) described by related text, image, video, and audio components. Multi-\nview learning focuses on the analysis of such multi-modal data by exploiting its implicit conditional\nindependence structure. For example, given multiple camera views of a single object, the partic-\nular idiosyncrasies of each camera are generally independent, hence the images they capture will\nbe conditionally independent given the scene. Similarly, the idiosyncrasies of text and images are\ngenerally conditionally independent given a topic. The goal of multi-view learning, therefore, is to\nuse known conditional independence structure to improve the quality of learning results.\nIn this paper we focus on the problem of multi-view subspace learning: reducing dimensionality\nwhen data consists of multiple, conditionally independent sources. Classically, multi-view subspace\nlearning has been achieved by an application of canonical correlation analysis (CCA) [2, 3]. In\nparticular, many successes have been achieved in using CCA to recover meaningful latent represen-\ntations in a multi-view setting [4\u20136]. Such work has been extended to probabilistic [7] and sparse\nformulations [8]. However, a key limitation of CCA-based approaches is that they only admit ef\ufb01-\ncient global solutions when using the squared-error loss (i.e. Gaussian models), while extensions to\nrobust models have had to settle for approximate solutions [9].\nBy contrast, in the single-view setting, recent work has developed new generalizations of subspace\nlearning that can accommodate arbitrary convex losses [10\u201312]. These papers replace the hard bound\non the dimension of the latent representation with a structured convex regularizer that still reduces\nrank, but in a relaxed manner that admits greater \ufb02exibility while retaining tractable formulations.\n\n\u2217Xinhua Zhang is now at the National ICT Australia (NICTA), Machine Learning Group.\n\n1\n\n\fSubspace learning can be achieved in this case by \ufb01rst recovering an optimal reduced rank response\nmatrix and then extracting the latent representation and reconstruction model. Such formulations\nhave recently been extended to the multi-view case [13, 14]. Unfortunately, the multi-view formu-\nlation of subspace learning does not have an obvious convex form, and current work has resorted\nto local training methods based on alternating descent minimization (or approximating intractable\nintegrals). Consequently, there is no guarantee of recovering a globally optimal subspace.\nIn this paper we provide a formulation of multi-view subspace learning that can be solved optimally\nand ef\ufb01ciently. We achieve this by adapting the new single-view training methods of [11, 12] to the\nmulti-view case. After deriving a new formulation of multi-view subspace learning that allows a\nglobal solution, we also derive ef\ufb01cient new algorithms. The outcome is an ef\ufb01cient approach to\nmulti-view subspace discovery that can produce high quality repeatable results.\nNotation: We use Ik for the k\u00d7k identity matrix, A(cid:48) for the transpose of matrix A, (cid:107) \u00b7 (cid:107)2 for the\ni \u03c3i(X) for the trace\n\nEuclidean norm, (cid:107)X(cid:107)F =(cid:112)tr(X(cid:48)X) for the Frobenius norm and (cid:107)X(cid:107)tr =(cid:80)\n\nnorm, where \u03c3i(X) is the ith singular value of X.\n\n(cid:110)(cid:104)xj\n\n(cid:105)(cid:111)\n\nyj\n\n(cid:105)\n\n(cid:104)X\n\n2 Background\nAssume one is given t paired observations\nconsisting of two views: an x-view and a y-view,\nof lengths m and n respectively. The goal of multi-view subspace learning is to infer, for each pair,\na shared latent representation, hj, of dimension k < min(n, m), such that the original data can be\naccurately modeled. We \ufb01rst consider a linear formulation. Given paired observations the goal is to\ninfer a set of latent representations, hj, and reconstruction models, A and B, such that Ahj \u2248 xj\nand Bhj \u2248 yj for all j. Let X denote the n \u00d7 t matrix of x observations, Y the m \u00d7 t matrix of\nthe concatenated (n + m) \u00d7 t data matrix. The problem can then be\ny observations, and Z =\nexpressed as recovering a (n + m) \u00d7 k matrix of model parameters, C =\n, and a k \u00d7 t matrix\nof latent representations, H, such that Z \u2248 CH.\nThe key assumption of multi-view learning is that each of the two views, xj and yj, is condition-\nally independent given the shared latent representation, hj. Although multi-view data can always\nbe concatenated and treated as a single view, if the conditional independence assumption holds, ex-\nplicitly representing multiple views enables more accurate identi\ufb01cation of the latent representation\n(as we will see). The classical formulation of multi-view subspace learning is given by canonical\ncorrelation analysis (CCA), which is typically expressed as the problem of projecting two views so\nthat the correlation between them is maximized [2]. Assuming the data is centered (i.e. X1 = 0 and\nY 1 = 0), the sample covariance of X and Y is given by XX(cid:48)/t and Y Y (cid:48)/t respectively. CCA can\nthen be expressed as an optimization over matrix variables\n\n(cid:104)A\n\n(cid:105)\n\nB\n\nY\n\nmax\nU,V\n\ntr(U(cid:48)XY (cid:48)V ) s.t. U(cid:48)XX(cid:48)U = V (cid:48)Y Y (cid:48)V = I\n\n(1)\nfor U \u2208 Rn\u00d7k, V \u2208 Rm\u00d7k [3]. Although this classical formulation (1) does not make the shared\nlatent representation explicit, CCA can be expressed by a generative model: given a latent represen-\ntation, hj, the observations xj = Ahj +\u0001j and yj = Bhj +\u03bdj are generated by a linear mapping plus\nindependent zero mean Gaussian noise, \u0001\u223c N (0, \u03a3x), \u03bd \u223c N (0, \u03a3y) [7]. In fact, one can show that\nthe classical CCA problem (1) is equivalent to the following multi-view subspace learning problem.\n\n(cid:20) (XX(cid:48))\u22121/2X\n\n(Y Y (cid:48))\u22121/2Y\n\n(cid:21)\n\n(C, H) = arg min\nC,H\n\nProposition 1. Fix k, let \u02dcZ =\n\n(cid:104)A\n\n(cid:105)\n\nand\n\n(cid:107) \u02dcZ \u2212 CH(cid:107)2\nF ,\n\n(2)\n\n2 B provide an optimal solution to (1),\n\nwhere C =\nimplying that A(cid:48)A = B(cid:48)B = I is satis\ufb01ed in the solution to (2).\n\n2 A and V = (Y Y (cid:48))\u2212 1\n\n. Then U = (XX(cid:48))\u2212 1\n\nB\n\n(The proof is given in Appendix A.) From Proposition 1, one can see how formulation (2) respects\nthe conditional independence of the separate views: given a latent representation hj, the reconstruc-\ntion losses on the two views, xj and yj, cannot in\ufb02uence each other, since the reconstruction models\nA and B are individually constrained. By contrast, in single-view subspace learning (i.e. principal\n\n2\n\n\fcomponents analysis) A and B are concatenated in the larger variable C, where C as a whole is con-\nstrained but A and B are not. A and B must then compete against each other to acquire magnitude\nto explain their respective \u201cviews\u201d given hj (i.e. conditional independence is not enforced). Such\nsharing can be detrimental if the two views really are conditionally independent given hj.\nDespite its elegance, a key limitation of CCA is its restriction to squared loss under a particular\nnormalization. Recently, subspace learning algorithms have been greatly generalized in the single\nview case by relaxing the rank(H) = k constraint while imposing a structured regularizer that is\na convex relaxation of rank [10\u201312]. Such a relaxation allows one to incorporate arbitrary convex\nlosses, including robust losses [10], while maintaining tractability.\nAs mentioned, these relaxations of single-view subspace learning have only recently been proposed\nfor the multi-view setting [13, 14]. An extension of these proposals can be achieved by reformulating\n(2) to \ufb01rst incorporate an arbitrary loss function L that is convex in its \ufb01rst argument (for examples,\nsee [15]), then relaxing the rank constraint by replacing it with a rank-reducing regularizer on H. In\nparticular, we consider the following training problem that extends [14]:\n\n(cid:19)\n\n(cid:21)\n(cid:18)(cid:20) A\n(cid:26)(cid:20) a\n\nB\n\nH; Z\n\n(cid:21)\n\nmin\nA,B,H\n\nL\n\n+ \u03b1(cid:107)H(cid:107)2,1,\n\ns.t.\n\n(cid:21)\n\n(cid:20) A:,i\n(cid:27)\n\nB:,i\n\n\u2208 C for all i,\n\n(cid:20) A\n\n(cid:21)\n\nb\n\n, C =\n\n: (cid:107)a(cid:107)2 \u2264 \u03b2,(cid:107)b(cid:107)2 \u2264 \u03b3\n\nand (cid:107)H(cid:107)2,1 =(cid:80)\n\nwhere C :=\n(3)\ni (cid:107)Hi,:(cid:107)2 is a matrix block norm. The signi\ufb01cance of using the (2, 1)-block norm\nas a regularizer is that it encourages rows of H to become sparse, hence reducing the dimensionality\nof the learned representation [16]. C must be constrained however, otherwise (cid:107)H(cid:107)2,1 can be pushed\narbitrarily close to zero simply by re-scaling H/s and Cs (s > 0) while preserving the same loss.\nUnfortunately, (3) is not jointly convex in A, B and H. Thus, the algorithmic approaches proposed\nby [13, 14] have been restricted to alternating block coordinate descent between components A, B\nand H, which cannot guarantee a global solution. Our main result is to show that (3) can in fact be\nsolved globally and ef\ufb01ciently for A, B and H, improving on the previous local solutions [13, 14].\n\nB\n\n,\n\n3 Reformulation\n\n(cid:110)\n\n(cid:111)\n\n.\n\n(cid:107)H(cid:107)2,1\n\nOur \ufb01rst main contribution is to derive an equivalent but tractable reformulation of (3), followed\nby an ef\ufb01cient optimization algorithm. Note that (3) can in principle be tackled by a boosting\nstrategy; however, one would have to formulate a dif\ufb01cult weak learning oracle that considers both\nviews simultaneously [17]. Instead, we \ufb01nd that a direct matrix factorization approach of the form\ndeveloped in [11, 12] is more effective.\nTo derive our tractable reformulation, we \ufb01rst introduce the change of variable \u02c6Z = CH which\nallows us to rewrite (3) equivalently as\n\nmin\n\n\u02c6Z\n\nL( \u02c6Z; Z) + \u03b1 min\n\n{C:C:,i\u2208C} min\n\n{H:CH= \u02c6Z}\n\n(4)\n\nA key step in the derivation is the following characterization of the inner minimization in (4).\n(cid:107)H(cid:107)2,1 de\ufb01nes a norm ||| \u00b7 |||\u2217 (on \u02c6Z) whose dual norm is\nProposition 2. min\n|||\u0393||| := max\n\n{C:C:,i\u2208C} min\n\n{H:CH= \u02c6Z}\n\nc(cid:48)\u0393h.\n\nc\u2208C,(cid:107)h(cid:107)2\u22641\n\nProof. Let \u03bbi = (cid:107)Hi,:(cid:107)2 be the Euclidean norm of the i-th row of H. Then Hi,: = \u03bbi \u02dcHi,: where\n\u02dcHi,: has unit length (if \u03bbi = 0, then take \u02dcHi,: to be any unit vector). Therefore\n\n(cid:107)H(cid:107)2,1 =\n\n{C,\u03bbi:C:,i\u2208C,\u03bbi\u22650, \u02c6Z=(cid:80)\n\nmin\n\n{H:CH= \u02c6Z}\n\n{C:C:,i\u2208C} min\nwhere K is the convex hull of the set G := {ch(cid:48) : c \u2208 C,(cid:107)h(cid:107)2 = 1}. In other words, we seek a\nrank-one decomposition of \u02c6Z, using only elements from G. Since the set K is convex and symmetric,\n\ni \u03bbi = min\n\ni \u03bbiC:,i \u02dcHi,:}\n\n{t\u22650: \u02c6Z\u2208tK}\n\nmin\n\nt,\n\n(5)\n\n(cid:80)\n\n3\n\n\f(5) is known as a gauge function and de\ufb01nes a norm on \u02c6Z (see e.g. [18, Proposition V.3.2.1]). This\nnorm has a dual given by\n\n|||\u0393|||\n\n(6)\nwhere the last equality follows because maximizing any linear function over the convex hull K of a\nset G achieves the same value as maximizing over the set G itself.\n(cid:4)\n\nc\u2208C,(cid:107)h(cid:107)2\u22641\n\n:= max\n\nmax\n\nZ\u2208K tr(\u0393(cid:48)Z) =\n\nc(cid:48)\u0393h,\n\nApplying Proposition 2 to problem (3) leads to a simpler formulation of the optimization problem.\n\nLemma 3. (3) = min\n.\n\u02c6Z\nProof. The lemma is proved by \ufb01rst deriving an explicit form of the norm ||| \u00b7 ||| in (6), then deriving\n(cid:4)\nits dual norm. The details are given in Appendix B.\n\nL( \u02c6Z; Z)+\u03b1 max\n\u03c1\u22650\n\n0\n\n\u03c1\n\n(cid:107)D\u22121\n\n\u02c6Z(cid:107)tr, where D\u03c1 =\n\n(cid:20)(cid:112)\u03b22 +\u03b32\u03c1 In\n\n(cid:21)\n\n(cid:112)\u03b32 +\u03b22/\u03c1 Im\n\n0\n\n(cid:20) \u03b2/\n\n\u221a\n\u03b7 In\n0\n\nUnfortunately the inner maximization problem in Lemma 3 is not concave in \u03c1. However, it is\npossible to re-parameterize D\u03c1 to achieve a tractable formulation as follows. First, de\ufb01ne a matrix\n\n=\n\nE\u03b7 := D \u03b22(1\u2212\u03b7)\nNote that max\u03c1\u22650 (cid:107)D\u22121\nfollowing lemma proves that this re-parameterization yields an ef\ufb01cient computational approach.\nLemma 4. h(\u03b7) := (cid:107)E\u22121\n\n0\n1 \u2212 \u03b7 Im\n\u02c6Z(cid:107), with \u03c1 \u2265 0 corresponding to 0 \u2264 \u03b7 \u2264 1. The\n\n\u221a\n\u03b3/\n\u02c6Z(cid:107) = max0\u2264\u03b7\u22641 (cid:107)E\u22121\n\n\u02c6Z(cid:107)tr is concave in \u03b7 over [0, 1].\n\n, such that D\u03c1 = E \u03b22\n\n\u03b32 \u03c1+\u03b22\n\n\u03b32 \u03b7\n\n\u03c1\n\n\u03b7\n\n.\n\n\u03b7\n\n(cid:21)\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\n\uf8ee\uf8f0(cid:113) \u03b7\n(cid:113) 1\u2212\u03b7\n\n\u03b22\n\n\u03b32\n\n\u02c6Z X\n\n\u02c6Z Y\n\n\uf8f9\uf8fb(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)tr\n\n(cid:16)(cid:113) \u03b7\n\u03b22 ( \u02c6Z X )(cid:48) \u02c6Z X + 1\u2212\u03b7\n\n= tr\n\n\u03b32 ( \u02c6Z Y )(cid:48) \u02c6Z Y\n\n, where tr(\n\n\u221a\u00b7)\n\n(cid:17)\n\nProof. Expand h(\u03b7) into\n\nmeans summing the square root of the eigenvalues (i.e. a spectral function). By [19], if a spectral\nfunction f is concave on [0,\u221e), then tr(f (M )) must be concave on positive semide\ufb01nite matrices.\n\u03b32 ( \u02c6Z Y )(cid:48) \u02c6Z Y is positive semi-de\ufb01nite for \u03b7 \u2208 [0, 1] and\nThe result follows since \u03b7\n(cid:4)\nf =\n\n\u221a\u00b7 is concave on [0,\u221e).\n\n\u03b22 ( \u02c6Z X )(cid:48) \u02c6Z X + 1\u2212\u03b7\n\nFrom Lemmas 3 and 4 we achieve the \ufb01rst main result.\nTheorem 5.\n\n(3) = min\n\n\u02c6Z\n\nL( \u02c6Z; Z) + \u03b1 max\n0\u2264\u03b7\u22641\n\n(cid:107)E\u22121\n\n\u03b7\n\n\u02c6Z(cid:107)tr = max\n0\u2264\u03b7\u22641\n\nL( \u02c6Z; Z) + \u03b1(cid:107)E\u22121\n\n\u03b7\n\n\u02c6Z(cid:107)tr. (7)\n\nmin\n\n\u02c6Z\n\nHence (3) is equivalent to a concave-convex maxi-min problem with no local maxima nor minima.\n\nThus we have achieved a new formulation for multi-view subspace learning that respects conditional\nindependence of the separate views (see discussion in Section 2) while allowing a globally solvable\nformulation. To the best of our knowledge, this has not previously been achieved in the literature.\n\n4 Ef\ufb01cient Training Procedure\n\nThis new formulation for multi-view subspace learning also allows for an ef\ufb01cient algorithmic ap-\nproach. Before conducting an experimental comparison to other methods, we \ufb01rst develop an ef-\n\ufb01cient implementation. To do so we introduce a further transformation \u02c6Q = E\u22121\n\u02c6Z in (7), which\nleads to an equivalent but computationally more convenient formulation of (3):\n\n\u03b7\n\n(3) = max\n0\u2264\u03b7\u22641\n\nmin\n\n\u02c6Q\n\nL(E\u03b7 \u02c6Q; Z) + \u03b1(cid:107) \u02c6Q(cid:107)tr.\n\n(8)\n\nDenote g(\u03b7) := min \u02c6Q L(E\u03b7 \u02c6Q; Z) + \u03b1(cid:107) \u02c6Q(cid:107)tr. The transformation does not affect the concavity of\nthe problem with respect to \u03b7 established in Lemma 4; therefore, (8) remains tractable. The training\nprocedure then consists of two stages: \ufb01rst, solve (8) to recover \u03b7 and \u02c6Q, which allows \u02c6Z = E\u03b7 \u02c6Q to\nbe computed; then, recover the optimal factors H and C (i.e. A and B) from \u02c6Z.\n\n4\n\n\fRecovering an optimal \u02c6Z: The key to ef\ufb01ciently recovering \u02c6Z is to observe that (8) has a conve-\nnient form. The concave outer maximization is de\ufb01ned over a scalar variable \u03b7, hence simple line\nsearch can be used to solve the problem, normally requiring at most a dozen evaluations to achieve\na small tolerance. Crucially, the inner minimization in \u02c6Q is a standard trace-norm-regularized loss\nminimization problem, which has been extensively studied in the matrix completion literature [20\u2013\n22]. By exploiting these algorithms, g(\u03b7) and its subgradient can both be computed ef\ufb01ciently.\n\nCH = \u02c6Z,\n\nRecovering C and H from \u02c6Z: Once \u02c6Z is obtained, we need to recover a C and H that satisfy\n\n(cid:107)H(cid:107)2,1 = ||| \u02c6Z|||\u2217,\n\n(5) that ||| \u02c6Z|||\u2217 = min{C,\u03bbi:C:i\u2208C,\u03bbi\u22650, \u02c6Z=(cid:80)\nloss of generality. In such a case, \u02c6Z =(cid:80)\nhave ||| \u02c6Z|||\u2217 = (cid:107)E\u22121\n\nand C:,i \u2208 C for all i.\n(9)\nWe exploit recent sparse approximation methods [23, 24] to solve this problem. First, note from\ni \u03bbi, where (cid:107) \u02dcHi,:(cid:107)2 \u2264 1. Since we already\n\u02c6Z(cid:107)tr from the \ufb01rst stage, we can rescale the problem so that ||| \u02c6Z|||\u2217 = 1 without\ni \u03bbi = 1 (we restore the\nproper scale to \u02dcH afterward). So now, \u02c6Z lies in the convex hull of the set G := {ch(cid:48) : c \u2208 C,(cid:107)h(cid:107)2 \u2264\n1} and any expansion of \u02c6Z as a convex combination of the elements in G is a valid recovery. From\nthis connection, we can now exploit the recent greedy algorithms developed in [23, 24] to solve the\nrecovery problem. In particular, the recovery just needs to solve\n\ni \u03bbiC:,i \u02dcHi,:}(cid:80)\ni \u03bbiC:,i \u02dcHi,: where \u03bb \u2265 0 and(cid:80)\n\n\u03b7\n\nK\u2208convG f (K), where\nmin\n\n(10)\nwhere conv denotes the convex hull. Note that the optimal value of (10) is 0. The greedy (boosting)\nalgorithm provided by [23, 24] produces a factorization of \u02c6Z into C and H and proceeds as follows:\nt \u2208 argminG\u2208G (cid:104)\u2207f (Kt\u22121), G(cid:105) . Note that this step\n1. Weak learning step: greedily pick Gt = cth(cid:48)\ncan be computed ef\ufb01ciently with a form of power method iteration (see Appendix C.2).\n2. \u201cTotally corrective\u201d step: \u00b5(t) = argmin\n\u00b5(t)\ni Gi.\n\n(cid:16) t(cid:80)\n\n, then Kt =\n\nt(cid:80)\n\n(cid:17)\n\n\u00b5iGi\n\nf\n\nf (K) := (cid:107) \u02c6Z \u2212 K(cid:107)2\nF .\n\n\u00b5\u22650,(cid:80)\n\ni \u00b5i=1\n\ni=1\n\ni=1\n\nF < \u0001 within O(1/\u0001) iterations [23, 24].\n\nThis procedure will \ufb01nd a Kt satisfying (cid:107) \u02c6Z \u2212 Kt(cid:107)2\nAcceleration: In practice, this procedure can be considerably accelerated via more re\ufb01ned analysis.\nRecall \u02c6Z is penalized by the dual of the norm in (6). Given \u02c6Z, it is not hard to recover its dual\nvariable \u0393 by exploiting the dual norm relationship: \u0393= argmax\u0393:|||\u0393|||\u22641tr(\u0393(cid:48) \u02c6Z). Then given \u0393, the\nfollowing theorem guarantees many bases in C can be eliminated from the recovery problem (9).\nTheorem 6. (C, H) satisfying \u02c6Z = CH is optimal iff (cid:107)\u0393(cid:48)C:,i(cid:107) = 1 and Hi,: = (cid:107)Hi,:(cid:107)2C(cid:48)\n:,i\u0393, \u2200i.\nTheorem 6 prunes many elements from G and the weak learning step only needs to consider a proper\nsubset. Interestingly this constrained search can be solved with no increase in the computational\ncomplexity. The accelerated boosting generates ct in the weak learning step, giving the recovery\nC = [c1, . . . , ck] and H = diag(\u00b5)C(cid:48)\u0393. The rank, k, is implicitly determined by termination of the\nboosting algorithm. The detailed algorithm and proof of Theorem 6 are given in Appendix C.\n\n5 Comparisons\nBelow we compare the proposed global learning method, Multi-view Subspace Learning (MSL),\nagainst a few benchmark competitors.\nLocal Multi-view Subspace Learning (LSL) An obvious competitor is to solve (3) by alternating\ndescent over the variables: optimize H with A and B \ufb01xed, optimize A with B and H \ufb01xed, etc.\nThis is the computational strategy employed by [13, 14]. Since A and B are both constrained and H\nis regularized by the (2,1)-block norm which is not smooth, we optimized them using the proximal\ngradient method [25].\nSingle-view Subspace Learning (SSL) Single view learning can be cast as a relaxation of (3),\nwhere the columns of C =\n\nare normalized as a whole, rather than individually for A and B:\n\n(cid:104)A\n\n(cid:105)\n\nB\n\nL(CH; Z) + \u03b1(cid:107)H(cid:107)2,1 = min\n\nL( \u02c6C \u02c6H; Z) + \u03b1(\u03b22 +\u03b32)\u2212 1\n\n2(cid:107) \u02c6H(cid:107)2,1 (11)\n\nmin\n{H,C:(cid:107)C:,i(cid:107)2\u2264\n\n\u221a\n\n\u03b22+\u03b32}\n\n{ \u02c6H, \u02c6C:(cid:107) \u02c6C:,i(cid:107)2\u22641}\n\nL( \u02c6Z; Z) + \u03b1(\u03b22 + \u03b32)\u2212 1\n\n2(cid:107) \u02c6Z(cid:107)tr.\n\n(12)\n\n= min\n\n\u02c6Z\n\n5\n\n\fC =(cid:112)\u03b22 + \u03b32 \u02c6C and \u02c6H =(cid:112)\u03b22 + \u03b32H. Equation (12) can be established from the basic results\n\nEquation (12) matches the formulation given in [10]. The equality in (11) is by change of variable\nof [11, 12] (or specializing Proposition 2 to the case where C is the unit Euclidean ball). To solve\n(12), we used a variant of the boosting algorithm [22] when \u03b1 is large, due to its effectiveness\nwhen the solution has low rank. When \u03b1 is small, we switch to the alternating direction augmented\nLagrangian method (ADAL) [26] which does not enforce low-rank at all iterations. This hybrid\nchoice of solver is also applied to the optimization of \u02c6Q in (8) for MSL. Once an optimal \u02c6Z is\nachieved, the corresponding C and H can be recovered by an SVD: for \u02c6Z = U \u03a3V (cid:48), set C =\n2 \u03a3V (cid:48) which satis\ufb01es CH = \u02c6Z and (cid:107)H(cid:107)2,1 =(cid:107) \u02c6Z(cid:107)tr, and so is an\n(\u03b22 +\u03b32) 1\noptimal solution to (11).\n\n2 U and H = (\u03b22 +\u03b32)\u2212 1\n\n6 Experimental results\n\nDatasets We provide experimental results on two datasets: a synthetic dataset and a face-image\ndataset. The synthetic dataset is generated as follows. First, we randomly generate a k-by-ttr matrix\nHtr for training, a k-by-tte matrix Hte for testing, and two basis matrices, A (n-by-k) and B (m-\nby-k), by (iid) sampling from a zero-mean unit-variance Gaussian distribution. The columns of A\nand B are then normalized to ensure that the Euclidean norm of each is 1. Then we set\n\nXtr = AHtr, Ytr = BHtr, Xte = AHte, Yte = BHte.\n\nNext, we add noise to these matrices, to obtain \u02dcXtr, \u02dcYtr, \u02dcXte, \u02dcYte. Following [10], we use sparse\nnon-Gaussian noise: 5% of the matrix entries were selected randomly and replaced with a value\ndrawn uniformly from [\u2212M, M ], where M is 5 times the maximal absolute entry of the matrices.\nThe second dataset is based on the Extended Yale Face Database B [27]. It contains grey level\nface images of 28 human subjects, each with 9 poses and 64 lighting conditions. To construct the\ndataset, we set the x-view to a \ufb01xed lighting (+000E+00) and the y-view to a different \ufb01xed lighting\n(+000E+20). We obtain a pair of views by randomly drawing a subject and a pose (under the two\n\ufb01xed lightings). The underlying assumption is that each lighting has its own set of bases (A and B)\nand each (person, pose) pair has the same latent representation for the two lighting conditions. All\nimages are down-sampled to 100-by-100, meaning n = m = 104. We kept one view (x-view) clean\nand added pixel errors to the second view (y-view). We randomly set 5% of the pixel values to 1,\nmimicking the noise in practice, e.g. occlusions and loss of pixel information from image transfer.\nThe goal is to enable appropriate reconstruction of a noisy image using other views.\nModel speci\ufb01cation Due to the sparse noise model, we used L1,1 loss for L:\n\nL\n\nH, Z\n\n= (cid:107)AH \u2212 X(cid:107)1,1\n\n+(cid:107)BH \u2212 Y (cid:107)1,1\n\n.\n\n(13)\n\n(cid:16)(cid:20) A\n\n(cid:21)\n\nB\n\n(cid:17)\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n(cid:125)\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n(cid:125)\n\n:=L1(AH,X)\n\n:=L2(BH,Y )\n\nFor computational reasons, we worked on a smoothed version of the L1,1 loss [26].\n\n6.1 Comparing optimization quality\n\nWe \ufb01rst compare the optimization performance of MSL (global solver) versus LSL (local solver).\nFigure 1(a) indicates that MSL consistently obtains a lower objective value, sometimes by a large\nmargin: more than two times lower for \u03b1 = 10\u22124 and 10\u22123. Interestingly, as \u03b1 increases, the\ndifference shrinks. This result suggests that more local minima occur in the higher rank case (a large\n\u03b1 increases regularization and decreases the rank of the solution). In Section 6.2, we will see that\nthe lower optimization quality of LSL and the fact that SSL optimizes a less constrained objective\nboth lead to signi\ufb01cantly worse denoising performance.\nSecond, we compare the runtimes of the three algorithms. Figure 1(b) presents runtimes for LSL\nand MSL for an increasing number of samples. Again, the runtime of LSL is signi\ufb01cantly worse\nfor smaller \u03b1, as much as 4000x slower; as \u03b1 increases, the runtimes become similar. This result\nis likely due to the fact that for small \u03b1, the MSL inner optimization is much faster via the ADAL\nsolver (the slowest part of the optimization), whereas LSL still has to slowly iterate over the three\nvariables. They both appear to scale similarly with respect to the number of samples.\n\n6\n\n\f(a) Objectives for LSL:MSL\n\n(b) Runtimes for LSL:MSL\n\n(c) Runtimes for SSL and MSL\n\nFigure 1: Comparison between LSL and MSL on synthetic datasets with changing \u03b1, n = m = 20 and 10\nrepeats. (a) LSL often gets stuck in local minima, with a signi\ufb01cantly higher objective than MSL. (b) For small\n\u03b1, LSL is signi\ufb01cantly slower than MSL. They scale similarly with the number of samples (c) Runtimes of SSL\nand MSL for training and recovery with \u03b1 = 10\u22123. For growing sample size, n = m = 20. MSL-R stands for\nthe recovery algorithm. The recovery time for SSL is almost 0, so it is not included.\n\n(a) tL = 100\n\n(b) tL = 300\n\n(a) MSL vs LSL\n\n(b) MSL vs SSL\n\nFigure 2: Signal-to-noise ratio of denoising algorithms on\nsynthetic data using recovered models on hold-out views. n =\nm = 10. In (a), we used tL = 100 pairs of views for training\nA and B and tested on 100 hold-out pairs, with 30 repeated\nrandom draws of training and test data. In (b) we used tL =\n300. Parameters were set to optimize respective methods.\n\nFigure 3: MSL versus SSL error in synthe-\nsizing y-view, over 30 random runs. We set\nn=m=200, tL=20, and ttest=80. In (a), LSL er-\nror is generally above the diagonal line, indicat-\ning higher error than MSL. In (b), SSL error is\nconsiderably higher than MSL.\n\nFor SSL versus MSL, we expect SSL to be faster than MSL because it is a more straightforward\noptimization: in MSL, each inner optimization of (8) over \u02c6Q (with a \ufb01xed \u03b7) has the same form\nas the SSL objective. Figure 1(c), however, illustrates that this difference is not substantial for\nincreasing sample size. Interestingly, the recovery runtime seems independent of dataset size, and\nis instead likely proportional to the rank of the data. The trend is similar for increasing features: for\ntL = 200, at n = 200, MSL requires \u223c 20 seconds, and at n = 1000, it requires \u223c 60 seconds.\n\n6.2 Comparing denoising quality\n\nNext we compare the denoising capabilities of the algorithms on synthetic and face image datasets.\nThere are two denoising approaches. The simplest is to run the algorithm on the noisy \u02dcYte, giving\nthe reconstructed \u02c6Yte as the denoised image. Another approach is to recover the models, A and B, in\na training phase. Given a new set of instances, \u02dcXte = {\u02dcxi}s\ni=1, noise in \u02dcXte and\n\u02dcYte can be removed using A and B, without re-training. This approach requires \ufb01rst recovering the\nlatent representation, \u02c6Hte = (h1, . . . , hs), for \u02dcXte and \u02dcYte. We use a batch approach for inference:\n(14)\n\nL1(AH, \u02dcXte)+L2(BH, \u02dcYte)+\u03b1(cid:107)H(cid:107)2,1.\n\ni=1 and \u02dcYte = {\u02dcyi}s\n\n\u02c6Hte = argmin\n\nThe x-views and y-views are then reconstructed using \u02c6Xte = A \u02c6Hte and \u02c6Yte = B \u02c6Hte. We compared\nthese reconstructions with the clean data, Xte and Yte, in terms of the signal-to-noise ratio:\n\nSNR( \u02c6Xte, \u02c6Yte) =\n\nF + (cid:107)Yte(cid:107)2\n\nF\n\nF + (cid:107)Yte \u2212 \u02c6Yte(cid:107)2\n\nF\n\n.\n\n(15)\n\nH\n\n(cid:16)(cid:107)Xte(cid:107)2\n\n(cid:17)(cid:46)(cid:16)(cid:107)Xte \u2212 \u02c6Xte(cid:107)2\n\n(cid:17)\n\nWe present the recovery approach on synthetic data and the direct reconstruction approach on the\nface dataset. We cross-validated over \u03b1 \u2208 {10\u22124, 10\u22123, 10\u22122, 10\u22121, 0.5, 1} according to the highest\nsignal-to-noise ratio on the training data. We set \u03b3 = \u03b2 = 1 because the data is in the [0, 1] interval.\n\n7\n\n0200400600800100012000.511.522.5Number of SamplesObjective Value Ratio LSL:MSLObjective LSL:MSL For Varying \u03b1=1e-4\u03b1=1e-3\u03b1=1e-2\u03b1=1e-1\u03b1=1\u03b1020040060080010001200\u22121000010002000300040005000Number of SamplesRuntime Ratio LSL:MSLTraining Runtime LSL:MSL For Varying \u03b1=1e-4\u03b1\u03b1=1e-3\u03b1=1e-2\u03b1=1e-1\u03b1=1020040060080010000246810Runtime on Synthetic DataNumber of Samples/FeaturesRuntime (seconds) SSLMSLMSL\u2212R01020300246810Run numberSNR MSLSSLLSL0102030024681012Run numberSNR MSLSSLLSL100102104100102104Error of MSLError of LSL100102104100102104Error of MSLError of SSL\fFigure 4: Reconstruction of a noisy image with 5% or 10% noise. LSL performs only slightly worse than\nMSL for larger noise values: a larger regularization parameter is needed for more noise, resulting in fewer local\nminima (as discussed in Figure 1). Conversely, SSL performs slightly worse than MSL for 5% noise, but as the\nnoise increases, the advantages of the MSL objective are apparent.\n\n6.2.1 Using Recovered Models for Denoising\n\nFigure 2 presents the signal-to-noise ratio for recovery on synthetic data. MSL produced the highest\nvalue of signal-to-noise ratio. The performance of LSL is inferior to MSL, but still better than SSL,\ncorroborating the importance of modelling the data as two views.\n\nImage Denoising\n\n6.2.2\nIn Figure 4, we can see that MSL outperforms both SSL and LSL on the face image dataset for\ntwo noise levels: 5% and 10%. Interestingly, in addition to having on average a 10x higher SNR\nthan SSL for these results, MSL also had signi\ufb01cantly different objective values. SSL had larger\nreconstruction error on the clean x-view (10x higher), lower reconstruction error on the noisy y-\nview (3x lower) and a higher representation norm (3x higher). Likely, the noisy y-view skewed the\nrepresentation, due to the joint rather than separate constraint as in the MSL objective.\n\nthe latent representation is computed from only one view:\n\n6.3 Comparing synthesis of views\nIn image synthesis,\nargminH L1(AH, \u02dcXte) + \u03b1(cid:107)H(cid:107)2,1. The y-view is then synthesized: \u02c6Yte = B \u02c6Hte.\nFigure 3 shows the synthesis error, || \u02c6Yte\u2212Yte||2\nF , of MSL, LSL, and SSL over 30 random runs: MSL\ngenerally incurs less error than LSL, and SSL incurs much higher error because it is not modelling\nthe conditional independence between views.\n\n\u02c6Hte =\n\n7 Conclusion\nWe provided a convex reformulation of multi-view subspace learning that enables global learning, as\nopposed to previous local formulations. We also developed a new training procedure which recon-\nstructs the data optimally and discovers the latent representations ef\ufb01ciently. Experimental results\non synthetic data and image data con\ufb01rm the effectiveness of our method, which consistently out-\nperformed other approaches in denoising quality. For future work, we are investigating extensions\nto semi-supervised settings, such as global methods for co-training and co-regularization. It should\nalso be possible to extend our approach to more than two views and incorporate kernels.\nAcknowledgements\nWe thank the reviewers for their helpful comments, in particular, an anonymous reviewer whose\nsuggestions greatly improved the presentation. Research supported by AICML and NSERC.\n\n8\n\nClean2040608010020406080100Noisy : 5%2040608010020406080100SSL2040608010020406080100LSL2040608010020406080100MSL2040608010020406080100Noisy : 10%2040608010020406080100SSL2040608010020406080100LSL2040608010020406080100MSL2040608010020406080100\fReferences\n[1] J. Lee and M. Verleysen. Nonlinear Dimensionality Reduction. Springer, 2010.\n[2] D. Hardoon, S. Szedmak, and J. Shawe-Taylor. Canonical correlation analysis: An overview\n\nwith application to learning methods. Neural Computation, 16:2639\u20132664, 2004.\n\n[3] T. De Bie, N. Cristianini, and R. Rosipal. Eigenproblems in pattern recognition. In Handbook\n\nof Geometric Computing, pages 129\u2013170, 2005.\n\n[4] P. Dhillon, D. Foster, and L. Ungar. Multi-view learning of word embeddings via CCA. In\n\nNIPS, 2011.\n\n[5] C. Lampert and O. Kr\u00a8omer. Weakly-paired maximum covariance analysis for multimodal\n\ndimensionality reduction and transfer learning. In ECCV, 2010.\n\n[6] L. Sigal, R. Memisevic, and D. Fleet. Shared kernel information embedding for discriminative\n\ninference. In CVPR, 2009.\n\n[7] F. Bach and M. Jordan. A probabilistic interpretation of canonical correlation analysis. Tech-\n\nnical Report 688, Department of Statistics, University of California, Berkeley, 2006.\n\n[8] C. Archambeau and F. Bach. Sparse probabilistic projections. In NIPS, 2008.\n[9] J. Viinikanoja, A. Klami, and S. Kaski. Variational Bayesian mixture of robust CCA. In ECML,\n\n2010.\n\n[10] E. Cand`es, X. Li, Y. Ma, and J. Wright. Robust principal component analysis? J.ACM, 58(1):\n\n1\u201337, 2011.\n\n[11] X. Zhang, Y. Yu, M. White, R. Huang, and D. Schuurmans. Convex sparse coding, subspace\n\nlearning, and semi-supervised extensions. In AAAI, 2011.\n\n[12] F. Bach, J. Mairal, and J. Ponce. Convex sparse matrix factorizations. arXiv:0812.1869v1,\n\n2008.\n\n[13] N. Quadrinto and C. Lampert. Learning multi-view neighborhood preserving projections. In\n\nICML, 2011.\n\n[14] Y. Jia, M. Salzmann, and T. Darrell. Factorized latent spaces with structured sparsity. In NIPS,\n\npages 982\u2013990, 2010.\n\n[15] M. White and D. Schuurmans. Generalized optimal reverse prediction. In AISTATS, 2012.\n[16] A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine Learn-\n\ning, 73(3):243\u2013272, 2008.\n\n[17] D. Bradley and A. Bagnell. Convex coding. In UAI, 2009.\n[18] J-B Hiriart-Urruty and C. Lemar\u00b4echal. Convex Analysis and Minimization Algorithms, I and\n\nII, volume 305 and 306. Springer-Verlag, 1993.\n\n[19] D. Petz. A survey of trace inequalities. In Functional Analysis and Operator Theory, pages\n\n287\u2013298. Banach Center, 2004.\n\n[20] S. Ma, D. Goldfarb, and L. Chen. Fixed point and Bregman iterative methods for matrix rank\n\nminimization. Mathematical Programming, 128:321\u2013353, 2011.\n\n[21] J. Cai, E. Cand`es, and Z. Shen. A singular value thresholding algorithm for matrix completion.\n\nSIAM Journal on Optimization, 20(4):1956\u20131982, 2010.\n\n[22] X. Zhang, Y. Yu, and D. Schuurmans. Accelerated training for matrix-norm regularization: A\n\nboosting approach. In NIPS, 2012.\n\n[23] A. Tewari, P. Ravikumar, and I. S. Dhillon. Greedy algorithms for structurally constrained high\n\ndimensional problems. In NIPS, 2011.\n\n[24] X. Yuan and S. Yan. Forward basis selection for sparse approximation over dictionary.\n\nAISTATS, 2012.\n\nIn\n\n[25] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse\n\nproblems. SIAM Journal on Imaging Sciences, 2(1):183\u2013202, 2009.\n\n[26] D. Goldfarb, S. Ma, and K. Scheinberg. Fast alternating linearization methods for minimizing\n\nthe sum of two convex functions. Mathematical Programming, to appear.\n\n[27] A. Georghiades, P. Belhumeur, and D. Kriegman. From few to many: Illumination cone models\n\nfor face recognition under variable lighting and pose. IEEE TPAMI, 23:643\u2013660, 2001.\n\n9\n\n\fSupplementary Material\n\nA Proof of Proposition 1\n\nTo show that (1) and (2) have equivalent solutions we exploit some developments from [28]. Let\nN = (XX(cid:48))\u2212 1\n\n2 and M = (Y Y (cid:48))\u2212 1\n\n2 , hence\n\nFirst consider (1). Its solution can be characterized by the maximal solutions to the generalized\n\neigenvalue problem [3]:(cid:20)\n\n0\nY X(cid:48)\n\nXY (cid:48)\n0\n\n= \u03bb\n\nwhich, under the change of variables u = N a and v = M b and then shifting the eigenvalues by 1, is\nequivalent to\n\u2261\n\nXY (cid:48)M\n\n= \u03bb\n\n0\n\n(cid:20)\n\n(cid:21)\n\n\u02dcZ \u02dcZ(cid:48) =\n\nI\n\nM Y X(cid:48)N\n\nN XY (cid:48)M\n\nI\n\nv\n\n(cid:20)\n(cid:21)(cid:20) u\n(cid:21)\n(cid:21)(cid:20) a\n(cid:21)(cid:20) a\n\u02dcZ \u02dcZ(cid:48)(cid:20) a\n\nb\n\nb\n\nb\n\n.\n\n(cid:21)\n(cid:21)(cid:20) u\n(cid:21)\n(cid:21)(cid:20) a\n(cid:21)\n\nv\n\n,\n\nb\n\n0\n\n0\nY Y (cid:48)\n\n(cid:20) XX(cid:48)\n(cid:20) N\u22121\n(cid:21)(cid:20) a\n(cid:20) I\n(cid:21)\n(cid:20) a\n\n0\n0 M\u22121\n0\nI\n\nb\n\n0\n\n= \u03bb\n\n= (\u03bb + 1)\n\nb\n\n(cid:21)\n(cid:21)\n(cid:21)\n\nY X(cid:48)N\n0\n\nM Y X(cid:48)N\n\nN XY (cid:48)M\n\n0\n\n0\n\n\u2261\n\n\u2261\n\n(cid:20)\n\n(cid:104)A\n\n(cid:105)\n\nB\n\nto the top k eigenvectors of \u02dcZ \u02dcZ(cid:48) one can show that U = N A and V = M B provides\n\nBy setting\nan optimal solution to (1) [3].\nBy comparison, for (2), an optimal H is given by H = C\u2020 \u02dcZ, where C\u2020 denotes pseudo-inverse.\nHence\n\nmin\nC,H\n\n(cid:107) \u02dcZ \u2212 CH(cid:107)2\n= tr( \u02dcZ \u02dcZ(cid:48)) \u2212 max\n\nF = min\nC\n\n{C:C(cid:48)C=I} tr(C(cid:48) \u02dcZ \u02dcZ(cid:48)C).\n\n(cid:107)(I \u2212 CC\u2020) \u02dcZ(cid:107)2\n\nF\n\nHere again the solution is given by the top k eigenvectors of \u02dcZ \u02dcZ(cid:48) [29].1\n\nB Proof for Lemma 3\n\nFirst, observe that\n\n(3) = min\n\n{C:C:,i\u2208C} min\n\nH\n\nL( \u02c6Z; Z) + \u03b1||| \u02c6Z|||\u2217,\n\n= min\n\n\u02c6Z\n\nL(CH; Z) + \u03b1(cid:107)H(cid:107)2,1 = min\n\nL( \u02c6Z; Z) + \u03b1 min\n\n{C:C:,i\u2208C} min\n\n{H:CH= \u02c6Z}\n\n(cid:107)H(cid:107)2,1\n\n\u02c6Z\n\nwhere the last step follows from Proposition 2.\nIt only remains to show ||| \u02c6Z|||\u2217 = max\u03c1\u22650 (cid:107)D\u22121\nthe proof in [11] for the convenience of the reader.\nWe will use two diagonal matrices, I X = diag([1n; 0m]) and I Y = diag([0n; 1m]) such that I X +\nI Y = Im+n. Similarly, for c \u2208 Rm+n, we use cX (respectively cY ) to denote c1:m (respectively\ncm+1:m+n).\nThe \ufb01rst stage is to prove that the dual norm is characterized by\n\n\u02c6Z(cid:107)tr, which was established in [11]. We reproduce\n\n\u03c1\n\n|||\u0393||| = min\n\u03c1\u22650\n\n(cid:107)D\u03c1\u0393(cid:107)sp.\n\n(16)\n\n1 [30] gave a similar but not equivalent formulation to (2), due to the lack of normalization.\n\n10\n\n\fwhere the spectral norm (cid:107)X(cid:107)sp = \u03c3max(X) is the dual of the trace norm, (cid:107)X(cid:107)tr. To this end, recall\nthat\n\n|||\u0393||| = max\n\nc\u2208C,(cid:107)h(cid:107)2\u22641\n\nc(cid:48)\u0393h = max\n\nc\u2208C (cid:107)c(cid:48)\u0393(cid:107)2 =\n\nmax\n\n{c:(cid:107)cX(cid:107)2=\u03b2, (cid:107)cY (cid:107)2=\u03b3}\n\n(cid:107)c(cid:48)\u0393(cid:107)2\n\ngiving\n\n|||\u0393|||2 =\n\n{c:(cid:107)cX(cid:107)2=\u03b2, (cid:107)cY (cid:107)2=\u03b3} c(cid:48)\u0393\u0393(cid:48)c =\n\nmax\n\n{\u03a6:\u03a6(cid:23)0, tr(\u03a6IX )\u2264\u03b22, tr(\u03a6I Y )\u2264\u03b32} tr(\u03a6\u0393\u0393(cid:48)),\n\nmax\n\n(17)\n\nusing the fact that when maximizing a convex function, one of the extreme points in the constraint\nset {\u03a6 : \u03a6(cid:23)0, tr(\u03a6In)\u2264\u03b22, tr(\u03a6Im)\u2264\u03b32} must be optimal. Furthermore, since the extreme points\nhave rank at most one in this case [31], the rank constraint rank(\u03a6) = 1 can be dropped.\nNext, form the Lagrangian L(\u03a6; \u03bb, \u03bd, \u039b) = tr(\u03a6\u0393\u0393(cid:48)) + tr(\u03a6\u039b) + \u03bb(\u03b22 \u2212 tr(\u03a6I X )) + \u03bd(\u03b32 \u2212\ntr(\u03a6I Y )) where \u03bb \u2265 0, \u03bd \u2265 0 and \u039b (cid:23) 0. Note that the primal variable \u03a6 can be eliminated\nby formulating the equilibrium condition \u2202L/\u2202\u03a6 = \u0393\u0393(cid:48) + \u039b \u2212 \u03bbI X \u2212 \u03bdI Y = 0, which implies\n\u0393\u0393(cid:48) \u2212 \u03bbI X \u2212 \u03bdI Y (cid:22) 0. Therefore, we achieve the equivalent dual formulation\n\n(17) =\n\n(18)\nNow observe that for \u03bb \u2265 0 and \u03bd \u2265 0, the relation \u0393\u0393(cid:48) (cid:22) \u03bbI X + \u03bdI Y holds if and only if\nD\u03bd/\u03bb\u0393\u0393(cid:48)D\u03bd/\u03bb(cid:22) D\u03bd/\u03bb(\u03bbI X +\u03bdI Y )D\u03bd/\u03bb = (\u03b22\u03bb+\u03b32\u03bd)In+m, hence\n\n{\u03bb,\u03bd:\u03bb\u22650, \u03bd\u22650, \u03bbIX +\u03bdI Y (cid:23)\u0393\u0393(cid:48)} \u03b22\u03bb + \u03b32\u03bd.\n\nmin\n\n(18) =\n\n{\u03bb,\u03bd:\u03bb\u22650, \u03bd\u22650, (cid:107)D\u03bd/\u03bb\u0393(cid:107)2\n\nmin\n\nsp\u2264\u03b22\u03bb+\u03b32\u03bd}\u03b22\u03bb+\u03b32\u03bd\n\n(19)\n\nThe third constraint must be met with equality at the optimum due to continuity, for otherwise we\nwould be able to further decrease the objective, a contradiction to optimality. Note that a standard\ncompactness argument would establish the existence of minimizers. So\n(cid:107)D\u03c1\u0393(cid:107)2\nsp.\n\n(cid:107)D\u03bd/\u03bb\u0393(cid:107)2\n\n(19) = min\n\n\u03bb\u22650,\u03bd\u22650\n\nsp = min\n\u03c1\u22650\n\nFinally, for the second stage, we characterize the target norm by observing that\n\n||| \u02c6Z|||\u2217 = max\n\u0393:|||\u0393|||\u22641\n\ntr(\u0393(cid:48) \u02c6Z)\n\ntr(\u0393(cid:48) \u02c6Z)\ntr(\u02dc\u0393(cid:48)D\u22121\n\n\u03c1\n\n\u02c6Z)\n\n= max\n\u03c1\u22650\n\n= max\n\u03c1\u22650\n\n= max\n\u03c1\u22650\n\nmax\n\n\u0393:(cid:107)D\u03c1\u0393(cid:107)sp\u22641\n\nmax\n\u02dc\u0393:(cid:107)\u02dc\u0393(cid:107)sp\u22641\n\u02c6Z(cid:107)tr.\n(cid:107)D\u22121\n\n\u03c1\n\n(20)\n\n(21)\n\nwhere (20) uses (16), and (21) exploits the conjugacy of the spectral and trace norms. The lemma\nfollows.\n\nC Proof for Theorem 6 and Details of Recovery\n\nOnce an optimal reconstruction \u02c6Z is obtained, we need to recover the optimal factors C and H that\nsatisfy\n\nCH = \u02c6Z, ,\n\n(cid:107)H(cid:107)2,1 = ||| \u02c6Z|||\u2217,\n\nand C:,i \u2208 C for all i.\n\n(22)\n\n(23)\n\nNote that by Proposition 2 and Lemma 3, the recovery problem (22) can be re-expressed as\n\nmin\n\n{C,H:C:,i\u2208C \u2200i, CH= \u02c6Z}\n\n(cid:107)H(cid:107)2,1 =\n\n{\u0393:|||\u0393|||\u22641} tr(\u0393(cid:48) \u02c6Z).\n\nmax\n\nOur strategy will be to \ufb01rst recover the optimal dual solution \u0393 given \u02c6Z, then use \u0393 to recover H\nand C.\nFirst, to recover \u0393 one can simply trace back from (21) to (20). Let U \u03a3V (cid:48) be the SVD of D\u22121\nThen \u02dc\u0393 = U V (cid:48) and \u0393 = D\u22121\ntrace in (23) because tr(\u02dc\u0393(cid:48)D\u22121\n\n\u02c6Z.\n\u03c1 U V (cid:48) automatically satis\ufb01es |||\u0393||| = 1 while achieving the optimal\n\u02c6Z) = tr(\u03a3) = (cid:107)D\u22121\n\n\u02c6Z(cid:107)tr.\n\n\u03c1\n\n\u03c1\n\n\u03c1\n\n11\n\n\fGiven such an optimal \u0393, we are then able to characterize an optimal solution (C, H). Introduce the\nset\n\nC(\u0393) := arg max\n\n(24)\nTheorem 6. For a dual optimal \u0393, (C, H) solves recovery problem (22) if and only if C:,i \u2208 C(\u0393)\nand Hi,: = (cid:107)Hi,:(cid:107)2C(cid:48)\n\n:,i\u0393, such that CH = \u02c6Z.\n\nc =\n\nb\n\n.\n\n: (cid:107)a(cid:107) = \u03b2,(cid:107)b(cid:107) = \u03b3,(cid:107)\u0393(cid:48)c(cid:107) = 1\n\nc\u2208C (cid:107)\u0393(cid:48)c(cid:107) =\n\n(cid:26)\n\n(cid:20) a\n\n(cid:21)\n\n(cid:27)\n\n(cid:88)\n\nProof. By (23), if \u02c6Z = CH, then\n\n||| \u02c6Z|||\u2217 = tr(\u0393(cid:48) \u02c6Z) = tr(\u0393(cid:48)CH) =\n\n(25)\nNote that \u2200C:,i \u2208 C,(cid:107)\u0393(cid:48)C:,i(cid:107)2 \u2264 1 since |||\u0393||| \u2264 1 and Hi,:\u0393(cid:48)C:,i = (cid:107)Hi,:\u0393(cid:48)C:,i(cid:107)2 \u2264\n(cid:107)Hi,:(cid:107)2(cid:107)\u0393(cid:48)C:,i(cid:107)2 \u2264 (cid:107)Hi,:(cid:107)2.\ni (cid:107)Hi,:(cid:107)2, hence implying\n(cid:107)\u0393(cid:48)C:,i(cid:107)2 = 1 and Hi,: = (cid:107)Hi,:(cid:107)2C(cid:48)\nOn the other hand, if (cid:107)\u0393(cid:48)C:,i(cid:107)2 = 1 and Hi,: = (cid:107)Hi,:(cid:107)2C(cid:48)\nimplying the optimality of (C, H).\n\nIf (C, H) is optimal, then (25) = (cid:80)\n\n:,i\u0393, then we have ||| \u02c6Z|||\u2217 =(cid:80)\n\ni (cid:107)Hi,:(cid:107)2,\n(cid:4)\n\n:,i\u0393.\n\nHi,:\u0393(cid:48)C:,i.\n\ni\n\nTherefore, given \u0393, the recovery problem (22) has been reduced to \ufb01nding a vector \u00b5 and matrix C\nsuch that \u00b5 \u2265 0, C:,i \u2208 C(\u0393) for all i, and C diag(\u00b5)C(cid:48)\u0393 = \u02c6Z.\nNext we demonstrate how to incrementally recover \u00b5 and C. Denote the range of C diag(\u00b5)C(cid:48) by\nthe set\n\nS := {(cid:80)\n\ni \u00b5icic(cid:48)\n\ni : ci \u2208 C(\u0393), \u00b5 \u2265 0} .\n\nNote that S is the conic hull of (possibly in\ufb01nitely many) rank one matrices {cc(cid:48) : c \u2208 C(\u0393)}.\nHowever, by Carath\u00b4eodory\u2019s theorem [32, \u00a717], any matrix K \u2208 S can be written as the conic com-\nbination of \ufb01nitely many rank one matrices of the form {cc(cid:48) : c \u2208 C(\u0393)}. Therefore, conceptually,\nthe recovery problem has been reduced to \ufb01nding a sparse set of non-negative weights, \u00b5, over the\nset of feasible basis vectors, c \u2208 C(\u0393).\nTo \ufb01nd these weights, we use a totally corrective \u201cboosting\u201d procedure [22] that is guaranteed to\nconverge to a feasible solution. Consider the following objective function for boosting\n\nf (K) = (cid:107)K\u0393 \u2212 \u02c6Z(cid:107)2\n\nF , where K \u2208 S.\n\nNote that f is clearly a convex function in K with a Lipschitz continuous gradient. Theorem 6\nimplies that an optimal solution of (22) corresponds precisely to those K \u2208 S such that f (K) = 0.\nThe idea behind totally corrective boosting [22] is to \ufb01nd a minimizer of f (hence optimal solution\nof (22)) incrementally. After initializing K0 = 0, we iterate between two steps:\n1. Weak learning step: \ufb01nd\n\nct \u2208 argmin\nc\u2208C(\u0393)\n\n(cid:104)\u2207f (Kt\u22121), cc(cid:48)(cid:105) = argmax\nc\u2208C(\u0393)\n\nc(cid:48)Qc,\n\n(26)\n\n(27)\n\nwhere Q = \u2212\u2207f (Kt\u22121) = 2( \u02c6Z \u2212 Kt\u22121\u0393)\u0393(cid:48).\n\n2. \u201cTotally corrective\u201d step:\n\n(cid:17)\n\n,\n\ni=1 \u00b5icic(cid:48)\n\ni\n\n(cid:16)(cid:80)t\n\nf\ni cic(cid:48)\ni.\n\n\u00b5(t) = argmin\n\u00b5:\u00b5i\u22650\n\nKt = (cid:80)t\n\ni=1 \u00b5(t)\n\nThree key facts can be established about this boosting procedure: (i) each weak learning step can\nbe solved ef\ufb01ciently; (ii) each totally corrective weight update can be solved ef\ufb01ciently; and (iii)\nf (Kt) (cid:38) 0, hence a feasible solution can be arbitrarily well approximated. (iii) has been proved in\n[22], while (ii) is immediate because (27) is a standard quadratic program. Only (i) deserves some\nexplanation. We show in the next subsection that C(\u0393), de\ufb01ned in (24), can be much simpli\ufb01ed, and\nconsequently we give in the last subsection an ef\ufb01cient algorithm for the oracle problem (26) (the\nidea is similar to the one inherent in the proof of Lemma 3).\n\n12\n\n\fC.1 Simpli\ufb01cation of C(\u0393)\n\nSince C(\u0393) is the set of optimal solutions to\n\n(28)\nour idea is to \ufb01rst obtain an optimal solution to its dual problem, and then use it to recover the\noptimal c via the KKT conditions. In fact, its dual problem has been stated in (18). Once we obtain\nthe optimal \u03c1 in (21) by solving (8), it is straightforward to backtrack and recover the optimal \u03bb and\n\u03bd in (18). Then by KKT condition [32, \u00a728], c is an optimal solution to (28) if and only if\n\nmax\n\n(29)\n(30)\nSince (30) holds iff c is in the null space of R, we \ufb01nd an orthonormal basis {n1, . . . , nk} for this\nnull space. Assume\n\n(cid:104)R, cc(cid:48)(cid:105) = 0, where R = \u03bbI X + \u03bdI Y \u2212 \u0393\u0393(cid:48) (cid:23) 0.\n\nc\u2208C (cid:107)\u0393(cid:48)c(cid:107) ,\n(cid:13)(cid:13)cY(cid:13)(cid:13) = \u03b3,\n\n(cid:13)(cid:13)cX(cid:13)(cid:13) = \u03b2,\n\nBy (29), we have\n\nc = N \u03b1, where N = [n1, . . . , nk] =\n\n0 = \u03b32(cid:13)(cid:13)cX(cid:13)(cid:13)2 \u2212 \u03b22(cid:13)(cid:13)cY(cid:13)(cid:13)2\n\nN Y\n\n= \u03b1(cid:48)(cid:0)\u03b32(N X )(cid:48)N X \u2212 \u03b22(N Y )(cid:48)N Y(cid:1) \u03b1.\n\n, \u03b1 = (\u03b11, . . . , \u03b1k)(cid:48).\n\n(31)\n\n(32)\n\nThe idea is to go through some linear transformations for simpli\ufb01cation.\nPerform eigen-\ndecomposition U \u03a3U(cid:48) = \u03b32(N X )(cid:48)N X \u2212 \u03b22(N Y )(cid:48)N Y , where \u03a3 = diag(\u03c31, . . . , \u03c3k), and U \u2208\nRk\u00d7k is orthonormal. Let v = U(cid:48)\u03b1. Then by (31),\n\n(cid:20) N X\n\n(cid:21)\n\nand (32) is satis\ufb01ed if and only if\n\nv(cid:48)\u03a3v =\n\nc = N U v,\n\n(cid:88)\n\n\u03c3iv2\n\ni = 0.\n\nFinally, (29) implies\n\ni\n\n\u03b22 + \u03b32 = (cid:107)c(cid:107)2 = v(cid:48)U(cid:48)N(cid:48)N U v = v(cid:48)v.\n\nIn summary, by (33) we have\n\nC(\u0393) = {N U v : v satis\ufb01es (34) and (35)}\n\n(cid:110)\nN U v : v(cid:48)\u03a3v = 0, (cid:107)v(cid:107)2 = \u03b22 + \u03b32(cid:111)\n\n=\n\n(33)\n\n(34)\n\n(35)\n\n(36)\n\n.\n\nC.2 Solving the weak oracle problem (26)\n\nThe weak oracle needs to solve\n\nwhere Q = \u2212\u2207f (Kt\u22121) = 2( \u02c6Z \u2212 Kt\u22121\u0393)\u0393(cid:48). By (36), this optimization is equivalent to\n\nwhere T = U(cid:48)N(cid:48)QN U. Using the same technique as in the proof of Lemma 3, we have\n\nc(cid:48)Qc,\n\nmax\nc\u2208C(\u0393)\n\nmax\n\nv:v(cid:48)\u03a3v=0, (cid:107)v(cid:107)2=\u03b22+\u03b32\n\nv(cid:48)T v,\n\n(let H = vv(cid:48)) =\n\n(Lagrange dual) =\n\nmax\n\nv:v(cid:48)v=1,v(cid:48)\u03a3v=0\n\nv(cid:48)T v\n\nH(cid:23)0,tr(H)=1,tr(\u03a3H)=0\n\nmax\n\ntr(T H)\n\n\u03c9\n\nmin\n\n\u03c4,\u03c9:\u03c4 \u03a3+\u03c9I\u2212T(cid:23)0\n\u03c4\u2208R \u03bbmax(T \u2212 \u03c4 \u03a3),\n\n= min\n\nwhere \u03bbmax stands for the maximum eigenvalue. Since \u03bbmax is a convex function over real symmet-\nric matrices, the last line search problem is convex in \u03c4, hence can be solved globally and ef\ufb01ciently.\nGiven the optimal \u03c4 and the optimal objective value \u03c9, the optimal v can be recovered using a similar\ntrick as in Appendix C.1. Let the null space of \u03c9I + \u03c4 \u03a3 \u2212 T be spanned by \u02c6N = {\u02c6n1, . . . , \u02c6ns}.\nThen \ufb01nd any \u02c6\u03b1 \u2208 Rs such that v := \u02c6N \u02c6\u03b1 satis\ufb01es (cid:107)v(cid:107)2 = \u03b22 + \u03b32 and v(cid:48)\u03a3v = 0.\n\n13\n\n\fAuxiliary References\n[28] L. Sun, S. Ji, and J. Ye. Canonical correlation analysis for multilabel classi\ufb01cation: A least-\n\nsquares formulation, extensions, and analysis. IEEE TPAMI, 33(1):194\u2013200, 2011.\n\n[29] M. Overton and R. Womersley. Optimality conditions and duality theory for minimizing sums\nof the largest eigenvalues of symmetric matrices. Mathematical Programming, 62:321\u2013357,\n1993.\n\n[30] B. Long, P. Yu, and Z. Zhang. A general model for multiple view unsupervised learning. In\n\nICDM, 2008.\n\n[31] G. Pataki. On the rank of extreme matrices in semide\ufb01nite programs and the multiplicity of\n\noptimal eigenvalues. Mathematics of Operations Research, 23(2):339\u2013358, 1998.\n\n[32] R. Rockafellar. Convex Analysis. Princeton U. Press, 1970.\n\n14\n\n\f", "award": [], "sourceid": 802, "authors": [{"given_name": "Martha", "family_name": "White", "institution": null}, {"given_name": "Xinhua", "family_name": "Zhang", "institution": null}, {"given_name": "Dale", "family_name": "Schuurmans", "institution": null}, {"given_name": "Yao-liang", "family_name": "Yu", "institution": null}]}