{"title": "Convex Two-Layer Modeling", "book": "Advances in Neural Information Processing Systems", "page_first": 2985, "page_last": 2993, "abstract": "Latent variable prediction models, such as multi-layer networks, impose auxiliary latent variables between inputs and outputs to allow automatic inference of implicit features useful for prediction. Unfortunately, such models are difficult to train because inference over latent variables must be performed concurrently with parameter optimization---creating a highly non-convex problem. Instead of proposing another local training method, we develop a convex relaxation of hidden-layer conditional models that admits global training. Our approach extends current convex modeling approaches to handle two nested nonlinearities separated by a non-trivial adaptive latent layer. The resulting methods are able to acquire two-layer models that cannot be represented by any single-layer model over the same features, while improving training quality over local heuristics.", "full_text": "Convex Two-Layer Modeling\n\n\u00a8Ozlem Aslan\nDale Schuurmans\nDepartment of Computing Science, University of Alberta\n\nHao Cheng\n\nEdmonton, AB T6G 2E8, Canada\n\n{ozlem,hcheng2,dale}@cs.ualberta.ca\n\nXinhua Zhang\n\nMachine Learning Research Group\nNational ICT Australia and ANU\nxinhua.zhang@anu.edu.au\n\nAbstract\n\nLatent variable prediction models, such as multi-layer networks, impose auxil-\niary latent variables between inputs and outputs to allow automatic inference of\nimplicit features useful for prediction. Unfortunately, such models are dif\ufb01cult\nto train because inference over latent variables must be performed concurrently\nwith parameter optimization\u2014creating a highly non-convex problem.\nInstead\nof proposing another local training method, we develop a convex relaxation of\nhidden-layer conditional models that admits global training. Our approach ex-\ntends current convex modeling approaches to handle two nested nonlinearities\nseparated by a non-trivial adaptive latent layer. The resulting methods are able\nto acquire two-layer models that cannot be represented by any single-layer model\nover the same features, while improving training quality over local heuristics.\n\n1\n\nIntroduction\n\nDeep learning has recently been enjoying a resurgence [1, 2] due to the discovery that stage-wise\npre-training can signi\ufb01cantly improve the results of classical training methods [3\u20135]. The advan-\ntage of latent variable models is that they allow abstract \u201csemantic\u201d features of observed data to be\nrepresented, which can enhance the ability to capture predictive relationships between observed vari-\nables. In this way, latent variable models can greatly simplify the description of otherwise complex\nrelationships between observed variates. For example, in unsupervised (i.e., \u201cgenerative\u201d) settings,\nlatent variable models have been used to express feature discovery problems such as dimensionality\nreduction [6], clustering [7], sparse coding [8], and independent components analysis [9]. More\nrecently, such latent variable models have been used to discover abstract features of visual data\ninvariant to low level transformations [1, 2, 4]. These learned representations not only facilitate\nunderstanding, they can enhance subsequent learning.\nOur primary focus in this paper, however, is on conditional modeling. In a supervised (i.e. \u201ccondi-\ntional\u201d) setting, latent variable models are used to discover intervening feature representations that\nallow more accurate reconstruction of outputs from inputs. One advantage in the supervised case\nis that output information can be used to better identify relevant features to be inferred. However,\nlatent variables also cause dif\ufb01culty in this case because they impose nested nonlinearities between\nthe input and output variables. Some important examples of conditional latent learning approaches\ninclude those that seek an intervening lower dimensional representation [10] latent clustering [11],\nsparse feature representation [8] or invariant latent representation [1, 3, 4, 12] between inputs and\noutputs. Despite their growing success, the dif\ufb01culty of training a latent variable model remains\nclear: since the model parameters have to be trained concurrently with inference over latent vari-\nables, the convexity of the training problem is usually destroyed. Only highly restricted models can\nbe trained to optimality, and current deep learning strategies provide no guarantees about solution\nquality. This remains true even when restricting attention to a single stage of stage-wise pre-training:\nsimple models such as the two-layer auto-encoder or restricted Boltzmann machine (RBM) still pose\nintractable training problems, even within a single stage (in fact, simply computing the gradient of\nthe RBM objective is currently believed to be intractable [13]).\n\n1\n\n\fMeanwhile, a growing body of research has investigated reformulations of latent variable learn-\ning that are able to yield tractable global training methods in special cases. Even though global\ntraining formulations are not a universally accepted goal of deep learning research [14], there are\nseveral useful methodologies that have been been applied successfully to other latent variable mod-\nels: boosting strategies [15\u201317], semide\ufb01nite relaxations [18\u201320], matrix factorization [21\u201323], and\nmoment based estimators (i.e. \u201cspectral methods\u201d) [24, 25]. Unfortunately, none of these approaches\nhas yet been able to accommodate a non-trivial hidden layer between an input and output layer while\nretaining the representational capacity of an auto-encoder or RBM (e.g. boosting strategies embed\nan intractable subproblem in these cases [15\u201317]). Some recent work has been able to capture re-\nstricted forms of latent structure in a conditional model\u2014namely, a single latent cluster variable\n[18\u201320]\u2014but this remains a rather limited approach.\nIn this paper we demonstrate that more general latent variable structures can be accommodated\nwithin a tractable convex framework. In particular, we show how two-layer latent conditional models\nwith a single latent layer can be expressed equivalently in terms of a latent feature kernel. This\nreformulation allows a rich set of latent feature representations to be captured, while allowing useful\nconvex relaxations in terms of a semide\ufb01nite optimization. Unlike [26], the latent kernel in this\nmodel is explicitly learned (nonparametrically). To cope with scaling issues we further develop\nan ef\ufb01cient algorithmic approach for the proposed relaxation. Importantly, the resulting method\npreserves suf\ufb01cient problem structure to recover prediction models that cannot be represented by any\none-layer architecture over the same input features, while improving the quality of local training.\n2 Two-Layer Conditional Modeling\nWe address the problem of training a two-layer latent\nconditional model in the form of Figure 1; i.e., where\nthere is a single layer of h latent variables, , between\na layer of n input variables, x, and m output variables,\ny. The goal is to predict an output vector y given an\ninput vector x. Here, a prediction model consists of\nthe composition of two nonlinear conditional models,\nf1(W x) ; and f2(V ) ; \u02c6y, parameterized by the\nmatrices W 2 Rh\u21e5n and V 2 Rm\u21e5h. Once the param-\neters W and V have been speci\ufb01ed, this architecture\nde\ufb01nes a point predictor that can determine \u02c6y from x\nby \ufb01rst computing an intermediate representation .\nTo learn the model parameters, we assume we are given\nt training pairs {(xj, yj)}t\nj=1, stacked in two matrices\nX = (x1, ..., xt) 2 Rn\u21e5t and Y = (y1, ..., yt) 2\nRm\u21e5t, but the corresponding set of latent variable val-\nues = ( 1, ..., t) 2 Rh\u21e5t remains unobserved.\nTo formulate the training problem, we will consider two losses, L1 and L2, that relate the input\nto the latent layer, and the latent to the output layer respectively. For example, one can think of\nlosses as negative log-likelihoods in a conditional model that generates each successive layer given\nits predecessor; i.e., L1(W x, ) = log pW (|x) and L2(V , y) = log pV (y|). (However,\na loss based formulation is more \ufb02exible, since every negative log-likelihood is a loss but not vice\nversa.) Similarly to RBMs and probabilistic networks (PFNs) [27] (but unlike auto-encoders and\nclassical feed-forward networks), we will not assume is a deterministic output of the \ufb01rst layer;\ninstead we will consider to be a variable whose value is the subject of inference during training.\nGiven such a set-up many training principles become possible. For simplicity, we consider a Viterbi\nbased training principle where the parameters W and V are optimized with respect to an optimal\nimputation of the latent values . To do so, de\ufb01ne the \ufb01rst and second layer training objectives as\n\nFigure 1:\nLatent conditional model\nf1(W x) ; , f2(V ) ; \u02c6y, where j is\na latent variable, xj is an observed input\nvector, yj is an observed output vector,\nW are \ufb01rst layer parameters, and V are\nsecond layer parameters.\n\nj\n\nxj\n\nyj\n\nW\n\nf1\n\nV\n\nf2\n\nt\n\n2 kWk2\nF ,\n\nF1(W, ) = L1(W X, ) + \u21b5\n\nand F2(, V ) = L2(V , Y ) + \n\n(1)\nwhere we assume the losses are convex in their \ufb01rst arguments. Here it is typical to assume\nj=1 L1( \u02c6 j, j) and L2(Z, Y ) =\nj=1 L2(\u02c6zj, yj), where \u02c6 j is the jth column of \u02c6 and \u02c6zj is the jth column of \u02c6Z respectively. This\n\nthat the losses decompose columnwise; that is, L1( \u02c6 , ) = Pt\nPt\n\n2kV k2\nF ,\n\n2\n\n\ffollows for example if the training pairs (xj, yj) are assumed I.I.D., but such a restriction is not nec-\nessary. Note that we have also introduced Euclidean regularization over the parameters (i.e. negative\nlog-priors under a Gaussian), which will provide a useful representer theorem [28] we exploit later.\nThese two objectives can be combined to obtain the following joint training problem:\n\nmin\nW,V\n\nmin\n\n\n\nF1(W, ) + F2(, V ),\n\n(2)\n\nwhere > 0 is a trade off parameter that balances the \ufb01rst versus second layer discrepancy. Unfor-\ntunately (2) is not jointly convex in the unknowns W , V and .\nA key modeling question concerns the structure of the latent representation . As noted, the ex-\ntensive literature on latent variable modeling has proposed a variety of forms for latent structure.\nHere, we follow work on deep learning and sparse coding and assume that the latent variables are\nboolean, 2{ 0, 1}h\u21e51; an assumption that is also often made in auto-encoders [13], PFNs [27],\nand RBMs [5]. A boolean representation can capture structures that range from a single latent clus-\ntering [11, 19, 20], by imposing the assumption that 01 = 1, to a general sparse code, by imposing\nthe assumption that 01 = k for some small k [1, 4, 13].1 Observe that, in the latter case, one\ncan control the complexity of the latent representation by imposing a constraint on the number of\n\u201cactive\u201d variables k rather than directly controlling the latent dimensionality h.\n\n2.1 Multi-Layer Perceptrons and Large-Margin Losses\n\nTo complete a speci\ufb01cation of the two-layer model in Figure 1 and the associated training problem\n(2), we need to commit to speci\ufb01c forms for the transfer functions f1 and f2 and the losses in (1). For\nsimplicity, we will adopt a large-margin approach over two-layer perceptrons. Although it has been\ntraditional in deep learning research to focus on exponential family conditional models (e.g. as in\nauto-encoders, PFNs and RBMs), these are not the only possibility; a large-margin approach offers\nadditional sparsity and algorithmic simpli\ufb01cations that will clarify the development below. Despite\nits simplicity, such an approach will still be suf\ufb01cient to prove our main point.\nFirst, consider the second layer model. We will conduct our primary evaluations on multi-\nclass classi\ufb01cation problems, where output vectors y encode target classes by indicator vectors\ny 2{ 0, 1}m\u21e51 such that y01 = 1. Although it is common to adopt a softmax transfer for f2 in\nsuch a case, it is also useful to consider a perceptron model de\ufb01ned by f2(\u02c6z) = indmax(\u02c6z) such that\nindmax(\u02c6z) = 1i (vector of all 0s except a 1 in the ith position) where \u02c6zi \u02c6zl for all l. Therefore,\nfor multi-class classi\ufb01cation, we will simply adopt the standard large-margin multi-class loss [29]:\n(3)\nIntuitively, if yc = 1 is the correct label, this loss encourages the response \u02c6zc = y0\u02c6z on the correct\nlabel to be a margin greater than the response \u02c6zi on any other label i 6= c.\nSecond, consider the \ufb01rst layer model. Although the loss (3) has proved to be highly successful for\nmulti-class classi\ufb01cation problems, it is not suitable for the \ufb01rst layer because it assumes there is\nonly a single target component active in any latent vector ; i.e. 01 = 1. Although some work\nhas considered learning a latent clustering in a two-layer architecture [11, 18\u201320], such an approach\nis not able to capture the latent sparse code of a classical PFN or RBM in a reasonable way: using\nclustering to simulate a multi-dimensional sparse code causes exponential blow-up in the number of\nlatent classes required. Therefore, we instead adopt a multi-label perceptron model for the \ufb01rst layer,\nde\ufb01ned by the transfer function f1( \u02c6 ) = step( \u02c6 ) applied componentwise to the response vector \u02c6 ;\ni.e. step( \u02c6 i) = 1 if \u02c6 i > 0, 0 otherwise. Here again, instead of using a traditional negative log-\nlikelihood loss, we will adopt a simple large-margin loss for multi-label classi\ufb01cation that naturally\naccommodates multiple binary latent classi\ufb01cations in parallel. Although several loss formulations\nexist for multi-label classi\ufb01cation [30, 31], we adopt the following:\n\nL2(\u02c6z, y) = max(1 y + \u02c6z 1y0\u02c6z).\n\nL1( \u02c6 , ) = max(1 + \u02c6 01 10 \u02c6 ) \u2318 max(1 )/(01) + \u02c6 10 \u02c6 /(01). (4)\n\nIntuitively, this loss encourages the average response on the active labels, 0 \u02c6 /(01), to exceed the\nresponse \u02c6 i on any inactive label i, i = 0, by some margin, while also encouraging the response on\nany active label to match the average of the active responses. Despite their simplicity, large-margin\nmulti-label losses have proved to be highly successful in practice [30, 31]. Therefore, the overall\narchitecture we investigate embeds two nonlinear conditionals around a non-trivial latent layer.\n\n1 Throughout this paper we let 1 denote the vector of all 1s with length determined by context.\n\n3\n\n\f3 Equivalent Reformulation\n\nThe main contribution of this paper is to show that the training problem (2) has a convex relaxation\nthat preserves suf\ufb01cient structure to transcend one-layer models. To demonstrate this relaxation, we\n\ufb01rst need to establish the key observation that problem (2) can be re-expressed in terms of a kernel\nmatrix between latent representation vectors. Importantly, this reformulation allows the problem to\nbe re-expressed in terms of an optimization objective that is jointly convex in all participating vari-\nables. We establish this key intermediate result in this section in three steps: \ufb01rst, by re-expressing\nthe latent representation in terms of a latent kernel; second, by reformulating the second layer ob-\njective; and third, by reformulating the \ufb01rst layer objective by exploiting large-margin formulation\noutlined in Section 2.1. Below let K = X0X denote the kernel matrix over the input data, let Im(N )\ndenote the row space of N, and let and \u2020 denote Moore-Penrose pseudo-inverse.\nFirst, simply de\ufb01ne N = 0. Next, re-express the second layer objective F2 in (1) by the following.\nLemma 1. For any \ufb01xed , letting N = 0, it follows that\n\nmin\nV\n\nF2(, V ) =\n\nmin\n\nB2Im(N )\n\nL2(B, Y ) + \n\n2 tr(BN\u2020B0).\n\n(5)\n\nProof. The result follows from the following sequence of equivalence preserving transformations:\n\nmin\nV\n\nL2(V , Y ) + \n\n2kV k2\n\nF = min\nA\nmin\n\n=\n\nB2Im(N )\n\nL2(AN, Y ) + \n\n2 tr(AN A0)\n\nL2(B, Y ) + \n\n2 tr(BN\u2020B0),\n\n(6)\n\n(7)\n\nwhere, starting with the de\ufb01nition of F2 in (1), the \ufb01rst equality in (6) follows from the representer\ntheorem applied to kV k2\nF , which implies that the optimal V must be in the form of V = A0 for\nsome A 2 Rm\u21e5t [28]; and \ufb01nally, (7) follows by the change of variable B = AN.\nNote that Lemma 1 holds for any loss L2. In fact, the result follows solely from the structure of the\nregularizer. However, we require L2 to be convex in its \ufb01rst argument to ensure a convex problem\nbelow. Convexity is indeed satis\ufb01ed by the choice (3). Moreover, the term tr(BN\u2020B0) is jointly\nconvex in N and B since it is a perspective function [32], hence the objective in (5) is jointly convex.\nNext, we reformulate the \ufb01rst layer objective F1 in (1). Since this transformation exploits speci\ufb01c\nstructure in the \ufb01rst layer loss, we present the result in two parts: \ufb01rst, by showing how the de-\nsired outcome follows from a general assumption on L1, then demonstrating that this assumption\nis satis\ufb01ed by the speci\ufb01c large-margin multi-label loss de\ufb01ned in (4). To establish this result we\nwill exploit the following augmented forms for the data and variables: let \u02dc= [ , kI], \u02dcN = \u02dc0 \u02dc,\n\u02dc = [ \u02c6 , 0], \u02dcX = [X, 0], \u02dcK = \u02dcX0 \u02dcX, and \u02dct = t + h.\nLemma 2. For any L1 if there exists a function \u02dcL1 such that L1( \u02c6 , ) = \u02dcL1( \u02dc0 \u02dc , \u02dc0 \u02dc) for all\n\u02c6 2 Rh\u21e5t and 2{ 0, 1}h\u21e5t, such that 01 = 1k, it then follows that\n\nmin\nW\n\nF1(W, ) =\n\nmin\n\nD2Im( \u02dcN )\n\n\u02dcL1(D \u02dcK, \u02dcN ) + \u21b5\n\n2 tr(D0 \u02dcN\u2020D \u02dcK).\n\n(8)\n\n(9)\n\n(10)\n\n(11)\n\nProof. Similar to above, consider the sequence of equivalence preserving transformations:\n\nL1(W X, ) + \u21b5\n\nmin\nW\n\n2 kWk2\n\nF = min\nW\n= min\n\n\u02dcL1( \u02dc0W \u02dcX, \u02dc0 \u02dc) + \u21b5\n\u02dcL1( \u02dc0 \u02dcC \u02dcX0 \u02dcX, \u02dc0 \u02dc) + \n\n2 kWk2\n\nF\n\n2 tr( \u02dcXC0 \u02dc0 \u02dcC \u02dcX0)\n\n=\n\nC\nmin\n\nD2Im( \u02dcN )\n\n\u02dcL1(D \u02dcK, \u02dcN ) + \u21b5\n\n2 tr(D0 \u02dcN\u2020D \u02dcK),\n\nwhere, starting with the de\ufb01nition of F1 in (1), the \ufb01rst equality (9) simply follows from the as-\nsumption. The second equality (10) follows from the representer theorem applied to kWk2\nF , which\nimplies that the optimal W must be in the form of W = \u02dcC \u02dcX0 for some C 2 R\u02dct\u21e5\u02dct (using the fact\nthat \u02dc has full rank h) [28]. Finally, (11) follows by the change of variable D = \u02dcN C.\n\n4\n\n\fObserve that the term tr(D0 \u02dcN\u2020D \u02dcK) is again jointly convex in \u02dcN and D (also a perspective func-\ntion), while it is easy to verify that \u02dcL1(D \u02dcK, \u02dcN ) as de\ufb01ned in Lemma 3 below is also jointly convex\nin \u02dcN and D [32]; therefore the objective in (8) is jointly convex.\nNext, we show that the assumption of Lemma 2 is satis\ufb01ed by the speci\ufb01c large-margin multi-label\nformulation in Section 2.1; that is, assume L1 is given by the large-margin multi-label loss (4):\n\n\u02c6 j\n\nL1( \u02c6 , ) = Pj max1 j + \u02c6 j0j1 10j\n\n= \u2327110 + \u02c6 diag(01) 1 diag(0 \u02c6 )0, such that \u2327 (\u21e5) :=Pj max(\u2713j), (12)\nwhere we use \u02c6 j, j and \u2713j to denote the jth columns of \u02c6 , and \u21e5 respectively.\nLemma 3. For the multi-label loss L1 de\ufb01ned in (4), and for any \ufb01xed 2{ 0, 1}h\u21e5t where\n01 = 1k, the de\ufb01nition \u02dcL1( \u02dc0 \u02dc , \u02dc0 \u02dc) := \u2327 ( \u02dc0 \u02dc \u02dc0 \u02dc/k) + t tr( \u02dc0 \u02dc ) using the augmentation\nabove satis\ufb01es the property that L1( \u02c6 , ) = \u02dcL1( \u02dc0 \u02dc , \u02dc0 \u02dc) for any \u02c6 2 Rh\u21e5t.\nProof. Since 01 = 1k we obtain a simpli\ufb01cation of L1:\n\n(13)\nIt only remains is to establish that \u2327 (k \u02c6 ) = \u2327 ( \u02dc0 \u02dc \u02dc0 \u02dc/k). To do so, consider the sequence\nof equivalence preserving transformations:\n(14)\n\nL1( \u02c6 , ) = \u2327110 + k \u02c6 1 diag(0 \u02c6 )0 = \u2327 (k \u02c6 ) + t tr( \u02dc0 \u02dc ).\n\u2327 (k \u02c6 ) =\n=\n\ntr\u21e40(k \u02dc \u02dc)\nk tr\u23260 \u02dc0(k \u02dc \u02dc) = \u2327 ( \u02dc0 \u02dc \u02dc0 \u02dc/k),\n\nmax\n\u21e42Rh\u21e5\u02dct\n+ :\u21e401=1\nmax\n+ :\u232601=1\n\nwhere the equalities in (14) and (15) follow from the de\ufb01nition of \u2327 and the fact that linear maxi-\nmizations over the simplex obtain their solutions at the vertices. To establish the equality between\n+ there must exist an \u2326 2 R\u02dct\u21e5\u02dct\n(14) and (15), since \u02dc embeds the submatrix kI, for any \u21e4 2 Rh\u21e5\u02dct\n+ sat-\nisfying \u21e4= \u02dc\u2326/k. Furthermore, these matrices satisfy \u21e401 = 1 iff \u23260 \u02dc01/k = 1 iff \u232601 = 1.\nTherefore, the result (8) holds for the \ufb01rst layer loss (4), using \u02dcL1 de\ufb01ned in Lemma 3.\n(The\nsame result can be established for other loss functions, such as the multi-class large-margin loss.)\nCombining these lemmas yields the desired result of this section.\nTheorem 1. For any second layer loss and any \ufb01rst layer loss that satis\ufb01es the assumption of Lemma\n2 (for example the large-margin multi-label loss (4)), the following equivalence holds:\n\n\u23262R\u02dct\u21e5\u02dct\n\n(15)\n\n1\n\n(2) =\n\nmin\n\n{ \u02dcN :92{0,1}t\u21e5hs.t. 1=1k, \u02dcN = \u02dc0 \u02dc}\n\nmin\n\nB2Im( \u02dcN )\n\nmin\n\n\u02dcL1(D \u02dcK, \u02dcN ) + \u21b5\n\n2 tr(D0 \u02dcN\u2020D \u02dcK)\n\nD2Im( \u02dcN )\n+L2(B, Y ) + \n\n2 tr(B \u02dcN\u2020B0).\n\n(16)\n\n(Theorem 1 follows immediately from Lemmas 1 and 2.) Note that no relaxation has occurred thus\nfar: the objective value of (16) matches that of (2). Not only has this reformulation resulted in (2)\nbeing entirely expressed in terms of the latent kernel matrix \u02dcN, the objective in (16) is jointly convex\nin all participating unknowns, \u02dcN, B and D. Unfortunately, the constraints in (16) are not convex.\n\n4 Convex Relaxation\nWe \ufb01rst relax the problem by dropping the augmentation 7! \u02dc and working with the t\u21e5 t variable\nN = 0. Without the augmentation, Lemma 3 becomes a lower bound (i.e. (14)(15)), hence a\nrelaxation. To then achieve a convex form we further relax the constraints in (16). To do so, consider\n(17)\n(18)\n(19)\nwhere it is clear from the de\ufb01nitions that N0 \u2713N 1 \u2713N 2. (Here we use N \u232b 0 to also encode\nN0 = N.) Note that the set N0 corresponds to the original set of constraints from (16). The set\n\nN0 = N : 9 2{ 0, 1}t\u21e5h such that 1 = 1k and N = 0 \nN1 = N : N 2{ 0, ..., k}t\u21e5t, N \u232b 0, diag(N ) = 1k, rank(N ) \uf8ff h \n\nN2 = {N : N 0, N \u232b 0, diag(N ) = 1k} ,\n\n5\n\n\fAlgorithm 1: ADMM to optimize F(N ) for N 2N 2.\n1 Initialize: M0 = I, 0 = 0.\n2 while T = 1, 2, . . . do\n3\n4\n\nNT arg minN\u232b0 L(N, MT1, T1), by using the boosting Algorithm 2.\nMT arg minM0,Mii=k L(NT , M, T1), which has an ef\ufb01cient closed form solution.\nT T1 + 1\n\n\u00b5 (MT NT ); i.e. update the multipliers.\n\n5\n6 return NT .\n\nAlgorithm 2: Boosting algorithm to optimize G(N ) for N \u232b 0.\n1 Initialize: N0 0, H0 [ ] (empty set).\n2 while T = 1, 2, . . . do\n3\n4\n\nFind the smallest arithmetic eigenvalue of rG(NT1), and its eigenvector hT .\nConic search by LBFGS: (aT , bT ) mina0,b0 G(aNT1 + bhT h0T ).\nLocal search by LBFGS: HT local minHG(HH0) initialized by H = (paHT1,pbhT ).\nSet NT HT H0T ; break if stopping criterion met.\n\n5\n6 return NT .\n\nN1 simpli\ufb01es the characterization of this constraint set on the resulting kernel matrices N = 0.\nHowever, neither N0 nor N1 are convex. Therefore, we need to adopt the further relaxed set N2,\nwhich is convex. (Note that Nij \uf8ff k has been implied by N \u232b 0 and Nii = k in N2.) Since\ndropping the rank constraint eliminates the constraints B 2 Im(N ) and D 2 Im(N ) in (16) when\nN 0 [32], we obtain the following relaxed problem, which is jointly convex in N, B and D:\n\nmin\nN2N2\n\nmin\nB2Rt\u21e5t\n\nmin\nD2Rt\u21e5t\n\n\u02dcL1(DK, N ) + \u21b5\n\n2 tr(D0N\u2020DK) + L2(B, Y ) + \n\n2 tr(BN\u2020B0).\n\n(20)\n\n5 Ef\ufb01cient Training Approach\n\nUnfortunately, nonlinear semide\ufb01nite optimization problems in the form (20) are generally thought\nto be too expensive in practice despite their polynomial theoretical complexity [33, 34]. Therefore,\nwe develop an effective training algorithm that exploits problem structure to bypass the main compu-\ntational bottlenecks. The key challenge is that N2 contains both semide\ufb01nite and af\ufb01ne constraints,\nand the pseudo-inverse N\u2020 makes optimization over N dif\ufb01cult even for \ufb01xed B and D.\nTo mitigate these dif\ufb01culties we \ufb01rst treat (20) as the reduced problem, minN2N2 F(N ), where F\nis an implicit objective achieved by minimizing out B and D. Note that F is still convex in N by\nthe joint convexity of (20). To cope with the constraints on N we adopt the alternating direction\nmethod of multipliers (ADMM) [35] as the main outer optimization procedure; see Algorithm 1.\nThis approach allows one to divide N2 into two groups, N \u232b 0 and {Nij 0, Nii = k}, yielding\nthe augmented Lagrangian\n(21)\nL(N, M, ) = F(N ) + (N \u232b 0) + (Mij 0, Mii = k) h, NMi + 1\nwhere \u00b5 > 0 is a small constant, and denotes an indicator such that (\u00b7) = 0 if \u00b7 is true, and 1\notherwise. In this procedure, Steps 4 and 5 cost O(t2) time; whereas the main bottleneck is Step 3,\nwhich involves minimizing GT (N ) := L(N, MT1, T1) over N \u232b 0 for \ufb01xed MT1 and T1.\nBoosting for Optimizing over the Positive Semide\ufb01nite Cone. To solve the problem in Step 3\nwe develop an ef\ufb01cient boosting procedure based on [36] that retains low rank iterates NT while\navoiding the need to determine N\u2020 when computing G(N ) and rG(N ); see Algorithm 2. The key\nidea is to use a simple change of variable. For example, consider the \ufb01rst layer objective and let\nG1(N ) = minD \u02dcL1(DK, N ) + \u21b5\n2 tr(D0N\u2020DK). By de\ufb01ning D = N C, we obtain G1(N ) =\nminC \u02dcL1(N CK, N ) + \u21b5\n2 tr(C0N CK), which no longer involves N\u2020 but remains convex in C; this\nproblem can be solved ef\ufb01ciently after a slight smoothing of the objective [37] (e.g. by LBFGS).\nMoreover, the gradient rG1(N ) can be readily computed given C\u21e4. Applying the same technique\n\n2\u00b5 kNMk2\nF ,\n\n6\n\n\f2\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\u22122\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n1.5\n\n4\n\n3.5\n\n3\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\n0\n\n0.5\n\n1\n\n1.5\n\n2\n\n2.5\n\n3\n\n3.5\n\n4\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\n\u22120.4\n\n\u22120.6\n\n\u22120.8\n\n\u22121\n\u22122\n\n0\n\n2\n\n4\n\n6\n\n8\n\n10\n\n12\n\n14\n\nXOR\nTJB2 49.8\u00b10.7\nTSS1 50.2\u00b11.2\nSVM1 50.3\u00b11.1\n4.2\u00b10.9\nLOC2\nCVX2\n0.2\u00b10.1\n\nBOXES\n45.7 \u00b10.6\n35.7 \u00b11.3\n31.4 \u00b10.5\n11.4 \u00b10.6\n10.1 \u00b10.4\n\nINTER\n49.3\u00b11.3\n42.6\u00b13.9\n50.0\u00b10.0\n50.0\u00b10.0\n20.0\u00b12.4\n\n(a) \u201cXor\u201d (2 \u21e5 400)\nFigure 2: Synthetic experiments: three arti\ufb01cial data sets that cannot be meaningfully classi\ufb01ed by\na one-layer model that does not use a nonlinear kernel. Table shows percentage test set error.\n\n(c) \u201cInterval\u201d (2 \u21e5 200)\n\n(b) \u201cBoxes\u201d (2 \u21e5 320)\n\n(d) Synthetic results (% error)\n\nto the second layer yields an ef\ufb01cient procedure for evaluating G(N ) and rG(N ). Finally note that\nmany of the matrix-vector multiplications in this procedure can be further accelerated by exploiting\nthe low rank factorization of N maintained by the boosting algorithm; see the Appendix for details.\nAdditional Relaxation. One can further reduce computation cost by adopting additional relaxations\nto (20). For example, by dropping N 0 and relaxing diag(N ) = 1k to diag(N ) \uf8ff 1k, the\nobjective can be written as min{N\u232b0,maxi Nii\uf8ffk} F(N ). Since maxi Nii is convex in N, it is well\nknown that there must exist a constant c1 > 0 such that the optimal N is also an optimal solution\nto minN\u232b0 F(N ) + c1 (maxi Nii)2. While maxi Nii is not smooth, one can further smooth it\nwith a softmax, to instead solve minN\u232b0 F(N ) + c1 (logPi exp(c2Nii))2 for some large c2. This\nformulation avoids the need for ADMM entirely and can be directly solved by Algorithm 2.\n\n6 Experimental Evaluation\n\nTo investigate the effectiveness of the proposed relaxation scheme for training a two-layer condi-\ntional model, we conducted a number of experiments to compare learning quality against baseline\nmethods. Note that, given an optimal solution N, B and D to (20), an approximate solution to the\noriginal problem (2) can be recovered heuristically by \ufb01rst rounding N to obtain , then recovering\nW and V , as shown in Lemmas 1 and 2. However, since our primary objective is to determine\nwhether any convex relaxation of a two-layer model can even compete with one-layer or locally\ntrained two-layer models (rather than evaluate heuristic rounding schemes), we consider a transduc-\ntive evaluation that does not require any further modi\ufb01cation of N, B and D. In such a set-up, train-\ning data is divided into a labeled and unlabeled portion, where the method receives X = [X`, Xu]\nand Y`, and at test time the resulting predictions \u02c6Yu are evaluated against the held-out labels Yu.\nMethods. We compared the proposed convex relaxation scheme (CVX2) against the following\nmethods: simple alternating minimization of the same two-layer model (2) (LOC2), a one-layer\nlinear SVM trained on the labeled data (SVM1), the transductive one-layer SVM methods of [38]\n(TSJ1) and [39] (TSS1), and the transductive latent clustering method of [18, 19] (TJB2), which\nis also a two-layer model. Linear input kernels were used for all methods (standard in most deep\nlearning models) to control the comparison between one and two-layer models. Our experiments\nwere conducted with the following common protocol: First, the data was split into a separate training\nand test set. Then the parameters of each procedure were optimized by a three-fold cross validation\non the training set. Once the optimal parameters were selected, they were \ufb01xed and used on the test\nset. For transductive procedures, the same three training sets from the \ufb01rst phase were used, but then\ncombined with ten new test sets drawn from the disjoint test data (hence 30 overall) for the \ufb01nal\nevaluation. At no point were test examples used to select any parameters for any of the methods.\nWe considered different proportions between labeled/unlabeled data; namely, 100/100 and 200/200.\nSynthetic Experiments. We initially ran a proof of concept experiment on three binary labeled\narti\ufb01cial data sets depicted in Figure 2 (showing data set sizes n\u21e5 t) with 100/100 labeled/unlabeled\ntraining points. Here the goal was simply to determine whether the relaxed two-layer training\nmethod could preserve suf\ufb01cient structure to overcome the limits of a one-layer architecture. Clearly,\nnone of the data sets in Figure 2 are adequately modeled by a one-layer architecture (that does not\ncheat and use a nonlinear kernel). The results are shown in the Figure 2(d) table.\n\n7\n\n\fTJB2\nLOC2\nSVM1\nTSS1\nTSJ1\nCVX2\n\nMNIST\n19.3 \u00b11.2\n19.3 \u00b11.0\n16.2 \u00b10.7\n13.7 \u00b10.8\n14.6 \u00b10.7\n9.2 \u00b10.6\n\nUSPS\n53.2\u00b12.9\n13.9\u00b11.1\n11.6\u00b10.5\n11.1\u00b10.5\n12.1\u00b10.4\n9.2\u00b10.5\n\nLetter\n20.4\u00b12.1\n10.4\u00b10.6\n6.2\u00b10.4\n5.9\u00b10.5\n5.6\u00b10.5\n5.1\u00b10.5\n\nCOIL\n30.6\u00b10.8\n18.0\u00b10.5\n16.9\u00b10.6\n17.5\u00b10.6\n17.2\u00b10.6\n13.8\u00b10.6\n\nCIFAR G241N\n26.3\u00b10.8\n29.2\u00b12.1\n31.8\u00b10.9\n41.6\u00b10.9\n27.6\u00b10.9\n27.1\u00b10.9\n26.7\u00b10.7\n25.1\u00b10.8\n24.4\u00b10.7\n26.6\u00b10.8\n26.5\u00b10.8\n25.2\u00b11.0\n\nTable 1: Mean test misclassi\ufb01cation error % (\u00b1 stdev) for 100/100 labeled/unlabeled.\n\nTJB2\nLOC2\nSVM1\nTSS1\nTSJ1\nCVX2\n\nMNIST\n13.7 \u00b10.6\n16.3 \u00b10.6\n11.2 \u00b10.4\n11.4 \u00b10.5\n12.3 \u00b10.5\n8.8 \u00b10.4\n\nUSPS\n46.6\u00b11.0\n9.7\u00b10.5\n10.7\u00b10.4\n11.3\u00b10.4\n11.8\u00b10.4\n6.6\u00b10.4\n\nLetter\n14.0\u00b12.6\n8.5\u00b10.6\n5.0\u00b10.3\n4.4\u00b10.3\n4.8\u00b10.3\n3.8\u00b10.3\n\nCOIL\n45.0\u00b10.8\n12.8\u00b10.6\n15.6\u00b10.5\n14.9\u00b10.4\n13.5\u00b10.4\n8.2\u00b10.4\n\nCIFAR G241N\n22.4\u00b10.5\n30.4\u00b11.9\n28.2\u00b10.9\n40.4\u00b10.7\n25.5\u00b10.6\n22.9\u00b10.5\n23.7\u00b10.5\n24.0\u00b10.6\n23.9\u00b10.5\n22.2\u00b10.6\n22.8\u00b10.6\n20.3\u00b10.5\n\nTable 2: Mean test misclassi\ufb01cation error % (\u00b1 stdev) for 200/200 labeled/unlabeled.\n\nAs expected, the one-layer models SVM1 and TSS1 were unable to capture any useful classi\ufb01cation\nstructure in these problems. (TSJ1 behaves similarly to TSS1.) The results obtained by CVX2, on\nthe other hand, are encouraging. In these data sets, CVX2 is easily able to capture latent nonlin-\nearities while outperforming the locally trained LOC2. Although LOC2 is effective in the \ufb01rst two\ncases, it exhibits weaker test accuracy while failing on the third data set. The two-layer method\nTJB2 exhibited convergence dif\ufb01culties on these problems that prevented reasonable results.\nExperiments on \u201cReal\u201d Data Sets. Next, we conducted experiments on real data sets to deter-\nmine whether the advantages in controlled synthetic settings could translate into useful results in\na more realistic scenario. For these experiments we used a collection of binary labeled data sets:\nUSPS, COIL and G241N from [40], Letter from [41], MNIST, and CIFAR-100 from [42]. (See\nAppendix B in the supplement for further details.) The results are shown in Tables 1 and 2 for the\nlabeled/unlabeled proportions 100/100 and 200/200 respectively.\nThe relaxed two-layer method CVX2 again demonstrates effective results, although some data sets\ncaused dif\ufb01culty for all methods. The data sets can be divided into two groups, (MNIST, USPS,\nCOIL) versus (Letter, CIFAR, G241N). In the \ufb01rst group, two-layer modeling demonstrates a clear\nadvantage: CVX2 outperforms SVM1 by a signi\ufb01cant margin. Note that this advantage must be\ndue to two-layer versus one-layer modeling, since the transductive SVM methods TSS1 and TSJ1\ndemonstrate no advantage over SVM1. For the second group, the effectiveness of SVM1 demon-\nstrates that only minor gains can be possible via transductive or two-layer extensions, although some\ngains are realized. The locally trained two-layer model LOC2 performed quite poorly in all cases.\nUnfortunately, the convex latent clustering method TJB2 was also not competitive on any of these\ndata sets. Overall, CVX2 appears to demonstrate useful promise as a two-layer modeling approach.\n\n7 Conclusion\nWe have introduced a new convex approach to two-layer conditional modeling by reformulating the\nproblem in terms of a latent kernel over intermediate feature representations. The proposed model\ncan accommodate latent feature representations that go well beyond a latent clustering, extend-\ning current convex approaches. A semide\ufb01nite relaxation of the latent kernel allows a reasonable\nimplementation that is able to demonstrate advantages over single-layer models and local training\nmethods. From a deep learning perspective, this work demonstrates that trainable latent layers can\nbe expressed in terms of reproducing kernel Hilbert spaces, while large margin methods can be use-\nfully applied to multi-layer prediction architectures. Important directions for future work include\nreplacing the step and indmax transfers with more traditional sigmoid and softmax transfers, while\nalso replacing the margin losses with more traditional Bregman divergences; re\ufb01ning the relaxation\nto allow more control over the structure of the latent representations; and investigating the utility of\nconvex methods for stage-wise training within multi-layer architectures.\n\n8\n\n\fReferences\n[1] Q. Le, M. Ranzato, R. Monga, M. Devin, G. Corrado, K. Chen, J. Dean, and A. Ng. Building high-level\n\nfeatures using large scale unsupervised learning. In Proceedings ICML, 2012.\n\n[2] N. Srivastava and R. Salakhutdinov. Multimodal learning with deep Boltzmann machines. In NIPS, 2012.\n[3] Y. Bengio. Learning deep architectures for AI. Foundat. and Trends in Machine Learning, 2:1\u2013127, 2009.\n[4] G. Hinton. Learning multiple layers of representations. Trends in Cognitive Sciences, 11:428\u2013434, 2007.\n[5] G. Hinton, S. Osindero, and Y. Teh. A fast algorithm for deep belief nets. Neur. Comp., 18(7), 2006.\n[6] N. Lawrence. Probabilistic non-linear principal component analysis. JMLR, 6:1783\u20131816, 2005.\n[7] A. Banerjee, S. Merugu, I. Dhillon, and J. Ghosh. Clustering with Bregman divergences. J. Mach. Learn.\n\nRes., 6:1705\u20131749, 2005.\n\n[8] M. Elad and M. Aharon. Image denoising via sparse and redundant representations over learned dictio-\n\nnaries. IEEE Trans. on Image Processing, 15:3736\u20133745, 2006.\n\n[9] P. Comon. Independent component analysis, a new concept? Signal Processing, 36(3):287\u2013314, 1994.\n[10] M. Carreira-Perpi\u02dcn\u00b4an and Z. Lu. dimensionality reduction by unsupervised regression. In CVPR, 2010.\n[11] N. Tishby, F. Pereira, and W. Bialek. The information bottleneck method. In Allerton Conf., 1999.\n[12] S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio. Contractive auto-encoders: Explicit invariance\n\nduring feature extraction. In ICML, 2011.\n\n[13] K. Swersky, M. Ranzato, D. Buchman, B. Marlin, and N. de Freitas. On autoencoders and score matching\n\nfor energy based models. In Proceedings ICML, 2011.\n\n[14] Y. LeCun. Who is afraid of non-convex loss functions? http://videolectures.net/eml07 lecun wia, 2007.\n[15] Y. Bengio, N. Le Roux, P. Vincent, and O. Delalleau. Convex neural networks. In NIPS, 2005.\n[16] S. Nowozin and G. Bakir. A decoupled approach to exemplar-based unsupervised learning. In Proceed-\n\nings of the International Conference on Machine Learning, 2008.\n\n[17] D. Bradley and J. Bagnell. Convex coding. In UAI, 2009.\n[18] A. Joulin and F. Bach. A convex relaxation for weakly supervised classi\ufb01ers. In Proc. ICML, 2012.\n[19] A. Joulin, F. Bach, and J. Ponce. Ef\ufb01cient optimization for discrimin. latent class models. In NIPS, 2010.\n[20] Y. Guo and D. Schuurmans. Convex relaxations of latent variable training. In Proc. NIPS 20, 2007.\n[21] A. Goldberg, X. Zhu, B. Recht, J. Xu, and R. Nowak. Transduction with matrix completion: Three birds\n\nwith one stone. In NIPS 23, 2010.\n\n[22] E. Candes, X. Li, Y. Ma, and J. Wright. Robust principal component analysis? arXiv:0912.3599, 2009.\n[23] X. Zhang, Y. Yu, and D. Schuurmans. Accelerated training for matrix-norm regularization: A boosting\n\napproach. In Advances in Neural Information Processing Systems 25, 2012.\n\n[24] A. Anandkumar, D. Hsu, and S. Kakade. A method of moments for mixture models and hidden Markov\n\nmodels. In Proc. Conference on Learning Theory, 2012.\n\n[25] D. Hsu and S. Kakade. Learning mixtures of spherical Gaussians: Moment methods and spectral decom-\n\npositions. In Innovations in Theoretical Computer Science (ITCS), 2013.\n\n[26] Y. Cho and L. Saul. Large margin classi\ufb01cation in in\ufb01nite neural networks. Neural Comput., 22, 2010.\n[27] R. Neal. Connectionist learning of belief networks. Arti\ufb01cial Intelligence, 56(1):71\u2013113, 1992.\n[28] G. Kimeldorf and G. Wahba. Some results on Tchebychef\ufb01an spline functions. JMAA, 33:82\u201395, 1971.\n[29] K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based vector ma-\n\nchines. JMLR, pages 265\u2013292, 2001.\n\n[30] J. Fuernkranz, E. Huellermeier, E. Mencia, and K. Brinker. Multilabel classi\ufb01cation via calibrated label\n\nranking. Machine Learning, 73(2):133\u2013153, 2008.\n\n[31] Y. Guo and D. Schuurmans. Adaptive large margin training for multilabel classi\ufb01cation. In AAAI, 2011.\n[32] A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Mach. Learn., 73(3), 2008.\n[33] Y. Nesterov and A. Nimirovskii. Interior-Point Polynomial Algorithms in Convex Programming. 1994.\n[34] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge U. Press, 2004.\n[35] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning\nvia the alternating direction method of multipliers. Foundat. Trends in Mach. Learn., 3(1):1\u2013123, 2010.\n\n[36] S. Laue. A hybrid algorithm for convex semide\ufb01nite optimization. In Proc. ICML, 2012.\n[37] O. Chapelle. Training a support vector machine in the primal. Neural Comput., 19(5):1155\u20131178, 2007.\n[38] T. Joachims. Transductive inference for text classi\ufb01cation using support vector machines. In ICML, 1999.\n[39] V. Sindhwani and S. Keerthi. Large scale semi-supervised linear SVMs. In SIGIR, 2006.\n[40] http://olivier.chapelle.cc/ssl- book/benchmarks.html.\n[41] http://archive.ics.uci.edu/ml/datasets.\n[42] http://www.cs.toronto.edu/ kriz/cifar.html.\n\n9\n\n\f", "award": [], "sourceid": 1362, "authors": [{"given_name": "\u00d6zlem", "family_name": "Aslan", "institution": "University of Alberta"}, {"given_name": "Hao", "family_name": "Cheng", "institution": "University of Alberta"}, {"given_name": "Xinhua", "family_name": "Zhang", "institution": "NICTA"}, {"given_name": "Dale", "family_name": "Schuurmans", "institution": "University of Alberta"}]}