{"title": "Convex Deep Learning via Normalized Kernels", "book": "Advances in Neural Information Processing Systems", "page_first": 3275, "page_last": 3283, "abstract": "Deep learning has been a long standing pursuit in machine learning, which until recently was hampered by unreliable training methods before the discovery of improved heuristics for embedded layer training. A complementary research strategy is to develop alternative modeling architectures that admit efficient training methods while expanding the range of representable structures toward deep models. In this paper, we develop a new architecture for nested nonlinearities that allows arbitrarily deep compositions to be trained to global optimality. The approach admits both parametric and nonparametric forms through the use of normalized kernels to represent each latent layer. The outcome is a fully convex formulation that is able to capture compositions of trainable nonlinear layers to arbitrary depth.", "full_text": "Convex Deep Learning via Normalized Kernels\n\n\u00a8Ozlem Aslan\n\nDept of Computing Science\nUniversity of Alberta, Canada\nozlem@cs.ualberta.ca\n\nXinhua Zhang\n\nxizhang@nicta.com.au\n\nMachine Learning Group\n\nNICTA and ANU\n\nDale Schuurmans\n\nDept of Computing Science\nUniversity of Alberta, Canada\ndale@cs.ualberta.ca\n\nAbstract\n\nDeep learning has been a long standing pursuit in machine learning, which until\nrecently was hampered by unreliable training methods before the discovery of im-\nproved heuristics for embedded layer training. A complementary research strategy\nis to develop alternative modeling architectures that admit ef\ufb01cient training meth-\nods while expanding the range of representable structures toward deep models. In\nthis paper, we develop a new architecture for nested nonlinearities that allows arbi-\ntrarily deep compositions to be trained to global optimality. The approach admits\nboth parametric and nonparametric forms through the use of normalized kernels\nto represent each latent layer. The outcome is a fully convex formulation that is\nable to capture compositions of trainable nonlinear layers to arbitrary depth.\n\nIntroduction\n\n1\nDeep learning has recently achieved signi\ufb01cant advances in several areas of perceptual computing,\nincluding speech recognition [1], image analysis and object detection [2, 3], and natural language\nprocessing [4]. The automated acquisition of representations is motivated by the observation that\nappropriate features make any learning problem easy, whereas poor features hamper learning. Given\nthe practical signi\ufb01cance of feature engineering, automated methods for feature discovery offer an\nimportant tool for applied machine learning. Ideally, automatically acquired features capture simple\nbut salient aspects of the input distribution, upon which subsequent feature discovery can compose\nincreasingly abstract and invariant aspects [5]; an intuition that appears to be well supported by\nrecent empirical evidence [6].\nUnfortunately, deep architectures are notoriously dif\ufb01cult to train and, until recently, required sig-\nni\ufb01cant experience to manage appropriately [7, 8]. Beyond well known problems like local minima\n[9], deep training landscapes also exhibit plateaus [10] that arise from credit assignment problems in\nbackpropagation. An intuitive understanding of the optimization landscape and careful initialization\nboth appear to be essential aspects of obtaining successful training [11]. Nevertheless, the devel-\nopment of recent training heuristics has improved the quality of feature discovery at lower levels\nin deep architectures. These advances began with the idea of bottom-up, stage-wise unsupervised\ntraining of latent layers [12, 13] (\u201cpre-training\u201d), and progressed to more recent ideas like dropout\n[14]. Despite the resulting empirical success, however, such advances occur in the context of a\nproblem that is known to be NP-hard in the worst case (even to approximate) [15], hence there is no\nguarantee that worst case versus \u201ctypical\u201d behavior will not show up in any particular problem.\nGiven the recent success of deep learning, it is no surprise that there has been growing interest in\ngaining a deeper theoretical understanding. One key motivation of recent theoretical work has been\nto ground deep learning on a well understood computational foundation. For example, [16] demon-\nstrates that polynomial time (high probability) identi\ufb01cation of an optimal deep architecture can be\nachieved by restricting weights to bounded random variates and considering hard-threshold genera-\ntive gates. Other recent work [17] considers a sum-product formulation [18], where guarantees can\nbe made about the ef\ufb01cient recovery of an approximately optimal polynomial basis. Although these\n\n1\n\n\ftreatments do not cover the speci\ufb01c models that have been responsible for state of the art results,\nthey do provide insight into the computational structure of deep learning.\nThe focus of this paper is on kernel-based approaches to deep learning, which offer a potentially\neasier path to achieving a simple computational understanding. Kernels [19] have had a signi\ufb01cant\nimpact in machine learning, partly because they offer \ufb02exible modeling capability without sacri-\n\ufb01cing convexity in common training scenarios [20]. Given the convexity of the resulting training\nformulations, suboptimal local minima and plateaus are eliminated while reliable computational\nprocedures are widely available. A common misconception about kernel methods is that they are\ninherently \u201cshallow\u201d [5], but depth is an aspect of how such methods are used and not an intrinsic\nproperty. For example, [21] demonstrates how nested compositions of kernels can be incorporated\nin a convex training formulation, which can be interpreted as learning over a (\ufb01xed) composition of\nhidden layers with in\ufb01nite features. Other work has formulated adaptive learning of nested kernels,\nalbeit by sacri\ufb01cing convexity [22]. More recently, [23, 24] has considered learning kernel repre-\nsentations of latent clusters, achieving convex formulations under some relaxations. Finally, [25]\ndemonstrated that an adaptive hidden layer could be expressed as the problem of learning a latent\nkernel between given input and output kernels within a jointly convex formulation. Although these\nworks show clearly how latent kernel learning can be formulated, convex models have remained\nrestricted to a single adaptive layer, with no clear paths suggested for a multi-layer extension.\nIn this paper, we develop a convex formulation of multi-layer learning that allows multiple latent\nkernels to be connected through nonlinear conditional losses. In particular, each pair of succes-\nsive layers is connected by a prediction loss that is jointly convex in the adjacent kernels, while\nexpressing a non-trivial, non-linear mapping between layers that supports multi-factor latent rep-\nresentations. The resulting formulation signi\ufb01cantly extends previous convex models, which have\nonly been able to train a single adaptive kernel while maintaining a convex training objective. Addi-\ntional algorithmic development yields an approach with improved scaling properties over previous\napproaches, although not yet at the level of current deep learning methods. We believe the result\nis the \ufb01rst fully convex training formulation of a deep learning architecture with adaptive hidden\nlayers, which demonstrates some useful potential in empirical investigations.\n2 Background\nTo begin, consider a multi-layer condi-\ntional model where the input xi is an n\ndimensional feature vector and the out-\nput yi \u2208 {0, 1}m is a multi-label target\nvector over m labels. For concreteness,\nconsider a three-layer model (Figure 1).\nHere, the output of the \ufb01rst hidden layer\nis determined by multiplying the input, xi, with a weight matrix W \u2208 Rh\u00d7n and passing the result\nthrough a nonlinear transfer \u03c31, yielding \u03c6i = \u03c31(W xi). Then, the output of the second layer is\ndetermined by multiplying the \ufb01rst layer output, \u03c6i, with a second weight matrix U \u2208 Rh(cid:48)\u00d7h and\npassing the result through a nonlinear transfer \u03c32, yielding \u03b8i = \u03c32(U \u03c6i), etc. The \ufb01nal output is\nthen determined via \u02c6yi = \u03c33(V \u03b8i), for V \u2208 Rm\u00d7h(cid:48)\nThe goal of training is to \ufb01nd the weight matrices, W , U, and V , that minimize a training objective\nde\ufb01ned over the training data (with regularization). In particular, we assume the availability of t\ntraining examples {(xi, yi)}t\ni=1, and denote the feature matrix X := (x1, . . . , xt) \u2208 Rn\u00d7t and the\nlabel matrix Y := (y1, . . . , yt) \u2208 Rm\u00d7t respectively. One of the key challenges for training arises\nfrom the fact that the latent variables \u03a6 := (\u03c61, . . . , \u03c6t) and \u0398 := (\u03b81, . . . , \u03b8t) are unobserved.\nTo introduce our main development, we begin with a reconstruction of [25], which proposed a con-\nvex formulation of a simpler two-layer model. Although the techniques proposed in that work are\nintrinsically restricted to two layers, we will eventually show how this barrier can be surpassed\nthrough the introduction of a new tool\u2014normalized output kernels. However, we \ufb01rst need to pro-\nvide a more general treatment of the three main obstacles to obtaining a convex training formulation\nfor multi-layer architectures like Figure 1.\n2.1 First Obstacle: Nonlinear Transfers\nThe \ufb01rst key obstacle arises from the presence of the transfer functions, \u03c3i, which provide the essen-\ntial nonlinearity of the model. In classical examples, such as auto-encoders and feed-forward neural\n\n. For simplicity, we will set h(cid:48) = h.\n\nFigure 1: Multi-layer conditional models\n\n2\n\n\fmin\n\nW,U,V,\u03a6,\u0398\n\n2 (cid:107)V (cid:107)2 . 1\n\nL1(W X, \u03a6) + 1\n\n2 (cid:107)U(cid:107)2 + L3(V \u0398, Y ) + 1\n\n2 (cid:107)W(cid:107)2 + L2(U \u03a6, \u0398) + 1\n\nnetworks, an explicit form for \u03c3i is prescribed, e.g. a step or sigmoid function. Unfortunately, the\nimposition of a nonlinear transfer in any deterministic model imposes highly non-convex constraints\nof the form: \u03c6i = \u03c31(W xi). This problem is alleviated in nondeterministic models like probabilistic\nnetworks (PFN) [26] and restricted Boltzman machines (RBMs) [12], where the nonlinear relation-\nship between the output (e.g. \u03c6i) and the linear pre-image (e.g. W xi) is only softly enforced via\na nonlinear loss L that measures their discrepancy (see Figure 1). Such an approach was adopted\nby [25], where the values of the hidden layer responses (e.g. \u03c6i) were treated as independent vari-\nables whose values are to be optimized in conjunction with the weights. In the present case, if one\nsimilarly optimizes rather than marginalizes over hidden layer values, \u03a6 and \u0398 (i.e. Viterbi style\ntraining), a generalized training objective for a multi-layer architecture (Figure 1) can be expressed:\n(1)\nThe nonlinear loss L1 bridges the nonlinearity introduced by \u03c31, and L2 bridges the nonlinearity\nintroduced by \u03c32, etc. Importantly, these losses, albeit nonlinear, can be chosen to be convex in their\n\ufb01rst argument; for example, as in standard models like PFNs and RBMs (implicitly). In addition to\nthese exponential family models, which have traditionally been the focus of deep learning research,\ncontinuous latent variable models have also been considered, e.g. recti\ufb01ed linear model [27] and the\nexponential family harmonium. In this paper, like [25], we will use large-margin losses which offer\nadditional sparsity and simpli\ufb01cations.\nUnfortunately, even though the overall objective (1) is convex in the weight matrices (W, U, V )\ngiven (\u03a6, \u0398), it is not jointly convex in all participating variables due to the interaction between the\nlatent variables (\u03a6, \u0398) and the weight matrices (W, U, V ).\n2.2 Second Obstacle: Bilinear Interaction\nTherefore, the second key obstacle arises from the bilinear interaction between the latent variables\nand weight matrices in (1). To overcome this obstacle, consider a single connecting layer, which\nconsists of an input matrix (e.g. \u03a6) and output matrix (e.g. \u0398) and associated weight matrix (e.g. U):\n(2)\nBy the representer theorem, it follows that the optimal U can be expressed as U = A\u03a6(cid:48) for some\nA \u2208 Rm\u00d7t. Denote the linear response Z = U \u03a6 = A\u03a6(cid:48)\u03a6 = AK where K = \u03a6(cid:48)\u03a6 is the input\nkernel matrix. Then tr(U U(cid:48)) = tr(AKA(cid:48)) = tr(AKK\u2020KA(cid:48)) = tr(ZK\u2020Z(cid:48)), where K\u2020 is the\nMoore-Penrose pseudo-inverse (recall KK\u2020K = K and K\u2020KK\u2020 = K\u2020), therefore\n\nmin\n\nL(U \u03a6, \u0398) + 1\n\n2 (cid:107)U(cid:107)2 .\n\nU\n\n2 tr(ZK\u2020Z(cid:48)).\n\nL(Z, \u0398) + 1\n\n(2) = min\nZ\n\n(3)\nThis is essentially the value regularization framework [28]. Importantly, the objective in (3) is jointly\nconvex in Z and K, since tr(ZK\u2020Z) is a perspective function [29]. Therefore, although the single\nlayer model is not jointly convex in the input features \u03a6 and model parameters U, it is convex in\nthe equivalent reparameterization (K, Z) given \u0398. This is the technique used by [25] for the output\nlayer. Finally note that Z satis\ufb01es the constraint Z \u2208 Rm\u00d7n\u03a6 := {U \u03a6 : U \u2208 Rm\u00d7n}, which we\nwill write as Z \u2208 R\u03a6 for convenience. Clearly it is equivalent to Z \u2208 RK.\n2.3 Third Obstacle: Joint Input-Output Optimization\nThe third key obstacle is that each of the latent variables, \u03a6 and \u0398, simultaneously serve as the in-\nputs and output targets for successive layers. Therefore, it is necessary to reformulate the connecting\nproblem (2) so that it is jointly convex in all three components, U, \u03a6 and \u0398; and unfortunately (3) is\nnot convex in \u0398. Although this appears to be an insurmountable obstacle in general, [25] propose an\nexact reformulation in the case when \u0398 is boolean valued (consistent with the probabilistic assump-\ntions underlying a PFM or RBM) by assuming the loss function satis\ufb01es an additional postulate.\nPostulate 1. L(Z, \u0398) can be rewritten as Lu(\u0398(cid:48)Z, \u0398(cid:48)\u0398) for Lu jointly convex in both arguments.\nIntuitively, this assumption allows the loss to be parameterized in terms of the propensity matrix\n\u0398(cid:48)Z and the unnormalized output kernel \u0398(cid:48)\u0398 (hence the superscript of Lu). That is, the (i, j)-th\ncomponent of \u0398(cid:48)Z stands for the linear response value of example j with respect to the label of the\nexample i. The j-th column therefore encodes the propensity of example j to all other examples.\nThis reparameterization is critical because it bypasses the linear response value, and relies solely on\n1 The terms (cid:107)W(cid:107)2, (cid:107)U(cid:107)2 and (cid:107)V (cid:107)2 are regularizers, where the norm is the Frobenius norm. For clarity\nwe have omitted the regularization parameters, relative weightings between different layers, and offset weights\nfrom the model. These components are obviously important in practice, however they play no key role in the\ntechnical development and removing them greatly simpli\ufb01es the expressions.\n\n3\n\n\f(2) = min\n\nLu(S, N ) + 1\n\n2 (S2, N2)+ 1\n\n1 (S1, N1)+ 1\n\n\u2020\n1 S1)+Lu\n\n2 tr(K\u2020S(cid:48)\n1N\n\n2 tr(K\u2020S(cid:48)N\u2020S).\n\nthe relationship between pairs of examples. The work [25] proposes a particular multi-label predic-\ntion loss that satis\ufb01es Postulate 1 for boolean target vectors \u03b8i; we propose an alternative below.\nUsing Postulate 1 and again letting Z = U \u03a6, one can then rewrite the objective in (2) as\n2 (cid:107)U(cid:107)2. Now if we denote N := \u0398(cid:48)\u0398 and S := \u0398(cid:48)Z = \u0398(cid:48)U \u03a6 (hence\nLu(\u0398(cid:48)U \u03a6, \u0398(cid:48)\u0398) + 1\nS \u2208 \u0398(cid:48)R\u03a6 = NRK), the formulation can be reduced to the following (see Appendix A):\n(4)\nTherefore, Postulate 1 allows (2) to be re-expressed in a form where the objective is jointly convex\nin the propensity matrix S and output kernel N. Given that N is a discrete but positive semide\ufb01nite\nmatrix, a \ufb01nal relaxation is required to achieve a convex training problem.\nPostulate 2. The domain of N = \u0398(cid:48)\u0398 can be relaxed to a convex set preserving suf\ufb01cient structure.\nBelow we will introduce an improved scheme for such relaxation. Although these developments\nsupport a convex formulation of two-layer model training [25], they appear insuf\ufb01cient for deeper\nmodels. For example, by applying (3) and (4) to the three-layer model of Figure 1, one obtains\n\u2020\n2 Z(cid:48)\nLu\n3),\nwhere N1 = \u03a6(cid:48)\u03a6 and N2 = \u0398(cid:48)\u0398 are two latent kernels imposed between the input and output.\n\u2020\nUnfortunately, this objective is not jointly convex in all variables, since tr(N\n2 S2) is not jointly\nconvex in (N1, S2, N2), hence the approach of [25] cannot extend beyond a single hidden layer.\n3 Multi-layer Convex Modeling via Normalized Kernels\nAlthough obtaining a convex formulation for general multi-layer models appears to be a signi\ufb01cant\nchallenge, progress can be made by considering an alternative approach. The failure of the previous\ndevelopment in [25] can be traced back to (2), which eventually causes the coupled, non-convex\nregularization to occur between connected latent kernels. A natural response therefore is to recon-\nsider the original regularization scheme, keeping in mind that the representer theorem must still be\nsupported. One such regularization scheme appears has been investigated in the clustering literature\n[30, 31], which suggests a reformulation of the connecting model (2) using value regularization [28]:\n(5)\nHere (cid:107)\u0398(cid:48)U(cid:107)2 replaces (cid:107)U(cid:107)2 from (2). The signi\ufb01cance of this reformulation is that it still admits\nthe representer theorem, which implies that the optimal U must be of the form U = (\u0398\u0398(cid:48))\u2020A\u03a6(cid:48)\nfor some A \u2208 Rm\u00d7n. Now, since \u0398 generally has full row rank (i.e. there are more examples than\nlabels), one may execute a change of variables A = \u0398B. Such a substitution leads to the regularizer\n\n(cid:13)(cid:13)\u0398(cid:48)(\u0398\u0398(cid:48))\u2020\u0398B\u03a6(cid:48)(cid:13)(cid:13)2, which can be expressed in terms of the normalized output kernel [30]:\n\n\u2020\n2 S2)+L3(Z3, Y )+ 1\n\n2(cid:107)\u0398(cid:48)U(cid:107)2.\n\nL(U \u03a6, \u0398) + 1\n\n2 tr(Z3N\n\n2 tr(N\n\n\u2020\n1 S(cid:48)\n\n\u2020\n1 S(cid:48)\n\n2N\n\nmin\n\nU\n\n2N\n\nS\n\nM := \u0398(cid:48)(\u0398\u0398(cid:48))\u2020\u0398.\n\n(7)\n\n(6)\nThe term (\u0398\u0398(cid:48))\u2020 essentially normalizes the spectrum of the kernel \u0398(cid:48)\u0398, and it is obvious that all\neigen-values of M are either 0 or 1, i.e. M 2 = M [30]. The regularizer can be \ufb01nally written as\n(cid:107)M B\u03a6(cid:48)(cid:107)2 = tr(M BKB(cid:48)M ) = tr(M BKK\u2020KB(cid:48)M ) = tr(SK\u2020S(cid:48)), where S := M BK.\nIt is easy to show S = \u0398(cid:48)Z = \u0398(cid:48)U \u03a6, which is exactly the propensity matrix.\nAs before, to achieve a convex training formulation, additional structure must be postulated on the\nloss function, but now allowing convenient expression in terms of normalized latent kernels.\nPostulate 3. The loss L(Z, \u0398) can be written as Ln(\u0398(cid:48)Z, \u0398(cid:48)(\u0398\u0398(cid:48))\u2020\u0398) where Ln is jointly convex\nin both arguments. Here we write Ln to emphasize the use of normalized kernels.\nUnder Postulate 3, an alternative convex objective can be achieved for a local connecting model\n\n2 tr(SK\u2020S(cid:48)), where S \u2208 MRK.\n\n(8)\nCrucially, this objective is now jointly convex in S, M and K; in comparison to (4), the normal-\nization has removed the output kernel from the regularizer. The feasible region {(S, M, K) : M (cid:23)\n0, K (cid:23) 0, S \u2208 MRK} is also convex (see Appendix B). Applying (8) to the \ufb01rst two layers and (3)\nto the output layer, a fully convex objective for a multi-layer model (e.g., as in Figure 1) is obtained:\n3), (9)\nLn\n1 (S1, M1) + 1\nwhere S1 \u2208 M1RK, S2 \u2208 M2RM1, and Z3 \u2208 RM2.2 All that remains is to design a convex\nrelaxation of the domain of M (for Postulate 2) and to design the loss Ln (for Postulate 3).\n\n2) + L3(Z3, Y ) + 1\n\n2 tr(S1K\u2020S(cid:48)\n\n2 (S2, M2) + 1\n\nLn(S, M ) + 1\n\n2 tr(Z3M\n\n2 tr(S2M\n\n1) + Ln\n\n\u2020\n2 Z(cid:48)\n\n\u2020\n1 S(cid:48)\n\n2 Clearly the \ufb01rst layer can still use (4) with an unnormalized output kernel N1 since its input X is observed.\n\n4\n\n\f3.1 Convex Relaxation of the Domain of Output Kernels M\nClearly from its de\ufb01nition (6), M has a non-convex domain in general. Ideally one should design\nconvex relaxations for each domain of \u0398. However, M exhibits some nice properties for any \u0398:\n\nM (cid:23) 0, M (cid:22) I,\n\ntr(M ) = tr((\u0398\u0398(cid:48))\u2020(\u0398\u0398(cid:48))) = rank(\u0398\u0398(cid:48)) = rank(\u0398).\n\n(10)\nHere I is the identity matrix, and we also use M (cid:23) 0 to encode M(cid:48) = M. Therefore, tr(M )\nprovides a convenient proxy for controlling the rank of the latent representation, i.e. the number of\nhidden nodes in a layer. Given a speci\ufb01ed number of hidden nodes h, we may enforce tr(M ) = h.\nThe main relaxation introduced here is replacing the eigenvalue constraint \u03bbi(M ) \u2208 {0, 1} (implied\nby M 2 = M) with 0 \u2264 \u03bbi(M ) \u2264 1. Such a relaxation retains suf\ufb01cient structure to allow, e.g.,\na 2-approximation of optimal clustering to be preserved even by only imposing spectral constraints\n[30]. Experimental results below further demonstrate that nesting preserves suf\ufb01cient structure, even\nwith relaxation, to capture relationships that cannot be recovered by shallower architectures.\nMore re\ufb01ned constraints can be included to better account for the domain of \u0398. For example, if \u0398\nexpresses target values for a multiclass classi\ufb01cation (i.e. \u0398ij \u2208 {0, 1}, \u0398(cid:48)1 = 1 where 1 is a vector\nof all one\u2019s), we further have Mij \u2265 0 and M 1 = 1. If \u0398 corresponds to multilabel classi\ufb01cation\nwhere each example belongs to exactly k (out of the h) labels (i.e. \u0398 \u2208 {0, 1}h\u00d7t, \u0398(cid:48)1 = k1), then\nM can have negative elements, but the spectral constraint M 1 = 1 still holds (see proof in Appendix\nC). So we will choose the domains for M1 and M2 in (9) to consist of the spectral constraints:\n\nM := {0 (cid:22) M (cid:22) I : M 1 = 1, tr(M ) = h}.\n\n(11)\n\n3.2 A Jointly Convex Multi-label Loss for Normalized Kernels\nAn important challenge is to design an appropriate nonlinear loss to connect each layer of the model.\nRather than conditional log-likelihood in a generative model, [25] introduced the idea of a using\nlarge margin, multi-label loss between a linear response, z, and a boolean target vector, y \u2208 {0, 1}h:\n(12)\nwhere 1 denotes the vector of all 1s. Intuitively this encourages the responses on the active labels,\ny(cid:48)z, to exceed k times the response of any inactive label, kzi, by a margin, where the implicit\nnonlinear transfer is a step function. Remarkably, this loss can be shown to satisfy Postulate 1 [25].\nThis loss can be easily adapted to the normalized case as follows. We \ufb01rst generalize the notion of\nmargin to consider a a \u201cnormalized label\u201d (Y Y (cid:48))\u2020y:\n\n\u02dcL(z, y) = max(1 \u2212 y + k z \u2212 1(y(cid:48)z))\n\nL(z, y) = max(1 \u2212 (Y Y (cid:48))\u2020y + k z \u2212 1(y(cid:48)z))\n\nLn(S, M ) = \u03c4 (S \u2212 1\n\nTo obtain some intuition, consider the multiclass case where k = 1. In this case, Y Y (cid:48) is a diagonal\nmatrix whose (i, i)-th element is the number of examples in each class i. Dividing by this number\nallows the margin requirement to be weakened for popular labels, while more focus is shifted to less\nrepresented labels. For a given set of t paired input/output pairs (Z, Y ) the sum of the losses can\nj L(zj, yj) = \u03c4 (kZ \u2212 (Y Y (cid:48))\u2020Y ) + t \u2212 tr(Y (cid:48)Z),\n\nthen be compactly expressed as L(Z, Y ) = (cid:80)\nwhere \u03c4 (\u0393) :=(cid:80)\n\nj maxi \u0393ij. This loss can be shown to satisfy that satis\ufb01es Postulate 3:3\nk M ) + t \u2212 tr(S), where S = Y (cid:48)Z and M = Y (cid:48)(Y Y (cid:48))\u2020Y.\n\n(13)\nThis loss can be naturally interpreted using the remark following Postulate 1. It encourages that the\npropensity of example j with respect to itself, Sjj, should be higher than its propensity with respect\nto other examples, Sij, by a margin that is de\ufb01ned through the normalized kernel M. However note\nthis loss does not correspond to a linear transfer between layers, even in terms of the propensity\nmatrix S or normalized output kernel M. As in all large margin methods, the initial loss (12) is a\nconvex upper bound for an underlying discrete loss de\ufb01ned with respect to a step transfer.\n4 Ef\ufb01cient Optimization\nEf\ufb01cient optimization for the multi-layer model (9) is challenging, largely due to the matrix pseudo-\ninverse. Fortunately, the constraints on M are all spectral, which makes it easier to apply conditional\ngradient (CG) methods [32]. This is much more convenient than the models based on unnormalized\nkernels [25], where the presence of both spectral and non-spectral constraints necessitated expensive\nalgorithms such as alternating direction method of multipliers [33].\n3 A simple derivation extends [25]: \u03c4 (kZ \u2212 (Y Y (cid:48))\u2020Y ) = max\u039b:Rm\u00d7t\n\nmax\u2126:Rt\u00d7t\n+ :\u2126(cid:48)1=1\nfor any \u039b \u2208 Rm\u00d7t\n\n+\n\n1\n\nk tr(\u2126(cid:48)Y (cid:48)(kZ \u2212 (Y Y (cid:48))\u2020Y )) = \u03c4 (Y (cid:48)Z \u2212 1\nsatisfying \u039b(cid:48)1 = 1, there must be an \u2126 \u2208 Rt\u00d7t\n\n+ :\u039b(cid:48)1=1 tr(\u039b(cid:48)(kZ \u2212 (Y Y (cid:48))\u2020Y )) =\nk M ). Here the second equality follows because\n+ satisfying \u2126(cid:48)1 = 1 and \u039b = Y \u2126/k.\n\n5\n\n\fAlgorithm 1: Conditional gradient algorithm to optimize f (M1, M2) for M1, M2 \u2208 M.\n1 Initialize \u02dcM1 and \u02dcM2 with some random matrices.\n2 while s = 1, 2, . . . do\n3\n4\n5\n6\n7 return ( \u02dcM1, \u02dcM2).\n\nTotally corrective update: min\u03b1\u2208\u2206s,\u03b2\u2208\u2206s f(cid:0)(cid:80)s\nSet \u02dcM1 =(cid:80)s\n\nCompute the gradients G1 = \u2202\n\u2202M1\nCompute the new bases M s\n\n1 and \u02dcM2 =(cid:80)s\n\nf ( \u02dcM1, \u02dcM2) and G2 = \u2202\n\u2202M2\n2 by invoking oracle (15) with G1 and G2 respectively.\n\n2; break if stopping criterion is met.\n\n1,(cid:80)s\n\nf ( \u02dcM1, \u02dcM2).\n\n1 and M s\n\ni=1 \u03b1iM i\n\ni=1 \u03b1iM i\n\ni=1 \u03b2iM i\n2\n\ni=1 \u03b2iM i\n\n(cid:1).\n\nDenote the objective in (9) as g(M1, M2, S1, S2, Z3). The idea behind our approach is to optimize\n(14)\n\ng(M1, M2, S1, S2, Z3)\n\nf (M1, M2) :=\n\nmin\n\nS1\u2208M1RK,S2\u2208M2RM1,Z3\u2208RM2\n\nmax\n\n0(cid:22)M1(cid:22)I, tr(M1)=h\u22121\n\ntr(\u2212G(HM1H + 1\n\nby CG; see Algorithm 1 for details. We next demonstrate how each step can be executed ef\ufb01ciently.\nOracle problem in Step 4. This requires solving, given a gradient G (which is real symmetric),\nM\u2208M tr(\u2212GM ) \u21d4\nt 11(cid:48). (15)\nmax\n(cid:80)h\u22121\nHere we used Lemma 1 of [31]. By [34, Theorem 3.4], max0(cid:22)M1(cid:22)I, tr(M1)=h\u22121 tr(\u2212HGHM1) =\nat M1 = (cid:80)h\u22121\ni=1 \u03bbi where \u03bb1 \u2265 \u03bb2 \u2265 . . . are the leading eigenvalues of \u2212HGH. The maximum is attained\nargmaxM\u2208M tr(\u2212GM ) can be recovered by(cid:80)h\u22121\ni, where vi is the eigenvector corresponding to \u03bbi. The optimal solution to\nt 11(cid:48), which has low rank for small h.\ni=1 viv(cid:48)\n(cid:88)s\n\nTotally corrective update in Step 5. This is the most computationally intensive step of CG:\n\nt 11(cid:48))), where H = I \u2212 1\n\n(cid:16)(cid:88)s\n\ni=1 viv(cid:48)\n\ni + 1\n\n(cid:17)\n\nmin\n\n\u03b1\u2208\u2206s, \u03b2\u2208\u2206s\n\nf\n\n\u03b1iM i\n1,\n\ni=1\n\n\u03b2iM i\n2\n\n,\n\ni=1\n\n(16)\n\nwhere \u2206s stands for the s dimensional probability simplex (sum up to 1). If one can solve (16)\nef\ufb01ciently (which also provides the optimal S1, S2, Z3 in (14) for the optimal \u03b1 and \u03b2), then the\ngradient of f can also be obtained easily by Danskin\u2019s theorem (for Step 3 of Algorithm 1). However,\nthe totally corrective update is expensive because given \u03b1 and \u03b2, each evaluation of the objective f\nitself requires an optimization over S1, S2, and Z3. Such a nested optimization can be prohibitive.\nA key idea is to show that this totally corrective update can be accomplished with considerably\nimproved ef\ufb01ciency through the use of block coordinate descent [35]. Taking into account the\nstructure of the solution to the oracle, we denote\n\nM1(\u03b1) :=\n\n\u03b1iM i\n\n1 = V1D(\u03b1)V (cid:48)\n1 ,\n\nand M2(\u03b2) :=\n\nwhere D(\u03b1) = diag([\u03b111(cid:48)\n\nh, \u03b121(cid:48)\n\nh, . . .](cid:48)) and D(\u03b2) = diag([\u03b211(cid:48)\n\nh, \u03b221(cid:48)\n\n(cid:88)\n\ni\n\n(cid:88)\n\ni\n\n\u03b2iM i\n\n2 = V2D(\u03b2)V (cid:48)\n2 ,\nh, . . .](cid:48)). Denote\n\n(17)\n\n(19)\n\n(18)\nClearly S1\u2208 M1(\u03b1)RK iff S1 = V1A1K for some A1, S2\u2208 M2(\u03b2)RM1(\u03b1) iff S2 = V2A2M1(\u03b1)\nfor some A2, and Z3 \u2208 RM2(\u03b2) iff Z3 = A3M2(\u03b2) for some A3. So (16) is equivalent to\n\nP (\u03b1, \u03b2, S1, S2, Z3) := g (M1(\u03b1), M2(\u03b2), S1, S2, Z3) .\n\n\u03b1\u2208\u2206s, \u03b2\u2208\u2206s,A1,A2,A3\n\nmin\n\nP (\u03b1, \u03b2, V1A1K, V2A2M1(\u03b1), A3M2(\u03b2))\n2 tr(V1A1KA(cid:48)\n\n= Ln\n\n1V (cid:48)\n1 )\n\n(20)\n(21)\n(22)\nThus we have eliminated all matrix pseudo-inverses. However, it is still expensive because the size\nof Ai depends on t. To simplify further, assume X(cid:48), V1 and V2 all have full column rank.4 Denote\nB1 = A1X(cid:48) (note K = X(cid:48)X), B2 = A2V1, B3 = A3V2. Noting (17), the objective becomes\n\n1 (V1A1K, M1(\u03b1)) + 1\n+ Ln\n+ L3(A3M2(\u03b2), Y ) + 1\n\n2 tr(V2A2M1(\u03b1)A(cid:48)\n\n2 (V2A2M1(\u03b1), M2(\u03b2)) + 1\n\n2 tr(A3M2(\u03b2)A(cid:48)\n3).\n\n2V (cid:48)\n2 )\n\n4 This assumption is valid provided the features in X are linearly independent, since the bases (eigen-\nvectors) accumulated through all iterations so far are also independent. The only exception is the eigen-vector\n1\u221a\nt\n\n1. But since \u03b1 and \u03b2 lie on a simplex, it always contributes a constant 1\n\nt 11(cid:48) to M1(\u03b1) and M2(\u03b2).\n\n6\n\n\fR(\u03b1, \u03b2, B1, B2, B3) := Ln\n\n1 (V1B1X, V1D(\u03b1)V (cid:48)\n2 (V2B2D(\u03b1)V (cid:48)\n+ Ln\n+ L3(B3D(\u03b2)V (cid:48)\n\n(23)\n(24)\n(25)\nThis problem is much easier to solve, since the size of Bi depends on the number of input features,\nthe number of nodes in two latent layers, and the number of output labels. Due to the greedy nature\nof CG, the number of latent nodes is generally low. So we can optimize R by block coordinate\ndescent (BCD), i.e. alternating between:\n\n1V (cid:48)\n1 )\n2 tr(V2B2D(\u03b1)B(cid:48)\n\n2 tr(B3D(\u03b2)B(cid:48)\n3).\n\n1 ) + 1\n1 , V2D(\u03b2)V (cid:48)\n\n2 tr(V1B1B(cid:48)\n2 ) + 1\n\n2 , Y ) + 1\n\n2V (cid:48)\n2 )\n\n1. Fix (\u03b1, \u03b2), and solve (B1, B2, B3) (unconstrained smooth optimization, e.g. by LBFGS).\n2. Fix (B1, B2, B3), and solve (\u03b1, \u03b2) (e.g. by LBFGS with projection to simplex).\n\n2 and L3 are smooth.5 In practice,\nBCD is guaranteed to converge to a critical point when Ln\nthese losses can be made smooth by, e.g. approximating the max in (13) by a softmax. It is crucial\nto note that although both of the two steps are convex, R is not jointly convex in its variables. So in\ngeneral, this alternating scheme can only produce a stationary point of R. Interestingly, we further\nshow that any stationary point must provide a global optimal solution to P in (18).\nTheorem 1. Suppose (\u03b1, \u03b2, B1, B2, B3) is a stationary point of R with \u03b1i > 0 and \u03b2i > 0. Assume\nX(cid:48), V1 and V2 all have full column rank. Then it must be a globally optimal solution to R, and this\n(\u03b1, \u03b2) must be an optimal solution to the totally corrective update (16).\n\n1 , Ln\n\nSee the proof in Appendix D. It is noteworthy that the conditions \u03b1i > 0 and \u03b2i > 0 are trivial to\nmeet because CG is guaranteed to converge to optimal if \u03b1i \u2265 1/s and \u03b2i \u2265 1/s at each step s.\n\n5 Empirical Investigation\nTo investigate the potential of deep versus shallow convex training methods, and global versus local\ntraining methods, we implemented the approach outlined above for a three-layer model along with\ncomparison methods. Below we use CVX3 and CVX2 to refer respectively to three and two-layer\nversions of the proposed model. For comparison, SVM1 refers to a one-layer SVM; and TS1a [37]\nand TS1b [38] refer to one-layer transductive SVMs; NET2 refers to a standard two-layer sigmoid\nneural network with hidden layer size chosen by cross-validation; and LOC3 refers to the proposed\nthree-layer model with exact (unrelaxed) with local optimization. In these evaluations, we followed\na similar transductive set up to that of [25]: a given set of data (X, Y ) is divided into separate\ntraining and test sets, (XL, YL) and XU , where labels are only included for the training set. The\ntraining loss is then only computed on the training data, but the learned kernel matrices span the\nunion of data. For testing, the kernel responses on test data are used to predict output labels.\n\n5.1 Synthetic Experiments\nOur \ufb01rst goal was to compare the effective modeling capacity of a three versus two-layer archi-\ntecture given the convex formulations developed above. In particular, since the training formulation\ninvolves a convex relaxation of the normalized kernel domain, M in (11), it is important to determine\nwhether the representational advantages of a three versus two-layer architecture are maintained. We\nconducted two sets of experiments designed to separate one-layer from two-layer or deeper models,\nand two-layer from three-layer or deeper models. Although separating two from one-layer models\nis straightforward, separating three from two-layer models is a subtler question. Here we considered\ntwo synthetic settings de\ufb01ned by basic functions over boolean features:\n\nParity:\nInner Product:\n\ny = x1 \u2295 x2 \u2295 . . . \u2295 xn,\ny = (x1 \u2227 xm+1) \u2295 (x2 \u2227 xm+2) \u2295 . . . \u2295 (xm \u2227 xn), where m = n\n\n(26)\n2 . (27)\nIt is well known that Parity is easily computable by a two-layer linear-gate architecture but cannot\nbe approximated by any one-layer linear-gate architecture on the same feature space [39]. The IP\nproblem is motivated by a fundamental result in the circuit complexity literature: any small weights\nthreshold circuit of depth 2 requires size exp(\u2126(n)) to compute (27) [39, 40]. To generate data from\n\n5Technically, for BCD to converge to a critical point, each block optimization needs to have a unique optimal\nsolution. To ensure uniqueness, we used a method equivalent to the proximal method in Proposition 7 of [36].\n\n7\n\n\fTS1a\nTS1b\nSVM1\nNET2\nCVX2\nLOC3\nCVX3\n\nCIFAR MNIST\n16.3\u00b11.5\n30.7\u00b14.2\n16.0\u00b12.0\n26.0\u00b16.5\n33.3\u00b11.9\n18.3\u00b10.5\n15.3\u00b11.7\n30.7\u00b11.7\n12.7\u00b13.2\n27.7\u00b15.5\n22.0\u00b11.7\n36\u00b11.7\n23.3\u00b10.5\n13.0\u00b10.3\n\nUSPS\n12.7\u00b11.2\n11.0\u00b11.7\n12.7\u00b10.2\n12.7\u00b10.4\n9.7\u00b13.1\n12.3\u00b11.1\n9.0\u00b10.9\n\nCOIL\n16.0\u00b12.0\n20.0\u00b13.6\n16.3\u00b10.7\n15.3\u00b11.4\n14.0\u00b13.6\n17.7\u00b12.2\n9.0\u00b10.3\n\nLetter\n5.7\u00b12.0\n5.0\u00b11.0\n7.0\u00b10.3\n5.3\u00b10.5\n5.7\u00b12.9\n11.3\u00b10.2\n5.7\u00b10.2\n\n(a) Synthetic results: Parity data.\n\n(b) Real results: Test error % (\u00b1 stdev) 100/100 labeled/unlabeled.\n\nTS1a\nTS1b\nSVM1\nNET2\nCVX2\nLOC3\nCVX3\n\nCIFAR MNIST\n10.7\u00b13.1\n32.0\u00b12.6\n10.0\u00b13.5\n26.0\u00b13.3\n12.3\u00b11.4\n32.3\u00b11.6\n11.3\u00b11.3\n30.7\u00b10.5\n23.3\u00b13.5\n8.2\u00b10.6\n12.7\u00b10.6\n28.2\u00b12.3\n19.2\u00b10.9\n6.8\u00b10.4\n\nUSPS\n10.3\u00b10.6\n11.0\u00b11.3\n10.3\u00b10.1\n11.2\u00b10.5\n7.0\u00b11.3\n8.0\u00b10.1\n6.2\u00b10.7\n\nCOIL\n13.7\u00b14.0\n18.9\u00b12.6\n14.7\u00b11.3\n14.5\u00b10.6\n8.7\u00b13.3\n12.3\u00b10.9\n7.7\u00b11.1\n\nLetter\n3.8\u00b10.3\n4.0\u00b10.5\n4.8\u00b10.5\n4.3\u00b10.1\n4.5\u00b10.9\n7.3\u00b11.1\n3.0\u00b10.2\n\n(c) Synthetic results: IP data.\n\n(d) Real results: Test error % (\u00b1 stdev) 200/200 labeled/unlabeled.\nFigure 2: Experimental results (synthetic data: larger dots mean repetitions fall on the same point).\n\nthese models, we set the number of input features to n = 8 (instead of n = 2 as in [25]), then\ngenerate 200 examples for training and 100 examples for testing; for each example, the features xi\nwere drawn from {0, 1} with equal probability. Then each xi was corrupted independently by a\nGaussian noise with zero mean and variance 0.3. The experiments were repeated 100 times, and the\nresulting test errors of the two models are plotted in Figure 2. Figure 2(c) clearly shows that CVX3\nis able to capture the structure of the IP problem much more effectively than CVX2, as the theory\nsuggests for such architectures. In almost every repetition, CVX3 yields a lower (often much lower)\ntest error than CVX2. Even on the Parity problem (Figure 2(a)), CVX3 generally produces lower\nerror, although its advantage is not as signi\ufb01cant. This is also consistent with theoretical analysis\n[39, 40], which shows that IP is harder to model than parity.\n\n5.2 Experiments on Real Data\nWe also conducted an empirical investigation on some real data sets. Here we tried to replicate\nthe results of [25] on similar data sets, USPS and COIL from [41], Letter from [42], MNIST, and\nCIFAR-100 from [43]. Similar to [23], we performed an optimistic model selection for each method\non an initial sample of t training and t test examples; then with the parameters \ufb01xed the experiments\nwere repeated 5 times on independently drawn sets of t training and t test examples from the remain-\ning data. The results shown in Table 2(b) and Table 2(d) show that CVX3 is able to systematically\nreduce the test error of CVX2. This suggests that the advantage of deeper modeling does indeed\narise from enhanced representation ability, and not merely from an enhanced ability to escape local\nminima or walk plateaus, since neither exist in these cases.\n\n6 Conclusion\nWe have presented a new formulation of multi-layer training that can accommodate an arbitrary\nnumber of nonlinear layers while maintaining a jointly convex training objective. Accurate learning\nof additional layers, when required, appears to demonstrate a marked advantage over shallower\narchitectures, even when models can be trained to optimality. Aside from further improvements\nin algorithmic ef\ufb01ciency, an interesting direction for future investigation is to capture unsupervised\n\u201cstage-wise\u201d training principles via auxiliary autoencoder objectives within a convex framework,\nrather than treating input reconstruction as a mere heuristic training device.\n\n8\n\n101520253035101520253035Error of CVX2Error of CVX315202530354045501520253035404550Error of CVX2Error of CVX3\fReferences\n[1] G. Dahl, D. Yu, L. Deng, and A. Acero. On the problem of local minima in backpropagation. IEEE Trans.\n\nASLP, 20(1):30\u201342, 2012.\n\n[2] A. Krizhevsky, A. Sutskever, and G. Hinton.\n\nnetworks. In NIPS. 2012.\n\nImageNet classi\ufb01cation with deep convolutional neural\n\n[3] Q. Le, M. Ranzato, R. Monga, M. Devin, G. Corrado, K. Chen, J. Dean, and A. Ng. Building high-level\n\nfeatures using large scale unsupervised learning. In Proceedings ICML. 2012.\n\n[4] R. Socher, C. Lin, A. Ng, and C. Manning. Parsing natural scenes and natural language with recursive\n\nneural networks. In ICML. 2011.\n\n[5] Y. Bengio. Learning deep architectures for AI. Found. Trends in Machine Learning, 2:1\u2013127, 2009.\n[6] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. IEEE\n\nPAMI, 35(8):1798\u20131828, 2013.\n\n[7] G. Tesauro. Temporal difference learning and TD-Gammon. CACM, 38(3), 1995.\n[8] Y. LeCun, B. Boser, J. Denker, D. Henderson, R. Howard, W. Hubbard, and L. Jackel. Backpropagation\n\napplied to handwritten zip code recognition. Neural Comput., 1:541\u2013551, 1989.\n\n[9] M. Gori and A. Tesi. On the problem of local minima in backpropagation. IEEE PAMI, 14:76\u201386, 1992.\n[10] D. Erhan, Y. Bengio, A. Courville, P. Manzagol, and P. Vincent. Why does unsupervised pre-training help\n\ndeep learning? JMLR, 11:625\u2013660, 2010.\n\n[11] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in\n\ndeep learning. In ICML. 2013.\n\n[12] G. Hinton, S. Osindero, and Y. Teh. A fast algorithm for deep belief nets. Neur. Comp., 18(7), 2006.\n[13] P. Vincent, H. L. I. Lajoie, Y. Bengio, and P. Manzagol. Stacked denoising autoencoders: Learning useful\n\nrepresentations in a deep network with a local denoising criterion. JMLR, 11(3):3371\u20133408, 2010.\n\n[14] G. Hinton, N. Srivastava, A. Krizhevsky, A. Sutskever, and R. Salakhutdinov. Improving neural networks\n\nby preventing co-adaptation of feature detectors, 2012. ArXiv:1207.0580.\n\n[15] K. Hoeffgen, H. Simon, and K. Van Horn. Robust trainability of single neurons. JCSS, 52:114\u2013125, 1995.\n[16] S. Arora, A. Bhaskara, R. Ge, and T. Ma. Bounds for learning deep representations. In ICML. 2014.\n[17] R. Livni, S. Shalev-Shwartz, and O. Shamir. An algorithm for training polynomial networks, 2014.\n\nArXiv:1304.7045v2.\n\n[18] R. Gens and P. Domingos. Discriminative learning of sum-product networks. In NIPS 25. 2012.\n[19] G. Kimeldorf and G. Wahba. Some results on Tchebychef\ufb01an spline functions. JMAA, 33:82\u201395, 1971.\n[20] B. Schoelkopf and A. Smola. Learning with Kernels. MIT Press, 2002.\n[21] Y. Cho and L. Saul. Large margin classi\ufb01cation in in\ufb01nite neural networks. Neural Comput., 22, 2010.\n[22] J. Zhuang, I. Tsang, and S. Hoi. Two-layer multiple kernel learning. In AISTATS. 2011.\n[23] A. Joulin and F. Bach. A convex relaxation for weakly supervised classi\ufb01ers. In Proceedings ICML. 2012.\n[24] A. Joulin, F. Bach, and J. Ponce. Multi-class cosegmentation. In Proceedings CVPR. 2012.\n[25] O. Aslan, H. Cheng, D. Schuurmans, and X. Zhang. Convex two-layer modeling. In NIPS. 2013.\n[26] R. Neal. Connectionist learning of belief networks. Arti\ufb01cial Intelligence, 56(1):71\u2013113, 1992.\n[27] V. Nair and G. E. Hinton. Recti\ufb01ed linear units improve restricted Boltzmann machines. In ICML. 2010.\n[28] R. Rifkin and R. Lippert. Value regularization and Fenchel duality. JMLR, 8:441\u2013479, 2007.\n[29] A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Mach. Learn., 73, 2008.\n[30] J. Peng and Y. Wei. Approximating k-means-type clustering via semide\ufb01nite programming. SIAM J. on\n\nOptimization, 18:186\u2013205, 2007.\n\n[31] H. Cheng, X. Zhang, and D. Schuurmans. Convex relaxations of Bregman clustering. In UAI. 2013.\n[32] M. Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In ICML. 2013.\n[33] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning\nvia the alternating direction method of multipliers. Found. Trends in Machine Learning, 3(1):1\u2013123, 2010.\n[34] M. Overton and R. Womersley. Optimality conditions and duality theory for minimizing sums of the\n\nlargest eigenvalues of symmetric matrices. Mathematical Programming, 62:321\u2013357, 1993.\n\n[35] F. Dinuzzo, C. S. Ong, P. Gehler, and G. Pillonetto. Learning output kernels with block coordinate descent.\n\nIn ICML. 2011.\n\n[36] L. Grippoa and M. Sciandrone. On the convergence of the block nonlinear Gauss-Seidel method under\n\nconvex constraints. Operations Research Letters, 26:127\u2013136, 2000.\n\n[37] V. Sindhwani and S. Keerthi. Large scale semi-supervised linear SVMs. In SIGIR. 2006.\n[38] T. Joachims. Transductive inference for text classi\ufb01cation using support vector machines. In ICML. 1999.\n[39] A. Hajnal. Threshold circuits of bounded depth. J. of Computer & System Sciences, 46(2):129\u2013154, 1993.\n[40] A. A. Razborov. On small depth threshold circuits. In Algorithm Theory (SWAT 92). 1992.\n[41] Http://olivier.chapelle.cc/ssl- book/benchmarks.html.\n[42] Http://archive.ics.uci.edu/ml/datasets.\n[43] Http://www.cs.toronto.edu/ kriz/cifar.html.\n\n9\n\n\f", "award": [], "sourceid": 1664, "authors": [{"given_name": "\u00d6zlem", "family_name": "Aslan", "institution": "University of Alberta"}, {"given_name": "Xinhua", "family_name": "Zhang", "institution": "NICTA and Australian National University"}, {"given_name": "Dale", "family_name": "Schuurmans", "institution": "University of Alberta"}]}