{"title": "Bayesian Nonlinear Support Vector Machines and Discriminative Factor Modeling", "book": "Advances in Neural Information Processing Systems", "page_first": 1754, "page_last": 1762, "abstract": "A new Bayesian formulation is developed for nonlinear support vector machines (SVMs), based on a Gaussian process and with the SVM hinge loss expressed as a scaled mixture of normals. We then integrate the Bayesian SVM into a factor model, in which feature learning and nonlinear classifier design are performed jointly; almost all previous work on such discriminative feature learning has assumed a linear classifier. Inference is performed with expectation conditional maximization (ECM) and Markov Chain Monte Carlo (MCMC). An extensive set of experiments demonstrate the utility of using a nonlinear Bayesian SVM within discriminative feature learning and factor modeling, from the standpoints of accuracy and interpretability", "full_text": "Bayesian Nonlinear Support Vector Machines and\n\nDiscriminative Factor Modeling\n\nRicardo Henao, Xin Yuan and Lawrence Carin\nDepartment of Electrical and Computer Engineering\n\nDuke University, Durham, NC 27708\n\n{r.henao,xin.yuan,lcarin}@duke.edu\n\nAbstract\n\nA new Bayesian formulation is developed for nonlinear support vector machines\n(SVMs), based on a Gaussian process and with the SVM hinge loss expressed as\na scaled mixture of normals. We then integrate the Bayesian SVM into a factor\nmodel, in which feature learning and nonlinear classi\ufb01er design are performed\njointly; almost all previous work on such discriminative feature learning has as-\nsumed a linear classi\ufb01er.\nInference is performed with expectation conditional\nmaximization (ECM) and Markov Chain Monte Carlo (MCMC). An extensive\nset of experiments demonstrate the utility of using a nonlinear Bayesian SVM\nwithin discriminative feature learning and factor modeling, from the standpoints\nof accuracy and interpretability.\n\n1\n\nIntroduction\n\nThere has been signi\ufb01cant interest recently in developing discriminative feature-learning models, in\nwhich the labels are utilized within a max-margin classi\ufb01er. For example, such models have been\nemployed in the context of topic modeling [1], where features are the proportion of topics associated\nwith a given document. Such topic models may be viewed as a stochastic matrix factorization of\na matrix of counts. The max-margin idea has also been extended to factorization of more general\nmatrices, in the context of collaborative prediction [2, 3]. These studies have demonstrated that the\nuse of the max-margin idea, which is closely related to support vector machines (SVMs) [4], often\nyields better results than designing discriminative feature-learning models via a probit or logit link.\nThis is particularly true for high-dimensional data (e.g., a corpus characterized by a large dictionary\nof words), as in that case the features extracted from the high-dimensional data may signi\ufb01cantly\noutweigh the importance of the small number of labels in the likelihood. Margin-based classi\ufb01ers\nappear to be attractive in mitigating this challenge [1].\n\nJoint matrix factorization, feature learning and classi\ufb01er design are well aligned with hierarchical\nmodels. The Bayesian formalism is well suited to such models, and much of the aforementioned\nresearch has been constituted in a Bayesian setting. An important aspect of this prior work utilizes\nthe recent recognition that the SVM loss function may be expressed as a location-scale mixture of\nnormals [5]. This is attractive for joint feature learning and classi\ufb01er design, which is leveraged in\nthis paper. However, the Bayesian SVM setup developed in [5] assumed a linear classi\ufb01er decision\nfunction, which is limiting for sophisticated data, for which a nonlinear classi\ufb01er is more effective.\n\nThe \ufb01rst contribution of this paper concerns the extension of the work in [5] for consideration of a\nkernel-based, nonlinear SVM, and to place this within a Bayesian scaled-mixture-of-normals con-\nstruction, via a Gaussian process (GP) prior. The second contribution is a generalized formulation of\nthis mixture model, for both the linear and nonlinear SVM, which is important within the context of\nMarkov Chain Monte Carlo (MCMC) inference, yielding improved mixing. This new construction\ngeneralizes the form of the SVM loss function.\n\n1\n\n\fThe manner we employ a GP in this paper is distinct from previous work [6, 7, 8], in that we ex-\nplicitly impose a max-margin-based SVM cost function. In the previous GP-based classi\ufb01er design,\nall data contributed to the learned classi\ufb01cation function, while here a relatively small set of support\nvectors play a dominant role. This identi\ufb01cation of support vectors is of interest when the number of\ntraining samples is large (simplifying subsequent prediction). The key reason to invoke a Bayesian\nform of the SVM [5], instead of applying the widely studied optimization-based SVM [4], is that the\nformer may be readily integrated into sophisticated hierarchical models. As an example of that, we\nhere consider discriminative factor modeling, in which the factor scores are employed within a non-\nlinear SVM. We demonstrate the advantage of this in our experiments, with nonlinear discriminative\nfactor modeling for high-dimensional gene-expression data.\n\nWe present MCMC and expectation conditional maximization inference for the model. Conditional\nconjugacy of the hierarchical model yields simple and ef\ufb01cient computations. Hence, while the non-\nlinear SVM is signi\ufb01cantly more \ufb02exible than its linear counterpart, computations are only modestly\nmore complicated. Details on the computational approaches, insights on the characteristics of the\nmodel, and demonstration on real data constitute a third contribution of this paper.\n\n2 Mixture Representation for SVMs\nPrevious model for linear SVM Assume N observations {xn, yn}N\nn=1, where xn \u2208 Rd is a\nfeature vector and yn \u2208 {\u22121, 1} is its label. The support vector machine (SVM) seeks to \ufb01nd a\nclassi\ufb01cation function f (x) by solving a regularized learning problem\n\nargminf (x) n\u03b3PN\n\nn=1 max(1 \u2212 ynf (xn), 0) + R(f (x))o ,\n\nwhere max(1 \u2212 ynf (xn), 0) is the hinge loss, R(f (x)) is a regularization term that controls the\ncomplexity of f (x), and \u03b3 is a tuning parameter controlling the tradeoff between error penalization\nand the complexity of the classi\ufb01cation function. The decision boundary is de\ufb01ned as {x : f (x) =\n0} and sign(f (x)) is the decision rule, classifying x as either \u22121 or 1 [4].\nRecently, [5] showed that for the linear classi\ufb01er f (x) = \u03b2\u22a4x, minimizing (1) is equivalent to\nestimating the mode of the pseudo-posterior of \u03b2\n\np(\u03b2|X, y, \u03b3) \u221d QN\n\nn=1 L(yn|xn, \u03b2, \u03b3)p(\u03b2|\u00b7) ,\n\nwhere y = [y1 . . . yN ]\u22a4, X = [x1 . . . xN ], L(yn|xn, \u03b2, \u03b3) is the pseudo-likelihood function,\nand p(\u03b2|\u00b7) is the prior distribution for the vector of coef\ufb01cients \u03b2. Choosing \u03b2 to maximize the\nlog of (2) corresponds to (1), where the prior is associated with R(f (x)).\nIn [5] it was shown\nthat L(yn|xn, \u03b2, \u03b3) admits a location-scale mixture of normals representation by introducing latent\nvariables \u03bbn, such that\n(cid:19) d\u03bbn . (3)\nL(yn|xn, \u03b2, \u03b3) = e\u22122\u03b3max(1\u2212yn\u03b2\u22a4\n\n(1 + \u03bbn \u2212 yn\u03b2\u22a4xn)2\n\nxn,0) = Z \u221e\n\nexp(cid:18)\u2212\n\n\u221a\u03b3\n\u221a2\u03c0\u03bbn\n\n2\u03b3\u22121\u03bbn\n\n0\n\nExpression (2) is termed a pseudo-posterior because its likelihood term is unnormalized with respect\nto yn. Note that an improper \ufb02at prior is imposed on \u03bbn.\n\nThe original formulation of [5] has the tuning parameter \u03b3 as part of the prior distribution of \u03b2,\nwhile here in (3) it is included instead in the likelihood. This is done because (i) it puts \u03bbn and\nthe regularization term \u03b3 together, and (ii) it allows more freedom in the choice of the prior for \u03b2.\nAdditionally, it has an interesting interpretation, in that the SVM loss function behaves like a global-\nlocal shrinkage distribution [9]. Speci\ufb01cally, \u03b3\u22121 corresponds to a \u201cglobal\u201d scaling of the variance,\nand \u03bbn represents the \u201clocal\u201d scaling for component n. The {\u03bbn} de\ufb01ne the relative variances for\neach of the N data, and \u03b3\u22121 provides a global scaling.\n\nOne of the bene\ufb01ts of a Bayesian formulation for SVMs is that we can \ufb02exibly specify the behavior\nof \u03b2 while being able to adaptively regularize it by specifying a prior p(\u03b3) as well. For instance, [5]\ngave three examples of prior distributions for \u03b2: Gaussian, Laplace, and spike-slab.\n\nWe can extend the results of [5] to a slightly more general loss function, by imposing a proper prior\n\nfor the latent variables \u03bbn. In particular, by specifying \u03bbn \u223c Exp(\u03b30) and letting un = 1\u2212yn\u03b2\u22a4xn,\n\n(1)\n\n(2)\n\nL(yn|xn, \u03b2, \u03b3) =Z \u221e\n\n0\n\n\u03b30\u221a\u03b3\n\u221a2\u03c0\u03bb\n\ne\u2212 \u03b3\n\n2\n\n(un+\u03bbn )2\n\n\u03bbn\n\ne\u2212\u03b30\u03bbn d\u03bbn =\n\n2\n\n\u03b30\nc\n\ne\u2212\u03b3(c|un|+un) ,\n\n(4)\n\n\fwhere c = p1 + 2\u03b30\u03b3\u22121 > 1. The proof relies (see Supplementary Material) on the identity,\nR \u221e\n0 a(2\u03c0\u03bb)\u22121/2 exp{\u2212 1\n2 (a2\u03bb + b2\u03bb\u22121)}d\u03bb = e\u2212|ab| [10]. From (4) we see that as \u03b30 \u2192 0 we\nrecover (3) by noting that 2max(un, 0) = |un| + un.\nIn general we may use the prior \u03bbn \u223c\nGa(a\u03bb, \u03b30), with a\u03bb = 1 for the exponential distribution.\nIn the next section we discuss other\nchoices for a\u03bb. This means that the proposed likelihood is no longer equivalent to the hinge loss but\nto a more general loss, termed below a skewed Laplace distribution.\n\n\u03b30\n\n, (5)\n\ne\u2212\u03b3(c\u22121)|un| ,\n\nc (cid:26)e\u2212\u03b3(c+1)un ,\n\n0 N (un| \u2212 \u03bbn, \u03b3\u22121\u03bbn)Exp(\u03bbn|\u03b30)d\u03bbn =\n\nSkewed Laplace distribution We can write the likelihood function in (4) in terms of un as\nif un \u2265 0\nif un < 0\n\nL(un|\u03b3, \u03b30) = Z \u221e\nwhich corresponds to a Laplace distribution, with negative skewness, denoted as sLa(un|\u03b3, \u03b30).\nUnlike the density derived from the hinge loss (\u03b30 \u2192 0), this density is properly normalized, thus\nit corresponds to a valid probability density function. For the special case \u03b30 = 0, the integral\ndiverges, hence the normalization constant does not exist, which stems from exp(\u22122\u03b3max(un, 0))\nbeing constant for \u2212\u221e < un < 0.\nFrom (5) we see that sLa(un|\u03b3, \u03b30) can be represented either as mixture of normals or mixture\nof exponentials. Other properties of the distribution, such as its moments, can be obtained using\nthe results for general asymmetric Laplace distributions in [11]. Examining (5) we can gain some\nintuition about the behavior of the likelihood function for the classi\ufb01cation problem: (i) When\nyn\u03b2\u22a4xn = 1, \u03bbn = 0 and xn lies on the margin boundary.\n(ii) When yn\u03b2\u22a4xn > 1, xn is\ncorrectly classi\ufb01ed, outside the margin and |1 \u2212 yn\u03b2\u22a4xn| is exponential with rate \u03b3(c \u2212 1). (iii)\nxn is correctly classi\ufb01ed but lies inside the margin when 0 < yn\u03b2\u22a4xn < 1, and xn is misclassi\ufb01ed\nwhen yn\u03b2\u22a4xn < 0. In both cases, 1 \u2212 yn\u03b2\u22a4xn is exponential with rate \u03b3(c + 1). (iv) Finally, if\nyn\u03b2\u22a4xn = 0, xn lies on the decision boundary.\nSince c + 1 > c\u2212 1 for every c > 1, the distribution for case (ii) decays slower than the distribution\nfor case (iii). Alternatively, in terms of the loss function, observations satisfying (iii) get more\npenalized than those satisfying (ii). In the limiting case, \u03b30 \u2192 0 we have c \u2192 1, and case (ii) is\nnot penalized at all, recovering the behavior of the hinge loss. In the SVM literature, an observation\nxn is called a support vector if it satis\ufb01es cases (i) or (iii). In the latter case, \u03bbn is the distance\nfrom yn\u03b2\u22a4xn to the margin boundary [4]. The key thing that the Exp(\u03bb0) prior imposes on \u03bbn,\nrelative to the \ufb02at prior on \u03bbn \u2208 [0,\u221e), is that it constrains that \u03bbn not be too large (discouraging\nyn\u03b2\u22a4xn \u226b 1 for correct classi\ufb01cations, which is even more relevant for nonlinear SVMs); we\ndiscuss this further below.\n\nExtension to nonlinear SVM We now assume that the decision function f (x) is drawn from\na zero-mean Gaussian process GP(0, k(x,\u00b7, \u03b8)), with kernel parameters \u03b8. Evaluated at the N\npoints at which we have data, f \u223c N (0, K), where K is a N \u00d7 N covariance matrix with entries\nkij = k(xi, xj, \u03b8) for i, j \u2208 {1, . . . , N} [7]; f = [f1 . . . fN ]\u22a4 \u2208 RN corresponds to the continuous\nf (x) evaluated at {xn}N\nn=1. Together with (5), for un = 1\u2212 ynfn, where fn = f (xn), the full prior\nf \u223c N (0, K) , \u03bbn \u223c Exp(\u03b30) , \u03b3 \u223c Ga(a0, b0) .\n\nspeci\ufb01cation for the nonlinear SVM is\n\nIt is straightforward to prove the equality in (5) holds for fn in place of \u03b2\u22a4xn, as in (6).\n\n(6)\n\nFor nonlinear SVMs as above, being able to set \u03b30 > 0 is particularly bene\ufb01cial. It prevents fn\nfrom being arbitrarily large (hence preventing 1 \u2212 ynfn \u226a 0). This implies that isolated observa-\ntions far away from linear decision boundary (even when correctly classi\ufb01ed when learning) tend\nto be support vectors in a nonlinear SVM, yielding more conservative learned nonlinear decision\n\nboundaries. Figure 1 shows examples of log N (1 \u2212 ynfn;\u2212\u03bbn, \u03b3\u22121\u03bbn) Exp(\u03bbn; \u03b30) for \u03b3 = 100\nand \u03b30 = {0.01, 100}. The vertical lines denote the margin boundary (ynfn = 1) and the decision\nboundary (ynfn = 0). We see that when \u03b30 is small, the density has a very pronounced negative\nskewness (like in the hinge loss of the original SVM) whereas when \u03b30 is large, the density tends to\nbe more of a symmetric shape.\n\nInference\n\n3\nWe wish to compute the posterior p(f , \u03bb, \u03b3|y, X), where \u03bb = [\u03bb1 . . . \u03bbN ]\u22a4. We describe and have\n\nimplemented three inference procedures: Markov chain Monte Carlo (MCMC), a point estimate via\nexpectation-conditional maximization (ECM) and a GP approximation for fast inference.\n\n3\n\n\f102\n\nn\n\u03bb\n\n100\n\n10\u22122\n\n \n\u22123\n\n\u22122\n\n\u22121\n\nx 105\n\n \n\n\u22121\n\n\u22122\n\n\u22123\n\n\u22124\n\n1\n\n2\n\n3\n\n102\n\nn\n\u03bb\n\n100\n\n10\u22122\n\n \n\u22123\n\n\u22122\n\n\u22121\n\nx 105\n\n \n\n\u22121\n\n\u22122\n\n\u22123\n\n\u22124\n\n1\n\n2\n\n3\n\n0\n\n1 \u2212 ynfn\n\n0\n\n1 \u2212 ynfn\n\nFigure 1: Examples of log N (1 \u2212 ynfn; \u2212\u03bbn, \u03b3\u22121\u03bbn)Exp(\u03bbn; \u03b30) for \u03b3 = 100 and \u03b30 = 0.01 (left) and\n\u03b30 = 100 (right). The vertical lines denote the margin boundary (ynfn = 1) and the decision boundary\n(ynfn = 0).\n\nMCMC Inference is implemented by repeatedly sampling from the conditional posterior of pa-\nrameters in (6). Conditional conjugacy allows us to express the following distributions in closed\nform:\n\nf |y, \u03bb, \u03b3 \u223c N (m, S) , m = \u03b3SY\u039b\n\n\u22121(1 + \u03bb) , S = \u03b3\n\n\u22121K(K + \u03b3\n\n\u22121\u039b)\u22121\u039b ,\n\n\u22121\n\n\u03bb\n\nn |fn, yn, \u03b3 \u223c IG p1 + 2\u03b30\u03b3\u22121\nwhere \u039b = diag(\u03bb), Y = diag(y), \u01eb = 1 + \u03bb \u2212 Yf , and IG(\u00b5, \u03b3) is the inverse Gaussian\ndistribution with parameters \u00b5 and \u03b3 [10].\n\n, \u03b3 + 2\u03b30! , \u03b3|y, f , \u03bb \u223c Ga(cid:18)a0 +\n\n\u01eb(cid:19) ,\n\n|1 \u2212 ynfn|\n\nN, b0 +\n\n1\n2\n\n1\n2\n\n(7)\n\n\u039b\n\n\u22121\n\n\u01eb\n\n\u22a4\n\nIn MCMC \u03b30 plays a crucial role, because it controls the prior variance of the latent variables \u03bbn,\nthus greatly improving mixing, particularly that of \u03b3. We also veri\ufb01ed empirically that for small\nvalues of \u03b30, \u03b3 is consistently underestimated. In practice we \ufb01x \u03b30 = 0.1, however, a conjugate\nprior (gamma) exists, and sampling from its conditional posterior is straightforward if desired.\n\nThe parameters of the covariance function \u03b8 in the GP require Metropolis-Hastings type algorithms,\nas in most cases no closed form for their conditional posterior is available. However, the problem is\nrelatively well studied. We have found that slice sampling methods [12], in particular the surrogate\ndata sampler of [13], work well in practice, and are employed here.\n\nFor the case of SVMs, MCMC is naturally important as a way of quantifying the uncertainty of the\nparameters of the model. Further, it allows us to use the hierarchy in (6) as a building block in more\nsophisticated models, or to bring more \ufb02exibility to f through specialized prior speci\ufb01cations. As an\nexample of this, Section 5 describes a speci\ufb01cation for a nonlinear discriminative factor model.\n\nECM The expectation-conditional maximization algorithm is a generalization of the expectation-\nmaximization (EM) algorithm. It can be used when there are multiple parameters that need to be\nestimated [14]. From (6) we identify f and \u03b3 as the parameters to be estimated, and \u03bbn as the\nlatent variables. The Q function in EM-style algorithms is the complete data log-posterior, where\nexpectations are taken w.r.t. the posterior distribution evaluated at the current value of the parameter\nof interest. From (7) we see that \u03bbn appears in the conditional posterior p(f|y, K, \u03bb, \u03b3) as \ufb01rst order\nterms, thus we can write\n\nh\u03bb\u22121\n\nn |yn, f (i)\n\nn i = E[\u03bb\u22121\n\nn , \u03b3(i)] = p1 + 2\u03b30(\u03b3(i))\u22121|u(i)\nn |\u22121 ,\nn = 1 \u2212 ynf (i)\nn and \u03b3(i) are the estimates of fn and \u03b3 at the i-th iteration, and u(i)\nwhere f (i)\n(7) and (8) we can obtain the EM updates: f (i+1) = K(K + (\u03b3(i))\u22121h\u039bi)\u22121Y(1 + h\u03bbi) and\n+ h\u03bbni(cid:17)\u22121\n\n\u03b3(i+1) = (cid:0)a0 \u2212 1 + 1\n\n2 N(cid:1)(cid:16)b0 + 1\n\nn i(u(i+1)\n\nn=1h\u03bb\u22121\n\n)2 + 2u(i+1)\n\n2 PN\n\nn\n\nn\n\n.\n\nIn the ECM setting, learning the parameters of the covariance function is not as straightforward as in\nMCMC. However, we can borrow from the GP literature [7] and use the fact that we can marginalize\nf while conditioning on \u03bb and \u03b3:\n\n(8)\n\nn . From\n\nZ(y, X, \u03bb, \u03b3, \u03b8) = N (Y(1 + \u03bb), K + \u03b3\u22121\u039b) .\n\n(9)\n\nNote that K is a function of X and \u03b8. Estimation of \u03b8 is done by maximizing log Z(y, X, \u03bb, \u03b3, \u03b8).\nFor this we need only compute the partial derivatives of (9) w.r.t. \u03b8, and then use a gradient-based\n\n4\n\n\foptimizer. This is commonly known as Type II maximum likelihood (ML-II) [7]. In practice we\nalternate between EM updates for {f , \u03b3} and \u03b8 updates for a pre-speci\ufb01ed number of iterations\n(typically the model converges after 20 iterations).\n\nSpeeding up inference Perhaps one of the most well known shortcomings of GP is that its cubic\ncomplexity is prohibitive for large scale problems. However there is an extensive literature about\napproximations for fast GP models [15]. Here we use the Fully Independent Training Conditional\n(FITC) approximation [16], as it offers an attractive balance between complexity and performance\n\n[15]. The basic idea behind FITC is to assume that f is generated i.i.d. from pseudo-inputs {vm}M\nvia fm \u2208 RM such that fm \u223c N (0, Kmm), where Kmm is a M\u00d7M covariance matrix. Speci\ufb01cally,\n\nm=1\n\nfrom (5) we have\n\np(u|fm) = QN\n\nn=1 p(un|fm) = N (KnmK\u22121\n\nmmfm, diag(K \u2212 Qnn) + \u03b3\u22121\u039b) ,\n\nwhere u = 1 \u2212 Yf , Kmn is the cross-covariance matrix between {vm}M\nQnn = KnmK\u22121\n\nmmKmn. If we marginalize out fm thus\nZ(y, X, \u03bb, \u03b3, \u03b8) = N (Y(1 + \u03bb), Qnn + diag(K \u2212 Qnn) + \u03b3\u22121\u039b) .\n\n(10)\nNote that if we drop the diag(\u00b7) term in (10) due to the i.i.d. assumption for f , we recover the full\nGP marginal from (9). Similar to the ML-II approach previously described, for a \ufb01xed M we can\nmaximize log Z(y, X, \u03bb, \u03b3, \u03b8) w.r.t. \u03b8 and {vm}M\nm=1 using a gradient-based optimizer but with the\nadded bene\ufb01t of having decreased the computational cost from O(N 3) to O(N M 2) [16].\n\nm=1 and {xn}N\n\nn=1, and\n\nPredictions Making predictions under the model in (6), with conditional posterior distributions in\n(7), can be achieved using standard results of the multivariate normal distribution. The predictive\ndistribution of f\u22c6 for a new observation x\u22c6 given the dataset {X, y} can be written as\n\n(11)\nwhere \u03a3 = (K + \u03b3\u22121\u039b)\u22121, k\u22c6 = k(x\u22c6, x\u22c6, \u03b8) and k\u22c6 = [k(x\u22c6, x1, \u03b8) . . . k(x\u22c6, xN , \u03b8)]\u22a4.\nFurthermore, we can directly use the probit link \u03a6(f\u22c6) to compute\n\nf\u22c6|x\u22c6, X, y \u223c N (k\u22c6\u03a3Y(1 + \u03bb), k\u22c6 \u2212 k\u22a4\n\n\u22c6 \u03a3k\u22c6) ,\n\np(y\u22c6 = 1|x\u22c6, X, y) =Z \u03a6(f\u22c6)p(f\u22c6|x\u22c6, X, y)df\u22c6 = \u03a6(cid:0)k\u22c6\u03a3Y(1 + \u03bb)(1 + k\u22c6 \u2212 k\u22a4\n\n\u22c6 \u03a3k\u22c6)\u22121(cid:1) ,\n\nwhich follows from [7]. Computing the class membership probability is not possible in standard\nSVMs, because in such optimization-based methods one does not obtain the variance of the predic-\ntive distribution; this variance is an attractive component of the Bayesian construction.\n\nThe mean of the predictive distribution (11) is tightly related to the predictor in standard SVMs, in\nthe sense that both are manifestations of the representer theorem. In particular\n\nE[f\u22c6|x\u22c6, X, y] = PN\n\nn=1 \u03b1nk(x\u22c6, xn, \u03b8) ,\n\nwhere \u03b1 = (K + \u03b3\u22121\u039b)\u22121Y(1 + \u03bb). From the expectations of \u03bbn and f conditioned on \u03b3 and\n\u03b30 it is possible to show that \u03b1 is a vector with elements \u03b3(1 \u2212 c) \u2264 \u03b1n \u2264 \u03b3(1 + c), where\nc = p1 + 2\u03b30\u03b3\u22121. We differentiate three types of elements in \u03b1 as follows\n\n(12)\n\n\u03b1 = \uf8f1\uf8f2\n\uf8f3\n\nyn\u03b3(1 + c),\n\u03b10\nn ,\nyn\u03b3(1 \u2212 c) ,\n\nif ynfn < 1\nif ynfn = 1 (\u03bbn = 0)\nif ynfn > 1\n\n,\n\n(13)\n\n0,0 (y0 \u2212 \u03b3(1 + c)K0,aya \u2212 \u03b3(1 \u2212 c)K0,byb), where \u03b10\n\nwith \u03b10 = K\u22121\nn is an element of \u03b10, and\n0, a and b are subsets of {1, . . . , N} for which \u03bbn = 0, ynfn < 1 and ynfn > 1, respectively.\nThis implies \u03b1 and so the prediction rule in (12) depend on data for which \u03bbn > 0 only through\n\u03b3 and \u03b30. Note also that we do not need the values of \u03bb but whether or not they are different than\nzero. When \u03b30 \u2192 0 then c \u2192 1 and \u03b1 becomes a sparse vector bounded above by 2\u03b3. This result\nfor standard SVMs can be found independently from the Karush-Kuhn-Tucker conditions for its\nobjective function [4].\n\nFor ECM and variational Bayes EM inference (the latter discussed below in Section 5), we set\n\u03b30 = 0 and therefore \u03b1 is sparse, with \u03b1n = 0 when ynfn > 1, as in traditional SVMs. This\nproperty of the proposed use of GPs within the Bayesian SVM formulation is a signi\ufb01cant advantage\nrelative to traditional classi\ufb01er design based directly on GPs, for which we do not have such sparsity\nin general. For MCMC inference, we \ufb01nd the sampler mixes better when \u03b30 6= 0. Details on the\nderivations of (13) and the concavity of the problem may be found in Supplementary Material.\n\n5\n\n\f4 Related Work\n\nA key contribution of this paper concerns extension of the linear Bayesian SVM developed in [5]\nto a nonlinear Bayesian SVM. This has been implemented by replacing the linear f (x) = \u03b2\u22a4x\nconsidered in [5] with an f (x) drawn from a GP. The most relevant previous work is that for which\na classi\ufb01er is directly implemented via a GP, without an explicit connection to the margin associated\nwith the SVM [7]. Speci\ufb01cally, GP-based classi\ufb01ers have been developed by [17]. In [7] the f is\ndrawn from a GP, as in (6), but f is used directly with a probit or logit link function, to estimate class\nmembership probability. Previous GP-based classi\ufb01ers did not use f within a margin-based classi\ufb01er\n\nas in (6), implemented here via p(un) = N (\u2212\u03bbn, \u03b3\u22121\u03bbn), where un = 1\u2212ynfn. It has been shown\n\nempirically that nonlinear SVMs and GP classi\ufb01ers often perform similarly [8]. However, for the\nlatter, inference can be challenging due to the non-conjugacy of multivariate normal distribution\nto the link function. Common inference strategies employ iterative approximate inference schemes,\nsuch as the Laplace approximation [17] or expectation propagation (EP) [18]. The model we propose\nhere is locally fully conjugate (except for the GP kernel parameters) and inference can be easily\nimplemented using EM style algorithms, or via MCMC. Besides, the prediction rule of the GP\nclassi\ufb01er, which has a form almost identical to (12), is generally not sparse and therefore lacks the\ninterpretation that may be provided by the relatively few support vectors.\n\n5 Discriminative Factor Models\n\nCombinations of factor models and linear classi\ufb01ers have been widely used in many applications,\nsuch as gene expression, proteomics and image analysis, as a way to perform classi\ufb01cation and\nfeature selection simultaneously [19, 20]. One of the most common modeling approaches can be\nwritten as\n\nxn = Awn + \u01ebn, \u01ebn \u223c N (0, \u03c8\u22121I) , L(yn|\u03b2, wn,\u00b7) ,\n\nwhere A is a d\u00d7K matrix of factor loadings, wn \u2208 RK is a vector of factor scores, \u01ebn is observation\nnoise (and/or model residual), \u03b2 is a vector of K linear classi\ufb01er coef\ufb01cients and L(\u00b7) is for instance\nbut not limited to the linear SVM likelihood in (5) (a logit or probit link may also be used). One of\nmany possible prior speci\ufb01cation for the above model is\n\nak \u223c N (0, \u03a6k) , wn \u223c N (0, I) , \u03c8 \u223c Ga(a\u03c8, b\u03c8) , \u03b2 \u223c N (0, G) ,\n\nwhere ak is a column of A, \u03a6k = diag(\u03c61k, . . . , \u03c6dk), \u03c6ik \u223c Exp(\u03bd), G = diag(g1, . . . , gK ) and\neach element of A is distributed aik \u223c Laplace(\u03bd) after marginalizing out {\u03c6ik} [10]. Shrinkage\nin A is typically a requirement when N \u226a d or when its columns, ak, need to be interpreted. For\nsimplicity, we can set G = I, however a shrinkage prior for the elements gk of G might be useful in\nsome applications, as a mechanism for factor score selection. Although the described model usually\n\nworks well in practice, it assumes that there is a linear mapping from Rd to RK , such that K \u226a d,\nin which the classes {\u22121, 1} are linearly separable. We can relax this assumption by imposing\nthe hierarchical model in (6) in place of \u03b2. This implies that matrix K from (6) has now entries\nkij = k(wi, wj , \u03b8). Inference using MCMC is straightforward except for the conditional posterior\nof the factor scores. This model is related to latent-variable GP models (GP-LVM) [21], in that we\ninfer the latent {wi} that reside within a GP kernel. However, here {wi} are also factor scores in a\nfactor model, and the GP is used within the context of a Bayesian SVM classi\ufb01er; neither of latter\ntwo have been considered previously.\n\nFor the nonlinear Bayesian SVM classi\ufb01er we no longer have a closed form for the conditional of\nwn, due to the covariance function of the GP prior. Thus, we require a Metropolis-Hastings type\nalgorithm. Here we use elliptical slice sampling [22]. Speci\ufb01cally, we sample wn from\np(wn|A, W\\n, \u03c8, y, \u03bb, \u03b3, \u03b8) \u221d p(wn|xn, A, \u03c8)Z(y, wn, W\\n, \u03bb, \u03b3, \u03b8) ,\n\n(14)\nwhere p(wn|xn, A, \u03c8) \u223c N (SN\u03c8Axn, SN), W = [w1 . . . wN ], W\\n is matrix W without\ncolumn n, S\u22121\nN = \u03c8A\u22a4A + I, and we have marginalized out f as in (9) with W in place of X.\nThe elliptical slice sampler proposes samples from p(wn|xn, A, \u03c8) while biasing them towards\nmore likely con\ufb01gurations of \u03bb. Provided that \u03bb ultimately controls the predictive distribution of\nthe classi\ufb01er in (11), samples of wn will at the same time attempt to \ufb01t the data and to improve\nclassi\ufb01cation performance. From (14), note that we sample one column of W at a time, while\nkeeping the others \ufb01xed. Details of the elliptical slice sampler are found in [22]. In applications in\nwhich sampling from (14) is time prohibitive, we can use instead a variational Bayes EM (VB-EM)\napproach. In the E-step, we approximate the posterior of A, {\u03a6k}, \u03c8, f , \u03bb and \u03b3 by a factorized\ndistribution q(A)Qk q(\u03a6k)q(\u03c8)q(f )q(\u03bb)q(\u03b3) and in the M-step we optimize W and \u03b8, using L-\nBFGS [23]. Details of the implementation can be found in the Supplementary Material.\n\n6\n\n\f6 Experiments\n\nIn all experiments we set the covariance function to (i) either the square exponential (SE), which\n\nhas the form k(xi, xj , \u03b8) = exp(cid:0)\u2212kxi \u2212 xjk2(cid:14) \u03b82), where \u03b82 is known as the characteristic length\n\nscale; or (ii) the automatic relevance determination (ARD) SE in which each dimension of x has\nits own length scale [7]. All code used in the experiments was written in Matlab and executed on a\n2.8GHz workstation with 4Gb RAM.\n\nd\n34\n60\n9\n7\n8\n\nN\n351\n208\n683\n200\n768\n1540\n\nData set\nIonosphere\nSonar\nWisconsin\nCrabs\nPima\nUSPS 3 vs 5\n\nBSVM SVM GPC\n7.41\n12.50\n2.64\n2.5\n\nTable 1: Benchmark data results. Mean % error\nfrom 10-fold cross-validation.\n\nBenchmark data We \ufb01rst compare the perfor-\nmance of the proposed Bayesian hierarchy for\nnonlinear SVM (BSVM) against EP-based GP\nclassi\ufb01cation (GPC) and an optimization-based\nSVM, on six well known benchmark datasets.\nIn particular, we use the same data and settings\nas [8], speci\ufb01cally 10-fold cross-validation and\nSE covariance function. The parameters of the\nSVM {\u03b3, \u03b8} are obtained by grid search using\nan internal 5-fold cross-validation. GPC uses ML-II and a modi\ufb01ed SE function k(xi, xj , \u03b8) =\n1 exp(cid:0)\u2212kxi \u2212 xjk2(cid:14) \u03b82\n\u03b82\n2), where \u03b81 acts as regularization trade-off similar to \u03b3 in our formulation\n[7]. For our model we set 200 as the maximum number of iterations of the ECM algorithm and run\nML-II every 20 iterations. Table 1 shows mean errors for the methods under consideration. We see\nthat all three perform similarly as one might expect thus error bars are not showed, however BSVM\nslightly outperforms the others in 4 out of 6 datasets. From the three methods, the SVM is clearly\nfaster than the others. GP classi\ufb01cation and our model essentially scale cubically with N , however,\nours is relatively faster mainly due to overhead computations needed by the EP algorithm. More\nspeci\ufb01cally, running times for the larger dataset (USPS 3 vs 5) were approximately 1000, 1200 and\n5000 seconds for SVM, BSVM and GPC, respectively.\n\n5.71\n11.54\n3.07\n2.0\n\n5.98\n11.06\n2.93\n1.5\n\n22.01\n1.69\n\n24.22\n1.56\n\n21.88\n1.49\n\n256\n\n102\n\nError\nTime\n\n3 vs. 5 (N = 767)\n\n4 vs. non-4 (N = 7291)\n\nFITC-GPC\n3.69 \u00b1 0.26\n\nTable 2: FITC results (mean % error) for USPS data.\n\nFITC-BSVM FITC-GPC\n2.59 \u00b1 0.17\n3.49 \u00b1 0.29\n\nIn order to test the approximation intro-\nduced in Section 3 (to accelerate GP in-\nference) we use the traditional splitting of\nUSPS, 7291 for model \ufb01tting and the re-\nmaining 2007 for testing, on two different\ntasks: 3 vs. 5 and 4 vs. non-4. Table 2\nshows mean error rates and standard deviations for FITC versions of BSVM and GPC, for M = 100\npseudo-inputs and 10 repetitions. We see that FITC-BSVM slightly outperforms FITC-GPC in both\ntasks while being relatively faster. As baselines, full BSVM and GPC on the 3 vs. 5 task perform\nroughly similar at 2.46% error. We also veri\ufb01ed (results not shown) that increasing M consistently\ndecreases error rates for both FITC-BSVM and FITC-GPC.\n\nFITC-BSVM\n2.44 \u00b1 0.17\n\n604\n\n116\n\n46\n\nUSPS data We applied the model proposed in Section 5 to the well known 3 vs. 5 subset of the\nUSPS handwritten digits dataset, consisting of 1540 gray scale 16 \u00d7 16 images, rescaled within\n[\u22121, 1]. We use the resampled version, this is, 767 for model \ufb01tting and the remaining 773 for test-\ning. As baselines, we also perform inference as a two step procedure, \ufb01rst \ufb01tting the factor model\n(FM), followed by a linear (L) or a nonlinear (N) SVM classi\ufb01er. We also consider learning jointly\nthe factor model but with a linear SVM (LDFM), and a two step procedure consisting of LDFM fol-\nlowed by a nonlinear SVM. Our proposed nonlinear discriminative factor model is denoted NDFM.\nVB-EM versions of LDFM and NDFM are denoted as VLDFM and VNDFM, respectively. MCMC\ndetails for the linear SVM part can be found in [5]. For inference, we set K = 10, a SE covari-\nance function and run the sampler for 1200 iterations, from which we discard the \ufb01rst 600 and keep\nevery 10-th for posterior summaries. We observed in general good mixing regardless of random\ninitialization, and results remained very similar for different Markov chains.\n\nTable 3 shows classi\ufb01cation results for the eight classi\ufb01ers considered; we see that the nonlinear\nclassi\ufb01ers perform substantially better than the linear counterparts. In addition, the proposed non-\nlinear joint model (NDFM) is the best of all \ufb01ve. The nonlinear classi\ufb01er is powerful enough to\nperform well in both two step procedures. We found that VNDFM is not performing as good as\nNDFM because the data likelihood is dominating over the labels likelihood in the updates for the\nfactor scores, which is not surprising considering the marked size differences between the two. On\nthe positive side, runtime for VNDFM is approximately two orders of magnitude smaller than that\nof NDFM. We also tried a joint nonlinear model with a probit link as in GP classi\ufb01cation and we\n\n7\n\n\fTable 3: Mean % error with standard deviations and runtime (seconds) for USPS and gene expression data.\n\nFM+L\n\nFM+N\n\nLDFM\n\nVLDFM\n\nLDFM+N\n\nVLDFM+N\n\nNDFM\n\nVNDFM\n\nError\nTime\n\n6.21 \u00b1 0.32\n\n44\n\n3.36 \u00b1 0.26\n\n840\n\n5.95 \u00b1 0.31\n\n120\n\n5.56 \u00b1 0.18\n\n60\n\n3.62 \u00b1 0.26\n\n920\n\n3.62 \u00b1 0.19\n\n160\n\n2.72 \u00b1 0.13\n\n20000\n\n3.23 \u00b1 0.16\n\n210\n\nError\nTime\n\n22.70 \u00b1 0.92\n\n105\n\n19.52 \u00b1 1.02\n\n136\n\n22.70 \u00b1 0.92\n\n126\n\n22.31 \u00b1 0.78\n\n25\n\n20.31 \u00b1 0.88\n\n158\n\n19.52 \u00b1 0.88\n\n57\n\n18.33 \u00b1 0.84\n\n18.33 \u00b1 0.84\n\n1100\n\n103\n\nGene expression (10-fold cross-validation)\n\nUSPS (Test set)\n\nfound its classi\ufb01cation performance (a mean error rate of 3.10%) being slightly worse than that for\nNDFM. In addition, we found that using ARD SE covariance functions to automatically select for\nfeatures of A and larger values of K did not substantial changed the results.\n\nGene expression data The dataset originally introduced in [24] consists of gene expression mea-\nsurements from primary breast tumor samples for a study focused towards \ufb01nding expression pat-\nterns potentially related to mutations of the p53 gene. The original data were normalized using RMA\nand \ufb01ltered to exclude genes showing trivial variation. The \ufb01nal dataset consists of 251 samples and\n2995 normalized gene expression values. The labeling variable indicates whether or not a sample\nexhibits the mutation. We use the same baseline and inference settings from our previous experi-\nment, but validation is done by 10-fold cross-validation. In preliminary results we found that factor\nscore selection improves results, hence for the linear classi\ufb01er (L) we used an exponential prior for\nthe variances of \u03b2, gk \u223c Exp(\u03c1), and for the nonlinear case (N) we set an ARD SE covariance\nfunction for K. Table 3 summarizes the results, the nonlinear variants outperform their linear coun-\nterparts and our joint model perform slightly better than the others. Additionally, the joint nonlinear\nmodel with GP and probit link yielded an error rate of 19.52%.\n\nAs a way of quantifying whether the features (factor loadings) produced by FM, LDFM and NDFM\nare meaningful from a biological point of view, we performed Gene Ontology (GO) searches for the\ngene lists encoded by each column of A. In order to quantify the strength of the association between\nGO annotations and our gene lists we obtained Bonferroni corrected p-values [25]. We thresholded\nthe elements of matrix A such that |aik| > 0.1. Using the 10 lists from each model we found that\nFM, LDFM and NDFM produced respectively 5, 5 and 8 factors signi\ufb01cantly associated to GO terms\nrelevant to breast cancer. The GO terms are: fatty acid metabolism, induction of programmed cell\ndeath (apoptosis), anti-apoptosis, regulation of cell cycle, positive regulation of cell cycle, cell cycle\nand Wnt signaling pathway. The strongest associations in all models are unsurprisingly apoptosis\nand positive regulation of cell cycle, however, only NDFM produced a signi\ufb01cant association to\nanti-apoptosis which we believe is responsible for the edge in performance of NDFM in Table 3.\n\n7 Conclusion\n\nWe have introduced a fully Bayesian version of nonlinear SVMs, extending the previous restriction\nto linear SVMs [5]. Almost all of the existing joint feature-learning and classi\ufb01er-design mod-\nels assumed linear classi\ufb01ers [2, 3, 26]. We have demonstrated in our experiments that there is a\nsubstantial performance improvement manifested by the nonlinear classi\ufb01er. In addition, we have\nextended the Bayesian equivalent of the hinge loss to a more general loss function, for both linear\nand nonlinear classi\ufb01ers. We have demonstrated that this approach enhances modeling \ufb02exibility,\nand yields improved MCMC mixing. The Bayesian setup allows one to directly compute class\nmembership probabilities. We showed how to use the nonlinear SVM as a module in a larger model,\nand presented compelling results to highlight its potential. Point estimate inference using ECM is\nconceptually simpler and easier to implement than MCMC or GP classi\ufb01cation, although MCMC is\nattractive for integrating the factor model and classi\ufb01er (for example). We showed how FITC and\nVB-EM based approximations can be used in conjunction with the SVM nonlinear classi\ufb01er and\ndiscriminative factor modeling, respectively, as a way to scale inference in a principled way.\n\nAcknowledgments\n\nThe research reported here was funded in part by ARO, DARPA, DOE, NGA and ONR.\n\n8\n\n\fReferences\n\n[1] J. Zhu, A. Ahmed, and E. P. Xing. MedLDA: maximum margin supervised topic models for regression\n\nand classi\ufb01cation. ICML, pages 1257\u20131264, 2009.\n\n[2] M. Xu, J. Zhu, and B. Zhang. Fast max-margin matrix factorization with data augmentation. ICML, pages\n\n978\u2013986, 2013.\n\n[3] M. Xu, J. Zhu, and B. Zhang. Nonparametric max-margin matrix factorization for collaborative predic-\n\ntion. NIPS 25, pages 64\u201372, 2012.\n\n[4] C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20(3):273\u2013297, 1995.\n\n[5] N. G. Polson and S. L. Scott. Data augmentation for support vector machines. Bayesian Analysis, 6(1):1\u2013\n\n23, 2011.\n\n[6] M. Opper and O. Winther. Gaussian processes for classi\ufb01cation: Mean-\ufb01eld algorithms. Neural Compu-\n\ntation, 12(11):2655\u20132684, 2000.\n\n[7] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. The MIT Press, 2006.\n\n[8] M. Kuss and C. E. Rasmussen. Assessing approximate inference for binary Gaussian process classi\ufb01ca-\n\ntion. JMLR, 6:1679\u20131704, 2005.\n\n[9] N. G. Polson and J. G. Scott. Shrink globally, act locally: sparse Bayesian regularization and prediction.\n\nBayesian Statistics, 9:501\u2013538, 2010.\n\n[10] D. F. Andrews and C. L. Mallows. Scale mixtures of normal distributions. JRSSB, 36(1):99\u2013102, 1974.\n\n[11] T. J. Kozubowski and K. Podgorski. A class of asymmetric distributions. Actuarial Research Clearing\n\nHouse, 1:113\u2013134, 1999.\n\n[12] R. M. Neal. Slice sampling. AOS, 31(3):705\u2013741, 2003.\n\n[13] I. Murray and R. P. Adams. Slice sampling covariance hyperparameters of latent Gaussian models. NIPS\n\n23, pages 1723\u20131731, 2010.\n\n[14] X.-L. Meng and D. B. Rubin. Maximum likelihood estimation via the ECM algorithm: A general frame-\n\nwork. Biometrika, 80(2):267\u2013278, 1993.\n\n[15] J Qui\u02dcnonero-Candela and C. E. Rasmussen. A unifying view of sparse approximate Gaussian process\n\nregression. JMLR, 6:1939\u20131959, 2005.\n\n[16] E. Snelson and Z. Ghahramani. Sparse Gaussian processes using pseudo-inputs. NIPS 18, pages 1257\u2013\n\n1264, 2006.\n\n[17] C. K. I. Williams and D. Barber. Bayesian classi\ufb01cation with Gaussian processes. PAMI, 20(12):1342\u2013\n\n1351, 1998.\n\n[18] Thomas P. Minka. A family of algorithms for approximate Bayesian inference. PhD thesis, MIT, 2001.\n\n[19] C. M. Carvalho, J. Chang, J. E. Lucas, J. R. Nevins, Q. Wang, and M. West. High-dimensional sparse\n\nfactor modeling: Applications in gene expression genomics. JASA, 103(484):1438\u20131456, 2008.\n\n[20] M. Zhou, H. Chen, J. Paisley, L. Ren, G. Sapiro, and L. Carin. Non-parametric Bayesian dictionary\n\nlearning for sparse image representations. NIPS 22, pages 2295\u20132303, 2009.\n\n[21] N.D. Lawrence. Gaussian process latent variable models for visualisation of high dimensional data. NIPS\n\n16, 2003.\n\n[22] I. Murray, R. P. Adams, and D. J. C. MacKay. Elliptical slice sampling. AISTATS, pages 541\u2013548, 2010.\n\n[23] D. C. Liu and J. Nocedal. On the limited memory method for large scale optimization. Mathematical\n\nProgramming B, pages 503\u2013528, 1989.\n\n[24] L. D. Miller, J. Smeds, J. George, V. B. Vega, L. Vergara, A. Ploner, Y. Pawitan, P. Hall, S. Klaar,\nE. T. Liu, et al. An expression signature for p53 status in human breast cancer predicts mutation status,\ntranscriptional effects, and patient survival. PNAS, 102(38):13550\u201313555, 2005.\n\n[25] J. T. Chang and J. R. Nevins. GATHER: a systems approach to interpreting genomic signatures. Bioin-\n\nformatics, 22(23):2926\u20132933, 2006.\n\n[26] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Supervised dictionary learning. NIPS 21, pages\n\n1033\u20131040, 2009.\n\n9\n\n\f", "award": [], "sourceid": 937, "authors": [{"given_name": "Ricardo", "family_name": "Henao", "institution": "Duke University"}, {"given_name": "Xin", "family_name": "Yuan", "institution": "Duke University"}, {"given_name": "Lawrence", "family_name": "Carin", "institution": "Duke University"}]}