{"title": "Free energy score space", "book": "Advances in Neural Information Processing Systems", "page_first": 1428, "page_last": 1436, "abstract": "Score functions induced by generative models extract fixed-dimension feature vectors from different-length data observations by subsuming the process of data generation, projecting them in highly informative spaces called score spaces. In this way, standard discriminative classifiers are proved to achieve higher performances than a solely generative or discriminative approach. In this paper, we present a novel score space that exploits the free energy associated to a generative model through a score function. This function aims at capturing both the uncertainty of the model learning and ``local compliance of data observations with respect to the generative process. Theoretical justifications and convincing comparative classification results on various generative models prove the goodness of the proposed strategy.", "full_text": "Free energy score-space\n\nAlessandro Perina1,3, Marco Cristani1,2, Umberto Castellani1\n\n{alessandro.perina, marco.cristani, umberto.castellani, vittorio.murino}@univr.it\n\nVittorio Murino1,2 and Nebojsa Jojic3\n\njojic@microsoft.com\n\n1 Department of Computer Science, University of Verona, Italy\n\n2 IIT, Italian Institute of Technology, Genova, Italy\n\n3 Microsoft Research, Redmond, WA\n\nAbstract\n\nA score function induced by a generative model of the data can provide a fea-\nture vector of a \ufb01xed dimension for each data sample. Data samples themselves\nmay be of differing lengths (e.g., speech segments, or other sequence data), but\nas a score function is based on the properties of the data generation process, it\nproduces a \ufb01xed-length vector in a highly informative space, typically referred to\nas a \u201cscore space\u201d. Discriminative classi\ufb01ers have been shown to achieve higher\nperformance in appropriately chosen score spaces than is achievable by either the\ncorresponding generative likelihood-based classi\ufb01ers, or the discriminative clas-\nsi\ufb01ers using standard feature extractors. In this paper, we present a novel score\nspace that exploits the free energy associated with a generative model. The result-\ning free energy score space (FESS) takes into account latent structure of the data\nat various levels, and can be trivially shown to lead to classi\ufb01cation performance\nthat at least matches the performance of the free energy classi\ufb01er based on the\nsame generative model, and the same factorization of the posterior. We also show\nthat in several typical vision and computational biology applications the classi\ufb01ers\noptimized in FESS outperform the corresponding pure generative approaches, as\nwell as a number of previous approaches to combining discriminating and gener-\native models.\n\n1 Introduction\nThe complementary nature of discriminative and generative approaches to machine learning [20] has\nmotivated lots of research on the ways in which these can be combined [5, 12, 15, 18, 9, 24, 27]. One\nrecipe for such integration uses \u201cgenerative score-spaces.\u201d Using the notation of [24], such spaces\ncan be built from data by considering for each observed sequence x = (x1, . . . , xk, . . . , xK) of\nobservations xk \u2208 (cid:60)d, k = 1, . . . , K, a family of generative models P = {P (x|\u03b8i)} parameterized\nby \u03b8i.\nThe observed sequence x is mapped to the \ufb01xed-length score vector \u03d5f\n\u02c6F\n\n(x),\n\n(x) = \u03d5 \u02c6F f({Pi (x|\u03b8i))}),\n\n\u03d5f\n\u02c6F\n\n(1)\n\nwhere f is the function of the set of probability densities under the different models, and \u02c6F is some\noperator applied to it. For instance, in case of the Fisher score [9], f is the log likelihood, and the\noperator \u02c6F produces the \ufb01rst order derivatives with respect to parameters, whereas in [24] other\nderivatives are also included. Another example is the TOP kernel [27] for which the function f is\nthe posterior log-odds and \u02c6F is again the gradient operator.\nIn these cases, the generative score-space approaches help to distill the relationship between a\nmodel parameter \u03b8i and the particular data sample. After the mapping, a score-space metric must\n\n1\n\n\fbe de\ufb01ned in order to employ discriminative approaches.\n\nA number of nice properties for these mappings, and especially for Fisher score, can be derived\nunder the assumption that the test data indeed follows the generative model used for the score\ncomputation. However, the generative score spaces build upon the choice of one (or few) out\nof many possible generative models, as well as the parameters \ufb01t to a limited amount of data.\nIn practice, these models can therefore suffer from improper parametrization of the probability\ndensity function, local minima, over-\ufb01tting add under-training problems. Consider, for instance,\nthe situation where the assumed model over high dimensional data is a mixture of n diagonal\nGaussians with a given small and \ufb01xed variance, and a uniform prior over the components. The\nonly free parameters are therefore the Gaussian centers, and let us assume that training data is best\ncaptured with these centers all lying on (or close to) a hypersphere with a radius suf\ufb01ciently larger\nthan the Gaussians\u2019 deviation. An especially surprising and inconvenient outlier in this case would\nbe a test data point that falls close to the center of the hypersphere, as the derivatives of its log\nlikelihood with respect to these parameters (Gaussian centers) evaluated at the estimate could be\nvery low when the number of components n in the mixture is large, because the derivatives are\nscaled by the uniform posterior 1/n. But, this makes such a test point insuf\ufb01ciently distinguishable\nfrom the test points that actually satisfy the model perfectly by falling directly into one of the\nGaussian centers. If the model parameters are extended to include the prior distribution over mixture\ncomponents, then derivatives with respect to these parameters would help disambiguate these points.\n\nIn this paper, we propose a novel score space which focuses on how well the data point \ufb01ts different\nparts of the generative model, rather than on derivatives with respect to the model parameters. We\nstart with the variational free energy as a lower bound on the negative log-likelihood of the data, as\nthis affords us with two advantages. First of all, the variational free energy can be computed for an\narbitrary structure of the posterior distribution, allowing us to deal with generative models with many\nlatent variables and complex structure without compromising tractability, as was previously done\nfor inference in generative models. Second, a variational approximation of the posterior typically\nprovides an additive decomposition of the free energy, providing many terms that can be used as\nfeatures. These terms/features are divided into two categories: the \u201centropy set\u201d of terms that express\nuncertainty in the posterior distribution, and the \u201ccross-entropy set\u201d describing the quality of the \ufb01t\nof the data to different parts of the model according to the posterior distribution.\nWe \ufb01nd the resulting score space to be highly informative for discriminative learning. In partic-\nular, we tested our approach on three computational biology problems (promoter recognition, ex-\nons/introns classi\ufb01cation, and homology detection), as well as vision problems (scene/object recog-\nnition). The results compare favorably with the state-of-the-art from recent literature.\nThe rest of the paper is organized as follows. The next section describes the proposed framework in\nmore detail. In Sec. 3, we show that the proposed generative score space leads to better classi\ufb01cation\nperformances than the related generative counterpart. Some simple extensions are described in Sec.\n4, and used in the experiments in Sec. 5.\n\n(cid:81)T\n2 FESS: Free Energy Score Space\nA generative model de\ufb01nes the distribution P (h, x|\u03b8) =\nt=1 P (h(t), x(t)|\u03b8) over a set of observa-\ntions x = {x(t)}T\nt=1, each with associated hidden variables h(t), for a given set of model parameters\n\u03b8 shared across all observations. In addition, to model the posterior distribution P (h|x), we also\nde\ufb01ne a family of distributions Q from which we need to select a variational distribution Q(h) that\nbest \ufb01ts the model and the data. Assuming i.i.d data, the family Q can be simpli\ufb01ed to include only\ndistributions of the form Q(h) =\nt=1 q(h(t)). The free energy [19, 11] is a function of the data,\nparameters of the posterior Q(h), and the parameters of the model P , de\ufb01ned as\nQ(h) log Q(h)\nP (h, x|\u03b8)\n\nFQ = KL(Q, P (h|x, \u03b8)) \u2212 log P (x|\u03b8) =\n\n(cid:81)T\n\n(cid:88)\n\nh\n\n(2)\n\nThe free energy bounds the log likelihood, FQ \u2264 \u2212 log P (x) and the equality is attained only if\nQ is expressive enough to capture the true posterior distribution, as the free energy is minimized\nwhen Q(h) = P (h|x). Constraining Q to belong to a simpli\ufb01ed family of distributions Q, however,\n\n2\n\n\fprovides computational advantages for dealing with intractable models P . Examples of distribution\nfamilies used for approximation are the fully-factorized mean \ufb01eld form [13], or the structured vari-\national approximation [7], where some dependencies among the hidden variables are kept.\nMinimization of FQ as a proxy for negative log likelihood is usually achieved by alternating opti-\nmization of with respect to Q and \u03b8, a special case of which \u2013 when Q is fully expressive \u2013 is the EM\nalgorithm. Different choices of Q provide different types of compromise between the accuracy and\ncomputational complexity. For some models, accurate inference of some of the latent variables may\nrequire excessive computation even though the results of the inference can be correctly reinterpreted\nby studying the posterior Q from a simpler family and observing the symmetries of the model, or\nby reparametrizing the model (see for example [1]). In what follows, we will develop a technique\nthat uses the parts of the free energy to infer the mapping of the data to a class variable with an\nincreased accuracy despite possible imperfections of the data \ufb01t, whether this imperfection is due to\nthe approximations and errors in the model or the posterior.\nHaving obtained an estimate of parameters \u02c6\u03b8 that \ufb01t the given i.i.d. data we can rearrange the free\nenergy (Eq.2) as\n\nF tQ, and\nq(h(t)|\u02c6\u03b8) \u00b7 log q(h(t)|\u02c6\u03b8) \u2212\n\n(cid:88)\n\nh(t)\n\nq(h(t)|\u02c6\u03b8) \u00b7 log P (h(t), x(t)|\u02c6\u03b8)\n\n(3)\n\n(cid:88)\n(cid:88)\n\nt\n\nh(t)\n\nFQ =\nF tQ =\n\nThe second term in the equation above is the cross-entropy term and it quanti\ufb01es how well the data\npoint \ufb01ts the model, assuming that hidden variables follow the estimated posterior distribution. This\nposterior distribution is \ufb01t to minimize the free energy; the \ufb01rst term in 3 is the entropy and quanti\ufb01es\nthe uncertainty in this \ufb01t.\nIf Q and P factorize, then each of these two terms further breaks into a sum of individual terms,\neach quantifying the aspects of the \ufb01t of the data point with respect to different parts of the model.\nFor example, if the generative model is described by a Bayesian network, the joint distribution can\nn |PAn), where v(t) = {x(t), h(t)} denotes the set of all variables\nbe written as P (v(t) =\n(hidden or visible) and PAn are the parents of the n \u2212 th of these variables, i.e., v(t)\nn .\n(cid:88)\nThe cross-entropy term in the equation above further decomposes into\n\nn P (v(t)\n\n(cid:88)\n\n(cid:81)\n\nq(v(t)\n\n1 \u222a PA1|\u02c6\u03b8) \u00b7 log P (v(t)\n\n1 |PA1, \u02c6\u03b8) + \u00b7\u00b7\u00b7 +\n\nq(v(t)\n\nN \u222a PAN|\u02c6\u03b8) \u00b7 log P (v(t)\n\nN |PAN , \u02c6\u03b8)\n\n(4)\n\n[v\n\n(t)\n1 ]\n\n[v\n\n(t)\nN ]\n\nFor each discrete hidden variable v(t)\nindividual terms in the summation over the Dn possible con\ufb01gurations of the variable, e.g.,\n\nn , the appropriate terms above can be further broken down into\n\nq(v(t)\n\nn = 1,\u222a PAn|\u02c6\u03b8)\u00b7log P (v(t)\n\nn = 1|PAn, \u02c6\u03b8)+\u00b7\u00b7\u00b7+ q(v(t)\n\nn = Dn,\u222a PAn|\u02c6\u03b8)\u00b7log P (v(t)\n\nn = Dn|PAn, \u02c6\u03b8)\n(5)\n\nIn a similar fashion, the entropy term can also be decomposed further into a sum of terms as dictated\nby the factorization of the family Q. Therefore, the free energy for a single sample t can be expressed\nas the sum\n\n(cid:88)\n\nF tQ =\n\nf t\ni,\u02c6\u03b8\n\n(6)\n\ni\n\nwhere all the free energy pieces f t\ni,\u02c6\u03b8\n\nderive from the \ufb01nest decomposition (5) or (4).\n\nThe terms f t\ndescribe how the data point \ufb01ts possible con\ufb01gurations of the hidden variables in\ni,\u02c6\u03b8\ndifferent parts of the model. Such information can be encapsulated in a score space that we call free\nenergy score space or simply FESS.\nFor example, in the case of a binary classi\ufb01cation problem, given the generative models for the two\nclasses, we can de\ufb01ne as F(Q,\u02c6\u03b8)(x(t)) the mapping of x(t) to a vector of scores f with respect to a\nparticular model with its estimated parameters, and a particular choice of the posterior family Q for\neach of the classes, and then concatenate the scores. Therefore, using the notation from [24] the free\n\n3\n\n\fenergy score operator \u03d5F ESS\n\n(cid:104)\n\n\u02c6F\n\n(x(t)) is de\ufb01ned as\nF(Q1,\u02c6\u03b81)(x(t));F(Q2,\u02c6\u03b82)(x(t))\n\n(cid:105)\n\n\u02c6F\n\ni, \u02c6\u03b8c\n\n: x(t) \u2192\n\nwhere F(Qc,\u02c6\u03b8c) = [. . . , f t\n\n\u03d5F ESS\n\n, . . . ]T , c = 1, 2\n(7)\nIf the posterior families are fully expressive, then the MAP estimate based on the generative models\nfor the two classes can be obtained from this mapping by simply summing the appropriate terms to\nobtain the log likelihood difference, as the free energy equals the negative log likelihood.\nHowever, the mapping also allows for the parts of the model \ufb01t to play uneven roles in classi\ufb01cation\nafter an additional step of discriminative training. In this case the data points do not have to \ufb01t either\nmodel well in order to be correctly classi\ufb01ed. Furthermore, even in the extreme case where one\nmodel provides a higher likelihood than the other for the data from both classes (e.g., because the\nmodels are not nested, and likelihoods cannot be directly compared), the mapping may still provide\nan abstraction from which another step of discriminative training can bene\ufb01t. The additional step\nof training a discriminative model allows for mining the similarities among the data points in terms\nof the path through different hidden variables that has to be followed in their generation. These\nsimilarities may be informative even if the generative process is imperfect.\nObviously, (7) can be generalized to include multiple models (or the use of a single model) and/or\nmultiple posterior approximations, either for two-class or multi-class classi\ufb01cation problems.\n\n3 Free energy score space classi\ufb01cation dominates the MAP classi\ufb01cation\nWe use here the terminology introduced in [27], under which FESS would be considered a model-\ndependent feature extractor, as different generative models lead to different feature vectors [25].\nThe family of feature extractors \u03d5 \u02c6F : X \u2192 (cid:60)d maps the input data x \u2208 X in a space of \ufb01xed\ndimension derived from a plug-in estimate \u03bb, in our case the generative model with parameters \u02c6\u03b8\nfrom which the features are extracted.\nGiven some observations x and the corresponding class labels y \u2208 {\u22121, +1} following the joint\nprobability P (x, y|\u03b8\u2217), a generative model can be trained to provide an estimate \u02c6\u03b8 (cid:54)= \u03b8\u2217, where \u03b8\u2217\nare the true parameters. As most kernels (e.g. Fisher and TOP) are commonly used in combination\nwith linear classi\ufb01ers such as linear SVMs, [27] proposes as a starting point for evaluating the\nperformance of a feature extractor the classi\ufb01cation error of a linear classi\ufb01er wT \u00b7 \u03d5 \u02c6F (x) + b in\nthe feature space (cid:60)d, where w \u2208 (cid:60)d and b \u2208 (cid:60). Assuming that w and b are chosen by an optimal\nlearning algorithm on a suf\ufb01ciently large training dataset, and that the test set follows the same\ndistribution with parameter \u03b8\u2217, the classi\ufb01cation error R(\u03d5 \u02c6F ) can be shown to tend to\n\nR(\u03d5 \u02c6F ) = min\n\nw,b\n\nEx,y\u03a6[\u2212y(wT \u00b7 \u03d5 \u02c6F (x) + b)]\n\n(8)\n\nwhere \u03a6[a] is an indicator function which is 1 when a > 0, and 0 otherwise, and Ex,y denotes the\nexpectation with respect to the true distribution P (x, y|\u03b8\u2217).\nThe Fisher kernel (FK) classi\ufb01er can perform at least as well as its plug-in estimate if the parameters\nof a linear classi\ufb01er are properly determined [9, 27],\n\n\u02c6F\n\nR(\u03d5F K\n\n) \u2264 Ex,t\u03a6[\u2212y(P (y = +1|x, \u02c6\u03b8) \u2212 1\n2\nwhere \u03bb represents the generative model used as plug-in estimate.\nThis property also trivially holds for our method, where \u03d5 \u02c6F (x(t)) = \u03d5F ESS\nenergy can be expressed as a linear combination of the elements of \u03d5 .\nIn fact, the minimum free energy test (and the maximum likelihood rule when Q is fully expressive)\ncan be de\ufb01ned on \u03d5 derived from the generative models with parameters \u02c6\u03b8+1 for one class and \u02c6\u03b8\u22121\nfor another as\n\n(x(t)), because the free\n\n)] = R(\u03bb)\n\n(9)\n\n\u02c6F\n\n{F t\n\n\u02c6y = min\n\n(10)\nThe extension to a multiclass classi\ufb01cation is straightforward. When the family Q is expressive\nenough to capture the true posterior distribution, then free energy reduces to negative log likelihood,\n\n1TF(Q,\u02c6\u03b8+1)(x(t)) \u2212 1TF(Q,\u02c6\u03b8\u22121)(x(t))\n\n} = \u03a6\n\n(Q,\u02c6\u03b8\u22121)\n\n(Q,\u02c6\u03b8+1)\n\n,F t\n\ny\n\n(cid:105)\n\n(cid:104)\n\n4\n\n\fand the free energy test reduces to ML classi\ufb01cation.\nIn other cases, likelihood computation is\nintractable, and free energy test is used instead of the likelihood ratio test. It is straightforward\nto prove that a kernel classi\ufb01er that works in FESS is asymptotically at least as good as the MAP\nlabelling based on the generative models for the two classes since generative classi\ufb01cation is a\nspecial case of our framework.\n\nLemma 3.1 For \u03d5F ESS\n(x(t)) derived as above with its \ufb01rst M1 elements being the components of\nthe free energy for one model, and the remaining M2 for the second, a linear classi\ufb01er employing\nwill, asymptotically (with enough data), provide classi\ufb01cation error which is at least as low\n\u03d5F ESS\nas RQ(\u03bb) achieved using the free energy test above.\n\n\u02c6F\n\n\u02c6F\n\n(cid:183)\n\n(cid:184)\n\nR(\u03d5F ESS\n\n\u02c6F\n\n) \u2264 Ex,t\u03a6\n\n\u2212y(P (y = +1|x, \u02c6\u03b8) \u2212 1\n2\n\n)\n\n= RQ(\u03bb)\n\nProof\n\nR(\u03d5F ESS\n\n\u02c6F\n\n) = min\nw,b\n\nEx,y\u03a6[\u2212y(wT \u00b7 \u03d5F ESS\n\n\u02c6F\n\n(x) + b)] \u2264 Ex,y\u03a6[\u2212y(wT \u00b7 \u03d5F ESS\n\n(x) + b)] \u2200 w, b\n\n(cid:122)\n\n(cid:125)(cid:124)\n\n\u02c6F\n\n(cid:123)\n\n(cid:122)\n\n(cid:125)(cid:124)\n\n(cid:123)\n\nM1 times\n+1, \u00b7 \u00b7 \u00b7 , +1,\n\nM2 times\n\u22121, \u00b7 \u00b7 \u00b7 , \u22121]T , bg = 0\n\n(x) + bg)] for wg = [\n\nR(\u03d5F ESS\n\n\u02c6F\n\n) \u2264 Ex,y\u03a6[\u2212y(wT\n) \u2264 RQ(\u03bb)\n\ng \u00b7 \u03d5F ESS\n\n\u02c6F\n\n\u02c6F\n\nR(\u03d5F ESS\n\n(11)\n(cid:164)\nFurthermore, when the family Q is expressive enough to capture the true posterior distribution, the\nfree energy test is equivalent to maximum likelihood (ML) classi\ufb01cation, RQ(\u03bb) = R(\u03bb). The\ndominance of the Fisher and Top kernels [9, 27] over their plug-in holds for FESS too, and the same\nplug-in (the likelihood under a generative model) may be used when this is tractable. However, if\nthe computation of the likelihood (and the kernels derived from it) is intractable, then the free energy\ntest as well as the kernel methods based on FESS that will outperform this test, can both still be used\nwith an appropriate family of variational distributions Q.\n4 Controlling the length of the feature vector\nIn some generative models, especially sequence models, the number of hidden variables may change\nfrom one data point to the next. In speech processing, for instance, hidden Markov models (HMM)\n[23] may have to model utterances x(t)\nK(t) of different sequence lengths K(t). As each\nelement in the sequence has an associated hidden variable, the hidden state sequences s(t)\nK(t)\nare also of variable lengths. The parameters \u03b8 of this model include the prior state distribution \u03c0,\nthe state transition probability matrix A = a{ij}, and the emission probabilities B = b{iv}. Exact\ninference is tractable in HMMs and so we can use the exact posterior (EX) distribution to formulate\nthe free energy and the free energy minimization is equivalent to the usual Baum-Welch training\nalgorithm [17] and FEX = \u2212 log P (x). The free energy of each sample xt is\n\n1 , . . . , x(t)\n\n1 , . . . , s(t)\n\nF t\nEX =\n\nq(s(t)\n\n1 ) log q(s(t)\n\n1 ) +\n\nq(s(t)\n\nk , s(t)\n\nk+1) log q(s(t)\n\nk , s(t)\n\nk+1) \u2212\n\nq(s(t)\n\n1 ) log \u03c0s\n\n(t)\n1\n\n(cid:88)\n(cid:88)\n\n[s]\n\n\u2212\n\nK(t)\u22121(cid:88)\n\n[s]\n\nk=1\n\n(cid:88)\n\nK(t)\u22121(cid:88)\n\n[s]\n\nk=1\n\nq(s(t)\n\nk , s(t)\n\nk+1) log a{s\n\nk+1} \u2212\n\n(t)\n\n(t)\nk ,s\n\n(cid:88)\n\nK(t)(cid:88)\n\n[s]\n\nk=1\n\n(cid:88)\n\n[s]\n\nq(s(t)\n\nk ) log b{s\n\n(t)\nk ,x\n\n(t)\n\nk }\n\n(12)\n\nDepending on how this is broken into terms fi, we could get feature vectors whose dimension de-\npends on the length of the sample K(t). To solve this problem, we \ufb01rst note that a standard approach\nto dealing with utterances of different lengths is to normalize the likelihood by the sequence length,\nand this approach is also used for de\ufb01ning other score spaces. If, before the application of the score\noperator, we simply evaluate the sums over k in the free energy and divide each by K(t), we obtain\na \ufb01xed number of terms independent of the sequence length. This results in a length-normalized\nscore space nFESS, where the granularity of the decomposition of the free energy is dramatically\nreduced.\n\n5\n\n\fFigure 1: A) SVM error rates for nFESS and probability product kernels [10] using Markov models\n(we reported only their best result) and hidden Markov models as plug-ins. T represents the param-\neters used in the kernel of [10], and K is the order of the Markov chain. The results are arranged\nalong the x axis by the regularization constant used in SVM training. B) Comparison with results\nobtained using FK and TK score spaces. C) Comparison of the \ufb01ve homology detection methods\nin Experiment 3. Y axis represents the total number of families for which a given method exceeds a\nmedian RFP score on the X axis.\n\nIn general, even for \ufb01xed-length data points and arbitrary generative models, we do not need to cre-\nate large feature vectors corresponding to the \ufb01nest level of granularity described in (5), or for that\nmatter the slightly coarser level of granularity in (4). Some of the terms in these equations can be\ngrouped and summed up to ensure for shorter feature vectors, if this is warranted by the application.\nThe longer the feature vector, the \ufb01ner is the level of detail with which the generative process for the\ndata sample is represented, but more data is needed for the training of the discriminative classi\ufb01er.\nDomain knowledge can often be used to reduce the complexity of the representation by summing\nappropriate terms without sacri\ufb01cing the amount of useful information packed in the feature vectors.\nSuch control of the feature vector length does not negate the previously discussed advantages of the\nclassi\ufb01cation in the free energy score space compared with the straightforward application of free\nenergy, likelihood, or in case of sequence models, length-normalized likelihood tests.\n\n5 Experiments\nWe evaluated our approach on four standard datasets and compared its performance with the clas-\nsi\ufb01cation results provided by the datasets\u2019 creators, those estimated using the plug-in estimate \u03bb,\nand those obtained using the Fisher (FK) and TOP (TK) kernel [9, 27] derived from the plug-ins.\nSupport vector machines (SVMs) with RBF kernel were used as discriminative classi\ufb01ers in all the\nscore spaces, as this technique was previously identi\ufb01ed as most potent for dealing with variable-\nlength sequences [25]. As plug-ins, or generative models/likelihoods \u03bb, for the three score spaces\nwe compare across experiments, we used hidden Markov models (HMMs)[23] in Experiments 1-3\nand latent Dirichlet allocation (LDA)[4] in Experiment 4. For each experiment, comparisons are\nbased on the same validation procedure used in the appropriate original papers that introduced the\ndatasets. For both FK and FESS, in each experiment we trained a single generative model (HMM\nor LDA, depending on the experiment). For all HMM models, the length-normalization with associ-\nated summation over the sequence as described in the previous section was used in the construction\nof the free energy score space. The model complexity, e.g., the number of states for the HMM were\nchosen by cross-validation on the training set.\nExperiment 1: E. coli promoter gene sequences. The \ufb01rst analyzed dataset consists of the E.\ncoli promoter gene sequences (DNA) with associated imperfect domain theory [26]. The standard\ntask on this dataset is to recognize promoters in strings of nucleotides (A, G, T, or C). A promoter\nis a genetic region which facilitates the transcription of gene located nearby. The input features\nare 57 sequential DNA nucleotides. Results, obtained using leave-one-out (LOO) validation, are\n\n6\n\n1234560.050.10.150.2Regularization log(C)Error Rate1234560.050.10.150.20.250.30.350.4Regularization log(C)K=4 T=7T=8T=9T=10T=11hiddenMarkov modelFESSTOP Kernel (TK)Fisher-SVM (FK)00.10.20.30.40.50.60.70.80.910102030405060 FESS - EXFESS - MF FPSFKSAM ( \u03bb\u0397\u039c\u039c )PSI\u2212BLASTNumber of familiesmedian RFPA)C)FESSB)Error Rate0.15Markov model\u03bbHMM\freported in Table 1 and illustrate that FESS represents well the \ufb01xed size genetic sequences, leading\nto superior performance over other score spaces as well as over the plug-in \u03bbHMM .\n\nE.Coli\nAccuracy\n\n\u03bbHMM\n67,34% 94,33% 85,80% 79,20% 85,30%\n\nFESS\n\nnFESS\n\nFK\n\nTK\n\nTable 1: Promoter classi\ufb01cation results.\n\nExperiment 2: Introns/Exons classi\ufb01cation in HS3D data set. The HS3D data set 1[10] con-\ntains labelled intron and exon sequences of nucleotides. The task here is to distinguish between\nthe two types of gene sequences that can both vary in length (from dozens of nucleotides to tens of\nthousands of nucleotides). For the sake of comparison, we adopted the same experimental setting\nof [10]. In Fig.1-A (top right), we reported the results obtained in [10] (overall error rate, OER,\n7,5%), the results obtained using the HMM model (\u03bbHMM , OER 27,59%) together with the results\nobtained by our method (OER 6,12%). In Fig. 1-B (bottom right), we compared our method also\nwith FK (OER 10,06%) and TK (OER 12,82%) kernels.\n\nExperiment 3: Homology detection in SCOP 1.53. We tested the ability of FESS to classify\nprotein domains into superfamilies in the Structural Classi\ufb01cation of Proteins (SCOP)2 version 1.53.\nThe sequences in the database were selected from the Astral database, based on the E-value thresh-\nold of 10\u221225 for removing similar sequences from it.\nIn the end, 4352 distinct sequences were\ngrouped into families and superfamilies. For each family, the protein domains within the family are\nconsidered positive test examples, and the protein domains outside the family, but within the same\nsuperfamily, are taken as positive training examples. The data set yields 54 families containing at\nleast 10 family members (positive test) and 5 superfamily members outside of the family (positive\ntrain) for a total of 54 One-Vs-All problems. The experimental setup is similar to that used in [8],\nexcept for one important difference: in the current experiments, the positive training sets do not\ninclude additional protein sequences extracted from a large, unlabelled database. Therefore, the\nrecognition tasks performed here are more dif\ufb01cult than those in [8]. In order to measure the quality\nof the ranking, we used the median RFP score [8] which is the fraction of negative test sequences\nthat score as high as or better than the median-scoring positive sequence. We used SVM decision\nvalues as score. We \ufb01nd that FESS outperforms task-speci\ufb01c algorithms (PSI-Blast [2] and SAM\n[14]) as well as the Fisher score (FK,[8]) with statistical signi\ufb01cance with p-values of 5.1e-9, 8.3e-7,\n1.1e-5, respectively. There is no statistical difference between our results FESS and those based on\nFPS [3]. In particular, the poor performance of [8] is explained by the under-training of HMMs [6].\nThe FESS representation proved to be much less sensitive to the training problems. We repeated\nthe test using two different choices of Q: the approximate mean \ufb01eld factorization and the exact\nposterior (FESS-MF and FESS-EX, respectively, in Fig.1-C). Interestingly, the performance was\nalso robust with respect to these choices.\n\nExperiment 4: Scene/object recognition. Our \ufb01nal set of experiments used the data from the\nGraz dataset3, as well as the dataset proposed in [21]. In both tests, we used Latent Dirichlet allo-\ncation (LDA) [4] as the generative model. The free energy for LDA is derived in [4]. To serve as\nwords in the model, we extracted SIFT features from 16x16 pixel windows computed over a grid\nwith spacing of 8 pixels. These features were mapped to 175 codewords (W = 175). We varied the\nnumber of topics to explore the effectiveness of different techniques.\nGraz dataset has two object classes, bikes (373 images) and persons (460 images), in addition to a\nbackground class (270 images)4. The range of scales and poses at which exemplars are presented\nis highly diverse, e.g., a \u201cperson\u201d image may show a pedestrian at a certain distance, a side view\nof a complete body, or just a closeup of a head. We performed two-class detection (object vs.\nbackground) using an experimental setup consistent with [16, 22]. We generated ROC curves by\nthresholding raw SVM output, and report here the ROC equal error rate averaged over ten runs. The\nresults are shown in Table 2. The standard deviation of the classi\ufb01cation rate is quite high as the\nimages in the database have very different complexities, and the performance for any single run is\n\n1www.sci.unisannio.it/docenti/rampone\n2http://scop.mrc-lmb.cam.ac.uk/scop/\n3http://www.emt.tugraz.at/ pinz/data/GRAZ 02/\n4The car class is ignored as in [16]\n\n7\n\n\fGraz dataset\nBikes\nPeople\nScenes dataset\nNatural\nArti\ufb01cial\n\nFESS - Z=15\n86,1% (1,8)\n83,1% (3,1)\n\n\u03bbLDA\n63,93%\n67,21%\n\nFESS - Z=30\n86,5% (2,0)\n82,9% (2,8)\n\nFESS\n95,21%\n94,38%\n\nFESS - Z=45\n89,1% (2,3)\n84,4% (2,0)\n\nFK\n\n90,10%\n90,32%\n\n[16]\n\n86,3% (2,5)\n82,3% (3,1)\n\n[21]\n\n89,00%\n89,00%\n\n[22]\n86,5%\n80,8%\n[16]\n\n84,51%\n89,43%\n\nTable 2: Classi\ufb01cation rates for object/scene recognition tasks. The deviation is shown in brackets.\nOur approach tends to be robust to the choice of the number Z of topics, and so in scene recognition\nexperiments, we report only the result for Z=40.\n\nhighly dependent on the composition of the training set.\nWe also tested our approach on the scene recognition task using the datasets of [21], composed of\ntwo (Natural and Arti\ufb01cial scenes) datasets, each with 4 different classes. The results are reported in\nTable 2 where for the \ufb01rst time we employed Fisher-LDA in a vision application. Although this new\ntechnique outperformed state of the art, once again, FESS outperforms both this result and other\nstate-of-the-art discriminative methods [21, 16].\n\n6 Conclusions\nIn this paper, we present a novel generative score space, FESS, exploiting variational free energy\nterms as features. The additive free energy terms arise naturally as a consequence of the factorization\nof the model P and the posterior Q. We show that the use of these terms as features in discriminative\nclassi\ufb01cation leads to more robust results than the use of the Fisher scores, which are based on the\nderivatives of the log likelihood of the data with respect to the model parameters. As was previously\nobserved, we \ufb01nd that the Fisher score space suffers from the so called \u201cwrap-around\u201d problem,\nwhere very different data points may map to the same derivative, an example of which was dis-\ncussed in the introduction. The free energy terms, on the other hand, quantify the data \ufb01t in different\nparts of the model, and seem to be informative even when the model is imperfect. This indicates\n(cid:80)\nthat the re-scaling of these terms, which the subsequent discriminative training provides, leads to\nimproved modelling of the data in some way. Scaling a term in the free energy composition, e.g.,\nh q(h) log p(x|h), by a constant w is equivalent to raising the appropriate conditional\nthe term\ndistribution to the power w. This is indeed reminiscent of some previous approaches to correcting\ngenerative modelling problems. In speech applications, for example, it is a standard practice to raise\nthe observation likelihood in HMMs to a power less than 1, before inference is performed on the\ntest sample, as the acoustic signal would otherwise overwhelm the hidden process modelling the\nlanguage constraints [28]. This problem arises from the approximations in the acoustic model. For\ninstance, a high-dimensional acoustic observation is often modelled as following a diagonal Gaus-\nsian distribution, thus assuming independent noise in the elements of the signal, even though the\ntrue acoustics of speech is far more constrained. This results in over-accounting for the variation in\nthe observed acoustic signal, and to correct for this in practice, the log probability of the observation\ngiven the hidden variable is scaled down. The technique described here proposes a way to automati-\ncally infer the best scaling, but it also goes a step further in allowing for such corrections at all levels\nof the model hierarchy, and even for speci\ufb01c con\ufb01gurations of hidden variables. Furthermore, the\nuse of kernel methods provides for nonlinear corrections, as well. This extremely simple technique\nwas shown here to work remarkably well, outperforming previous score space approaches as well\nas the state of the art in multiple applications.\nIt is possible to extend the ideas here to other types of model/data energy. For example, the free\nenergy approximated in different ways is used in [1] to construct various inference algorithms for\na single scene parsing task. It may also be effective, for example, to use the terms in the Bethe\nfree energy linked to different belief propagation messages to construct the feature vectors. Finally,\nalthough we \ufb01nd that FESS outperforms the previously studied score spaces that depend on the\nderivatives, i.e. where \u02c6F is a derivative with respect to \u03b8, the use of this derivative in (7) is, of\ncourse, possible. This allows for the construction of kernels similar to FK and TK, but derived\nfrom intractable generative models as we show in Experiment 4 (FK in Table 2) on latent Dirichlet\nallocation.\n\nAcknowledgements\nWe acknowledge \ufb01nancial support from the FET programme within the EU FP7, under the SIMBAD\nproject (contract 213250).\n\n8\n\n\fReferences\n[1] B. Frey and N. Jojic. A Comparison of Algorithms for Inference and Learning in Probabilistic Graphical\n\nModels Transactions on pattern analysis and machine intelligence, 1392:1416\u201327, 2005.\n\n[2] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic local alignment search tool. J\n\nMol Biol, 215(3):403\u2013410, October 1990.\n\n[3] T. L. Bailey and W. N. Grundy. Classifying proteins by family using the product of correlated p-values. In\nProceedings of the Third Annual International Conference on Computational Molecular Biology, pages\n10\u201314. ACM, 1999.\n\n[4] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res., 3:993\u20131022,\n\n2003.\n\n[5] G. Bouchard and B. Triggs. The tradeoff between generative and discriminative classi\ufb01ers.\n\nInternational Symposium on Computational Statistics, pages 721\u2013728, Prague, August 2004.\n\nIn IASC\n\n[6] K. Tsuda B.Schlkopf and J. Vert. Kernel Methods in Computational Biology. The MIT Press, 2004.\n[7] Z. Ghahramani. On structured variational approximations. Technical Report CRG-TR-97-1, 1997.\n[8] T. Jaakkola, M. Diekhaus, and D. Haussler. Using the \ufb01sher kernel method to detect remote protein\n\nhomologies. 7th Intell. Sys. Mol. Biol., pages 149\u2013158, 1999.\n\n[9] T. Jaakkola and D. Haussler. Exploiting generative models in discriminative classi\ufb01ers. Nips, 1998.\n[10] T. Jebara, R. Kondor, A. Howard, K. Bennett, and N. Cesa-bianchi. Probability product kernels. Journal\n\nof Machine Learning Research, 5:819\u2013844, 2004.\n\n[11] M.I. Jordan, Z. Ghahramani, T. Jaakkola, and L.K. Saul. An introduction to variational methods for\n\ngraphical models. Machine Learning, 37(2):183\u2013233, 1999.\n\n[12] S. Kapadia. Discriminative Training of Hidden Markov Models. PhD thesis, 1998.\n[13] H. Kappen and W. Wiegerinck. Mean \ufb01eld theory for graphical models, 2001.\n[14] K. Karplus, C. Barrett, and R. Hughey. Hidden markov models for detecting remote protein homologies.\n\nBioinformatics, 14:846\u2013856, 1999.\n\n[15] J. A. Lasserre, C. M. Bishop, and T. P. Minka. Principled hybrids of generative and discriminative models.\n\nIn Cvpr, pages 87\u201394, 2006.\n\n[16] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing\n\nnatural scene categories. Cvpr, 2:2169\u20132178, 2006.\n\n[17] D. MacKay. Ensemble learning for Hidden Markov Models, 1997. Unpublished. Department of Physics,\n\nUniversity of Cambridge.\n\n[18] A. Mccallum, C. Pal, G. Druck, and X. Wang. Multi-conditional learning: Generative/discriminative\ntraining for clustering and classi\ufb01cation. In In Proceedings of the 21st National Conference on Arti\ufb01cial\nIntelligence, pages 433\u2013439, 2006.\n\n[19] R. M. Neal and G. E. Hinton. A view of the em algorithm that justi\ufb01es incremental, sparse, and other\n\nvariants. pages 355\u2013368, 1999.\n\n[20] A. Y. Ng and M. I. Jordan. On discriminative vs. generative classi\ufb01ers: A comparison of logistic regression\nand naive bayes. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, NIPS, Cambridge, MA, 2002.\nMIT Press.\n\n[21] A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of the spatial\n\nenvelope. International Journal of Computer Vision, 42:145\u2013175, 2001.\n\n[22] A. Opelt, M. Fussenegger, A. Pinz, and P. Auer. Weak hypotheses and boosting for generic object detec-\n\ntion and recognition. In Eccv, volume 2, pages 71\u201384, 2004.\n\n[23] L.R. Rabiner. A tutorial on Hidden Markov Models and selected applications in speech recognition. Proc.\n\nof IEEE, 77(2):257\u2013286, 1989.\n\n[24] N. Smith and M. Gales. Speech recognition using SVMs. In Nips, pages 1197\u20131204. MIT Press, 2002.\n[25] N. Smith and M. Gales. Using svms to classify variable length speech patterns. Technical Report\n\nCUED/F-INGENF/TR.412, University of Cambridge, UK, 2002.\n\n[26] G. G. Towell, J. W. Shavlik, and M. O. Noordewier. Re\ufb01nement of approximate domain theories by\nIn In Proceedings of the Eighth National Conference on Arti\ufb01cial\n\nknowledge-based neural networks.\nIntelligence, pages 861\u2013866, 1990.\n\n[27] K. Tsuda, M. Kawanabe, G. R\u00a8atsch, S. Sonnenburg, and K. R. M\u00a8uller. A new discriminative kernel from\n\nprobabilistic models. Neural Comput., 14(10):2397\u20132414, 2002.\n\n[28] L. Deng and D. O\u2019Shaughnessy, Speech Processing - A Dynamic and Optimization-Oriented Approach\n\nMarcel Dekker Inc., June 2003\n\n9\n\n\f", "award": [], "sourceid": 682, "authors": [{"given_name": "Alessandro", "family_name": "Perina", "institution": null}, {"given_name": "Marco", "family_name": "Cristani", "institution": null}, {"given_name": "Umberto", "family_name": "Castellani", "institution": null}, {"given_name": "Vittorio", "family_name": "Murino", "institution": null}, {"given_name": "Nebojsa", "family_name": "Jojic", "institution": null}]}