{"title": "Learning Discriminative Feature Transforms to Low Dimensions in Low Dimentions", "book": "Advances in Neural Information Processing Systems", "page_first": 969, "page_last": 976, "abstract": null, "full_text": "Learning Discriminative Feature Transforms\n\nto Low Dimensions in Low Dimensions\n\nMotorola Labs, 7700 South River Parkway, MD ML28, Tempe AZ 85284, USA\n\nKari Torkkola\n\nKari.Torkkola@motorola.com http://members.home.net/torkkola\n\nAbstract\n\nThe marriage of Renyi entropy with Parzen density estimation has been\nshown to be a viable tool in learning discriminative feature transforms.\nHowever, it suffers from computational complexity proportional to the\nsquare of the number of samples in the training data. This sets a practical\nlimit to using large databases. We suggest immediate divorce of the two\nmethods and remarriage of Renyi entropy with a semi-parametric density\nestimation method, such as a Gaussian Mixture Models (GMM). This al-\nlows all of the computation to take place in the low dimensional target\nspace, and it reduces computational complexity proportional to square\nof the number of components in the mixtures. Furthermore, a conve-\nnient extension to Hidden Markov Models as commonly used in speech\nrecognition becomes possible.\n\n1 Introduction\nFeature selection or feature transforms are important aspects of any pattern recognition sys-\ntem. Optimal feature selection coupled with a particular classi\ufb01er can be done by actually\ntraining and evaluating the classi\ufb01er using all combinations of available features. Obvi-\nously this wrapper strategy does not allow learning feature transforms, because all possible\ntransforms cannot be enumerated. Both feature selection and feature transforms can be\nlearned by evaluating some criterion that re\ufb02ects the \u201cimportance\u201d of a feature or a number\nof features jointly. This is called the \ufb01lter con\ufb01guration in feature selection. An optimal cri-\nterion for this purpose would naturally re\ufb02ect the Bayes error rate. Approximations can be\nused, for example, based on Bhattacharyya bound or on an interclass divergence criterion.\nThese are usually accompanied by a parametric estimation, such as Gaussian, of the densi-\nties at hand [6, 12]. The classical Linear Discriminant Analysis (LDA) assumes all classes\nto be Gaussian with a shared single covariance matrix [5]. Heteroscedastic Discriminant\nAnalysis (HDA) extends this by allowing each of the classes have their own covariances\n[9].\n\nMaximizing a particular criterion, the joint mutual information (MI) between the features\nand the class labels [1, 17, 16, 13], can be shown to minimize the lower bound of the\nclassi\ufb01cation error [3, 10, 15]. However, MI according to the popular de\ufb01nition of Shannon\ncan be computationally expensive. Evaluation of the joint MI of a number of variables is\nplausible through histograms, but only for a few variables [17]. As a remedy, Principe et\nal showed in [4, 11, 10] that using Renyi\u2019s entropy instead of Shannon\u2019s, combined with\nParzen density estimation, leads to expressions of mutual information with computational\ncomplexity of \u0002\u0001\u0004\u0003\u0006\u0005\b\u0007 , where \u0003\nis the number of samples in the training set. This method\ncan be formulated to express the mutual information between continuous variables and\ndiscrete class labels in order to learn dimension-reducing feature transforms, both linear\n\n\f[15] and non-linear [14], for pattern recognition. One must note that regarding \ufb01nding the\nextrema, both de\ufb01nitions of entropy are equivalent (see [7] pages 118,406, and [8] page\n325).\n\nThis formulation of MI evaluates the effect of each sample to every other sample in the\ntransformed space through the Parzen density estimation kernel. This effect can also called\nas the \u201cinformation force\u201d. Thus large/huge databases are hard to use due to the \u0002\u0001\u0004\u0003\ncomplexity.\n\nTo remedy this problem, and also to alleviate the dif\ufb01culties in Parzen density estimation\n\nin high-dimensional spaces (\u0001\u0003\u0002\u0005\u0004 ), we present a formulation combining the mutual infor-\n\nmation criterion based on Renyi entropy with a semi-parametric density estimation method\nusing Gaussian Mixture Models (GMM). In essence, Parzen density estimation is replaced\nby GMMs. In order to evaluate the MI, evaluating mutual interactions between mixture\ncomponents of the GMMs suf\ufb01ces, instead of having to evaluate interactions between all\npairs of samples. An approach that maps an output space GMM back to input space and\nagain to output space through the adaptive feature transform is taken. This allows all of the\ncomputation to take place in the target low dimensional space. Computational complexity\nis reduced proportional to the square of the number of components in the mixtures.\n\nThis paper is structured as follows. An introduction is given to the maximum mutual in-\nformation (MMI) formulation for discriminative feature transforms using Renyi entropy\nand Parzen density estimation. We discuss different strategies to reduce its computational\ncomplexity, and we present a formulation based on GMMs. Empirical results are presented\nusing a few well known databases, and we conclude by discussing a connection to Hidden\nMarkov Models.\n\n2 MMI for Discriminative Feature Transforms\n\n) to\n\u0007 , the mutual information\n. The procedure is depicted in Fig. 1.\n\u0007 , in a differentiable\n\nGiven a set of training data\u0006\b\u0007\n\t ,\u000b\f\t\u000e\r as samples of a continuous-valued random variable\n,\u0007\n,\u000b\n, and class labels as samples of a discrete-valued random variable\u0016\n\t\u0011\u0010\u0013\u0012\u0015\u0014\n\t\u0017\u0010\n\u0003)( , the objective is to \ufb01nd a transformation (or its parameters*\n\u0003\u0017\"\n\u0010&%\n\u0006\u0019\u0018\u0019\u001a\u001c\u001b\u001d\u001a\u001f\u001e \u001e!\u001e \u001a\n\r\u0019\u001a$#\n\u0018'\u001a\n\u00010/\u00051\nsuch that+\nthat maximizes7\n\u0010,\u0012.-\n*,\u001a$\u00076\t\n\u00168\u001a\u001c9\n\t3254\nand class labels\u0016\n(MI) between transformed data9\nTo this end, we need to express7 as a function of the data set,7\n\u001a\u001c\u000b\nform. Once that is done, we can perform gradient ascent on7 as follows\n25*\u0017:GBED\n*;:=@2A*;:CBED\n\t J\nof its computational advantages. Estimating the densityK\nGaussians each centered at a sample+\n\t , the expression of Renyi\u2019s quadratic entropy of9\n\nTo derive an expression for MI using a non-parametric density estimation method we apply\nRenyi\u2019s quadratic entropy instead of Shannon\u2019s entropy as described in [10, 15] because\nas a sum of spherical\n\n\u0007 of9\n\n(1)\n\nis\n\nL;M\n\nN)O P'QSRUT6K\nN)O P'Q\nN)O P'Q\n\nRVTEW\n\n>\\[\n\n\u001a\u001c]\n\n\u001a\u001c\u001b']\n\n\u0007_^\n\n\u001a\u001c]\n\n(2)\n\n\u0005\n\u0007\n\u000f\n+\n\t\n\u001a\n\u0001\n\u0007\n\u0001\n\u0001\n\u0006\n+\n\t\n\t\n\nF\n7\nF\n*\nH\nI\n>\nF\n7\nF\n+\n\t\nF\n+\n\t\nF\n*\n\u001e\n\u0001\n+\n\u0001\n9\n\u0007\n2\n\u0001\n+\n\u0007\n\u0005\n\u0001\n+\n2\n\u0018\n\u0003\n\u0005\nX\nH\nI\nY\nJ\n>\nH\nI\nZ\nJ\n\u0001\n+\nN\n+\nY\n\u0005\n7\n\u0007\n[\n\u0001\n+\nN\n+\nZ\n\u0005\n7\n`\n\u0001\n+\n2\n\u0018\n\u0003\n\u0005\nH\nI\nY\nJ\n>\nH\nI\nZ\nJ\n>\n[\n\u0001\n+\nY\nN\n+\nZ\n\u0005\n7\n\u0007\n\u001e\n\fAbove, use is made of the fact that the convolution of two Gaussians is a Gaussian. Thus\nRenyi\u2019s quadratic entropy can be computed as a sum of local interactions as de\ufb01ned by the\nkernel, over all pairs of samples.\n\nresults in\n\nH\n\t\n\n\u001a$\u000b\n\n(3)\n\n(4)\n\n,\u0005\n\n\u0005\b\u0015\n\nsuch a measure has been derived in [10, 15] as follows:\n\nIn order to use this convenient property, a measure of mutual information making use of\nand\n\nquadratic functions of the densities would be desirable. Between a discrete variable\u0016\na continuous variable9\nN\u0017\u001b\nRVT6K\nRVT\nRVT\n\u00168\u001a\u001c9\n7\u0001\nWe use\u0002\u0004\u0003\nfor the number of samples in classK\nfor\u0006 th sample regardless of its class,\nZ for the same sample, but emphasizing that it belongs to classK\n, with index\u0007 within\nand\u0005\nthe class. Expressing densities as their Parzen estimates with kernel width]\n\u000b\r\f\n\u000b\r\f\n7\b\n\u0002\u0012\u0003\n\u0003\u0014\u0013\n>\u0011\u0010\n\u0002\u0012\u0003\n\n\u0007 can now be interpreted as an information potential in-\nduced by samples of data in different classes. It is now straightforward to derive partial\n\nMutual information7\u0001\n\t which can accordingly be interpreted as an information force that other samples\n7\u0017\u0016\nexert to sample+\n\t . The three components of the sum give rise to following three compo-\n>\u0019\u0018 Samples within the same class attract each other, \u0005\n\u0018 All\n\u0018 Samples of different classes repel each\nsamples regardless of class attract each other, and\u001a\nother. This force, coupled with the latter factorF\ndirection of the information force, and thus increase the MI criterion7\n\ninside the sum in (1), tends to\nchange the transform in such a way that the samples in transformed space move into the\n\u0007 . See [15]\n\nnents of the information force:\n\nfor details.\n\n\u001a\u001c\u000b\n\n\u001a\u001c\u000b\f\t\n\nClass labels: c\n\ng(w,x)\n\nHigh-dimens-\nional data: x\n\nLow dimensional\nfeatures: y\n\nMutual Information \nI(c,y)\n(=Information potential)\n\nGradient\n\nI\nw\n\nFigure 1: Learning feature transforms by maximizing the mutual information between class labels\nand transformed features.\n\nEach term in (4) consists of a double sum of Gaussians evaluated using the pairwise dis-\ntance between the samples. The \ufb01rst component consists of a sum of these interactions\nwithin each class, the second of all interactions regardless of class, and the third of a sum\nof the interactions of each class against all other samples. The bulk of computation consists\nof evaluating these \u0003\u0006\u0005\n\n, makes use of the same Gaussians, in addition to pairwise differences of the\nsamples [15]. For large \u0003\nis a problem. Thus, the rest of the paper\nexplores possibilities of reducing the computation to make the method applicable to large\ndatabases.\n\n\u0016'\u001b Gaussians, and forming the sums of those. Information force, the\n\ngradient of7\n\n, complexity of \u0002\u0001\u0004\u0003\n\n\u0001\n\u0007\n2\nI\n\"\nK\n\u0001\n\u000b\n\u001a\n+\n\u0007\n\u0005\n\u0001\n+\nB\nI\n\"\n\u0001\n\u000b\n\u0007\n\u0005\nK\n\u0001\n+\n\u0007\n\u0005\n\u0001\n+\nI\n\"\nK\n\u0001\n\u000b\n\u001a\n+\n\u0007\nK\n\u0001\n\u000b\n\u0007\nK\n\u0001\n+\n\u0007\n\u0001\n+\nY\n\u0003\n\u0001\n\u0006\n+\n\t\n\t\n\n\u0007\n2\n\u0018\n\u0003\n\u0005\nI\n\u0003\nJ\n>\nI\nY\nJ\n>\nI\n\u000e\nJ\n>\n[\n\u0001\n+\n\u0003\nY\nN\n+\n\u0003\n\u000e\n\u001a\n\u001b\n]\n\u0005\n7\n\u0007\nB\n\u0018\n\u0003\n\u0005\n\u000f\nH\n\t\nI\n\u0003\nJ\nH\nI\nY\nJ\n>\nH\nI\n\u000e\nJ\n>\n[\n\u0001\n+\nY\nN\n+\n\u000e\n\u001a\n\u001b\n]\n\u0005\n7\n\u0007\nN\n\u001b\n\u0018\n\u0003\n\u0005\nH\n\t\nI\n\u0003\nJ\n>\n\u0003\n\u000b\n\f\nI\nZ\nJ\n>\nH\nI\nY\nJ\n>\n[\n\u0001\n+\n\u0003\nZ\nN\n+\nY\n\u001a\n\u001b\n]\n\u0005\n7\n\u0007\n\u0001\n\u0006\n+\n\t\n\t\n\nF\nF\n+\n+\n\t\n\u0016\nF\n*\n\u0001\n\u0006\n+\n\t\n\n\u00b6\n\u00b6\n\n\u0005\n\u0007\n\f3 How to Reduce Computation?\nIn essence, we are trying to learn a transform that minimizes the class density overlap in\nthe output space while trying to drive each class into a singularity. Since kernel density es-\ntimate results in a sum of kernels over samples, a divergence measure between the densities\nnecessarily requires \u0002\u0001\n\u0005\b\u0007 operations. The only alternatives to reduce this complexity are\neither to reduce \u0003\nTwo straightforward ways to achieve the former are clustering or random sampling. In\nthis case clustering needs to be performed in the high-dimensional input space, which may\nbe dif\ufb01cult and computationally expensive itself. A transform is then learned to \ufb01nd a\nrepresentation that discriminates the cluster centers or the random samples belonging to\ndifferent classes. Details of the densities may be lost, more so with random sampling, but\nat least this might bring the problem down to a computable level.\n\n, or to form simpler density estimates.\n\nThe latter alternative can be accomplished by a GMM, for example. A GMM is learned\nin the low-dimensional output space for each class, and now, instead of comparing sam-\nples against each other, comparing samples against the components of the GMMs suf\ufb01ces.\n\nchange at each iteration, and the GMMs need to be estimated again. There is no guarantee\nis so small that simple re-estimation based\non previous GMMs would suf\ufb01ce. However, this depends on the optimization method used.\n\nHowever, as the parameters of the transform are being learned iteratively, the \u0006\nthat the change to the transform and to the\u0006\ntransformed by4\n\nA further step in reducing computation is to compare GMMs of different classes in the\noutput space against each other, instead of comparing the actual samples. In addition to the\ninconvenience of re-estimation, we lack now the notion of \u201cmapping\u201d. Nothing is being\nfrom the input space to the output space, such that we could change\nthe transform in order to increase the MI criterion. Although it would be possible now\nto evaluate the effect of each sample to each mixture component, and the effect of each\n, due to the double summing, we\ncomponent to the MI, that is,\nwill pursue the mapping strategy outlined in the following section.\n\n will\n\n\t_\n\n4 Two GMM Mapping Strategies\nIO-mapping. If the GMM is available in the high-dimensional input space, those models\ncan be directly mapped into the output space by the transform. Let us call this case the\n\n\u0005\u0004\u0007\u0006\n\n\u0005\u0004\n\nT\t\b\n\nWe consider now only linear transforms. The transformed density in the low-dimensional\noutput space is then simply\n\n\u000e\n\f\n\nIO-mapping. Writing the density of classK as a GMM with \n\nZ as their mixture weights we get\n\u0007\r\f\n\u001a\u0012\u0011\n\u001a\u0016\u0013\u0017\u0011\n\nN\u0010\u000f\nN\u0014\u0013\u0015\u000f\ntransformed GMMs can be expressed as a function of \u0013\n\n\u000e\n\f\n\nNow, the mutual information in the output space between class labels and the densities as\n, and it will be possible to evaluate\nto insert into (1). A great advantage of this strategy is that once the input space\nGMMs have been created (by the EM-algorithm, for example), the actual training data\nneeds not be touched at all during optimization! This is thus a very viable approach if the\nGMMs are already available in the high-dimensional input space (see Section 7), or if it is\nnot too expensive or impossible to estimate them using the EM-algorithm. However, this\nmight not be the case.\n\n7\u0017\u0016\n\n\u0003 mixture components and\n\n(5)\n\n(6)\n\n\u0003\n+\n\t\n+\n\n\u0001\n\n\u0002\n2\n\u0003\n\t\n\u0003\nY\n\n\u0001\n\u0006\n\n\nT\n\b\n\n\u0002\n\u000b\n\u0003\nK\n\u0001\n\u000b\n\u0003\n\u0007\n2\nI\nZ\nJ\n>\n\u000b\n\u0003\nZ\n[\n\u0001\n\u0007\n\u0003\nZ\n\u0003\nZ\n\u0007\nK\n\u0001\n+\n\f\n\u000b\n\u0003\n\u0007\n2\nI\nZ\nJ\n>\n\u000b\n\u0003\nZ\n[\n\u0001\n+\n\u0003\nZ\n\u0003\nZ\n\u0013\n\n\u0007\nF\nF\n\u0013\n\fN\u0001\n\nOIO-mapping. An alternative is to construct a GMM model for the training data in the\nlow-dimensional output space. Since getting there requires a transform, the GMM is con-\nstructed after having transformed the data using, for example, a random or an informed\n\nis\n\nguess as the transform. Density estimated from the samples in the output space for classK\n\n(7)\n\nOnce the output space GMM is constructed, the same samples are used to construct a GMM\nin the input space using the same exact assignments of samples to mixture components\nas the output space GMMs have. Running the EM-algorithm in the input space is now\nunnecessary since we know which samples belong to which mixture components. Similar\nstrategy has been used to learn GMMs in high dimensional spaces [2]. Let us now use\nthe notation of Eq.(5) to denote this density also in the input space. As a result, we have\nGMMs in both spaces and a transform mapping between the two. The transform can be\n.\nThis case will be called OIO-mapping. The biggest advantage is now avoiding to operate\nin the high-dimensional input space at all, not even the one time in the beginning of the\nprocedure.\n\nlearned as in the IO-mapping, by using the equalities \n\nZ and \u0002\n\n\u0013\u0017\u0011\n\n\u0013\u0015\u000f\n\n5 Learning the Transform through Mapped GMMs\n\nand to make use of it in the \ufb01rst half of Equation (1).\n\nWe present now the derivation of adaptation equations for a linear transform that apply\nto either mapping. The \ufb01rst step is to express the MI as a function of the GMM that\n,\nthrough the mapping of the input space GMM to the output space GMM. The second step\n\nis constructed in the output space. This GMM is a function of the transform matrix \u0013\nis to compute its gradientF\nlowing equalities:K\n\nGMM in the output space for each class is already expressed in (7). We need the fol-\n\n5.1 Information Potential as a Function of GMMs\n\n7\u0017\u0016\n2\u0004\u0003\n\nLet us denote the three terms in (3) as \u0005\n\n\u0007 .\n\n\u0001\u001cH\n\n\u0007 , where \u0003\n, \u0005\u0007\u0006\t\b\n\b\nRUT\n\n. Then we have\n\n\u0003 denotes the class prior, andK\n, andN\n\u001b\u000b\u0005\u0007\f\n\t!J\n\nN\u0001\n\n\t\n\n\t$\u001a\n\n\u0007_^\n\nRVT\n\n\u000e\n\f\n\u000e\n\f\n\t J\n\u000e ,\n( , and\u0001\u0014\u0013\n\u0007 . Now we can write \u0005\n\u0005\u0007\u0006\t\b\n\b)2\n\n\u0006C\u001a\u0011\u0010\nH\n\t\n\nN\u000e\n, \u0005\u0007\u0006\t\b\n\b\n\n\u0003\u0016\u0015\n\nTo compact the notation, we change the indexing, and make the substitutions \nN\u000f\nwhere\u0006C\u001a\u0011\u0010\nis the total number of mixture components, and \u0005\nY\u0018\u0017\n\u0001\u001cH\n\n\u0001\u0014\u0013\n\u0018\u0019\u001e!\u001e \u001e\n\u0006C\u001a\u001c\u0010\n\n\u0007 ,\n, and \u0005\u0007\f\n\n\u000e ,\n\"\u001b\u001a\n\u000e\u0019\u0017\nH\n\t\n\n2\u0012\u0003\n\nH\n\t\n\nH\n\t\n\nY\r\u000e\n\nY\r\u000e\n\n\u0005\u0007\f\n\nin a convenient form.\n\n\u0006C\u001a\u0011\u0010\n\t\r\n\t\n\n(8)\n\n\u0007 ,\n\n(9)\n\nY\r\u000e\n\u0003\u0016\u0015\n\n\u001a\u0011\u0010\n\n\u0003\u0016\u0015\n\nK\n\u0001\n+\n\f\n\u000b\n\u0003\n\u0007\n2\n\u000e\n\f\nI\nZ\nJ\n>\n\u000b\n\u0003\nZ\n[\n\u0001\n+\n\u0003\nZ\n\u001a\n\u0002\n\u0003\nZ\n\u0007\n\u0003\nZ\n2\n\u0003\n\u0003\nZ\n2\n\u0003\nZ\n\u0013\n\nF\n\u0013\n\u0001\n\u000b\n\u0003\n\u001a\n+\n\u0007\n\u0003\nK\n\u0001\n+\n\f\n\u000b\n\u0003\n\u0001\n+\n\u0007\n2\n\u0003\nH\n\t\n\u0003\nJ\n>\nK\n\u0001\n\u000b\n\u0003\n\u001a\n+\n\u0005\n\u0001\nH\n2\nI\n\"\nK\n\u0001\n\u000b\n\u001a\n+\n\u0007\n\u0005\n\u0001\n+\n2\nH\n\t\nI\n\u0003\nJ\n>\n\u0003\n\u0005\n\u0003\nW\nX\n\u000e\n\f\nI\n>\n\u000b\n\u0003\n\t\n[\n\u0001\n+\n\u0003\n\u0002\n\u0003\n\t\n`\n\u0005\n\u0001\n\u0005\n2\nH\n\t\nI\n\u0003\nJ\n>\n\u0003\n\u0005\n\u0003\nI\n>\nI\nZ\nJ\n>\n\u000b\n\u0003\n\t\n\u000b\n\u0003\nZ\n[\n\u0001\n\n\u0003\n\t\n\u0003\nZ\n\u001a\n\u0002\n\u0003\n\t\nB\n\u0002\n\u0003\nZ\n\u0007\n2\n\nY\n\u0002\nY\n\u000e\n2\n\u0002\nY\nB\n\u0002\n[\n\u0001\n\u0007\n2\n[\n\u0001\n\n\u001a\n\u0002\n\u0005\n\u0001\n\u0007\nY\n\u0003\n\u000e\n\u000b\nY\n\u000b\n\u000e\n[\n\u0001\n\u0006\n\u0010\n%\n2\n\u0003\n\"\n\f\n\u0003\n\u0005\n\u0001\n\u0001\nH\n\u0005\n2\nI\n\u0003\nJ\n>\n\u0005\n\u0003\n\u0003\n\u0001\nI\n\u001d\nJ\n>\n\u0003\n\u0005\n\u001d\n\u0007\nI\n\u0003\nJ\n>\nI\n\u0015\nJ\n>\n\u0005\n2\nH\n\t\nI\n\u0003\nJ\n>\n\u0003\n\u0003\nH\n\t\nI\n\u0015\nJ\n>\n\u0005\n\f5.2 Gradient of the Information Potential\n\n\u0007 , we need its gradient as\n\nAs each Gaussian mixture component is now a function of the corresponding input space\n, it is straightforward (albeit tedious) to write the\nis composed of different sums of\n\ncomponent and the transform matrix \u0013\ngradientF\n. Since each of the three terms in7\n\u001a\u0011\u0010\n\u0006C\u001a\u0011\u0010\nwhere the input space GMM parameters are \u000f\n2\u0015\u000f\n\u000e and \u0002\nequalities \n\u0013\u0015\u000f\n\u001a\u0011\u0010\nhow this convolution in the output space changes, as \u0013\nin terms of these convolutions, and maximizing it tends to \ufb01nd a \u0013\n\n\u0007 expresses the convolution of two mixture components in the output space. As we\nalso have those components in the high-dimensional input space, the gradient expresses\nthat maps the mixture compo-\nnents to the output space, is being changed. The mutual information measure is de\ufb01ned\nthat (crudely stated)\nminimizes these convolutions between classes and maximizes them within classes. The\ndesired gradient of the Gaussian with respect to the transform matrix is as follows:\n\n\u001a\u0016\u0013\u0017\u0011\nY\r\u000e\n\u000e and \u0011\n\n\u000e with the\n\n\u0013\u0015\u000f\nN\u0014\u000f\n\n\u0013\u0017\u0011\n\nY\r\u000e\n\nY\r\u000e\n\nY\r\u000e\n\nY\r\u000e\n\n(10)\n\n.\n\n\u0007\u0011\u0002\u0001\n\n\u0002\u0004\u0003\n\n\u0002\u0001\n\n\u001a\u0011\u0010\n\nby the above gradient.\n\ncan now be obtained simply by replacing\n\n\u001a\u0011\u0010\n7\u0001\n\u0016\n, the bulk of computation is in evaluating the \u0005\n\n7.N\u000e\n\nY\r\u000e\nY\r\u000e\u0006\u0005\n\u0007 . In addition, theF\n\n\u0015 , the componentwise\nconvolutions. Computational complexity is now \u0002\u0001\nrequires\npairwise sums and differences of the mixture parameters in the input space, but these need\nonly be computed once.\n\nThe total gradientF\nIn evaluating7\n\nin (8) and (9)\n\n\u0013\u0017\u0011\n\nY\r\u000e\n\n\u000e\b\u0007\n\n(11)\n\nY\r\u000e\n\nY\r\u000e\n\u0006C\u001a\u001c\u0010\n7\u0001\n\n6 Empirical Results\nThe \ufb01rst step in evaluating this approach is to compare its performance to the computa-\ntionally more expensive MMI feature transforms that use Parzen density estimation. To\nthis end, we repeated the pattern recognition experiments of [15] using exactly the same\nLVQ-classi\ufb01er. These experiments were done using \ufb01ve publicly available databases that\nare very different in terms of the amount of data, dimension of data, and the number of\ntraining instances. For details of the data sets, please see [15]. OIO-mapping was used\nwith 3-5 diagonal Gaussians per class to learn a dimension-reducing linear transform. Gra-\ndient ascent was used for optimization1. Results are presented in Tables 1 - 5. The last\ncolumn denotes the original dimensionality of the data set.\n\nAs a \ufb01gure of the overall performance, the average over all \ufb01ve databases and all reduced\ndimensions, which ranged from one up to the original dimension minus one, was 69.6% for\nPCA, 77.8% for the MMI-Parzen combination, and 77.0% for the MMI-GMM combination\n(30 tests altogether). For LDA this \ufb01gure cannot be calculated since some databases had\nfeatures. The results are very satisfactory\na small \u0003\nsince the best we could hope for is performance equal to the MMI-Parzen combination.\nThus a very signi\ufb01cant reduction in computation caused only a minor drop in performance\nwith this classi\ufb01er.\n7 Discussion\nWe have presented a method to learn discriminative feature transforms using Maximum\nMutual Information as the criterion. Formulating MI using Renyi entropy, and Gaussian\n\n\" and LDA can only produce \u0003\n\nNA\u0018\n\n1Example video clips can be viewed at http://members.home.net/torkkola/mmi.\n\n7\n\n\u0016\nF\n\u0013\n\n[\n\u0001\n\u0006\nF\nF\n\u0013\n[\n\u0001\n\u0007\n2\nF\nF\n\u0013\n[\n\u0001\n\nY\n\u000e\n\u001a\n\u0002\nY\n\u000e\n\u0007\n2\nF\nF\n\u0013\n[\n\u0001\nY\n\u000e\n\u0013\n\n\u0007\nY\n\u000e\nY\n2\n\u0011\nY\nB\n\u0011\n2\nY\n2\n\u0013\n\n[\n\u0001\n\u0006\nF\nF\n\u0013\n[\n\u0001\n\u0006\n\u0007\n2\nN\n[\n\u0001\n\u0006\n>\n\n\n>\nY\n\u000e\nB\n\n\u000f\n\nY\nF\n\u0013\n[\n\u0001\n\u0007\n\n\u0003\n\u0001\n\u0005\n\u0013\n\u0016\nF\n\u0013\n\"\n\fTable 1: Accuracy on the Phoneme test data set using LVQ classi\ufb01er.\n\nOutput dimenson\nPCA\nLDA\nMMI-Parzen\nMMI-GMM\n\n1\n7.6\n5.1\n15.5\n21.4\n\n2\n70.0\n66.0\n68.5\n70.4\n\n3\n76.8\n74.7\n75.2\n76.8\n\n4\n81.1\n80.2\n80.2\n80.2\n\n6\n84.2\n82.8\n82.6\n82.6\n\n9\n87.3\n86.0\n85.3\n87.7\n\n20\n90.0\n-\n-\n-\n\nTable 2: Accuracy on the Landsat test data set using LVQ classi\ufb01er.\n36\n90.4\n-\n-\n-\n\nOutput dimension\nPCA\nLDA\nMMI-Parzen\nMMI-GMM\n\n2\n81.5\n75.7\n82.0\n80.4\n\n4\n87.8\n87.2\n86.2\n88.3\n\n9\n89.4\n88.8\n87.6\n87.4\n\n15\n90.3\n90.0\n89.5\n89.1\n\n1\n41.2\n42.5\n65.1\n65.0\n\n3\n85.8\n86.2\n86.4\n86.1\n\nTable 3: Accuracy on the Letter test data set using LVQ classi\ufb01er.\n16\nOutput dimension\n92.4\nPCA\nLDA\n-\nMMI-Parzen\n-\nMMI-GMM\n-\n\n2\n16.0\n38.0\n50.3\n42.4\n\n4\n53.2\n68.1\n70.9\n68.5\n\n6\n75.2\n80.3\n82.4\n80.9\n\n8\n82.5\n86.3\n88.6\n86.6\n\n3\n36.0\n53.1\n62.8\n48.3\n\n1\n4.5\n13.4\n16.4\n15.7\n\nTable 4: Accuracy on the Pipeline data set using LVQ classi\ufb01er.\n12\nOutput dimension\n99.0\nPCA\nLDA\n-\n-\nMMI-Parzen\nMMI-GMM\n-\n\n2\n88.0\n98.8\n99.1\n98.8\n\n4\n89.7\n-\n99.2\n98.9\n\n1\n41.5\n98.4\n99.4\n91.3\n\n7\n97.2\n-\n99.0\n98.7\n\n5\n96.4\n-\n98.9\n99.1\n\n3\n87.8\n-\n98.9\n99.1\n\nTable 5: Accuracy on the Pima data set using LVQ classi\ufb01er.\n\nOutput dimension\nPCA\nLDA\nMMI-Parzen\nMMI-GMM\n\n1\n64.4\n65.8\n72.0\n73.9\n\n2\n73.0\n-\n77.5\n79.7\n\n3\n75.2\n-\n78.7\n79.4\n\n4\n74.1\n-\n78.5\n77.9\n\n5\n75.6\n-\n78.3\n76.7\n\n6\n74.7\n-\n78.3\n77.5\n\n8\n74.7\n-\n-\n-\n\nMixture Models as a semi-parametric density estimation method, allows all of the compu-\ntation to take place in the low-dimensional transform space. Compared to previous formu-\nlation using Parzen density estimation, large databases become now a possibility.\n\nA convenient extension to Hidden Markov Models (HMM) as commonly used in speech\nrecognition becomes also possible. Given an HMM-based speech recognition system,\nthe state discrimination can be enhanced by learning a linear transform from some high-\ndimensional collection of features to a convenient dimension. Existing HMMs can be con-\nverted to these high-dimensional features using so called single-pass retraining (compute\nall probabilities using current features, but do re-estimation using a the high-dimensional\nset of features). Now a state-discriminative transform to a lower dimension can be learned\nusing the method presented in this paper. Another round of single-pass retraining then\nconverts existing HMMs to new discriminative features.\n\nA further advantage of the method in speech recognition is that the state separation in the\ntransformed output space is measured in terms of the separability of the data represented\nas Gaussian mixtures, not in terms of the data itself (actual samples). This should be\nadvantageous regarding recognition accuracies since HMMs have the same exact structure.\n\n\fReferences\n\n[1] R. Battiti. Using mutual information for selecting features in supervised neural net\n\nlearning. Neural Networks, 5(4):537\u2013550, July 1994.\n\n[2] Sanjoy Dasgupta. Experiments with random projection. In Proceedings of the 16th\nConference on Uncertainty in Arti\ufb01cial Intelligence, pages 143\u2013151, Stanford, CA,\nJune30 - July 3 2000.\n\n[3] R.M. Fano. Transmission of Information: A Statistical theory of Communications.\n\nWiley, New York, 1961.\n\n[4] J.W. Fisher III and J.C. Principe. A methodology for information theoretic feature\nextraction. In Proc. of IEEE World Congress On Computational Intelligence, pages\n1712\u20131716, Anchorage, Alaska, May 4-9 1998.\n\n[5] K. Fukunaga. Introduction to statistical pattern recognition (2nd edition). Academic\n\nPress, New York, 1990.\n\n[6] Xuan Guorong, Chai Peiqi, and Wu Minhui. Bhattacharyya distance feature selec-\ntion. In Proceedings of the 13th International Conference on Pattern Recognition,\nvolume 2, pages 195 \u2013 199. IEEE, 25-29 Aug. 1996.\n\n[7] J.N. Kapur. Measures of information and their applications. Wiley, New Delhi, India,\n\n1994.\n\n[8] J.N. Kapur and H.K. Kesavan. Entropy optimization principles with applications.\n\nAcademic Press, San Diego, London, 1992.\n\n[9] Nagendra Kumar and Andreas G. Andreou. Heteroscedastic discriminant analysis\nand reduced rank HMMs for improved speech recognition. Speech Communication,\n26(4):283\u2013297, 1998.\n\n[10] J.C. Principe, J.W. Fisher III, and D. Xu. Information theoretic learning. In Simon\n\nHaykin, editor, Unsupervised Adaptive Filtering. Wiley, New York, NY, 2000.\n\n[11] J.C. Principe, D. Xu, and J.W. Fisher III.\n\nPose estimation in SAR using an\n\ninformation-theoretic criterion. In Proc. SPIE98, 1998.\n\n[12] George Saon and Mukund Padmanabhan. Minimum bayes error feature selection for\ncontinuous speech recognition. In Todd K. Leen, Thomas G. Dietterich, and Volker\nTresp, editors, Advances in Neural Information Processing Systems 13, pages 800\u2013\n806. MIT Press, 2001.\n\n[13] Janne Sinkkonen and Samuel Kaski. Clustering based on conditional distributions in\n\nan auxiliary space. Neural Computation, 14:217\u2013239, 2002.\n\n[14] Kari Torkkola. Nonlinear feature transforms using maximum mutual information.\nIn Proceedings of the IJCNN, pages 2756\u20132761, Washington DC, USA, July 15-19\n2001.\n\n[15] Kari Torkkola and William Campbell. Mutual information in learning feature trans-\nformations. In Proceedings of the 17th International Conference on Machine Learn-\ning, pages 1015\u20131022, Stanford, CA, USA, June 29 - July 2 2000.\n\n[16] N. Vlassis, Y. Motomura, and B. Krose. Supervised dimension reduction of intrinsi-\n\ncally low-dimensional data. Neural Computation, 14(1), January 2002.\n\n[17] H. Yang and J. Moody. Feature selection based on joint mutual information. In Pro-\nceedings of International ICSC Symposium on Advances in Intelligent Data Analysis,\nRochester, New York, June 22-25 1999.\n\n\f", "award": [], "sourceid": 2028, "authors": [{"given_name": "Kari", "family_name": "Torkkola", "institution": null}]}