{"title": "Multiplicative Updates for Classification by Mixture Models", "book": "Advances in Neural Information Processing Systems", "page_first": 897, "page_last": 904, "abstract": null, "full_text": "Multiplicative Updates for Classi\ufb01cation\n\nby Mixture Models\n\n Department of Computer and Information Science\n\nLawrence K. Saul and Daniel D. Lee\u0001\n\u0001 Department of Electrical Engineering\n\nUniversity of Pennsylvania, Philadelphia, PA 19104\n\nAbstract\n\nWe investigate a learning algorithm for the classi\ufb01cation of nonnegative data by\nmixture models. Multiplicative update rules are derived that directly optimize\nthe performance of these models as classi\ufb01ers. The update rules have a simple\nclosed form and an intuitive appeal. Our algorithm retains the main virtues of\nthe Expectation-Maximization (EM) algorithm\u2014its guarantee of monotonic im-\nprovement, and its absence of tuning parameters\u2014with the added advantage of\noptimizing a discriminative objective function. The algorithm reduces as a spe-\ncial case to the method of generalized iterative scaling for log-linear models. The\nlearning rate of the algorithm is controlled by the sparseness of the training data.\nWe use the method of nonnegative matrix factorization (NMF) to discover sparse\ndistributed representations of the data. This form of feature selection greatly\naccelerates learning and makes the algorithm practical on large problems. Ex-\nperiments show that discriminatively trained mixture models lead to much better\nclassi\ufb01cation than comparably sized models trained by EM.\n\n1\n\nIntroduction\n\nMixture models[11] have been widely applied to problems in classi\ufb01cation. In these prob-\n\nlabeled examples. Mixture models are typically used to parameterize class-conditional dis-\n\nlems, one must learn a decision rule mapping feature vectors (\u0002\n\u0003\u0007\u000b\ntributions, \u0005\u0007\u0006\t\b\n\u0002\n\u000f\u0018\u0017\n\u000e\u0010\u000f\u0012\u0011\u0014\u0013\u0016\u0015\n\u0005\u0007\u0006\t\b\n\u0002\n\n\u0003 ) to class labels (\u0004 ) given\n\f , from Bayes rule.\n\u0004\r\f , and then to compute posterior probabilities, \u0005\u0007\u0006\r\b\n\f , summed over training examples (indexed by \u0019 ). A virtue of this algo-\n\nParameter estimation in these models is handled by an Expectation-Maximization (EM)\nalgorithm[3], a learning procedure that monotonically increases the joint log likelihood,\n\nrithm is that it does not require the setting of learning rates or other tuning parameters.\n\n\u000f\u0012\u0011\u0014\u0013\u0016\u0015\n\nA weakness of the above approach is that the model parameters are optimized by maxi-\nmum likelihood estimation, as opposed to a discriminative criterion more closely related\nto classi\ufb01cation error[14]. In this paper, we derive multiplicative update rules for the pa-\nrameters of mixture models that directly maximize the discriminative objective function,\n\n\u0005\u0007\u0006\t\b\u001a\u0004\n\n\f . This objective function measures the conditional log likelihood that\n\nthe training examples are correctly classi\ufb01ed. Our update rules retain the main virtues of\nthe EM algorithm\u2014its guarantee of monotonic improvement, and its absence of tuning\nparameters\u2014with the added advantage of optimizing a discriminative cost function. They\nalso have a simple closed form and appealing intuition. The proof of convergence com-\nbines ideas from the EM algorithm[3] and methods for generalized and improved iterative\n\n\u0004\n\u000b\n\u0002\n\u0003\n\u0003\n\u0004\n\u000f\n\u000e\n\u000f\n\u000b\n\u0002\n\u0003\n\u000f\n\fscaling[2, 4].\n\nThe approach in this paper is limited to the classi\ufb01cation of nonnegative data, since from\nthe constraint of nonnegativity emerges an especially simple learning algorithm. This lim-\nitation, though, is not too severe. An abundance of interesting data occurs naturally in\nthis form: for example, the pixel intensities of images, the power spectra of speech, and\nthe word-document counts of text. Real-valued data can also be coerced into this form by\naddition or exponentiation. Thus we believe the algorithm has broad applicability.\n\n2 Mixture models as generative models\n\n\u0004\r\u001a\n\n(1)\n\n(2)\n\n(3)\n\n(5)\n\n\u0005\u0007\u0006\t\b\n\n\u0003\u0007\u000b\n\u0005\u0007\u0006\t\b\n\u0002\n\n.\n\n\u0005\u0007\u0006\t\b\n\u0002\n\n\u0005\u0007\u0006\r\b\n\n\u0004\r\f\n\nclasses of data. The parameterized distributions take the form:\n\nMixture models are typically used as generative models to parameterize probability dis-\n. Different mixture models are used to model different\n\ntributions over feature vectors \u0002\n\u0005\u0007\u0006\r\b\n\nindicates which mixture component is used to gen-\n\nerate the observed variable \u0002\n\nde\ufb01ne \u201cbumps\u201d of high probability in the feature space. A popular choice is the multivariate\nGaussian distribution:\n\n\f . The basis functions, usually chosen from the exponential family,\n\n\u0006\r\f\u000e\u0006\u0010\u000f\n\u0003\u0012\u0011\n\u0001\u0005\u0004\u0007\u0006\t\b\u000b\n\n\u0001\u0003\u0002\n\u0006 are constrained to sum to unity,\nwhere the rows of the nonnegative weight matrix\b\n\u0001\u0014\u0013 , and the basis functions\f\n\u0003\u0015\u0011 are properly normalized distributions, such\n\u0003\u0015\u0011\n\u0001\u0018\u0013\nfor all\u0019 . The model can be interpreted as the latent variable model,\nthat\u0016\u000e\u0017\u0007\u0002\nwhere the discrete latent variable\u001b\nIn this setting, one identi\ufb01es \f\n\u0019\t\f and\n\u0001\u001d\u0002\n\b\u001c\n\n\u0011\u001f40\"65\n\u0003\u0012\u0011\n\u0011$#&%('*),+.-0/21\n\u0006 and covariance matrices\"\n\u0006 . Gaussian distributions are extremely versatile,\n\u0003\u0015\u0011\n\u000198;:\u001d<\n\u0006 . Here, the value of3\n\nbut not always the most appropriate. For sparse nonnegative data, a more natural choice is\nthe exponential distribution:\n\nGenerative models can be viewed as a prototype method for classi\ufb01cation, with the pa-\nrameters of each mixture component de\ufb01ning a particular basin of attraction in the feature\nspace. Intuitively, patterns are labeled by the most similar prototype, chosen from among\nall possible classes. Formally, unlabeled examples are classi\ufb01ed by computing posterior\nprobabilities from Bayes\u2019 rule,\n\nparameters of these basis functions must be estimated from data.\n\nindexes the elements of \u0002\n\nwith parameter vectors \u0002\n\n\u0006 and \u0002\n\n:>=\n\n5\u0012?\u0007@BA\rC\u0007A\n\nwith means \u0002\n\n\u0005\u0007\u0006\n\n\u0011\u00077\n\n(4)\n\n. The\n\n\u000f\u001f\u001e! \n\n\u0003\u0012\u0011\n\nlabel with the highest posterior probability.\n\n\f denote the prior probabilities of each class. Examples are classi\ufb01ed by the\n\nwhere \u0005\u0007\u0006\r\b\n\n\u0006\t\b\u001a\u0004\n\n\u0005\u0007\u0006\n\n\u000eEDGF\n\n\u0005\u0007\u0006\n\n\b\n\u0002\n\n\u0006\t\b\u001a\u0004\r\f\n\n\f\u0016\u0005\u0007\u0006\t\b\u001a\u0004IH\n\n\u0004\rH\n\n\u0003\n\u000f\n\u0002\n\u0003\n\u000b\n\u0004\n\f\n\u0002\n\u0017\n\n\u000e\n\u0006\n\b\n\n\u0006\n\u0006\n\u000f\n\u0002\n\u0003\n\f\n\u0006\n\u000f\n\u0002\n\u0004\n\f\n\u0001\n\u0003\n\u000b\n\u001b\n\f\n\u001b\n\u000b\n\u0017\n\u0003\n\u0006\n\u000f\n\u0002\n\u0001\n\b\n\u0002\n\u0003\n\u000b\n\u001b\n\u0001\n\u0006\n\u0001\n\u001b\n\u0001\n\u0019\n\u000b\n\u0004\n\f\n\u0006\n\u000f\n\u0002\n\u0001\n\u0013\n\u000b\n\"\n\u000b\n\u0006\n\u000b\n\u0013\n\u001e\n\u000f\n\u0002\n\u0003\n1\n\u0002\n3\n\u0006\n#\n\u0006\n\u000f\n\u0002\n\u0003\n1\n\u0002\n3\n\u0006\n\u0017\n3\n\f\n\u0006\n\u000f\n\u0002\n\u0006\n\u0017\n<\n<\n\u0003\n\u0005\n\u000b\n\u0002\n\u0003\n\f\n\u0001\n\b\n\u0002\n\u0003\n\u000b\n\u0004\n\f\n\u0005\n\u0003\n\u000b\n\f\n\u0017\n\u0004\n\fAn Expectation-Maximization (EM) algorithm can be used to estimate the parameters of\nmixture models. The EM algorithm optimizes the joint log likelihood,\n\n\u0002\u0001\n\n\u0011\u0014\u0013\u0016\u0015\n\n\u0005\u0007\u0006\n\n\f\u0016\u0005\u0007\u0006\r\b\n\nsummed over training examples. If basis functions are not shared across different classes,\n\n\u0003\u0007\u000b\nthen the parameter estimation for \u0005\u0007\u0006\t\b\n\u0002\n\nThis has the tremendous advantage of decomposing the original learning problem into sev-\neral smaller problems. Moreover, for many types of basis functions, the EM updates have a\nsimple closed form and are guaranteed to improve the joint log likelihood at each iteration.\nThese properties account for the widespread use of mixture models as generative models.\n\n\u0004\r\f can be done independently for each class label \u0004 .\n\n(6)\n\n(7)\n\n(8)\n\n3 Mixture models as discriminative models\n\nMixture models can also be viewed as purely discriminative models. In this view, their\npurpose is simply to provide a particular way of parameterizing the posterior distribution,\n\n\u0005\u0007\u0006\n\n\b\u001a\u0004\n\n\f . In this paper, we study posterior distributions of the form:\n\n\u0005\u0007\u0006\t\b\u001a\u0004\n\n\u0003\u0006\u0005\n\n\u000e\u0004\u0003\u0006\u0005\n\n\u0006.\u000f\n\n\u0006 and basis functions\f\n\nThe right hand side of this equation de\ufb01nes a valid posterior distribution provided that the\n\nmixture weights\b\n\n\u0003\u0015\u0011 are nonnegative. Note that for this inter-\n\npretation, the mixture weights and basis functions do not need to satisfy the more stringent\nnormalization constraints of generative models. We will deliberately exploit this freedom,\nan idea that distinguishes our approach from previous work on discriminatively trained\nmixture models[6] and hidden Markov models[5, 12]. In particular, the unnormalized basis\nfunctions we use are able to parameterize \u201csaddles\u201d and \u201cvalleys\u201d in the feature space, as\nwell as the \u201cbumps\u201d of normalized basis functions. This makes them more expressive than\ntheir generative counterparts: examples can not only be attracted to prototypes, but also\nrepelled by opposites.\n\n\u0003\u0010\u0011\n\u0003\u0015\u0011\b\u0007\n\nfunctions\f\n\n. We study basis functions of the form\n\nThe posterior distributions in eq. (7) must be further speci\ufb01ed by parameterizing the basis\n\n\u0003\u0015\u0011 as a function of \u0002\n\f\u000e\u0006\u0010\u000f\n\u0003\u0012\u0011\n\u0006 denotes a real-valued vector and \u0002\n\n=\n\t\n\n@\u0006\f\n\nwhere \u0002\n\ndenotes a nonnegative and possibly \u201cex-\npanded\u201d representation[14] of the original feature vector. The exponential form in eq. (8)\nallows us to recover certain generative models as a special case. For example, consider\nthe multivariate Gaussian distribution in eq. (3). By de\ufb01ning the \u201cquadratically expanded\u201d\nfeature vector:\n\n\u0003\u0013\u0012\nin the same way that the means \u0002\n\nwe can equate the basis functions in eqs. (3) and (8) by choosing the parameter vectors \u0002\nto act on \u0002\n\n. The\nexponential distributions in eq. (4) can be recovered in a similar way. Such generative\nmodels provide a cheap way to initialize discriminative models for further training.\n\n\u0006 and covariance matrices\"\n\n\u0006 act on \u0002\n\n\u0003\u0014\u0012\n\n\u0003\u0014\u0012\n\n(9)\n\n\u0007\u0010\u0007\u0011\u0007\n\n\u0007\u0011\u0007\u0010\u0007\n\n4 Learning algorithm\n\nOur learning algorithm directly optimizes the performance of the models in eq. (7) as clas-\nsi\ufb01ers. The objective function we use for discriminative training is the conditional log\nlikelihood,\n\n\u0011\u0014\u0013\u0016\u0015\n\n\u0016\u0015\n\n\u00019\u0004\n\n\u0005\u0007\u0006\n\n\b\u001a\u0004\n\n(10)\n\n\u0001\n\u0004\n\u000f\n\b\n\u0002\n\u0003\n\u000f\n\u000b\n\u0004\n\u000f\n\u0004\n\u000f\n\f\n\u0017\n\u000b\n\u0002\n\u0003\n\u0001\n\u0002\n\u000b\n\u0002\n\u0003\n\f\n\u0001\n\u000e\n\u0006\n\b\n\n\u0006\n\f\n\u0006\n\u000f\n\u0002\n\b\n\f\n\u0005\n\u000f\n\u0002\n\n\u0002\n\u0006\n\u000f\n\u0002\n\u0003\n\u0002\n\u0001\n\u000b\n\t\n\n\u0017\n\u000e\n\u000f\n\u0002\n\u000f\n\u0001\n\b\n\u0013\n\u0017\n\u0003\n#\n\u0017\n\u0003\n'\n\u0017\n\u0017\n\u0017\n\u0003\n#\n\u0003\n#\n\u0017\n\u0003\n#\n\u0003\n'\n\u0017\n\u0017\n5\n#\n\f\n\u0017\n\u000e\n\u0006\n\u000f\n3\n\u0003\n\u000f\n\u000f\n\u000b\n\u0002\n\u0003\n\u000f\n\f\n\u0017\n\fsummed over training examples. Let\ndenotes whether the \u0019\nobjective function as the difference of two terms, \n\n\u0002 th element\nth training example belongs to the\u0002 th class. Then we can write the\n\n, where:\n\n(11)\n\n denote the binary matrix whose \u0019\n\u0006E\b\n\n\u0002\u0001\n=\n\t\n\n=\b\t\n\n\u0004\u0003\n\nThe competition between these terms give rise to a scenario of contrastive learning. It is\nthe subtracted term, \n, which distinguishes the conditional log likelihood optimized by\ndiscriminative training from the joint log likelihood optimized by EM.\n\nOur learning algorithm works by alternately updating the mixture weights and the basis\nfunction parameters. Here we simply present the update rules for these parameters; a\nderivation and proof of convergence are given in the appendix. It is easiest to write the\n. The updates then take\n\n(12)\n\n(13)\n\n(14)\n\nIt is straightforward to compute the gradients in these ratios and show that they are always\nnonnegative. (This is a consequence of the nonnegativity constraint on the feature vectors:\n.) Thus, the nonnegativity constraints\n\non the mixture weights and basis functions are enforced by these multiplicative udpates.\n\n\u0002\u0001\n\n\u0006\b\u0007\n\n\u0006\b\u0007\n\n\b\u001c\n\n\u0006\n\t\f\u000b\n\t\u000e\u000b\n\nthe simple multiplicative form:\n\n@BA\n\u0001\u0014\u0013\u0016\u0015\n\nbasis function updates in terms of the nonnegative parameters=\n\u0006\r\t\n\b\u001c\n\n\u0011 where\u0012\n7\u0010\u000f\n:\u0018\u0017\u0014\u0019 for all examples \u0019 and feature components3\n\u0002\u0015\u001b\u001a\nand\u0007\nents and\u2014additionally, for the basis function updates\u2014by the exponent\u0013\nsures the sparseness of the training data. The value of\u0012\n\n\u0019 , or equivalently, where\u0007\n\n\u0019 and\u0007\n\n\u0016\u0015\u001b\u001a\n\n. The learning rate is controlled by the ratios of these gradi-\n\nis the maximum sum of features\nthat occurs in the training data. Thus, sparse feature vectors leads to faster learning, a\ncrucial point to which we will return shortly.\n\n\u0012 , which mea-\n\nThe updates have a simple intuition[9] based on balancing opposing terms in the gra-\ndient of the conditional log likelihood.\nIn particular, note that the \ufb01xed points of this\nupdate rule occur at stationary points of the conditional log likelihood\u2014that is, where\n\nIt is worth comparing these multiplicative updates to others in the literature. Jebara and\nPentland[6] derived similar updates for mixture weights, but without emphasizing the spe-\ncial form of eq. (13). Others have investigated multiplicative updates by the method of\nexponentiated gradients (EG)[7]. Our updates do not have the same form as EG updates:\nin particular, note that the gradients in eqs. (13\u201314) are not exponentiated. If we use one\nbasis function per class and an identity matrix for the mixture weights, then the updates re-\nduce to the method of generalized iterative scaling[2] for logistic or multinomial regression\n(also known as maximum entropy modeling). More generally, though, our multiplicative\nupdates can be used to train much more powerful classi\ufb01ers based on mixture models.\n\n5 Feature selection\n\nAs previously mentioned, the learning rate for the basis function parameters is controlled\nby the sparseness of the training data. If this data is not intrinsically sparse, then the mul-\ntiplicative upates in eqs. (13\u201314) can be impractically slow (just as the method of iterative\n\n\u000f\n\u0015\n\u0001\n1\n\n5\n\n\u0001\n\u0001\n\u0004\n\u000f\n\u0011\n\u0013\n\u0015\n\u0004\n\n\u0006\n\n\u000f\n\n\b\n\n\u0006\n\u000b\n@\n\f\n\t\n\n5\n\u0001\n\u0004\n\u000f\n\u0011\n\u0013\n\u0015\n\u0004\n\n\u0006\n\u000b\n@\n\f\n\t\n\n\u0003\n\u0007\n5\n\u000b\n\u0006\n\u0005\n\u0006\n/\n\u0007\n\b\n\n\n5\n\u0007\n\b\n\n7\n\u0017\n=\n\u000b\n@\nA\n\u0005\n=\n\u000b\n@\nA\n/\n\u0006\n\u0007\n\n\u0001\n\u0007\n\u000e\n\u0006\n:\n\u0006\n\u0007\n\n5\n\u0007\n\u000e\n\u0006\n:\n\t\n+\n\u000f\n\u000e\n:\n\u000f\n\u000f\n:\n\u0007\n\u000f\n\u000f\n\u0007\n\u0007\n\b\n\n\u0006\n\u0001\n\u0007\n\u000e\n\u0006\n:\n\u0001\n\n\u0001\n\u001a\n\u0007\n\b\n\n\u0006\n\u0001\n\u0007\n\n5\n\u001a\n\u0007\n\b\n\n\u0006\n\n\u0001\n\u001a\n\u0007\n\u000e\n\u0006\n:\n\u0001\n\u0007\n\n5\n\u001a\n\u0007\n\u000e\n\u0006\n:\n\u001a\n\f01-10\n\n11-20\n\n21-30\n\n31-40\n\n41-50\n\n51-60\n\n61-70\n\n71-80\n\nNMF basis vectors\n\npixel image\n\nNMF feature vector\n\n20\n\n40\n\n60\n\n80\n\n10\n\n5\n\n0\n\nFigure 1: Left: nonnegative basis vectors for handwritten digits discovered by NMF. Right: sparse\nfeature vector for a handwritten \u201c2\u201d. The basis vectors are ordered by their contribution to this image.\n\nscaling). In this case, it is important to discover sparse distributed representations of the\ndata that encode the same information. On large problems, such representations can accel-\nerate learning by several orders of magnitude.\n\nThe search for sparse distributed representations can be viewed as a form of feature se-\nlection. We have observed that suitably sparse representations can be discovered by the\nmethod of nonnegative matrix factorization (NMF)[8]. Let the raw nonnegative (and pos-\nis its raw dimen-\nsibly nonsparse) data be represented by the\nis the number of training examples. Algorithms for NMF yield a factor-\nsionality and\nization\n, where\nnonnegative\n\u0005\u0007\u0006\t\b\nare interpreted as basis vectors, and the\nmatrix. In this factorization, the columns of\ncolumns of \u000f\nas coef\ufb01cients (or new feature vectors). These coef\ufb01cients are typically very\nsparse, because the nonnegative basis vectors can only be added in a constructive way to\napproximate the original data.\n\n\u0017 nonnegative marix and \u000f\n\nis a\u0017\u000b\u0001\u0004\u0003\n\n, where\n\n\u0002\u0001\u0004\u0003\n\nmatrix\n\n\n\u0001\n\nis a\n\nThe effectiveness of NMF is best illustrated by example. We used the method to discover\nsparse distributed representations of the MNIST data set of handwritten digits[10]. The\ndata set has 60000 training and 10000 test examples that were deslanted and cropped to\n\nform\u001e\nbasis vectors discovered by NMF, each plotted as a\u001e\n\n\u0019 grayscale pixel images. The raw training data was therefore represented by\n\u0019 . The left plot of Fig. 1 shows the\u0017\n\u0001\u0012\u0011\n\nimage. Most of these basis\nvectors resemble strokes, only a fraction of which are needed to reconstruct any particular\nimage in the training set. For example, only about twenty basis vectors make an appreciable\ncontribution to the handwritten \u201c2\u201d shown in the right plot of Fig. 1. The method of NMF\nthus succeeds in discovering a highly sparse representation of the original images.\n\n\u0019 and\n\n\u0001\u0010\u000f\n\n\u0001\u000e\n\nmatrix, with\n\na\n\n\n\u0001\f\u0003\n\n6 Results\n\nModels were evaluated on the problem of recognizing handwritten digits from the MNIST\ndata set. From the grayscale pixel images, we generated two sets of feature vectors: one\n\nby NMF, with nonnegative features and dimensionality\u0017\n\n\u0001\u0013\u0011\n\n\u0019 ; the other, by principal\n\n\u0005\n\n\u0003\n\u000f\n\b\n\b\n\u0019\n\u0001\n\u001e\n\n\u0019\n\u0003\n\u0019\n\u0019\n\u0019\n\u0019\n\u0019\n\u0001\n\u001e\n\u0019\n\u0001\n\freduced dimensionality feature vectors were used for both training and testing.\n\ncomponents analysis (PCA), with real-valued features and dimensionality\u0017\u0001\n\nBaseline mixture models for classi\ufb01cation were trained by EM algorithms. Gaussian mix-\nture models with diagonal covariance marices were trained on the PCA features, while\nexponential mixture models (as in eq. (4)) were trained on the NMF features. The mixture\nmodels were trained for up to 64 iterations of EM, which was suf\ufb01cient to ensure a high\ndegree of convergence. Seven baseline classi\ufb01ers were trained on each feature set, with\ndifferent numbers of mixture components per digit (\u0002\n\r ). The error\nrates of these models, indicated by EM-PCA40 and EM-NMF80, are shown in Table 1.\nHalf as many PCA features were used as NMF features so as to equalize the number of\n\ufb01tted parameters in different basis functions.\n\n\u0017\u0004\u0003\n\n\u0001\t\u0013\n\n\u0001\u000e\n\n\u0019 . These\n\nMixture models on the NMF features were also trained discriminatively by the multiplica-\ntive updates in eqs. (13\u201314). Models with varying numbers of mixture components per\ndigit (\u0002\n\u000f ) were trained by 1000 iterations of these updates. Again, this was\nsuf\ufb01cient to ensure a high degree of convergence; there was no effort at early stopping.\nfor randomly selected fea-\nture vectors. The results of these experiments, indicated by DT-NMF80, are also shown\nin Table 1. The results show that the discriminatively trained models classify much better\nthan comparably sized models trained by EM. The ability to learn more compact classi\ufb01ers\nappears to be the major advantage of discriminative training. A slight disadvantage is that\nthe resulting classi\ufb01ers are more susceptible to overtraining.\n\nThe models were initialized by setting\b\n\n\u0001\u0018\u0013 and \u0002\n\nmodel\n\n1\n2\n4\n8\n16\n32\n64\n\nEM-PCA40\n\u0005\u0007\b\n\u0005\u0007\u0006\n10.1\n10.2\n8.3\n8.5\n6.8\n6.4\n5.1\n5.3\n4.4\n4.0\n3.6\n3.1\n1.9\n3.1\n\nEM-NMF80\n\u0005\t\b\n\u0005\u0007\u0006\n14.7\n15.7\n10.7\n12.3\n9.3\n8.2\n7.0\n7.8\n5.7\n6.2\n5.1\n5.0\n3.9\n4.2\n\nDT-NMF80\n\u0005\u0007\b\n\u0005\t\u0006\n5.8\n5.5\n4.4\n4.0\n2.8\n3.5\n3.2\n1.7\n1.0\n3.4\n\nTable 1: Classi\ufb01cation error rates (%) on the training set (\n\t\u000b\r\f and the test set (\n\u000f\u000e ) for mixture models\nwith different numbers of mixture components per digit (\u0010\n\n). Models in the same row have roughly\n\nthe same number of \ufb01tted parameters.\n\n), k-nearest neighbor (\u0005\u0011\b\n\nIt is instructive to compare our results to other benchmarks on this data set[10]. Without\nmaking use of prior knowledge, better error rates on the test set have been obtained by sup-\nport vector machines (\u0005\u0011\b\n), and fully connected\nmultilayer neural networks (\u0005\n). These results, however, either required storing\nlarge numbers of training examples or training signi\ufb01cantly larger models. For example, the\nnearest neighbor and support vector classi\ufb01ers required storing tens of thousands of train-\ning examples (or support vectors), while the neural network had over 120,000 weights. By\ncontrast, the \u0002\n) has less\nthan 6500 iteratively adjusted parameters, and most of its memory footprint is devoted to\npreprocessing by NMF.\n\n\u0011 discriminatively trained mixture model (with \u0005\n\n\u0001\t\u0013\n\n\u0013\u0013\u0012\n\n\u0007\u0012\n\n\u000f\u0014\u0012\n\nWe conclude by describing the problems best suited to the mixture models in this paper.\nThese are problems with many classes, large amounts of data, and little prior knowledge\nof symmetries or invariances. Support vector machines and nearest neighbor algorithms\ndo not scale well to this regime, and it remains tedious to train large neural networks with\nunspeci\ufb01ed learning rates. By contrast, the compactness of our models and the simplicity\nof their learning algorithm make them especially attractive.\n\n\u0017\n\u001e\n\u0017\n\n\u0017\n\u0011\n\u0017\n\u0013\n\u000f\n\u001e\n\u0017\n\u000f\n\u0001\n\u0013\n\u0017\n\u001e\n\u0017\n\n\u0017\n\u0011\n\u0017\n\u0013\n\n\u0006\n\u000e\n\u0006\n\u0001\n\u0002\n\u000f\n\u000f\n\u0002\n\u0007\n\u0001\n\u001e\n\u0007\n\b\n\u0001\n\u0013\n\u0007\n\u0001\n\b\n\u0001\n\u0003\n\u0007\n\u001e\n\u0012\n\fA Proof of convergence\n\nIn this appendix, we show that the multiplicative updates from section 4 lead to monotonic\nimprovement in the conditional log likelihood. This guarantee of convergence (to a sta-\ntionary point) is proved by computing a lower bound on the conditional log likelihood for\nupdated estimates of the mixture weights and basis function parameters. We indicate these\n\n\u0006 , and we indicate the resulting values of the conditional\n\nlog likelihood and its component terms by \non three simple inequalities applied to \nThe \ufb01rst term in the conditional log likelihood can be lower bounded by Jensen\u2019s inequality.\nThe same bound is used here as in the derivation of the EM algorithm[3, 13] for maximum\nlikelihood estimation:\n\nupdated estimates by\b\n\n\u0006 and \u0002\n\n. The proof of convergence rests\n\n , \n\n, and \n\n\u0011\u0014\u0013\n\n(15)\n\n .\n\n=\n\t\n\n@BA\n\n\u00019\u0004\n\n\u0006\u0002\u0001\n\u0013 for all \u0019\n\n.\n\n\u0003\u000f\u000e\n\n=\u0019\u0017\n\n\u0003\u0006\u0005\n\n\u0003\u0006\u0005\n\n(16)\n\ngiven by:\n\n\u0003\u0006\u0005\n\n\u0003\u001b\u001a\n\n=\n\t\n\n\u000b\r\f\n\n\u000f\u0012\u0011\n\ntivity of the feature vectors\n\nthey are properly normalized: \u000e\n\nfor each example in the training set. The bound holds for arbitrary distributions, provided\n\n\u0019 . We can therefore write:\n\nscaling[1, 2, 4, 13]. Note that the logarithm function is upper bounded by: \u0011\u0014\u0013\u0016\u0015\n\nThe second term in the conditional log likelihood occurs with a minus sign, so for this term\nwe require an upper bound. The same bounds can be used here as in derivations of iterative\n\nTo further bound the right hand side of eq. (16), we make the following observation: though\n,\n,\nis de\ufb01ned by eq. (14). (The validity of this observation hinges on the nonnega-\n.) It follows that for any example in the training set, the\n,\n\nThe right hand side of this inequality introduces an auxiliary probability distribution\u0001\n\u001b\u0006\u0005\nfor all\u001b\b\u0007\n=\n\t\n\u0013\u0015\u0014\n\u0015\n\t\n\u0011\u0014\u0013\n\u000b\r\f\nthe exponentials=\u0013\t\n\u0006 with elements \u000e\n\u0003 are convex functions of the parameter vector \u0002\nthey are concave functions of the \u201cwarped\u201d parameter vector=\u0018\u0017\b\t\n@ with elements=\u0019\u0017\n@BA\nwhere\u0012\nis upper bounded by its linearized expansion around=\nexponential=\n\r\u0004\u0003\n@BA\n@BA\n:.=\n=\n\t\nA \u001f\"!\n@BA\nThe last term in parentheses in eq. (17) is the derivative of=\n\u0003 with respect to the\nindependent variable=\n\u0006 ,\nCombining the above inequalities with a judicious choice for the auxiliary parameters\u0001\n\b\u000b\n\n\b\u001c\n\n=\n\t\n=\b\t\n\b\u001c\n\n\u0006 , while eq. (19) de\ufb01nes an analogous distribu-\nEq. (18) sets the auxiliary parameters\u0001\ntion\u0001\n\nfor the opposing term in the conditional log likelihood. (This will prove to be a\nuseful notation.) Combining these de\ufb01nitions with eqs. (15\u201317) and rearranging terms, we\nobtain the following inequality:\n\nwe obtain a proof of convergence for the multiplicative updates in eqs. (13\u201314). Let:\n\n, computed by the chain rule. Tighter bounds are possible than\n\neq. (17), but at the expense of more complicated update rules.\n\n\u0004\r:\u001d\u001c\n\n=\n\t\n\n@\u0006\f\n\n=\u001e\u0017\n\n@BA\n\n(17)\n\n(18)\n\n(19)\n\n\u0003\u0006\u0005\n\n\u0003\u0006\u0005\n\n\u0004\u0003\n\nH\n\n\u000e\nH\nH\nH\n\u0001\nH\n5\nH\n\nH\n\u0001\n\u000f\n\u0011\n\u0013\n\u0015\n\u0004\n\n\u0006\n\n\u000f\n\n\b\nH\n\n\u0006\n\u000b\nF\n@\n\f\n\t\n\n\u0003\n\u0017\n\u0004\n\u000f\n\n\u0001\n\u000f\n\n\u0006\n\u0015\n\u0003\n\n\u000f\n\n\b\nH\n\n\u0006\n=\n\t\n\u000b\nF\n@\n\f\n\t\n\n\u0003\n\u0001\n\u0001\n\u000f\n\n\u0006\n\u0004\n\u0007\n\u0001\n\u000f\n\n\u0006\n\n\u0006\n\u0001\n\u0001\n\u000f\n\n\u0006\n\u0001\n\u001b\n1\n\u0013\n\nH\n5\n1\n\n5\n\u0001\n\u0004\n\u000f\n\u000b\n\u000e\n\n\u0006\n\b\nH\n\n\u0006\n\u000b\nF\n@\n\f\n\t\n\n\u0003\n\u000e\n\b\n\u0003\n\u0005\n=\n\t\n\f\n\t\n\n\u0010\n\u0005\n\u0004\n\u0013\n\u000e\n\n\u0006\n\b\nH\n\n\u0006\n\u000b\nF\n@\n\f\n\t\n\n\u0003\n\u000e\n\b\n=\n\t\n\f\n\t\n\n\u0003\n1\n\u0016\n\u0007\n\u000b\nF\n@\n\f\n\t\n\n\u000e\nH\nH\n\u0006\n:\n\u000b\nF\n\u000b\nF\n\u0002\n\u000f\n\u000f\n\t\n\u000b\nF\n@\n\f\n\t\n\u0017\n\u000b\nF\n\u0001\n=\n\u0017\n\u000b\n\u000b\nF\n@\n\f\n\t\n\n\u0003\n\u0005\n\u000b\n\t\n\n\u000b\nF\n1\n\u000b\n@\n\u000f\n\u000f\n\t\n\u000b\n@\n\f\n\t\n\n\u0003\n\u0012\n=\n\u0017\n\u000b\n#\n\u0007\n\t\n\u000b\n@\n\f\n\t\n\n\u0017\n\u000b\n\u0001\n\u000f\n\n\u0001\n\u0001\n\u000f\n\n\u0006\n\u0001\n\u001c\n\u0004\n\u0005\n\u0005\n=\n\t\n\u000b\n\f\n\f\n\t\n\n\u0003\n\u001f\n5\n#\n\n\u000f\n\n\u0006\n=\n\t\n\u000b\n@\n\f\n\t\n\u0017\n\u0001\n5\n\u000f\n\n\u0006\n\u0001\n\u001c\n\u0004\n\b\n\u000b\n\f\n\f\n\t\n\n\u0003\n\u001f\n5\n#\n\u0006\n\u000b\n@\n\f\n\t\n\n\u0003\n\u0007\n\u0001\n\u000f\n\n5\n\u000f\n\n\u0006\n\f\u0006\u0002\u0001\n\n(20)\n\n#\u0001\u000b\n\n\u0006\u0003\u0002\n\b\u000b\n\n\u0006 . We\n\u0006 while holding the basis function parameters \ufb01xed\n\n\u0015\u0001\n\u001f\u0001\u0004\n\u0006\u0006\u0005\n\u0006\u0002\u0001\nBoth sides of the inequality vanish (yielding an equality) if\b\nthe right hand side with respect to\b\nfor\b\n\nderive the update rules by maximizing the right hand side of this inequality. Maximizing\n\n@BA\n\t\n\u0006 and \u0002\n\nyields the update, eq. (13). Likewise, maximizing the right hand side with respect to \u0002\n\nwhile holding the mixture weights \ufb01xed yields the update, eq. (14). Since these choices\nlead to positive values on the right hand side of the inequality (except at\n\ufb01xed points), it follows that the multiplicative updates in eqs. (13\u201314) lead to monotonic\nimprovement in the conditional log likelihood.\n\n\u0006 and \u0002\n\n=\u001e\u0017\b\u0007\n\n@BA\n\b\u001c\n\nReferences\n\n[1] M. Collins, R. Schapire, and Y. Singer (2000). Logistic regression, adaBoost, and Bregman dis-\ntances. In Proceedings of the Thirteenth Annual Conference on Computational Learning Theory.\n[2] J. N. Darroch and D. Ratcliff (1972). Generalized iterative scaling for log-linear models. Annals\n\nof Mathematical Statistics 43:1470\u20131480.\n\n[3] A. P. Dempster, N. M. Laird, and D. B. Rubin (1977). Maximum likelihood from incomplete\n\ndata via the EM algorithm. J. Royal Stat. Soc. B 39: 1\u201337.\n\n[4] S. Della Pietra, V. Della Pietra, and J. Lafferty (1997). Inducing features of random \ufb01elds. IEEE\n\nTransactions on Pattern Analysis and Machine Intelligence 19(4): 380\u2013393.\n\n[5] P.S. Gopalakrishnan, D. Kanevsky, A. Ndas and D. Nahamoo (1991). An inequality for ratio-\nnal functions with applications to some statistical estimation problems. IEEE Transactions on\nInformation Theory 37: 107\u2013113.\n\n[6] T. Jebara and A. Pentland (1998). Maximum conditional likelihood via bound maximization and\nthe CEM algorithm. In M. Kearns, S. Solla, and D. Cohn (eds.). Advances in Neural Information\nProcessing Systems 11, 494\u2013500. MIT Press: Cambridge, MA.\n\n[7] J. Kivinen and M. Warmuth (1997). Additive versus exponentiated gradient updates for linear\n\nprediction. Journal of Information and Computation 132: 1\u201364.\n\n[8] D. D. Lee and H. S. Seung (1999). Learning the parts of objects with nonnegative matrix factor-\n\nization. Nature 401: 788\u2013791.\n\n[9] D. D. Lee and H. S. Seung (2000). Algorithms for nonnegative matrix factorization. In T. Leen,\nT. Dietterich, and V. Tresp (eds.). Advances in Neural Information Processing Systems 13. MIT\nPress: Cambridge, MA.\n\n[10] Y.LeCun, L. Jackel, L.Bottou, A.Brunot, C.Cortes, J. Denker, H.Drucker, I.Guyon, U. Muller,\nE.Sackinger, P.Simard, and V.Vapnik (1995). A comparison of learning algorithms for handwrit-\nten digit recognition. In F.Fogelman and P.Gallinari (eds.). International Conference on Arti\ufb01cial\nNeural Networks, 1995, Paris: 53\u201360.\n\n[11] G. McLachlan and K. Basford (1988). Mixture Models: Inference and Applications to Cluster-\n\ning. Marcel Dekker.\n\n[12] Y. Normandin (1991). Hidden Markov Models, Maximum Mutual Information Estimation and\n\nthe Speech Recognition Problem. Ph.D. Thesis, McGill University, Montreal.\n\n[13] J. A. O\u2019Sullivan (1998). Alternating minimization algorithms:\n\nfrom Blahut-Arimoto to\nExpectation-Maximization. In A. Vardy (ed.). Codes, Curves, and Signals: Common Threads\nin Communications. Kluwer: Norwell, MA.\n\n[14] V. Vapnik (1999). The Nature of Statistical Learning Theory. Springer Verlag.\n\n\nH\n\n1\n\n\n\u0017\n\u0004\n\u000f\n\n\u0001\n\u000f\n\n\u0006\n/\n\u0011\n\u0013\n\b\nH\n\n\u0006\n\u001a\n\u001c\n\u0002\n\u000e\nH\n\u0006\n1\n\u0002\n\u000e\n\u0006\n\u0002\n\u000f\n\u000f\n7\n1\n\u0004\n\u000f\n\n5\n\u000f\n\n\b\nH\n\n\u0006\n\b\n\n\u0006\n1\n\u0013\n\u001a\n\b\nH\n\n\u0006\n\b\n\n\u0006\n\u0004\n:\n\u000f\n\u000f\n:\n!\n\u000b\nF\n5\n\u000b\n1\n\u0013\n\u0012\nH\n\n\u0006\n\u0001\n\u000e\nH\n\u0006\n\u0001\n\u0002\n\u000e\nH\n\n\u000e\nH\n\u0006\nH\n\n\u000e\nH\n\f", "award": [], "sourceid": 2085, "authors": [{"given_name": "Lawrence", "family_name": "Saul", "institution": null}, {"given_name": "Daniel", "family_name": "Lee", "institution": null}]}