{"title": "Quantizing Density Estimators", "book": "Advances in Neural Information Processing Systems", "page_first": 825, "page_last": 832, "abstract": null, "full_text": "Quantizing Density Estimators\n\nPeter Meinicke\n\nNeuroinformatics Group\nUniversity of Bielefeld\n\nBielefeld, Germany\n\npmeinick@techfak.uni-bielefeld.de\n\nHelge Ritter\n\nNeuroinformatics Group\nUniversity of Bielefeld\n\nBielefeld, Germany\n\nhelge@techfak.uni-bielefeld.de\n\nAbstract\n\nWe suggest a nonparametric framework for unsupervised learning of\nprojection models in terms of density estimation on quantized sample\nspaces. The objective is not to optimally reconstruct the data but in-\nstead the quantizer is chosen to optimally reconstruct the density of the\ndata. For the resulting quantizing density estimator (QDE) we present a\ngeneral method for parameter estimation and model selection. We show\nhow projection sets which correspond to traditional unsupervised meth-\nods like vector quantization or PCA appear in the new framework. For a\nprincipal component quantizer we present results on synthetic and real-\nworld data, which show that the QDE can improve the generalization of\nthe kernel density estimator although its estimate is based on signi\ufb01cantly\nlower-dimensional projection indices of the data.\n\n1 Introduction\n\nUnsupervised learning is essentially concerned with \ufb01nding alternative representations for\nunlabeled data. These alternative representations usually re\ufb02ect some important properties\nof the underlying distribution and usually they try to exploit some redundancy in the data.\nIn that way many unsupervised methods aim at a complexity-reduced representation of the\ndata, like the most common approaches, namely vector quantization (VQ) and principal\ncomponent analysis (PCA). Both approaches can be viewed as speci\ufb01c kinds of quantiza-\ntion, which is a basic mechanism of complexity reduction.\n\nThe objective of our approach to unsupervised learning is to achieve a suitable quantization\nof the data space which allows for an optimal reconstruction of the underlying density from\na \ufb01nite sample. In that way we consider unsupervised learning as density estimation on a\nquantized sample space and the resulting estimator will be referred to as quantizing density\nestimator (QDE). The construction of a QDE \ufb01rst requires to specify a suitable class of\nparametrized quantization functions and then to select from this set a certain function with\ngood generalization properties. While the \ufb01rst point is common to unsupervised learning,\nthe latter point is addressed in a density estimation framework where we tackle the model\nselection problem in a data-driven and nonparametric way.\n\nIt is often overlooked that modern Bayesian approaches to unsupervised learning and model\nselection are almost always based on some strong assumptions about the data distribution.\nUnfortunately these assumptions usually cannot be inferred from human knowledge about\n\n\fthe data domain and therefore the model building process is usually driven by computa-\ntional considerations. Although our approach can be interpreted in terms of a generative\nmodel of the data, in contrast to most other generative models (see [10] for an overview),\nthe present approach is nonparametric, since no speci\ufb01c assumptions about the functional\nform of the data distribution have to be made. In that way our approach compares well with\nother quantization methods, like principal curves and surfaces [4, 13, 6], which only have\nto make rather general assumptions about the underlying distribution. The QDE approach\ncan utilize these methods as speci\ufb01c quantization techniques and shows a practical way\nhow to further automatize the construction of unsupervised learning machines.\n\n2 Quantization by Density Estimation\n\n(1)\n\n\u0002\u0013\u0004\u0015\u0014\u0017\u0016\u0015\u0002\u0013\u0004\n\nis a given quantization or\n\nWe will now explain how the QDE may be derived from a generalization of the kernel\ndensity estimator (KDE), one of the most popular methods for nonparametric density es-\ntimation [12, 11]. If we construct a kernel density estimator on the basis of a quantized\nsample, we have the following estimator\n\n\u0014\u001f\u001e \u001e\u001f\u001e \u0014\u001c\u0004\n\r\"!\n\u00184\u001a\n\u0016\u0015\u0002\u0013\u0004\n0 which is parametrized according to\n\n\u000f\u0019\u0018\u001b\u001a\n\u0006\u001c\u0006\n\u0001\u0003\u0002\u0005\u0004\u0007\u0006\t\b\u000b\n\n\u0010\u0007\u0011\u0007\u0012\nfunction and \u0016('*)\u0015+-,/.*0213)\u0015+ with parameter vector \u001a\n\u0002$#%\u0014\u001f#&\u0006 denotes the kernel\nwhere \u001d\nis a sample from the target distribution, \u0012\n0 of the sample space\nto a parametrized subset.\nprojection function which maps a point\u0004\n\u0018S\u001a\n\u00184\u001a\n\u00065\b76\t\u000298 :\u0007\u0002\u0005\u0004\u0007\u0006\n\u0006;\u0014<8\u001f:=\u0002\u0005\u0004\u0003\u0006>\b7?A@\u001cB>CEDGF\n\u0004POQ6*\u0002\u0005R\nHJILKNM\nThereby the projection index8T:U\u0002V#W\u0006 associates a data point with its nearest neighbour in the\nprojection set.\n\u0018Y\u001a\n\u0006;\u0014ZR\\[P]^1_)a`\nwhere]\nministic latent variableR\ntive whether it is necessary to store all the sample data \u001d\n\nis the set of all possible projection indices which are realizations of the deter-\n. For a \ufb01xed non-zero kernel bandwidth the parameters of the\nquantization function may be determined by nonparametric maximum likelihood (ML) es-\ntimation, as will be introduced in the next section.\n\nFor an intuitive motivation of the QDE, one may ask from a data compression perspec-\nfor the realization\nof the kernel density estimator or if it is possible to \ufb01rst reduce the data by some suit-\nable quantization method and then construct the estimator from the more parsimonious\ncomplexity-reduced data set. Clearly, we would prefer a quantizer which does not decrease\nthe performance of the estimator on unseen and unquantized data.\n\n\u0004X'A\u0004Q\b76\t\u0002\u0013R\n\n\u0014 \u001e\u001f\u001e \u001e \u0014\u0017\u0004\n\n(2)\n\n(3)\n\nTo get an idea of how to select a suitable quantization function let us consider an example\nfrom a 1D data space. In one dimension a natural projection set can be speci\ufb01ed by a set\n\nofb\n\nquantization levels on the real line, i.e.\n\nbandwidth, we can now perform maximum likelihood estimation of the level coordinates.\nIn that way we obtain a maximum likelihood estimator of the form\n\n\u0014\u001f\u001e \u001e \u001e \u0014\n\ncfe\n\n\u001ddc\n\n! . For a \ufb01xed kernel\n\n\u0010\u0007\u0011=i\nj counting the number of data points which are\nquantized to level l . In this case, it remains the question how to choose the number of\nwithi\n\n'mln\bo?A@\u001cB\u0015CEDpFZqnj\n\nquantization levels.\n\n\b3j\n\nqsj\n\n\u001d k\n\n(4)\n\n\u0001a\u0002\u0013gh\u00065\b\ngdrfO\n\n\u0002\u0013gY\u0014\n\n\n\f\n\n\u000e\n\u000f\n\u0004\n\u0011\n\u0006\nM\n\u001e\n.\n0\n\b\n\u001d\n!\n\u0004\n\u0011\n\n!\n.\n0\n\b\n\u0011\n\n\n\f\ne\n\u000e\n\u000f\n\u000f\n\u0012\n\nc\n\u000f\n\u0006\n\u000f\n\nc\n!\n\fFrom a different starting point the authors in [3] proposed the same functional form of a\nnonparametric ML density estimator with respect to Gaussian kernels of equal width cen-\nvariable positions. As with the traditional Gaussian KDE (\ufb01xed kernel centers\ndata points), for consistency of the estimator the bandwidth has to be decreased as the\nsample size increases. In [3] the authors reported that for a \ufb01xed non-zero bandwidth, ML-\nkernel centers always resulted in a smaller number of actually distinct\ncenters, i.e. several kernels coincided to maximize the likelihood. Therefore the resulting\ncorresponds to the number of distinct centers with\n\ntered on\f\non\f\nestimation of the\f\n\u000f counting the number of kernels coinciding at c\nestimator had the form of (4) whereb\n\nquantization levels for a given bandwidth therefore arises as an automatic byproduct of ML\nestimation.\n\n\u000f . The optimum number of effective\n\nFinally one has to choose an appropriate kernel width which implicitly determines the com-\nplexity of the quantizer. The bandwidth selection problem has been tackled in the domain\nof kernel density estimation for some time and many approaches have been proposed (see\ne.g. [5] for an overview), among which the cross-validation methods are most common. In\nthe next section we will adopt the method of likelihood cross-validation to \ufb01nd a practical\nanswer to the bandwidth selection problem.\n\n3 General Learning Scheme\n\nBZ\u00029\u0001a\u0002\u0013\u0004\n\n\u0006\u0019\u0006 w.r.t. to\u001a\n\nBy applying the method of sieves as proposed in [3], for a \ufb01xed non-zero bandwidth we can\nestimate the parameters of the quantization function via maximization of the log-likelihood\n. For consistency of the resulting density estimator the band-\nwidth has to be decreased as the sample size increases, since asymptotically the estimator\nmust converge to a mixture of delta functions centered on the data points. Thus, for de-\ncreasing bandwidth, the quantization function of the QDE must converge to the identity\nfunction, i.e. the QDE must converge to the kernel density estimator.\n\n\u0010\u0007\u0011\u0002\u0001\u0004\u0003\n\n\u0018\u001b\u001a\n\nE-Step:\n\nM-Step:\n\n\u0010\u0007\u0011\n\n(5)\n\n(6)\n\n\u0006\u0004\b\n\t\n\u0018Y\u001a\n\n\u0006\u000b\b\f\t\n\u0010\u0007\u0011\n\nFor a \ufb01xed bandwidth, maximization of the likelihood can be achieved by applying the EM-\nalgorithm [2] which provides a convenient optimization scheme, especially for Gaussian\nkernels. The EM-scheme requires to iterate the following two steps\n\n\u0005\u0007\u0006\u0004\b\n\t\n\u0006\u0004\b\n\u000f\n\b\u0019\u0018s\u0014\n\u0014 \u001e \u001e\u001f\u001e\n\u0006\u000b\u001d\u0011\t . Thereby \u0005\n\n\u0006\u0019\u0006\u000e\r\n\u0010\u0007\u0011\n\u0006\u000b\b\f\t\n\u0010\u0003\u0011\nconvergence at \u001a\nfor a sequence \u0017\nbeen \u201cgenerated\u201d by mixture componentk with density\u0012\n\u0018\u001b\u001a\n\u0014\"!\n\u0006\u001c\u0006\u0016\u0015\nOX6\t\u0002\u0013R\n\n\u0018\u001b\u001a\n\u0014\u0017\u0016\u0015\u0002\u0013\u0004Yr\n\u0002\u0013\u0004\n\u0014;\u0016\u0015\u0002\u0005\u0004=q\n\u0002\u0005\u0004\n\u0006\u001c\u0006\n\u0006\u001c\u0006\u0016\u0015\n?\u0011\u0010\n?A@\u001cB5C\nB\u0014\u0013\n\u0002\u0013\u0004\n\u0014;\u0016\u0015\u0002\u0005\u0004\n\u0001\u0012\u0003\ni with suitable initial parameter vector \u001a\n\u0018Y\u001a\nr denotes the posterior probability that data point l has\n\u0006\u001b\u001a\u001c\t and suf\ufb01cient\n\u0006\u0019\u0006 . For further insight\n\u0002\u0005\u0004\u0015\u0014;\u0016\u0015\u0002\u0005\u0004\n\b#\u0013\n\u0014\u001cR\n\u0014\u001f\u001e \u001e\u001f\u001e\n\u0002\u0013\u0004\n\u0014\u001c6*\u0002\u0005R\nB\u0014\u0013\n\u0018Y\u001a\n\u0010\u0003\u0011\n\u0001\u0004\u0003\n\b7?A@\u001cB\nsubject toR\nCEDpF\nHJILK\n\none may realize that the M-Step requires to solve a constrained optimization problem by\nsearching for\n\nIn general this optimization problem can only be solved by iterative techniques. Therefore\nit may be convenient not to maximize but only to increase the log-likelihood at the M-Step\nwhich then corresponds to an application of the generalized EM-algorithm. Without (8)\n\n?\u0011\u0010\n\n0\u001f\u001e\n\n(7)\n\n(8)\n\ni\n\n\n\u000f\n\u000f\n\u000f\nr\n\b\n\u0012\n\u000f\n\n\u000e\nq\n\u0012\n\u000f\n\u001a\n\u0011\n\t\n\b\n0\n\n\u000e\n\u000f\n\n\u000e\nr\n\u0005\n\u000f\nr\n\u0012\n\u000f\nr\n\n\u0014\n\u000f\nr\nC\n \n\n\u000e\n\u000f\n\n\u000e\nr\n\u0005\n\u000f\nr\n\u0012\n\u000f\n\u000f\nR\n\u0011\n\n\u0015\n\u000f\nM\n\u0004\n\u000f\n\u0006\nM\n\u001e\n\funconstrained maximization according to (7) yields another class of interesting learning\nschemes which for reasons of space will not be considered in this paper.\n\nFor Gaussian kernels and an Euclidean metric for the projection, in the limiting case of a\nvanishing bandwidth, EM-optimization of the QDE parameters corresponds to minimiza-\ntion of the following error or risk\n\n\u0018Y\u001a\n\nOX6\t\u0002\u0013R\n\nCEDpF\nHJILK\n\n\u0010\u0007\u0011\n\nMinimization of such error functions corresponds to traditional approaches to unsupervised\nlearning of projection models which can be viewed as a special case of QDE-based learn-\ning.\n\n3.1 Bandwidth Selection\n\n\u0001\u0003\u0002\n\nIt is easy to see that the kernel bandwidth cannot be determined by ML-estimation since\nmaximization of the likelihood would drive the bandwidth towards zero. For selection of\nthe kernel bandwidth, we therefore apply the method of likelihood cross-validation (see\ne.g. [12]), which can be realized by a slight extension of the above EM-scheme. With the\nleave-one-out QDE\n\nr\u0005\u0004\n\u0006\u001c\u0006 with respect to the kernel bandwidth. For a\n\n\u0002\u0005\u0004\u0007\u0006\t\b\nBZ\u0002\n\u0010\u0007\u0011\u0002\u0001\u0012\u0003\nthe idea is to maximize \nGaussian kernel with bandwidth \u0006\nStep update rule\n\f\b\u0007\noverall optimization scheme one may now alter the estimation of \u001a and \u0006 or alternatively\nr are easily derived from a leave-one-out version of (5). In an\nThe posterior probabilities \u0005\n\nan appropriate EM scheme requires the following M-\n\none may estimate both by likelihood cross-validation.\n\n\u0002\u0005\u0004\u0015\u0014;\u0016\u0015\u0002\u0005\u0004Yr\n\nOX\u0016\u0015\u0002\u0013\u0004\n\nr\u0005\u0004\n\n\u0010\u0003\u0011\n\n\u0005\n\t\n\n\u0001\u0003\u0002\n\n\u0002\u0013\u0004\n\n\u0006\u001c\u0006\n\n\u00184\u001a\n\n\u0018Y\u001a\n\n(9)\n\n(10)\n\n4 Projection Sets in Multidimensions\n\nBy the speci\ufb01cation of a certain class of quantization functions we can incorporate domain\nknowledge into the density estimation process, in order to improve generalization. Thereby\nthe idea is to reduce the variance of the density estimator by reducing the variation of the\nquantized training set. The price is an increase of the bias which requires a careful selection\nof the set of admissible quantization functions. Then the QDE offers the chance to \ufb01nd a\nbetter bias/variance trade-off then with the \u201cnon-quantizing\u201d KDE.\n\nWe will now show how to utilize existing methods for unsupervised learning within the\ncurrent density estimation framework. Because many unsupervised methods can be stated\nin terms of \ufb01nding optimal projection sets, it is straightforward to apply the corresponding\ntypes of quantization functions within the current framework. Thus in the following we\nshall consider speci\ufb01c parametrizations of the general projection set (3) which correspond\nto traditional unsupervised learning methods.\n\n4.1 Vector Quantization\n\nVector quantization (VQ) is a standard technique among unsupervised methods and it is\neasily incorporated into the current density estimation framework by straightforward gen-\n\n\n\n\u0002\n\u001a\n\u0006\n\b\n\n\f\n\n\u000e\n\u000f\nM\n\u0004\n\u000f\n\u0006\nM\n\u0001\n\u001e\n\n\u000f\n\n\f\nO\n\n\u000e\n\u0010\n\u000f\n\u0012\n\n\u000f\n\n\u000f\n\u000f\n\n\u0006\n\u0001\n\b\n\n\u000e\n\u000f\n\u000e\n\u0010\n\u000f\n\u000f\nr\nM\n\u0004\n\u000f\nr\n\u0006\nM\n\u0001\n\t\n\u000f\n\f\u00184\u001a\n\neralization of the one-dimensional quantizer in section 2 to the multi-dimensional case.\n\ndistinct (\u201ceffective\u201d) quantization levels, similar to maximum entropy clustering [9, 1].\n\nThe projection set of a vector quantizer can be parametrized according to a general basis\n\nAgain with a \ufb01xed kernel bandwidth ML estimation yields a certain number of b\nfunction representation [7]6\t\u0002\u0002\u0001\nwith\f\nrL\u0002\n\ts\u00065\b\f\u000b\nfor componentk .\n\n-dimensional vector of basis functions \u0005\t\u0002$#&\u0006 containing discrete delta functions, i.e.\nr\u0017q\n\nThe QDE on the basis of a vector quantizer can be expected to generalize well if some\ncluster structure is present within the data. In multi-dimensional spaces the data are often\nconcentrated in certain regions which allows for a sparse representation by some reference\nvectors well-positioned in those regions. An alternative approach has been proposed in\n[14] where the application of the support vector formalism to density estimation results in\na sparse representation of the data distribution.\n\n\u0006>\b\u0004\u0003\u0006\u0005\t\u0002\u0002\u0001\u001b\u0006;\u0014\u0007\u0001N[\n\n\u0014\u001f\u001e \u001e \u001e\n\n(11)\n\n4.2 Principal Component Analysis\n\nA linear af\ufb01ne parametrization of the projection set yields candidate functions of the form\n\n\u00184\u001a\n\n6\t\u0002\u0013R\n\n\u00065\b\f\n\nR\u000f\u000e\u0011\u0010s\u0014\u000bRX[N)a`\n\nwith\ndata spaces, the data are concentrated around some manifold of lower dimensionality.\n\n. The PCA approach re\ufb02ects our knowledge that in most high-dimensional\n\n\u0012\u0013\n\nTo exploit this structure PCA divides the sample space into two subspaces which are quan-\ntized in different ways: within the \u201cinner\u201d subspace spanned by the directions of the pro-\njection manifold we have no quantization at all; within the orthogonal \u201couter\u201d subspace the\ndata are quantized to a single level.\nthe constrained optimization problem at the\nWith a Gaussian kernel with \ufb01xed bandwidth \u0006\nM-Step takes a convenient form which facilitates further analysis of the learning algorithm.\nFrom (7) and (8) it follows that one has to maximize the following objective function\n\nO\u0019\u0016\u001a\u0016\u001a\u001b\n\n\u0002\u0013\u0004\n\nO\u001c\u0010L\u00065O\u0019\u0010\n\n(13)\n\n\u0010\u0007\u0011\n\n\u0010\u0007\u0011\n\nwhere \u0007\u001e\u001d\n\nmatrix \u0016\nhas orthogonal columns which span the subspace of the projection\nmanifold. From the consideration of the corresponding stationarity conditions one \ufb01nds\nthat the sample mean\nMaximization of (13) with respect to \u0016\n\nis an estimator of the shift vector \u0010 .\n\nthen requires to maximize the following trace\n\n\u0002\u0015\u0010s\u0014\u0017\u0016N\u0006>\b\n\nconst.O\n\n\u0004Q\b\ntr\u0013\n\n\u0016 \u001b\t\u0002\u0015!\n\n(12)\n\n(14)\n\n(15)\n\n(16)\n\nwith symmetric matrices\n\nThus (14) is maximized if \u0016\nvalues, i.e. with\ndimensionality\n\n\u0010\u0007\u0011\n\u0014 \u001e\u001f\u001e \u001e\n\n\u0010\u0007\u0011\n\n(*)\n\nO\u0011\"\n\u0004#\u001b\n\n\u0002\u0005\u0004\n\n\u001d+(\n\ntr\u0013\n\n\u0016 \u001b&%'\u0016\n\n\u0004#\u001b\n\n\u0010\u0007\u0011\n\n\u0015S\b\n\u0004#\u001ba\u0006$\u0016\n\u0006;\u0014\u0007\"\n\u0004#\u001b\n\n\u0018s\u0014al\u0015\b\n\n\u0014 \u001e \u001e\u001f\u001e\n\n\u0010\u0007\u0011\n\u0014\u0017-\n\ncontains all eigenvectors of %\n, associated with positive eigen-\nbeing the eigenvalues of % we have the optimal subspace\n\n\n\f\n\u001d\n\n\u0014\n\f\n!\n\b\n\u0007\n\u0014\n\n\u0018\n\u0006\n\u0001\n\n\u000e\n\u000f\n\n\u000e\nr\n\u0005\n\u000f\nr\nM\n\u0004\n\u000f\nr\nM\n\u0001\n\u0012\n\u001f\n\u0011\n\n\n\u000f\n\u0004\n\u000f\nO\n\f\n\u001f\n\u0004\n\u001f\n\u0015\n!\n\b\n\n\u000e\n\u000f\n\n\u000e\nr\n\u0005\n\u000f\nr\n\u000f\nr\n\u000e\n\u0004\nr\n\u000f\n\b\n\n\u000e\nr\n\u0004\nr\nr\n\n\u000e\n\u000f\n\u0005\n\u000f\nr\n(\n\u0011\n\u0014\n(\n\u0001\n\u0014\n\n\u0012\n\b\nj\n\u000f\n'\n(\n\u000f\n,\n\n!\nj\n\fwhich complements a recent result about parametric dimensionality estimation with respect\nto a\n-factor model with isotropic Gaussian noise [8]. For the QDE, the two limiting cases\nof zero and in\ufb01nite bandwidth, are of particular interest. With the positive de\ufb01nite sample\ncovariance matrix\n\n\u001b one can show\n\n(17)\n\nThus for suf\ufb01ciently large bandwidth %\nsubspace dimensionality estimator\nFor suf\ufb01ciently small bandwidth %\nquantization takes place.\n\nbecomes negative de\ufb01nite, which implies a zero\n\n\u0018 , i.e. all data are quantized to the sample mean.\nbecomes positive de\ufb01nite implying \n\n, i.e. no\n\n\u0004\u0003\u0006 \u0002\u0005\u0004\n\n\u0002\u0013\u0004\n\nDpC\n\n\u0001\u0003\u0002\u0005\u0004\n\n\u0004\u0007\u0006\n\nDpC\n\n\u0001\u0003\u0002\n\n4.3 Independent Component Analysis\n\nThe PCA method provides a rather coarse quantization scheme since it only decides be-\ntween one-level and no quantization for each subspace dimension. A natural re\ufb01nement\nwould therefore be to allow for a certain number of effective quantization levels for each\ncomponent. Such an approach may be viewed as a nonparametric variant of independent\ncomponent analysis (ICA). The idea is to quantize each coordinate axis separately, which\nyields a multi-dimensional quantization grid according to\n\n6\t\u0002\u0013R\n\n\u0005\t\u0002\u0002\u0001\n\n\u0006#\u000e\u0011\u0010s\u0014\u000bR\\[\nquantization levels of thel -th coordinate axis with direction \u0006\nnormalize the direction vectors according to M\n\n\u0006>\b\n\u0010\u0003\u0011\nand \u0005\t\u0002$#&\u0006 as in (11). Thereby the components of \u0007\n\n\u0014 \u001e\u001f\u001e \u001e\n\u000f contain the\n\u000f . Further, it makes sense to\n\n . There are strong similarities with a\n\nparametric ICA model which has been suggested in [10], where source densities have been\nmixtures of delta functions and additive noise has been isotropic Gaussian.\n\nwith \u0006\n\n)\u0015+\n\n(18)\n\n, \u0007\n\n\u0018Y\u001a\n\nOther unsupervised learning methods which correspond to different projection sets, like\nprincipal curves or multilayer perceptrons (see [7] for an overview) can as well be incorpo-\nrated into the QDE framework and will be considered elsewhere.\n\n5 Experiments\n\nuniform distribution with support on a \n\nof the KDE from \n\nwhere grey-values are proportional to \n\nIn the following experiments we investigated the PCA based QDE with Gaussian kernel\nand compared the generalization performance with that of the \u201cnon-quantizing\u201d KDE.\nAll parameters, including the bandwidth of the KDE, were estimated by likelihood cross-\n\u0018h\u001e\nvalidation. In the \ufb01rst experiment we sampled 100 points from a stretched and rotated\nrectangle. In this case the QDE extracted\na one-dimensional \u201cunquantized\u201d subspace. Generalization performance was measured by\nthe average log-likelihood on an independent 1000-point test set. With an automatically se-\n\n ) the PCA-QDE could improve the performance\nlected 1D subspace (compression ratio \u0018\n\u0018\f\u000b . Thus, the PCA-QDE could successfully exploit the elon-\ngated structure of the distribution. The estimated density functions are depicted in \ufb01gure 1,\n\u000f on a\n\n\u0018\u0003\u0010 grid. From the images one can\nsee, that the QDE better captures the global structure of the distribution while the KDE is\nmore sensitive to local variations in the data.\nIn a second experiment we trained PCA-QDEs with \n\f\u000b -dimensional real-world data\nimages) which had been derived from the MNIST database of handwritten digits\n(\u0010\n(http://www.research.att.com/\nyann/ocr/mnist/). For each digit class a\n\u0018 -point test set were used to compare the PCA-QDE with\n\n\u0018 -point training set and a\n\n\u001e\t\b\u0003\n\nto \u0018\n\n\u001a\u000e\n\n\u0018\f\u0010\n\n\u0012\n\n\b\n\u0011\n\n\n\u000f\n\u000f\nO\n\u001f\n\u000f\nO\n\u001f\n\u0001\n%\n\b\nO\n\f\n\n\u0014\n\u0001\n\u001a\n%\n\b\n\f\n\n\u001e\n\n\u0012\n\b\n\u0012\n\b\n\u0007\n+\n\u000e\n\u000f\n\u0006\n\u000f\n\u0007\n\u001b\n\u000f\n\u000f\n\u001d\n\n\u0014\n\f\n!\n+\n\u000f\n[\n\u000f\n[\n)\n\n\u0006\n\u000f\nM\n\b\n\u001e\n\u0018\n\u001d\n\n\u001e\n\u0001\n\u001d\n\n\u001d\n\u0010\n\u0011\n\n\u0018\n\u0018\n\u0012\n\u0018\n\u0018\n\fFigure 1: Left: stretched uniform distribution in 2D with white points indicating 100 data\npoints used for estimation; middle: Estimated density using the PCA-QDE; right: kernel\ndensity estimate.\n\nthe KDE, with results shown in table 1. Again the PCA-QDE improved the generaliza-\ntion performance of the KDE although the QDE decided to remove about 40 \u201credundant\u201d\ndimensions per digit class.\n\nTable 1: Results on \n\u0003\u000b -dimensional digit data for different digit classes \u20190\u2019...\u20199\u2019 (\ufb01rst row);\nsecond row: difference between average log-likelihoods of (PCA-)QDE and KDE on test\nset; third row: optimal subspace dimensionality of QDE\n\nDigit:\n\n\u0002\u0001\u0004\u0003\u0006\u0005\b\u0007\b\u0002\t\u0002\u0003\u0006\u0005\n\n:\n\n:\n\n0\n\n1.87\n22\n\n1\n\n0.66\n29\n\n2\n\n1.02\n26\n\n3\n\n1.38\n24\n\n4\n\n1.58\n24\n\n5\n\n1.54\n25\n\n6\n\n1.44\n24\n\n7\n\n0.64\n27\n\n8\n\n1.53\n21\n\n9\n\n1.33\n25\n\n6 Conclusion\n\nThe QDE offers a nonparametric approach to unsupervised learning of quantization func-\ntions which can be viewed as a generalization of the kernel density estimator. While the\nKDE is directly constructed from the given data set the QDE \ufb01rst creates a quantized rep-\nresentation of the data. Unlike traditional quantization methods which minimize the asso-\nciated reconstruction error of the data points, the QDE adjusts the quantizer to optimize\nan estimate of the data density. This feature allows for a convenient model selection pro-\ncedure, since the complexity of the quantizer can be controlled by the kernel bandwidth,\nwhich in turn can be selected in a data-driven way. For a practical realization we out-\nlined EM-schemes for parameter estimation and bandwidth selection. As an illustration,\nwe discussed examples with different projection sets which correspond to VQ, PCA and\nICA methods. We presented experiments which demonstrate that the bias imposed by the\nquantization can lead to an improved generalization as compared to the \u201cnon-quantizing\u201d\nKDE. This suggests that QDEs offer a promising approach to unsupervised learning that\nallows to control bias without the usually rather strong distributional assumptions of the\nBayesian approach.\n\nAcknowledgement\n\nThis work was funded by the Deutsche Forschungsgemeinschaft within the project SFB\n360.\n\n\u000b\n\fReferences\n\n[1] J. M. Buhmann and N. Tishby. Empirical risk approximation: A statistical learning\nIn C. M. Bishop, editor, Neural Networks and Machine\n\ntheory of data clustering.\nLearning, pages 57\u201368. Springer, Berlin Heidelberg New York, 1998.\n\n[2] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete\ndata via the EM algorithm. Journal of the Royal Statistical Society Series B, 39:1\u201338,\n1977.\n\n[3] Stuart Geman and Chii-Ruey Hwang. Nonparametric maximum likelihood estimation\n\nby the method of sieves. The Annals of Statistics, 10(2):401\u2013414, 1982.\n\n[4] T. Hastie and W. Stuetzle. Principal curves. Journal of the American Statistical\n\nAssociation, 84:502\u2013516, 1989.\n\n[5] M. C. Jones, J. S. Marron, and S. J. Sheather. A brief survey of bandwidth selection\nfor density estimation. Journal of the American Statistical Association, 91(433):401\u2013\n407, 1996.\n\n[6] B. K\u00b4egl, A. Krzyzak, T. Linder, and K. Zeger. Learning and design of principal\ncurves. IEEE Transaction on Pattern Analysis and Machine Intelligence, 22(3):281\u2013\n297, 2000.\n\n[7] Peter Meinicke. Unsupervised Learning in a Generalized Regression Frame-\nwork. PhD thesis, Universitaet Bielefeld, 2000. http://archiv.ub.uni-\nbielefeld.de/disshabi/2000/0033/.\n\n[8] Peter Meinicke and Helge Ritter. Resolution-based complexity control for Gaussian\n\nmixture models. Neural Computation, 13(2):453\u2013475, 2001.\n\n[9] K. Rose, E. Gurewitz, and G. C. Fox. Statistical mechanics and phase transitions in\n\nclustering. Physical Review Letters, 65(8):945\u2013948, 1990.\n\n[10] Sam Roweis and Zoubin Ghahramani. A unifying review of linear Gaussian models.\n\nNeural Computation, 11(2):305\u2013345, 1999.\n\n[11] D. W. Scott. Multivariate Density Estimation. Wiley, 1992.\n[12] B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman and\n\nHall, London and New York, 1986.\n\n[13] Alex J. Smola, Robert C. Williamson, Sebastian Mika, and Bernhard Sch\u00a8olkopf. Reg-\nularized principal manifolds. In Proc. 4th European Conference on Computational\nLearning Theory, volume 1572, pages 214\u2013229. Springer-Verlag, 1999.\n\n[14] Vladimir N. Vapnik and Sayan Mukherjee. Support vector method for multivariate\ndensity estimation. In S. A. Solla, T. K. Leen, and K.-R. M\u00a8uller, editors, Advances in\nNeural Information Processing Systems, volume 12, pages 659\u2013665. The MIT Press,\n2000.\n\n\f", "award": [], "sourceid": 2007, "authors": [{"given_name": "Peter", "family_name": "Meinicke", "institution": null}, {"given_name": "Helge", "family_name": "Ritter", "institution": null}]}