{"title": "Probabilistic Visualisation of High-Dimensional Binary Data", "book": "Advances in Neural Information Processing Systems", "page_first": 592, "page_last": 598, "abstract": null, "full_text": "Probabilistic Visualisation of \nHigh-dimensional Binary Data \n\nMichael E. Tipping \nMicrosoft Research, \n\nSt George House, 1 Guildhall Street, \n\nCambridge CB2 3NH, U.K. \nmtipping@microsoit.com \n\nAbstract \n\nWe present a probabilistic latent-variable framework for data visu(cid:173)\nalisation, a key feature of which is its applicability to binary and \ncategorical data types for which few established methods exist. A \nvariational approximation to the likelihood is exploited to derive a \nfast algorithm for determining the model parameters. Illustrations \nof application to real and synthetic binary data sets are given. \n\n1 \n\nIntroduction \n\nVisualisation is a powerful tool in the exploratory analysis of multivariate data. The \nrendering of high-dimensional data in two dimensions, while generally implying loss \nof information, often reveals interesting structure to the human eye. Standard \ndimensionality-reduction methods from multivariate analysis, notably the principal \ncomponent projection, are often utilised for this purpose, while techniques such \nas 'projection pursuit ' have been tailored specifically to this end. With the cur(cid:173)\nrent trend for larger databases and the need for effective 'data mining' methods, \nvisualisation is becoming increasingly topical, and recent novel developments in(cid:173)\nclude nonlinear topographic methods (Lowe and Tipping 1997; Bishop, Svensen, \nand Williams 1998) and hierarchical combinations of linear models (Bishop and \nTipping 1998). However, a disadvantageous aspect of many proposed techniques \nis their applicability only to continuous variables; there are very few such methods \nproposed specifically for the visualisation of discrete binary data types, which are \ncommonplace in real-world datasets. \nWe approach this difficulty by proposing a probabilistic framework for the visualisa(cid:173)\ntion of arbitrary data types, based on an underlying latent variable density model. \nThis leads to an algorithm which permits the visualisation of structure within data, \nwhile also defining a generative observation probability model. A further, and \n\n\fProbabilistic Visualisation of High-Dimensional Binary Data \n\n593 \n\nintuitively pleasing, result is that the specialisation of the model to continuous vari(cid:173)\nables recovers principal component analysis. Continuous, binary and categorical \ndata types may thus be combined and visualised together within this framework, \nbut for reasons of space, we concentrate on binary types alone in this paper. \nIn the next section we outline the proposed latent variable approach, and in Section \n3 consider the difficulties involved in estimating the parameters in this model, giving \nan efficient variational scheme to this end in Section 4. In Section 5 we illustrate the \napplication of the model and consider the accuracy of the variational approximation. \n\n2 Latent Variable Models for Visualisation \n\nIn an ideal visualisation model, we would wish all of the dependencies between \nvariables to be evident in the visualisation space, while the information that we lose \nin the dimensionality-reduction process should represent \"noise\", independent to \neach variable. This principle is captured by the following probability density model \nfor a dataset comprising d-dimensional observation vectors t = (t1' t2, ... , td): \n\np(t) = J {gP(!i!X,IJ)} p(x)dx, \n\n(1) \n\nwhere x is a two-dimensional latent variable vector, the distribution of which must \nbe a priori specified, and 0 are the model parameters. Now, for a given value of x \n(or location in the visualisation space), the observations are independent under the \nmodel. (In general, of course, the model and conditional independence assumption \nwill only hold approximately.) However, the unconditional observation model p(t) \ndoes not, in general, factorise and so can still capture dependencies between the \nd variables, given the constraint implied by the use of just two underlying latent \nvariables. So, having estimated the parameters 0, data could be visualised by \n'inverting' the generative model using Bayes' rule: p(xlt) = p(tlx)p(x)/p(t). Each \ndata point then induces a distribution in the latent space, which for the purposes \nof visualisation, we might summarise with the conditional mean value (x lt). \nThat this form of model can be appropriate for visualisation was demonstrated by \nBishop and Tipping (1998), who showed that if the latent variables are defined to \nbe independent and Gaussian, x \"'\" N(O, I), and the conditional observation model \nis also Gaussian, tilx \"'\" N(wJx + J.l.i' a}I), then maximum-likelihood estimation of \nthe model parameters {Wi, J.l.i, a}} leads to a model where the the posterior mean \n(xlt) is equivalent to a probabilistic principal component projection. \nA visualisation method for binary variables now follows naturally. Retaining the \nGaussian latent distribution x \"'\" N(O , I), we specify an appropriate conditional \ndistribution for P( ti Ix, 0). Given that principal components corresponds to a linear \nmodel for continuous data types, we adopt the appropriate generalised linear model \nin the binary case: \n\nwhere O'(A) = {I + exp( -An -1 and Ai = wJx + bi with parameters Wi and k \n\n3 Maximum-likelihood Parameter Estimation \n\nThe proposed model for binary data already exists in the literature under various \nguises, most historically as a latent trait model (Bartholomew 1987), although it \nis not utilised for data visualisation. While in the case of probabilistic principal \n\n(2) \n\n\f594 \n\nM. E. Tipping \n\ncomponent analysis, ML parameter estimates can be obtained in closed-form, a dis(cid:173)\nadvantageous feature of the binary model is that, with P(tilx) defined by (2), the \nintegral of (1) is analytically intractable and P(t) cannot be computed directly. Fit(cid:173)\nting a latent trait model thus necessitates a numerical integration, and recent papers \nhave considered both Gauss-Hermite (Moustaki 1996) and Monte-Carlo sampling \napproximations (Mackay 1995; Sammel, Ryan, and Legler 1997). \n\nIn this latter case, the log-likelihood for a dataset of N observation vectors \n{tl, ... , tN} would be approximated by \n\nN {I L \n\n;: ~ ~ In L ~ g P(tinIXI, Wi, bi) \n\nd \n\n} \n\n(3) \n\nwhere Xl , l = 1 ... L, are samples from the two-dimensional latent distribution. \nTo obtain parameter estimates we may utilise an expectation-maximisation (EM) \napproach by noting that (3) is equivalent in form to an L-component latent class \nmodel (Bartholomew 1987) where the component probabilities are mutually con(cid:173)\nstrained from (2). Applying standard methodology leads to an E-step which re(cid:173)\nquires computation of N x L posterior 'responsibilities' P(xlltn), and a logistic \nregression M-step which is unfortunately iterative, although it can be performed \nrelatively efficiently by an iteratively re-weighted least-squares algorithm. Because \nof these difficulties in implementation, in the next section we describe a variational \napproximation to the likelihood which can be maximised more efficiently. \n\n4 A Variational Approximation to the Likelihood \n\nJaakkola and Jordan (1997) introduced a variational approximation for the predic(cid:173)\ntive likelihood in a Bayesian logistic regression model and also briefly considered \nthe \"dual\" problem, which is closely related to the proposed visualisation model. \nIn this approach, the integral in (1) is approximated by: \n\n(4) \n\nwhere \n\n(5) \nwith Ai = (2ti -\nl)(wTx + bi ) and A(~i) = {O.5 - (J(~i)}/2~i. The parameters \n~i are the 'variational' parameters, and this approximation has the property that \nP(tilx, ~i) ::; P(tilx), with equality at ~i = Ai, and thus it follows that P(t) ::; P(t). \nNow because the exponential in (5) is quadratic in X , then the integral in (4), and \nalso the likelihood, can be computed in closed form. This suggests an alterna(cid:173)\ntive algorithm for finding parameter estimates where we iteratively maximise the \nvariational approximation to the likelihood. Each iteration of this algorithm is guar(cid:173)\nanteed to increase a lower bound on, but will not necessarily maximise, the true \nlikelihood. Nevertheless , we would hope that it will be a close approximation, the \naccuracy of which is investigated later. At each step in the algorithm, then, we: \n\n1. Obtain the sufficient statistics for the approximated posterior distribution \n\nof latent variables given each observation, p(xnltn, ~n). \n\n2. Qptimise the variational parameters ~in in order to make the approximation \n\nP(tn) as close as possible to P(tn) for all tn. \n\n3. Update the model parameters Wi and bi to increase P(t). \n\n\fProbabilistic VISualisation of High-Dimensional Binary Data \n\n595 \n\nJaakkola and Jordan (1997) give formulae for the above computations, but these \ndo not include provision for the 'biases' bi, and so the necessary expressions are \nre-derived below. Note that although we have introduced N x d additional vari(cid:173)\national parameters, it is no longer necessary to sample from p(x) and compute \nresponsibilities, and no iterative logistic regression step is needed. \nComputing the Latent Posterior Statistics. From Bayes' rule, the posterior \napproximation p(xnltn'~n) is Gaussian with covariance and mean given by \n\nen = [1 -2 t '\\((in)WiW [ 1 ' \n\n-1 \n\n(6) \n\n(7) \n\nOptimising the Variational Parameters. B~ause P(t) ~ P(t), the variational \napproximation can be optimised by maximising P(tn ) with respect to each (,in. We \nuse the EM methodology to obtain updates \n\n(8) \n\nwhere the angle brackets (.) denote expectations with respect to p(xnltn,~~ld) and \nwhere, from (6) and (7) earlier, the necessary posterior statistics are given by: \n\n(xn) = I-Ln, \n\n(xnx~) = C n + I-Lnl-L~. \n\n(9) \n(10) \n\nSince (6) and (7) depend on the variational parameters, C n and I-Ln are computed \nfollowed by the update for each (,in from (8). Iteration of this two-stage process \nis guaranteed to improve monotonically the approximation of P(tn ) and typically \nonly two iterations are necessary for convergence. \nOptimising the Model Parameters. We again use EM to increase the varia(cid:173)\ntionallikelihood approximation with respect to Wi and bi. Defining \n\nWi = (wi, bi)T, \nx=(xT,1r, \n\nleads to updates for both Wi and bi given by: \n\nwhere \n\n5 Visualisation \n\n(11) \n\n(12) \n\nI-Ln) 1 \n\n. \n\nSynthetic clustered data. We firstly give an example of visualisation of \nartificially-generated data to illustrate the operation and features of the method. \nBinary data was synthesised by first generating three random 16-bit prototype vec(cid:173)\ntors, where each bit was set with probability 0.5. Next a 600-point dataset was \ngenerated by taking 200 examples of each prototype and inverting each bit with \n\n\f596 \n\nM. E. npping \n\nprobability 0.05. We generated a second dataset in the same manner, but where \nthe probability of bit inversion was 0.15, simulating more \"noise\" about each pro(cid:173)\ntotype. The final values of ILn from (7) for each data point are plotted in Figure \n1. In the left plot for the low-noise dataset, the three clusters are clear, as are the \nprototype vectors. On the right , the bit-noise is sufficiently high such that clus(cid:173)\nters now overlap to a degree and the prototypes are no longer evident. However, \nwe can elucidate further information from the model by drawing lines representing \nP(tilx) = 0.5, or wTx+bi = 0, which may be considered to be 'decision boundaries' \nfor each bit. These offer more convincing evidence of the presence of three clusters. \n\n~ 4\" \n\n+ \n\n1,5 \n\n0.5 \n\n0 \n\n-1 \n\n-1.5 \n\n\u2022 .# \n\n-+ \n\n+ \n\n+ \n\n+t \n\n-ito \n\n+ \n\n-0.5 \"., \n\n\" \n\n\" \" \n\" \n\n\\fi \n\n~ \n\ntv \n\n\u2022 \n\nIt \n\n+ \n\n, \n\n., \n.. 0 .-. \n\n.. \n\n\"0 \n\n1,5 \n\n0,5 \n\n0 \n\n-0.5 \n\n-1 \n\n-1.5 \n\n- 1,5 \n\n-1 \n\n-0,5 \n\n\" \n\n0 \n\n0.5 \n\n1.5 \n\n- 1.5 \n\n-1 \n\n-0,5 \n\n0 \n\n0.5 \n\n1,5 \n\nFigure 1: Visualisation of two synthetic clustered datasets. The three clusters have been \ndenoted by separate glyphs, the size of which reflects the number of examples whose \nposterior means are located at that point in the latent space. In the right plot , lines \ncorresponding to P(tdx) = 0.5 have been drawn. \n\nHandwritten digit data. On the left of Figure 2, a visualisation is given of 1000 \nexamples derived from 16 x 16 images of handwritten digit '2's. There is visual \nevidence of the natural variability of writing styles in the plot as the posterior latent \nmeans in Figure 2 describe an approximate 'horseshoe' structure. On the right of \nthe figure we examine the nature of this by plotting gray-scale images of the vectors \nP(tlxj), where Xj are four numbered samples in the visualisation space. These \nimages illustrate the expected value of each bit given the latent-space location and \ndemonstrate that the location is indeed indicative of the style of the digit, notably \nthe presence of a loop. \n\nAccuracy of the variational approximation. To investigate the accuracy of the \napproximation, the sampling algorithm of Section 3 for likelihood maximisation was \nimplemented and applied to the above two datasets. The evolution of error (negative \nlog-likelihood per data-point) was plotted against time for both algorithms, using \nidentical initialisations. The 'true' error for the variational approach was estimated \nusing the same 500-point Monte-Carlo sample. Typical results are shown in Figure \n3, and the final running time and error (using a sensible stopping criterion) are \ngiven for both datasets in Table 1. \n\nFor these two example datasets, the variational algorithm converges considerably \nmore quickly than in the sampling case, and the difference in final error is relatively \nsmall, particularly so for the larger-dimensionality dataset. The approximation of \nthe posterior distributions p(xnltn) is the key factor in the accuracy of the algo(cid:173)\nrithm. In Figure 4, contours of the posterior distribution in the latent space induced \n\n\fProbabilistic VISualisation of High-Dimensional Binary Data \n\n597 \n\nDigIt 2 \n\n\u2022 \n\n. \n. ' \n\n::., ,' ... \n\u00b7 .\u00b7~4\u00b7~\u00b7: .\u2022. ' .. ',. \n\"(j)~.~ \u2022\u2022\u2022 \n\u00b7 '. \u00b7\u00b7\u00b7\u00b7 .. :r-~\\i\u00b7\u00b7 .' \n. 'f. . ..... r\"\" \n. \n. . ;. \"\u00b7t~\u00b7\":~~~:'\" . \n\u2022 \"';':\\-Y~.,,:.'. \n\u2022\u2022\u2022\u2022 \" . .. \" . I. r ~ .. -;I t'n :. :' \u2022\u2022\u2022 \n..... ....... : .. ', ,~~ \". \n. . . \": . ;\" '.. . .. , \n......,. ..... . \n. .\u2022 '. J:, ,,_ . '. \" \u2022\u2022 1t:.:':\":~::'''''':'': . \n.. . : \u00ae. \n.,: ... t'., \u2022 ... ,\u00b7~\u00b7;'~i''''-:I:\". . \n\u2022 \u2022 \u2022 \u2022 ~:. , .. , \u2022\u2022 \" \n'(.tY \u2022 .,::,,:,~ .... # \nJ.JIl. ~.' \n'.. : \n:,111'-:') -:..~ .-: . \n;0 . \" ' . ~ \".,.~ otl ... \". \n\u2022 \n.' , \u2022. , '.' @ ;a,\\.\" .':;' \n. ' :.,. '.. ' . ' \n\u2022 :', \u2022\u2022 :. ,- :.':, 3 \u2022\u2022 ~\" ':.- \u2022 _.:' \n\u2022 \nI \u2022\u2022 ', \": .f\", .. \n'. \n\n. . \n\n\u2022 \n, : \n\n\u2022 \n\n. \": ..... \n\n(2) \n\n~\"J \n~ \n\n(3) \n\n~ (4) \n~ \n\nFigure 2: Left: visualisation of 256-dimensional digit '2' data. Right: gray-scale images \nof the conditional probability of each bit at the latent space locations marked. \n\nVariational \n\nSampling \n\n8 \n\n7.5 \n\n7 \n\n12 65 \nw \n\n6 \n\n5.5 \n\n10-' \n\n10\u00b0 \n\n10' \n\nTime (sees) \n\n10' \n\n10' \n\n335 \n\n33 \n\n325 \n\ne 32 \nW \n\n315 \n\n31 \n\n30 5 \n\n30 \n10- 1 \n\nVariational \n\nSampling \n\n10\u00b0 \n\nla' \nTIme (sees) \n\nla' \n\nla' \n\n10' \n\nFigure 3: Error vs. time for the synthetic data (left) and the digit '2' data (right). \n\nby a typical data point are shown for both algorithms and datasets. This approxi(cid:173)\nmation is more accurate as dimensionality increases (a phenomenon observed with \nother datasets too), as the true posterior becomes more Gaussian in form. \n\n6 Conclusions \n\nWe have outlined a variational approximation for parameter estimation in a proba(cid:173)\nbilistic visualisation model and although we have only considered its application to \nbinary variables here, the extension to mixtures of arbitrary data types is readily \nimplemented. For the two comparisons shown (and others not illustrated here) , the \napproximation appears acceptably accurate, and particularly so for data of higher \ndimensionality. The algorithm is considerably faster than a sampling approach, \nwhich would permit incorporation of multiple models in a more complex hierarchi(cid:173)\ncal architecture, of a sort that has been effectively implemented for visualisation of \ncontinuous variables (Bishop and Tipping 1998) . \n\n\f598 \n\nM. E. Tipping \n\nVariational \nSampling \n\nSynthetic-16 \nError \nTime \n5.14 \n7.8 \n331.1 \n4.93 \n\nDigit-256 \nTime Error \n30.23 \n25.6 \n1204.5 \n30.19 \n\nTable 1: Comparison of final error and running time for the two algorithms. \n\nTrue Posterior \n\nApproximation \n\n0.5 \n\n0.5 \n\n-0.5 \n\n-I \n\n-1.5 \n\n05 \n\no \n\n-0.5 \n\n-I \n\n-15 \n\n-2 \n\n-2 \n\n-1 .5 \n\n-I \n\n-05 \n\n0 \n\n0.5 \n\n-2 \n\n-15 \n\n- I \n\n- 05 \n\n0 \n\n0.5 \n\nTrue Posterior \n\nApproximation \n\n0.5 \n\n-05 \n\n-1 \n\n-1.5 \n\n-I \n\n-05 \n\n0 \n\n05 \n\n15 \n\n- I \n\n-05 \n\n0 \n\n05 \n\n15 \n\nFigure 4: True and approximated posteriors for a single example from the synthetic data \nset (top) and the digit '2' data (bottom) . \n\n7 References \n\nBartholomew, D . J . (1987). Latent Variable Models and Factor Analysis. London: \n\nCharles Griffin & Co. Ltd. \n\nBishop, C. M., M. Svensen, and C. K. I. Williams (1998) . GTM : the Generative Topo(cid:173)\n\ngraphic Mapping. Neural Computation 10(1),215- 234. \n\nBishop, C. M. and M. E . Tipping (1998) . A hierarchical latent variable model for data vi(cid:173)\n\nsualization. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(3), \n281- 293. \n\nJaakkola, T. S. and M. 1. Jordan (1997) . Bayesian logistic regression: a variational \napproach. In D . Madigan and P. Smyth (Eds.), Proceedings of the 1997 Conference \non Artificial Intelligence and Statistics, Ft Lauderdale, FL. \n\nLowe, D. and M. E. Tipping (1997). Neuroscale : Novel topographic feature extraction \nwith radial basis function networks. In M. Mozer, M. Jordan, and T . Petsche (Eds.), \nAdvances in Neural Information Processing Systems 9, pp. 543- 549. Cambridge, \nMass: MIT Press. \n\nMackay, D . J. C. (1995). Bayesian neural networks and density networks. Nuclear In(cid:173)\n\nstruments and Methods in Physics Research, Section A 354 (1), 73- 80. \n\nMoustaki, 1. (1996). A latent trait and a latent class model for mixed observed variables. \n\nBritish Journal of Mathematical and Statistical Psychology 49, 313- 334. \n\nSammel, M. D ., L. M. Ryan, and J. M. Legler (1997) . Latent variable models for mixed \ndiscrete and continuous outcomes. Journal of the Royal Statistical Society, Series \nB 59, 667- 678. \n\n\f", "award": [], "sourceid": 1561, "authors": [{"given_name": "Michael", "family_name": "Tipping", "institution": null}]}