{"title": "Product Analysis: Learning to Model Observations as Products of Hidden Variables", "book": "Advances in Neural Information Processing Systems", "page_first": 729, "page_last": 735, "abstract": null, "full_text": "Prod uct Analysis: \n\nLearning to model observations as \n\nproducts of hidden variables \n\nBrendan J. Freyl, Anitha Kannanl , Nebojsa Jojic2 \n\n1 Machine Learning Group, University of Toronto, www.psi.toronto.edu \n\n2 Vision Technology Group, Microsoft Research \n\nAbstract \n\nFactor analysis and principal components analysis can be used to \nmodel linear relationships between observed variables and linearly \nmap high-dimensional data to a lower-dimensional hidden space. \nIn factor analysis, the observations are modeled as a linear com(cid:173)\nbination of normally distributed hidden variables. We describe a \nnonlinear generalization of factor analysis, called \"product analy(cid:173)\nsis\", that models the observed variables as a linear combination \nof products of normally distributed hidden variables. Just as fac(cid:173)\ntor analysis can be viewed as unsupervised linear regression on \nunobserved, normally distributed hidden variables, product anal(cid:173)\nysis can be viewed as unsupervised linear regression on products \nof unobserved, normally distributed hidden variables. The map(cid:173)\nping between the data and the hidden space is nonlinear, so we \nuse an approximate variational technique for inference and learn(cid:173)\ning. Since product analysis is a generalization of factor analysis, \nproduct analysis always finds a higher data likelihood than factor \nanalysis. We give results on pattern recognition and illumination(cid:173)\ninvariant image clustering. \n\n1 \n\nIntroduction \n\nContinuous-valued latent representations of observed feature vectors can be useful \nfor pattern classification via Bayes rule, summarizing data sets, and producing low(cid:173)\ndimensional representations of data for later processing. \n\nLinear techniques, including principal components analysis (J olliffe 1986), factor \nanalysis (Rubin and Thayer 1982) and probabilistic principal components analysis \n(Tipping and Bishop 1999) , model the input as a linear combination of hidden \nvariables, plus sensor noise. The noise models are quite different in all 3 cases (see \nTipping and Bishop (1999) for a discussion). For example, whereas factor analysis \ncan account for different noise variances in the coordinates of the input, principal \ncomponents analysis assumes that the noise variances are the same in different \n\n\finput coordinates. Also, whereas factor analysis accounts for the sensor noise when \nestimating the combination weights, principal components analysis does not. \n\nOften, the input coordinates are not linearly related, but instead the input vector \nis the result of a nonlinear generative process. In particular, data often can be \naccurately described as the product of unknown random variables. Examples include \nthe combination of \"style\" and \"content\" (Tenenbaum and Freeman 1997), and the \ncombination of a scalar light intensity and a reflectance image. \n\nWe introduce a generalization of factor analysis, called \"product analysis\" , that per(cid:173)\nforms maximum likelihood estimation to model the input as a linear combination \nof products of hidden variables. Although exact EM is not tractable because the \nhidden variables are nonlinearly related to the input, the form of the product analy(cid:173)\nsis model makes it well-suited to a variational inference technique and a variational \nEM algorithm. \n\nOther approaches to learning nonlinear representations include principal surface \nanalysis (1984) and nonlinear autoencoders (Baldi and Hornik 1989; Diamantaras \nand Kung 1996), which minimize a reconstruction error when the data is mapped \nto the latent space and back; mixtures of linear models (Kambhatla and Leen \n1994; Ghahramani and Hinton 1997; Tipping and Bishop 1999), which approximate \nnonlinear relationships using piece-wise linear patches; density networks (MacKay \n1995), which use Markov chain Monte Carlo methods to learn potentially very com(cid:173)\nplex density functions; generative topographic maps (Bishop, SvensE'm and Williams \n1998) , which use a finite set of fixed samples in the latent space for efficient infer(cid:173)\nence and learning; and kernel principal components analysis (Sch6Ikopf, Smola and \nMuller 1998), which finds principal directions in nonlinear functions of the input. \n\nOur goals in developing product analysis is to introduce a technique that \n\n\u2022 produces a density estimator of the data \n\u2022 separates sensor noise from the latent structure \n\u2022 learns a smooth, nonlinear map from the input to the latent space \n\u2022 works for high-dimensional data and high-dimensional latent spaces \n\u2022 is particularly well-suited to products of latent variables \n\u2022 is computationally efficient \n\nWhile none of the other approaches described above directly addresses all of these \ngoals, product analysis does. \n\n2 Factor analysis model \n\nOf the three linear techniques described above, factor analysis has the simplest \ndescription as a generative model of the data. The input vector x is modeled using \na vector of hidden variables z. The hidden variables are independent and normally \ndistributed with zero mean and unit variance: \n\np(z) = N(z ; 0, I). \n\n(1) \n\nThe input is modeled as a linear combination of the hidden variables, plus indepen(cid:173)\ndent Gaussian noise: \n\n(2) \nThe model parameters are the factor loading matrix A and the diagonal matrix of \nsensor noise variances, \\]f. \n\np(xlz) = N(x; Az, \\]f). \n\n\fFactor analysis (d. \n(Rubin and Thayer 1982)) is the procedure for estimat(cid:173)\ning A and lJI using a training set. The marginal distribution over the input is \np(x) = N(x; 0, AA T + lJI), so factor analysis can be viewed as estimating a low(cid:173)\nrank parameterization of the covariance matrix of the data. \n\n3 Product analysis model \n\nIn the \"product analyzer\", the input vector x is modeled using a vector of hidden \nvariables z, which are independent and normally distributed with zero mean and \nunit variance: \n\np(z) = N(z; 0, I). \n\nIn factor analysis, the input is modeled as a linear combination of the hidden vari(cid:173)\nables. In product analysis, the input is modeled as a linear combination of mono(cid:173)\nmials in the hidden variables. The power of variable Zk in monomial i is Sik. So, \nthe ith monomial is \n\nJi(z) = II ZZik. \n\nk \n\nDenoting the vector of Ji(z) 's by f(z) , the density of the input given z is \n\n(5) \nThe model parameters are A and the diagonal covariance matrix lJI . Here, we \nlearn A , maintaining the distribution over z constant. Alternatively, if A is known \napriori, we can learn the distribution over z, maintaining A to be fixed. \n\np(xlz) = N(x; Af(z) , lJI). \n\nThe matrix S = {Sik} can be specified beforehand, estimated from the data using \ncross-validation, or averaged over in a Bayesian fashion. When S = I, J(z) = z and \nthe product analyzer simplifies to the factor analyzer. If, for some i, Sik = 0, for all \nk, Ji(z) = 1 and this monomial will account for a constant offset in the input. \n\n4 Product analysis \n\nExact EM in the product analyzer is intractable, since the sufficient statistics require \naveraging over the posterior p(zlx), for which we do not have a tractable expression. \nInstead, we use a variational approximation (Jordan et al. 1998) , where for each \ntraining case, the posterior p(zlx) is approximated by a factorized Gaussian dis(cid:173)\ntribution q(z) and the parameters of q(z) are adjusted to make the approximation \naccurate. Then, the approximation q(z) is used to compute the sufficient statistics \nfor each training case in a generalized EM algorithm (Neal and Hinton 1993). \n\nThe q-distribution is specified by the variational parameters 'f/ and ~: \n\nq(z) = N(z; 'f/, ~), \n\n(6) \n\nwhere ~ is a diagonal covariance matrix. \n\nq is optimized by minimizing the relative entropy (Kullback-Leibler divergence), \n\n(3) \n\n(4) \n\n(7) \n\n(8) \n\nIn fact , minimizing this entropy is equivalent to maximizing the following lower \nbound on the log-probability of the observation: \n\n1 \n\nJ( = \n\nz q(z) In p(zlx) . \n\nq(z) \n\nB = 1 q(z) In p~7~~) ~ lnp(x) \n\n\fPulling lnp(x) out of the integral, the bound can be expressed as \n\nB = lnp(x) - 1 q(z) In p~~~~) = lnp(x) - K. \n\n(9) \n\nSince lnp(x) does not directly depend on the variational parameters, maximizing B \nis equivalent to minimizing K. Note that since K :::=: 0, B :S lnp(x). Using Lagrange \nmultipliers, it is easy to show that the bound is maximized when q(z) = p(zlx), in \nwhich case K = 0 and B = lnp(x). \nSubstituting the expressions for p(z), p(xlz) and q(z) into (8), and using the fact \nthat f(z)T ATw-1 Af(z) = tr(f(z)T ATw-1 Af(z)) = tr(ATw-1 Af(z)f(z)T), we \nhave \n\nB = ~ (In I 27fe* I -In 127fWI -In 127f11 \n\n-l1Tl1- XTW-1x + 2xTw-1 AE[f(z)] + tr(A TW-l AE[f(z)f(z)T])), \n\n(10) \n\nwhere E[] denotes an expectation with respect to q(z) . \nThe expectations are simplified as follows: \n\nE[!i(Z)] = E[II ZZik] = II E[ZZik] = II m Sik (T)k, \u00a2k), \n\nk \n\nk \n\nk \n\nE[Ji(Z)!j(z)] = E[II z:idSik ] = IIE[z:ik+sik ] = II mSik+Sik(T)k,\u00a2k), \n\n(11) \n\nk \n\nk \n\nk \n\nwhere mn(T), \u00a2) is the nth moment under a Gaussian with mean T) and variance \n\u00a2. Closed forms for the mn(T), \u00a2) are found by setting derivatives of the Gaussian \nmoment generating function to zero: \n\n(12) \n\nAfter substituting the closed forms for the moments, B is a polynomial in the T)k 'S \nand the (Pk's. For each training case, B is maximized with respect to the T)k'S and \nthe \u00a2k'S using, e.g., conjugate gradients. The model parameters A and W that \nmaximize the sum of the bounds for the training cases can be computed directly, \nsince W does not affect the solution for A, B is quadratic in A , and the optimal W \ncan be written in terms of A and the variational parameters. \n\nIf the power of each latent variable is restricted to be 0 or 1 in each monomial, \no :S Sik :S 1, the above expressions simplify to \n\nk \n\nk \n\nIn this case, we can directly maximize B with respect to each T)k in turn, since B is \nquadratic in each T)k. \n\n5 Experimental results: \n\n5.1 Classification results on the Wisconsin breast cancer database: We \nobtained results on using product analysis for classification of malignant and benign \ncancer using the breast cancer database provided by Dr. Wolberg from the Univ. \nof Wisconsin. Each observation in the database is characterized by nine cytological \n\n\fa) \n\nb) \n\n--.---,-~~ ..... \n\nc) \n\nFigure 1: a) Data from training set. Mean images learned using b) product analysis \nc) mixture of gaussians \n\nfeatures, namely, lump thickness, uniformity of cell and shape, marginal adhesion, \nsingle epithelial cell size, bare nuclei, bland chromotin, normal nucleoli and mitoses. \nEach feature is assigned an integer between 1 and 10. \n\nIn their earlier work (Wolberg and Mangasarian 1990) , the authors used linear pro(cid:173)\ngramming for classification. The objective was to find a hyperplane that separates \nthe classes of malignant and benign cancer. In the absence of a separating plane, \naverage sum of misclassifications of each class is minimized. \n\nOur approach is learn one density model for the benign feature vectors and a second \ndensity model for the malignant feature vectors and then use Bayes rule to classify an \ninput vector. With separate models, classification involves assigning the observation \nto the model that provides the largest probability for occurrence of that observation \nas given by, \n\nP( l \n\n) \n\nc as s Ix = --=--:----:-:-----:-=-:-:----'---:-'---=--:'---c---'-----,-----'-----:-=---:-----::-------,-\nP(xlbenign)P(benign) + P(xlmalignant)P(malignant) \n\nP(xlclass)P(class) \n\nTo compare with the result reported in (Wolberg and Mangasarian 1990), 4.1 % \nerror rate on 369 instances, we used the same set for our learning scheme and found \nthat the product analysis produced 4% misclassfication. \n\nIn addition, to compare the recognition rate of product analysis with the recognition \nrate of factor analysis, we divided the data set into 3 sets for training, validation and \ntesting. The parameters of the model are learned using the training set, and tested \non the validation set. This is repeated for 20 times, remembering the parameters \nthat provided the best classification rate on the validation set. Finally, the param(cid:173)\neters that provided the best performance on the validation set is used to classify \nthe test set, only once. Since the data is limited, we perform this experimentation \non 4 different random breakups of data into training, validation and test set. For \nproduct analysis model, we chose 3 hidden variables without optimization but for \nfactor analysis, we chose the optimum number of factors. The average error rate on \nthe 4 breakups was 5% using product analysis and 5.59% using factor analysis. \n\n\fFigure 2: Images generated from the learned mixture of product analyzers \n\nFigure 3: First row: Observation. Second row: corresponding image normalized for \ntranslation and lighting after lighting & transformation invariant model is learned \n\n5.2 Mixture of lighting invariant appearance models: Often, objects are \nimaged under different illuminants. To learn an appearance model, we want to \nautomatically remove the lighting effects and infer lighting-normalized images. \n\nSince ambient light intensity and reflectances of patches on the object multiply to \nproduce a lighting-affected image, we can model lighting-invariance using a prod(cid:173)\nuct analyzer. P(x,z) = P(xlz)P(z), where x is the vector of pixel intensities of \nthe observation, Zl is the random variable describing the light intensity, and the \nremaining Zi are the pixel intensities in the lighting normalized image. We learn \nthe distribution over z, where f(z) = [ZlZ2, ZlZ3, ... ZlZN+l]T and A is identity. By \ninfering Zl, we can remove its effect on observation. The mixture model of product \nanalyzer has joint distribution 7rcP(xlz)P(z), where 7rc is the probability of each \nclass. It can be used to infer various kinds of images (e.g. faces of different people) \nunder different lighting conditions. \n\nWe trained this model on images with 2 different poses of the same person(Fig. la). \nThe variation in the images is governed by change in pose, light, and background \nclutter. Fig. Ib and Fig. lc compares the components learned using a mixture of \nproduct analyzers and a mixture of Gaussians. Due to limited variation in the \npose and large variation in lighting, the mixture of gaussians is unable to extract \nthe mean images. However, mixture of product analyzers is able to capture the \ndistributions well. (Fig. 3). \n\n5.3 Transformation and lighting invariant appearance models: Geomet(cid:173)\nric transformations like shift and shearing can occur when scenes are imaged. Trans(cid:173)\nformation invariant mixtures of Guassians and factor analyzers (Frey and Jojic \n2002; Jojic et al. 2001) enable infering transformation-neutral image. Here, we \nadd lighting-invariance to this framework enabling clustering based on interesting \nfeatures such as pose, without concern for transformation and lighting effects. \n\n\f6 Conclusions \n\nWe introduced a density model that explains observations as products of hidden \nvariables and we presented a variational technique for inference and learning in \nthis model. On the Wisonsin breast cancer data, we found that product analysis \noutperforms factor analysis, when used with Bayes rule for pattern classification. \nWe also found that product analysis was able to separate the two hidden causes, \nlighting and image noise in noisy images with varying illumination and varying pose. \n\nReferences \n\nBaldi, P. and Hornik, K 1989. Neural networks and principal components analysis: Learn(cid:173)\n\ning from examples without local minima. N eural Networks, 2:53- 58. \n\nBishop, C. M. , Svensen, M., and Williams, C. K 1. 1998. Gtm: the generative topographic \n\nmapping. Neural Computation, 10(1):215- 235 . \n\nDiamantaras, K 1. and Kung, S. Y. 1996. Principal Component Neural Networks. Wiley, \n\nNew York NY. \n\nFrey, B. J. and Jojic, N. 2002. Transformation invariant clustering and linear compo(cid:173)\nnent analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence. To \nappear. Available at http://www.cs.utoronto.ca/~frey. \n\nGhahramani, Z. and Hinton, G. E . 1997. The EM algorithm for mixtures of factor ana(cid:173)\n\nlyzers. University of Toronto Technical Report CRG-TR-96-1. \n\nHastie, T . 1984. Principal Curves and Surfaces. Stanford University, Stanford CA. Doc(cid:173)\n\ntoral dissertation. \n\nJojic, N. , Simard, P., Frey, B. J. , and Heckerman, D. 2001. Separating appearance from \ndeformation. To appear in Proceedings of the IEEE International Conference on \nComputer Vision. \n\nJolliffe, 1. T. 1986. Principal Component Analysis. Springer-Verlag, New York NY. \nJordan, M. 1. , Ghahramani, Z. , Jaakkola, T. S. , and Saul, L. K 1998. An introduction \nto variational methods for graphical models. In Jordan , M. 1. , editor, Learning in \nGraphical Models. Kluwer Academic Publishers, Norwell MA. \n\nKambhatla, N. and Leen, T. K. 1994. Fast non-linear dimension reduction. In Cowan, \n\nJ . D. , Tesauro, G., and Alspector, J. , editors, Advances in N eural Information Pro(cid:173)\ncessing Systems 6, pages 152- 159. Morgan Kaufmann, San Francisco CA. \n\nMacKay, D. J. C. 1995. Bayesian neural networks and density networks. Nuclear Instru(cid:173)\n\nments and Methods in Physics Research, 354:73- 80. \n\nNeal, R. M. and Hinton, G. E. 1993. A new view of the EM algorithm that justifies \nincremental and other variants. Unpublished manuscript available over the internet \nby ftp at ftp:/ /ftp. cs. utoronto. ca/pub/radford/em. ps. Z. \n\nRubin, D. and Thayer, D. 1982. EM algorithms for ML factor analysis. Psychometrika, \n\n47(1):69- 76. \n\nScholkopf, B., Smola, A., and Miiller, K-R. 1998. Nonlinear component analysis as a \n\nkernel eigenvalue problem. Neural Computation, 10:1299- 1319. \n\nTenenbaum, J. B. and Freeman, W. T. 1997. Separating style and content. In Mozer, \n\nM. C. , Jordan, M. 1. , and Petsche, T. , editors, Advances in Neural Information Pro(cid:173)\ncessing Systems 9. MIT Press, Cambridge MA. \n\nTipping, M. E. and Bishop, C. M. 1999. Mixtures of probabilistic principal component \n\nanalyzers. N eural Computation, 11(2):443- 482 . \n\nWolberg, W. H. and Mangasarian, O. L. 1990. Multisurface method of pattern separa(cid:173)\n\ntion for medical diagnosis applied to breast cytology. In Proceedings of the National \nAcademy of Sciences. \n\n\f", "award": [], "sourceid": 1973, "authors": [{"given_name": "Brendan", "family_name": "Frey", "institution": null}, {"given_name": "Anitha", "family_name": "Kannan", "institution": null}, {"given_name": "Nebojsa", "family_name": "Jojic", "institution": null}]}*