{"title": "Fast, Large-Scale Transformation-Invariant Clustering", "book": "Advances in Neural Information Processing Systems", "page_first": 721, "page_last": 727, "abstract": null, "full_text": "Fast, large-scale transformation-invariant\nclustering\nBrendan J. Frey\n\nMachine Learning Group\nUniversity of Toronto\nwww.psi.toronto.edu/#frey\n\nNebojsa Jojic\n\nVision Technology Group\nMicrosoft Research\nwww.ifp.uiuc.edu/#jojic\nAbstract\n\nIn previous work on ``transformed mixtures of Gaussians'' and\n``transformed hidden Markov models'', we showed how the EM al-\ngorithm in a discrete latent variable model can be used to jointly\nnormalize data (e.g., center images, pitch-normalize spectrograms)\nand learn a mixture model of the normalized data. The only input\nto the algorithm is the data, a list of possible transformations, and\nthe number of clusters to find. The main criticism of this work\nwas that the exhaustive computation of the posterior probabili-\nties over transformations would make scaling up to large feature\nvectors and large sets of transformations intractable. Here, we de-\nscribe how a tremendous speed-up is acheived through the use of\na variational technique for decoupling transformations, and a fast\nFourier transform method for computing posterior probabilities.\nFor NN images, learning C clusters under N rotations, N scales,\n\nN x-translations and N y-translations takes only (C + 2 log N)N\n\n2\n\nscalar operations per iteration. In contrast, the original algorithm\ntakes CN\n\n6\n\noperations to account for these transformations. We\ngive results on learning a 4-component mixture model from a video\nsequence with frames of size 320240. The model accounts for 360\nrotations and 76,800 translations. Each iteration of EM takes only\n10 seconds per frame in MATLAB, which is over 5 million times\nfaster than the original algorithm.\n\n1 Introduction\n\nThe task of clustering raw data such as video frames and speech spectrograms is\noften obfuscated by the presence of random, but well-understood transformations\nin the data. Examples of these transformations include object motion and camera\nmotion in video sequences and pitch modulation in spectrograms.\nThe machine learning community has proposed a variety of sophisticated techniques\nfor pattern analysis and pattern classification, but these techniques have mostly as-\nsumed the data is already normalized (e.g., the patterns are centered in the images)\nor nearly normalized. Linear approximations to the transformation manifold have\n\f\nbeen used to significantly improve the performance of feedforward discriminative\nclassifiers such as nearest neighbors and multilayer perceptrons (Simard, LeCun\nand Denker 1993). Linear generative models (factor analyzers, mixtures of factor\nanalyzers) have also been modified using linear approximations to the transforma-\ntion manifold to build in some degree of transformation invariance (Hinton, Dayan\nand Revow 1997). A multi-resolution approach can be used to extend the useful-\nness of linear approximations (Vasconcelos and Lippman 1998), but this approach is\nsusceptable to local minima -- e.g. a pie may be confused for a face at low resolution.\nFor significant levels of transformation, linear approximations are far from exact\nand better results can be obtained by explicitly considering transformed versions of\nthe input. This approach has been used to design ``convolutional neural networks''\nthat are invariant to translations of parts of the input (LeCun et al. 1998).\nIn previous work on ``transformed mixtures of Gaussians'' (Frey and Jojic 2001)\nand ``transformed hidden Markov models'' (Jojic et al. 2000), we showed how the\nEM algorithm in a discrete latent variable model can be used to jointly normalize\ndata (e.g., center video frames, pitch-normalize spectrograms) and learn a mixture\nmodel of the normalized data. We found ``that the algorithm is reasonably fast (it\nlearns in minutes or hours) and very e#ective at transformation-invariant density\nmodeling.'' Those results were for 44\n\n\n\n28 images, but realistic applications such\nas home video summarization require near-real-time processing of medium-quality\nvideo at resolutions near 320\n\n\n\n240. In this paper, we show how a variational\ntechnique and a fast Fourier method for computing posterior probabilities can be\nused to achieve this goal.\n2 Background\n\nIn (Frey and Jojic 2001), we introduced a single discrete variable that enumerates\na discrete set of possible transformations that can occur in the input. Here, we\nbreak the transformation into a sequence of transformations. T k is the random\nvariable for the transformation matrix at step k. So, if\n\nT\n\nk is the set of possible\ntransformation matrices corresponding to the type of transformation at step k (e.g.,\nimage rotation), T k\n\n# T\n\nk .\nThe generative model is shown in Fig. 1a and consists of picking a class c, drawing a\nvector of image pixel intensities z 0 from a Gaussian, picking the first transformation\nmatrix T 1 from\n\nT\n\nk , applying this transformation to z 0 and adding Gaussian noise\nto obtain z 1 , and repeating this process until the last transformation matrix TK is\ndrawn from\n\nTK\n\nand is applied to zK-1 to obtain the observed data zK . The joint\ndistribution is\np(c, z 0 , T 1 , z 1 , . . . , TK , zK ) = p(c)p(z 0\n\n|c)\n\nK\n\n#\n\nk=1\n\np(T k )p(z k\n\n|z\n\nk-1 , T k ). (1)\nThe probability of class c\n\n# {1,\n\n. . . , C} is parameterized by p(c) = # c and the\nuntransformed latent image has conditional density\n\np(z 0\n\n|c)\n\n=\n\nN\n\n(z 0 ; c , # c ), (2)\nwhere\n\nN\n\n() is the normal distribution, c is the mean image for class c and # c is the\ndiagonal noise covariance matrix for class c. Notice that the noise modeled by # c\n\ngets transformed, so # c can model noise sources that depend on the transformations,\nsuch as background clutter and object deformations in images.\n\f\n(c)\n\nTK\nz\n\n0\n\nz\n\n1\n\nT 1 c\n\n(b)\nz z\n\n0\n\nz\n\n1\n\nT 1 TK c\n\n(a)\nK\nFigure 1: (a) The Bayesian network for a generative model that draws an image z 0\n\nfrom class c, applies a randomly drawn transformation matrix T 1 of type 1 (e.g.,\nimage rotation) to obtain z 1 , and so on, until a randomly drawn transformation\nmatrix TK of type K (e.g., image translation) is applied to obtain the observed\nimage zK . (b) The Bayesian network for a factorized variational approximation to\nthe posterior distribution, given zK . (c) When an image is measured on a discrete,\nradial 2-D grid, a scale and rotation correspond to a shift in the radial and angular\ncoordinates.\nThe probability of transformation matrix T k at step k is p(T k ) = # k,Tk . (In our\nexperiments, we often fix this to be uniform.) At each step, we assume a small\namount of noise with diagonal covariance matrix # is added to the image, so\n\np(z k\n\n|z\n\nK-1 , T k ) =\n\nN\n\n(z k ; T k z k-1 , #). (3)\n\nT k operates on z k-1 to produce a transformed image. In fact, T k can be viewed\nas a permutation matrix that rearranges the pixels in z k-1 . Usually, we assume\n\n# = #I and in our experiments we often set # to a constant, small value, such as\n0.01.\nIn (2001), an exact EM algorithm for learning this model is described. The suf-\nficient statistics for # c , c and # c are computed by averaging the derivatives of\nln(# c\n\nN\n\n(z 0 ; c , # c )) over the posterior distribution,\np(c, z 0\n\n|z\n\nK ) =\n\n#\n\nT1\n\n \n\n#\n\nTK\n\np(z 0\n\n|c,\n\nT 1 , . . . , TK , zK )p(c, T 1 , . . . , TK\n\n|z\n\nK ). (4)\nSince z 0 , . . . , zK are jointly Gaussian given c and T 1 , . . . , TK ,\n\np(z 0\n\n|c,\n\nT 1 , . . . , TK , zK ) is Gaussian and its mean and covariance are com-\nputed using linear algebra. Also, p(c, T 1 , . . . , TK\n\n|z\n\nK ) is computed using linear\nalgebra.\nThe problem with this direct approach is that the number of scalar operations in (4)\nis very large for large feature vectors and large sets of transformations. For N\n\nN\n\nimages, learning C clusters under N rotations, N scales, N x-translations and N\ny-translations leads to N\n\n4\n\nterms in the summation. Since there are N\n\n2\n\npixels, each\nterm is computed using N\n\n2\n\nscalar operations. So, each iteration of EM takes CN\n\n6\n\nscalar operations per training case. For 10 classes and images of size 256\n\n\n\n256, the\ndirect approach takes 2.8\n\n\n\n10\n\n15\n\nscalar operations per image for each iteration of\nEM.\nWe now describe how a variational technique for decoupling transformations, and\na fast Fourier transform method for computing posterior probabilities can reduce\nthe above number to (C + 2 log N)N\n\n2\n\nscalar operations. For 10 classes and images\nof size 256\n\n\n\n256, the new method takes 2, 752, 512 scalar operations per image for\neach iteration of EM.\n\f\n3 Factorized variational technique\n\nTo simplify the computation of the required posterior in (4), we use a variational\napproximation (Jordan et al. 1998). As shown in Fig. 1b, our variational approxi-\nmation is a completely factorized approximation to the true posterior:\n\np(c, z 0 , T 1 , z 1 , . . . , TK\n\n|z\n\nK )\n\n#\n\nq(c, z 0 , T 1 , z 1 , . . . , TK )\n= q(c)q(z 0 )\n\n#K-1 #\n\nk=1\n\nq(T k )q(z k )\n\n#\n\nq(TK ). (5)\nThe q-distributions are parameterized and these variational parameters are varied\nto make the approximation a good one. p(c, z 0\n\n|z\n\nK )\n\n#\n\nq(c)q(z K ), so the su#cient\nstatistics can be readily determined from q(c) and q(z K ). The variational parame-\nters are q(c) = # c , q(T k ) = # k,Tk , q(z k ) =\n\nN\n\n(z k ; # k\n\n,#\n\nk ).\nThe generalized EM algorithm (Neal and Hinton 1998) maximizes a lower bound\non the log-likelihood of the observed image zK :\n\nB\n\n=\n\n#\n#\n\nq(c, z 0 , T 1 , z 1 , . . . , TK ) ln\n\np(c, z 0 , T 1 , z 1 , . . . , TK , zK )\n\nq(c, z 0 , T 1 , z 1 , . . . , TK ) #\n\nln p(z K ). (6)\nIn the E step, the variational parameters are adjusted to maximize\n\nB\n\nand in the M\nstep, the model parameters are adjusted to maximize\n\nB.\n\nAssuming constant noise, # = #I, the derivatives of\n\nB\n\nwith respect to the varia-\ntional parameters produce the following E-step updates:\n\n#\n\n0\n\n#\n\n##\n\nc\n\n# c # -1 c + # -1 I\n\n#\n\n-1\n# 0\n\n##\n\n0\n\n##\n\nc\n\n# c # -1 c c + # -1\n\n#\n\nT1\n\n# 1,T1 T -1 1 # 1\n\n#\n\n(7)\n# c # c exp\n\n#\n\n-\n\n1\n2\ntr(#\n\n0 # -1 c )\n\n-\n\n1\n2\n(# 0\n\n-\n\n c ) # # -1 c (# 0\n\n-\n\n c )\n\n#\n\n#\n\nk\n\n1\n2\n\nI\n\n# k\n\n1\n2\n\n##\n\nTk\n\n# k,Tk T k # k-1 +\n\n#\n\nTk+1\n\n# k+1,Tk+1 T -1\n\nk+1\n\n# k+1\n\n#\n\n(8)\n# k,Tk # k,Tk exp\n\n#\n\n-\n\n1\n2\ntr(#\n\nk -1 )\n\n-\n\n1\n2\n\n-1 (# k\n\n-T\n\nk # k-1 ) # (# k\n\n-T\n\nk # k-1 )\n\n#\n\n. (9)\nEach time the # c 's are updated, they should be normalized and similarly for the\n\n# k,Tk 's. One or more iterations of the above updates are applied for each training\ncase and the variational parameters are stored for use in the M-step, and as the\ninitial conditions for the next E-step.\nThe derivatives of\n\nB\n\nwith respect to the model parameters produce the following\nM-step updates:\n\n# c\n\n##\n\nc\n\n#\n\n c\n\n##\n\nc # 0\n\n#\n\n# c\n\n##\n\nc\n\n(#\n\n0 + diag((# 0\n\n-\n\n c )(# 0\n\n-\n\n c ) # )#, (10)\nwhere\n\n##\n\nindicates an average over the training set.\nThis factorized variational inference technique is quite greedy, since at each step,\nthe method approximates the posterior with one Gaussian. So, the method works\nbest for a small number of steps (2 in our experiments).\n\f\n4 Inference using fast Fourier transforms\n\nThe M-step updates described above take very few computations, but the E-step\nupdates can be computationally burdensome. The dominant culprits are the com-\nputation of the distance of the form\n\ndT = (g\n\n-\n\nTh) # (g\n\n-\n\nTh) (11)\nin (9), for all possible transformations T, and the computation of the form\n\n#\n\nT\n\n#TTh (12)\nin (7) and (8).\nSince the variational approximation is more accurate when the transformations are\nbroken into fewer steps, it is a good idea to pack as many transformations into\neach step as possible. In our experiments, x-y translations are applied in one step,\nand rotations are applied in another step. However, the number of possible x-y\n\ntranslations in a 320\n\n\n\n240 image is 76,800. So, 76,800 dT 's must be computed and\nthe computation of each dT uses a vector norm of size 76,800.\nIt turns out that if the data is defined on a coordinate system where the e#ect of a\ntransformation is a shift, the above quantities can be computed very quickly using\nfast Fourier transforms (FFTs). For images measured on rectangular grids, an x-y\n\ntranslation corresponds to a shift in the coordinate system. For images measured\non a radial grid, such as the one shown in Fig. 1c, a scale and rotation corresponds\nto a shift in the coordinate system (Wolberg and Zokai 2000).\nWhen updating the variational parameters, it is straightforward to convert them to\nthe appropriate coordinate system, apply the FFT method and convert them back.\nWe now use a very di#erent notation to describe the FFT method. The image is\nmeasured on a discrete grid and x is the x-y coordinate of a pixel in the image (x\nis a 2-vector). The images g and h in (11) and (12) are written as functions of x:\n\ng(x), h(x). In this representation, T is an integer 2-vector, corresponding to a shift\nin x. So, (11) becomes\n\nd(T) =\n\n#\n\nx\n\n(g(x)\n\n-\n\nh(x +T))\n\n2\n\n=\n\n#\n\nx\n\n(g(x)\n\n2\n\n-\n\n2g(x)h(x +T) + h(x +T)\n\n2\n\n) (13)\nand (12) becomes\n\n#\n\nT\n\n#(T)h(x +T). (14)\nThe common form is the correlation\n\nf(T) =\n\n#\n\nx\n\ng(x)h(x +T), (15)\nFor an N\n\n\n\nN grid, computing the correlation directly for all T takes N\n\n4\n\nscalar\noperations. The FFT can be used to compute the correlation in N\n\n2\n\nlog N time.\nThe FFTs G(#) and H(#) of g and h are computed in N\n\n2\n\nlog N time. Then, the\nFFT F (#) of f is computed in N\n\n2\n\nas follows,\n\nF (#) = G(#) # H(#), (16)\nwhere `` # '' denotes complex conjugate. Then the inverse FFT f(T) of F (#) is\ncomputed in N\n\n2\n\nlog N time.\nUsing this method, the posterior and su#cient statistics for all N\n\n2\n\nshifts in an\n\nN\n\nN\n\ngrid can be computed in N\n\n2\n\nlog N time. Using this method along with the\nvariational technique, C classes, N scales, N rotations, N x-translations and N\ny-translations can be accounted for using (C + 2 log N)N\n\n2\n\nscalar operations.\n\f\n5 Results\n\nIn order to compare our new learning algorithm with the previously published result,\nwe repeated the experiment on clustering head poses in 200 44x28 frames. We\nachieved essentially the same result, but in only 10 seconds as opposed to 40 minutes\nthat the original algorithm needed to compete the task. Both algorithms were\nimplemented in Matlab. It should be noted that the original algorithm tested only\nfor 9 vertical and 9 horizontal shifts (81 combinations), while the new algorithm\ndealt with all 1232 possible discrete shifts. This makes the new algorithm 600\ntimes faster on low resolution data. The speed-up is even more drastic at higher\nresolutions, and when rotations and scales are added, since the complexity of the\noriginal algorithm is CN\n\n6\n\n, where C is the number of classes and N is the number\nof pixels.\nThe speed-up promised in the abstract is based on our computations, but obviously\nwe were not able to run the original algorithm on full 320x240 resolution data.\nTo illustrate that the fast variational technique presented here can be e#ciently\nused to learn data means in the presence of scale change, significant rotations and\ntranslations in the data, we captured 10 seconds of a video at 320x240 resolution and\ntrained a two-stage transformation-invariant where the first stage modeled rotations\nand scales as shifts in the log-polar coordinate system and the second stage modeled\nall possible shifts as described above. In Fig. 2 we show the results of training an\nordinary Gaussian model, shift-invariant model and finally the scale, rotation and\nshift invariant model on the sequence. We also show three frames from the sequence\nstabilized using the variational inference.\n6 Conclusions\n\nWe describes how a tremendous speed-up in training transformation-invariant gen-\nerative model can be achieved through the use of a variational technique for decou-\npling transformations, and a fast Fourier transform method for computing posterior\nprobabilities. For N\n\nN\n\nimages, learning C clusters under N rotations, N scales,\n\nN x-translations and N y-translations takes only (C +2 log N)N\n\n2\n\nscalar operations\nper iteration. In contrast, the original algorithm takes CN\n\n6\n\noperations to account\nfor these transformations. In this way we were able to reduce the computation to\nonly seconds per frame for the images of 320x240 resolution using a simple Matlab\nimplementation.\nThis opens the door for generative models of pixel intensities in video to be e#-\nciently used for transformation-invariant video summary and search. As opposed to\nmost techniques used in computer vision today, the generative modeling approach\nprovides the likelihood model useful for search or retrieval, automatic clustering of\nthe data and the extensibility through adding new hidden variables.\nThe model described here could potentially be useful for other high-dimensional\ndata, such as audio.\n\nReferences\n\nDempster, A. P., Laird, N. M., and Rubin, D. B. 1977. Maximum likelihood from in-\ncomplete data via the EM algorithm. Proceedings of the Royal Statistical Society,\n\nB-39:1--38.\nFrey, B. J. and Jojic, N. 2001. Transformation invariant clustering and dimensionality\nreduction. IEEE Transactions on Pattern Analysis and Machine Intelligence. To\nappear. Available at http://www.cs.utoronto.ca/#frey.\n\f\nFigure 2: Learning a rotation, scale and translation invariant model on 320x240\nvideo\nHinton, G. E., Dayan, P., and Revow, M. 1997. Modeling the manifolds of images of\nhandwritten digits. IEEE Transactions on Pattern Analysis and Machine Intelli-\ngence, 8:65--74.\nJojic, N., Petrovic, N., Frey, B. J., and Huang, T. S. 2000. Transformed hidden markov\nmodels: Estimating mixture models of images and inferring spatial transformations\nin video sequences. In Proceedings of the IEEE Conference on Computer Vision and\nPattern Recognition.\n\nJordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. 1998. An introduction\nto variational methods for graphical models. In Jordan, M. I., editor, Learning in\nGraphical Models. Kluwer Academic Publishers, Norwell MA.\nLeCun, Y., Bottou, L., Bengio, Y., and Ha#ner, P. 1998. Gradient-based learning applied\nto document recognition. Proceedings of the IEEE, 86(11):2278--2324.\nNeal, R. M. and Hinton, G. E. 1998. A view of the EM algorithm that justifies incremental,\nsparse, and other variants. In Jordan, M. I., editor, Learning in Graphical Models,\n\npages 355--368. Kluwer Academic Publishers, Norwell MA.\nSimard, P. Y., LeCun, Y., and Denker, J. 1993. E#cient pattern recognition using a new\ntransformation distance. In Hanson, S. J., Cowan, J. D., and Giles, C. L., editors,\n\nAdvances in Neural Information Processing Systems 5. Morgan Kaufmann, San Mateo\nCA.\nVasconcelos, N. and Lippman, A. 1998. Multiresolution tangent distance for a#ne-\ninvariant classification. In Jordan, M. I., Kearns, M. I., and Solla, S. A., editors,\n\nAdvances in Neural Information Processing Systems 10. MIT Press, Cambridge MA.\nWolberg, G. and Zokai, S. 2000. Robust image registration using log-polar transform. In\n\nProceedings IEEE Intl. Conference on Image Processing, Vancouver, Canada.\n\f\n", "award": [], "sourceid": 1962, "authors": [{"given_name": "Brendan", "family_name": "Frey", "institution": null}, {"given_name": "Nebojsa", "family_name": "Jojic", "institution": null}]}