{"title": "JPMAX: Learning to Recognize Moving Objects as a Model-fitting Problem", "book": "Advances in Neural Information Processing Systems", "page_first": 933, "page_last": 940, "abstract": null, "full_text": "JPMAX: Learning to Recognize Moving \n\nObjects as a Model-fitting Problem \n\nSuzanna Becker \n\nDepartment of Psychology, McMaster University \n\nHamilton, Onto L8S 4K1 \n\nAbstract \n\nUnsupervised learning procedures have been successful at low-level \nfeature extraction and preprocessing of raw sensor data. So far, \nhowever, they have had limited success in learning higher-order \nrepresentations, e.g., of objects in visual images. A promising ap(cid:173)\nproach is to maximize some measure of agreement between the \noutputs of two groups of units which receive inputs physically sep(cid:173)\narated in space, time or modality, as in (Becker and Hinton, 1992; \nBecker, 1993; de Sa, 1993). Using the same approach, a much sim(cid:173)\npler learning procedure is proposed here which discovers features \nin a single-layer network consisting of several populations of units, \nand can be applied to multi-layer networks trained one layer at \na time. When trained with this algorithm on image sequences of \nmoving geometric objects a two-layer network can learn to perform \naccurate position-invariant object classification. \n\n1 LEARNING COHERENT CLASSIFICATIONS \n\nA powerful constraint in sensory data is coherence over time, in space, and across \ndifferent sensory modalities. An unsupervised learning procedure which can capital(cid:173)\nize on these constraints may be able to explain much of perceptual self-organization \nin the mammalian brain. The problem is to derive an appropriate cost function for \nunsupervised learning which will capture coherence constraints in sensory signals; \nwe would also like it to be applicable to multi-layer nets to train hidden as well \nas output layers. Our ultimate goal is for the network to discover natural object \nclasses based on these coherence assumptions. \n\n\f934 \n\nSuzanna Becker \n\n1.1 PREVIOUS WORK \n\nSuccessive images in continuous visual input are usually views of the same object; \nthus, although the image pixels may change considerably from frame to frame, \nthe image usually can be described by a small set of consistent object descriptors, \nor lower-level feature descriptors. We refer to this type of continuity as temporal \ncoherence. This sort of structure is ubiquitous in sensory signals, from vision as \nwell as other senses, and can be used by a neural network to derive temporally \ncoherent classifications. This idea has been used, for example, in temporal versions \nof the Hebbian learning rule to associate items over time (Weinshall, Edelman and \nB iilt hoff, 1990; FOldiak, 1991). To capitalize on temporal coherence for higher-order \nfeature extraction and classification, we need a more powerful learning principle. \nA promising approach is to maximize some measure of agreement between the out(cid:173)\nputs of two groups of units which receive inputs physically separated in space, time \nor modality, as in (Becker and Hinton, 1992; Becker, 1993; de Sa, 1993). This forces \nthe units to extract features which are coherent across the different input sources. \nBecker and Hinton's (1992) Imax algorithm maximizes the mutual information be(cid:173)\ntween the outputs of two modules, y~ and Yb, connected to different parts of the \ninput, a and b. Becker (1993) extended this idea to the problem of classifying tem(cid:173)\nporally varying patterns by applying the discrete case of the mutual information \ncost function to the outputs of a single module at successive time steps, y-;'(t) and \ny-;'(t + 1). However, the success of this method relied upon the back-propagation of \nderivatives to train the hidden layer and it was found to be extremely susceptible \nto local optima. de Sa's method (1993) is closely related, and minimizes the prob(cid:173)\nability of disagreement between output classifications, y~(t) and y1(t), produced by \ntwo modules having different inputs, e.g., from different sensory modalities. The \nsuccess of this method hinges upon bootstrapping the first layer by initializing the \nweights to randomly selected training patterns, so this method too is susceptible to \nthe problem of local optima. If we had a more flexible cost function that could be \napplied to a multi-layer network, first to each hidden layer in turn, and finally to \nthe~utput layer for classification, so that the two layers could discover genuinely \ndifferent structure, we might be able to overcome the problem of getting trapped \nin local optima, yielding a more powerful and efficient learning procedure. \nWe can analyze the optimal solutions for both de Sa's and Becker's cost functions \n(see Figure 1 a) and see that both cost functions are maximized by having perfect \none-to-one agreement between the two groups of units over all cases, using a one-of-n \nencoding, i.e., having only a single output unit on for each case. A major limitation \nof these methods is that they strive for perfect classifications by the units. While \nthis is desirable at the top layer of a network, it is an unsuitable goal for training \nintermediate layers to detect low-level features. For example, features like oriented \nedges would not be perfect predictors across spatially or temporally nearby image \npatches in images of translating and rotating objects. Instead, we might expect that \nan oriented edge at one location would predict a small range of similar orientations \nat nearby locations. So we would prefer a cost function whose optimal solution was \nmore like those shown in Figure 1 b) or c). This would allow a feature i in group a \nto agree with any of several nearby features, e.g. i - 1, i, or i + 1 in group b. \n\n\fJPMAX \n\n935 \n\na) \n\n\u2022 \n\u2022 \u2022 \n=== \n.!iii \n.. \n=== ;;; \n\n...... , \n\nII \n, \n\n\u2022 == -= = \n\u2022 \n\n\u2022 \u2022 , \u2022 \u2022 \u2022 \n\nII' ill!! \nII \nI' \nI' \n\nb) \n\u2022\u2022 11 \n\u2022\u2022\u2022\u2022 \n= -.1. \nI!!!!! \u2022\u2022 \u2022 \n\u2022 \u2022 11, \u2022 \u2022 \u2022 \n\"'-= .i. \u00b7 1' \u2022\u2022\u2022 \n\u2022\u2022 \u2022 \u2022\u2022 \u2022 \n\u2022 J[., \u2022 \n\u2022 \n\u2022\u2022 \n\nii \u2022\u2022\u2022 II \n\nI \n\nc) \n\nn \n\n\u2022 \u2022 11 \n= \u2022 \u2022 \u2022 \u2022 \n= I!!!!! \u2022 I!!IJ!!' \n\u2022\u2022\u2022 \n\nII' \u2022 \u2022 \n\n:== \n\n;;; \n\nII' \n\n.. \n.. \n\u2022 \u2022 II .'. ;;; \n\u2022\u2022 .. \n== .i. Ilii \n\u2022 \n\nI \u2022 \u2022 II \n. \n\nFigure 1: Three possible joint distributions for the probability that the i th and j th \nunits in two sets of m classification units are both on. White is high density, black \nis low density. The optimal joint distribution for Becker's and de Sa's algorithms \nis a matrix with all its density either in the diagonal as in a), or any subset of the \ndiagonal entries for de Sa's method, or a permutation of the diagonal matrix for \nBecker's algorithm. Alternative distributions are shown in b) and c). \n\n1.2 THE JPMAX ALGORITHM \n\nOne way to achieve an arbitrary configuration of agreement over time between two \ngroups of units (as in Figure 1 b) or c\u00bb \nis to treat the desired configuration as a \nprior joint probability distribution over their outputs. We can obtain the actual \ndistribution by observing the temporal correlations between pairs of units' outputs \nin the two groups over an ensemble of patterns. We can then optimize the actual \ndistribution to fit the prior. We now derive two different cost functions which \nachieve this result. Interestingly, they result in very similar learning rules. \nSuppose we have two groups of m units as shown in Figure 2 a), receiving inputs, x\"7t \nand xi\" from the same or nearby parts of the input image. Let Ca(t) and Cb(t) be \nthe classifications of the two input patches produced by the network at time step t; \nthe outputs of the two groups of units, y\"7t(t) and yi,(t), represent these classification \nprobabilities: \n\nYai(t) \n\n-\n\nP(Ca(t) = i) = Lj eneto;(t) \n\nenetoi(t) \n\nYbi(t) = P(Cb(t) = i) = Lj enetb;(t) \n\nenetbi (t) \n\n(1) \n\n(the usual \"soft max\" output function) where netai(t) and netbj(t) are the weighted \nnet inputs to units. We could now observe the expected joint probability distribu(cid:173)\ntion qij = E [Yai(t)Ybj(t + 1)]t = E (P(Ca(t) = i, Cb(t + 1) = j)]t by computing the \ntemporal covariances between the classification probabilities, averaged over the en(cid:173)\nsemble of training patterns; this joint probability is an m2-valued random variable. \nGiven the above statistics, one possible cost function we could minimize is the \n-log probability of the observed temporal covariance between the two sets of units' \noutputs under some prior distribution (e.g. Figure 1 b) or c\u00bb. If we knew the \nactual frequency counts for each (joint) classification k = kll ,\u00b7\u00b7 ., kIm, k21 , ... ,kmm, \n\n\f936 \n\nSuzanna Becker \n\nb) \n\nAt \n\nb'~-#--#----------\"'-\"\"\"\"-\"''''''''''''''''''''':iIo. \n\n