{"title": "Learning to categorize objects using temporal coherence", "book": "Advances in Neural Information Processing Systems", "page_first": 361, "page_last": 368, "abstract": null, "full_text": "Learning to categorize objects using \n\ntemporal coherence \n\nSuzanna Becker\u00b7 \n\nThe Rotman Research Institute \n\nBaycrest Center \n3560 Bathurst St. \n\nToronto, Ontario, M6A 2E1 \n\nAbstract \n\nThe invariance of an objects' identity as it transformed over time \nprovides a powerful cue for perceptual learning. We present an un(cid:173)\nsupervised learning procedure which maximizes the mutual infor(cid:173)\nmation between the representations adopted by a feed-forward net(cid:173)\nwork at consecutive time steps. We demonstrate that the network \ncan learn, entirely unsupervised, to classify an ensemble of several \npatterns by observing pattern trajectories, even though there are \nabrupt transitions from one object to another between trajecto(cid:173)\nries. The same learning procedure should be widely applicable to \na variety of perceptual learning tasks. \n\n1 \n\nINTRODUCTION \n\nA promising approach to understanding human perception is to try to model its \ndevelopmental stages. There is ample evidence that much of perception is learned. \nEven some very low level perceptual abilities such as stereopsis (Held, Birch and \nGwiazda, 1980; Birch, Gwiazda and Held, 1982) are not present at birth, and appear \nto be learned. Once rudimentary feature detection abilities have been established, \nthe infant can learn to segment the sensory input, and eventually classify it into \nfamiliar patterns. These earliest stages of learning seem to be inherently unsuper-\n\n\u2022 Address as of July 1993: Department of Psychology, McMaster University, 1280 Main \n\nStreet West, Hamilton Ontario, Canada, L8S 4K1 \n\n361 \n\n\f362 \n\nBecker \n\nvised (or \"self-supervised\"). Gradually, the infant learns to detect regularities in \nthe world. One kind of structure that is ubiquitous in sensory information is spatio(cid:173)\ntemporal coherence. For example, in speech signals, speaker characteristics such as \nthe fundamental frequency are relatively constant over time. At shorter time scales, \nindividual words are typically composed of long intervals having relatively constant \nspectral characteristics, corresponding to vowels, with short intervening bursts and \nrapid transitions corresponding to consonants. The consonants also change across \ntime in very regular ways. This temporal coherence at various scales makes speech \npredictable, to a certain degree. As one moves about in the world, the visual field \nflows by in characteri3tic patterns of expansion, dilation and translation. Since \nmost objects in the visual world move slowly, if at all, the visual scene changes \nslowly over time, exhibiting the same temporal coherence as other sensory sources. \nIndependently moving rigid objects are invariant with respect to shape, texture \nand many other features, up to very high level properties such as the object's iden(cid:173)\ntity. Even under nonlinear shape distortions, images like clouds drifting across the \nsky are perceived to have coherent features, in spite of undergoing highly non-rigid \ntransformations. Thus, temporal coherence of the sensory input may provide im(cid:173)\nportant cues for segmenting signals in space and time, and for object localization \nand identification. \n\n2 PREVIOUS WORK \n\nA common approach to training neural networks to perform transformation(cid:173)\ninvariant object recognition is to build in hard constraints which enforce invariance \nwith respect to the transformations of interest. For example, equality constraints \namong feature-detecting kernels have been used to enforce translation-invariance \n(Fukushima, 1988; Le Cun et al., 1990). Various other higher-order constraints \nhave been used to enforce viewpoint-invariance (Hinton and Lang, 1985; Zemel, \nHinton and Mozer, 1990) and invariance with respect to arbitrary group transfor(cid:173)\nmations (Giles and Maxwell, 1987). While in the case of translation-invariance \nit is straightforward to hard-wire the appropriate constraints, more general linear \ntransformation-in variance requires rather cumbersome machinery, and for arbitrary \nnon-linear transformations the approach is difficult if not impossible. \n\nIn contrast to the above approaches, Foldiak's model of complex cell development \nresults in translation-invariant orientation detectors without the imposition of any \nhard constraints (Foldiak, 1991). Further, his method is unsupervised. He proposed \na modified Hebbian learning rule, in which each weight change depends on the unit's \noutput history: \n\n~Wij(t) = a Yi(t) (Xj(t) - Wij(t)) \n\nwhere Xj(t) is the activity of the jth presynaptic unit at the tth time step, and Yi(t) \nis a temporally low-pass filtered trace of the postsynaptic activity of the ith unit. \nWhereas a standard Hebb-rule encourages a unit to detect correlations between its \ninputs, this rule encourages a unit to produce outputs which are correlated over \ntime. A single unit can therefore learn to group patterns which have zero overlap. \nFoldiak demonstrated this by presenting trajectories of moving lines, with line ori(cid:173)\nentation held constant within each trajectory, to a network whose input features \nwere local orientation detectors. Units became tuned to particular orientations, \n\n\fLearning to categorize objects using temporal coherence \n\n363 \n\nindependent of location. \n\nWhile Foldiak's work is of interest as a model of cell development in early visual \ncortex, there are several reasons why it cannot be applied directly to the more \ngeneral problem of transformation-invariant object recognition. One reason that \nFoldiak's learning rule worked well on the line trajectory problem is that the input \nrepresentation (oriented line features) made the problem linearly separable: there \nwas no overlap between input features present in successive trajectories, hence it \nwas easy to categorize lines of the same orientation. Generally, in more difficult \npattern classification problem:l (such as digit or speech recognition) the optimal \ninput features cannot be preselected but must be learned, and there is considerable \noverlap between the component features of different pattern classes. Hence, a multi(cid:173)\nlayer network is required, and it must be able to optimally select features so as to \nimprove its classification performance. The question of interest here is whether it \nis possible to train such a network entirely unsupervised? As mentioned above, the \ntemporal coherence of the sensory input may be an important cue for solving this \nproblem in biological systems. \n\n3 TEMPORAL-COHERENCE BASED LEARNING \n\nOne way to capture the constraint of temporal coherence in a learning procedure \nis to build it into the objective function. For example, we could try to build repre(cid:173)\nsentations that are relatively predictable, at least over short time scales. We also \nneed a constraint which captures the notion of high information content; for exam(cid:173)\nple, we could require that the network be unpredictable over long time scales. A \nmeasure which satisfies both criteria is the mutual information between the classifi(cid:173)\ncations produced by the network at successive time steps. If the network produces \nclassification C(t) at time t and classification C(t+ 1) at time t+ 1, the mutual infor(cid:173)\nmation between the two successive classifications, averaged over the entire sequence \nof patterns, is given by \n\nH(Ct) + H(Ct+t) - H(Ct, Ct+t) \n- L (p/)t log (p/)t - L (p/+!)t log (p/+1)t \n+ L {p/p/+1)t log {p/p/+l)t \n\n; \n\ni; \n\nwhere the angle brackets denote time-averaged quantities. \n\nA set of n output units can be forced to represent a probability distribution over \nn classes, C E {Cl ... cn }, by adopting states whose probabilities sum to one. This \ncan be done, for example, by using the \"soft max\" activation function suggested by \nBridle (1990): \n\neXi(t) \n\nt \n\nPi = \",n \n\nx \u00b7(t) \n\n~j=l e 1 \n\n= P(C(t) = Ci) \n\nwhere Xi is the total weighted summed input to the ith unit, and Pit, the output of \nthe ith unit, stands for the probability of the ith class, P( C(t) = CIi). \n\n\f364 \n\nBecker \n\nOnce we know the probability of the network assigning each pattern to each class, \nwe can compute the mutual information between the classifications produced by \nthe network at neighboring time steps, e(t) and e(t + 1). This requires sampling, \nover the entire training set, the average probability of each class, as well as the joint \nprobabilities of each possible pair of classifications being produced as successive time \nsteps. The learning involves adjusting the weights in the network so as to maximize \nthe mutual information between the representations produced by the network at \nadjacent time steps. In the experiments reported here, a gradient ascent procedure \nwas used with the method of conjugate gradients. \n\nOne problem with maximizing the information measure described above is that for \na fixed amount of entropy in the classifications, H(et ), the network can always \nimprove the mutual information by decreasing the joint entropy, H(et, etJ. In \norder to achieve low joint entropy, the network must try to assign class probabilities \nwith high certainty, i.e., produce output values near zero or one. Thus the network \ncan always improve its current solution by simply make the weights very large. \nUnfortunately, this often occurs during learning. To discourage the network from \ngetting stuck in such locally optimal (but very poor) solutions, we introduce a \nconstant A to weight the importance of the joint entropy term in the objective \nfunction, so as to maximize the following: \n\nIn the simulations reported here, we used a value of 0.5 for A. This effectively \nprevents the network from concentrating all its effort on reducing the joint entropy, \nand forces it to learn more gradually, resulting in more globally optimal solutions. \n\nWe have tested this learning procedure on a simple signal classification problem. \nThe pattern set consisted of trajectories of random intensity patterns, drawn from \nsix classes, shown in figure 1. Members of the same class consisted of translated ver(cid:173)\nsions of the same pattern, shifted one to five pixels with wrap-around. A trajectory \nconsisted of a block of ten randomly selected patterns from the same class. Between \ntrajectories, the pattern class changed randomly. The network had six input units, \ntwenty hidden units, and six output units. The hidden units used the logistic non(cid:173)\nlinearity, 1+~-'\" and the output units used the softmax activation function. The \nhidden units had biases but the outputs did not. 1 After training the network on \n1200 patterns (20 trajectories of 10 examples of each of the six patterns) for 300 \nconjugate gradient iterations, the output units always became reasonably specific \nto particular pattern classes, as shown for a typical run in Figure 2a). The general \npattern is that each output unit responds maximally to one or two pattern classes, \nalthough some of the units have mixed responses. \n\nThis classification problem is extremely difficult for an unsupervised learning pro(cid:173)\ncedure, as there is considerable overlap between patterns in different classes, and \nessentially no overlap between patterns in the same class. It is therefore easy to see \nwhy a single unit might end up capturing a few patterns from one class and a few \nfrom another. We can create an easier subproblem by only training the network on \nhalf the patterns in each class. In this case, the network always learns to separate \nthe six pattern classes either perfectly, or nearly so, as shown in figure 2b). \n\n1 Removing biases from the outputs helps prevent the network from getting trapped in \n\nlocal maxima during learning. \n\n\fLearning to categorize objects using temporal coherence \n\n365 \n\n:\u00b7\u00b7\u00b7\u00b7::I:\u00b7\u00b7--~.::.\u00b7\u00b7\u00b7:::\u00b7\u00b7\u00b7:::.: \n\u2022 \n\u2022 \n\n.1 , \n\nI' \n\n\u2022\u2022 ' \n\n,\" \n\n\u2022 \u2022 \" - - - \" . . . . . . . \n\n, ' - - - , ' \u2022 \u2022 _ -_ \u2022 \u2022 \u2022 \u2022 \n\n__ I \n\nFigure 1: The set of 6 random patterns used to create pattern trajectories. Each pat(cid:173)\ntern was created by randomly setting the intensities of the 6 pixels, and normalizing \nthe intensity profile to have zero mean. \n\n4 DISCUSSION \n\nBecker and Hinton (1992) showed that a network could learn to extract a continuous \nparameter of visual scenes which is coherent across space, by maximizing the mutual \ninformation between the outputs of two network modules that receive input from \nspatially adjacent parts of the input. Here, we have shown how the same idea \ncan be applied to the temporal domain, to perform a discrete classification of the \ninput assuming temporal coherence. We could also apply the same algorithm to the \nproblem of unsupervised multi-sensory integration, by forming classifications which \nare coherent across different sensory modalities, as well as across time. \n\nOne advantage of the approach presented here over unsupervised learning proce(cid:173)\ndures such as competitive learning is that units must co-operate to try to find \na globally optimal solution. There is therefore incentive for each unit to try to \nimprove the temporal predictability of all of the output units' classifications over \ntime, including its own; this discourages anyone unit from trying to model all of \n\n\f366 \n\nBecker \n\na) \n\nMean: \n\nUnit 1 \n\nUnit 2 \n\nUnt 3 \n\nUnit 4 \n\nUnit 5 \n\nUnit 6 \n\no.~---+---------+--------~--------;---------+---------+---------; \n\n0 .. 7~---+--------~---------~--------+---------~ \n\no. \n\n0 \n\nO. \n\n0 \n\nO. \n\no. \nb) \n\nMean: \n\n1 ,,,, \n\n0.90 \n\nO.Se) \n\n... \" \n\nO .~:: \n\nn c\" \n\n-\no oil\" \n\nO.3~ \n\n0'''' \n\n,,~v \n\n--\n--\n--\n\nI--\n\nf--\n\nf--\n\nf--\n\n0.10--- tit -\n\n-\n\n.vv \n\nUnit 1 \n\nUnit 2 \n\nUnit 3 \n\nUnit 4 \n\nUnit 5 \n\nUnlt6 \n\n.. \n\n!;: \n~ \n~ \n~ \n~ \n~ \n~ \n~ \n~ \n~ \n~ \n~ \n=: \n!; \nP-\nt; \nt; \n\"\"\" \n;: \n:! \n:::: \n;: \n~ \n\n~ \n== \n;; \n~ \n\nI \n! \n~ \n;; \n~ \n== \n~ \nmm \n\ne \n\n~ \n\n,. \n-\u2022 \n\n!-\n\n~ \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n;; \n\u2022 \n~ \n\u2022 \n\u2022 \n\u2022 \nIII \n\u2022 \n: \n: \n\u2022 \n\u2022 \nt: \n\nFigure 2: The probability of each output unit responding for each of the six classes \nof patterns, averaged over 1200 cases. In a) the pattern trajectories contained six \nshifted examples of each class, while in b) there were three examples of each class. \n\n\fLearning to categorize objects using temporal coherence \n\n367 \n\nthe patterns. Additionally, because we have a well-defined objective function for \nthe learning, the procedure can be applied to multi-layer networks which discover \nfeatures specifically tuned to the classification problem. \n\nHowever, there are a few drawbacks to using this learning procedure. One is that \nif any lower-order temporally coherent structure exists, the network will invariably \ndiscover it. So, for example, if the pattern classes differ in their average intensity, the \nnetwork can easily learn to separate them simply by detecting the average intensity \nof the inputs and ignoring all other information. Similarly, if the spatial location of \npattern features varies slowly .md predictably over time, the network tends to learn \na spatial map rather than solving the higher-order problem of pattern classification. \nOn the other hand, this suggests that a sequential approach to modelling temporally \ncoherent structure may be possible: an initial processing stage could try to model \nlow-order temporal structure such as local spatial correlations, a second processing \nstage could model the remaining structure in the output of the first over a larger \nspatio-temporal extent, and so on. \n\nA second drawback is the space complexity of the algorithm: for a network with \nn output units, each must store n 2 joint probability statistics and n individual \nprobabilities. 2 The storage complexity can be reduced from n 2 + n to just two \nstatistics per output unit by optimizing a more constrained objective function in \nwhich each output unit assumes a maximum entropy distribution for the other n - 1 \nunits. It then need only consider the average probability of its own output, and \nthe joint probability of its output at successive time steps. In this case, the mutual \ninformation can be approximated by a sum of n terms: \n\nL H(Ci,t) + H(Ci,t+d - H(Ci,t, Ci,t+d \n\ni \n\nwhere H(Ci,t) = - (Pit)t log (Pit)t - n~l (1 - Pit)t log n~l (1 - Pit)t is the entropy \nof the ith output unit under the maximum entropy assumption for the other output \nunits, and the other constrained entropies are computed similarly. \n\nA final drawback of the learning procedure presented here, as discussed earlier, is \nits tendency to become trapped in local optima with very large weights. We dealt \nwith this by introducing a constant parameter, A, to dampen the importance of the \njoint entropy term. A more principled way to deal with the problem of local optima \nis to use stochastic rather than deterministic output units, resulting in a stochastic \ngradient descent learning procedure (although this would increase the simulation \ntime considerably). Another way of obtaining more globally optimal solutions might \nbe to consider the predictability of classifications over longer time scales rather than \njust at pairwise time steps, as was done in Foldicik's model (1991). The network \ncould thus maximize the mutual information between its current response and a \nweighted average of its responses over the last few time steps. \n\n2Note, however, that the complexity (both in time and space) of the computation of \nthese statistics is negligible relative to that of the gradient calculations, assuming there \nare many more weights than the squared number of output units in the network. \n\n\f368 \n\nBecker \n\n5 CONCLUSIONS \n\nThe invariance of an objects' identity over time, with respect to transformations it \nmay undergo as it and/or the observer move, provides a powerful cue for perceptual \nlearning. We have demonstrated that a network can learn, entirely unsupervised, \nto build translation-invariant object detectors based on the assumption of temporal \ncoherence about the input. This procedure should be widely applicable to a variety \nof perceptual learning tasks, such as identifying phonemes in speech, segmenting \nobjects in images of trajectories, and classifying textures in tactile input. \n\nAcknowledgments \n\nI thank Geoff Hinton for many fruitful discussions that led to the ideas presented \nin this paper. \n\nReferences \n\nBecker, S. and Hinton, G. E. (1992). A self-organizing neural network that discovers \n\nsurfaces in random-dot stereograms. Nature, 355:161-163. \n\nBirch, E. E., Gwiazda, J., and Held, R. (1982). Stereoacuity development for crossed \n\nand uncrossed disparities in human infants. Vison research, 22:507-513. \n\nBridle, J. S. (1990). Training stochastic model recognition algorithms as networks \n\ncan lead to maximum mutual information estimation of parameters. In Touret(cid:173)\nzky, D. S., editor, Neural Information Processing Systems, Vol. 2, pages 111-\n217, San Mateo, CA. Morgan Kaufmann. \n\nFoldiak, P. (1991). Learning invariance from transformation sequences. Neural \n\nComputation, 3(2):194-200. \n\nFukushima, K. (1988). Neocognition: A hierarchical neural network capable of \n\nvisual pattern pattern recognition. Neural networks, 1:119-130. \n\nGiles, C. 1. and Maxwell, T. (1987). Learning, invariance, and generalization in \n\nhigh-order neural networks. Applied Optics, 26(23):4972-4978. \n\nHeld, R., Birch, E. E., and Gwiazda, J. (1980). Stereoacuity of human infants. \n\nProceedings of the national academy of sciences USA, 77(9):5572-5574. \n\nHinton, G. E. and Lang, K. (1985). Shape recognition and illusory conjunctions. \n\nIn IlCAI 9, Los Angeles. \n\nLe Cun, Y., Boser, B., Denker, J., Henderson, D., Howard, R., Hubbard, W., \nand Jackel, 1. (1990). Handwritten digit recognition with a back-propagation \nnetwork. In Touretzky, D., editor, Advances in Neural Information Processing \nSystems, pages 396-404, Denver 1989. Morgan Kaufmann, San Mateo. \n\nZemel, R. S., Hinton, G. E., and Mozer, M. C. (1990). TRAFFIC: object recogni(cid:173)\n\ntion using hierarchical reference frame transformations. In Advances in Neural \nInformation Processing Systems 2, pages 266-273. Morgan Kaufmann Publish(cid:173)\ners. \n\n\f", "award": [], "sourceid": 596, "authors": [{"given_name": "Suzanna", "family_name": "Becker", "institution": null}]}