Maximilian Riesenhuber, Peter Dayan
We present a connectionist method for representing images that ex(cid:173) plicitly addresses their hierarchical nature. It blends data from neu(cid:173) roscience about whole-object viewpoint sensitive cells in inferotem(cid:173) poral cortex8 and attentional basis-field modulation in V43 with ideas about hierarchical descriptions based on microfeatures.5,11 The resulting model makes critical use of bottom-up and top-down pathways for analysis and synthesis.6 We illustrate the model with a simple example of representing information about faces.
1 Hierarchical Models
Images of objects constitute an important paradigm case of a representational hi(cid:173) erarchy, in which 'wholes', such as faces, consist of 'parts', such as eyes, noses and mouths. The representation and manipulation of part-whole hierarchical informa(cid:173) tion in fixed hardware is a heavy millstone around connectionist necks, and has consequently been the inspiration for many interesting proposals, such as Pollack's RAAM.l1
We turned to the primate visual system for clues. Anterior inferotemporal cortex (IT) appears to construct representations of visually presented objects. Mouths and faces are both objects, and so require fully elaborated representations, presumably at the level of anterior IT, probably using different (or possibly partially overlap(cid:173) ping) sets of cells. The natural way to represent the part-whole relationship between mouths and faces is to have a neuronal hierarchy, with connections bottom-up from the mouth units to the face units so that information about the mouth can be used to help recognize or analyze the image of a face, and connections top-down from the face units to the mouth units expressing the generative or synthetic knowledge that if there is a face in a scene, then there is (usually) a mouth too. There is little
We thank Larry Abbott, Geoff Hinton, Bruno Olshausen, Tomaso Poggio, Alex Pouget,
Emilio Salinas and Pawan Sinha for discussions and comments.
M. Riesenhuberand P. Dayan
empirical support for or against such a neuronal hierarchy, but it seems extremely unlikely on the grounds that arranging for one with the correct set of levels for all classes of objects seems to be impossible.
There is recent evidence that activities of cells in intermediate areas in the visual processing hierarchy (such as V4) are influenced by the locus of visual attention. 3 This suggests an alternative strategy for representing part-whole information, in which there is an interaction, subject to attentional control, between top-down generative and bottom-up recognition processing. In one version of our example, activating units in IT that represent a particular face leads, through the top-down generative model, to a pattern of activity in lower areas that is closely related to the pattern of activity that would be seen when the entire face is viewed. This activation in the lower areas in turn provides bottom-up input to the recognition system. In the bottom-up direction, the attentional signal controls which aspects of that activation are actually processed, for example, specifying that only the activity reflecting the lower part of the face should be recognized. In this case, the mouth units in IT can then recognize this restricted pattern of activity as being a particular sort of mouth. Therefore, we have provided a way by which the visual system can represent the part-whole relationship between faces and mouths.
This describes just one of many possibilities. For instance, attentional control could be mainly active during the top-down phase instead. Then it would create in VI (or indeed in intermediate areas) just the activity corresponding to the lower portion of the face in the first place. Also the focus of attention need not be so ineluctably spatial.
The overall scheme is based on an hierarchical top-down synthesis and bottom-up analysis model for visual processing, as in the Helmholtz machine6 (note that "hi(cid:173) erarchy" here refers to a processing hierarchy rather than the part-whole hierarchy discussed above) with a synthetic model forming the effective map:
'object' 18) 'attentional eye-position' -t 'image'
(shown in cartoon form in figure 1) where 'image' stands in for the (probabilities over the) activities of units at various levels in the system that would be caused by seeing the aspect of the 'object' selected by placing the focus and scale of attention appropriately. We use this generative model during synthesis in the way described above to traverse the hierarchical description of any particular image. We use the statistical inverse of the synthetic model as the way of analyzing images to determine what objects they depict. This inversion process is clearly also sensitive to the attentional eye-position - it actually determines not only the nature of the object in the scene, but also the way that it is depicted (ie its instantiation parameters) as reflected in the attentional eye position.
In particular, the bottom-up analysis model exists in the connections leading to the 2D viewpoint-selective image cells in IT reported by Logothetis et al8 which form population codes for all the represented images (mouths, noses, etc). The top-down synthesis model exists in the connections leading in the reverse direction. In generalizations of our scheme, it may, of course, not be necessary to generate an image all the way down in VI.
The map (1) specifies a top-down computational task very like the bottom-up one addressed using a multiplicatively controlled synaptic matrix in the shifter model
Neural Modelsfor Part-Whole Hierarchies
attentional eye position e =(c;, tyl t%)