Part of Advances in Neural Information Processing Systems 13 (NIPS 2000)
Shimon Edelman, Nathan Intrator
We describe a unified framework for the understanding of struc(cid:173) ture representation in primate vision. A model derived from this framework is shown to be effectively systematic in that it has the ability to interpret and associate together objects that are related through a rearrangement of common "middle-scale" parts, repre(cid:173) sented as image fragments. The model addresses the same concerns as previous work on compositional representation through the use of what+where receptive fields and attentional gain modulation. It does not require prior exposure to the individual parts, and avoids the need for abstract symbolic binding.
1 The problem of structure representation
The focus of theoretical discussion in visual object processing has recently started to shift from problems of recognition and categorization to the representation of object structure. Although view- or appearance-based solutions for these problems proved effective on a variety of object classes , the "holistic" nature of this approach - the lack of explicit representation of relational structure - limits its appeal as a general framework for visual representation .
The main challenges in the processing of structure are productivity and system(cid:173) aticity, two traits commonly attributed to human cognition. A visual system is productive if it is open-ended, that is, if it can deal effectively with a potentially infinite set of objects. A visual representation is systematic if a well-defined change in the spatial configuration of the object (e.g., swapping top and bottom parts) causes a principled change in the representation (e.g., the interchange of the rep(cid:173) resentations of top and bottom parts [3, 2]). A solution commonly offered to the twin problems of productivity and systematicity is compositional representation, in which symbols standing for generic parts drawn from a small repertoire are bound together by categorical symbolically coded relations .
2 The Chorus of Fragments
In visual representation, the need for symbolic binding may be alleviated by us(cid:173) ing location in the visual field in lieu of the abstract frame that encodes object structure. Intuitively, the constituents of the object are then bound to each other by virtue of residing in their proper places in the visual field; this can be thought of as a pegboard, whose spatial structure supports the arrangement of parts sus(cid:173) pended from its pegs. This scheme exhibits shallow compositionality, which can be enhanced by allowing the "pegboard" mechanism to operate at different spatial scales, yielding effective systematicity across levels of resolution. Coarse coding the constituents (e.g., representing each object fragment in terms of its similarities to some basis shapes) will render the scheme productive. We call this approach to the representation of structure the Chorus of Fragments (CoF; ).
2.1 Neurobiological building blocks
What+ Where cells. The representation of spatially anchored object fragments pos(cid:173) tulated by the CoF model can be supported by what+where neurons, each tuned both to a certain shape class and to a certain range of locations in the visual field. Such cells have been found in the monkey in areas V 4 and posterior IT , and in the prefrontal cortex .
Attentional gain fields. To decouple the representation of object structure from its location in the visual field, one needs a version of the what+where mechanism in which the response of the cell depends not merely on the location of the stimulus with respect to fixation (as in classical receptive fields), but also on its location with respect to the focus of attention. Indeed, modulatory effects of object-centered attention on classical RF structure (gain fields) have been found in area V 4 .
Our implementation of the CoF model involves what+where cells with attention(cid:173) modulated gain fields, and is aimed at productive and systematic treatment of composite shapes in object-centered coordinates. It operates directly on gray-level images, pre-processed by a model of the primary visual cortex , with complex(cid:173) cell responses modified to use the MAX operation suggested in . In the model, one what+where unit is assigned to the top and one to the bottom fragment of the visual field, each extracted by an appropriately configured Gaussian gain profile (Figure 2, left). The units are trained (1) to discriminate among five objects, (2) to tolerate translation within the hemifield, and (3) to provide an estimate of the reliability of its output, through an autoassociation mechanism attempting to reconstruct the stimulus image [11, 12]. Within each hemifield, the five outputs of a unit can provide a coarse coding of novel objects belonging to the familiar category, in a manner useful for translation-tolerant recognition . The reliability estimate carries information about category, allowing outputs for objects from other categories to be squelched. Most importantly, due to the spatial localization of the unit's receptive field, the system can distinguish between different configurations of the same shapes, while noting the fragment-wise similarities.
We assume that during learning the system performs multiple fixations of the target object, effectively providing the what+where units with a basis for spanning the