{"title": "Shape Context: A New Descriptor for Shape Matching and Object Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 831, "page_last": 837, "abstract": null, "full_text": "Shape Context: A new descriptor for \nshape matching and object recognition \n\nSerge Belongie, Jitendra Malik and Jan Puzicha \n\nDepartment of Electrical Engineering and Computer Sciences \n\nUniversity of California at Berkeley \n\nBerkeley, CA 94720, USA \n\n{sjb, malik,puzicha} @cs.berkeley.edu \n\nAbstract \n\nWe develop an approach to object recognition based on match(cid:173)\ning shapes and using a resulting measure of similarity in a nearest \nneighbor classifier. The key algorithmic problem here is that of \nfinding pointwise correspondences between an image shape and a \nstored prototype shape. We introduce a new shape descriptor, \nthe shape context, which makes this possible, using a simple and \nrobust algorithm. The shape context at a point captures the distri(cid:173)\nbution over relative positions of other shape points and thus sum(cid:173)\nmarizes global shape in a rich, local descriptor. We demonstrate \nthat shape contexts greatly simplify recovery of correspondences \nbetween points of two given shapes. Once shapes are aligned, shape \ncontexts are used to define a robust score for measuring shape sim(cid:173)\nilarity. We have used this score in a nearest-neighbor classifier \nfor recognition of hand written digits as well as 3D objects, using \nexactly the same distance function. On the benchmark MNIST \ndataset of handwritten digits, this yields an error rate of 0.63%, \noutperforming other published techniques. \n\n1 \n\nIntroduction \n\nThe last decade has seen increased application of statistical pattern recognition \ntechniques to the problem of object recognition from images. Typically, an image \nblock with n pixels is regarded as an n dimensional feature vector formed by con(cid:173)\ncatenating the brightness values of the pixels. Given this representation, a number \nof different strategies have been tried, e.g. nearest-neighbor techniques after extract(cid:173)\ning principal components [15, 13], convolutional neural networks [12], and support \nvector machines [14, 5]. Impressive performance has been demonstrated on datasets \nsuch as digits and faces. \n\nA vector of pixel brightness values is a somewhat unsatisfactory representation of an \nobject. Basic invariances e.g. to translation, scale and small amount of rotation must \nbe obtained by suitable pre-processing or by the use of enormous amounts of training \ndata [12]. Instead, we will try to extract \"shape\", which by definition is required to \nbe invariant under a group of transformations. The problem then becomes that of \n\n\foperationalizing a definition of shape. The literature in computer vision and pattern \nrecognition is full of definitions of shape descriptors and distance measures, ranging \nfrom moments and Fourier descriptors to the Hausdorff distance and the medial \naxis transform. (For a recent overview, see [16].) Most of these approaches suffer \nfrom one of two difficulties: (1) Mapping the shape to a small number of numbers, \ne.g. moments, loses information. Inevitably, this means sacrificing discriminative \npower. (2) Descriptors restricted to silhouettes and closed curves are of limited \napplicability. Shape is a much more general concept. \n\nFundamentally, shape is about relative positional information. This has motivated \napproaches such as [1] who find key points or landmarks, and recognize objects using \nthe spatial arrangements of point sets. However not all objects have distinguished \nkey points (think of a circle for instance), and using key points alone sacrifices the \nshape information available in smooth portions of object contours. \n\nOur approach therefore uses a general representation of shape - a set of points \nsampled from the contours on the object. Each point is associated with a novel \ndescriptor, the shape context, which describes the coarse arrangement of the rest of \nthe shape with respect to the point. This descriptor will be different for different \npoints on a single shape S; however corresponding (homologous) points on similar \nshapes Sand S' will tend to have similar shape contexts. Correspondences between \nthe point sets of S and S' can be found by solving a bipartite weighted graph \nmatching problem with edge weights Cij defined by the similarity of the shape \ncontexts of points i and j. Given correspondences, we can effectively calculate the \nsimilarity between the shapes S and S'. This similarity measure is then employed \nin a nearest-neighbor classifier for object recognition. \n\nThe core of our work is the concept of shape contexts and its use for solving the \ncorrespondence problem between two shapes. It can be compared to an alternative \nframework for matching point sets due to Gold, Rangarajan and collaborators (e.g. \n[7, 6]). They propose an iterative optimization algorithm to jointly determine point \ncorrespondences and underlying image transformations. The cost measure is Eu(cid:173)\nclidean distance between the first point set and a transformed version of the second \npoint set. This formulation leads to a difficult non-convex optimization problem \nwhich is solved using deterministic annealing. Another related approach is elastic \ngraph matching [11] which also leads to a difficult stochastic optimization problem. \n\n2 Matching with Shape Contexts \n\nIn our approach, a shape is represented by a discrete set of points sampled from the \ninternal or external contours on the shape. These can be obtained as locations of \nedge pixels as found by an edge detector, giving us a set P = {PI, ... ,Pn}, Pi E lR?, \nof n points. They need not, and typically will not, correspond to key-points such \nas maxima of curvature or inflection points. We prefer to sample the shape with \nroughly uniform spacing, though this is also not critical. Fig. 1(a,b) shows sample \npoints for two shapes. For each point Pi on the first shape, we want to find the \n\"best\" matching point qj on the second shape. This is a correspondence problem \nsimilar to that in stereopsis. Experience there suggests that matching is easier if \none uses a rich local descriptor instead of just the brightness at a single pixel or \nedge location. Rich descriptors reduce the ambiguity in matching. \n\nIn this paper, we propose a descriptor, the shape context, that could play such a role \nin shape matching. Consider the set of vectors originating from a point to all other \nsample points on a shape. These vectors express the configuration of the entire \nshape relative to the reference point. Obviously, this set of n - 1 vectors is a rich \n\n\f...... \n\n. . . \n. \n. \n.. . . \n. \n. \n: ..... .. + ... . \n\nIj) \n\n: \n. . .. \n\n..... \n\n, . \n. .. \n\n(a) \n\n(b) \n\n(c) \n\n(d) \n\n(e) \n\nFigure 1: Shape context computation and matching. (a, b) Sampled edge points of two \nshapes. (c) Diagram of log-polar histogram bins used in computing the shape contexts. We \nuse 5 bins for log rand 12 bins for (). (d-f) Example shape contexts for reference samples \nmarked by 0,0,