{"title": "Similarity by Composition", "book": "Advances in Neural Information Processing Systems", "page_first": 177, "page_last": 184, "abstract": null, "full_text": "Similarity by Composition\n\nDept. of Computer Science and Applied Math\n\nThe Weizmann Institute of Science\n\nOren Boiman Michal Irani\n\n76100 Rehovot, Israel\n\nAbstract\n\nWe propose a new approach for measuring similarity between two signals, which\nis applicable to many machine learning tasks, and to many signal types. We say\nthat a signal S1 is \u201csimilar\u201d to a signal S2 if it is \u201ceasy\u201d to compose S1 from few\nlarge contiguous chunks of S2. Obviously, if we use small enough pieces, then\nany signal can be composed of any other. Therefore, the larger those pieces are,\nthe more similar S1 is to S2. This induces a local similarity score at every point\nin the signal, based on the size of its supported surrounding region. These local\nscores can in turn be accumulated in a principled information-theoretic way into a\nglobal similarity score of the entire S1 to S2. \u201cSimilarity by Composition\u201d can be\napplied between pairs of signals, between groups of signals, and also between dif-\nferent portions of the same signal. It can therefore be employed in a wide variety\nof machine learning problems (clustering, classi\ufb01cation, retrieval, segmentation,\nattention, saliency, labelling, etc.), and can be applied to a wide range of signal\ntypes (images, video, audio, biological data, etc.) We show a few such examples.\n\n1 Introduction\n\nA good measure for similarity between signals is necessary in many machine learning problems.\nHowever, the notion of \u201csimilarity\u201d between signals can be quite complex. For example, observing\nFig. 1, one would probably agree that Image-B is more \u201csimilar\u201d to Image-A than Image-C is. But\nwhy...? The con\ufb01gurations appearing in image-B are different than the ones observed in Image-A.\nWhat is it that makes those two images more similar than Image-C? Commonly used similarity\nmeasures would not be able to detect this type of similarity. For example, standard global similarity\nmeasures (e.g., Mutual Information [12], Correlation, SSD, etc.) require prior alignment or prior\nknowledge of dense correspondences between signals, and are therefore not applicable here. Dis-\ntance measures that are based on comparing empirical distributions of local features, such as \u201cbags\nof features\u201d (e.g., [11]), will not suf\ufb01ce either, since all three images contain similar types of local\nfeatures (and therefore Image-C will also be determined similar to Image-A).\n\nIn this paper we present a new notion of similarity between signals, and demonstrate its applicability\nto several machine learning problems and to several signal types. Observing the right side of Fig. 1,\nit is evident that Image-B can be composed relatively easily from few large chunks of Image-A (see\ncolor-coded regions). Obviously, if we use small enough pieces, then any signal can be composed\nof any other (including Image-C from Image-A). We would like to employ this idea to indicate high\nsimilarity of Image-B to Image-A, and lower similarity of Image-C to Image-A. In other words,\nregions in one signal (the \u201cquery\u201d signal) which can be composed using large contiguous chunks\nof data from the other signal (the \u201creference\u201d signal) are considered to have high local similarity.\nOn the other hand, regions in the query signal which can be composed only by using small frag-\nmented pieces are considered locally dissimilar. This induces a similarity score at every point in\nthe signal based on the size of its largest surrounding region which can be found in the other signal\n(allowing for some distortions). This approach provides the ability to generalize and infer about new\ncon\ufb01gurations in the query signal that were never observed in the reference signal, while preserving\n\n\fImage-A:\n\nThe \u201creference\u201d signal: (Image-A)\n\nImage-C:\n\nImage-B:\n\nThe \u201cquery\u201d signal: (Image-B)\n\nFigure 1: Inference by Composition \u2013 Basic concept.\nLeft: What makes \u201cImage-B\u201d look more similar to \u201cImage-A\u201d than \u201cImage-C\u201d does? (None of\nthe ballet con\ufb01gurations in \u201cImage-B\u201d appear in \u201cImage-A\u201d!)\nRight:\nImage-B (the \u201cquery\u201d) can be composed using few large contiguous chunks from Image-\nA (the \u201creference\u201d), whereas it is more dif\ufb01cult to compose Image-C this way. The large shared\nregions between B and A (indicated by colors) provide high evidence to their similarity.\n\nstructural information. For instance, even though the two ballet con\ufb01gurations observed in Image-B\n(the \u201cquery\u201d signal) were never observed in Image-A (the \u201creference\u201d signal), they can be inferred\nfrom Image-A via composition (see Fig. 1), whereas the con\ufb01gurations in Image-C are much harder\nto compose.\n\nNote that the shared regions between similar signals are typically irregularly shaped, and therefore\ncannot be restricted to prede\ufb01ned regularly shaped partitioning of the signal. The shapes of those re-\ngions are data dependent, and cannot be prede\ufb01ned. Our notion of signal composition is\u201cgeometric\u201d\nand data-driven. In that sense it is very different from standard decomposition methods (e.g., PCA,\nICA, wavelets, etc.) which seek linear decomposition of the signal, but not geometric decomposi-\ntion. Other attempts to maintain the bene\ufb01ts of local similarity while maintaining global structural\ninformation have recently been proposed [8]. These have been shown to improve upon simple \u201cbags\nof features\u201d, but are restricted to preselected partitioning of the image into rectangular sub-regions.\n\nIn our previous work [5] we presented an approach for detecting irregularities in images/video as\nregions that cannot be composed from large pieces of data from other images/video. Our approach\nwas restricted only to detecting local irregularities. In this paper we extend this approach to a general\nprincipled theory of \u201cSimilarity by Composition\u201d, from which we derive local and global similarity\nand dissimilarity measures between signals. We further show that this framework extends to a wider\nrange of machine learning problems and to a wider variety of signals (1D, 2D, 3D, .. signals).\nMore formally, we present a statistical (generative) model for composing one signal from another.\nUsing this model we derive information-theoretic measures for local and global similarities induced\nby shared regions. The local similarities of shared regions (\u201clocal evidence scores\u201d) are accumulated\ninto a global similarity score (\u201cglobal evidence score\u201d) of the entire query signal relative to the\nreference signal. We further prove upper and lower bounds on the global evidence score, which are\ncomputationally tractable. We present both a theoretical and an algorithmic framework to compute,\naccumulate and weight those gathered \u201cpieces of evidence\u201d.\n\nSimilarity-by-Composition is not restricted to pairs of signals. It can also be applied to compute\nsimilarity of a signal to a group of signals (i.e., compose a query signal from pieces extracted from\nmultiple reference signals). Similarly, it can be applied to measure similarity between two different\ngroups of signals. Thus, Similarity-by-Composition is suitable for detection, retrieval, classi\ufb01ca-\ntion, and clustering. Moreover, it can also be used for measuring similarity or dissimilarity between\ndifferent portions of the same signal. Intra-signal dissimilarities can be used for detecting irregu-\nlarities or saliency, while intra-signal similarities can be used as af\ufb01nity measures for sophisticated\nintra-signal clustering and segmentation.\n\n\fThe importance of large shared regions between signals have been recognized by biologists for\ndetermining similarities between DNA sequences, amino acid chains, etc. Tools for \ufb01nding large\nrepetitions in biological data have been developed (e.g., \u201cBLAST\u201d [1]). In principle, results of such\ntools can be fed into our theoretical framework, to obtain similarity scores between biological data\nsequences in a principled information theoretic way.\n\nThe rest of the paper is organized as follows: In Sec. 2 we derive information-theoretic measures for\nlocal and global \u201cevidence\u201d (similarity) induced by shared regions. Sec. 3 describes an algorithmic\nframework for computing those measures. Sec. 4 demonstrates the applicability of the derived local\nand global similarity measures for various machine learning tasks and several types of signal.\n\n2 Similarity by Composition \u2013 Theoretical Framework\n\nWe derive principled information-theoretic measures for local and global similarity between a\n\u201cquery\u201d Q (one or more signals) and a \u201creference\u201d ref (one or more signals). Large shared re-\ngions between Q and ref provide high statistical evidence to their similarity. In this section we\nshow how to quantify this statistical evidence. We \ufb01rst formulate the notion of \u201clocal evidence\u201d for\nlocal regions within Q (Sec. 2.1). We then show how these pieces of local evidence can be integrated\nto provide \u201cglobal evidence\u201d for the entire query Q (Sec. 2.2).\n\n2.1 Local Evidence\nLet R \u2286 Q be a connected region within Q. Assume that a similar region exists in ref. We would\nlike to quantify the statistical signi\ufb01cance of this region co-occurrence, and show that it increases\nwith the size of R. To do so, we will compare the likelihood that R was generated by ref, versus\nthe likelihood that it was generated by some random process.\nMore formally, we denote by Href the hypothesis that R was \u201cgenerated\u201d by ref, and by H0 the\nhypothesis that R was generated by a random process, or by any other application-dependent PDF\n(referred to as the \u201cnull hypothesis\u201d).\nHref assumes the following model for the \u201cgeneration\u201d of R: a region was taken from somewhere\nin ref, was globally transformed by some global transformation T , followed by some small possible\nlocal distortions, and then put into Q to generate R. T can account for shifts, scaling, rotations, etc.\nIn the simplest case (only shifts), T is the corresponding location in ref.\n\nWe can compute the likelihood ratio:\n\nP (R|T, Href )P (T|Href )\n\n(cid:80)\n\nLR(R) = P (R|Href )\nP (R|H0)\n\nT\n\n=\n\n(1)\nwhere P (T|Href ) is the prior probability on the global transformations T (shifts, scaling, rotations),\nand P (R|T, Href ) is the likelihood that R was generated from ref at that location, scale, etc. (up\nto some local distortions which are also modelled by P (R|T, Href ) \u2013 see algorithmic details in\nSec. 3). If there are multiple corresponding regions in ref, (i.e., multiple T s), all of them contribute\nto the estimation of LR(R). We de\ufb01ne the Local Evidence Score of R to be the log likelihood ratio:\n\nP (R|H0)\n\nLES(R|Href ) = log2(LR(R)).\n\nLES is referred to as a \u201clocal evidence score\u201d, because the higher LES is, the smaller the proba-\nbility that R was generated by random (H0). In fact, P ( LES(R|Href ) > l | H0) < 2\u2212l, i.e., the\nprobability of getting a score LES(R) > l for a randomly generated region R is smaller than 2\u2212l\n(this is due to LES being a log-likelihood ratio [3]). High LES therefore provides higher statistical\nevidence that R was generated from ref.\nNote that the larger the region R \u2286 Q is, the higher its evidence score LES(R|Href ) (and therefore\nit will also provide higher statistical evidence to the hypothesis that Q was composed from ref). For\nexample, assume for simplicity that R has a single identical copy in ref, and that T is restricted to\nshifts with uniform probability (i.e., P (T|Href ) = const), then P (R|Href ) is constant, regardless\nof the size of R. On the other hand, P (R|H0) decreases exponentially with the size of R. Therefore,\nthe likelihood ratio of R increases, and so does its evidence score LES.\n\n\fLES can also be interpreted as the number of bits saved by describing the region R using ref,\ninstead of describing it using H0: Recall that the optimal average code length of a random variable\ny with probability function P (y) is length(y) = \u2212log(P (y)). Therefore we can write the evidence\nscore as LES(R|Href ) = length(R|H0) \u2212 length(R|Href ). Therefore, larger regions provide\nhigher saving (in bits) in the description length of R.\nA region R induces \u201caverage savings per point\u201d for every point q \u2208 R, namely, LES(R|Href )\n(where\n|R| is the number of points in R). However, a point q \u2208 R may also be contained in other regions\ngenerated by ref, each with its own local evidence score. We can therefore de\ufb01ne the maximal\npossible savings per point (which we will refer to in short as P ES = \u201cPoint Evidence Score\u201d):\n\n|R|\n\nP ES(q|Href ) =\n\n(2)\nFor any point q \u2208 Q we de\ufb01ne R[q] to be the region which provides this maximal score for q.\nFig. 1 shows such maximal regions found in Image-B (the query Q) given Image-A (the reference\nref). In practice, many points share the same maximal region. Computing an approximation of\nLES(R|Href ), P ES(q|Href ), and R[q] can be done ef\ufb01ciently (see Sec 3).\n\nR\u2286Q s.t. q\u2208R\n\n|R|\n\nmax\n\nLES(R|Href )\n\n2.2 Global Evidence\nWe now proceed to accumulate multiple local pieces of evidence. Let R1, ..., Rk \u2286 Q be k disjoint\nregions in Q, which have been generated independently from the examples in ref. Let R0 =\nQ\\ \u222ak\ni=1 Ri denote the remainder of Q. Namely, S = {R0, R1, ..., Rk} is a segmentation/division\nof Q. Assuming that the remainder R0 was generated i.i.d. by the null hypothesis H0, we can derive\na global evidence score on the hypothesis that Q was generated from ref via the segmentation S\n(for simplicity of notation we use the symbol Href also to denote the global hypothesis):\n\nGES(Q|Href , S) = log P (Q|Href , S)\nP (Q|H0)\n\n= log\n\nk(cid:81)\n\nP (R0|H0)\n\nk(cid:88)\n\ni=1\n\nLES(Ri|Href )\n\ni=1\n\n=\n\nP (Ri|H0)\n\nP (Ri|Href )\n\nk(cid:81)\n(cid:80)k\ni=1 LES(Ri|Href ) > l | H0) < 2\u2212l.\n\ni=0\n\nNamely, the global evidence induced by S is the accumulated sum of the local evidences provided\nby the individual segments of S. The statistical signi\ufb01cance of such an accumulated evidence is\nexpressed by: P ( GES(Q|Href , S) > l | H0) = P (\nConsequently, we can accumulate local evidence of non-overlapping regions within Q which have\nsimilar regions in ref for obtaining global evidence on the hypothesis that Q was generated from\nref. Thus, for example, if we found 5 regions within Q with similar copies in ref, each resulting\nwith probability less than 10% of being generated by random, then the probability that Q was gen-\nerated by random is less than (10%)5 = 0.001% (and this is despite the unfavorable assumption we\nmade that the rest of Q was generated by random).\nSo far the segmentation S was assumed to be given, and we estimated GES(Q, Href , S). In order\nto obtain the global evidence score of Q, we marginalize over all possible segmentations S of Q:\n\nP (S|Href ) P (Q|Href , S)\nP (Q|H0)\n\nGES(Q|Href ) = log P (Q|Href )\nP (Q|H0)\n\n= log\n\nS\n\n(3)\nNamely, the likelihood P (S|Href ) of a segmentation S can be interpreted as a weight for the like-\nlihood ratio score of Q induced by S. Thus, we would like P (S|Href ) to re\ufb02ect the complexity of\nthe segmentation S (e.g., its description length).\nFrom a practical point of view, in most cases it would be intractable to compute GES(Q|Href ), as\nEq. (3) involves summation over all possible segmentations of the query Q. However, we can derive\nupper and lower bounds on GES(Q|Href ) which are easy to compute:\nClaim 1. Upper and lower bounds on GES:\nmax\n\nLES(Ri|Href ) } \u2264 GES(Q|Href ) \u2264\n\n{ logP (S|Href ) +\n\nP ES(q|Href )\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\nS\n\nRi\u2208S\n\nq\u2208Q\n\n(4)\nproof: See Appendix www.wisdom.weizmann.ac.il/\u02dcvision/Composition.html.\n\n\f(cid:80)\n\nPractically, this claim implies that we do not need to scan all possible segmentations. The lower\nbound (left-hand side of Eq. (4) ) is achieved by the segmentation of Q with the best accumulated\nRi\u2208S LES(Ri|Href ) = GES(Q|Href , S), penalized by the length of the seg-\nevidence score,\nmentation description logP (S|Href ) = \u2212length(S). Obviously, every segmentation provides such\na lower (albeit less tight) bound on the total evidence score. Thus, if we \ufb01nd large enough contiguous\nregions in Q, with supporting regions in ref (i.e., high enough local evidence scores), and de\ufb01ne\nR0 to be the remainder of Q, then S = R0, R1, ..., Rk can provide a reasonable lower bound on\nGES(Q|Href ). As to the upper bound on GES(Q|Href ), this can be done by summing up the\nmaximal point-wise evidence scores P ES(q|Href ) (see Eq. 2) from all the points in Q (right-hand\nside of Eq. (4)). Note that the upper bound is computed by \ufb01nding the maximal evidence regions\nthat pass through every point in the query, regardless of the region complexity length(R). Both\nbounds can be estimated quite ef\ufb01ciently (see Sec. 3).\n\n3 Algorithmic Framework\n\nThe local and global evidence scores presented in Sec. 2 provide new local and global similarity\nmeasures for signal data, which can be used for various learning and inference problems (see Sec. 4).\nIn this section we brie\ufb02y describe the algorithmic framework used for computing P ES, LES, and\nGES to obtain the local and global compositional similarity measures.\nAssume we are given a large region R \u2282 Q and would like to estimate its evidence score\nLES(R|Href ). We would like to \ufb01nd similar regions to R in ref, that would provide large local\nevidence for R. However, (i) we cannot expect R to appear as is, and would therefore like to allow\nfor global and local deformations of R, and (ii) we would like to perform this search ef\ufb01ciently. Both\nrequirements can be achieved by breaking R into lots of small (partially overlapping) data patches,\neach with its own patch descriptor. This information is maintained via a geometric \u201censemble\u201d of lo-\ncal patch descriptors. The search for a similar ensemble in ref is done using ef\ufb01cient inference on a\nstar graphical model, while allowing for small local displacement of each local patch [5]. For exam-\nple, in images these would be small spatial patches around each pixel contained in the larger image\nregion R, and the displacements would be small shifts in x and y. In video data the region R would\nbe a space-time volumetric region, and it would be broken into lots of small overlapping space-time\nvolumetric patches. The local displacements would be in x, y, and t (time). In audio these patches\nwould be short time-frame windows, etc. In general, for any n-dimensional signal representation,\nthe region R would be a large n-dimensional region within the signal, and the patches would be\nsmall n-dimensional overlapping regions within R. The local patch descriptors are signal and appli-\ncation dependent, but can be very simple. (For example, in images we used a SIFT-like [9] patch\ndescriptor computed in each image-patch. See more details in Sec. 4). It is the simultaneous match-\ning of all these simple local patch descriptors with their relative positions that provides the strong\noverall evidence score for the entire region R. The likelihood of R, given a global transformation\nT (e.g., location in ref) and local patch displacements \u2206li for each patch i in R (i = 1, 2, ...,|R|),\nis captured by the following expression: P (R|T,{\u2206li}, Href ) = 1/Z\n2 , where\n{\u2206di} are the descriptor distortions of each patch, and Z is a normalization factor. To estimate\nP (R|T, Href ) we marginalize over all possible local displacements {\u2206li} within a prede\ufb01ned lim-\nited radius. In order to compute LES(R|Href ) in Eq. (1), we need to marginalize over all possible\nglobal transformations T . In our current implementation we used only global shifts, and assumed\nuniform distributions over all shifts, i.e., P (T|Href ) = 1/|ref|. However, the algorithm can ac-\ncommodate more complex global transformations. To compute P (R|Href ), we used our inference\nalgorithm described in [5], modi\ufb01ed to compute likelihood (sum-product) instead of MAP (max-\nproduct). In a nutshell, the algorithm uses a few patches in R (e.g., 2-3), exhaustively searching ref\nfor those patches. These patches restrict the possible locations of R in ref, i.e., the possible can-\ndidate transformations T for estimating P (R|T, Href ). The search of each new patch is restricted\nto locations induced by the current list of candidate transformations T . Each new patch further\nreduces this list of candidate positions of R in ref. This computation of P (R|Href ) is ef\ufb01cient:\nO(|db|) + O(|R|) \u2248 O(|db|), i.e., approximately linear in the size of ref.\nIn practice, we are not given a speci\ufb01c region R \u2282 Q in advance. For each point q \u2208 Q we want to\nestimate its maximal region R[q] and its corresponding evidence score LES(R|Href ) (Sec. 2.1). In\n\n\u2212 |\u2206di|2\n2\u03c32\n\n\u2212 |\u2206li|2\n2\u03c32\n\n(cid:81)\n\ni\n\ne\n\n1 e\n\n\f(a)\nFigure 2: Detection of defects in grapefruit images.\nUsing the single image (a) as a \u201crefer-\nence\u201d of good quality grapefruits, we can detect defects (irregularities) in an image (b) of different\ngrapefruits at different arrangements. Detected defects are highlighted in red (c).\n\n(b)\n\n(c)\n\n(a)\n\nInput1:\n\nOutput1:\n\n(b)\n\nInput2:\n\nOutput2:\n\nFigure 3: Detecting defects in fabric images (No prior examples). Left side of (a) and (b) show\nfabrics with defects. Right side of (a) and (b) show detected defects in red (points with small intra-\nimage evidence LES). Irregularities are measured relative to other parts of each image.\n\norder to perform this step ef\ufb01ciently, we start with a small surrounding region around q, break it into\npatches, and search only for that region in ref (using the same ef\ufb01cient search method described\nabove). Locations in ref where good initial matches were found are treated as candidates, and\nare gradually \u2018grown\u2019 to their maximal possible matching regions (allowing for local distortions in\npatch position and descriptor, as before). The evidence score LES of each such maximally grown\nregion is computed. Using all these maximally grown regions we approximate P ES(q|Href ) and\nR[q] (for all q \u2208 Q). In practice, a region found maximal for one point is likely to be the maximal\nregion for many other points in Q. Thus the number of different maximal regions in Q will tend to\nbe signi\ufb01cantly smaller than the number of points in Q.\nHaving computed P ES(q|Href ) \u2200q \u2208 Q, it is straightforward to obtain an upper bound on\nGES(Q|Href ) (right-hand side of Eq. (4)).\nIn principle, in order to obtain a lower bound on\nGES(Q|Href ) we need to perform an optimization over all possible segmentations S of Q. How-\never, any good segmentation can be used to provide a reasonable (although less tight) lower bound.\nHaving extracted a list of disjoint maximal regions R1, ..., Rk, we can use these to induce a reason-\nable (although not optimal) segmentation using the following heuristics: We choose the \ufb01rst segment\nto be the maximal region with the largest evidence score: \u02dcR1 = argmaxRiLES(Ri|Href ). The\nsecond segment is chosen to be the largest of all the remaining regions after having removed their\noverlap with \u02dcR1, etc. This process yields a segmentation of Q: S = { \u02dcR1, ..., \u02dcRl} (l \u2264 k). Re-\nevaluating the evidence scores LES( \u02dcRi|Href ) of these regions, we can obtain a reasonable lower\nbound on GES(Q|Href ) using the left-hand side of Eq. (4). For evaluating the lower bound, we\nalso need to estimate log P (S|Href ) = \u2212length(S|Href ). This is done by summing the descrip-\ntion length of the boundaries of the individual regions within S. For more details see appendix in\nwww.wisdom.weizmann.ac.il/\u02dcvision/Composition.html.\n\n4 Applications and Results\nThe global similarity measure GES(Q|Href ) can be applied between individual signals, and/or be-\ntween groups of signals (by setting Q and ref accordingly). As such it can be employed in machine\nlearning tasks like retrieval, classi\ufb01cation, recognition, and clustering. The local similarity mea-\nsure LES(R|Href ) can be used for local inference problems, such as local classi\ufb01cation, saliency,\nsegmentation, etc. For example, the local similarity measure can also be applied between different\n\n\f(b)\n\n(a)\nFigure 4: Image Saliency and Segmentation.\n(b) Detected salient points,\ni.e., points with low intra-image evidence scores LES (when measured relative to the rest of the\nimage).\n(c) Image segmentation \u2013 results of clustering all the non-salient points into 4 clusters\nusing normalized cuts. Each maximal region R[q] provides high evidence (translated to high af\ufb01nity\nscores) that all the points within it should be grouped together (see text for more details).\n\n(c)\n(a) Input image.\n\nportions of the same signal (e.g., by setting Q to be one part of the signal, and ref to be the rest of the\nsignal). Such intra-signal evidence can be used for inference tasks like segmentation, while the ab-\nsence of intra-signal evidence (local dissimilarity) can be used for detecting saliency/irregularities.\nIn this section we demonstrate the applicability of our measures to several of these problems, and\napply them to three different types of signals: audio, images, video. For additional results as well as\nvideo sequences see www.wisdom.weizmann.ac.il/\u02dcvision/Composition.html\n1. Detection of Saliency/Irregularities (in Images): Using our statistical framework, we de\ufb01ne a\npoint q \u2208 Q to be irregular if its best local evidence score LES(R[q]|Href ) is below some threshold.\nIrregularities can be inferred either relative to a database of examples, or relative to the signal itself.\nIn Fig. 2 we show an example of applying this approach for detecting defects in fruit. Using a single\nimage as a \u201creference\u201d of good quality grapefruits (Fig. 2.a, used as ref), we can detect defects\n(irregularities) in an image of different grapefruits at different arrangements (Fig. 2.b, used as the\nquery Q). The algorithm tried to compose Q from as large as possible pieces of ref. Points in Q\nwith low LES (i.e., points whose maximal regions were small) were determined as irregular. These\nare highlighted in \u201dred\u201d in Fig. 2.c, and correspond to defects in the fruit.\n\nAlternatively, local saliency within a query signal Q can also be measured relative to other portions\nof Q, e.g., by trying to compose each region in Q using pieces from the rest of Q. For each point\nq \u2208 Q we compute its intra-signal evidence score LES(R[q]) relative to the other (non-neighboring)\nparts of the image. Points with low intra-signal evidence are detected as salient. Examples of using\nintra-signal saliency to detect defects in fabric can be found in Fig. 3. Another example of using\nthe same algorithm, but for a completely different scenario (a ballet scene) can be found in Fig. 4.b.\nWe used a SIFT-like [9] patch descriptor, but computed densely for all local patches in the image.\nPoints with low gradients were excluded from the inference (e.g., \ufb02oor).\nFor each point q \u2208 Q we compute its maximal evidence\n2. Signal Segmentation (Images):\nregion R[q]. This can be done either relative to a different reference signal, or relative Q itself (as is\nthe case of saliency). Every maximal region provides evidence to the fact that all points within the\nregion should be clustered/segmented together. Therefore, the value LES(R[q]|Href )) is added to\nall entries (i, j) in an af\ufb01nity matrix, \u2200qi\u2200qj \u2208 R[q]. Spectral clustering can then be applied to the\naf\ufb01nity matrix. Thus, large regions which appear also in ref (in the case of a single image \u2013 other\nregions in Q) are likely to be clustered together in Q. This way we foster the generation of segments\nbased on high evidential co-occurrence in the examples rather than based on low level similarity as\nin [10]. An example of using this algorithm for image segmentation is shown in Fig. 4.c. Note that\nwe have not used explicitly low level similarity in neighboring point, as is customary in most image\nsegmentation algorithms. Such additional information would improve the segmentation results.\n3. Signal Classi\ufb01cation (Video \u2013 Action Classi\ufb01cation): We have used the action video database\nof [4], which contains different types of actions (\u201crun\u201d, \u201cwalk\u201d, \u201cjumping-jack\u201d, \u201cjump-forward-on-\ntwo-legs\u201d, \u201cjump-in-place-on-two-legs\u201d, \u201cgallop-sideways\u201d, \u201cwave-hand(s)\u201d,\u201cbend\u201d) performed by\nnine different people (altogether 81 video sequences). We used a leave-one-out procedure for action\nclassi\ufb01cation. The number of correct classi\ufb01cations was 79/81 = 97.5%. These sequences contain\na single person in the \ufb01eld of view (e.g., see Fig. 5.a.). Our method can handle much more complex\nscenarios. To illustrate the capabilities of our method we further added a few more sequences (e.g.,\nsee Fig. 5.b and 5.c), where several people appear simultaneously in the \ufb01eld of view, with partial\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 5: Action Classi\ufb01cation in Video.\n\n(a) A sample \u2018walk\u2019 sequence from the action database of [4]. (b),(c) Other more complex sequences\nwith several walking people in the \ufb01eld of view. Despite partial occlusions, differences in scale, and\ncomplex backgrounds, these sequences were all classi\ufb01ed correctly as \u2019walk\u2019 sequences. For video\nsequences see www.wisdom.weizmann.ac.il/\u02dcvision/Composition.html\n\nocclusions, some differences in scale, and more complex backgrounds. The complex sequences\nwere all correctly classi\ufb01ed (increasing the classi\ufb01cation rate to 98%).\nIn our implementation, 3D space-time video regions were broken into small spatio-temporal video\npatches (7 \u00d7 7 \u00d7 4). The descriptor for each patch was a vector containing the absolute values\nof the temporal derivatives in all pixels of the patch, normalized to a unit length. Since stationary\nbackgrounds have zero temporal derivatives, our method is not sensitive to the background, nor does\nit require foreground/background separation.\n\nImage patches and fragments have been employed in the task of class-based object recognition\n(e.g., [7, 2, 6]). A sparse set of informative fragments were learned for a large class of objects (the\ntraining set). These approaches are useful for recognition, but are not applicable to non-class based\ninference problems (such as similarity between pairs of signals with no prior data, clustering, etc.)\n4. Signal Retrieval (Audio \u2013 Speaker Recognition): We used a database of 31 speakers (male and\nfemale). All the speakers repeated three times a \ufb01ve-word sentence (2-3 seconds long) in a foreign\nlanguage, recorded over a phone line. Different repetitions by the same person slightly varied from\none another. Altogether the database contained 93 samples of the sentence. Such short speech\nsignals are likely to pose a problem for learning-based (e.g., HMM, GMM) recognition system.\nWe applied our global measure GES for retrieving the closest database elements. The highest GES\nrecognized the right speaker 90 out of 93 cases (i.e., 97% correct recognition). Moreover, the second\nbest GES was correct 82 out of 93 cases (88%). We used a standard mel-frequency cepstrum frame\ndescriptors for time-frames of 25 msec, with overlaps of 50%.\n\nAcknowledgments\nThanks to Y. Caspi, A. Rav-Acha, B. Nadler and R. Basri for their helpful remarks. This work was\nsupported by the Israeli Science Foundation (Grant 281/06) and by the Alberto Moscona Fund. The\nresearch was conducted at the Moross Laboratory for Vision & Motor Control at the Weizmann Inst.\nReferences\n\n[1] S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman. Basic local alignment search tool. JMolBiol,\n\n215:403\u2013410, 1990.\n\n[2] E. Bart and S. Ullman. Class-based matching of object parts. In VideoRegister04, page 173, 2004.\n[3] A. Birnbaum. On the foundations of statistical inference. J. Amer. Statist. Assoc, 1962.\n[4] M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri. Actions as space-time shapes. In ICCV05.\n[5] O. Boiman and M. Irani. Detecting irregularities in images and in video. In ICCV05, pages I: 462\u2013469.\n[6] P. Felzenszwalb and D. Huttenlocher. Pictorial structures for object recognition. IJCV, 61, 2005.\n[7] R. Fergus, P. Perona, and A. Zisserman. Object class recognition by unsupervised scale-invariant learning.\n\nIn CVPR03.\n\n[8] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing\n\nnatural scene categories. In CVPR06.\n\n[9] D. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91\u2013110, 2004.\n[10] J. Shi and J. Malik. Normalized cuts and image segmentation. PAMI, 22(8):888\u2013905, August 2000.\n[11] J. Sivic, B. Russell, A. Efros, A. Zisserman, and W. Freeman. Discovering objects and their localization\n\nin images. In ICCV05, pages I: 370\u2013377.\n\n[12] P. Viola and W. Wells, III. Alignment by maximization of mutual information. In ICCV95, pages 16\u201323.\n\n\f", "award": [], "sourceid": 3127, "authors": [{"given_name": "Oren", "family_name": "Boiman", "institution": null}, {"given_name": "Michal", "family_name": "Irani", "institution": null}]}