{"title": "Self-Tuning Spectral Clustering", "book": "Advances in Neural Information Processing Systems", "page_first": 1601, "page_last": 1608, "abstract": null, "full_text": "Self-Tuning Spectral Clustering\n\nLihi Zelnik-Manor\n\nDepartment of Electrical Engineering\n\nCalifornia Institute of Technology\n\nPasadena, CA 91125, USA\n\nlihi@vision.caltech.edu\n\nPietro Perona\n\nDepartment of Electrical Engineering\n\nCalifornia Institute of Technology\n\nPasadena, CA 91125, USA\n\nperona@vision.caltech.edu\n\nhttp://www.vision.caltech.edu/lihi/Demos/SelfTuningClustering.html\n\nAbstract\n\nWe study a number of open issues in spectral clustering: (i) Selecting the\nappropriate scale of analysis, (ii) Handling multi-scale data, (iii) Cluster-\ning with irregular background clutter, and, (iv) Finding automatically the\nnumber of groups. We \ufb01rst propose that a \u2018local\u2019 scale should be used to\ncompute the af\ufb01nity between each pair of points. This local scaling leads\nto better clustering especially when the data includes multiple scales and\nwhen the clusters are placed within a cluttered background. We further\nsuggest exploiting the structure of the eigenvectors to infer automatically\nthe number of groups. This leads to a new algorithm in which the \ufb01nal\nrandomly initialized k-means stage is eliminated.\n\n1 Introduction\nClustering is one of the building blocks of modern data analysis. Two commonly used\nmethods are K-means and learning a mixture-model using EM. These methods, which are\nbased on estimating explicit models of the data, provide high quality results when the data\nis organized according to the assumed models. However, when it is arranged in more com-\nplex and unknown shapes, these methods tend to fail. An alternative clustering approach,\nwhich was shown to handle such structured data is spectral clustering. It does not require\nestimating an explicit model of data distribution, rather a spectral analysis of the matrix\nof point-to-point similarities. A \ufb01rst set of papers suggested the method based on a set of\nheuristics (e.g., [8, 9]). A second generation provided a level of theoretical analysis, and\nsuggested improved algorithms (e.g., [6, 10, 5, 4, 3]).\n\nThere are still open issues: (i) Selection of the appropriate scale in which the data is to\nbe analyzed, (ii) Clustering data that is distributed according to different scales, (iii) Clus-\ntering with irregular background clutter, and, (iv) Estimating automatically the number of\ngroups. We show here that it is possible to address these issues and propose ideas to tune\nthe parameters automatically according to the data.\n1.1 Notation and the Ng-Jordan-Weiss (NJW) Algorithm\n\nThe analysis and approaches suggested in this paper build on observations presented in [5].\nFor completeness of the text we \ufb01rst brie\ufb02y review their algorithm.\nGiven a set of n points S = {s1, . . . , sn} in Rl cluster them into C clusters as follows:\n\n1. Form the af\ufb01nity matrix A \u2208 Rn\u00d7n de\ufb01ned by Aij = exp (\u2212d2(si,sj )\n\n) for i (cid:5)= j\nand Aii = 0, where d(si, sj) is some distance function, often just the Euclidean\n\n\u03c32\n\n\f\u03c3 = 0.041235\n\n\u03c3 = 0.054409\n\n\u03c3 = 0.035897\n\n\u03c3 = 0.03125\n\n\u03c3 = 0.015625\n\n\u03c3 = 0.35355\n\n\u03c3 = 1\n\nFigure 1: Spectral clustering without local scaling (using the NJW algorithm.) Top row:\nWhen the data incorporates multiple scales standard spectral clustering fails. Note, that\nthe optimal \u03c3 for each example (displayed on each \ufb01gure) turned out to be different. Bottom\nrow: Clustering results for the top-left point-set with different values of \u03c3. This highlights\nthe high impact \u03c3 has on the clustering quality. In all the examples, the number of groups\nwas set manually. The data points were normalized to occupy the [\u22121, 1]2 space.\n\ndistance between the vectors si and sj. \u03c3 is a scale parameter which is further\ndiscussed in Section 2.\n\n(cid:1)n\n\n2. De\ufb01ne D to be a diagonal matrix with Dii =\nmalized af\ufb01nity matrix L = D\u22121/2AD\u22121/2.\n\nj=1 Aij and construct the nor-\n\n3. Manually select a desired number of groups C.\n4. Find x1, . . . , xC, the C largest eigenvectors of L, and form the matrix X =\n5. Re-normalize the rows of X to have unit length yielding Y \u2208 Rn\u00d7C, such that\n\n[x1, . . . , xC] \u2208 Rn\u00d7C.\n\n(cid:1)\nYij = Xij/(\n\nj X 2\n\nij)1/2.\n\n6. Treat each row of Y as a point in RC and cluster via k-means.\n7. Assign the original point si to cluster c if and only if the corresponding row i of\n\nthe matrix Y was assigned to cluster c.\n\nIn Section 2 we analyze the effect of \u03c3 on the clustering and suggest a method for setting\nit automatically. We show that this allows handling multi-scale data and background clut-\nter. In Section 3 we suggest a scheme for \ufb01nding automatically the number of groups C.\nOur new spectral clustering algorithm is summarized in Section 4. We conclude with a\ndiscussion in Section 5.\n\n2 Local Scaling\n\nAs was suggested by [6] the scaling parameter is some measure of when two points are\nconsidered similar. This provides an intuitive way for selecting possible values for \u03c3. The\nselection of \u03c3 is commonly done manually. Ng et al. [5] suggested selecting \u03c3 automat-\nically by running their clustering algorithm repeatedly for a number of values of \u03c3 and\nselecting the one which provides least distorted clusters of the rows of Y . This increases\nsigni\ufb01cantly the computation time. Additionally, the range of values to be tested still has to\nbe set manually. Moreover, when the input data includes clusters with different local statis-\ntics there may not be a singe value of \u03c3 that works well for all the data. Figure 1 illustrates\nthe high impact \u03c3 has on clustering. When the data contains multiple scales, even using the\noptimal \u03c3 fails to provide good clustering (see examples at the right of top row).\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 2: The effect of local scaling. (a) Input data points. A tight cluster resides within\na background cluster. (b) The af\ufb01nity between each point and its surrounding neighbors\nis indicated by the thickness of the line connecting them. The af\ufb01nities across clusters are\nlarger than the af\ufb01nities within the background cluster. (c) The corresponding visualization\nof af\ufb01nities after local scaling. The af\ufb01nities across clusters are now signi\ufb01cantly lower\nthan the af\ufb01nities within any single cluster.\n\nInstead of selecting a single scaling parameter \u03c3 we propose\nIntroducing Local Scaling:\nto calculate a local scaling parameter \u03c3i for each data point si. The distance from si\nto sj as \u2018seen\u2019 by si is d(si, sj)/\u03c3i while the converse is d(sj, si)/\u03c3j. Therefore the\nsquare distance d2 of the earlier papers may be generalized as d(si, sj)d(sj, si)/\u03c3i\u03c3j =\nd2(si, sj)/\u03c3i\u03c3j The af\ufb01nity between a pair of points can thus be written as:\n\n(cid:2)\u2212d2(si, sj)\n\n(cid:3)\n\n\u03c3i\u03c3j\n\n\u02c6Aij = exp\n\n(1)\n\nUsing a speci\ufb01c scaling parameter for each point allows self-tuning of the point-to-point\ndistances according to the local statistics of the neighborhoods surrounding points i and j.\nThe selection of the local scale \u03c3i can be done by studying the local statistics of the neigh-\nborhood of point si. A simple choice, which is used for the experiments in this paper,\nis:\n\n(2)\nwhere sK is the K\u2019th neighbor of point si. The selection of K is independent of scale\nand is a function of the data dimension of the embedding space. Nevertheless, in all our\nexperiments (both on synthetic data and on images) we used a single value of K = 7, which\ngave good results even for high-dimensional data (the experiments with high-dimensional\ndata were left out due to lack of space).\n\n\u03c3i = d(si, sK)\n\nFigure 2 provides a visualization of the effect of the suggested local scaling. Since the data\nresides in multiple scales (one cluster is tight and the other is sparse) the standard approach\nto estimating af\ufb01nities fails to capture the data structure (see Figure 2.b). Local scaling\nautomatically \ufb01nds the two scales and results in high af\ufb01nities within clusters and low\naf\ufb01nities across clusters (see Figure 2.c). This is the information required for separation.\n\nWe tested the power of local scaling by clustering the data set of Figure 1, plus four ad-\nditional examples. We modi\ufb01ed the Ng-Jordan-Weiss algorithm reviewed in Section 1.1\nsubstituting the locally scaled af\ufb01nity matrix \u02c6A (of Eq. (1)) instead of A. Results are shown\nin Figure 3. In spite of the multiple scales and the various types of structure, the groups\nnow match the intuitive solution.\n\n3 Estimating the Number of Clusters\n\nHaving de\ufb01ned a scheme to set the scale parameter automatically we are left with one\nmore free parameter: the number of clusters. This parameter is usually set manually and\n\n\fFigure 3: Our clustering results. Using the algorithm summarized in Section 4. The\nnumber of groups was found automatically.\n\n1\n\n0.99\n\n0.98\n\n0.97\n\n0.96\n\n0.95\n\n2\n\n4\n\n6\n\n8\n\n10\n\n1\n\n0.99\n\n0.98\n\n0.97\n\n0.96\n\n0.95\n\n2\n\n4\n\n6\n\n8\n\n10\n\n1\n\n0.99\n\n0.98\n\n0.97\n\n0.96\n\n0.95\n\n2\n\n4\n\n6\n\n8\n\n10\n\n1\n\n0.99\n\n0.98\n\n0.97\n\n0.96\n\n0.95\n\n2\n\n4\n\n6\n\n8\n\n10\n\nFigure 4: Eigenvalues. The \ufb01rst 10 eigenvalues of L corresponding to the top row data\nsets of Figure 3.\n\nnot much research has been done as to how might one set it automatically. In this section\nwe suggest an approach to discovering the number of clusters. The suggested scheme turns\nout to lead to a new spatial clustering algorithm.\n\n3.1 The Intuitive Solution: Analyzing the Eigenvalues\nOne possible approach to try and discover the number of groups is to analyze the eigenval-\nues of the af\ufb01nity matrix. The analysis given in [5] shows that the \ufb01rst (highest magnitude)\neigenvalue of L (see Section 1.1) will be a repeated eigenvalue of magnitude 1 with mul-\ntiplicity equal to the number of groups C. This implies one could estimate C by counting\nthe number of eigenvalues equaling 1.\n\nExamining the eigenvalues of our locally scaled matrix, corresponding to clean data-sets,\nindeed shows that the multiplicity of eigenvalue 1 equals the number of groups. However,\nif the groups are not clearly separated, once noise is introduced, the values start to deviate\nfrom 1, thus the criterion of choice becomes tricky. An alternative approach would be to\nsearch for a drop in the magnitude of the eigenvalues (this was pursued to some extent by\nPolito and Perona in [7]). This approach, however, lacks a theoretical justi\ufb01cation. The\neigenvalues of L are the union of the eigenvalues of the sub-matrices corresponding to\neach cluster. This implies the eigenvalues depend on the structure of the individual clusters\nand thus no assumptions can be placed on their values. In particular, the gap between the\nC\u2019th eigenvalue and the next one can be either small or large. Figure 4 shows the \ufb01rst 10\neigenvalues corresponding to the top row examples of Figure 3. It highlights the different\npatterns of distribution of eigenvalues for different data sets.\n\n3.2 A Better Approach: Analyzing the Eigenvectors\nWe thus suggest an alternative approach which relies on the structure of the eigenvec-\ntors. After sorting L according to clusters, in the \u201cideal\u201d case (i.e., when L is strictly\nblock diagonal with blocks L(c), c = 1, . . . , C), its eigenvalues and eigenvectors are\nthe union of the eigenvalues and eigenvectors of its blocks padded appropriately with\nzeros (see [6, 5]). As long as the eigenvalues of the blocks are different each eigen-\n\n\fvector will have non-zero values only in entries corresponding to a single block/cluster:\n\nwhere x(c) is an eigenvector of the sub-matrix L(c) cor-\n\n\uf8ee\n\uf8f0 x(1) \u2212\u21920\n\u2212\u21920\n\u2212\u21920\n\u00b7\u00b7\u00b7 \u2212\u21920\n\u2212\u21920\n\u2212\u21920\n\nx(C)\n\n\uf8f9\n\uf8fb\nn\u00d7C\n\n\u02c6X =\n\nresponding to cluster c. However, as was shown above, the eigenvalue 1 is bound to be a\nrepeated eigenvalue with multiplicity equal to the number of groups C. Thus, the eigen-\nsolver could just as easily have picked any other set of orthogonal vectors spanning the\nsame subspace as \u02c6X\u2019s columns. That is, \u02c6X could have been replaced by X = \u02c6XR for any\northogonal matrix R \u2208 RC\u00d7C.\nThis, however, implies that even if the eigensolver provided us the rotated set of vectors,\nwe are still guaranteed that there exists a rotation \u02c6R such that each row in the matrix X \u02c6R\nhas a single non-zero entry. Since the eigenvectors of L are the union of the eigenvectors\nof its individual blocks (padded with zeros), taking more than the \ufb01rst C eigenvectors will\nresult in more than one non-zero entry in some of the rows. Taking fewer eigenvectors we\ndo not have a full basis spanning the subspace, thus depending on the initial X there might\nor might not exist such a rotation. Note, that these observations are independent of the\ndifference in magnitude between the eigenvalues.\n\nWe use these observations to predict the number of groups. For each possible group number\nC we recover the rotation which best aligns X\u2019s columns with the canonical coordinate\nsystem. Let Z \u2208 Rn\u00d7C be the matrix obtained after rotating the eigenvector matrix X,\ni.e., Z = XR and denote Mi = maxj Zij. We wish to recover the rotation R for which in\nevery row in Z there will be at most one non-zero entry. We thus de\ufb01ne a cost function:\n\nn(cid:8)\n\nC(cid:8)\n\ni=1\n\nj=1\n\nJ =\n\nZ 2\nij\nM 2\ni\n\n(3)\n\nMinimizing this cost function over all possible rotations will provide the best alignment\nwith the canonical coordinate system. This is done using the gradient descent scheme\ndescribed in Appendix A. The number of groups is taken as the one providing the minimal\ncost (if several group numbers yield practically the same minimal cost, the largest of those\nis selected).\n\nThe search over the group number can be performed incrementally saving computation\ntime. We start by aligning the top two eigenvectors (as well as possible). Then, at each\nstep of the search (up to the maximal group number), we add a single eigenvector to the\nalready rotated ones. This can be viewed as taking the alignment result of the previous\ngroup number as an initialization to the current one. The alignment of this new set of\neigenvectors is extremely fast (typically a few iterations) since the initialization is good.\nThe overall run time of this incremental procedure is just slightly longer than aligning all\nthe eigenvectors in a non-incremental way.\n\nUsing this scheme to estimate the number of groups on the data set of Figure 3 provided\na correct result for all but one (for the right-most dataset at the bottom row we predicted\n2 clusters instead of 3). Corresponding plots of the alignment quality for different group\nnumbers are shown in Figure 5.\n\nYu and Shi [11] suggested rotating normalized eigenvectors to obtain an optimal segmen-\ntation. Their method iterates between non-maximum suppression (i.e., setting Mi = 1 and\nZij = 0 otherwise) and using SVD to recover the rotation which best aligns the columns of\nX with those of Z. In our experiments we noticed that this iterative method can easily get\nstuck in local minima and thus does not reliably \ufb01nd the optimal alignment and the group\nnumber. Another related approach is that suggested by Kannan et al. [3] who assigned\npoints to clusters according to the maximal entry in the corresponding row of the eigenvec-\ntor matrix. This works well when there are no repeated eigenvalues as then the eigenvectors\n\n\f0.2\n\n0.15\n\n0.1\n\n0.05\n\n0\n2\n\n4\n\n6\n\n8\n\n10\n\n0.08\n\n0.06\n\n0.04\n\n0.02\n\n0\n2\n\n4\n\n6\n\n8\n\n10\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n0\n2\n\n4\n\n6\n\n8\n\n10\n\n0.08\n\n0.06\n\n0.04\n\n0.02\n\n0\n2\n\n4\n\n6\n\n8\n\n10\n\nFigure 5: Selecting Group Number. The alignment cost (of Eq. (3)) for varying group\nnumbers corresponding to the top row data sets of Figure 3. The selected group number\nmarked by a red circle, corresponds to the largest group number providing minimal cost\n(costs up to 0.01% apart were considered as same value).\n\ncorresponding to different clusters are not intermixed. Kannan et al. used a non-normalized\naf\ufb01nity matrix thus were not certain to obtain a repeated eigenvalue, however, this could\neasily happen and then the clustering would fail.\n\n4 A New Algorithm\nOur proposed method for estimating the number of groups automatically has two desir-\nable by-products: (i) After aligning with the canonical coordinate system, one can use\nnon-maximum suppression on the rows of Z, thus eliminating the \ufb01nal iterative k-means\nprocess, which often requires around 100 iterations and depends highly on its initialization.\n(ii) Since the \ufb01nal clustering can be conducted by non-maximum suppression, we obtain\nclustering results for all the inspected group numbers at a tiny additional cost. When the\ndata is highly noisy, one can still employ k-means, or better, EM, to cluster the rows of Z.\nHowever, since the data is now aligned with the canonical coordinate scheme we can obtain\nby non-maximum suppression an excellent initialization so very few iterations suf\ufb01ce. We\nsummarize our suggested algorithm:\nAlgorithm: Given a set of points S = {s1, . . . , sn} in Rl that we want to cluster:\n\n1. Compute the local scale \u03c3i for each point si \u2208 S using Eq. (2).\n2. Form the locally scaled af\ufb01nity matrix \u02c6A \u2208 Rn\u00d7n where \u02c6Aij is de\ufb01ned according\n\nto Eq. (1) for i (cid:5)= j and \u02c6Aii = 0.\n\n(cid:1)n\n\n3. De\ufb01ne D to be a diagonal matrix with Dii =\nmalized af\ufb01nity matrix L = D\u22121/2 \u02c6AD\u22121/2.\n[x1, . . . , xC] \u2208 Rn\u00d7C, where C is the largest possible group number.\n\n4. Find x1, . . . , xC the C largest eigenvectors of L and form the matrix X =\n\n\u02c6Aij and construct the nor-\n\nj=1\n\n5. Recover the rotation R which best aligns X\u2019s columns with the canonical coordi-\nnate system using the incremental gradient descent scheme (see also Appendix A).\n6. Grade the cost of the alignment for each group number, up to C, according to\n\nEq. (3).\n\n7. Set the \ufb01nal group number Cbest to be the largest group number with minimal\n\nalignment cost.\n\n8. Take the alignment result Z of the top Cbest eigenvectors and assign the original\n\npoint si to cluster c if and only if maxj(Z 2\n\nij) = Z 2\nic.\n\n9. If highly noisy data, use the previous step result to initialize k-means, or EM,\n\nclustering on the rows of Z.\n\nWe tested the quality of this algorithm on real data. Figure 6 shows intensity based image\nsegmentation results. The number of groups and the corresponding segmentation were\nobtained automatically. In this case same quality of results were obtained using non-scaled\naf\ufb01nities, however, this required manual setting of both \u03c3 (different values for different\nimages) and the number of groups, whereas our result required no parameter settings.\n\n\fFigure 6: Automatic image segmentation. Fully automatic intensity based image segmen-\ntation results using our algorithm.\n\nMore experiments and results on real data sets can be found on our web-page\nhttp://www.vision.caltech.edu/lihi/Demos/SelfTuningClustering.html\n\n5 Discussion & Conclusions\nSpectral clustering practitioners know that selecting good parameters to tune the cluster-\ning process is an art requiring skill and patience. Automating spectral clustering was the\nmain motivation for this study. The key ideas we introduced are three: (a) using a local\nscale, rather than a global one, (b) estimating the scale from the data, and (c) rotating the\neigenvectors to create the maximally sparse representation. We proposed an automated\nspectral clustering algorithm based on these ideas: it computes automatically the scale and\nthe number of groups and it can handle multi-scale data which are problematic for previous\napproaches.\n\nSome of the choices we made in our implementation were motivated by simplicity and are\nperfectible. For instance, the local scale \u03c3 might be better estimated by a method which\nrelies on more informative local statistics. Another example: the cost function in Eq. (3) is\nreasonable, but by no means the only possibility (e.g. the sum of the entropy of the rows\nZi might be used instead).\nAcknowledgments:\n\nFinally, we wish to thank Yair Weiss for providing us his code for spectral clustering.\nThis research was supported by the MURI award number SA3318 and by the Center of\nNeuromorphic Systems Engineering award number EEC-9402726.\n\nReferences\n[1] G. H. Golub and C. F. Van Loan \u201cMatrix Computation\u201d, John Hopkins University Press, 1991,\n\nSecond Edition.\n\n[2] V. K. Goyal and M. Vetterli \u201cBlock Transform by Stochastic Gradient Descent\u201d IEEE Digital\n\nSignal Processing Workshop, 1999, Bryce Canyon, UT, Aug. 1998\n\n[3] R. Kannan, S. Vempala and V.Vetta \u201cOn Spectral Clustering \u2013 Good, Bad and Spectral\u201d In Pro-\n\nceedings of the 41st Annual Symposium on Foundations of Computer Sceince, 2000.\n\n[4] M. Meila and J. Shi \u201cLearning Segmentation by Random Walks\u201d In Advances in Neural Infor-\n\nmation Processing Systems 13, 2001\n\n[5] A. Ng, M. Jordan and Y. Weiss \u201cOn spectral clustering: Analysis and an algorithm\u201d In Advances\n\nin Neural Information Processing Systems 14, 2001\n\n[6] P. Perona and W. T. Freeman \u201cA Factorization Approach to Grouping\u201d Proceedings of the 5th\n\nEuropean Conference on Computer Vision, Volume I, pp. 655\u2013670 1998.\n\n[7] M. Polito and P. Perona \u201cGrouping and dimensionality reduction by locally linear embedding\u201d\n\nAdvances in Neural Information Processing Systems 14, 2002\n\n[8] G.L. Scott and H.C. Longuet-Higgins \u201cFeature grouping by \u2018relocalisation\u2019 of eigenvectors of\nthe proximity matrix\u201d In Proc. British Machine Vision Conference, Oxford, UK, pages 103\u2013108,\n1990.\n\n\f[9] J. Shi and J. Malik \u201cNormalized Cuts and Image Segmentation\u201d IEEE Transactions on Pattern\n\nAnalysis and Machine Intelligence, 22(8), 888-905, August 2000.\n\n[10] Y. Weiss \u201cSegmentation Using Eigenvectors: A Unifying View\u201d International Conference on\n\nComputer Vision, pp.975\u2013982,September,1999.\n\n[11] S. X. Yu and J. Shi \u201cMulticlass Spectral Clustering\u201d International Conference on Computer\n\nVision, Nice, France, pp.11\u201317,October,2003.\n\nA Recovering the Aligning Rotation\n\nTo \ufb01nd the best alignment for a set of eigenvectors we adopt a gradient descent scheme\nsimilar to that suggested in [2]. There, Givens rotations where used to recover a rotation\nwhich diagonalizes a symmetric matrix by minimizing a cost function which measures the\ndiagonality of the matrix. Similarly, here, we de\ufb01ne a cost function which measures the\nalignment quality of a set of vectors and prove that the gradient descent, using Givens\nrotations, converges.\nThe cost function we wish to minimize is that of Eq. (3). Let mi = j such that Zij =\n= Mi. Note, that the indices mi of the maximal entries of the rows of X might be\nZimi\ndifferent than those of the optimal Z. A simple non-maximum supression on the rows of\nX can provide a wrong result. Using the gradient descent scheme allows to increase the\ncost corresponding to part of the rows as long as the overall cost is reduced, thus enabling\nchanging the indices mi.\nSimilar to [2] we wish to represent the rotation matrix R in terms of the smallest possible\nnumber of parameters. Let \u02dcGi,j,\u03b8 denote a Givens rotation [1] of \u03b8 radians (counterclock-\nwise) in the (i, j) coordinate plane. It is suf\ufb01cient to consider Givens rotations so that i < j,\nthus we can use a convenient index re-mapping Gk,\u03b8 = \u02dcGi,j,\u03b8, where (i, j) is the kth entry\nof a lexicographical list of (i, j) \u2208 {1, 2, . . . , C}2 pairs with i < j. Hence, \ufb01nding the\naligning rotation amounts to minimizing the cost function J over \u0398 \u2208 [\u2212\u03c0/2, \u03c0/2)K. The\nupdate rule for \u0398 is: \u0398k+1 = \u0398k \u2212 \u03b1 \u2207J|\u0398=\u0398k where \u03b1 \u2208 R+ is the step size.\nWe next compute the gradient of J and bounds on \u03b1 for stability. For convenience we will\nGa+1,\u03b8a+1 \u00b7\u00b7\u00b7 Gb,\u03b8b where\nfurther adopt the notation convention of [2]. Let U(a,b) = Ga,\u03b8a\nUk. De\ufb01ne A(k), 1 \u2264 k \u2264 K, element\nU(a,b) = I if b < a, Uk = U(k,k), and Vk = \u2202\n\u2202\u03b8k\nwise by A(k)\nWe can now compute \u2207J element wise:\n\n\u2202\u03b8k . Since Z = XR we obtain A(k) = XU(1,k\u22121)VkU(k+1,K).\n\nij = \u2202Zij\n\nn(cid:8)\n\nC(cid:8)\n\ni=1\n\nj=1\n\n\u2202J\n\u2202\u03b8k\n\n=\n\n\u2202\n\u2202\u03b8k\n\nZ 2\nij\nM 2\ni\n\n\u2212 1 = 2\n\nZij\nM 2\ni\n\nij \u2212 Z 2\nA(k)\n\nij\nM 3\ni\n\n\u2202Mi\n\u2202\u03b8k\n\nn(cid:8)\n\nC(cid:8)\n\ni=1\n\nj=1\n\n(cid:10)\n\n(cid:9)\n\nDue to lack of space we cannot describe in full detail the complete convergence proof. We\nthus refer the reader to [2] where it is shown that convergence is obtained when 1 \u2212 \u03b1Fkl\n. Note, that at \u0398 = 0 we have Zij = 0\nlie in the unit circle, where Fkl =\nfor j (cid:5)= mi, Zimi\n= A(k)\nimi (i.e., near \u0398 = 0 the maximal\n=\n(cid:1)n\nentry for each row does not change its index). Deriving thus gives\nij . Further substituting in the values for A(k)\n2\n\n= Mi, and \u2202Mi\n\u2202\u03b8k\n\n\u2202\u03b8l\u2202\u03b8k\n= \u2202Zimi\n\u2202\u03b8k\n\nij|\u0398=0\n\n(cid:1)\n\n\u2202\u03b8l\u2202\u03b8k\n\n\u22022J\n\n\u22022J\n\n\u0398=0\n\n(cid:10)\n\n(cid:9)\n\ni=1\n\n(cid:11)\nj(cid:4)=mi\n\n1\nM 2\ni\n\n(cid:12)\nA(k)\nij A(l)\nij|\u0398=0\n\n\u22022J\n\n\u2202\u03b8l\u2202\u03b8k\n\nFkl =\n\n(cid:13)\n\n=\n\n2#i s.t. mi = ik or mi = jk\n0\n\nij |\u0398=0 yields:\nif k = l\notherwise\n\nwhere (ik, jk) is the pair (i, j) corresponding to the index k in the index re-mapping dis-\ncussed above. Hence, by setting \u03b1 small enough we get that 1 \u2212 \u03b1Fkl lie in the unit circle\nand convergence is guaranteed.\n\n\f", "award": [], "sourceid": 2619, "authors": [{"given_name": "Lihi", "family_name": "Zelnik-manor", "institution": null}, {"given_name": "Pietro", "family_name": "Perona", "institution": null}]}