{"title": "Convex Optimization Procedure for Clustering: Theoretical Revisit", "book": "Advances in Neural Information Processing Systems", "page_first": 1619, "page_last": 1627, "abstract": "In this paper, we present theoretical analysis of SON~--~a convex optimization procedure for clustering using a sum-of-norms (SON) regularization recently proposed in \\cite{ICML2011Hocking_419,SON, Lindsten650707, pelckmans2005convex}. In particular, we show if the samples are drawn from two cubes, each being one cluster, then SON can provably identify the cluster membership provided that the distance between the two cubes is larger than a threshold which (linearly) depends on the size of the cube and the ratio of numbers of samples in each cluster. To the best of our knowledge, this paper is the first to provide a rigorous analysis to understand why and when SON works. We believe this may provide important insights to develop novel convex optimization based algorithms for clustering.", "full_text": "Convex Optimization Procedure for Clustering:\n\nTheoretical Revisit\n\nDepartment of Electrical and Computer Engineering\n\nChangbo Zhu\n\nDepartment of Mathematics\n\nNational University of Singapore\n\nelezhuc@nus.edu.sg\n\nHuan Xu\n\nDepartment of Mechanical Engineering\n\nNational University of Singapore\n\nmpexuh@nus.edu.sg\n\nChenlei Leng\n\nDepartment of Statistics\nUniversity of Warwick\n\nc.leng@warwick.ac.uk\n\nShuicheng Yan\n\nDepartment of Electrical and Computer Engineering\n\nNational University of Singapore\n\neleyans@nus.edu.sg\n\nAbstract\n\nIn this paper, we present theoretical analysis of SON \u2013 a convex optimization\nprocedure for clustering using a sum-of-norms (SON) regularization recently pro-\nposed in [8, 10, 11, 17]. In particular, we show if the samples are drawn from two\ncubes, each being one cluster, then SON can provably identify the cluster mem-\nbership provided that the distance between the two cubes is larger than a threshold\nwhich (linearly) depends on the size of the cube and the ratio of numbers of sam-\nples in each cluster. To the best of our knowledge, this paper is the \ufb01rst to provide\na rigorous analysis to understand why and when SON works. We believe this may\nprovide important insights to develop novel convex optimization based algorithms\nfor clustering.\n\n1\n\nIntroduction\n\nClustering is an important problem in unsupervised learning that deals with grouping observations\n(data points) appropriately based on their similarities or distances [20]. Many clustering algo-\nrithms have been proposed in literature, including K-means, spectral clustering, Gaussian mix-\nture models and hierarchical clustering, to solve problems with respect to a wide range of cluster\nshapes. However, much research has pointed out that these methods all suffer from instabilities\n[3, 20, 16, 15, 13, 19]. Taking K-means as an example, the formulation of K-means is NP-hard and\nthe typical way to solve it is the Lloyd\u2019s method, which requires randomly initializing the clusters.\nHowever, different initialization may lead to signi\ufb01cantly different \ufb01nal cluster results.\n\n1.1 A Convex Optimization Procedure for Clustering\n\n[10, 11], Hocking et al.\n\nRecently, Lindsten et al.\n[17] proposed the\nfollowing convex optimization procedure for clustering, which is termed as SON by Lindsten et al.\n[11] (Also called Clusterpath by Hocking et al. [8]),\n(cid:107)A \u2212 X(cid:107)2\n\n[8] and Pelckmans et al.\n\n\u02c6X = arg min\n\n(cid:107)Xi\u00b7 \u2212 Xj\u00b7(cid:107)2.\n\n(1)\n\nF + \u03b1\n\nX\u2208Rn\u00d7p\n\n(cid:88)\n\ni 0.\n\n\u03b1(cid:80)\n\nThe main contribution of this paper is to provide theoretic analysis of SON, in particular to derive\nsuf\ufb01cient conditions when SON successfully recovers the clustering membership. We show that\nif there are two clusters, each of which is a cube, then SON succeeds provided that the distance\nbetween the cubes is larger than a threshold value that depends on the cube size and the ratio of\nnumber of samples drawn in each cluster. Thus, the intuitive argument about why SON works is\nmade rigorous and mathematically solid. To the best of our knowledge, this is the \ufb01rst attempt to\ntheoretically quantify why and when SON succeeds.\nRelated Work: we brie\ufb02y review the related works on SON. Hocking et al. [8] proposed SON,\narguing that it can be seen as a generalization of hierarchical clustering, and presented via numerical\nsimulations several situations in which SON works while K-means and average linkage hierarchical\nclustering fail. They also developed R package called \u201cclusterpath\u201d which can be used to solve\nProblem (1). Independently, Lindsten et al.\n[10, 11] derived SON as a convex relaxation of K-\nmeans clustering. In the algorithmic aspect, Chi et al. [6] developed two methods to solve Problem\n(1), namely, Alternating Direction Method of Multipliers (ADMM) and alternating minimization\nalgorithm (AMA). Marchetti et al. [14] generalized SON to the high-dimensional and noisy cases.\nYet, in all these works, no attempt has been made to study rigorously why and when SON succeeds.\nNotation: in this paper, matrices are denoted by upper case boldface letters (e.g. A, B), sets are\ndenoted by blackboard bold characters (e.g. R, I, C) and operators are denoted by Fraktur characters\n(e.g. D, M). Given a matrix A, we use Ai\u00b7 to denote its ith row, and A\u00b7j to denote its jth column.\nIts (i, j)th entry is denoted by Ai,j. Two norms are used: we use (cid:107) \u00b7 (cid:107)F to denote the Frobenius\nnorm and (cid:107) \u00b7 (cid:107)2 to denote the l2 norm of a vector. The space spanned by the rows of A is denoted\nby Row(A). Moreover, given a matrix A of dimension n \u00d7 p and a function f : Rp (cid:55)\u2192 Rq, we use\nthe notation f (A) to denote the matrix whose ith row is f (Ai\u00b7).\n\n2 Main Result\n\nIn this section we present our main theoretic result \u2013 a provable guarantee when SON succeeds in\nidentifying cluster membership.\n\n2\n\n\f2.1 Preliminaries\n\nWe \ufb01rst de\ufb01ne some operators that will be frequently used in the remainder of the paper.\nDe\ufb01nition 2. Given any two matrices E of dimension n1 \u00d7 p and F of dimension n2 \u00d7 p, de\ufb01ne\nthe difference operator D1 on E, D2 on the two matrices E, F and D on the matrix constructed by\nconcatenating E and F vertically as\n\n\uf8eb\uf8ec\uf8ec\uf8ec\uf8ec\uf8ec\uf8ec\uf8ec\uf8ec\uf8ec\uf8ec\uf8ec\uf8ec\uf8ec\uf8ec\uf8ed\n\nE1\u00b7 \u2212 F1\u00b7\nE1\u00b7 \u2212 F2\u00b7\n\nE1\u00b7 \u2212 Fn2\u00b7\nE2\u00b7 \u2212 F1\u00b7\n\nE2\u00b7 \u2212 Fn2\u00b7\n\n...\n\n...\n...\n\nEn1\u00b7 \u2212 Fn2\u00b7\n\n\uf8f6\uf8f7\uf8f7\uf8f7\uf8f7\uf8f7\uf8f7\uf8f7\uf8f7\uf8f7\uf8f7\uf8f7\uf8f7\uf8f7\uf8f7\uf8f8\n\n(cid:18)E\n(cid:19)\n\nF\n\nand D(\n\n(cid:32) D1(E)\n\n(cid:33)\n\nD1(F)\n\nD2(E, F)\n\n.\n\n) =\n\n\uf8eb\uf8ec\uf8ec\uf8ec\uf8ec\uf8ec\uf8ec\uf8ec\uf8ec\uf8ec\uf8ec\uf8ec\uf8ec\uf8ec\uf8ec\uf8ed\n\nE1\u00b7 \u2212 E2\u00b7\nE1\u00b7 \u2212 E3\u00b7\n\nE1\u00b7 \u2212 En1\u00b7\nE2\u00b7 \u2212 E3\u00b7\n\nE2\u00b7 \u2212 En1\u00b7\n\n...\n\n...\n...\n\nE(n1\u22121)\u00b7 \u2212 En1\u00b7\n\n\uf8f6\uf8f7\uf8f7\uf8f7\uf8f7\uf8f7\uf8f7\uf8f7\uf8f7\uf8f7\uf8f7\uf8f7\uf8f7\uf8f7\uf8f7\uf8f8\n\nD1(E) =\n\n, D2(E, F) =\n\nIn words, the operator D1 calculates the difference between every two rows of a matrix and lists the\nresults in the order indicated in the de\ufb01nition. Similarly, given two matrices E and F, the operator\nD2(E, F) calculates the difference of any two rows between E and F, one from E and the other from\nF. We also de\ufb01ne the following average operation which calculates the mean of the row vectors.\nDe\ufb01nition 3. Given any matrix E of dimension n \u00d7 p, de\ufb01ne the average operator on E as\n\nn(cid:88)\n\ni=1\n\nM(E) =\n\n1\nn\n\n(\n\nEi\u00b7).\n\nDe\ufb01nition 4. A matrix E is called column centered if M(E) = 0.\n\n2.2 Theoretical Guarantees\n\nOur main result essentially says that when there are two clusters, each of which is a cube, and\nthey are reasonably separated away from each other, then SON successfully recovers the cluster\nmembership. We now make this formal. For i = 1, 2, suppose Ci \u2286 Rp is a cube with center\n(\u00b5i1, \u00b5i2,\u00b7\u00b7\u00b7 , \u00b5ip) and edge length si = 2(\u03c3i1, \u03c3i2,\u00b7\u00b7\u00b7 , \u03c3ip) , i.e.,\n\nCi = [\u00b5i1 \u2212 \u03c3i1, \u00b5i1 + \u03c3i1] \u00d7 \u00b7\u00b7\u00b7 \u00d7 [\u00b5ip \u2212 \u03c3ip, \u00b5ip + \u03c3ip].\n\nDe\ufb01nition 5. The distance d1,2 between cubes C1 and C2 is\n\nd1,2 (cid:44) inf{(cid:107)x \u2212 y(cid:107)2 | x \u2208 C1, y \u2208 C2}.\n\nDe\ufb01nition 6. The weighted size w1,2 with respect to C1, C2, n1 and n2 is de\ufb01ned as\n(cid:107)s2(cid:107)2\n\nw1,2 = max\n\n(cid:107)s1(cid:107)2,\n\n(cid:26)(cid:18) 2n2(n1 \u2212 1)\n\n(cid:18) 2n1(n2 \u2212 1)\n\n(cid:19)\n\n(cid:19)\n\n+ 1\n\n+ 1\n\n(cid:27)\n\n.\n\nn2\n1\n\nn2\n2\n\nTheorem 1. Given a column centered data matrix A of dimension n \u00d7 p, where each row is ar-\nbitrarily picked from either cube C1 or cube C2 and there are totally ni rows chosen from Ci for\ni = 1, 2, if w1,2 < d1,2, then by choosing the parameter \u03b1 \u2208 R such that w1,2 < n\n2 \u03b1 < d1,2, we\nhave the following:\n\n1. SON can correctly determine the cluster membership of A;\n\n2. Rearrange the rows of A such that\n\nA =\n\n(cid:18)A1\n\nA2\n\n(cid:19)\n\nand Ai =\n\n3\n\n\uf8f6\uf8f7\uf8f7\uf8f7\uf8f8 ,\n\n\uf8eb\uf8ec\uf8ec\uf8ec\uf8ed Ai\n\n1\u00b7\nAi\n2\u00b7\n...\nAi\nni\u00b7\n\n(2)\n\n\fwhere for i = 1, 2 and j = 1, 2,\u00b7\u00b7\u00b7 , ni, Ai\noptimal solution \u02c6X of Problem (1) is given by\n\n(cid:16)\n\n\uf8f1\uf8f2\uf8f3 n2\n\nn1+n2\n\n\u2212 n1\n\nn1+n2\n\n(cid:16)\n\n1 \u2212\n1 \u2212\n\n\u02c6Xi\u00b7 =\n\nn\u03b1\n\n2(cid:107)M(D2(A1,A2))(cid:107)2\n\nn\u03b1\n\n2(cid:107)M(D2(A1,A2))(cid:107)2\n\n(cid:17)\n\nj\u00b7 = (Ai\n\nj,1, Ai\n\nM(cid:0)D2(A1, A2)(cid:1) ,\n(cid:17)\nM(cid:0)D2(A1, A2)(cid:1) ,\n\nif Ai\u00b7 \u2208 C1;\nif Ai\u00b7 \u2208 C2.\n\nj,2,\u00b7\u00b7\u00b7 , Ai\n\nj,p) \u2208 Ci. Then, the\n\nThe theorem essentially states that we need d1,2 to be large and w1,2 to be small for correct deter-\nmination of the cluster membership of A. This is indeed intuitive. Notice that d1,2 is the distance\nbetween the cubes and w1,2 is a constant that depends on the size of the cube as well as the ratio\nbetween the samples in each cube. Obviously, if the cubes are too close with each other, i.e., d1,2 is\nsmall, or if the sizes of the clusters are too big compared to their distance, it is dif\ufb01cult to determine\nthe cluster membership correctly. Moreover, when n1 (cid:28) n2 or n1 (cid:29) n2, w1,2 is large, and the\ntheorem states that it is dif\ufb01cult to determine the cluster membership. This is also well expected,\nsince in this case one cluster will be overwhelmed by the other, and hence determining where the\ndata points are chosen from becomes problematic.\nThe assumption in Theorem 1 that the data matrix A is column centered can be easily relaxed, using\nthe following proposition which states that the result of SON is invariant to any isometry operation.\nDe\ufb01nition 7. An isometry of Rn is a function f : Rn \u2192 Rn that preserves the distance between\nvectors, i.e.,\nProposition 1. (Isometry Invariance) Given a data matrix A of dimension n \u00d7 p where each row\nis chosen from some cluster Ci, i = 1, 2,\u00b7\u00b7\u00b7 , c, and f (\u00b7) an isometry of Rp, we have\n\n(cid:107)f (u) \u2212 f (w)(cid:107)2 = (cid:107)u \u2212 w(cid:107)2,\u2200 u, w \u2208 Rn.\n\n(cid:88)\n\n\u02c6X = arg min\n\nX\u2208Rn\u00d7p\n\u21d0\u21d2f ( \u02c6X) = arg min\nX\u2208Rn\u00d7p\n\n(cid:107)A \u2212 X(cid:107)2\n\nF + \u03b1\n\n(cid:107)Xi\u00b7 \u2212 Xj\u00b7(cid:107)2\n\ni