{"title": "Scalable Laplacian K-modes", "book": "Advances in Neural Information Processing Systems", "page_first": 10041, "page_last": 10051, "abstract": "We advocate Laplacian K-modes for joint clustering and density mode finding,\nand propose a concave-convex relaxation of the problem, which yields a parallel\nalgorithm that scales up to large datasets and high dimensions. We optimize a tight\nbound (auxiliary function) of our relaxation, which, at each iteration, amounts to\ncomputing an independent update for each cluster-assignment variable, with guar-\nanteed convergence. Therefore, our bound optimizer can be trivially distributed\nfor large-scale data sets. Furthermore, we show that the density modes can be\nobtained as byproducts of the assignment variables via simple maximum-value\noperations whose additional computational cost is linear in the number of data\npoints. Our formulation does not need storing a full affinity matrix and computing\nits eigenvalue decomposition, neither does it perform expensive projection steps\nand Lagrangian-dual inner iterates for the simplex constraints of each point. Fur-\nthermore, unlike mean-shift, our density-mode estimation does not require inner-\nloop gradient-ascent iterates. It has a complexity independent of feature-space\ndimension, yields modes that are valid data points in the input set and is appli-\ncable to discrete domains as well as arbitrary kernels. We report comprehensive\nexperiments over various data sets, which show that our algorithm yields very\ncompetitive performances in term of optimization quality (i.e., the value of the\ndiscrete-variable objective at convergence) and clustering accuracy.", "full_text": "Scalable Laplacian K-modes\n\nImtiaz Masud Ziko \u2217\n\n\u00c9TS Montreal\n\nEric Granger\n\u00c9TS Montreal\n\nIsmail Ben Ayed\n\u00c9TS Montreal\n\nAbstract\n\nWe advocate Laplacian K-modes for joint clustering and density mode \ufb01nding,\nand propose a concave-convex relaxation of the problem, which yields a parallel\nalgorithm that scales up to large datasets and high dimensions. We optimize a\ntight bound (auxiliary function) of our relaxation, which, at each iteration, amounts\nto computing an independent update for each cluster-assignment variable, with\nguaranteed convergence. Therefore, our bound optimizer can be trivially distributed\nfor large-scale data sets. Furthermore, we show that the density modes can be\nobtained as byproducts of the assignment variables via simple maximum-value\noperations whose additional computational cost is linear in the number of data\npoints. Our formulation does not need storing a full af\ufb01nity matrix and comput-\ning its eigenvalue decomposition, neither does it perform expensive projection\nsteps and Lagrangian-dual inner iterates for the simplex constraints of each point.\nFurthermore, unlike mean-shift, our density-mode estimation does not require\ninner-loop gradient-ascent iterates. It has a complexity independent of feature-\nspace dimension, yields modes that are valid data points in the input set and is\napplicable to discrete domains as well as arbitrary kernels. We report comprehen-\nsive experiments over various data sets, which show that our algorithm yields very\ncompetitive performances in term of optimization quality (i.e., the value of the\ndiscrete-variable objective at convergence) and clustering accuracy.\n\n1\n\nIntroduction\n\nWe advocate Laplacian K-modes for joint clustering and density mode \ufb01nding, and propose a\nconcave-convex relaxation of the problem, which yields a parallel algorithm that scales up to large\ndata sets and high dimensions. Introduced initially in the work of Wang and Carreira-Perpin\u00e1n [33],\nthe model solves the following constrained optimization problem for L clusters and data points\nX = {xp \u2208 RD, p = 1, . . . , N}:\n\n(cid:41)\n\nmin\n\nZ\n\nzp,lk(xp, ml) +\n\n\u03bb\n2\n\nk(xp, xq)(cid:107)zp \u2212 zq(cid:107)2\n\n(cid:40)\nE(Z) := \u2212 N(cid:88)\nL(cid:88)\n(cid:88)\n\np=1\n\nl=1\n\nzp,lk(xp, y)\n\ny\u2208X\n\nml = arg max\n1tzp = 1, zp \u2208 {0, 1}L \u2200p\n\np\n\n(cid:88)\n\np,q\n\n(1)\nwhere, for each point p, zp = [zp,1, . . . , zp,L]t denotes a binary assignment vector, which is con-\nstrained to be within the L-dimensional simplex: zp,l = 1 if p belongs to cluster l and zp,l = 0\notherwise. Z is the N \u00d7 L matrix whose rows are given by the zp\u2019s. k(xp, xq) are pairwise af\ufb01nities,\nwhich can be either learned or evaluated in an unsupervised way via a kernel function.\nModel (1) integrates several powerful and well-known ideas in clustering. First, it identi\ufb01es density\nmodes [18, 8], as in popular mean-shift. Prototype ml is a cluster mode and, therefore, a valid data\n\n\u2217Corresponding author email: imtiaz-masud.ziko.1@etsmtl.ca\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fpoint in the input set. This is important for manifold-structured, high-dimensional inputs such as\nimages, where simple parametric prototypes such as the means, as in K-means, may not be good\nrepresentatives of the data; see Fig. 1. Second, the pairwise term in E is the well-known graph\nLaplacian regularizer, which can be equivalently written as \u03bbtr(ZtLZ), with L the Laplacian matrix\ncorresponding to af\ufb01nity matrix K = [k(xp, xq)]. Laplacian regularization encourages nearby\ndata points to have similar latent representations (e.g., assignments) and is widely used in spectral\nclustering [32, 26] as well as in semi-supervised and/or representation learning [2]. Therefore, the\nmodel can handle non-convex (or manifold-structured) clusters, unlike standard prototype-based\nclustering techniques such as K-means. Finally, the explicit cluster assignments yield straightforward\nout-of-sample extensions, unlike spectral clustering [3].\nOptimization problem (1) is challenging due to the simplex/integer constraints and the non-linear/non-\ndifferentiable dependence of modes ml on assignment variables. In fact, it is well known that\noptimizing the pairwise Laplacian term over discrete variables is NP-hard [30], and it is common\nto relax the integer constraint. For instance, [33] replaces the integer constraint with a probability-\nsimplex constraint, which results in a convex relaxation of the Laplacian term. Unfortunately, such a\ndirect convex relaxation requires solving for N \u00d7 L variables all together. Furthermore, it requires\nadditional projections onto the L-dimensional simplex, with a quadratic complexity with respect to L.\nTherefore, as we will see in our experiments, the relaxation in [33] does not scale up for large-scale\nproblems (i.e., when N and L are large). Spectral relaxation [32, 27] widely dominates optimization\nof the Laplacian term subject to balancing constraints in the context of graph clustering2. It can\nbe expressed in the form of a generalized Rayleigh quotient, which yields an exact closed-form\nsolution in terms of the L largest eigenvectors of the af\ufb01nity matrix. It is well-known that spectral\nrelaxation has high computational and memory load for large N as one has to store the N \u00d7 N af\ufb01nity\nmatrix and compute explicitly its eigenvalue decomposition, which has a complexity that is cubic\nwith respect to N for a straightforward implementation and, to our knowledge, super-quadratic for\nfast implementations [30]. In fact, investigating the scalability of spectral relaxation for large-scale\nproblems is an active research subject [26, 30, 31]. For instance, the studies in [26, 30] investigated\ndeep learning approaches to spectral clustering, so as to ease the scalability issues for large data sets,\nand the authors of [31] examined the variational Nystr\u00f6m method for large-scale spectral problems,\namong many other efforts on the subject. In general, computational scalability is attracting signi\ufb01cant\nresearch interest with the overwhelming widespread of interesting large-scale problems [11]. Such\nissues are being actively investigated even for the basic K-means algorithm [11, 22].\nThe K-modes term in (1) is closely related to kernel density based algorithms for mode estimation\nand clustering, for instance, the very popular mean-shift [8]. The value of ml globally optimizing this\nterm for a given \ufb01xed cluster l is, clearly, the mode of the kernel density of feature points within the\ncluster [29]. Therefore, the K-mode term, as in [6, 25], can be viewed as an energy-based formulation\nof mean-shift algorithms with a \ufb01xed number of clusters [29]. Optimizing the K-modes over discrete\nvariable is NP-hard [33], as is the case of other prototype-based models for clustering3. One way to\ntackle the problem is to alternate optimization over assignment variables and updates of the modes,\nwith the latter performed as inner-loop mean-shift iterates, as in [6, 25]. Mean-shift moves an initial\nrandom feature point towards the closest mode via gradient ascent iterates, maximizing at convergence\nthe density of feature points. While such a gradient-ascent approach has been very popular for low-\ndimensional distributions over continuous domains, e.g., image segmentation [8], its use is generally\navoided in the context of high-dimensional feature spaces [7]. Mean-shift iterates compute expensive\nsummations over feature points, with a complexity that depends on the dimension of the feature space.\nFurthermore, the method is not applicable to discrete domains [7] (as it requires gradient-ascent\nsteps), and its convergence is guaranteed only when the kernels satisfy certain conditions; see [8].\nFinally, the modes obtained at gradient-ascent convergence are not necessarily valid data points in\nthe input set.\nWe optimize a tight bound (auxiliary function) of our concave-convex relaxation for discrete problem\n(1). The bound is the sum of independent functions, each corresponding to a data point p. This yields\na scalable algorithm for large N, which computes independent updates for assignment variables zp,\nwhile guaranteeing convergence to a minimum of the relaxation. Therefore, our bound optimizer can\nbe trivially distributed for large-scale data sets. Furthermore, we show that the density modes can\n\n2Note that spectral relaxation is not directly applicable to the objective in (1) because of the presence of the\n\nK-mode term.\n\n3In fact, even the basic K-means problem is NP-hard.\n\n2\n\n\fbe obtained as byproducts of assignment variables zp via simple maximum-value operations whose\nadditional computational cost is linear in N. Our formulation does not need storing a full af\ufb01nity\nmatrix and computing its eigenvalue decomposition, neither does it perform expensive projection\nsteps and Lagrangian-dual inner iterates for the simplex constraints of each point. Furthermore, unlike\nmean-shift, our density-mode estimation does not require inner-loop gradient-ascent iterates. It has a\ncomplexity independent of feature-space dimension, yields modes that are valid data points in the input\nset and is applicable to discrete domains and arbitrary kernels. We report comprehensive experiments\nover various data sets, which show that our algorithm yields very competitive performances in term of\noptimization quality (i.e., the value of the discrete-variable objective at convergence)4 and clustering\naccuracy, while being scalable to large-scale and high-dimensional problems.\n\n2 Concave-convex relaxation\n\nWe propose the following concave-convex relaxation of the objective in (1):\n\n(cid:40)\n\nR(Z) :=\n\nmin\nzp\u2208\u2207L\n\nN(cid:88)\n\np log(zp) \u2212 N(cid:88)\n\nzt\n\nL(cid:88)\n\n(cid:88)\n\n(cid:41)\n\nzp,lk(xp, ml) \u2212 \u03bb\n\nk(xp, xq)zt\n\npzq\n\n(2)\n\np=1\n\np=1\n\nl=1\n\np,q\n\nzt\n\np,q\n\np\n\np,q\n\n(3)\n\n(cid:88)\n\np\n\n(cid:88)\n\ntr(ZtLZ) =\n\nk(xp, xq)zt\n\npzq,\n\nk(xp, xq)zt\n\npzq =\n\npzpdp \u2212(cid:88)\n\nwhere \u2207L denotes the L-dimensional probability simplex \u2207L = {y \u2208 [0, 1]L | 1ty = 1}. It is\neasy to check that, at the vertices of the simplex, our relaxation in (2) is equivalent to the initial\ndiscrete objective in (1). Notice that, for binary assignment variables zp \u2208 {0, 1}L, the \ufb01rst term in\n(2) vanishes and the last term is equivalent to Laplacian regularization, up to an additive constant:\n\ndp \u2212(cid:88)\nwhere the last equality is valid only for binary (integer) variables and dp =(cid:80)\n\nq k(xp, xq). When we\nreplace the integer constraints zp \u2208 {0, 1} by zp \u2208 [0, 1], our relaxation becomes different from direct\nconvex relaxations of the Laplacian [33], which optimizes tr(ZtLZ) subject to probabilistic simplex\nconstraints. In fact, unlike tr(ZtLZ), which is a convex function5, our relaxation of the Laplacian\nterm is concave for positive semi-de\ufb01nite (psd) kernels k. As we will see later, concavity yields\na scalable (parallel) algorithm for large N, which computes independent updates for assignment\nvariables zp. Our updates can be trivially distributed, and do not require storing a full N \u00d7 N af\ufb01nity\nmatrix. These are important computational and memory advantages over direct convex relaxations\nof the Laplacian [33], which require solving for N \u00d7 L variables all together as well as expensive\nsimplex projections, and over common spectral relaxations [32], which require storing a full af\ufb01nity\nmatrix and computing its eigenvalue decomposition. Furthermore, the \ufb01rst term we introduced in (2)\nis a convex negative-entropy barrier function, which completely avoids expensive projection steps\nand Lagrangian-dual inner iterations for the simplex constraints of each point. First, the entropy\nbarrier restricts the domain of each zp to non-negative values, which avoids extra dual variables\nfor constraints zp \u2265 0. Second, the presence of such a barrier function yields closed-form updates\nfor the dual variables of constraints 1tzp = 1. In fact, entropy-like barriers are commonly used in\nBregman-proximal optimization [35], and have well-known computational and memory advantages\nwhen dealing with the challenging simplex constraints [35]. Surprisingly, to our knowledge, they\nare not common in the context of clustering. In machine learning, such entropic barriers appear\nfrequently in the context of conditional random \ufb01elds (CRFs) [14, 15], but are not motivated from\noptimization perspective; they result from standard probabilistic and mean-\ufb01eld approximations of\nCRFs [14].\n\n3 Bound optimization\n\nIn this section, we derive an iterative bound optimization algorithm that computes independent\n(parallel) updates of assignment variables zp (z-updates) at each iteration, and provably converges\nto a minimum of relaxation (2). As we will see in our experiments, our bound optimizer yields\n4We obtained consistently lower values of function E at convergence than the convex-relaxation proximal\n\nalgorithm in [33].\n\n5For relaxed variables, tr(ZtLZ) is a convex function because the Laplacian is always positive semi-de\ufb01nite.\n\n3\n\n\fIn (4), i denotes the iteration counter. In general, bound optimizers update the current solution Zi to\nthe optimum of the auxiliary function: Zi+1 = arg minZ Ai(Z). This guarantees that the original\nobjective function does not increase at each iteration: R(Zi+1) \u2264 Ai(Zi+1) \u2264 Ai(Zi) = R(Zi).\nBound optimizers can be very effective as they transform dif\ufb01cult problems into easier ones [37].\nExamples of well-known bound optimizers include the concave-convex procedure (CCCP) [36],\nexpectation maximization (EM) algorithms and submodular-supermodular procedures (SSP) [21],\namong others. Furthermore, bound optimizers are not restricted to differentiable functions6, neither\ndo they depend on optimization parameters such as step sizes.\n\nProposition 1 Given current solution Zi = [zi\nmi\nconstant) for the concave-convex relaxation in (2) and psd7 af\ufb01nity matrix K:\n\np,l] at iteration i, and the corresponding modes\np,lk(xp, y), we have the following auxiliary function (up to an additive\n\nl = arg maxy\u2208X\n\np zi\n\n(cid:80)\n\nN(cid:88)\n\nAi(Z) =\n\np(log(zp) \u2212 ai\nzt\n\np \u2212 \u03bbbi\np)\n\n(4a)\n(4b)\n\n(5)\n\n(6a)\n\n(6b)\n\nconsistently lower values of function E at convergence than the proximal algorithm in [33], while\nbeing highly scalable to large-scale and high-dimensional problems. We also show that the density\nmodes can be obtained as byproducts of the z-updates via simple maximum-value operations whose\nadditional computational cost is linear in N. Instead of minimizing directly our relaxation R, we\niterate the minimization of an auxiliary function, i.e., an upper bound of R, which is tight at the\ncurrent solution and easier to optimize.\nDe\ufb01nition 1 Ai(Z) is an auxiliary function of R(Z) at current solution Zi if it satis\ufb01es:\n\nR(Z) \u2264 Ai(Z), \u2200Z\nR(Zi) = Ai(Zi)\n\nwhere ai\n\np and bi\n\np are the following L-dimensional vectors:\n\np=1\n\nai\np = [ai\nbi\np = [bi\n\np,1, . . . , ai\np,1, . . . , bi\n\np,L]t, with ai\np,L]t, with bi\n\np,l = k(xp, mi\nl)\np,l =\n\nk(xp, xq)zi\nq,l\n\n(cid:88)\n\nProof 1 See supplemental material.\n\nq\n\nNotice that the bound in Eq. (5) is the sum of independent functions, each corresponding to a point p.\nTherefore, both the bound and simplex constraints zp \u2208 \u2207L are separable over assignment variables\nzp. We can minimize the auxiliary function by minimizing independently each term in the sum over\nzp, subject to the simplex constraint, while guaranteeing convergence to a local minimum of (2):\n\np(log(zp) \u2212 ai\nzt\n\np \u2212 \u03bbbi\n\np), \u2200p\n\nmin\nzp\u2208\u2207L\n\n(7)\n\np log zp restricts zp to be non-negative, which removes the\nNote that, for each p, negative entropy zt\nneed for handling explicitly constraints zp \u2265 0. This term is convex and, therefore, the problem in\n(7) is convex: The objective is convex (sum of linear and convex functions) and constraint zp \u2208 \u2207L\nis af\ufb01ne. Therefore, one can minimize this constrained convex problem for each p by solving the\nKarush-Kuhn-Tucker (KKT) conditions8. The KKT conditions yield a closed-form solution for both\nprimal variables zp and the dual variables (Lagrange multipliers) corresponding to simplex constraints\n\n6Our objective is not differentiable with respect to the modes as each of these is de\ufb01ned as the maximum of a\n\nfunction of the assignment variables.\n\n7We can consider K to be psd without loss of generality. When K is not psd, we can use a diagonal shift for\nthe af\ufb01nity matrix, i.e., we replace K by \u02dcK = K + \u03b4IN . Clearly, \u02dcK is psd for suf\ufb01ciently large \u03b4. For integer\nvariables, this change does not alter the structure of the minimum of discrete function E.\n\n8Note that strong duality holds since the objectives are convex and the simplex constraints are af\ufb01ne. This\n\nmeans that the solutions of the (KKT) conditions minimize the auxiliary function.\n\n4\n\n\f1tzp = 1. Each closed-form update, which globally optimizes (7) and is within the simplex, is given\nby:\n\nzi+1\np =\n\nexp(ai\n1t exp(ai\n\np + \u03bbbi\np)\np + \u03bbbi\np)\n\n\u2200 p\n\n(8)\n\nl=1\n\nl=1\n\nl=1\n\nl }L\n\nl }L\n\nAlgorithm 1: SLK algorithm\n:X, Initial centers {m0\nInput\nOutput :Z and {ml}L\nl=1 \u2190 {m0\n1 {ml}L\n2 repeat\ni \u2190 1 // iteration index\n3\n{mi\n4\n// z-updates\nforeach xp do\nCompute ai\np from (6a)\nexp{ai\np}\np} // Initialize\nzi\np =\n1t exp{ai\n\nl}L\nl=1 \u2190 {ml}L\n\nl=1\n\n5\n6\n\n7\n\nusing (6b) and (8)\n\nend\nrepeat\n\nCompute zi+1\ni \u2190 i + 1\n\np\n\nuntil convergence\nZ = [zi+1\np,l ]\n// Mode-updates\nif SLK-MS then\n\n8\n9\n10\n11\n12\n13\n\n14\n15\n16\n17\n\nif SLK-BO then\n\nupdate ml using (9) until converges\nml \u2190 arg max\n\n[zi+1\np,l ]\n\nxp\n\n18 until convergence\n19 return Z, {ml}L\n\nl=1\n\nThe pseudo-code for our Scalable Laplacian K-modes (SLK) method is provided in Algorithm 1.\nThe complexity of each inner iteration in z-updates is O(N \u03c1L), with \u03c1 the neighborhood size for\nthe af\ufb01nity matrix. Typically, we use sparse matrices (\u03c1 << N). Note that the complexity becomes\nO(N 2L) in the case of dense matrices in which all the af\ufb01nities are non-zero. However, the update\nof each zp can be done independently, which enables parallel implementations.\nOur SLK algorithm alternates the following two steps until convergence (i.e. until the modes {ml}L\nl=1\ndo not change): (i) z-updates: update cluster assignments using expression (8) with the modes \ufb01xed\nand (ii) Mode-updates: update the modes {ml}L\nl=1 with the assignment variable Z \ufb01xed; see the next\nsection for further details on mode estimation.\n\n3.1 Mode updates\n\nTo update the modes, we have two options: modes via mean-shift or as byproducts of the z-updates.\nModes via mean-shift: This amounts to updating each mode ml by running inner-loop mean-shift\niterations until convergence, using the current assignment variables:\n\n(9)\n\nModes as byproducts of the z-updates: We also propose an ef\ufb01cient alternative to mean-shift. Observe\nthe following: For each point p, bi\nq,l is proportional to the kernel density estimate\n\nq k(xp, xq)zi\n\n(cid:80)\n(cid:80)\n\np zp,lk(xp, mi\nl)xp\np zp,lk(xp, mi\nl)\n\nmi+1\n\nl =\n\np,l =(cid:80)\n\n5\n\n\fFigure 1: Examples of mode images obtained with our SLK-BO, mean images and the corresponding\n3-nearest-neighbor images within each cluster. We used LabelMe (Alexnet) dataset.\n\n(KDE) of the distribution of features within current cluster l at point p. In fact, the KDE at a feature\npoint y is:\n\n(cid:80)\n\n(cid:80)\n\nq k(y, xq)zi\nq,l\n\n.\n\nq zi\nq,l\n\nP i\nl (y) =\n\np,l \u221d P i\n\nl (xp). As a result, for a given point p within the cluster, the higher bi\n\np,l, the higher\nTherefore, bi\nthe KDE of the cluster at that point. Notice also that ai\nl) measures a proximity between\npoint xp and the mode obtained at the previous iteration. Therefore, given the current assignment zi\np,\nthe modes can be obtained as a proximal optimization, which seeks a high-density data point that\ndoes not deviate signi\ufb01cantly from the mode obtained at the previous iteration:\n\np,l = k(xp, mi\n\n(cid:123)(cid:122)\n\n(cid:125)\n\nmax\ny\u2208X\n\n+\n\n[k(y, mi\nl)\nproximity\n\n(cid:124)\n\n(cid:88)\n(cid:124)\n\np\n\n]\nzp,lk(xp, y)\n\n(cid:123)(cid:122)\n\ndensity\n\n(cid:125)\n\n(10)\n\nNow observe that the z-updates we obtained in Eq. (8) take the form of softmax functions. Therefore,\nthey can be used as soft approximations of the hard max operation in Eq. (10):\n\nmi+1\n\nl = xp, with p = arg max\n\nq\n\n[zq,l]i\n\n(11)\n\nThis yields modes as byproducts of the z-updates, with a computational cost that is linear in N.\nWe refer to the two different versions of our algorithm as SLK-MS, which updates the modes via\nmean-shift, and SLK-BO, which updates the modes as byproducts of the z-updates.\n\n6\n\n\f(a) MNIST (small)\n\n(b) LabelMe (Alexnet)\n\nFigure 2: Discrete-variable objective (1): Comparison of the objectives obtained at convergence for\nSLK-MS (ours) and LK [33]. The objectives at convergence are plotted versus different values of\nparameter \u03bb.\n\n4 Experiments\n\nWe report comprehensive evaluations of the proposed algorithm9 as well as comparisons to the\nfollowing related baseline methods: Laplacian K-modes (LK) [33], K-means, NCUT [27], K-modes\n[25, 5], Kernel K-means (KK-means) [9, 29] and Spectralnet [26]. Our algorithm is evaluated in\nterms of performance and optimization quality in various clustering datasets.\n\nDatasets\nMNIST (small)\nMNIST (code)\nMNIST\nMNIST (GAN)\nShuttle\nLabelMe (Alexnet)\nLabelMe (GIST)\nYTF\nReuters (code)\n\nTable 1: Datasets used in the experiments.\nSamples (N )\n2, 000\n70, 000\n70, 000\n70, 000\n58, 000\n2, 688\n2, 688\n10, 036\n685, 071\n\nDimensions (D)\n784\n10\n784\n256\n9\n4, 096\n44, 604\n9, 075\n10\n\nClusters (L)\n10\n10\n10\n10\n7\n8\n8\n40\n4\n\nImbalance\n1\n\u223c 1\n\u223c 1\n\u223c 1\n4, 558\n1\n1\n13\n\u223c 5\n\n4.1 Datasets and evaluation metrics\n\nWe used image datasets, except Shuttle and Reuters. The overall summary of the datasets is given\nin Table 1. For each dataset, imbalance is de\ufb01ned as the ratio of the size of the biggest cluster to\nthe size of the smallest one. We use three versions of MNIST [17]. MNIST contains all the 70, 000\nimages, whereas MNIST (small) includes only 2, 000 images by randomly sampling 200 images\nper class. We used small datasets in order to compare to LK [33], which does not scale up for large\ndatasets. For MNIST (GAN), we train the GAN from [12] on 60, 000 training images and extract\nthe 256-dimensional features from the discriminator network for the 70, 000 images. The publicly\navailable autoencoder in [13] is used to extract 10-dimensional features as in [26] for MNIST (code)\nand Reuters (code). LabelMe [23] consists of 2, 688 images divided into 8 categories. We used the\npre-trained AlexNet [16] and extracted the 4096-dimensional features from the fully-connected layer.\nTo show the performances on high-dimensional data, we extract 44604-dimensional GIST features\n[23] for the LabelMe dataset. Youtube Faces (YTF) [34] consists of videos of faces with 40 different\nsubjects.\n\n9Code is available at: https://github.com/imtiazziko/SLK\n\n7\n\n\fTable 2: Clustering results as NMI/ACC in the upper half and average elapsed time in seconds (s). (*)\nWe report the results of Spectralnet with Euclidean-distance af\ufb01nity for MNIST (code) and Reuters\n(code) from [26].\n\nAlgorithm MNIST MNIST\n(code)\n\nMNIST\n(GAN)\n\nLabelMe\n(Alexnet)\n\nLabelMe\n(GIST)\n\nYTF\n\nShuttle\n\nReuters\n\n-\n-\n0.81/0.80 -\n\n-\n\n-\n\n0.81/0.91 0.59/0.61 0.77/0.59 -\n-\n-\n\n0.53/0.55 0.66/0.74 0.68/0.75 0.81/0.90 0.57/0.69 0.77/0.58 0.22/0.41 0.48/0.73\nK-means\n0.56/0.60 0.67/0.75 0.69/0.80 0.81/0.91 0.58/0.68 0.79/0.62 0.33/0.47 0.48/0.72\nK-modes\n0.74/0.61 0.84/0.81 0.77/0.67 0.81/0.91 0.58/0.61 0.74/0.54 0.47/0.46 -\nNCUT\n0.53/0.55 0.67/0.80 0.69/0.68 0.81/0.90 0.57/0.63 0.71/0.50 0.26/0.40 -\nKK-means\nLK\n-\n-\nSpectralnet* -\n0.46/0.65\n0.80/0.79 0.88/0.95 0.86/0.94 0.83/0.91 0.61/0.72 0.82/0.65 0.45/0.70 0.43/0.74\nSLK-MS\n0.77/0.80 0.89/0.95 0.86/0.94 0.83/0.91 0.61/0.72 0.80/0.64 0.51/0.71 0.43/0.74\nSLK-BO\nK-means\n119.9s\n90.2s\nK-modes\n26.4s\nNCUT\n2580.8s\nKK-means\nLK\n-\nSpectralnet* -\nSLK-MS\nSLK-BO\n\n16.8s\n20.2s\n28.2s\n1967.9s\n-\n3600.0s\n82.4s\n23.1s\n\n1.8s\n0.5s\n27.4s\n1177.6s\n-\n-\n3.8s\n1.3s\n\n36.1s\n51.6s\n-\n-\n-\n9000.0s\n12.5s\n53.1s\n\n51.6s\n20.3s\n9.3s\n2427.9s\n-\n-\n37.3s\n10.3s\n\n210.1s\n61.0s\n19.0s\n40.2s\n409.0s\n-\n83.3s\n12.4s\n\n132.1s\n12.4s\n10.4s\n17.2s\n180.9s\n-\n37.0s\n7.1s\n\n11.2s\n7.4s\n7.4s\n4.6s\n33.4s\n-\n4.7s\n1.8s\n\n101.2s\n14.2s\n\nTo evaluate the clustering performance, we use two well adopted measures: Normalized Mutual\nInformation (NMI) [28] and Clustering Accuracy (ACC) [26, 10]. The optimal mapping of clustering\nassignments to the true labels are determined using the Kuhn-Munkres algorithm [20].\n\nImplementation details\n\n4.2\nWe built kNN af\ufb01nities as follows: k(xp, xq) = 1 if xq \u2208 N \u03c1\np and k(xp, xq) = 0 otherwise, where\nN \u03c1\np is the set of the \u03c1 nearest neighbors of data point xp. This yields a sparse af\ufb01nity matrix, which\nis ef\ufb01cient in terms of memory and computations. In all of the datasets, we \ufb01xed \u03c1 = 5. For the\nlarge datasets such as MNIST, Shuttle and Reuters, we used the Flann library [19] with the KD-tree\nalgorithm, which \ufb01nds approximate nearest neighbors. For the other smaller datasets, we used an\nef\ufb01cient implementation of exact nearest-neighbor computations. We used the Euclidean distance\nfor \ufb01nding the nearest neighbors. We used the same sparse K for the pairwise-af\ufb01nity algorithms\nwe compared with, i.e., NCUT, KK-means, Laplacian K-modes. Furthermore, for each of these\nbaseline methods, we evaluated the default setting of af\ufb01nity construction with tuned \u03c3, and report\nthe best result found. Mode estimation is based on the Gaussian kernel k(x, y) = e\u2212((cid:107)x\u2212y)(cid:107)2/2\u03c32),\nl }L\nwith \u03c32 estimated as: \u03c32 = 1\nl=1 are based on\nN \u03c1\nK-means++ seeds [1]. We choose the best initial seed and regularization parameter \u03bb empirically\nbased on the accuracy over a validation set (10% of the total data). The \u03bb is determined from tuning\nin a small range from 1 to 4. In SLK-BO, we take the starting mode ml for each cluster from the\ninitial assignments by simply following the mode de\ufb01nition in (1). In Algorithm 1, all assignment\nvariables zp are updated in parallel. We run the publicly released codes for K-means [24], NCUT\n[27], Laplacian K-modes [4], Kernel K-means10 and Spectralnet [26].\n\n(cid:107)xp \u2212 xq(cid:107)2. Initial centers {m0\n\n(cid:80)\n\n(cid:80)\n\nxq\u2208N \u03c1\n\nxp\u2208X\n\np\n\n4.3 Clustering results\n\nTable 2 reports the clustering results, showing that, in most of the cases, our algorithms SLK-MS and\nSLK-BO yielded the best NMI and ACC values. For MNIST with the raw intensities as features, the\nproposed SLK achieved almost 80% NMI and ACC. With better learned features for MNIST (code)\nand MNIST (GAN), the accuracy (ACC) increases up to 95%. For the MNIST (code) and Reuters\n(code) datasets, we used the same features and Euclidean distance based af\ufb01nity as Spectralnet [26],\nand obtained better NMI/ACC performances. The Shuttle dataset is quite imbalanced and, therefore,\n\n10https://gist.github.com/mblondel/6230787\n\n8\n\n\fTable 3: Discrete-variable objectives at convergence for LK [33] and SLK-MS (ours).\n\nLK [33]\n\nDatasets\nMNIST (small)\nLabelMe (Alexnet) \u22121.513 84 \u00d7 103 \u22121.807 77 \u00d7 103\n\u22121.954 90 \u00d7 103 \u22122.024 10 \u00d7 103\nLabelMe (GIST)\n\u22121.000 32 \u00d7 104 \u22121.000 35 \u00d7 104\nYTF\n\nSLK-MS (ours)\n\n273.25\n\n67.09\n\nall the baseline clustering methods fail to achieve high accuracy. Notice that, in regard to ACC for\nthe Shuttle dataset, we outperformed all the methods by a large margin.\nOne advantage of our SLK-BO over standard prototype-based models is that the modes are valid data\npoints in the input set. This is important for manifold-structured, high-dimensional inputs such as\nimages, where simple parametric prototypes such as the means, as in K-means, may not be good\nrepresentatives of the data; see Fig. 1.\n\n4.4 Comparison in terms of optimization quality\n\nTo assess the optimization quality of our optimizer, we computed the values of discrete-variable\nobjective E in model (1) at convergence for our concave-convex relaxation (SLK-MS) as well as for\nthe convex relaxation in [33] (LK). We compare the discrete-variable objectives for different values\nof \u03bb. For a fair comparison, we use the same initialization, \u03c3, k(xp, xq), \u03bb and mean-shift modes for\nboth methods. As shown in the plots in Figure 2, our relaxation consistently obtained lower values of\ndiscrete-variable objective E at convergence than the convex relaxation in [33]. Also, Table 3 reports\nthe discrete-variable objectives at convergence for LK [33] and SLK-MS (ours). These experiments\nsuggest that our relaxation in Eq. (2) is tighter than the convex relaxation in [33]. In fact, Eq. (3)\nalso suggests that our relaxation of the Laplacian term is tighter than a direct convex relaxation (the\npzp are not relaxed in our case.\n\nexpression in the middle in Eq. (3)) as the variables in term(cid:80)\n\np dpzt\n\n4.5 Running Time\n\nThe running times are given at the bottom half of Table 2. All the experiments (our methods and the\nbaselines) were conducted on a machine with Xeon E5-2620 CPU and a Titan X Pascal GPU. We\nrestrict the multiprocessing to at most 5 processes. We run each algorithm over 10 trials and report\nthe average running time. For high-dimensional datasets, such as LabelMe (GIST) and YTF, our\nmethod is much faster than the other methods we compared to. It is also interesting to see that, for\nhigh dimensions, SLK-BO is faster than SLK-MS which uses mean-shift for mode estimation.\n\n5 Conclusion\n\nWe presented Scalable Laplacian K-modes (SLK), a method for joint clustering and density mode\nestimation, which scales up to high-dimensional and large-scale problems. We formulated a concave-\nconvex relaxation of the discrete-variable objective, and solved the relaxation with an iterative\nbound optimization. Our solver results in independent updates for cluster-assignment variables,\nwith guaranteed convergence, thereby enabling distributed implementations for large-scale data sets.\nFurthermore, we showed that the density modes can be estimated directly from the assignment\nvariables using simple maximum-value operations, with an additional computational cost that is\nlinear in the number of data points. Our solution removes the need for storing a full af\ufb01nity matrix\nand computing its eigenvalue decomposition. Unlike the convex relaxation in [33], it does not require\nexpensive projection steps and Lagrangian-dual inner iterates for the simplex constraints of each\npoint. Furthermore, unlike mean-shift, our density-mode estimation does not require inner-loop\ngradient-ascent iterates. It has a complexity independent of feature-space dimension, yields modes\nthat are valid data points in the input set and is applicable to discrete domains as well as arbitrary\nkernels. We showed competitive performances of the proposed solution in term of optimization\nquality and accuracy. It will be interesting to investigate joint feature learning and SLK clustering.\n\n9\n\n\fReferences\n[1] D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In ACM-SIAM symposium\n\non Discrete algorithms, pages 1027\u20131035. Society for Industrial and Applied Mathematics, 2007.\n\n[2] M. Belkin, P. Niyogi, and V. Sindhwani. Manifold regularization: A geometric framework for learning\n\nfrom labeled and unlabeled examples. Journal of Machine Learning Research, 7:2399\u20132434, 2006.\n\n[3] Y. Bengio, J.-f. Paiement, P. Vincent, O. Delalleau, N. L. Roux, and M. Ouimet. Out-of-sample extensions\nfor lle, isomap, mds, eigenmaps, and spectral clustering. In Neural Information Processing Systems (NIPS),\npages 177\u2013184, 2004.\n\n[4] M. \u00c1. Carreira-Perpi\u00f1\u00e1n. Gaussian mean-shift is an em algorithm. IEEE Transactions on Pattern Analysis\n\nand Machine Intelligence, 29(5):767\u2013776, 2007.\n\n[5] M. \u00c1. Carreira-Perpi\u00f1\u00e1n. A review of mean-shift algorithms for clustering.\n\narXiv:1503.00687, 2015.\n\n[6] M. \u00c1. Carreira-Perpi\u00f1\u00e1n and W. Wang. The k-modes algorithm for clustering.\n\narXiv:1304.6478, 2013.\n\narXiv preprint\n\narXiv preprint\n\n[7] C. Chen, H. Liu, D. Metaxas, and T. Zhao. Mode estimation for high dimensional discrete tree graphical\n\nmodels. In Neural Information Processing Systems (NIPS), pages 1323\u20131331, 2014.\n\n[8] D. Comaniciu and P. Meer. Mean shift: A robust approach toward feature space analysis. IEEE Transactions\n\non Pattern Analysis and Machine Intelligence, 24(5):603\u2013619, 2002.\n\n[9] I. S. Dhillon, Y. Guan, and B. Kulis. Kernel k-means: spectral clustering and normalized cuts.\n\nIn\nInternational Conference on Knowledge Discovery and Data Mining (SIGKDD), pages 551\u2013556, 2004.\n\n[10] K. Ghasedi Dizaji, A. Herandi, C. Deng, W. Cai, and H. Huang. Deep clustering via joint convolutional\nautoencoder embedding and relative entropy minimization. In International Conference on Computer\nVision (ICCV), pages 5747\u20135756, 2017.\n\n[11] Y. Gong, M. Pawlowski, F. Yang, L. Brandy, L. Bourdev, and R. Fergus. Web scale photo hash clustering\n\non a single machine. In Computer Vision and Pattern Recognition (CVPR), pages 19\u201327, 2015.\n\n[12] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio.\nGenerative adversarial nets. In Neural Information Processing Systems (NIPS), pages 2672\u20132680, 2014.\n\n[13] Z. Jiang, Y. Zheng, H. Tan, B. Tang, and H. Zhou. Variational deep embedding: An unsupervised and\ngenerative approach to clustering. In International Joint Conference on Arti\ufb01cial Intelligence (IJCAI),\npages 1965\u20131972, 2017.\n\n[14] P. Kr\u00e4henb\u00fchl and V. Koltun. Ef\ufb01cient inference in fully connected crfs with gaussian edge potentials. In\n\nNeural Information Processing Systems (NIPS), pages 109\u2013117, 2011.\n\n[15] P. Kr\u00e4henb\u00fchl and V. Koltun. Parameter learning and convergent inference for dense random \ufb01elds. In\n\nInternational Conference on Machine Learning (ICML), pages 513\u2013521, 2013.\n\n[16] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional neural\n\nnetworks. In Neural Information Processing Systems (NIPS), pages 1097\u20131105, 2012.\n\n[17] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nProceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[18] J. Li, S. Ray, and B. G. Lindsay. A nonparametric statistical approach to clustering via mode identi\ufb01cation.\n\nJournal of Machine Learning Research, 8:1687\u20131723, 2007.\n\n[19] M. Muja and D. G. Lowe. Scalable nearest neighbor algorithms for high dimensional data.\n\nTransactions on Pattern Analysis and Machine Intelligence, 36(11):2227\u20132240, 2014.\n\nIEEE\n\n[20] J. Munkres. Algorithms for the assignment and transportation problems. Journal of the society for\n\nindustrial and applied mathematics, 5(1):32\u201338, 1957.\n\n[21] M. Narasimhan and J. Bilmes. A submodular-supermodular procedure with applications to discriminative\n\nstructure learning. In Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI), pages 404\u2013412, 2005.\n\n[22] J. Newling and F. Fleuret. Nested mini-batch k-means. In Neural Information Processing Systems (NIPS),\n\npages 1352\u20131360, 2016.\n\n10\n\n\f[23] A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of the spatial envelope.\n\nInternational Journal of Computer Vision, 42(3):145\u2013175, 2001.\n\n[24] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,\nR. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay.\nScikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825\u20132830, 2011.\n\n[25] M. B. Salah, I. B. Ayed, J. Yuan, and H. Zhang. Convex-relaxed kernel mapping for image segmentation.\n\nIEEE Transactions on Image Processing, 23(3):1143\u20131153, 2014.\n\n[26] U. Shaham, K. Stanton, H. Li, R. Basri, B. Nadler, and Y. Kluger. Spectralnet: Spectral clustering using\n\ndeep neural networks. In International Conference on Learning Representations (ICLR), 2018.\n\n[27] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and\n\nMachine Intelligence, 22(8):888\u2013905, 2000.\n\n[28] A. Strehl and J. Ghosh. Cluster ensembles\u2014a knowledge reuse framework for combining multiple\n\npartitions. Journal of Machine Learning Research, 3(12):583\u2013617, 2002.\n\n[29] M. Tang, D. Marin, I. B. Ayed, and Y. Boykov. Kernel cuts: Kernel and spectral clustering meet\n\nregularization. International Journal of Computer Vision, In press:1\u201335, 2018.\n\n[30] F. Tian, B. Gao, Q. Cui, E. Chen, and T.-Y. Liu. Learning deep representations for graph clustering. In\n\nAAAI Conference on Arti\ufb01cial Intelligence, pages 1293\u20131299, 2014.\n\n[31] M. Vladymyrov and M. Carreira-Perpi\u00f1\u00e1n. The variational nystrom method for large-scale spectral\n\nproblems. In International Conference on Machine Learning (ICML), pages 211\u2013220, 2016.\n\n[32] U. Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17(4):395\u2013416, 2007.\n\n[33] W. Wang and M. A. Carreira-Perpin\u00e1n. The laplacian k-modes algorithm for clustering. arXiv preprint\n\narXiv:1406.3895, 2014.\n\n[34] L. Wolf, T. Hassner, and I. Maoz. Face recognition in unconstrained videos with matched background\n\nsimilarity. In Computer Vision and Pattern Recognition (CVPR), pages 529\u2013534, 2011.\n\n[35] J. Yuan, K. Yin, Y. Bai, X. Feng, and X. Tai. Bregman-proximal augmented lagrangian approach to\nmultiphase image segmentation. In Scale Space and Variational Methods in Computer Vision (SSVM),\npages 524\u2013534, 2017.\n\n[36] A. L. Yuille and A. Rangarajan. The concave-convex procedure (CCCP). In Neural Information Processing\n\nSystems (NIPS), pages 1033\u20131040, 2001.\n\n[37] Z. Zhang, J. T. Kwok, and D.-Y. Yeung. Surrogate maximization/minimization algorithms and extensions.\n\nMachine Learning, 69:1\u201333, 2007.\n\n11\n\n\f", "award": [], "sourceid": 6490, "authors": [{"given_name": "Imtiaz", "family_name": "Ziko", "institution": "Ecole de technologie superieure (ETS)"}, {"given_name": "Eric", "family_name": "Granger", "institution": "\u00c9cole de technologie sup\u00e9rieure, Universit\u00e9 du Qu\u00e9bec"}, {"given_name": "Ismail", "family_name": "Ben Ayed", "institution": "ETS Montreal"}]}