{"title": "Statistical and Computational Trade-Offs in Kernel K-Means", "book": "Advances in Neural Information Processing Systems", "page_first": 9357, "page_last": 9367, "abstract": "We investigate the efficiency of k-means in terms of both statistical and computational requirements.\nMore precisely, we study a Nystr\\\"om approach to kernel k-means. We analyze the statistical properties of the proposed method and show that it achieves the same accuracy of exact kernel k-means with only a fraction of computations.\nIndeed, we prove under basic assumptions that sampling $\\sqrt{n}$ Nystr\\\"om landmarks allows to greatly reduce computational costs without incurring in any loss of accuracy. To the best of our knowledge this is the first result showing in this kind for unsupervised learning.", "full_text": "Statistical and Computational Trade-Offs\n\nin Kernel K-Means\n\nDaniele Calandriello\nLCSL \u2013 IIT & MIT,\n\nGenoa, Italy\n\nLorenzo Rosasco\nUniversity of Genoa,\nLCSL \u2013 IIT & MIT\n\nAbstract\n\nWe investigate the ef\ufb01ciency of k-means in terms of both statistical and compu-\ntational requirements. More precisely, we study a Nystr\u00f6m approach to kernel\nk-means. We analyze the statistical properties of the proposed method and show\n\u221a\nthat it achieves the same accuracy of exact kernel k-means with only a fraction\nof computations. Indeed, we prove under basic assumptions that sampling\nn\nNystr\u00f6m landmarks allows to greatly reduce computational costs without incurring\nin any loss of accuracy. To the best of our knowledge this is the \ufb01rst result of this\nkind for unsupervised learning.\n\n1\n\nIntroduction\n\nModern applications require machine learning algorithms to be accurate as well as computationally\nef\ufb01cient, since data-sets are increasing in size and dimensions. Understanding the interplay and\ntrade-offs between statistical and computational requirements is then crucial [31, 30]. In this paper,\nwe consider this question in the context of clustering, considering a popular nonparametric approach,\nnamely kernel k-means [33]. K-means is arguably one of most common approaches to clustering and\nproduces clusters with piece-wise linear boundaries. Its kernel version allows to consider nonlinear\nboundaries, greatly improving the \ufb02exibility of the approach. Its statistical properties have been\nstudied [15, 24, 10] and from a computational point of view it requires manipulating an empirical\nkernel matrix. As for other kernel methods, this latter operation becomes unfeasible for large scale\nproblems and deriving approximate computations is subject of a large body of recent works, see for\nexample [34, 16, 29, 35, 25] and reference therein.\nIn this paper we are interested into quantifying the statistical effect of computational approximations.\nArguably one could expect the latter to induce some loss of accuracy. In fact, we prove that, perhaps\nsurprisingly, there are favorable regimes where it is possible maintain optimal statistical accuracy\nwhile signi\ufb01cantly reducing computational costs. While a similar phenomenon has been recently\nshown in supervised learning [31, 30, 12], we are not aware of similar results for other learning tasks.\nOur approach is based on considering a Nystr\u00f6m approach to kernel k-means based on sampling a\nsubset of training set points (landmarks) that can be used to approximate the kernel matrix [3, 34,\n13, 14, 35, 25]. While there is a vast literature on the properties of Nystr\u00f6m methods for kernel\napproximations [25, 3], experience from supervised learning show that better results can be derived\nfocusing on the task of interest, see discussion in [7]. The properties of Nystr\u00f6m approximations for\nk-means has been recently studied in [35, 25]. Here they focus only on the computational aspect of\nthe problem, and provide fast methods that achieve an empirical cost only a multiplicative factor\nlarger than the optimal one.\nOur analysis is aimed at combining both statistical and computational results. Towards this end\nwe derive a novel additive bound on the empirical cost that can be used to bound the true object\nof interest: the expected cost. This result can be combined with probabilistic results to show that\noptimal statistical accuracy can be obtained considering only O(\nn) Nystr\u00f6m landmark points,\n\n\u221a\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fwhere n is the number of training set of points. Moreover, we show similar bounds not only for the\noptimal solution, which is hard to compute in general, but also for approximate solutions that can be\ncomputed ef\ufb01ciently using k-means++. From a computational point of view this leads to massive\nimprovements reducing the memory complexity from O(n2) to O(n\nn). Experimental results on\nlarge scale data-sets con\ufb01rm and illustrate our \ufb01ndings.\nThe rest of the paper is organized as follows. We \ufb01rst overview kernel k-means, and introduce our\napproximate kernel k-means approach based on Nystr\u00f6m embeddings. We then prove our statistical\nand computational guarantees and empirically validate them. Finally, we present some limits of our\nanalysis, and open questions.\n\n\u221a\n\ni=1\n\nfrom \u00b5, we denote with \u00b5n(A) = (1/n)(cid:80)n\n\n2 Background\nNotation Given an input space X , a sampling distribution \u00b5, and n samples {xi}n\ni=1 drawn i.i.d.\nI{xi \u2208 A} the empirical distribution. Once the data\nhas been sampled, we use the feature map \u03d5(\u00b7) : X \u2192 H to maps X into a Reproducing Kernel\nHilbert Space (RKHS) H [32], that we assume separable, such that for any x \u2208 X we have \u03c6 = \u03d5(x).\nIntuitively, in the rest of the paper the reader can assume that \u03c6 \u2208 RD with D (cid:29) n or even in\ufb01nite.\nUsing the kernel trick [2] we also know that \u03c6T\u03c6(cid:48) = K(x, x(cid:48)), where K is the kernel function\nassociated with H and \u03c6T\u03c6(cid:48) = (cid:104)\u03c6, \u03c6(cid:48)(cid:105)H is a short-hand for the inner product in H. With a slight\nabuse of notation we will also de\ufb01ne the norm (cid:107)\u03c6(cid:107)2 = \u03c6T\u03c6, and assume that (cid:107)\u03d5(x)(cid:107)2 \u2264 \u03ba2 for\nany x \u2208 X . Using \u03c6i = \u03d5(xi), we denote with D = {\u03c6i}n\ni=1 the input dataset. We also represent\nthe dataset as the map \u03a6n = [\u03c61, . . . , \u03c6n] : Rn \u2192 H with \u03c6i as its i-th column. We denote with\nn\u03a6n the empirical kernel matrix with entries [Kn]i,j = ki,j. Finally, given \u03a6n we denote as\nKn = \u03a6T\nn)+ the orthogonal projection matrix on the span Hn = Im(\u03a6n) of the dataset.\n\u03a0n = \u03a6n\u03a6T\nk-mean\u2019s objective Given our dataset, we are interested in partitioning it into k disjoint clusters\neach characterized by its centroid cj. The Voronoi cell associated with a centroid cj is de\ufb01ned as the\nset Cj := {i : j = arg mins=[k] (cid:107)\u03c6i \u2212 cs(cid:107)2}, or in other words a point \u03c6i \u2208 D belongs to the j-th\ncluster if cj is its closest centroid. Let C = [c1, . . . ck] be a collection of k centroids from H. We\ncan now formalize the criterion we use to measure clustering quality.\nDe\ufb01nition 1. The empirical and expected squared norm criterion are de\ufb01ned as\n\nn(\u03a6n\u03a6T\n\n(cid:20)\n\n(cid:21)\n\nW (C, \u00b5n) :=\n\n1\nn\n\n(cid:107)\u03c6i \u2212 cj(cid:107)2,\n\nmin\n\nj=1,...,k\n\nW (C, \u00b5) := E\n\u03c6\u223c\u00b5\n\nmin\n\nj=1,...,k\n\n(cid:107)\u03c6 \u2212 cj(cid:107)2\n\n.\n\nThe empirical risk minimizer (ERM) is de\ufb01ned as Cn := arg minC\u2208Hk W (C, \u00b5n).\nThe sub-script n in Cn indicates that it minimizes W (C, \u00b5n) for the n samples in D. Biau et al. [10]\ngives us a bound on the excess risk of the empirical risk minimizer.\nProposition 1 ([10]). The excess risk E(Cn) of the empirical risk minimizer Cn satis\ufb01es\n\nE(Cn) := ED\u223c\u00b5 [W (Cn, \u00b5)] \u2212 W \u2217(\u00b5) \u2264 O(cid:0)k/\n\nk times larger than a corresponding O((cid:112)k/n)\n\nwhere W \u2217(\u00b5) := inf C\u2208Hk W (C, \u00b5) is the optimal clustering risk.\nFrom a theoretical perspective, this result is only\nlower bound [18], and therefore shows that the ERM Cn achieve an excess risk optimal in n. From a\ncomputational perspective, De\ufb01nition 1 cannot be directly used to compute Cn, since the points \u03c6i in\nH cannot be directly represented. Nonetheless, due to properties of the squared norm criterion, each\ncj \u2208 Cn must be the mean of all \u03c6i associated with that center, i.e., Cn belongs to Hn. Therefore, it\ncan be explicitly represented as a sum of the \u03c6i points included in the j-th cluster, i.e., all the points\nin the j-th Voronoi cell Cj. Let V be the space of all possible disjoint partitions [C1, . . . ,Cj]. We can\nuse this fact, together with the kernel trick, to reformulate the objective W (\u00b7, \u00b5n).\nProposition 2 ([17]). We can rewrite the objective\n\nn(cid:1)\n\n\u221a\n\n\u221a\n\nn(cid:88)\n\ni=1\n\n(cid:13)(cid:13)(cid:13)\u03c6i \u2212 1|Cj| \u03c6Cj\n\nwith\n\nC\u2208H W (C, \u00b5n) =\nmin\n= ki,i \u2212 2|Cj|\n\n(cid:13)(cid:13)(cid:13)2\n\n1|Cj|\n\nminV\n\n1\nn\n\n(cid:80)\n\nk(cid:88)\n\n(cid:88)\n\nj=1\n\ni\u2208Cj\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2\n\n\u03c6s\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\u03c6i \u2212 1\n(cid:80)\n\n|Cj|\n\ns\u2208Cj\n\n(cid:88)\n(cid:80)\n\ns\u2208Cj\n\ns\u2208Cj\n\nki,s + 1|Cj|2\n\ns(cid:48)\u2208Cj\n\nks,s(cid:48)\n\n2\n\n\fWhile the combinatorial search over V can now be explicitly computed and optimized using the\nkernel matrix Kn, it still remains highly inef\ufb01cient to do so. In particular, simply constructing and\nstoring Kn takes O(n2) time and space and does not scale to large datasets.\n\n3 Algorithm\n\nA simple approach to reduce computational cost is to use approximate embeddings, which replace the\n\nmap \u03d5(\u00b7) and points \u03c6i = \u03d5(xi) \u2208 H with a \ufb01nite-dimensional approximation(cid:101)\u03c6i = (cid:101)\u03d5(xi) \u2208 Rm.\n\nNystr\u00f6m kernel k-means Given a dataset D, we denote with I = {\u03c6j}m\nj=1 a dictionary (i.e.,\nsubset) of m points \u03c6j from D, and with \u03a6m : Rm \u2192 H the map with these points as columns.\nThese points acts as landmarks [36], inducing a smaller space Hm = Im(\u03a6m) spanned by the\ndictionary. As we will see in the next section, I should be chosen so that Hm is close to the whole\nspan Hn = Im(\u03a6n) of the dataset.\nLet Km,m \u2208 Rm\u00d7m be the empirical kernel matrix between all points in I, and denote with\n\nthe orthogonal projection on Hm. Then we can de\ufb01ne an approximate ERM over Hm as\n\n\u03a0m = \u03a6m\u03a6T\n\nm(\u03a6m\u03a6T\n\nm)+ = \u03a6mK+\n\nm,m\u03a6T\n\nm,\n\n(1)\n\nn(cid:88)\n\nn(cid:88)\n\ni=1\n\n1\nn\n\nmin\nj=[k]\n\nm\n\ni=1\n\nmin\nj=[k]\n\nC\u2208Hk\n\nm\n\n(cid:107)\u03a0m(\u03c6i \u2212 cj)(cid:107)2,\n\n(cid:107)\u03c6i \u2212 cj(cid:107)2 = arg min\nC\u2208Hk\n\n1\nn\n\nCn,m = arg min\n\n(cid:107)\u03a0m(\u03c6i \u2212 cj)(cid:107)2 = (cid:107)\u039b\u22121/2UT\u03a6T\n\n(2)\nsince any component outside of Hm is just a constant in the minimization. Note that the centroids\nCn,m are still points in Hm \u2282 H, and we cannot directly compute them. Instead, we can use the\neigen-decomposition of Km,m = U\u039bUT to rewrite \u03a0m = \u03a6mU\u039b\u22121/2\u039b\u22121/2UT\u03a6T\nm. De\ufb01ning\n\nnow (cid:101)\u03d5(\u00b7) = \u039b\u22121/2UT\u03a6T\nwhere (cid:101)\u03c6i := \u039b\u22121/2UT\u03a6T\nsearching over (cid:101)C \u2208 Rm\u00d7k instead of searching over C \u2208 Hk\nk(cid:88)\n(cid:101)Cn,m = arg min\n(cid:101)C\u2208Rm\u00d7k\n\nm\u03d5(\u00b7) we have a \ufb01nite-rank embedding into Rm. Substituting in Eq. (2)\nm(\u03c6i \u2212 cj)(cid:107)2 = (cid:107)(cid:101)\u03c6i \u2212 \u039b\u22121/2UT\u03a6T\nmcj(cid:107)2,\nm\u03c6i are the embedded points. Replacing(cid:101)cj := \u039b\u22121/2UT\u03a6T\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:101)\u03c6i \u2212 1\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2\nn(cid:88)\n(cid:107)(cid:101)\u03c6i \u2212(cid:101)cj(cid:107)2 =\n(cid:101)\u03c6s\nwhere we do not need to resort to kernel tricks, but can use the m-dimensional embeddings(cid:101)\u03c6i to\nexplicitly compute the centroid(cid:80)\ns\u2208Cj (cid:101)\u03c6s. Eq. (3) can now be solved in multiple ways. The most\nstraightforward is to run a parametric k-means algorithm to compute (cid:101)Cn,m, and then invert the rela-\nmUT\u039b1/2(cid:101)Cn,m =\ntionship(cid:101)cj = \u03a6mU\u039b\u22121/2cj to bring back the solution to Hm, i.e., Cn,m = \u03a6+\n\u03a6mU\u039b\u22121/2(cid:101)Cn,m. This can be done in in O(nm) space and O(nmkt + nm2) time using t steps of\nLloyd\u2019s algorithm [22] for k clusters. More in detail, computing the embeddings(cid:101)\u03c6i is a one-off cost\ntaking nm2 time. Once the m-rank Nystr\u00f6m embeddings(cid:101)\u03c6i are computed they can be stored and\n\nmcj and\nm, we obtain (similarly to Proposition 2)\n\nmanipulated in nm time and space, with an n/m improvement over the n2 time and space required\nto construct Kn.\n\n(cid:88)\n\n(cid:88)\n\nj=1\n\ni\u2208Cj\n\n1\nn\n\nminV\n\n,\n\n(3)\n\n|Cj|\n\ns\u2208Cj\n\nmin\nj=[k]\n\ni=1\n\n1\nn\n\n3.1 Uniform sampling for dictionary construction\nDue to its derivation, the computational cost of Algorithm 1 depends on the size m of the dictionary I.\nTherefore, for computational reasons we would prefer to select as small a dictionary as possible. As a\napproximate W (\u00b7, \u00b5n) well. Let \u03a0\n\ncon\ufb02icting goal, we also wish to optimize W (\u00b7, \u00b5n) well, which requires a (cid:101)\u03d5(\u00b7) and I rich enough to\nm be the projection orthogonal to Hm. Then when ci \u2208 Hm\n\u22a5\n\n(cid:107)\u03c6i \u2212 ci(cid:107)2 = (cid:107)(\u03a0m + \u03a0\n\nm)(\u03c6i \u2212 ci)(cid:107)2 = (cid:107)\u03a0m(\u03c6i \u2212 ci)(cid:107)2 + (cid:107)\u03a0\n\u22a5\n\nm\u03c6i(cid:107)2.\n\u22a5\n\nWe will now introduce the concept of a \u03b3-preserving dictionary I to control the quantity (cid:107)\u03a0\n\nm\u03c6i(cid:107)2.\n\u22a5\n\n3\n\n\fAlgorithm 1 Nystr\u00f6m Kernel K-Means\nInput: dataset D = {\u03c6i}n\n\ni=1, dictionary I = {\u03c6j}m\n\nj=1 with points from D, number of clusters k\n\nm\u03a6m between all points in I\n\ncompute kernel matrix Km,m = \u03a6T\ncompute eigenvectors U and eigenvalues \u039b of Km,m\n\nfor each point \u03c6i, compute embedding(cid:101)\u03c6i = \u039b\u22121/2UT\u03a6T\ncompute optimal centroids (cid:101)Cn,m \u2208 Rm\u00d7k on the embedded dataset (cid:101)D = {(cid:101)\u03c6i}n\ncompute explicit representation of centroids Cn,m = \u03a6mU\u039b\u22121/2(cid:101)Cn,m\n\nm\u03c6i = \u039b\u22121/2UTKm,i \u2208 Rm\n\ni=1\n\nDe\ufb01nition 2. We de\ufb01ne the subspace Hm and dictionary I as \u03b3-preserving w.r.t. space Hn if\n\n\u22121 .\n\n(\u03a6n\u03a6T\n\ni (\u03c6i\u03c6T\n\nm = \u03a0n \u2212 \u03a0m (cid:22) \u03b3\n\u22a5\n1 \u2212 \u03b5\n\u03a0\n\ni (\u03a6n\u03a6T\n\ni (\u03a6n\u03a6T\n\ni (\u03a6n\u03a6T\n\nn + \u03b3\u03a0n)\n\nn + \u03b3\u03a0n)\n\nn + \u03b3\u03a0n)\n\nn + \u03b3\u03a0n)\n\nm\u03c6i(cid:107)2 (cid:46) \u03b3\u03c6T\n\u22a5\n\n(4)\n\u22121 on the right-hand side of the inequality is crucial to control\nNotice that the inverse (\u03a6n\u03a6T\n\u22121 \u03c6i. In particular, since \u03c6i \u2208 \u03a6n, we have that in the\nthe error (cid:107)\u03a0\ni )+ \u03c6i \u2264 1. Conversely,\nworst case the error is bounded as \u03c6T\nn + \u03b3\u03a0n)\nn) \u2264 \u03ba2n we know that in the best case the error can be reduced up to 1/n \u2264\nsince \u03bbmax(\u03a6n\u03a6T\n\u22121 \u03c6i. Note that the directions associated with the larger\nn) \u2264 \u03c6T\ni \u03c6i/\u03bbmax(\u03a6n\u03a6T\n\u03c6T\neigenvalues are the ones that occur most frequently in the data. As a consequence, De\ufb01nition 2\nguarantees that the overall error across the whole dataset remains small. In particular, we can control\nthe residual \u03a0\nTo construct \u03b3-preserving dictionaries we focus on a uniform random sampling approach[7]. Uniform\nsampling is historically the \ufb01rst [36], and usually the simplest approach used to construct I. Leverag-\n\ning results from the literature [7, 14, 25] we can show that uniformly sampling (cid:101)O(n/\u03b3) landmarks\n\nm\u03a6n after the projection as (cid:107)\u03a0\n\u22a5\n\nn + \u03b3\u03a0n)\u22121\u03a6n(cid:107) \u2264 \u03b3.\n\nm\u03a6n(cid:107)2 \u2264 \u03b3(cid:107)\u03a6T\n\u22a5\n\n\u22121 \u03c6i \u2264 \u03c6T\n\ngenerates a \u03b3-preserving dictionary with high probability.\nLemma 1. For a given \u03b3, construct I by uniformly sampling m \u2265 12\u03ba2n/\u03b3 log(n/\u03b4)/\u03b52 landmarks\nfrom D. Then w.p. at least 1 \u2212 \u03b4 the dictionary I is \u03b3-preserving.\nMusco and Musco [25] obtains a similar result, but instead of considering the operator \u03a0n they focus\non the \ufb01nite-dimensional eigenvectors of Kn. Moreover, their \u03a0n (cid:22) \u03a0m + \u03b5\u03b3\nn)+ bound\nis weaker and would not be suf\ufb01cient to satisfy our de\ufb01nition of \u03b3-accuracy. A result equivalent to\nLemma 1 was obtained by Alaoui and Mahoney [3], but they also only focus on the \ufb01nite-dimensional\neigenvectors of Kn, and did not investigate the implications for H.\nProof sketch of Lemma 1. It is well known [7, 14] that uniformly sampling O(n/\u03b3\u03b5\u22122 log(n/\u03b4))\npoints with replacement is suf\ufb01cient to obtain w.p. 1 \u2212 \u03b4 the following guarantees on \u03a6m\n\n1\u2212\u03b5 (\u03a6n\u03a6T\n\nn(\u03a6n\u03a6T\n\n\u03a6m\u03a6T\n\nm (cid:22) (1 + \u03b5)\u03a6n\u03a6T\n\nn + \u03b5\u03b3\u03a0n.\n\n\u22121 =\n\n1\n1 \u2212 \u03b5\n\n(\u03a6n\u03a6T\n\nn + \u03b3\u03a0n)\n\n\u22121\n\nWhich implies\n\n(cid:16) n\n\nm\n\n\u03a6m\u03a6T\n\nm + \u03b3\u03a0n\n\nWe can now rewrite \u03a0n as\n\n(1 \u2212 \u03b5)\u03a6n\u03a6T\n\nn \u2212 \u03b5\u03b3\u03a0n (cid:22) n\nm\n\n(cid:17)\u22121 (cid:22) ((1 \u2212 \u03b5)\u03a6n\u03a6T\n(cid:17)(cid:16) n\n(cid:17)+\n\nm + \u03b3\u03a0n\n\n\u03a6m\u03a6T\n\nm\n\nm\n\n(cid:16) n\n(cid:16) n\n(cid:16) n\n\nm\n\n\u03a6m\u03a6T\nm\n\n\u03a6m\u03a6T\nm\n\n\u03a6m\u03a6T\nm\n\n(cid:16) n\n\nm\nn\n=\nm\n(cid:22) n\nm\n\n= \u03a0m + \u03b3\n\n\u03a6m\u03a6T\n\nm + \u03b3\u03a0n\n\nm\n\n\u03a6m\u03a6T\n\nm + \u03b3\u03a0n\n\nn \u2212 \u03b5\u03b3\u03a0n + \u03b3\u03a0n)\n(cid:17)\u22121\n(cid:16) n\n(cid:17)\u22121\n(cid:16) n\n(cid:17)\u22121 (cid:22) \u03a0m +\n\n\u03a6m\u03a6T\n\n+ \u03b3\n\n+ \u03b3\n\nm\n\nm\n\n\u03a0n =\n\n\u03a6m\u03a6T\n\n(cid:17)\u22121\n\nm + \u03b3\u03a0n\n\n\u03a6m\u03a6T\n\nm + \u03b3\u03a0n\n\n(cid:17)\u22121\n\nm + \u03b3\u03a0n\n\n\u03b3\n1 \u2212 \u03b5\n\n(\u03a6n\u03a6T\n\nn + \u03b3\u03a0n)\n\n\u22121\n\nIn other words, using uniform sampling we can reduce the size of the search space Hm by a 1/\u03b3 factor\n(from n to m (cid:39) n/\u03b3) in exchange for a \u03b3 additive error, resulting in a computation/approximation\ntrade-off that is linear in \u03b3.\n\n4\n\n\f4 Theoretical analysis\n\nExploiting the error bound for \u03b3-preserving dictionaries we are now ready for the main result of this\npaper: showing that we can improve the computational aspect of kernel k-means using Nystr\u00f6m\nembedding, while maintaining optimal generalization guarantees.\nTheorem 1. Given a \u03b3-preserving dictionary\n\nE(Cn,m) = W (Cn,m, \u00b5) \u2212 W (Cn, \u00b5) \u2264 O\n\nk\n\n(cid:18)\n\n(cid:18) 1\u221a\n\n+\n\n\u03b3\nn\n\nn\n\n(cid:19)(cid:19)\n\n\u221a\n\n\u221a\n\n\u221a\n\n\u221a\n\n\u221a\n\n\u221a\n\n\u221a\n\n\u221a\n\nn).\n\nfrom n to\n\nn) + O(k\n\nn/n) \u2264 O(k/\n\nn the solution Cn,m achieves the O (k/\n\nn, and the total required space from n2 to (cid:101)O(n\n\nFrom a statistical point of view, Theorem 1 shows that if I is \u03b3-preserving, the ERM in Hm achieves\nthe same excess risk as the exact ERM from Hn up to an additional \u03b3/n error. Therefore, choosing\nn) generalization [10].\n\u03b3 =\nFrom a computational point of view, Lemma 1 shows that we can construct an\nn-preserving\nn) points uniformly1, which greatly reduces the embedding size\n\ndictionary simply by sampling (cid:101)O(\nTime-wise, the bottleneck becomes the construction of the embeddings(cid:101)\u03c6i, which takes nm2 \u2264 (cid:101)O(n2)\ntime, while each iterations of Lloyd\u2019s algorithm only requires nm \u2264 (cid:101)O(n\n\n\u221a\nn) time. In the full\ngenerality of our setting this is practically optimal, since computing a\nn-preserving dictionary\nis in general as hard as matrix multiplication [26, 9], which requires \u2126(n2) time. In other words,\nunlike the case of space complexity, there is no free lunch for time complexity, that in the worst case\nmust scale as n2 similarly to the exact case. Nonetheless embedding the points is an embarrassingly\nparallel problem that can be easily distributed, while in practice it is usually the execution of the\nLloyd\u2019s algorithm that dominates the runtime.\nFinally, when the dataset satis\ufb01es certain regularity conditions, the size of I can be improved, which\nreduces both embedding and clustering runtime. Denote with dn\n\nso-called effective dimension [3] of Kn. Since Tr(cid:0)KT\neff(\u03b3) (cid:28) \u221a\n(cid:101)O(ndn\nwould require only (cid:101)O(ndn\n\nn(Kn)+), we have\neff(\u03b3) can be seen as a soft version of the rank. When\neff(\u03b3) landmarks in\neff(\u03b3)2) time using specialized algorithms [14] (see Section 6). In this case, the embedding step\n\neff(\u03b3)2) (cid:28) (cid:101)O(n2), improving both time and space complexity.\n\neff(\u03b3) \u2264 r := Rank(Kn), and therefore dn\n\neff(\u03b3) = Tr(cid:0)KT\nn(Kn + In)\u22121(cid:1) \u2264 Tr (KT\n\nn(Kn + In)\u22121(cid:1) the\n\nn it is possible to construct a \u03b3-preserving dictionary with only dn\n\nthat dn\ndn\n\n\u221a\n\n\u221a\nMorever, to the best of our knowledge, this is the \ufb01rst example of an unsupervised non-parametric\nproblem where it is always (i.e., without assumptions on \u00b5) possible to preserve the optimal O(1/\nn)\nrisk rate while reducing the search from the whole space H to a smaller Hm subspace.\nProof sketch of Theorem 1. We can separate the distance between W (Cn,m, \u00b5) \u2212 W (Cn, \u00b5) in a\ncomponent that depends on how close \u00b5 is to \u00b5n, bounded using Proposition 1, and a component\nW (Cn,m, \u00b5n) \u2212 W (Cn, \u00b5n) that depends on the distance between Hn and Hm\nLemma 2. Given a \u03b3-preserving dictionary\n\nW (Cn,m, \u00b5n) \u2212 W (Cn, \u00b5n) \u2264 min(k, dn\n1 \u2212 \u03b5\n\neff(\u03b3))\n\n\u03b3\nn\n\nTo show this we can rewrite the objective as (see [17])\nW (Cn,m, \u00b5n) = (cid:107)\u03a6n \u2212 \u03a0m\u03a6nSn,m(cid:107)2\n\nF = Tr(\u03a6T\n\nn\u03a6n \u2212 Sn\u03a6T\n\nn\u03a0m\u03a6nSn),\n\nwhere Sn \u2208 Rn\u00d7n is a k-rank projection matrix associated with the exact clustering Cn. Then using\nDe\ufb01nition 2 we have \u03a0m \u2212 \u03a0n (cid:23) \u2212 \u03b3\nn + \u03b3\u03a0n)\u22121 and we obtain an additive error bound\n\n1\u2212\u03b5 (\u03a6n\u03a6T\n\n(cid:18)\nn\u03a6n \u2212 Sn\u03a6T\n\nTr(\u03a6T\n\u2264 Tr\n\nn\u03a0m\u03a6nSn)\n\nTr(cid:0)Sn\u03a6T\n1(cid:101)O hides logarithmic dependencies on n and m.\n\nn\u03a6n \u2212 Sn\u03a6T\n\u03a6T\n\u03b3\n1 \u2212 \u03b5\n\n= W (Cn, \u00b5n) +\n\nn\u03a6nSn +\n\n(cid:19)\n\n\u03b3\n1 \u2212 \u03b5\nn(\u03a6n\u03a6T\n\nn(\u03a6n\u03a6T\n\nSn\u03a6T\nn + \u03b3\u03a0n)\u22121\u03a6nSn\n\n(cid:1) .\n\nn + \u03b3\u03a0n)\u22121\u03a6nSn\n\n5\n\n\fSince (cid:107)\u03a6T\n\nn + \u03b3\u03a0n)\u22121\u03a6n(cid:107) \u2264 1, Sn is a projection matrix, and Tr(Sn) = k we have\n\nn(\u03a6n\u03a6T\n\u03b3\n\n1\u2212\u03b5 Tr(cid:0)Sn\u03a6T\n\n1\u2212\u03b5 Tr(cid:0)Sn\u03a6T\n\n\u03b3\n\nn(\u03a6n\u03a6T\n\nn + \u03b3\u03a0n)\u22121\u03a6nSn\n\n1\u2212\u03b5 Tr (SnSn) = \u03b3k\n1\u2212\u03b5 .\n\n(cid:1) \u2264 \u03b3\n1\u2212\u03b5 Tr(cid:0)\u03a6T\n(cid:1) \u2264 \u03b3\n\nn + \u03b3\u03a0n)\u22121\u03a6n (cid:22) \u03a0n we have\nn + \u03a0n)\u22121\u03a6n\n\nn(\u03a6n\u03a6T\n\n(cid:1) \u2264 \u03b3dn\n\nConversely, if we focus on the matrix \u03a6T\n\nn(\u03a6n\u03a6T\n\neff(\u03b3)\n1\u2212\u03b5 .\nSince both bounds hold simultaneously, we can simply take the minimum to conclude our proof.\n\nn(\u03a6n\u03a6T\n\nn + \u03a0n)\u22121\u03a6nSn\n\nWe now compare the theorem with previous work. Many approximate kernel k-means methods have\nbeen proposed over the years, and can be roughly split in two groups.\nLow-rank decomposition based methods try to directly simplify the optimization problem from\n\nProposition 2, replacing the kernel matrix Kn with an approximate (cid:101)Kn that can be stored and\n\nmanipulated more ef\ufb01ciently. Among these methods we can mention partial decompositions [8],\nNystr\u00f6m approximations based on uniform [36], k-means++ [27], or ridge leverage score (RLS)\nsampling[35, 25, 14], and random-feature approximations [6]. None of these optimization based\nmethods focus on the underlying excess risk problem, and their analysis cannot be easily integrated in\nexisting results, as the approximate minimum found has no clear interpretation as a statistical ERM.\nOther works take the same embedding approach that we do, and directly replace the exact \u03d5(\u00b7)\n\nwith an approximate (cid:101)\u03d5(\u00b7), such as Nystr\u00f6m embeddings [36], Gaussian projections [10], and again\nrandom-feature approximations [29]. Note that these approaches also result in approximate (cid:101)Kn that\n\n\u03d5(x) =(cid:80)D\nthe following m-dimensional approximate feature map (cid:101)\u03d5(x) = P[\u03c81(x), . . . , \u03c8D(x)] \u2208 Rm. Using\n\ncan be manipulated ef\ufb01ciently, but are simpler to analyze theoretically. Unfortunately, no existing\nembedding based methods can guarantee at the same time optimal excess risk rates and a reduction in\nthe size of Hm, and therefore a reduction in computational cost.\nTo the best of our knowledge, the only other result providing excess risk guarantee for approximate\nkernel k-means is Biau et al. [10], where the authors consider the excess risk of the ERM when the\napproximate Hm is obtained using Gaussian projections. Biau et al. [10] notes that the feature map\ns=1 \u03c8s(x) can be expressed using an expansion of basis functions \u03c8s(x), with D very large\nor in\ufb01nite. Given a matrix P \u2208 Rm\u00d7D where each entry is a standard Gaussian r.v., [10] proposes\nJohnson-Lindenstrauss (JL) lemma [19], they show that if m \u2265 log(n)/\u03bd2 then a multiplicative error\nbound of the form W (Cn,m, \u00b5n) \u2264 (1 + \u03bd)W (Cn, \u00b5n) holds. Reformulating their bound, we obtain\nthat W (Cn,m, \u00b5n) \u2212 W (Cn, \u00b5n) \u2264 \u03bdW (Cn, \u00b5n) \u2264 \u03bd\u03ba2 and E(Cn,m) \u2264 O(k/\nNote that to obtain a bound comparable to Theorem 1, and if we treat k as a constant, we need to take\n\n\u03bd = \u03b3/n which results in m \u2265 (n/\u03b3)2. This is always worse than our (cid:101)O(n/\u03b3) result for uniform\n\n\u221a\n\u221a\nn risk rate setting Gaussian projections would require\nNystr\u00f6m embedding. In particular, in the 1/\nn resulting in m \u2265 n log(n) random features, which would not bring any improvement\n\u03bd = 1/\nover computing Kn. Moreover when D is in\ufb01nite, as it is usually the case in the non-parametric\nsetting, the JL projection is not explicitly computable in general and Biau et al. [10] must assume the\n\nexistence of a computational oracle capable of constructing (cid:101)\u03d5(\u00b7). Finally note that, under the hood,\ntraditional embedding methods such as those based on JL lemma, usually provide only bounds of\nthe form \u03a0n \u2212 \u03a0m (cid:22) \u03b3\u03a0n, and an error (cid:107)\u03a0\nm\u03c6i(cid:107)2 \u2264 \u03b3 (cid:107)\u03c6i(cid:107)2 (see the discussion of De\ufb01nition 2).\n\u22a5\nTherefore the error can be larger along multiple directions, and the overall error (cid:107)\u03a0\nm\u03a6n(cid:107)2 across\n\u22a5\nthe dictionary can be as large as n\u03b3 rather than \u03b3.\nRecent work in RLS sampling has also focused on bounding the distance W (Cn,m, \u00b5n)\u2212W (Cn, \u00b5n)\nbetween empirical errors. Wang et al. [35] and Musco and Musco [25] provide multiplicative error\nbounds of the form W (Cn,m, \u00b5n) \u2264 (1+\u03bd)W (Cn, \u00b5n) for uniform and RLS sampling. Nonetheless,\nthey only focus on empirical risk and do not investigate the interaction between approximation and\n\u221a\ngeneralization, i.e., statistics and computations. Moreover, as we already remarked for [10], to achieve\nthe 1/\nn excess risk rate using a multiplicative error bound we would require an unreasonably small\n\u03bd, resulting in a large m that brings no computational improvement over the exact solution.\nFinally, note that when [31] showed that a favourable trade-off was possible for kernel ridge re-\ngression (KRR), they strongly leveraged the fact that KRR is a \u03b3-regularized problem. Therefore,\nall eigenvalues and eigenvectors in the \u03a6n\u03a6T\nn covariance matrix smaller than the \u03b3 regularization\ndo not in\ufb02uence signi\ufb01cantly the solution. Here we show the same for kernel k-means, a problem\n\nn + \u03bd).\n\n\u221a\n\n6\n\n\fwithout regularization. This hints at a deeper geometric motivation which might be at the root of both\nproblems, and potentially similar approaches could be leveraged in other domains.\n\n4.1 Further results: beyond ERM\nSo far we provided guarantees for Cn,m, that this the ERM in Hm. Although Hm is much smaller\nthan Hn, solving the optimization problem to \ufb01nd the ERM is still NP-Hard in general [4]. Nonethe-\nless, Lloyd\u2019s algorithm [22], when coupled with a careful k-means++ seeding, can return a good\napproximate solution C++\nn,m.\nProposition 3 ([5]). For any dataset EA[W (C++\nis the randomness deriving from the k-means++ initialization.\n\nn,m, \u00b5n)] \u2264 8(log(k) + 2)W (Cn,m, \u00b5n), where A\n\nNote that, similarly to [35, 25], this is a multiplicative error bound on the empirical risk, and as we\ndiscussed we cannot leverage Lemma 2 to bound the excess risk E(C++\nn,m). Nonetheless we can\nstill leverage Lemma 2 to bound only the expected risk W (C++\nn,m, \u00b5), albeit with an extra error term\nappearing that scales with the optimal clustering risk W \u2217(\u00b5) (see Proposition 1).\nTheorem 2. Given a \u03b3-preserving dictionary\n\n(cid:104)E\n\nE\nD\u223c\u00b5\n\nA[W (C++\n\nn,m, \u00b5)]\n\n(cid:18)\n\n(cid:105) \u2264 O\n\nlog(k)\n\n(cid:18) k\u221a\n\n(cid:19)(cid:19)\n\n\u03b3\nn\n\nn\n\n+ k\n\u221a\n\n+ W \u2217(\u00b5)\n\n.\n\n\u221a\nn to obtain a O(k/\n\n\u221a\n\n\u221a\n\n\u221a\n\n\u221a\n\nIn particular, each iteration of Lloyd\u2019s algorithm will take only (cid:101)O(n\n\ufb01nd a solution in (cid:101)O(n\n\nn) rate for the\nFrom a statistical perspective, we can once again, set \u03b3 =\n\ufb01rst part of the bound. Conversely, the optimal clustering risk W \u2217(\u00b5) is a \u00b5-dependent quantity that\ncannot in general be bounded in n, and captures how well our model, i.e., the choice of H and how\nwell the criterion W (\u00b7, \u00b5), matches reality.\nFrom a computational perspective, we can now bound the computational cost of \ufb01nding C++\nn,m.\nnk) time. Moreover, when\nk-means++ initialization is used, the expected number of iterations required for Lloyd\u2019s algorithm to\nconverge is only logarithmic [1]. Therefore, ignoring the time required to embed the points, we can\nwith a strong O(\nFinally, if the data distribution satis\ufb01es some regularity assumption the following result follows [15].\nCorollary 1. If we denote by X\u00b5 the support of the distribution \u00b5 and assume \u03d5(X\u00b5) to be a d-\ndimensional manifold, then W \u2217(\u00b5) \u2264 dk\u22122/d, and given a\nn-preserving dictionary the expected\ncost satis\ufb01es ED\u223c\u00b5[EA[W (C++\n5 Experiments\n\nnk) time and space instead of the (cid:101)O(n2k) cost required by the exact method,\n\n(cid:16) k\u221a\nn + dk\u22122/d(cid:17)(cid:17)\n\nn,m, \u00b5)]] \u2264 O(cid:16)\n\nn) improvement.\n\nlog(k)\n\n\u221a\n\nWe now evaluate experimentally the claims of Theorem 1, namely that sampling (cid:101)O(n/\u03b3) increases the\n\nexcess risk by an extra \u03b3/n factor, and that m =\nn is suf\ufb01cient to recover the optimal rate. We use\nthe Nystroem and MiniBatchKmeans classes from the sklearn python library [28]to implement\nkernel k-means with Nystr\u00f6m embedding (Algorithm 1) and we compute the solution C++\nn,m.\nFor our experiments we follow the same approach as Wang et al. [35], and test our algorithm on two\nvariants of the MNIST digit dataset. In particular, MNIST60K [20] is the original MNIST dataset\ncontaining pictures each with d = 784 pixels. We divide each pixel by 255, bringing each feature\nin a [0, 1] interval. We split the dataset in two part, n = 60000 samples are used to compute the\nn,m, \u00b5test), as a\nW (C++\nproxy for W (C++\nn,m, \u00b5). To test the scalability of our approach we also consider the MNIST8M dataset\nfrom the in\ufb01nite MNIST project [23], constructed using non-trivial transformations and corruptions\nof the original MNIST60K images. Here we compute C++\nn,m using n = 8000000 images, and compute\nn,m, \u00b5test) on 100000 unseen images. As in Wang et al. [35] we use Gaussian kernel with\nW (C++\nbandwidth \u03c3 = (1/n2)\n\nn,m) centroids, and we leave out unseen 10000 samples to compute W (C++\n\n(cid:113)(cid:80)\n\ni,j (cid:107)xi \u2212 xj(cid:107)2.\n\n.\n\nMNIST60K: these experiments are small enough to run in less than a minute on a single laptop\nwith 4 cores and 8GB of RAM. The results are reported in Fig. 1. On the left we report in blue\n\n7\n\n\fFigure 1: Results for MNIST60K\n\nFigure 2: Results for MNIST8M\n\nn,m, \u00b5test), where the shaded region is a 95% con\ufb01dence interval for the mean over 10 runs.\nW (C++\nAs predicted, the expected cost decreases as the size of Hm increases, and plateaus once we achieve\n\u221a\n1/m (cid:39) 1/\nn, in line with the statistical error. Note that the normalized mutual information (NMI)\nbetween the true [0 \u2212 9] digit classes y and the computed cluster assignments yn,m also plateaus\n\u221a\nn. While this is not predicted by the theory, it strengthens the intuition that beyond a\naround 1/\ncertain capacity expanding Hm is computationally wasteful.\nMNIST8M: to test the scalability of our approach, we run the same experiment on millions of points.\nNote that we carry out our MNIST8M experiment on a single 36 core machine with 128GB of RAM,\nmuch less than the setup of [35], where at minimum a cluster of 8 such nodes are used. The behaviour\nof W (C++\nn,m, \u00b5test) and NMI are similar to MNIST60K, with the increase in dataset size allowing\nfor stronger concentration and smaller con\ufb01dence intervals. Finalle, note that around m = 400\nuniformly sampled landmarks are suf\ufb01cient to achieve N M I(yn,m, y) = 0.405, matching the 0.406\nNMI reported by [35] for a larger m = 1600, although smaller than the 0.423 NMI they report for\nm = 1600 when using a slower, PCA based method to compute the embeddings, and RLS sampling\nto select the landmarks. Nonetheless, computing C++\nn,m takes less than 6 minutes on a single machine,\nwhile their best solution required more than 1.5hr on a cluster of 32 machines.\n\n6 Open questions and conclusions\n\nCombining Lemma 1 and Lemma 2, we know that using uniform sampling we can linearly trade-off\n\u221a\na 1/\u03b3 decrease in sub-space size m with a \u03b3/n increase in excess risk. While this is suf\ufb01cient to\nmaintain the O(1/\nn) rate, it is easy to see that the same would not hold for a O(1/n) rate, since\nwe would need to uniformly sample n/1 landmarks losing all computational improvements.\nTo achieve a better trade-off we must go beyond uniform sampling and use different probabilities for\neach sample, to capture their uniqueness and contribution to the approximation error.\n\n8\n\n\feff(\u03b3) =(cid:80)n\n\nDe\ufb01nition 3 ([3]). The \u03b3-ridge leverage score (RLS) of point i \u2208 [n] is de\ufb01ned as\ni Kn(Kn + \u03b3In)\u22121ei.\n\nn + \u03b3\u03a0n)\u22121\u03c6i = eT\n\n\u03c4i(\u03b3) = \u03c6T\n\ni (\u03a6n\u03a6T\n\n(5)\n\n1\u2212\u03b5 \u03c6T\n\ni (\u03a6n\u03a6T\n\nm\u03c6i(cid:107)2 \u2264 \u03b3\n\u22a5\n\ni=1 \u03c4i(\u03b3) is the empirical effective dimension of the dataset.\n\nn + \u03b3\u03a0n)\u22121\u03c6i. It is easy to see that, up to a factor \u03b3\n\nThe sum of the RLSs dn\nRidge leverage scores are closely connected to the residual (cid:107)\u03a0\nm\u03c6i(cid:107)2 after the projection \u03a0m\n\u22a5\ndiscussed in De\ufb01nition 2. In particular, using Lemma 2 we have that the residual can be bounded as\n(cid:107)\u03a0\n1\u2212\u03b5, high-RLS points\nare also high-residual points. Therefore it is not surprising that sampling according to RLSs quickly\nselects any high-residual points and covers Hn, generating a \u03b3-preserving dictionary.\nLemma 3. [11] For a given \u03b3, construct I by sampling m \u2265 12\u03ba2dn\nfrom D proportionally to their RLS. Then w.p. at least 1 \u2212 \u03b4 the dictionary I is \u03b3-preserving.\nNote there exist datasets where the RLSs are uniform,and therefore in the worst case the two sampling\napproaches coincide. Nonetheless, when the data is more structured m (cid:39) dn\neff(\u03b3) can be much smaller\nthan the n/\u03b3 dictionary size required by uniform sampling.\nFinally, note that computing RLSs exactly also requires constructing Kn and O(n2) time and space,\nbut in recent years a number of fast approximate RLSs sampling methods [14] have emerged that can\neff(\u03b3)2) time. Using this result, it\n\nconstruct \u03b3-preserving dictionaries of size (cid:101)O(dn\n\neff(\u03b3)) in just (cid:101)O(ndn\n\neff(\u03b3) log(n/\u03b4)/\u03b52 landmarks\n\nis trivial to sharpen the computational aspects of Theorem 1 in special cases.\n\u221a\nn-preserving dictionary with only dn\nIn particular, we can generate a\nn) elements instead of\neff(\nthe\nn required by uniform sampling. Using concentration arguments [31] we also know that w.h.p.\neff(\u03b3) \u2264 3d\u00b5\nthe empirical effective dimension is at most three times dn\neff(\u03b3) the expected effective\ndimension, a \u00b5-dependent quantity that captures the interaction between \u00b5 and the RKHS H.\nDe\ufb01nition 4. Given the expected covariance operator \u03a8 := Ex\u223c\u00b5 [\u03c6(x)\u03c6(x)T], the expected\neffective dimension is de\ufb01ned as d\u00b5\n. Moreover, for some\nconstant c that depends only on \u03d5(\u00b7) and \u00b5, d\u00b5\n\n(cid:105)\neff(\u03b3) \u2264 c (n/\u03b3)\u03b7 with 0 < \u03b7 \u2264 1.\n\neff(\u03b3) = Ex\u223c\u00b5\n\n\u03c6(x) (\u03a8 + \u03b3\u03a0)\n\n\u22121 \u03c6(x)\n\n(cid:104)\n\n\u221a\n\n\u221a\n\n\u221a\n\n\u221a\n\n\u221a\n\nn) \u2264 cn\u03b7/2.\n\nn) due to the O(k/\n\neff(\u03b3) \u2264 O(n/\u03b3) worst-case upper bound that we saw for dn\n\neff(\u03b3) \u2264 c (n/\u03b3)\u03b7, and in our case \u03b3 =\n\n\u221a\nn we have d\u00b5\neff(\n\u221a\n\nNote that \u03b7 = 1 just gives us the d\u00b5\neff(\u03b3),\nand it is always satis\ufb01ed when the kernel function is bounded. If instead we have a faster spectral\ndecay, \u03b7 can be much smaller. For example, if the eigenvalues of \u03a8 decay polynomially as \u03bbi = i\u2212\u03b7,\nthen d\u00b5\nWe can now better characterize the gap between statistics and computation: using RLSs sampling we\ncan improve the computational aspect of Theorem 1 from\neff(\u03b3), but the risk rate remains\nO(k/\nAssume for a second we could generalize, with additional assumptions, Proposition 1 to a faster\nO(1/n) rate. Then applying Lemma 2 with \u03b3 = 1 we would obtain a risk E(Cn,m) \u2264 O(k/n) +\nO(k/n). Here we see how the regularity condition on d\u00b5\neff(1) becomes crucial. In particular, if \u03b7 = 1,\neff(1) \u2264 n\u03b7. This kind of adaptive\nthen we have d\u00b5\nrates were shown to be possible in supervised learning [31], but seems to still be out of reach for\napproximate kernel k-means.\nOne possible approach to \ufb01ll this gap is to look at fast O(1/n) excess risk rates for kernel k-means.\nProposition 4 ([21], informal). Assume that k \u2265 2, and that \u00b5 satis\ufb01es a margin condition with\nradius r0. If Cn is an empirical risk minimizer, then, with probability larger than 1 \u2212 e\u2212\u03b4,\n\neff(1) \u223c n and no gain. If instead \u03b7 < 1 we obtain d\u00b5\n\nn) component coming from Proposition 1.\n\nn to d\u00b5\n\nE(Cn) \u2264 (cid:101)O\n\n(cid:18) 1\n\nr0\n\n(k + log(|M|)) log(1/\u03b4)\n\nn\n\n(cid:19)\n\n,\n\nwhere |M| is the cardinality of the set of all optimal (up to a relabeling) clustering.\nFor more details on the margin assumption, we refer the reader to the original paper [21]. Intuitively\nthe margin condition asks that every labeling (Voronoi grouping) associated with an optimal clustering\nis re\ufb02ected by large separation in \u00b5. This margin condition also acts as a counterpart of the usual\nmargin conditions for supervised learning where \u00b5 must have lower density around the neighborhood\nof the critical area {x|\u00b5(cid:48)(Y = 1|X = x) = 1/2}. Unfortunately, it is not easy to integrate\nProposition 4 in our analysis, as it is not clear how the margin condition translate from Hn to Hm.\n\n9\n\n\fAcknowledgments.\nThis material is based upon work supported by the Center for Brains, Minds and Machines (CBMM), funded by\nNSF STC award CCF-1231216, and the Italian Institute of Technology. We gratefully acknowledge the support\nof NVIDIA Corporation for the donation of the Titan Xp GPUs and the Tesla k40 GPU used for this research.\nL. R. acknowledges the support of the AFOSR projects FA9550-17-1-0390 and BAA-AFRL-AFOSR-2016-0007\n(European Of\ufb01ce of Aerospace Research and Development), and the EU H2020-MSCA-RISE project NoMADS\n- DLV-777826. A. R. acknowledges the support of the European Research Council (grant SEQUOIA 724063).\n\nReferences\n[1] Nir Ailon, Ragesh Jaiswal, and Claire Monteleoni. Streaming k-means approximation. In\n\nAdvances in neural information processing systems, pages 10\u201318, 2009.\n\n[2] M. A. Aizerman, E. A. Braverman, and L. Rozonoer. Theoretical foundations of the potential\nfunction method in pattern recognition learning. In Automation and Remote Control,, number 25\nin Automation and Remote Control\u201e pages 821\u2013837, 1964.\n\n[3] Ahmed El Alaoui and Michael W. Mahoney. Fast randomized kernel methods with statistical\n\nguarantees. In Neural Information Processing Systems, 2015.\n\n[4] Daniel Aloise, Amit Deshpande, Pierre Hansen, and Preyas Popat. Np-hardness of euclidean\n\nsum-of-squares clustering. Machine learning, 75(2):245\u2013248, 2009.\n\n[5] David Arthur and Sergei Vassilvitskii. k-means++: The advantages of careful seeding. In\nProceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages\n1027\u20131035. Society for Industrial and Applied Mathematics, 2007.\n\n[6] Haim Avron, Michael Kapralov, Cameron Musco, Christopher Musco, Ameya Velingker, and\nAmir Zandieh. Random Fourier features for kernel ridge regression: Approximation bounds\nand statistical guarantees. In Proceedings of the 34th International Conference on Machine\nLearning, 2017.\n\n[7] Francis Bach. Sharp analysis of low-rank kernel matrix approximations. In Conference on\n\nLearning Theory, 2013.\n\n[8] Francis R Bach and Michael I Jordan. Predictive low-rank decomposition for kernel methods.\nIn Proceedings of the 22nd international conference on Machine learning, pages 33\u201340. ACM,\n2005.\n\n[9] Arturs Backurs, Piotr Indyk, and Ludwig Schmidt. On the \ufb01ne-grained complexity of empirical\nrisk minimization: Kernel methods and neural networks. In Advances in Neural Information\nProcessing Systems, 2017.\n\n[10] G\u00e9rard Biau, Luc Devroye, and G\u00e1bor Lugosi. On the performance of clustering in hilbert\n\nspaces. IEEE Transactions on Information Theory, 54(2):781\u2013790, 2008.\n\n[11] Daniele Calandriello. Ef\ufb01cient Sequential Learning in Structured and Constrained Environments.\n\nPhD thesis, 2017.\n\n[12] Daniele Calandriello, Alessandro Lazaric, and Michal Valko. Ef\ufb01cient second-order online\nIn Advances in Neural Information Processing\n\nkernel learning with adaptive embedding.\nSystems, pages 6140\u20136150, 2017.\n\n[13] Daniele Calandriello, Alessandro Lazaric, and Michal Valko. Second-order kernel online convex\noptimization with adaptive sketching. In International Conference on Machine Learning, 2017.\n\n[14] Daniele Calandriello, Alessandro Lazaric, and Michal Valko. Distributed adaptive sampling for\n\nkernel matrix approximation. In AISTATS, 2017.\n\n[15] Guillermo Canas, Tomaso Poggio, and Lorenzo Rosasco. Learning manifolds with k-means\nand k-\ufb02ats. In Advances in Neural Information Processing Systems, pages 2465\u20132473, 2012.\n\n10\n\n\f[16] Radha Chitta, Rong Jin, Timothy C. Havens, and Anil K. Jain. Approximate kernel k-means:\nSolution to large scale kernel clustering. In Proceedings of the 17th ACM SIGKDD international\nconference on Knowledge discovery and data mining, pages 895\u2013903. ACM, 2011.\n\n[17] Inderjit S. Dhillon, Yuqiang Guan, and Brian Kulis. A uni\ufb01ed view of kernel k-means, spectral\nclustering and graph cuts. Technical Report No. UTCS TR-04-25, University of Texas at Austin,\n2004.\n\n[18] Siegfried Graf and Harald Luschgy. Foundations of Quantization for Probability Distributions.\nLecture Notes in Mathematics. Springer-Verlag, Berlin Heidelberg, 2000. ISBN 978-3-540-\n67394-1.\n\n[19] William B Johnson and Joram Lindenstrauss. Extensions of lipschitz mappings into a hilbert\n\nspace. Contemporary Mathematics, 26, 1984.\n\n[20] Yann LeCun and Corinna Cortes. MNIST handwritten digit database. 2010. URL http:\n\n//yann.lecun.com/exdb/mnist/.\n\n[21] Cl\u00e9ment Levrard et al. Nonasymptotic bounds for vector quantization in hilbert spaces. The\n\nAnnals of Statistics, 43(2):592\u2013619, 2015.\n\n[22] Stuart Lloyd. Least squares quantization in pcm. IEEE transactions on information theory, 28\n\n(2):129\u2013137, 1982.\n\n[23] Ga\u00eblle Loosli, St\u00e9phane Canu, and L\u00e9on Bottou. Training invariant support vector machines us-\ning selective sampling. In Large Scale Kernel Machines, pages 301\u2013320. MIT Press, Cambridge,\nMA., 2007.\n\n[24] Andreas Maurer and Massimiliano Pontil. $ K $-dimensional coding schemes in Hilbert spaces.\n\nIEEE Transactions on Information Theory, 56(11):5839\u20135846, 2010.\n\n[25] Cameron Musco and Christopher Musco. Recursive Sampling for the Nystr\u00f6m Method. In\n\nNIPS, 2017.\n\n[26] Cameron Musco and David Woodruff. Is input sparsity time possible for kernel low-rank\n\napproximation? In Advances in Neural Information Processing Systems 30. 2017.\n\n[27] Dino Oglic and Thomas G\u00e4rtner. Nystr\u00f6m method with kernel k-means++ samples as landmarks.\n\nJournal of Machine Learning Research, 2017.\n\n[28] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,\nP. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher,\nM. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine\nLearning Research, 12:2825\u20132830, 2011.\n\n[29] Ali Rahimi and Ben Recht. Random features for large-scale kernel machines.\n\nInformation Processing Systems, 2007.\n\nIn Neural\n\n[30] Alessandro Rudi and Lorenzo Rosasco. Generalization properties of learning with random\n\nfeatures. In Advances in Neural Information Processing Systems, pages 3218\u20133228, 2017.\n\n[31] Alessandro Rudi, Raffaello Camoriano, and Lorenzo Rosasco. Less is more: Nystr\u00f6m computa-\ntional regularization. In Advances in Neural Information Processing Systems, pages 1657\u20131665,\n2015.\n\n[32] Bernhard Scholkopf and Alexander J Smola. Learning with kernels: support vector machines,\n\nregularization, optimization, and beyond. MIT press, 2001.\n\n[33] Bernhard Sch\u00f6lkopf, Alexander Smola, and Klaus-Robert M\u00fcller. Nonlinear component analysis\n\nas a kernel eigenvalue problem. Neural computation, 10(5):1299\u20131319, 1998.\n\n[34] Joel A Tropp, Alp Yurtsever, Madeleine Udell, and Volkan Cevher. Fixed-rank approximation\nof a positive-semide\ufb01nite matrix from streaming data. In Advances in Neural Information\nProcessing Systems, pages 1225\u20131234, 2017.\n\n11\n\n\f", "award": [], "sourceid": 5707, "authors": [{"given_name": "Daniele", "family_name": "Calandriello", "institution": "LCSL IIT/MIT"}, {"given_name": "Lorenzo", "family_name": "Rosasco", "institution": "University of Genova- MIT - IIT"}]}