{"title": "Supervising Unsupervised Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 4991, "page_last": 5001, "abstract": "We introduce a framework to transfer knowledge acquired from a repository of (heterogeneous) supervised datasets to new unsupervised datasets. Our perspective avoids the subjectivity inherent in unsupervised learning by reducing it to supervised learning, and provides a principled way to evaluate unsupervised algorithms. We demonstrate the versatility of our framework via rigorous agnostic bounds on a variety of unsupervised problems. In the context of clustering, our approach helps choose the number of clusters and the clustering algorithm, remove the outliers, and provably circumvent Kleinberg's impossibility result. Experiments across hundreds of problems demonstrate improvements in performance on unsupervised data with simple algorithms despite the fact our problems come from heterogeneous domains. Additionally, our framework lets us leverage deep networks to learn common features across many small datasets, and perform zero shot learning.", "full_text": "Supervising Unsupervised Learning\n\nVikas K. Garg\nCSAIL, MIT\n\nAdam Kalai\n\nMicrosoft Research\n\nvgarg@csail.mit.edu\n\nnoreply@microsoft.com\n\nAbstract\n\nWe introduce a framework to transfer knowledge acquired from a repository of\n(heterogeneous) supervised datasets to new unsupervised datasets. Our perspective\navoids the subjectivity inherent in unsupervised learning by reducing it to super-\nvised learning, and provides a principled way to evaluate unsupervised algorithms.\nWe demonstrate the versatility of our framework via rigorous agnostic bounds on a\nvariety of unsupervised problems. In the context of clustering, our approach helps\nchoose the number of clusters and the clustering algorithm, remove the outliers,\nand provably circumvent Kleinberg\u2019s impossibility result. Experiments across hun-\ndreds of problems demonstrate improvements in performance on unsupervised data\nwith simple algorithms despite the fact our problems come from heterogeneous\ndomains. Additionally, our framework lets us leverage deep networks to learn\ncommon features across many small datasets, and perform zero shot learning.\n\n1\n\nIntroduction\n\nUnsupervised Learning (UL) is an elusive branch of Machine Learning (ML), including problems\nsuch as clustering and manifold learning, that seeks to identify structure among unlabeled data. UL is\nnotoriously hard to evaluate and inherently unde\ufb01nable. To illustrate this point, we consider clustering\nthe points on the line in Figure 1. One can easily justify 2, 3, or 4 clusters. As Kleinberg argues [1], it\nis impossible to give an axiomatically consistent de\ufb01nition of the \u201cright\u201d clustering. However, now\nsuppose that one can access a bank of prior clustering problems, drawn from the same distribution as\nthe current problem at hand, but for which ground-truth labels are available. In this example, evidence\nmay favor two clusters since the unlabeled data closely resembles two of the three 1-dimensional\nclustering problems, and all the clusterings share the common property of roughly equal size clusters.\nGiven suf\ufb01ciently many problems in high dimensions, one can learn to extract features of the data\ncommon across problems to improve clustering.\nWe model UL problems as representative samples from a meta-distribution, and offer a solution\nusing an annotated collection of prior datasets. Speci\ufb01cally, we propose a meta-unsupervised-\nlearning (MUL) framework that, by considering a distribution over unsupervised problems, reduces\nUL to Supervised Learning (SL). Going beyond transfer learning, semi-supervised learning, and\ndomain adaptation [2, 3, 4, 5, 6, 7, 8] where problems have the same dimensionality or at least the\nsame type of data (text, images, etc.), our framework can be used to improve UL performance for\nproblems of different representations and from different domains. While the apparent distinctions\nbetween the domains might make our view seem outlandish, they are merely an artifact of human\nperception or representation. Fundamentally, we may encode any kind of data as a program or a bit\nstream. Depending on considerations such as the amount of available data, one can choose to have\na distribution over only the programs from a single domain such as images, and consider all other\nprograms as a set of measure 0; or alternatively, as we do, consider a distribution over all programs.\nEmpirically, we train meta-algorithms on the repository of classi\ufb01cation problems from openml.org\nthat has a variety of datasets spanning domains such as NLP, computer vision, and bioinformatics.\nIgnoring labels, each dataset can be viewed as a UL problem. We take a data-driven approach to\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00b4eal, Canada.\n\n\fFigure 1: In isolation (left), there is no basis to choose a clustering \u2013 even points on a line could be\nclustered in 2-4 clusters. A repository of clustering problems with ground-truth labels (right) can\ninform the choice amongst clusterings or even offer common features for richer data.\n\nquantitatively evaluate and design new UL algorithms, and merge classic ideas like UL and domain\nadaptation in a simple way. This enables us to make UL problems, such as clustering, well-de\ufb01ned.\nOur model requires a loss function measuring the quality of a solution with respect to some conceptual\nground truth. Note that we require the ground truth labels for a training repository but not test data.\nWe assume access to a repository of datasets annotated with ground truth and drawn from a meta-\ndistribution \u00b5 over problems, and that the given data X was drawn from this same distribution\n(though without labels). From this collection, one could learn which clustering algorithm works best,\nor, even better, which algorithm works best for which type of data. The same meta-approach could\nhelp in selecting how many clusters to have, which outliers to remove, etc.\nOur theoretical model is a meta-application of Agnostic Learning [9] where we treat each entire\nlabeled problem analogous to a training example in a supervised learning task. We show how one\ncan provably learn to perform UL as well as the best algorithm in certain classes of algorithms. In\nindependent work, Balcan et al [10] design near-optimal algorithms for NP-hard problems such as\nmeta-clustering. Our work also relates to supervised learning questions of meta-learning, sometimes\nreferred to as auto-ml [e.g. 11, 12], learning to learn [e.g. 13], Bayesian optimization [e.g. 14] and\nlifelong learning [e.g. 15, 16]. In the case of supervised learning, where accuracy is easy to evaluate,\nmeta-learning enables algorithms to achieve accuracy more quickly with less data.\nOur contributions. We make several fundamental contributions. First, we show how to adapt\nknowledge acquired from a repository of small datasets that come from different domains, to new\nunsupervised datasets. Our framework removes the subjectivity due to rules of thumb or educated\nguesses, and provides an objective evaluation methodology for UL. We introduce algorithms for\nvarious problems including choosing a clustering algorithm and the number of clusters, learning\ncommon features, and removing outliers in a principled way. Next, we add a fresh perspective to\nthe debate triggered by Kleinberg on what characterizes right clustering [1, 17, 18], by introducing\nthe meta-scale-invariance property that learns a scale across clustering problems and provably\nmakes good clustering possible. Finally, we use deep learning to automate learning of features\nacross small problems from different domains and of very different natures. We show that these\nseemingly unrelated problems can be leveraged to gain improvements in the average performance\nacross previously unseen UL datasets. In this way, we effectively unite many heterogeneous \u201csmall\ndata\u201d into suf\ufb01cient \u201cbig data\u201d to bene\ufb01t from a single neural network. In principle, our approach may\nbe combined with compression methods for deep nets [19, 20, 21] to learn under resource constrained\nsettings [22], where maintaining a separate deep net for each dataset is simply impracticable.\n\n2 Setting\n\nWe \ufb01rst de\ufb01ne agnostic learning in general, and then de\ufb01ne meta-learning as a special case where the\nexamples are problems themselves. A learning task consists of a universe X , labels Y and a bounded\nloss function (cid:96) : Y \u00d7 Y \u2192 [0, 1], where (cid:96)(y, z) is the loss of predicting z when the true label is y. A\nlearner L : (X \u00d7 Y)\u2217 \u2192 YX takes a training set T = {(x1, y1), . . . (xn, yn)} consisting of a \ufb01nite\n\n2\n\nClustering problemin isolation:2\u02da15\u02da1\u02da14\u02da0 db3 db34\u02da65\u02daClustering repository:How many clusters?\fnumber of iid samples from \u00b5 and outputs a classi\ufb01er L(T ) \u2208 YX , where YX is the set of functions\nfrom X to Y. The loss of a classi\ufb01er c \u2208 YX is (cid:96)\u00b5(c) = E(x,y)\u223c\u00b5 [(cid:96)(y, c(x))], and the expected loss\nof L is (cid:96)\u00b5(L) = ET\u223c\u00b5n [(cid:96)\u00b5(L(T ))]. Learning is with respect to a concept class C \u2286 YX .\nDe\ufb01nition 1 (Agnostic learning of C). For countable1 sets X ,Y and (cid:96) : Y \u00d7 Y \u2192 [0, 1], learner\nL agnostically learns C \u2286 YX if there exists a polynomial p such that for any distribution \u00b5 over\nX \u00d7 Y and for any n \u2265 p(1/\u0001, 1/\u03b4),\n\n(cid:20)\n\n(cid:21)\n\nPr\nT\u223c\u00b5n\n\n(cid:96)\u00b5(L(T )) \u2264 min\n\nc\u2208C (cid:96)\u00b5(c) + \u0001\n\n\u2265 1 \u2212 \u03b4.\n\n(cid:26)1 \u2212 RI(Y, Z)\n\nif \u222a Y = \u222aZ\notherwise.\n\nFurther, L and the classi\ufb01er L(T ) must run in time polynomial in the length of their inputs.\n\nPAC learning refers to the special case when \u00b5 is additionally assumed to satisfy minc\u2208C (cid:96)\u00b5(c) = 0.\nMUL, which is the focus of this paper, simply refers to case where \u00b5 is a meta-distribution over\ndatasets X \u2208 X and ground truth labelings Y \u2208 Y. We use capital letters to represent datasets as\nopposed to individual examples. A meta-classi\ufb01er c is a UL algorithm that takes an entire dataset X\nas input and produces output, such as a clustering algorithm, Z \u2208 Y. As mentioned, true labels need\nonly be observed for the training datasets \u2013 we may never observe the true labels of any problem\nencountered after deployment.2 For a \ufb01nite set S, let \u03a0(S) denote the set of clusterings or disjoint\npartitions of S into two or more sets, e.g., \u03a0({1, 2, 3}) includes {{1},{2, 3}}, i.e., the partition into\nclusters {1} and {2, 3}. For a clustering C, denote by \u222aC = \u222aS\u2208CS the set of points clustered.\nGiven two clusterings Y, Z \u2208 \u03a0(S), the Rand Index RI(Y, Z) measures the fraction of pairs of\npoints on which they agree. Adjusted Rand Index (ARI) is a re\ufb01ned measure that attempts to correct\nRI by accounting for chance agreement [23]. We denote by ARI(Y, Z) the adjusted rand index\nbetween two clusterings Y and Z. We abuse notation and also write ARI(Y, Z) when Y is a vector\nof class labels, by converting it to a clustering with one cluster for each class label. We de\ufb01ne the loss\nto be the fraction of pairs of points on which the clusterings disagree, assuming they are on the same\nset of points. If, for any reason the clusterings are not on the same set, the loss is de\ufb01ned to be 1, i.e.,\n\n1\n\n(cid:96)(Y, Z) =\n\n(1)\nIn Euclidean clustering, the points are Euclidean, so each dataset X \u2282 Rd for some d \u2265 1. In\nmeta-Euclidean-clustering, we instead aim to learn a clustering algorithm from several different\ntraining clustering problems (of potentially different dimensionalities d). Note that (Adjusted) Rand\nIndex measures clustering quality with respect to an extrinsic ground truth. In many cases, such a\nground truth is unavailable, and an intrinsic metric is useful. Such is the case, e.g., when choosing\nthe number of clusters. We can compare and select from different clusterings of size k = 2, 3, . . .\nusing the standard method of Silhouette score [24], which is de\ufb01ned for a Euclidean clustering as\n\nsil(C) =\n\n1\n\n| \u222a C|\n\nb(x) \u2212 a(x)\n\nmax{a(x), b(x)} ,\n\n(cid:88)\n\nx\u2208\u222aC\n\n(2)\n\nwhere a(x) denotes the average distance between point x and other points in its own cluster and b(x)\ndenotes the average distance between x and points in the closest alternative cluster.\n\n3 Meta-unsupervised problems\n\nThe simplest approach to MUL is Empirical Risk Minimization (ERM), namely choosing from a\nfamily U any unsupervised algorithm with lowest empirical error on training set T , which we write\nas ERMU (T ). The following lemma implies a logarithmic dependence on |U| and helps us solve\nseveral interesting MUL problems.\nLemma 1. For any \ufb01nite family U of UL algorithms, any distribution \u00b5 over problems X, Y \u2208 X \u00d7Y,\nand any n \u2265 1, \u03b4 > 0,\n\n(cid:34)\n\n(cid:114)\n\n(cid:35)\n\n2\nn\n\nlog\n\n|U|\n\u03b4\n\n\u2265 1 \u2212 \u03b4,\n\nPr\nT\u223c\u00b5n\n\n(cid:96)\u00b5(ERMU (T )) \u2264 min\n\nU\u2208U (cid:96)\u00b5(U ) +\n\n1For simplicity of presentation, we assume that these sets are countable, but with appropriate measure\n\ntheoretic assumptions the analysis in this paper can be extended to the in\ufb01nite case.\n\n2This differs from, say, online learning, where it is assumed that for each example, the ground truth is\n\nrevealed once prediction is made.\n\n3\n\n\f(cid:96)(Y, U (X)) is any empirical loss minimizer over U \u2208 U.\n\nwhere ERMU (T ) \u2208 arg min\nU\u2208U\n\n(cid:88)\n\n(X,Y )\u2208T\n\nProof. Fix U0 \u2208 arg minU\u2208U (cid:96)\u00b5(U ). Let\n\n(cid:114)\n\nInvoking the Chernoff bound, we have\n\nPr\nT\u223c\u00b5n\n\n(X,Y )\u2208T\n\nn\n\n\uf8ee\uf8f0 1\n\uf8ee\uf8f0 1\n\nn\n\n(cid:88)\n\n(cid:88)\n\n\u0001 = 2\n\nlog(1/\u03b4) + log |U|\n\n2n\n\n.\n\n(cid:96)(Y, U0(X)) \u2265 (cid:96)\u00b5(U0) + \u0001/2\n\n\u22122n(\u0001/2)2\n\n.\n\n\uf8f9\uf8fb \u2264 e\n\uf8f9\uf8fb \u2264 e\n\nDe\ufb01ne S = {U \u2208 U | (cid:96)\u00b5(U ) \u2265 (cid:96)\u00b5(U0) + \u0001}. Applying the Chernoff bound, we have for each U \u2208 S\n\nPr\nT\u223c\u00b5n\n\n(X,Y )\u2208T\n\n(cid:96)(Y, U (X)) \u2264 (cid:96)\u00b5(U ) \u2212 \u0001/2\n\n\u22122n(\u0001/2)2\n\n.\n\nIn order for (cid:96)\u00b5(ERMU ) \u2265 minU\u2208U (cid:96)\u00b5(U ) + \u0001 to happen, either some U \u2208 S must have empirical\nerror at most (cid:96)\u00b5(U ) \u2212 \u0001/2 or the empirical error of U0 must be at least (cid:96)\u00b5(U0) + \u0001/2. By the union\nbound, this happens with probability at most |U|e\u22122n(\u0001/2)2\n\n= \u03b4.\n\nSelecting the clustering algorithm/number of clusters. Instead of the ad hoc parameter selection\nheuristics currently used in UL, MUL provides a principled data-driven alternative. Suppose one has\nm candidate clustering algorithms, or parameter settings C1(X), . . . , Cm(X) for each data set X.\nThese may be derived from m different clustering algorithms, or alternatively, they could represent the\nmeta-k problem, i.e., how many clusters to choose from a single clustering algorithm where parameter\nk \u2208 {2, . . . , m + 1} determines the number of clusters. In this section, we show that choosing the\nright algorithm is essentially a multi-class classi\ufb01cation problem given any set of problem meta-data\nfeatures and cluster-speci\ufb01c features. Trivially, Lemma 1 implies that with O(log m) training problem\nsets one can select the Cj that performs best across problems. For meta-k, however, this would mean\nchoosing the same number of clusters to use across all problems, analogous to choosing the best\nsingle class for multi-class classi\ufb01cation. To learn to choose the best Cj on a problem-by-problem\nbasis, suppose we have problem features \u03c6(X) \u2208 \u03a6 such as number of dimensions, number of\npoints, domain (text/vision/etc.), and cluster features \u03b3(Cj(X)) \u2208 \u0393 that might include number of\nclusters, mean distance to cluster center, and Silhouette score (eq. 2). Suppose we also have a family\nF of functions f : \u03a6 \u00d7 \u0393m \u2192 {1, 2, . . . , m}, which selects the clustering based on features (any\nmulti-class classi\ufb01cation family may be used for this purpose):\n\n(cid:96)(cid:0)Yi, Cf (\u03c6(Xi),\u03b3(C1(Xi)),...,\u03b3(Cm(Xi)))(Xi)(cid:1) .\n\n(cid:88)\n\narg min\n\nf\u2208F\n\ni\n\nThe above ERMF is a reduction from the problem of selecting Cj from X to the problem of multi-\nclass classi\ufb01cation based on features \u03c6(X) and \u03b3(C1(X)), . . . , \u03b3(Cm(X)) and loss as de\ufb01ned in\neq. (1). As long as F can be parametrized by a \ufb01xed number of b-bit numbers, the ERM approach of\nchoosing the \u201cbest\u201d f will be statistically ef\ufb01cient. If ERMF cannot be computed exactly within the\ntime constraints, an approximate minimizer may be used.\nFitting the threshold in single-linkage clustering. To illustrate a concrete ef\ufb01cient algorithm,\nconsider choosing the threshold parameter of a single linkage clustering algorithm. Fix the set of\npossible vertices V. Take X to consist of undirected weighted graphs X = (V, E, W ) with vertices\nV \u2286 V, edges E \u2286 {{u, v} | u, v \u2208 V } and non-negative weights W : E \u2192 R+. The loss on\nclusterings Y = \u03a0(V) is again as de\ufb01ned in Eq. (1). Note that Euclidean data could be transformed\ninto the complete graph, e.g., with W ({x, x(cid:48)}) = (cid:107)x \u2212 x(cid:48)(cid:107). Single-linkage clustering with parameter\nr \u2265 0, Cr(V, E, W ) partitions the data such that u, v \u2208 V are in the same cluster if and only if there\nis a path from u to v such that the weight on each edge in the path is at most r. For generalization\nbounds, we simply assume that numbers are represented with a constant number of bits. The loss is\nde\ufb01ned as in (1). We have the following result.\nTheorem 1. The class {Cr | r > 0} of single-linkage algorithms with threshold r where numbers\nare represented using b bits, can be agnostically learned. In particular, a quasilinear time algorithm\n\nachieves error \u2264 minr (cid:96)\u00b5(Cr) +(cid:112)2(b + log 1/\u03b4)/n, with prob. \u2265 1 \u2212 \u03b4 over n training problems.\n\n4\n\n\fProof. For generalization, we assume that numbers are represented using at most b bits. By Lemma\n1, we see that with n training graphs and |{Cr}| \u2264 2b, we have that with probability \u2265 1 \u2212 \u03b4, the\n\nerror of ERM is within(cid:112)2(b + log 1/\u03b4)/n of minr (cid:96)\u00b5(Cr). It remains to show how one can \ufb01nd the\n\nFor a quasilinear time algorithm (in the input size |T| = \u0398((cid:80)\n\nbest single-linkage parameter in quasilinear time. It is trivial to see that one can \ufb01nd the best cutoff\nfor r in polynomial time: for each weight r in the set of edge weights across all graphs, compute\nthe mean loss of Cr across the training set. Since Cr runs in polynomial time, loss can be computed\nin polynomial time, and the number of different possible cutoffs is bounded by the number of edge\nweights, which is polynomial in the input size. Thus the entire procedure takes polynomial time.\ni |Vi|2)), we run Kruskal\u2019s algorithm on\nthe union graph of all the graphs in the training set (i.e., the number of nodes and edges are the sum of\nthe number of nodes and edges in the training graphs, respectively). As the Kruskal\u2019s algorithm adds\neach new edge to its forest (in order of non-decreasing edge weight), effectively two clusters in some\ntraining graph (Vi, Ei, Wi) have been merged. The change in loss of the resulting clustering can be\ncomputed from the loss of the previous clustering in time proportional to the product of the sizes of\nthe two clusters that are being merged, since these are the only entities on which the clusterings, and\ni |Vi|3). However, note that, each pair\nof nodes begins separately and is updated, exactly once during the course of the algorithm, to be in\ni |Vi|2). Since Kruskal\u2019s algorithm is\n\nthus the losses differ. Na\u00a8\u0131vely, this may seem to take O((cid:80)\nthe same cluster. Hence, the total number of updates is O((cid:80)\n\nquasilinear time itself, we deduce that the entire algorithm is quasilinear.\nFor correctness, it is easy to see that as the algorithm runs, Cr has been computed for each possible r\nat the step just preceding when Kruskal adds the \ufb01rst edge whose weight is greater than r.\n\nn\n\n(cid:80)\n\nthat this choice of \u03b8 will give a loss within(cid:112)2(b + log 1/\u03b4)/n of the optimal \u03b8, with probability at\n\nOutlier removal. For simplicity, we consider learning a single hyperparameter pertaining to the\nfraction of examples, furthest from the mean, to remove. In particular, suppose training problems\nare classi\ufb01cation instances, i.e., Xi \u2208 Rdi\u00d7mi and Yi \u2208 {1, 2, . . . , ki}mi. To be concrete, suppose\none is using algorithm C which is, say, K-means clustering. Choosing the parameter \u03b8 which is the\nfraction of outliers to ignore during \ufb01tting, one might de\ufb01ne C\u03b8 with parameter \u03b8 \u2208 [0, 1) on data\nx1, . . . , xn \u2208 Rd as follows: (a) compute the data mean \u00b5 = 1\ni xi, (b) set aside as outliers the\n\u03b8 fraction of examples where xi is furthest from \u00b5 in Euclidean distance, (c) cluster the data with\noutliers removed using C, and (d) assign each outlier to the nearest cluster center. We can trivially\nchoose the best \u03b8 so as to optimize performance. With a single b-bit parameter \u03b8, Lemma 1 implies\nleast 1 \u2212 \u03b4 over the sample of datasets. The number of \u03b8\u2019s that need to be considered is at most the\ntotal number of inputs across problems, so the algorithm runs in polynomial time.\nThe meta-outlier-removal procedure can be extended directly to remove a different proportion of\noutliers from each test dataset. Conceptually, one may view the procedure as assigning weights to\npoints in the test set of size z, where each of the r outlier points (according to the learned threshold \u03b8)\nis assigned weight 0, and all the other points share an equal weight 1/(z \u2212 r). One could relax this\nhard assignment to instead have a distribution that re-adjusts the weights to assign a low but non-zero\nmass on the outlier points r. Then, these weights act as a prior in, e.g., the penalized and weighted\nK-means algorithm [25] that groups the points into clusters and a set of noisy or outlier points.\nProblem recycling. For this model, suppose that each problem belongs to a set of common problem\ncategories, e.g., digit recognition, sentiment analysis, image classi\ufb01cation amongst the thousands\nof classes of ImageNet [26], etc. The idea is that one can recycle the solution to one version of\nthe problem in a later incarnation. For instance, suppose that one trained a digit recognizer on a\nprevious problem. For a new problem, the input may be encoded differently (e.g., different image\nsize, different pixel ordering, different color representation), but there is a transformation T that maps\nthis problem into the same latent space as the previous problem so that the prior solution can be\nre-used. In particular, for each problem category i = 1, 2, . . . , N, there is a latent problem space \u039bi\nand a solver Si : \u039bi \u2192 Yi. Each problem X, Y of this category can be transformed to T (X) \u2208 \u039bi\nwith low solution loss (cid:96)(Y, S(T (X))). In addition to the solvers, one also requires a meta-classi\ufb01er\nM : X \u2192 {1, 2, . . . , N} that, for a problem X, identi\ufb01es which solver i = M (X) to use. Finally,\none has transformers Ti : M\u22121(i) \u2192 \u039bi that map any X such that M (X) = i into latent space \u039bi.\nThe output of the meta-classi\ufb01er is simply SM (X)(TM (X)(X)). Lemma 1 implies that if one can\noptimize over meta-classi\ufb01ers and the parameters of the meta-classi\ufb01er are represented by D b-bit\n\nnumbers, then one achieves loss within \u0001 of the best meta-classi\ufb01er with m = O(cid:0)Db/\u00012(cid:1) problems.\n\n5\n\n\f4 The possibility of meta-clustering\n\nIn this section, we point out how the framing of meta-clustering circumvents Kleinberg\u2019s impossibility\ntheorem [1] for clustering. To review, [1] considers clustering \ufb01nite sets of points X endowed with\nsymmetric distance functions d \u2208 D(X), where the set of valid distance functions is:\n\nD(X) = {d : X \u00d7 X \u2192 R | \u2200x, x(cid:48) \u2208 X, d(x, x(cid:48)) = d(x(cid:48), x) \u2265 0, d(x, x(cid:48)) = 0 iff x = x(cid:48)}.\n\n(3)\nA clustering algorithm A takes a distance function d \u2208 D(X) and returns a partition A(d) \u2208 \u03a0(X).\nKleinberg de\ufb01nes an axiomatic framework with the following three desirable properties, and proves\nno clustering algorithm A can satisfy all of these properties. (Scale-Invariance) For any distance\nfunction d and any \u03b1 > 0, A(d) = A(\u03b1 \u00b7 d), where \u03b1 \u00b7 d is the distance function d scaled by \u03b1.\nThat is, the clustering should not change if the problem is scaled by a constant factor. (Richness)\nFor any \ufb01nite X and clustering C \u2208 \u03a0(X), there exists d \u2208 D(X) such that A(d) = C. Richness\nimplies that for any partition there is an arrangement of points where that partition is the correct\nclustering. (Consistency) Let d, d(cid:48) \u2208 D(X) such that A(d) = C, and for all x, x(cid:48) \u2208 X, if x, x(cid:48) are\nin the same cluster in C then d(cid:48)(x, x(cid:48)) \u2264 d(x, x(cid:48)) while if x, x(cid:48) are in different clusters in C then\nd(cid:48)(x, x(cid:48)) \u2265 d(x, x(cid:48)). The axiom demands A(d(cid:48)) = A(d). That is, clusters should not change if the\npoints within any cluster are pulled closer, and those in different clusters are pushed farther apart.\nFor intuition, consider clustering two points where there is a single distance. Should they be in a\nsingle cluster or two clusters? By richness, there must be some distances \u03b41, \u03b42 > 0 such that if\nd(x1, x2) = \u03b41 then they are in the same cluster while if d(x1, x2) = \u03b42 they are in different clusters.\nThis, however violates scale-invariance, since the problems are at a scale \u03b1 = \u03b42/\u03b41 of each other.\nWe show a natural meta-version of the axioms is satis\ufb01ed by a simple meta-single-linkage clustering\nalgorithm. The main insight is that prior problems can be used to de\ufb01ne a scale in the meta-clustering\nframework. Suppose we de\ufb01ne the clustering problem with respect to a non-empty training set of\nclustering problems. So a meta-clustering algorithm M (d1, C1, . . . , dt, Ct) = A takes t \u2265 1 training\nclustering problems with their ground-truth clusterings (on corresponding sets Xi, i.e., di \u2208 D(Xi)\nand Ci \u2208 \u03a0(Xi)) and outputs a clustering algorithm A. We can use these training clusterings to\nestablish a scale. In particular, we will show a meta-clustering algorithm whose output A always\nsatis\ufb01es richness and consistency, and which satis\ufb01es the following variant of scale-invariance.\nMeta-scale-invariance (MSI). Fix any distance functions d1, d2, . . . , dt and ground truth clus-\nterings C1, . . . , Ct on sets X1, . . . , Xt. For any \u03b1 > 0, and any distance function d,\nif\nM (d1, C1, . . . , dt, Ct) = A and M (\u03b1 \u00b7 d1, C1, . . . , \u03b1 \u00b7 dt, Ct) = A(cid:48), then A(d) = A(cid:48)(\u03b1 \u00b7 d).\nTheorem 2. There exists a meta-clustering algorithm that satis\ufb01es meta-scale-invariance and whose\noutput always satis\ufb01es richness and consistency.\n\nProof. There are a number of such clustering algorithms, but for simplicity we create one based\non single-linkage clustering. Single-linkage clustering satis\ufb01es richness and consistency (see [1],\nTheorem 2.2). The question is how to choose its single-linkage parameter. With meta-clustering, the\nscale can be established using training data. One can choose it to be the minimum distance between\nany two points in different clusters across all training problems. It is easy to see that if one scales\nthe training problems and d by the same factor \u03b1, the clusterings remain unchanged, and hence the\nmeta-clustering algorithm satis\ufb01es meta-scale-invariance.\n\n5 Experiments\n\nWe conducted several experiments to substantiate the ef\ufb01cacy of the proposed framework un-\nder various unsupervised settings. We downloaded all classi\ufb01cation datasets from OpenML\n(http://www.openml.org) that had at most 10,000 instances, 500 features, 10 classes, and no missing\ndata to obtain a corpus of 339 datasets. We now describe in detail the results of our experiments.\nSelecting the number of clusters. For the purposes of this section, we \ufb01x the clustering algorithm\nto be K-means and compare two approaches to choosing the number of clusters k, from k = 2 to 10.\nMore generally, one could vary k on, for instance, a logarithmic scale, or a combination of different\nscales. First, we consider a standard heuristic for the baseline choice of k: for each cluster size k\nand each dataset, we generate 10 clusterings from different random starts for K-means and take one\nwith best Silhouette score among the 10. Then, over the 9 different values of k, we choose the one\n\n6\n\n\fSelecting the number of clusters\n\n0.12\n\nSilhouette\nMeta\n\nI\nR\nA\ne\ng\na\nr\ne\nv\nA\n\n0.115\n\n0.11\n\n0.105\n\n0.1\n\n0.1\n\n0.2\n\n0.3\n\n0.5\n\n0.6\n\n0.4\n\nTraining fraction\n\n0.7\n\n0.8\n\n0.9\n\nE\nS\nM\nR\ne\ng\na\nr\ne\nv\nA\n\n4\n\n3.8\n\n3.6\n\n3.4\n\n3.2\n\n3\n\n2.8\n\n2.6\n\n2.4\n\n2.2\n\n2\n\nBest-k Root Mean Square Error\n\nSilhouette\nMeta\n\n0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9\n\nTraining fraction\n\nFigure 2: (Left) Average ARI scores of the meta-algorithm and the baseline for choosing the number\nof clusters, versus the fraction of problems used for training. We observe that the best k predicted by\nthe meta approach registered a much higher ARI than that by Silhouette score maximizer. (Right)\nThe meta-algorithm achieved a much lower root-mean-square error than the Silhouette maximizer.\n\nwith greatest Silhouette score so that the resulting clustering is the one of greatest Silhouette score\namong all 90. In our meta-k approach, the meta-algorithm outputs \u02c6k as a function of Silhouette score\nand k by outputting the \u02c6k with greatest estimated ARI. We evaluate on the same 90 clusterings for\nthe 339 datasets as the baseline. To estimate ARI in this experiment, we used simple least-squares\nlinear regression. In particular, for each k \u2208 {2, . . . , 9}, we \ufb01t ARI as a linear function of Silhouette\nscores using all the data from the meta-training set in the partition pertaining to k: each dataset in\nthe meta-training set provided 10 target values, corresponding to different runs where number of\nclusters was \ufb01xed to k. As is standard, we de\ufb01ne the best-\ufb01t k\u2217\ni for dataset i to be the one that yielded\nmaximum ARI score across the different runs, which is often different from ki, the number of clusters\nin the ground truth (i.e., the number of class labels). We held out a fraction of the problems for\ntest and used the remaining for training. We evaluated two quantities of interest: the ARI and the\nroot-mean-square error (RMSE) between \u02c6k and k\u2217. The meta-algorithm performed better than the\nbaseline on both quantities (Fig. 2) (we performed 1000 splits to compute the con\ufb01dence intervals).\nSelecting the clustering algorithm. We consider the question which of given clustering algorithms\nto use to cluster a given data set. We illustrate the main ideas with k = 2 clusters. We run each\nof the algorithms on the repository and see which algorithm has the lowest average error. Error is\ncalculated with respect to the ground truth labels by ARI (see Section 2). We compare algorithms on\nthe 250 binary classi\ufb01cation datasets with at most 2000 instances. The baselines are chosen to be\n\ufb01ve clustering algorithms from scikit-learn [27]: K-Means, Spectral, Agglomerative Single Linkage,\nComplete Linkage, and Ward, together with a second version of each in which each attribute is\nnormalized to have zero mean and unit variance. Each algorithm is run with the default scikit-learn\nparameters. We implement the algorithm selection approach of Section 3, learning to choose a\ndifferent algorithm for each problem based on problem and cluster-speci\ufb01c features. Given clustering\n\u03a0 of X \u2208 Rd\u00d7m, the feature vector \u03a6(X, \u03a0) consists of the dimensionality, number of examples,\nminimum and maximum eigenvalues of covariance matrix, and silhouette score of the clustering \u03a0:\n\n\u03a6(X, \u03a0) = (d, m, \u03c3min(\u03a3(X)), \u03c3max(\u03a3(X)), sil(\u03a0)) ,\n\nwhere \u03a3(X) denotes the covariance matrix of X, and \u03c3min(M ) and \u03c3max(M ) denote the minimum\nand maximum eigenvalues, respectively, of matrix M. Instead of choosing the clustering with best\nSilhouette score, which is a standard approach, the meta-clustering algorithm effectively learns terms\nthat can correct for over- or under-estimates, e.g., learning for which problems the Silhouette heuristic\ntends to produce too many clusters. To choose which of the ten clustering algorithms on each problem,\nwe \ufb01t ten estimators of accuracy by ARI based on these features. That is for each clustering algorithm\nCj, we \ufb01t ARI(Yi, Cj(Xi)) from features \u03a6(Xi, Cj(Xi)) \u2208 R5 over problems Xi, Yi using \u03bd-SVR\nregression, with default parameters as implemented by scikit-learn. Call this estimator \u02c6aj(X, Cj(X)).\nTo cluster a new dataset X \u2208 Rd\u00d7m, the meta-algorithm then chooses Cj(X) for the j with greatest\naccuracy estimate \u02c6aj(X, Cj(X)). The 250 problems were divided into train and test sets of varying\nsizes. The results, shown in Figure 3, demonstrate two interesting features. First, one can see that the\n\n7\n\n\f)\nI\nR\nA\n\n(\nx\ne\nd\nn\nI\nd\nn\na\nR\nd\ne\nt\ns\nu\nj\nd\nA\n\n0.13\n\n0.12\n\n0.11\n\n0.1\n\n0.09\n\n0.08\n\n0.07\n\n0.06\n\n0.05\n\n0.04\n\n0.03\n\n0.02\n\nPerformance of different algorithms\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n\n0.6\n\n0.7\n\n0.8\n\n0.9\n\nTraining fraction\n\nMeta\nKMeans\nKMeans-N\nWard\nWard-N\nAverage\nAverage-N\nComplete\nComplete-N\nSpectral\nSpectral-N\n\nx\ne\nd\nn\nI\nd\nn\na\nR\nd\ne\nt\ns\nu\nj\nd\nA\ne\ng\na\nr\ne\nv\nA\n\n0.13\n\n0.125\n\n0.12\n\n0.115\n\n0.11\n\n0.105\n\n0.1\n\nPerformance with outlier removal\n\n5%\n4%\n3%\n2%\n1%\n0%\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n\n0.6\n\n0.7\n\n0.8\n\n0.9\n\nTraining fraction\n\nFigure 3: (Left) ARI scores of different clustering (k=2) algorithms on OpenML binary classi\ufb01cation\nproblems. The meta algorithm (95% con\ufb01dence intervals are shown) is compared with standard\nclustering baselines on both the original data as well as the normalized data (denoted by \u201c-N\u201d) . The\nmeta-algorithm, given suf\ufb01cient training problems, is able to outperform the best baseline algorithms\nby over 5%. Note that ARI is a strong measure of performance since it accounts for chance, and may\neven be negative. Moreover, the best possible value of ARI, i.e. by running the optimal algorithm\nfor each dataset turned out to be about 0.16, which makes this improvement impressive. (Right)\nOutlier removal results are even better. Removing 1% of the instances as outliers improves the ARI\nscore considerably. Interestingly, going beyond 1% decreases the performance to that without outlier\nremoval. Our algorithm naturally \ufb01gures out the right proportion (1%) in a principled way.\n\ndifferent baseline clustering algorithms had very different average performances, suggesting that a\nprincipled approach like ours to select algorithms can make a difference. Further, Figure 3 shows that\nthe meta-algorithm, given suf\ufb01ciently many training problems, is able to outperform, on average, all\nthe baseline algorithms despite the fact the problems come from multiple heterogeneous domains.\nRemoving outliers. We also experimented to see if removing outliers improved average performance\non the same 339 classi\ufb01cation problems. Our objective was to choose a single best fraction to\nremove from all the meta-test sets. For each data set X, we removed a p \u2208 {0, 0.01, 0.02, . . . , 0.05}\nfraction of examples with the highest euclidean norm in X as outliers, and likewise for each meta-test\nset in the partition. We \ufb01rst clustered the data without outliers, and obtained the corresponding\nSilhouette scores. We then put back the outliers by assigning them to their nearest cluster center, and\ncomputed the ARI score thereof. Then, following an identical procedure to the meta-k algorithm of\nSection 5, we \ufb01tted regression models for ARI corresponding to complete data using the silhouette\nscores on pruned data, and measured the effect of outlier removal in terms of the true average ARI\n(corresponding to the best predicted ARI) over entire data. Again, we report the results averaged\nover 10 independent train/test partitions. As Fig. 3 shows, by treating 1% of the instances in each\ndataset as outliers, we achieved remarkable improvement in ARI scores relative to clustering with all\nthe data as in section 5. As the fraction of data deemed outlier was increased beyond 2%, however,\nperformance degraded. Clearly, we can learn what fraction to remove based on data, and improve the\nperformance considerably even with such a simple algorithm.\nDeep learning binary similarity function. In this section, we consider a new unsupervised problem\nof learning a binary similarity function (BSF) that predicts whether two examples from a given\nproblem should belong to the same cluster (i.e., have the same class label). Formally, a problem is\nspeci\ufb01ed by a set X of data and meta-features \u03c6. The goal is to learn a classi\ufb01er f (x, x(cid:48), \u03c6) \u2208 {0, 1}\nthat takes two examples x, x(cid:48) \u2208 X and the corresponding problem meta-features \u03c6, and predicts 1 if\nthe input pair would belong to the same cluster (or have the same class labels). In our experiments,\nwe take Euclidean data X \u2286 Rd (each problem may have different dimensionality d), and the\nmeta-features \u03c6 = \u03a3(X) consist of the covariance matrix of the unlabeled data. We restricted our\nexperiments to the 146 datasets with at most 1000 examples and 10 features, and formed disjoint\nmeta-training and meta-test sets by randomly sampling pairs of examples from each dataset, and\neach resulting feature vector comprised of 55 covariance features, and 20 features from its pair. We\nassigned a label 1 to pairs formed by combining examples belonging to the same class, and 0 to those\nresulting from the different classes. Each dataset in the \ufb01rst category was used to sample data pairs\nfor both the meta-training and the meta-internal test (meta-IT) datasets, while the second category\n\n8\n\n\fy\nc\na\nr\nu\nc\nc\na\ne\ng\na\nr\ne\nv\nA\n\n0.8\n\n0.75\n\n0.7\n\n0.65\n\n0.6\n\n0.55\n\n0.5\n\nAverage binary similarity prediction accuracy\n\nMeta\nMajority\n\nInternal test (IT)\n\nExternal test (ET)\n\nFigure 4: Mean accuracy and standard deviation on meta-IT and meta-ET data. Comparison between\nthe fraction of pairs correctly predicted by the meta algorithm and the majority rule. Recall that\nmeta-ET, unlike meta-IT, was generated from a partition that did not contribute any training data.\nNonetheless, the meta approach dramatically improved upon the majority rule even on meta-ET.\n\ndid not contribute any training data and was exclusively used to generate only the meta-external test\n(meta-ET) dataset. Our procedure ensured a disjoint intersection between the meta-training and\nthe meta-IT data, and resulted in 10 separate (meta-training, meta-IT, meta-ET) triplets. Thus, we\neffectively turn small data into big data by combining examples from several small problems. The\ndetails of our sampling procedure, network architecture, and training are given in the Supplementary.\nWe tested our trained models on meta-IT and meta-ET data. We computed the predicted same class\nprobability for each feature vector and the vector obtained by swapping its pair. We predicted the\ninstances in the corresponding pair to be in the same cluster only if the average of these probabilities\nexceeded 0.5. We compared the meta approach to a hypothetical majority rule that had prescience\nabout the class distribution. As the name suggests, the majority rule predicted all pairs to have\nthe majority label, i.e., on a problem-by-problem basis we determined whether 1 (same class) or 0\n(different class) was more accurate and gave the baseline the advantage of this knowledge for each\nproblem, even though it normally would not be available at classi\ufb01cation time. This information\nabout the distribution of the labels was not accessible to our meta-algorithm. Fig. 4 shows the average\nfraction of similarity pairs correctly identi\ufb01ed relative to the corresponding pairwise ground truth\nrelations on the two test sets across 10 independent (meta-training, meta-IT, meta-ET) collections.\nClearly, the meta approach greatly outperforms the majority rule on meta-IT, illustrating the bene\ufb01ts\nof the meta approach in a multi-task transductive setting. More interesting, still, is the signi\ufb01cant\nimprovement exhibited by the meta method on meta-ET, despite having its category precluded from\ncontributing any data for training. The result clearly demonstrates the potential bene\ufb01ts of leveraging\narchived supervised data for informed decision making in unsupervised settings.\n\nConclusion\n\nUnsupervised settings are hard to de\ufb01ne, evaluate, and work with due to lack of supervision. We\nremove the subjectivity inherent in them by reducing UL to supervised learning in a meta-setting.\nThis helps us provide theoretically sound and practically ef\ufb01cient algorithms for questions like which\nclustering algorithm and how many clusters to choose for a particular task, how to \ufb01x the threshold\nin single linkage algorithms, and what fraction of data to discard as outliers. We also introduce the\nmeta-scale-invariance (MSI) property, a natural alternative to scale invariance, and show how to\ndesign a single-linkage clustering algorithm that satis\ufb01es MSI, richness and consistency. Thus, we\navoid Kleinberg\u2019s impossibility result and achieve provably good meta-clustering.\nFinally, we automate learning of features across diverse data, and show how several small datasets\nmay be effectively combined into big data that can be used for zero shot learning with neural nets.\nThis is especially important due to two reasons: it (a) alleviates the primary limitation of deep nets that\nthey are not suitable for extremely small datasets, and (b) goes beyond transfer learning with a few\nhomogeneous datasets and shows how knowledge may be transferred from numerous heterogeneous\ndomains to completely new datasets that contribute no data during training.\n\n9\n\n\fAcknowledgments\n\nWe thank Lester Mackey for suggesting the title of this paper.\n\nReferences\n[1] Jon Kleinberg. An impossibility theorem for clustering. In Advances in neural information\n\nprocessing systems (NIPS), pages 463\u2013470, 2003.\n\n[2] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning.\n\nKnowledge and Data Engineering (TKDE), 22:1345\u20131359, 2010.\n\nIEEE Transactions on\n\n[3] Diederik P. Kingma, Danilo J. Rezende, Shakir Mohamed, and Max Welling. Semi-supervised\n\nlearning with deep generative models. In NIPS, 2014.\n\n[4] N. Siddharth, Brooks Paige, Jan-Willem van de Meent, Alban Desmaison, Noah D. Goodman,\nPushmeet Kohli, Frank Wood, and Philip H.S. Torr. Learning disentangled representations with\nsemi-supervised deep generative models. In NIPS, 2017.\n\n[5] Marcus Rohrbach, Sandra Ebert, and Bernt Schiele. Transfer learning in a transductive setting.\n\nIn NIPS, 2013.\n\n[6] Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation.\n\nIn ICML, 2015.\n\n[7] Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I. Jordan. Unsupervised domain\n\nadaptation with residual transfer networks. In NIPS, 2016.\n\n[8] Konstantinos Bousmalis, George Trigeorgis, Nathan Silberman, Dilip Krishnan, and Dumitru\n\nErhan. Domain separation networks. In NIPS, 2016.\n\n[9] Michael J Kearns, Robert E Schapire, and Linda M Sellie. Toward ef\ufb01cient agnostic learning.\n\nMachine Learning, 17(2-3):115\u2013141, 1994.\n\n[10] Maria-Florina Balcan, Vaishnavh Nagarajan, Ellen Vitercik, and Colin White. Learning-theoretic\nfoundations of algorithm con\ufb01guration for combinatorial partitioning problems. In Conference\non Learning Theory, pages 213\u2013274, 2017.\n\n[11] Chris Thornton, Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. Auto-weka: Combined\nselection and hyperparameter optimization of classi\ufb01cation algorithms. In Proceedings of the\n19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages\n847\u2013855. ACM, 2013.\n\n[12] Nicolo Fusi, Rishit Sheth, and Huseyn Melih Elibol. Probabilistic matrix factorization for\n\nautomated machine learning. 2018.\n\n[13] Sebastian Thrun and Lorien Pratt. Learning to learn. Springer Science & Business Media,\n\n2012.\n\n[14] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine\n\nlearning algorithms. In NIPS, pages 2951\u20132959, 2012.\n\n[15] Sebastian Thrun and Tom M Mitchell. Lifelong robot learning. In The biology and technology\n\nof intelligent autonomous agents, pages 165\u2013196. Springer, 1995.\n\n[16] Maria-Florina Balcan, Avrim Blum, and Santosh Vempala. Ef\ufb01cient representations for lifelong\n\nlearning and autoencoding. In Workshop on Computational Learning Theory (COLT), 2015.\n\n[17] Reza Bosagh Zadeh and Shai Ben-David. A uniqueness theorem for clustering. In Proceedings\nof the twenty-\ufb01fth conference on uncertainty in arti\ufb01cial intelligence (UAI), pages 639\u2013646.\nAUAI Press, 2009.\n\n[18] Margareta Ackerman and Shai Ben-David. Measures of clustering quality: A working set of\n\naxioms for clustering. In NIPS, 2008.\n\n10\n\n\f[19] Wenlin Chen, James Wilson, Stephen Tyree, Kilian Weinberger, and Yixin Chen. Compressing\n\nneural networks with the hashing trick. In ICML, 2015.\n\n[20] S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural networks with\n\npruning, trained quantization and huffman coding. In ICLR, 2016.\n\n[21] Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. Thinet: A \ufb01lter level pruning method for deep\n\nneural network compression. In ICCV, 2017.\n\n[22] Chirag Gupta, Arun Sai Suggala, Ankit Goyal, Harsha Vardhan Simhadri, Bhargavi Paranjape,\nAshish Kumar, Saurabh Goyal, Raghavendra Udupa, Manik Varma, and Prateek Jain. Pro-\ntoNN: Compressed and accurate kNN for resource-scarce devices. In Proceedings of the 34th\nInternational Conference on Machine Learning (ICML), pages 1331\u20131340, 2017.\n\n[23] Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of classi\ufb01cation, 2(1):193\u2013\n\n218, 1985.\n\n[24] Peter J Rousseeuw. Silhouettes: a graphical aid to the interpretation and validation of cluster\n\nanalysis. Journal of computational and applied mathematics, 20:53\u201365, 1987.\n\n[25] G. C. Tseng. Penalized and weighted k-means for clustering with scattered objects and prior\n\ninformation in high-throughput biological data. Bioinformatics, 23:2247\u20132255, 2007.\n\n[26] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng\nHuang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei.\nImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision\n(IJCV), 115(3):211\u2013252, 2015.\n\n[27] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,\nP. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher,\nM. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine\nLearning Research (JMLR), 12:2825\u20132830, 2011.\n\n[28] Matthew D Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701,\n\n2012.\n\n11\n\n\f", "award": [], "sourceid": 2413, "authors": [{"given_name": "Vikas", "family_name": "Garg", "institution": "MIT"}, {"given_name": "Adam", "family_name": "Kalai", "institution": "Microsoft Research"}]}