{"title": "Learning Multiple Models via Regularized Weighting", "book": "Advances in Neural Information Processing Systems", "page_first": 1977, "page_last": 1985, "abstract": "We consider the general problem of Multiple Model Learning (MML) from data, from the statistical and algorithmic perspectives; this problem includes clustering, multiple regression and subspace clustering as special cases. A common approach to solving new MML problems is to generalize Lloyd's algorithm for clustering (or Expectation-Maximization for soft clustering). However this approach is unfortunately sensitive to outliers and large noise: a single exceptional point may take over one of the models. We propose a different general formulation that seeks for each model a distribution over data points; the weights are regularized to be sufficiently spread out. This enhances robustness by making assumptions on class balance. We further provide generalization bounds and explain how the new iterations may be computed efficiently. We demonstrate the robustness benefits of our approach with some experimental results and prove for the important case of clustering that our approach has a non-trivial breakdown point, i.e., is guaranteed to be robust to a fixed percentage of adversarial unbounded outliers.", "full_text": "Learning Multiple Models via Regularized Weighting\n\nDaniel Vainsencher\n\nShie Mannor\n\nDepartment of Electrical Engineering\n\nDepartment of Electrical Engineering\n\nTechnion, Haifa, Israel\n\nTechnion, Haifa, Israel\n\ndanielv@tx.technion.ac.il\n\nshie@ee.technion.ac.il\n\nHuan Xu\n\nMechanical Engineering Department\n\nNational University of Singapore, Singapore\n\nmpexuh@nus.edu.sg\n\nAbstract\n\nWe consider the general problem of Multiple Model Learning (MML) from data,\nfrom the statistical and algorithmic perspectives; this problem includes clustering,\nmultiple regression and subspace clustering as special cases. A common approach\nto solving new MML problems is to generalize Lloyd\u2019s algorithm for clustering\n(or Expectation-Maximization for soft clustering). However this approach is un-\nfortunately sensitive to outliers and large noise: a single exceptional point may\ntake over one of the models.\nWe propose a different general formulation that seeks for each model a distribu-\ntion over data points; the weights are regularized to be suf\ufb01ciently spread out.\nThis enhances robustness by making assumptions on class balance. We further\nprovide generalization bounds and explain how the new iterations may be com-\nputed ef\ufb01ciently. We demonstrate the robustness bene\ufb01ts of our approach with\nsome experimental results and prove for the important case of clustering that our\napproach has a non-trivial breakdown point, i.e., is guaranteed to be robust to a\n\ufb01xed percentage of adversarial unbounded outliers.\n\n1\n\nIntroduction\n\nThe standard approach to learning models from data assumes that the data were generated by a\ncertain model, and the goal of learning is to recover this generative model. For example, in linear\nregression, an unknown linear functional, which we want to recover, is believed to have generated\ncovariate-response pairs. Similarly, in principal component analysis, a random variable in some\nunknown low-dimensional subspace generated the observed data, and the goal is to recover this\nlow-dimensional subspace. Yet, in practice, it is common to encounter data that were generated by a\nmixture of several models rather than a single one, and the goal is to learn a number of models such\nthat any given data can be explained by at least one of the learned models. It is also common for the\ndata to contain outliers: data-points that are not well explained by any of the models to be learned,\npossibly inserted by external processes.\n\nWe brie\ufb02y explain our approach (presented in detail in the next section). At its center is the problem\nof assigning data points to models, with the main consideration that every model be consistent with\nmany of the data points. Thus we seek for each model a distribution of weights over the data\npoints, and encourage even weights by regularizing these distributions (hence our approach is called\nRegularized Weighting; abbreviated as RW). A data point that is inconsistent with all available\nmodels will receive lower weight and even sometimes be ignored. The value of ignoring dif\ufb01cult\npoints is illustrated by contrast with the common approach, which we consider next.\n\n1\n\n\fThe arguably most widely applied approach for multiple model learning is the minimum loss ap-\nproach, also known as Lloyd\u2019s algorithm [1] in clustering, where the goal is to \ufb01nd a set of models,\nassociate each data point to one model (in so called \u201csoft\u201d variations, one or more models), such\nthat the sum of losses over data points is minimal. Notice that in this approach, every data point\nmust be explained by some model. This leaves the minimum loss approach vulnerable to outliers\nand corruptions: If one data point goes to in\ufb01nity, so must at least one model.\n\nOur remedy to this is relaxing the requirement that each data point must be explained.\nIndeed,\nas we show later, the RW formulation is provably robust in the case of clustering, in the sense of\nhaving non-zero breakdown point [2]. Moreover, we also establish other desirable properties, both\ncomputational and statistical, of the proposed method. Our main contributions are:\n\n1. A new formulation of the sub-task of associating data points to models as a convex op-\ntimization problem for setting weights. This problem favors broadly based models, and\nmay ignore dif\ufb01cult data points entirely. We formalize such properties of optimal solutions\nthrough analysis of a strongly dual problem. The remaining results are characteristics of\nthis approach.\n\n2. Outlier robustness. We show that the breakdown point of the proposed method is bounded\naway from zero for the clustering case. The breakdown point is a concept from robust\nstatistics:\nit is the fraction of adversarial outliers that an algorithm can sustain without\nhaving its output arbitrarily changed.\n\n3. Robustness to fat tailed noise. We show, empirically on a synthetic and real world datasets,\n\nthat our formulation is more resistant to fat tailed additive noise.\n\n4. Generalization. Ignoring some of the data, in general, may lead to over\ufb01tting. We show that\nwhen the parameter \u03b1 (de\ufb01ned in Section 2) is appropriately set, this essentially does not\noccur. We prove this through uniform convergence bounds resilient to the lack of ef\ufb01cient\nalgorithms to \ufb01nd near-optimal solutions in multiple model learning.\n\n5. Computational complexity. As almost every method to tackle the multiple model learning\nproblem, we use alternating optimization of the models and the association (weights), i.e.,\nwe iteratively optimize one of them while \ufb01xing the other. Our formulation for optimizing\nthe association requires solving a quadratic problem in kn variables, where k is the number\nof models and n is the number of points. Compared to O(kn) steps for some formulations,\nthis seems expensive. We show how to take advantage of the special problem structure and\nrepetition in the alternating optimization subproblems to reduce this cost.\n\n1.1 Relation to previous work\n\nLearning multiple models is by no means a new problem. Indeed, special examples of multi-model\nlearning have been studied, including k-means clustering [3, 4, 5] (and many other variants thereof),\nGaussian mixture models (and extensions) [6, 7] and subspace segmentation problem [8, 9, 10]; see\nSection 2 for details. Fewer studies attempt to cross problem type boundaries. A general treatment\nof the sample complexity of problems that can be interpreted as learning a code book (which en-\ncompasses some types of multiple model learning) is [11]. Slightly closer to our approach is [12],\nwhose formulation generalizes a common approach to different model types and permits for prob-\nlem speci\ufb01c regularization, giving both generalization results and algorithmic iteration complexity\nresults. A probabilistic and generic algorithmic approach to learning multiple models is Expectation\nMaximization [13].\n\nAlgorithms for dealing with outliers and multiple models together have been proposed in the context\nof clustering [14]. Reference [15] provides an example of an algorithm for outlier resistance in\nlearning a single subspace, and partly inspires the current work. In contrast, we abstract almost\ncompletely over the class of models, allowing both algorithms and analysis to be easily reused to\naddress new classes.\n\n2 Formulation\n\nIn this section we show how multi-model learning problems can be formed from simple estimation\nproblem (where we seek to explain weighted data points by a single model), and imposing a par-\n\n2\n\n\fticular joint loss. We contrast the joint loss proposed here to a common one through the weights\nassigned by each and their effects on robustness.\nWe refer throughout to n data points from X by (xi)n\nby k models from M denoted (mj)k\ndistributions (wj)k\n\ni=1 = X \u2208 X n, which we seek to explain\nj=1 = M \u2208 Mk. A data set may be weighted by a set of k\n\nj=1 = W \u2208 (\u25b3n)k where \u25b3n \u2282 Rn is the simplex.\n\nDe\ufb01nition 1. A base weighted learning problem is a tuple (X , M, \u2113, A), where \u2113 : X \u00d7 M \u2192 R+\nis a non-negative convex function, which we call a base loss function and A : \u25b3n \u00d7 X n \u2192 M\nde\ufb01nes an ef\ufb01cient algorithm for choosing a model. Given the weight w and data X, A obtains\nwi\u2113 (xi, m) (the weighted empirical loss need not be minimal,\n\nallowing for regularization which we do not discuss further).\n\nlow weighted empirical loss Pn\n\ni=1\n\nWe will often denote the losses of a model m over X as a vector l = (\u2113(xi, m))n\ni=1. In the context\nof a set of models M , we similarly associate the loss vector lj and the weight vector wj with the\nmodel mj ; this allows us to use the terse notation w\u22a4\nj\n\nlj for the weighted loss of model j.\n\nGiven a base weighted learning problem, one may pose a multi-model learning problem\n\nExample 1. The multi-model learning problem covers many examples, here we list a few:\n\n\u2022 In k-means clustering, the goal is to partition the training samples into k subsets, where\neach subset of samples is \u201cclose\u201d to their mean. In our terminology, a multi-model learning\n\nproblem where the base learning problem is (cid:16)Rd, Rd, (x, m) 7\u2192 kx \u2212 mk2\n\n\ufb01nds the weighted mean of the data. The weights allow us to compute each cluster center\naccording to the relevant subset of points.\n\n2 , A(cid:17) where A\n\n\u2022 In subspace clustering, also known as subspace segmentation, the objective is to group\nthe training samples into subsets, such that each subset can be well approximated by a\nlow-dimensional af\ufb01ne subspace. This is a multi-model learning problem where the corre-\nsponding single-model learning problem is PCA.\n\n\u2022 Regression clustering [16] extends the standard linear regression problem in that the train-\ning samples cannot be explained by one linear function. Instead, multiple linear function\nare sought, so that the training samples can be split into groups, and each group can be\napproximated by one linear function.\n\n\u2022 Gaussian Mixture Model considers the case where data points are generated by a mixture\nof a \ufb01nite number of Gaussian distributions, and seeks to estimate the mean and variance\nof each of these distribution, and simultaneously to group the data points according to the\ndistribution that generates it. This is a multi-model learning problem where the respective\nsingle model learning problem is estimating the mean and variance of a distribution.\n\nThe most common way to tackle the multiple model learning problem is the minimum loss approach,\ni.e, to minimize the following joint loss\n\nL (X, M ) =\n\n1\n\nn Xx\u2208X\n\nmin\nm\u2208M\n\n\u2113 (x, m) .\n\n(2.1)\n\nIn terms of weighted base learning problems, each model gives equal weight to all points for which\nit is the best (lowest loss) model. For example, when M = X = Rn with \u2113(x, m) = kx \u2212 mk2\n2\nthe squared Euclidean distance loss yields k means clustering. In this context, alternating between\nchoosing for each x its loss minimizing model, and adjusting each model to minimized the squared\nEuclidean loss, yields Lloyd\u2019s algorithm (and its generalizations for other problems).\n\nThe minimum loss approach requires that every point is assigned to a model, this can potentially\ncause problems in the presence of outliers. For example, consider the clustering case where the data\ncontain a single outlier point xi. Let xi tend to in\ufb01nity; there will always be some mj that is closest\nto xi, and is therefore (at equilibrium) the average of xi and some other data points. Then mj will\ntend to in\ufb01nity also. We call this phenomenon mode I of sensitivity to outliers; it is common also\n\n3\n\n\fto such simple estimators as the mean. Mode II of sensitivity is more particular: as mj follows xi\nto in\ufb01nity, it stops being the closest to any other points, until the model is associated only to the\noutlier and thus matches it perfectly. Thus under Eq. (2.1) outliers tend to take over models. Mode\nII of sensitivity is not clustering speci\ufb01c, and Fig. 2.1 provides an example in multiple regression.\nNeither mode is avoided by spreading a point\u2019s weight among models as in mixture models [6].\n\nTo overcome both modes of sensitivity, we propose a different joint loss, in which the hard constraint\nis only that for each model we produce a distribution over data points. A penalty term discourages\nthe concentration of a model on few points and thus mode II sensitivity. Deweighting dif\ufb01cult points\nhelps mitigate mode I. For clustering this robustness is formalized in Theorem 2.\n\nRobust and Lloyds association methods, quadratic regression.\n\n\u22120.5\n\n\u22121.0\n\n\u22121.5\n\n\u22122.0\n\n\u22122.5\n\n\u22123.0\n\n\u22123.5\n\nData\nMinimum loss 0.20 correct on 34 points\nMinimum loss 0.20 correct on 4 points\nRobust joint loss 0.20 correct on 29 points\nRobust joint loss 0.20 correct on 37 points\n\n0.0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\nFigure 2.1: Data is a mixture of two quadratics, with positive fat tailed noise. Under a minimum loss\napproach an off-the-chart high-noise point suf\ufb01ces to prevent the top broken line from being close\nto many other data points. Our approach is free to better model the bulk of data. We used a robust\n(mean absolute deviation) criterion to choose among the results of multiple restarts for each model.\n\nDe\ufb01nition 2. Let u \u2208 \u25b3n be the uniform distribution. Given k weight vectors, we denote their aver-\nwj , and just v when W is clear from context. The Regularized Weighting\n\nmultiple model learning loss is a function L\u03b1 : X n \u00d7 Mk \u00d7 (\u25b3n)k \u2192 R de\ufb01ned as\n\nage v (W ) = k\u22121Pk\n\nj=1\n\nL\u03b1 (X, M, W ) = \u03b1 ku \u2212 v (W )k2\n\n2 + k\u22121\n\nk\n\nXj=1\n\nl\u22a4\nj\n\nwj\n\nwhich in particular de\ufb01nes the weight setting subproblem:\n\nL\u03b1 (X, M ) = min\n\nW \u2208(\u25b3n)k\n\nL\u03b1 (X, M, W ) .\n\n(2.2)\n\n(2.3)\n\nAs its name suggests, our formulation regularizes distributions of weight over data points; speci\ufb01-\ncally, wj are controlled by forcing their average v to be close to the uniform distribution u. Our goal\nis for each model to represent many data points, so weights should not be concentrated. We avoid\nthis by penalizing squared Euclidean distance from uniformity, which emphasizes points receiving\nweight much higher than the natural n\u22121, and essentially ignores small variations around n\u22121. The\neffect is later formalized in Lemma 1, but to illustrate we next calculate the penalties for two stylized\ncases. This will also produce the \ufb01rst of several hints about the appropriate range of values for \u03b1.\nIn the following examples, we will consider a set of \u03b3nk\u22121 data points, recalling that nk\u22121 is the\nnatural number of points per model. To avoid letting a few high loss outliers skew our models (mode\nI of sensitivity), we prefer instead to give them zero weight. Take \u03b3 \u226a k/2, then the cost of ignoring\nsome \u03b3nk\u22121 points in all models is at most \u03b1n\u22121 \u00b7 2\u03b3k\u22121 \u226a \u03b1n\u22121. In contrast, basing a model\n\n4\n\n\fClustering in 1D w. varied class size. \u03b1/n = 0.335\n\nmodel 1 weighs 21 points\nmodel 2 weighs 39 points\n\n1.6\n\n1.4\n\n1.2\n\n1.0\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0.0\n\nk\n/\nn\n\n\u00b7\n\ni\n,\nj\n\nw\n\n:\n\nl\n\nd\ne\na\nc\ns\n\n,\nl\n\ne\nd\no\nm\ny\nb\nd\ne\nn\ng\ns\ns\na\n\ni\n\nt\n\ni\n\nh\ng\ne\nW\n\n\u22120.2\n\n\u22120.6\n\n\u22120.4\n\n\u22120.2\n\n0.0\n\nLocation\n\n0.2\n\n0.4\n\n0.6\n\nFigure 2.2: For each location (horizontal) of a data point, the vertical locations of corresponding\nmarkers gives the weights assigned by each model. The left cluster is half as populated as the right,\nthus must give weights about twice as large. Within each model, weights are af\ufb01ne in the loss\n(see Section 2.1), causing the concave parabolas. The gap allowed between the maximal weights\nof different models allows a point from the right cluster to be adopted by the left model, lowering\noverall penalty at a cost to weighted losses.\n\non very few points (mode II of sensitivity) should be avoided. If the jth model is \ufb01t to only \u03b3nk\u22121\npoints for \u03b3 \u226a 1, the penalty from those points will be at least (approximately) \u03b1n\u22121 \u00b7 \u03b3\u22121k\u22121.\nWe can make the \ufb01rst situation cheap and the second expensive (per model) in comparison to the\nempirical weighted loss term by choosing\n\n\u03b1n\u22121 \u2248 k\u22121\n\nk\n\nXj=1\n\nw\u22a4\nj\n\nlj.\n\n(2.4)\n\nOn the \ufb02ip side, highly unbalanced classes in the data can be challenging to our approach. Consider\nthe case where a model has low loss for fewer than n/(2k) points: spreading its weight only over\nthem can incur very high costs due to the regularization term, which might be lowered by including\nsome higher-loss points that are indeed better explained by another model (see Figure 2.2 on page\n5 for an illustration). This challenge might be solved by explicitly and separately estimating the\nrelative frequencies of the classes, and penalizing deviations from the estimates rather than from\nequal frequencies, as is done in mixture models [6]; this is left for future study.\n\n2.1 Two properties of Regularized Weighting\n\nTwo properties of our formulation result from an analysis (in Appendix A for lack of space) of a\ndual problem of the weight setting problem (2.3). These provide the basis for later theory by relating\nv, losses and \u03b1. The \ufb01rst illustrates the uniform control of v:\nLemma 1. Let all losses be in [0, B], then in an optimal solution to (2.3), we have\n\nkv \u2212 uk\u221e \u2264 B/ (2\u03b1) .\n\nThis strengthens the conclusion of (2.4): if outliers are present and \u03b1n\u22121 > 2B where B bounds\nlosses on all points including outliers, weights will be almost uniform (enabling mode I of sensi-\n\n5\n\n\ftivity). On the positive side, this lemma plays an important role in the generalization and iteration\ncomplexity results presented in the sequel. A more detailed view of vi for individual points is\nprovided by the second property.\n\nBy PC we denote the orthogonal projection mapping into a convex set C.\nLemma 2. For an optimal solution to (2.3), there exists t \u2208 Rk such that:\n\nv = P\u25b3n (cid:18)u \u2212 min\n\nj\n\n(lj \u2212 tj) / (2\u03b1)(cid:19) ,\n\nwhere minj should be read as operating element-wise, and in particular wj,i > 0 implies that j\nminimizes the ith element.\n\nThis establishes that average weight (when positive) is af\ufb01ne in the loss; the concave parabolas\nvisible in Figure 2.2 on page 5 are an example. We also learn the role of \u03b1 in solutions is determining\nthe coef\ufb01cient in the af\ufb01ne relation. Distinct t allow for different densities of points around different\nmodels. One observation from this lemma is that if a particular model j gives weight to some point i,\nthen every point with lower loss \u2113 (xi\u2032 , mj) under that model will receive at least that much weight.\nThis property plays a key role in the proof of robustness to outliers in clustering.\n\n2.2 An alternating optimization algorithm\n\nThe RW multiple model learning loss, like other MML losses, is not convex. However the weight\nsetting problem (2.3) is convex when we \ufb01x the models, and an ef\ufb01cient procedure A is assumed\nfor solving a weighted base learning problem for a model, supporting an alternating optimization\napproach, as in Algorithm 1; see Section 5 for further discussion.\n\nData: X\nResult: The model-set M\nM \u2190 initialM odels (X);\nrepeat\n\nM \u2032 \u2190 M ;\nW \u2190 arg minW \u2032 L\u03b1 (X, M, W \u2032);\nmj \u2190 A (wj, X)\n\n(\u2200j \u2208 [k]) ;\n\nuntil L (X, M \u2032) \u2212 L (X, M ) < \u03b5;\n\nAlgorithm 1: Alternating optimization for Regularized Weighting\n\n3 Breakdown point in clustering\n\nOur formulation allows a few dif\ufb01cult outliers to be ignored if the right models are found; does this\nhappen in practice? Figure 2.1 on page 4 provides a positive example in regression clustering, and\na more substantial empirical evaluation on subspace clustering is in Appendix B. In the particular\ncase of clustering with the squared Euclidean loss, robustness bene\ufb01ts can be proved.\n\nWe use \u201cbreakdown point\u201d \u2013 the standard robustness measure in the literature of robust statistics [2],\nsee also [17, 18] and many others \u2013 to quantify the robustness property of the proposed formula-\ntion. The breakdown point of an estimator is the smallest fraction of bad observations that can\ncause the estimator to take arbitrarily aberrant values, i.e., the smallest fraction of outliers needed to\ncompletely break an estimator.\n\nFor the case of clustering with the squared Euclidean distance base loss, the min-loss approach\ncorresponds to k-means clustering which is not robust in this sense; its breakdown point is 0. The\nnon robustness of k-means has led to the development of many formulations of robust clustering,\nsee a review by [14]. In contrast, we show that our joint loss yields an estimator that has a non-zero\nbreakdown point, and is hence robust.\n\nIn general, a squared loss clustering formulation that assigns equal weight to different data points\ncannot be robust \u2013 as one data point tends to in\ufb01nity so must at least one model. This applies to\nour model if \u03b1 is allowed to tend to in\ufb01nity. On the other hand if \u03b1 is too low, it becomes possible\n\n6\n\n\ffor each model to assign all of its weight to a single point, which may well be an outlier tending\nto in\ufb01nity. Thus, it is well expected that the robustness result below requires \u03b1 to belong to a data\ndependent range.\n\nTheorem 2. Let X = M be a Euclidean space in which we perform clustering with the loss\n\u2113 (xi, mj) = kmj \u2212 xik2 and k centers. Denote by R the radius of any ball containing the inliers,\nand \u03b7 < k\u22122/22 the proportion of outliers allowed to be outside the ball. Denote also by r a radius\nsuch that there exists M \u2032 = {m\u2032\nk} such that each inlier is within a distance r of some\nmodel m\u2032\nj and each mj approximates (i.e., within a distance r) at least n/(2k) inliers; this always\nholds for some r \u2264 R.\n\n1, \u00b7 \u00b7 \u00b7 , m\u2032\n\nFor any \u03b1 \u2208 n(cid:2)r2, 13R2(cid:3) let (M, W ) be minimizers of L\u03b1 (X, M, W ).\nkmj \u2212 xik2 \u2264 6R for every model mj and inlier xi.\n\nThen we have\n\nTheorem 2 shows that when the number of outliers is not too high, then the learned model, regardless\nof the magnitude of the outliers, is close to the inliers and hence cannot be arbitrarily bad.\nIn\nparticular, the theorem implies a non-zero breakdown point for any \u03b1 > nr2; taking too high an\n\u03b1 merely forces a larger but still \ufb01nite R. If the inliers are amenable to balanced clustering so that\nr \u226a R, the regime of non-zero breakdown is extended to smaller \u03b1.\n\nThe proof follows three steps. First, due to the regularization term, for any model, the total weight\non the few outliers is at most 1/3. Second, an optimal model must thus be at least twice as close to\nthe weighted average of its inlier as it is to the weighted average of its outliers. This step depends\ncritically on squared Euclidean loss being used. Lastly, this gap in distances cannot be large in\nabsolute terms, due to Lemma 2; an outlier that is much farther from the model than the inliers must\nreceive weight zero. For the proof see Appendix C of the supplementary material.\n\n4 Regularized Weighting formulation sample complexity\n\nAn important consideration in learning algorithms is controlling over\ufb01tting, in which a model is\nfound that is appropriate for some data, rather than for the source that generates the data. The\ncurrent formulation seems to be particularly vulnerable since it allows data to be ignored, in contrast\nto most generalization bounds that assume equal weight is given to all data.\n\nOur loss L\u03b1(X, M ) differs from common losses in allowing data points to be differently weighted.\nThus, to obtain the sample complexity of our formulation we need to bound the difference that a\nsingle sample can make to the loss. For a common empirical average loss this is bounded by Bn\u22121\nwhere B is the maximal value of the non-negative loss on a single data point, and in our case by\nB kvk\u221e, because if X, X \u2032 differ only on the ith element, then:\n\n|L\u03b1 (X \u2032, M, W ) \u2212 L\u03b1 (X, M, W )| = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\nk\n\nk\u22121\n\nXj=1(cid:0)wj,i(cid:0)lj,i \u2212 l\u2032\n\nj,i(cid:1)(cid:1)\n\n\u2264 Bk\u22121\n\nk\n\nXj=1\n\nwj,i \u2264 Bvi.\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\nWhenever W is optimal with respect to either X or X \u2032, Lemma 1 provides the necessary bound\non kvk\u221e. Along with covering numbers as de\ufb01ned next and standard arguments (found in the\nsupplementary material), this bound on differences provides us with the desired generalization result.\nDe\ufb01nition 3 (Covering numbers for multiple models). We shall endow Mk with the metric\n\nd\u221e (M, M \u2032) = max\n\nj\u2208[k](cid:13)(cid:13)\u2113 (\u00b7, mj) \u2212 \u2113(cid:0)\u00b7, m\u2032\n\nj(cid:1)(cid:13)(cid:13)\u221e\n\nand de\ufb01ne its covering number N\u03b5(cid:0)Mk(cid:1) as the minimal cardinality of a set Mk\nSM \u2208Mk\n\nB(M, \u03b5).\n\n\u03b5\n\n\u03b5 such that Mk \u2286\n\nThe bound depends on an upper bound on base losses denoted B; this should be viewed as \ufb01xing\na scale for the losses and is standard where losses are not naturally bounded (e.g., classical bounds\non SVM kernel regression [19] use bounded kernels). Thus, we have the following generalization\nresult, whose proof can be found in Appendix D of the supplementary material.\n\n7\n\n\fTheorem 3. Let the base losses be bounded in the interval [0, B], let Mk have covering num-\n\nbers N\u03b5(cid:0)Mk(cid:1) \u2264 (C/\u03b5)dk and let \u03b3 = nB/ (2\u03b1). Then we have with probability at least\n1 \u2212 expndk log(cid:0) 2C\n\n\u03c4 (cid:1) \u2212 2n\u03c4 2\n\nB 2(1+\u03b3)2o:\n\n\u2200M \u2208 Mk\n\n|L\u03b1 (X, M ) \u2212 EX \u2032\u223cDn L\u03b1 (X \u2032, M )| \u2264 3\u03c4.\n\n5 The weight assignment optimization step\n\nAs is typical in multi-model learning, simultaneously optimizing the model and the association of\nthe data (in our formulation, the weight) is computationally hard [20], thus Algorithm 1 alternates\nbetween optimizing the weight with the model \ufb01xed, and optimizing the model with the weights\n\ufb01xed. Thus we show how to ef\ufb01ciently solve a sequence of weight setting problems, minimizing\nL\u03b1(X, Mi, W ) over W , where Mi typically converge.\n\nWe propose to solve each instance of weight setting using gradient methods, and in particular FISTA\n[21]. This has two advantages compared to Interior Point methods: First, the use of memory for\ngradient methods depends only linearly with respect to the dimension, which is O(kn) in problem\n(2.3), allowing scaling to large data sets. Second, gradient methods have \u201cwarm start\u201d properties:\nthe number of iterations required is proportional to the distance between the initial and optimal\nsolutions, which is useful both due to bounds on kv \u2212 uk\u221e and when Mi converge.\nTheorem 4. Given data and models (X, M ) there exists an algorithm that \ufb01nds a weight matrix W\n\nsuch that L\u03b1(X, M, W ) \u2212 L\u03b1(X, M ) \u2264 \u03b5 using O(cid:16)pk\u03b1/\u03b5(cid:17) iterations, each costing O(kn) time\nand memory. If \u03b1 \u2265 Bn/4 then O(cid:16)kp\u03b1n\u22121/\u03b5(cid:17) iterations suf\ufb01ce.\n\nThe \ufb01rst bound might suggest that typical settings of \u03b1 \u221d n requires iterations to increase with the\nnumber of points n; the second bounds shows this is not always necessary.\n\nThis result can be realized by applying the algorithm FISTA, with a starting point wj = u, with\n2\u03b1k\u22122 as a bound on the Lipschitz constant for the gradient. For the \ufb01rst bound we estimate the\ndistance from u by the radius of the product of k simplices; for the second we use Lemma 1 in Ap-\npendix E.\n\n6 Conclusion\n\nIn this paper, we proposed and analyzed, from a general perspective, a new formulation for learning\nmultiple models that explain well much of the data. This is based on associating to each model a\nregularized weight distribution over the data it explains well. A main advantage of the new formula-\ntion is its robustness to fat tailed noise and outliers: we demonstrated this empirically for regression\nclustering and subspace clustering tasks, and proved that for the important case of clustering, the\nproposed method has a non-trivial breakdown point, which is in sharp contrast to standard meth-\nods such as k-means. We further provided generalization bounds and explained an optimization\nprocedure to solve the formulation in scale.\n\nOur main motivation comes from the fast growing attention to analyzing data using multiple models,\nunder the names of k-means clustering, subspace segmentation, and Gaussian mixture models, to\nlist a few. While all these learning schemes share common properties, they are largely studied\nseparately, partly because these problems come from different sub-\ufb01elds of machine learning. We\nbelieve general methods with desirable properties such as generalization and robustness will supply\nready tools for new applications using other model types.\n\nAcknowledgments\n\nH. Xu is partially supported by the Ministry of Education of Singapore through AcRF Tier Two\ngrant R-265-000-443-112 and NUS startup grant R-265-000-384-133. This research was funded (in\npart) by the Intel Collaborative Research Institute for Computational Intelligence (ICRI-CI).\n\n8\n\n\fReferences\n\n[1] S. Lloyd. Least squares quantization in PCM.\n\nInformation Theory, IEEE Transactions on,\n\n28(2):129\u2013137, 1982.\n\n[2] P. J. Huber. Robust Statistics. John Wiley & Sons, New York, 1981.\n\n[3] J.A. Hartigan and M.A. Wong. Algorithm AS 136: A k-means clustering algorithm. Journal\n\nof the Royal Statistical Society. Series C (Applied Statistics), 28(1):100\u2013108, 1979.\n\n[4] R. Ostrovsky, Y. Rabani, L.J. Schulman, and C. Swamy. The effectiveness of Lloyd-type\nmethods for the k-means problem. In Foundations of Computer Science, 2006. FOCS\u201906. 47th\nAnnual IEEE Symposium on, pages 165\u2013176. IEEE, 2006.\n\n[5] P. Hansen, E. Ngai, B.K. Cheung, and N. Mladenovic. Analysis of global k-means, an incre-\nmental heuristic for minimum sum-of-squares clustering. Journal of classi\ufb01cation, 22(2):287\u2013\n310, 2005.\n\n[6] G. J. McLachlan and K. E. Basford. Mixture Models: Inference and Applications to Clustering.\n\nMarcel Dekker, New York, 1998.\n\n[7] Mikhail Belkin and Kaushik Sinha. Polynomial learning of distribution families. In FOCS\n2010: Proceedings of the 51st Annual IEEE Symposium on Foundations of Computer Science,\npages 103\u2013112. IEEE Computer Society, 2010.\n\n[8] G. Chen and M. Maggioni. Multiscale geometric and spectral analysis of plane arrangements.\nIn Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 2825\u2013\n2832. IEEE, 2011.\n\n[9] Yaoliang Yu and Dale Schuurmans. Rank/norm regularization with closed-form solutions:\nApplication to subspace clustering. In Fabio Gagliardi Cozman and Avi Pfeffer, editors, UAI,\npages 778\u2013785. AUAI Press, 2011.\n\n[10] M. Soltanolkotabi and E.J. Cand`es. A geometric analysis of subspace clustering with outliers.\n\nArxiv preprint arXiv:1112.4258, 2011.\n\n[11] A. Maurer and M. Pontil. k-dimensional coding schemes in hilbert spaces. Information Theory,\n\nIEEE Transactions on, 56(11):5839\u20135846, 2010.\n\n[12] A.J. Smola, S. Mika, B. Sch\u00a8olkopf, and R.C. Williamson. Regularized principal manifolds.\n\nThe Journal of Machine Learning Research, 1:179\u2013209, 2001.\n\n[13] A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EM\n\nalgorithm. Journal of the Royal Statistical Society. Series B, 39(1):1\u201338, 1977.\n\n[14] R.N. Dav\u00b4e and R. Krishnapuram. Robust clustering methods: a uni\ufb01ed view. Fuzzy Systems,\n\nIEEE Transactions on, 5(2):270\u2013293, 1997.\n\n[15] Huan Xu, Constantine Caramanis, and Shie Mannor. Outlier-robust PCA: The high-\n\ndimensional case. IEEE transactions on information theory, 59(1):546\u2013572, 2013.\n\n[16] B. Zhang. Regression clustering. In Data Mining, 2003. ICDM 2003. Third IEEE International\n\nConference on, pages 451\u2013458. IEEE, 2003.\n\n[17] P. J. Rousseeuw and A. M. Leroy. Robust Regression and Outlier Detection. John Wiley &\n\nSons, New York, 1987.\n\n[18] R. A. Maronna, R. D. Martin, and V. J. Yohai. Robust Statistics: Theory and Methods. John\n\nWiley & Sons, New York, 2006.\n\n[19] Olivier Bousquet and Andr\u00b4e Elisseeff. Stability and generalization. The Journal of Machine\n\nLearning Research, 2:499\u2013526, 2002.\n\n[20] M. Mahajan, P. Nimbhorkar, and K. Varadarajan. The planar k-means problem is np-hard.\n\nWALCOM: Algorithms and Computation, pages 274\u2013285, 2009.\n\n[21] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse\n\nproblems. SIAM Journal on Imaging Sciences, 2(1):183\u2013202, 2009.\n\n[22] Roberto Tron and Ren\u00b4e Vidal. A benchmark for the comparison of 3-d motion segmentation\n\nalgorithms. In CVPR. IEEE Computer Society, 2007.\n\n[23] J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Ef\ufb01cient projections onto the l1-\nball for learning in high dimensions. In Proceedings of the 25th international conference on\nMachine learning, pages 272\u2013279, 2008.\n\n9\n\n\f", "award": [], "sourceid": 998, "authors": [{"given_name": "Daniel", "family_name": "Vainsencher", "institution": "Technion"}, {"given_name": "Shie", "family_name": "Mannor", "institution": "Technion"}, {"given_name": "Huan", "family_name": "Xu", "institution": "NUS"}]}