{"title": "Local Supervised Learning through Space Partitioning", "book": "Advances in Neural Information Processing Systems", "page_first": 91, "page_last": 99, "abstract": "We develop a novel approach for supervised learning based on adaptively partitioning the feature space into different regions and learning local region-specific classifiers. We formulate an empirical risk minimization problem that incorporates both partitioning and classification in to a single global objective. We show that space partitioning can be equivalently reformulated as a supervised learning problem and consequently any discriminative learning method can be utilized in conjunction with our approach. Nevertheless, we consider locally linear schemes by learning linear partitions and linear region classifiers. Locally linear schemes can not only approximate complex decision boundaries and ensure low training error but also provide tight control on over-fitting and generalization error. We train locally linear classifiers by using LDA, logistic regression and perceptrons, and so our scheme is scalable to large data sizes and high-dimensions. We present experimental results demonstrating improved performance over state of the art classification techniques on benchmark datasets. We also show improved robustness to label noise.", "full_text": "Local Supervised Learning through Space\n\nPartitioning\n\nDept. of Electrical and Computer Engineering\n\nDept. of Electrical and Computer Engineering\n\nVenkatesh Saligrama\n\nBoston University\nBoston, MA 02116\n\nJoseph Wang\n\nBoston University\nBoston, MA 02116\njoewang@bu.edu\n\nsrv@bu.edu\n\nAbstract\n\nWe develop a novel approach for supervised learning based on adaptively parti-\ntioning the feature space into different regions and learning local region-speci\ufb01c\nclassi\ufb01ers. We formulate an empirical risk minimization problem that incorpo-\nrates both partitioning and classi\ufb01cation in to a single global objective. We show\nthat space partitioning can be equivalently reformulated as a supervised learning\nproblem and consequently any discriminative learning method can be utilized in\nconjunction with our approach. Nevertheless, we consider locally linear schemes\nby learning linear partitions and linear region classi\ufb01ers. Locally linear schemes\ncan not only approximate complex decision boundaries and ensure low training\nerror but also provide tight control on over-\ufb01tting and generalization error. We\ntrain locally linear classi\ufb01ers by using LDA, logistic regression and perceptrons,\nand so our scheme is scalable to large data sizes and high-dimensions. We present\nexperimental results demonstrating improved performance over state of the art\nclassi\ufb01cation techniques on benchmark datasets. We also show improved robust-\nness to label noise.\n\n1 Introduction\n\nWe develop a novel approach for supervised learning based on adaptively partitioning the feature\nspace into different regions and learning local region classi\ufb01ers. Fig. 1 (left) presents one possible\narchitecture of our scheme (others are also possible). Here each example passes through a cascade\nof reject classi\ufb01ers (gj\u2019s). Each reject classi\ufb01er, gj, makes a binary decision and the observation is\neither classi\ufb01ed by the associated region classi\ufb01er, fj, or passed to the next reject classi\ufb01er. Each\nreject classi\ufb01er, gj, thus partitions the feature space into regions. The region classi\ufb01er fj operates\nonly on examples within the local region that is consistent with the reject classi\ufb01er partitions.\n\nWe incorporate both feature space partitioning (reject classi\ufb01ers) and region-speci\ufb01c classi\ufb01ers into\na single global empirical risk/loss function. We then optimize this global objective by means of coor-\ndinate descent, namely, by optimizing over one classi\ufb01er at a time. In this context we show that each\nstep of the coordinate descent can be reformulated as a supervised learning problem that seeks to op-\ntimize a 0/1 empirical loss function. This result is somewhat surprising in the context of partitioning\nand has broader implications. First, we can now solve feature space partitioning through empirical\nrisk function minimization(ERM) and so powerful existing methods including boosting, decision\ntrees and kernel methods can be used in conjunction for training \ufb02exible partitioning classi\ufb01ers.\n\nSecond, because data is usually locally \u201cwell-behaved,\u201d simpler region-classi\ufb01ers, such as linear\nclassi\ufb01ers, often suf\ufb01ce for controlling local empirical error. Furthermore, since complex boundaries\nfor partitions can be approximated by piecewise linear functions, feature spaces can be partitioned\nto arbitrary degree of precision using linear boundaries (reject classi\ufb01ers). Thus the combination\nof piecewise linear partitions along with linear region classi\ufb01ers has the ability to adapt to complex\ndata sets leading to low training error. Yet we can prevent over\ufb01tting/overtraining by optimizing the\n\n1\n\n\fLocal \n\nPerceptron \n\nLocal \n\nLogistic \n\nRegression \n\ng1(x) \n\ng2(x) \n\nAdaBoost \n\nDecision Tree \n\nFigure 1: Left: Architecture of our system. Reject Classi\ufb01ers, gj(x), partition space and region classi\ufb01ers,\nfj (x), are applied locally within the partitioned region. Right: Comparison of our approach (upper panel)\nagainst Adaboost and Decision tree (lower panel) on the banana dataset[1]. We use linear perceptrons and\nlogistic regression for training partitioning classi\ufb01er and region classi\ufb01ers. Our scheme splits with 3 regions\nand does not overtrain unlike Adaboost.\n\nnumber of linear partitions and linear region classi\ufb01ers, since the VC dimension of such a structure\nis reasonably small. In addition this also ensures signi\ufb01cant robustness to label noise. Fig. 1 (right)\ndemonstrates the substantial bene\ufb01ts of our approach on the banana dataset[1] over competing meth-\nods such as boosting and decision trees, both of which evidently overtrain.\n\nLimiting reject and region classi\ufb01ers to linear methods has computational advantages as well. Since\nthe datasets are locally well-behaved we can locally train with linear discriminant analysis (LDA),\nlogistic regression and variants of perceptrons. These methods are computationally ef\ufb01cient in that\nthey scale linearly with data size and data dimension. So we can train on large high-dimensional\ndatasets with possible applications to online scenarios.\n\nOur approach naturally applies to multi-class datasets. Indeed, we present some evidence that shows\nthat the partitioning step can adaptively cluster the dataset into groups and letting region classi\ufb01ers\nto operate on simpler problems. Additionally linear methods such as LDA, Logistic regression, and\nperceptron naturally extend to multi-class problems leading to computationally ef\ufb01cient and statisti-\ncally meaningful results as evidenced on challenging datasets with performance improvements over\nstate of the art techniques.\n\n1.1 Related Work\n\nOur approach \ufb01ts within the general framework of combining simple classi\ufb01ers for learning complex\nstructures. Boosting algorithms [2] learn complex decision boundaries characterized as a weighted\nlinear combination of weak classi\ufb01ers. In contrast our method takes unions and intersections of\nsimpler decision regions to learn more complex decision boundaries. In this context our approach is\nclosely related to decision trees. Decision trees are built by greedily partitioning the feature space\n[3]. One main difference is that decision trees typically attempt to greedily minimize some loss or a\nheuristic, such as region purity or entropy, at each split/partition of the feature space. In contrast our\nmethod attempts to minimize global classi\ufb01cation loss. Also decision trees typically split/partition\na single feature/component resulting in unions of rectangularly shaped decision regions; in contrast\nwe allow arbitrary partitions leading to complex decision regions.\n\nOur work is loosely related to so called coding techniques that have been used in multi-class classi-\n\ufb01cation [4, 5]. In these methods a multiclass problem is decomposed into several binary problems\nusing a code matrix and the predicted outcomes of these binary problems are fused to obtain multi-\nclass labels. Jointly optimizing for the code matrix and binary classi\ufb01cation is known to be NP\nhard [6] and iterative techniques have been proposed [7, 8]. There is some evidence (see Sec. 3)\nthat suggests that our space partitioning classi\ufb01er groups/clusters multiple classes into different re-\ngions; nevertheless our formulation is different in that we do not explicitly code classes into different\nregions and our method does not require fusion of intermediate outcomes.\n\nDespite all these similarities, at a fundamental level, our work can also be thought of as a somewhat\ncomplementary method to existing supervised learning algorithms. This is because we show that\nspace partitioning itself can be re-formulated as a supervised learning problem. Consequently, any\n\n2\n\n\fexisting method, including boosting and decision trees, could be used as a method of choice for\nlearning space partitioning and region-speci\ufb01c decision functions.\n\nWe use simple linear classi\ufb01ers for partitioning and region-classi\ufb01ers in many of our experiments.\nUsing piecewise combinations of simple functions to model a complex global boundary is a well\nstudied problem. Mixture Discriminant Analysis (MDA), proposed by Hastie et al. [9], models\neach class as a mixture of gaussians, with linear discriminant analysis used to build classi\ufb01ers be-\ntween estimated gaussian distributions. MDA relies upon the structure of the data, assuming that\nthe true distribution is well approximated by a mixture of Gaussians. Local Linear Discriminant\nAnalysis (LLDA) , proposed by Kim et al. [10], clusters the data and performs LDA within each\ncluster. Both of these approaches partition the data then attempt to classify locally. Partitioning of\nthe data is independent of the performance of the local classi\ufb01ers, and instead based upon the spatial\nstructure of the data. In contrast, our proposed approach partitions the data based on the performance\nof classi\ufb01ers in each region. A recently proposed alternative approach is to build a global classi\ufb01er\nignoring clusters of errors, and building separate classi\ufb01ers in each error cluster region [11]. This\nproposed approach greedily approximates a piecewise linear classi\ufb01er in this manner, however fails\nto take into account the performance of the classi\ufb01ers in the error cluster regions. While piece-\nwise linear techniques have been proposed in the past [12, 13], we are unaware of techniques that\nlearn piecewise linear classi\ufb01ers based on minimizing global ERM and allows any discriminative\napproach to be used for partitioning and local classi\ufb01cation, and also extends to multiclass learning\nproblems.\n\n2 Learning Space Partitioning Classi\ufb01ers\n\nThe goal of supervised classi\ufb01cation is to learn a function, f (x), that maps features, x \u2208 X , to a\ndiscrete label, y \u2208 {1, 2, . . . , c}, based on training data, (xi, yi), i = 1, 2, . . . , n. The empirical\nrisk/loss of classi\ufb01er f is:\n\nn\n\nR(f ) =\n\n{f (xi)6=yi}\n\n1\nn\n\nXi=1\n\nOur goal is empirical risk minimization(ERM), namely, to minimize R(f ) over all classi\ufb01ers, f (\u00b7),\nbelonging to some class F. It is well known that the complexity of the family F dictates general-\nization errors. If F is too simple, it often leads to large bias errors; if the family F is too rich, it\noften leads to large variance errors. With this perspective we consider a family of classi\ufb01ers (see\nFig. 1 that adaptively partitions data into regions and \ufb01ts simple classi\ufb01ers within each region. We\npredict the output for a test sample, x, based on the output of the trained simple classi\ufb01er associated\nwith the region x belongs to. The complexity of our family of classi\ufb01ers depends on the number\nof local regions, the complexity of the simple classi\ufb01ers in each region, and the complexity of the\npartitioning. In the sequel we formulate space partitioning and region-classi\ufb01cation into a single\nobjective and show that space partitioning is equivalent to solving a binary classi\ufb01cation problem\nwith 0/1 empirical loss.\n\n1\n\n1\n\n1\nn\n\nn\n\nXi=1\n\n1\n\n1\n\nn\n\nXi=1\n\n1\nn\n\ni =1\n\n2.1 Binary Space Partitioning as Supervised Learning\n\nIn this section we consider learning binary space partitioning for ease of exposition. The function\ng(\u00b7) partitions the space by mapping features, x \u2208 X , to a binary label, z \u2208 {0, 1}. Region classi-\n\ufb01ers f0(x), f1(x) operate on the respective regions generated by g(x) (see Fig. 1). The empirical\nrisk/loss associated with the binary space partitioned classi\ufb01ers is given by:\n\nR(g, f0, f1) =\n\n{g(xi)=0}\n\n{f0(xi)6=yi} +\n\n{g(xi)=1}\n\n{f1(xi)6=yi}\n\n(1)\n\nOur goal is to minimize the empirical error jointly over the family of functions g(\u00b7) \u2208 G and fi(\u00b7) \u2208\nF. From the above equation, when the partitioning function g(\u00b7) is \ufb01xed, it is clear how one can\nview choice of classi\ufb01ers f0(\u00b7) and f1(\u00b7) as ERM problems. In contrast, even when f0, f1 are \ufb01xed,\nit is unclear how to view minimization over g \u2208 G as an ERM. To this end let, `(i)\nindicate\nwhether or not classi\ufb01er f0, f1 makes an error on example x(i) and let S denote the set of instances\nwhere the classi\ufb01er f0 makes errors, namely,\n\n0 , `(i)\n\n1\n\n`0\n\n{f0(xi)6=yi}, `1\n\n{f1(xi)6=yi}, S = {i | `0\n\ni = 1}\n\n(2)\n\n3\n\n1\ni =1\n\n\fWe can then rewrite Eq. 1 as follows:\n\nR(g, f0, f1) =\n\n`0\ni\n\n{g(xi)=0} +\n\n{g(xi)=1}\n\n{g(xi)=0} +\n\n{g(xi)=1} +\n\n{g(xi)=1}\n\n{g(xi)=0} +\n\n{g(xi)=0}) +\n\n{g(xi)=1}\n\n(1 \u2212 `1\n\n{g(xi)=0} +\n\nn\n\n1\n\n1\nn\n\nXi=1\nn Xi\u2208S\nn Xi\u2208S\nn Xi\u2208S\n\n1\n\n1\n\n=\n\n=\n\n=\n\ni )1\n\n1\n1\n\n1\n\n1\n\nn Xi6\u2208S\n\n`1\ni\n\n1\n\nn Xi6\u2208S\n\n`1\ni\n\n1\n\nn Xi6\u2208S\n\n`1\ni\n\n{g(xi)=1}\n\n1\n\n1\n1\ni (1 \u22121\n\n`1\ni\n\nn\n\n1\n\n1\nn\n\nXi=1\nn Xi\u2208S\nn Xi\u2208S\n\n1\n\n`1\ni\n\n`1\n\n+\n\n1\n\n1\n\n`1\ni\n\nn Xi\u2208S\n| {z }\n\nindep. of g\n\nn Xi6\u2208S\n\n`1\ni\n\n1\n\n1\n1\n\ni )1\ni )1\nwi1\n\n(1 \u2212 `1\n\nNote that for optimizing g \u2208 G for \ufb01xed f0, f1, the second term above is constant. Furthermore, by\nconsequence of Eq. 2 we see that the \ufb01rst and third terms can be further simpli\ufb01ed as follows:\n1\n\n1\n\n1\n\n{g(xi)6=`0\n\ni };\n\n{g(xi)=1} =\n\n{g(xi)6=`0\n\ni }\n\n(1 \u2212 `1\n\n{g(xi)=0} =\n\n1\n\nn Xi6\u2208S\n\n`1\ni\n\nn Xi\u2208S\n\nn Xi\u2208S\n\nPutting all this together we have the following lemma:\nLemma 2.1. For a \ufb01xed f0, f1 the problem of choosing the best binary space partitions, g(\u00b7) in\nEq. 1 is equivalent to choosing a binary classi\ufb01er g that optimizes following 0/1 (since wi \u2208 {0, 1})\nempirical loss function:\n\n\u02dcR(g) =\n\n1\nn\n\nn\n\nXi=1\n\n{g(xi)6=`0\n\ni }, where wi = (cid:26) 1, `0\n\ni 6= `1\ni\n0, otherwise\n\nThe composite classi\ufb01er F (x) based on the reject and region classi\ufb01ers can be written compactly as\nF (x) = fg(x)(x). We observe several aspects of our proposed scheme:\n(1) Binary partitioning is a binary classi\ufb01cation problem on the training set, (xi, `0\n1, 2, . . . , n.\n(2) The 0/1 weight, wi = 1, is non-zero if and only if the classi\ufb01ers disagree on xi, i.e.,\nf0(xi) 6= f1(xi).\n(3) The partitioning error is zero on a training example xi with weight wi = 1 if we choose g(xi) = 0\non examples where f0(xi) = yi. In contrast if f0(xi) 6= yi the partitioning error can be reduced by\nchoosing g(xi) = 1, and thus rejecting the example from consideration by f0.\n\ni ), i =\n\n2.2 Surrogate Loss Functions, Algorithms and Convergence\n\nAn important implication of Lemma 2.1 is that we can now use powerful learning techniques such\nas decision trees, boosting and SVMs for learning space partitioning classi\ufb01ers. Our method is a\ncoordinate descent scheme which optimizes over a single variable at a time. Each step is an ERM\nand so any learning method can be used at each step.\nConvergence Issues: It is well known that that indicator losses are hard to minimize, even when\nthe class of classi\ufb01ers, F, is nicely parameterized. Many schemes are based on minimizing sur-\nrogate losses. These surrogate losses are upper bounds for indicator losses and usually attempt\nto obtain large margins. Our coordinate descent scheme in this context is equivalent to describ-\ning surrogates for each step and minimizing these surrogates. This means that our scheme may\nnot converge, let alone converge to a global minima, even when surrogates at each step are nice\nand convex. This is because even though each surrogate upper bounds indicator loss functions\nat each step, when put together they do not upper bound the global objective of Eq. 1. Conse-\nquently, we need a global surrogate to ensure that the solution does converge. Loss functions are\nmost conveniently thought of in terms of margins. For notational convenience, in this section we\nwill consider the case where the partition classi\ufb01er, g, maps to labels ` \u2208 {\u22121, 1}, where a la-\nbel of \u22121 and 1 indicates classi\ufb01cation by f0 and f1, respectively. We seek functions \u03c6(z) that\n\nsatisfy1z\u22640 \u2264 \u03c6(z). Many such surrogates can be constructed using sigmoids, exponentials etc.\n\nConsider the classi\ufb01cation function g(x) = sign (h(x) > 0). The empirical error can be upper\n\n4\n\n\fbounded:1\n\n`g(x)=1 =1\n\n\u2212`h(x)\u22640 \u2264 \u03c6(\u2212`h(x)) We then form a global surrogate for the empir-\nical loss function. Approximating the indicator functions of the empirical risk/loss in Eq. 1 with\nsurrogate functions, the global surrogate is given by:\n\n\u02c6R(g, f0, f1) =\n\n1\nn\n\nn\n\nXi=1\n\n\u03c6 (h(xi)) \u03c6 (yif0(xi)) +\n\n1\nn\n\nn\n\nXi=1\n\n\u03c6 (\u2212h(xi)) \u03c6 (yif1(xi)) ,\n\n(3)\n\nwhich is an upper bound on Eq. 1. Optimizing the partitioning function g(\u00b7) can be posed as a\nsupervised learning problem, resulting in the following lemma (see Supplementary for a proof):\nLemma 2.2. For a \ufb01xed f0, f1 the problem of choosing the best binary space partitions, g(\u00b7) in\nEq. 3 is equivalent to choosing a binary classi\ufb01er h that optimizes a surrogate function \u03c6(\u00b7):\n\n\u02c6R(g) =\n\n1\n2n\n\n2n\n\nXi=1\n\nwi\u03c6 (h(xi)ri) , ri = (cid:26) 1,\n\n\u22121,\n\ni < n + 1\n\notherwise ,wi = (cid:26) \u03c6(f0(xi)yi),\n\n\u03c6(f1(xi)yi),\n\ni < n + 1\notherwise .\n\nTheorem 2.3. For any continuous surrogate \u03c6(\u00b7, \u00b7), performing alternating minimization on the\nclassi\ufb01ers f0, f1, and g converges to a local minima of Eq. 3, with a loss upper-bounding the\nempirical loss de\ufb01ned by Eq. 1.\n\nProof. This follows directly, as this is coordinate descent on a smooth cost function.\n\n2.3 Multi-Region Partitioning\n\nLemma 2.1 can be used to also reduce multi-region space partitioning to supervised learning. We\ncan obtain this reduction in one of several ways. One approach is to use pairwise comparisons,\ntraining classi\ufb01ers to decide between pairs of regions. Unfortunately, the number of different reject\nclassi\ufb01ers scales quadratically, so we instead employ a greedy partitioning scheme using a cascade\nclassi\ufb01er.\n\nFig 1 illustrates a recursively learnt three region space partitioning classi\ufb01er. In general the regions\nare de\ufb01ned by a cascade of binary reject classi\ufb01ers, gk(x), k \u2208 {1, 2, . . . , r \u2212 1}, where r is the\nnumber of classi\ufb01cation regions. Region classi\ufb01ers, fk(x), k \u2208 {1, 2, . . . , r}, map observations in\nthe associated region to labels. At stage k, if gk(x) = 0, an observation is classi\ufb01ed by the region\nclassi\ufb01er, fk(x), otherwise the observation is passed to the next stage of the cascade. At the last\nreject classi\ufb01er in the cascade, if gr\u22121(x) = 1, the observation is passed to the \ufb01nal region classi\ufb01er,\nfr(x). This ensures that only r reject classi\ufb01ers have to be trained for r regions.\nNow de\ufb01ne for an arbitrary instance (x, y) and \ufb01xed {gj}, {fj}, the 0/1 loss function at each stage\nk,\n\nLk(x, y) = (cid:26)(cid:0)\n\n{gk(x)=0}(cid:1)\n\n{fk+1(x)6=y}\n\n{fk(x)6=y} +(cid:0)\n\n{gk(x)=1}(cid:1) Lk+1(x, y)\n\nif k < r\nif k = r\n\n,\n\n(4)\n\n1\n1\n\n1\n\n1\n\nWe observe that Lk(x, y) \u2208 {0, 1} and is equal to zero if the example is classi\ufb01ed correctly at\ncurrent or future stages and one otherwise. Consequently, the aggregate 0/1 empirical risk/loss is\nthe average loss over all training points at stage 1, namely,\n\nR (g1, g2, . . . , gr\u22121, f1, f2, . . . , fr) =\n\n1\nn\n\n(5)\n\nIn the expression above we have made the dependence on reject classi\ufb01ers and region-classi\ufb01ers\nexplicit. We minimize Eq. 5 over all gj, fj by means of coordinate descent, namely, to optimize gk\nwe hold fj, \u2200j and gj , j 6= k \ufb01xed. Based on the expressions derived above the coordinate descent\nsteps for gk and fk reduces respectively to:\n\nL1(xi, yi)\n\nCk(xi)1\n\nn\n\nXi=1\n\nn\n\nXi=1\n\nwhere, Cj(x) =1\n\ngk(\u00b7) = argmin\n\ng\u2208G\n\n1\nn\n\nn\n\nXi=1\n\nCk(xi)Lk(xi, yi), fk(\u00b7) = argmin\n\nf \u2208F\n\n1\nn\n\n{fk(xi)6=yi} V{gk(xi)=0}\n\n(6)\ni=1 {gi(x)=1}}, denotes whether or not an example makes it to the jth stage. The\noptimization problem for fk(\u00b7) is exactly the standard 0/1 empirical loss minimization over training\n\n{Vj\u22121\n\n5\n\n\fAlgorithm 1 Space Partitioning Classi\ufb01er\n\nInput: Training data, {(xi, yi)}n\nOutput: Composite classi\ufb01er, F (\u00b7)\nInitialize: Assign points randomly to r regions\nwhile F not converged do\nfor j = 1, 2, . . . , r do\n\ni=1, number of classi\ufb01cation regions, r\n\nTrain region classi\ufb01er fj(x) to optimize 0/1 empirical loss of Eq. (6).\n\nend for\nfor k = r \u2212 1, r \u2212 2, . . . , 2, 1 do\n\nTrain reject classi\ufb01er gk(x) to optimize 0/1 empirical loss of Eq. (7).\n\nend for\nend while\n\nwi1\n\n1\nn\n\ng\u2208G\n\nXi=1\nand wi = (cid:26) 1,\n\n0,\n\ndata that survived upto stage k. On the other hand, the optimization problem for gk is exactly in\nthe form where Lemma 2.1 applies. Consequently, we can also reduce this problem to a supervised\nlearning problem:\n\nn\n\ngk(\u00b7) = argmin\n\n{g(xi)6=`i},\n\n(7)\n\nwhere\n\n`i = (cid:26)0 if fk(xi) = yi\n\n1 if fk(xi) 6= yi\n\n`i 6= Lk+1(xi, yi), Ck(x) 6= 0\n\notherwise\n\n.\n\nThe composite classi\ufb01er F (x) based on the reject and region classi\ufb01ers can be written compactly as\nfollows:\n\nF (x) = fs(x), s = min{j | gj(x) = 0} \u222a {r}\n\n(8)\nObserve that if the kth region classi\ufb01er correctly classi\ufb01es the example xi, i.e., fk(xi) = yi then\nthis would encourage gk(xi) = 0. This is because gk(xi) = 1 would induce an increased cost in\nterms of increasing Lk+1(xi, yi). Similarly, if the kth region classi\ufb01er incorrectly classi\ufb01es, namely,\nfk(xi) 6= yi, the optimization would prefer gk(xi) = 1. Also note that if the kth region classi\ufb01er\nloss as well as the subsequent stages are incorrect on an example are incorrect then the weight on\nthat example is zero. This is not surprising since reject/no-reject does not impact the global cost.\nWe can deal with minimizing indicator losses and resulting convergence issues by deriving a global\nsurrogate as we did in Sec. 2.2. A pseudo-code for the proposed scheme is described in Algorithm 1.\n\n2.4 Local Linear Classi\ufb01cation\n\nLinear classi\ufb01cation is a natural method for learning lo-\ncal decision boundaries, with the global decision regions\napproximated by piecewise linear functions. In local lin-\near classi\ufb01cation, local classi\ufb01ers, f1, f2, . . . , fr, and re-\nject classi\ufb01ers, g1, g2, . . . , gr\u22121, are optimized over the\nset of linear functions. Local linear rules can effectively\ntradeoff bias and variance error. Bias error (empirical er-\nror) can be made arbitrarily small by approximating the\ndecision boundary by many local linear classi\ufb01ers. Vari-\nance error (classi\ufb01er complexity) can be made small by\nrestricting the number of local linear classi\ufb01ers used to\nconstruct the global classi\ufb01er. This idea is based on the\nrelatively small VC-dimension of a binary local linear classi\ufb01er, namely,\nTheorem 2.4. The VC-dimension of the class composed (Eq. 8) with r \u2212 1 linear classi\ufb01ers gj and\nr linear classi\ufb01ers fj in a d-dimensional space is bounded by 2(2r \u2212 1) log(e(2r \u2212 1))(d + 1).\n\nFigure 2: Local LDA classi\ufb01cation regions\nfor XOR data, the black line is reject classi-\n\ufb01er boundary.\n\nThe VC-dimension of local linear classi\ufb01ers grows linearly with dimension and nearly linearly with\nrespect to the number of regions. This is seen from Fig. 1. In practice, few regions are necessary to\nachieve low training error as highly non-linear decision boundaries can be approximated well locally\nwith linear boundaries. For example, consider 2-D XOR data. Learning the local linear classi\ufb01er\nwith 2 regions using LDA produces a classi\ufb01er with small empirical error. In fact our empirical\nobservation can be translated to a theorem (see Supplementary for details):\n\n6\n\n\fTheorem 2.5. Consider an idealized XOR, namely, samples are concentrated into four equal clus-\nters at coordinates (\u22121, 1), (1, 1), (1, \u22121), (\u22121, \u22121) in a 2D space. Then with high probability\n(where probability is wrt initial sampling of reject region) a two region composite classi\ufb01er trained\nlocally using LDA converges to zero training error.\n\nIn general, training linear classi\ufb01ers on the indicator loss is impractical. Optimization on the non-\nconvex problem is dif\ufb01cult and usually leads to non-unique optimal solutions. Although margin\nbased methods such as SVMs can be used, we primarily use relatively simple schemes such as\nLDA, logistic regression, and average voted perceptron in our experiments. We use each of these\nschemes for learning both reject and region-classi\ufb01ers. These schemes enjoy signi\ufb01cant computa-\ntional advantages over other schemes.\nComputational Costs of LDA, Logistic Regression and Perceptron: Each LDA classi\ufb01er is\ntrained in O(nd2) computations, where n is the number of training observations and d is the di-\nmension of the training data. As a result, the total computation cost per iteration of the local lin-\near classi\ufb01er with LDA scales linearly with respect to the number of training samples, requiring\nO(nd2r) computations per iteration, where r is the number of classi\ufb01cation regions. Similarly, the\ncomputational cost of training a single linear classi\ufb01er by logistic regression scales O(ncd2) for a\n\ufb01xed number of iterations, with the local linear classi\ufb01er training time scaling O(rncd2) computa-\ntions per iteration, where c is the number of classes. A linear variant of the voted perceptron was\nimplemented by taking the average of the weights generated by the unnormalized voted perceptron\n[15]. Training each perceptron for a \ufb01xed number of epochs is extremely ef\ufb01cient, requiring only\nO(ndc) computations to train. Therefore, training local linear perceptron scales linearly with data\nsize and dimensions, with O(ndcr) computations, per iteration.\n\n3 Experimental Results\n\nMulticlass Classi\ufb01cation:Experimental results on six datasets from the UCI repository [16] were\nperformed using the benchmark training and test splits associated with each data set, as shown in\nTable 1. Con\ufb01dence intervals are not possible with the results, as the prede\ufb01ned training and test\nsplits were used. Although con\ufb01dence intervals cannot be computed by multiple training/test splits,\ntest set error bounds [17] show that with test data sets of these sizes, the difference between true error\nand empirical error is small with high probability. The six datasets tested were: Isolet (d=617, c= 26,\nn=6238, T=1559), Landsat (d=36, c=7, n=4435, T=2000), Letter (d=16, c=26, n=16000, T=4000),\nOptdigit (d=64, c=10, n=3823, T=1797), Pendigit (d=16, n=10, n=7494, T=3498), and Shuttle (d=9,\nc=7, n=43500, T=14500), where d is the dimensions, c the number of classes, n training data size\nand T the number of test samples.\n\nLocal linear classi\ufb01ers were trained with LDA, logistic regression, and perceptron (mean of weights)\nused to learn local surrogates for the rejection and local classi\ufb01cation problems. The classi\ufb01ers were\ninitialized with 5 classi\ufb01cation regions (r = 5), with the trained classi\ufb01ers often reducing to fewer\nclassi\ufb01cation regions due to empty rejection region. Termination of the algorithm occurred when the\nrejection outputs, gk(x), and classi\ufb01cation labels, F (x), remained consistent on the training data for\ntwo iterations. Each classi\ufb01er was randomly initialized 15 times, and the classi\ufb01er with the minimum\ntraining error was chosen. Results were compared with Mixture Discriminant Analysis (MDA)\n\ng1(x)\n\ng2(x)\n\ng3(x)\n\ng4(x)\n\ng5(x)\n\nFigure 3: Histogram of classes over test data for the Optdigit dataset in different partitions generated by our\napproach using the linear voted perceptron .\n[9] and classi\ufb01cation trees trained using the Gini diversity index (GDI) [3]. These classi\ufb01cation\nalgorithms were chosen for comparison as both train global classi\ufb01ers modeled as simple local\nclassi\ufb01ers, and both are computationally ef\ufb01cient.\n\n7\n\n\fFor comparison to globally complex classi\ufb01cation techniques, previous state of the art boosting re-\nsults of Saberian and Vasconcelos [18] and Jhu et al. [19] were listed. Although the multiclass\nboosted classi\ufb01ers were terminated early, we consider the comparison appropriate, as early termi-\nnation limits the complexity of the classi\ufb01ers. The improved performance of local linear learning\nof comparable complexity justi\ufb01es approximating these boundaries by piecewise linear functions.\nComparison with kernelized SVM was omitted, as SVM is rarely applied to multiclass learning\non large datasets. Training each binary kernelized classi\ufb01er is computationally intensive, and on\nweakly learnable data, boosting also allows for modeling of complex boundaries with arbitrarily\nsmall empirical error.\n\nTable 1: Multiclass learning algorithm test errors on six UCI datasets using benchmark training and test sets.\nBold indicates best test error among listed algorithms. One vs All AdaBoostis trained using decision stumps as\nweak learners. AdaBoost-SAMME and GD-MCBoost are trained using depth-2 decision trees as weak learners.\n\nAlgorithm\n\nOne vs All AdaBoost [2]\n\nGDI Tree [3]\n\nMDA [9]\n\nAdaBoost-SAMME [19]\n\nGD-MCBoost [18]\nLocal Classi\ufb01ers\n\nLDA\n\nLogistic Regression\n\nPerceptron\n\nIsolet\n11.10%\n20.59%\n35.98%\n39.00%\n15.72%\n\n5.58%\n19.95%\n5.71%\n\nLandsat\n16.10%\n14.45%\n36.45%\n20.20%\n13.35%\n\n13.95%\n14.00%\n20.15%\n\nLetter\n37.37%\n14.37%\n22.73%\n44.35%\n40.35%\n\n24.45%\n13.08%\n20.40%\n\nOptdigit\n12.24%\n14.58%\n9.79%\n22.47%\n7.68%\n\n5.78%\n7.74%\n4.23%\n\nPendigit\n11.29%\n8.78%\n7.75%\n16.18%\n7.06%\n\n6.60%\n4.75%\n4.32%\n\nShuttle\n0.11%\n0.04%\n9.59%\n0.30%\n0.27%\n\n2.67%\n1.19%\n0.32%\n\nIn 4 of the 6 datasets, local linear classi\ufb01cation produced the lowest classi\ufb01cation error on test\ndatasets, with optimal test errors within 0.6% of the minimal test error methods for the remaining\ntwo datasets. Also there is evidence that suggests that our scheme partitions multiclass problems\ninto simpler subproblems. We plotted histogram output of class labels for Optdigit dataset across\ndifferent regions using local perceptrons (Fig. 3). The histogram is not uniform across regions,\nimplying that the reject classi\ufb01ers partition easily distinguishable classes. We may interpret our\napproach as implicitly learning data-dependent codes for multiclass problems. This can contrasted\nwith many state of the art boosting techniques, such as [18], which attempt to optimize both the\ncodewords for each class as well as the binary classi\ufb01cation problems de\ufb01ning the codewords.\n\nFigure 4: Test error for different values of label noise. Left: Wisconsin Breast Cancer data, Middle: Vertebrae\ndata, and Right: Wine data.\nRobustness to Label Noise: Local linear classi\ufb01cation trained using LDA, logistic regression, and\naveraged voted perceptron was tested in the presence of random label noise. A randomly selected\nfraction of all training observations were given incorrect labels, and trained as described for the\nmulticlass experiments. Three datasets were chosen from the UCI repository [16]: Wisconsin Breast\nCancer data, Vertebrae data, and Wine data. A training set of 100 randomly selected observations\nwas used, with the remainder of the data used as test. For each label noise fraction, 100 randomly\ndrawn training and test sets were used, and the average test error is shown in Fig. 4.\n\nFor comparison, results are shown for classi\ufb01cation trees trained according to Gini\u2019s diversity index\n(GDI) [3], AdaBoost trained with stumps [2], and support vector machines trained on Gaussian ra-\ndial basis function kernels. Local linear classi\ufb01cation, notably when trained using LDA, is extremely\nrobust to label noise. In comparison, boosting and classi\ufb01cation trees show sensitivity to label noise,\nwith the test error increasing at a faster rate than LDA-trained local linear classi\ufb01cation on both the\nWisconsin Breast Cancer data and Vertebrae data.\n\nAcknowledgments\n\nThis research was partially supported by NSF Grant 0932114.\n\n8\n\n\fReferences\n\n[1] G. R\u00a8atsch, T. Onoda, and K.-R. M\u00a8uller. Soft margins for AdaBoost. Technical Report NC-TR-\n1998-021, Department of Computer Science, Royal Holloway, University of London, Egham,\nUK, August 1998. Submitted to Machine Learning.\n\n[2] Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning\nand an application to boosting. Journal of Computer and System Sciences, 55(1):119 \u2013 139,\n1997.\n\n[3] Leo Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classi\ufb01cation and Regression\n\nTrees. Wadsworth, 1984.\n\n[4] Thomas G. Dietterich and Ghulum Bakiri. Solving multiclass learning problems via error-\n\ncorrecting output codes. Journal of Arti\ufb01cial Intelligence Research, 2:263\u2013286, 1995.\n\n[5] Erin L. Allwein, Robert E. Schapire, and Yoram Singer. Reducing multiclass to binary: a\n\nunifying approach for margin classi\ufb01ers. J. Mach. Learn. Res., 1:113\u2013141, September 2001.\n\n[6] Koby Crammer and Yoram Singer. On the learnability and design of output codes for multiclass\nproblems. In In Proceedings of the Thirteenth Annual Conference on Computational Learning\nTheory, pages 35\u201346, 2000.\n\n[7] Venkatesan Guruswami and Amit Sahai. Multiclass learning, boosting, and error-correcting\nIn Proceedings of the twelfth annual conference on Computational learning theory,\n\ncodes.\nCOLT \u201999, pages 145\u2013155, New York, NY, USA, 1999. ACM.\n\n[8] Yijun Sun, Sinisa Todorovic, Jian Li, and Dapeng Wu. Unifying the error-correcting and\noutput-code adaboost within the margin framework. In Proceedings of the 22nd international\nconference on Machine learning, ICML \u201905, pages 872\u2013879, New York, NY, USA, 2005.\nACM.\n\n[9] Trevor Hastie and Robert Tibshirani. Discriminant analysis by gaussian mixtures. Journal of\n\nthe Royal Statistical Society, Series B, 58:155\u2013176, 1996.\n\n[10] Tae-Kyun Kim and Josef Kittler. Locally linear discriminant analysis for multimodally dis-\ntributed classes for face recognition with a single model image. IEEE Transactions on Pattern\nAnalysis and Machine Intelligence, 27:318\u2013327, 2005.\n\n[11] Ofer Dekel and Ohad Shamir. There\u2019s a hole in my data space: Piecewise predictors for\nheterogeneous learning problems. In Proceedings of the International Conference on Arti\ufb01cial\nIntelligence and Statistics, volume 15, 2012.\n\n[12] Juan Dai, Shuicheng Yan, Xiaoou Tang, and James T. Kwok. Locally adaptive classi\ufb01cation pi-\nloted by uncertainty. In Proceedings of the 23rd international conference on Machine learning,\nICML \u201906, pages 225\u2013232, New York, NY, USA, 2006. ACM.\n\n[13] Marc Toussaint and Sethu Vijayakumar. Learning discontinuities with products-of-sigmoids\nfor switching between local models. In Proceedings of the 22nd international conference on\nMachine Learning, pages 904\u2013911. ACM Press, 2005.\n\n[14] Eduardo D. Sontag. Vc dimension of neural networks.\n\nLearning, pages 69\u201395. Springer, 1998.\n\nIn Neural Networks and Machine\n\n[15] Yoav Freund and Robert E. Schapire. Large margin classi\ufb01cation using the perceptron algo-\n\nrithm. Machine Learning, 37:277\u2013296, 1999. 10.1023/A:1007662407062.\n\n[16] A. Frank and A. Asuncion. UCI machine learning repository, 2010.\n[17] J. Langford. Tutorial on practical prediction theory for classi\ufb01cation. Journal of Machine\n\nLearning Research, 6(1):273, 2006.\n\n[18] Mohammad J. Saberian and Nuno Vasconcelos. Multiclass boosting: Theory and algorithms.\nIn J. Shawe-Taylor, R.S. Zemel, P. Bartlett, F.C.N. Pereira, and K.Q. Weinberger, editors,\nAdvances in Neural Information Processing Systems 24, pages 2124\u20132132. 2011.\n[19] Ji Zhu, Hui Zou, Saharon Rosset, and Trevor Hastie. Multi-class adaboost, 2009.\n\n9\n\n\f", "award": [], "sourceid": 54, "authors": [{"given_name": "Joseph", "family_name": "Wang", "institution": null}, {"given_name": "Venkatesh", "family_name": "Saligrama", "institution": null}]}