{"title": "Partition-wise Linear Models", "book": "Advances in Neural Information Processing Systems", "page_first": 3527, "page_last": 3535, "abstract": "Region-specific linear models are widely used in practical applications because of their non-linear but highly interpretable model representations. One of the key challenges in their use is non-convexity in simultaneous optimization of regions and region-specific models. This paper proposes novel convex region-specific linear models, which we refer to as partition-wise linear models. Our key ideas are 1) assigning linear models not to regions but to partitions (region-specifiers) and representing region-specific linear models by linear combinations of partition-specific models, and 2) optimizing regions via partition selection from a large number of given partition candidates by means of convex structured regularizations. In addition to providing initialization-free globally-optimal solutions, our convex formulation makes it possible to derive a generalization bound and to use such advanced optimization techniques as proximal methods and decomposition of the proximal maps for sparsity-inducing regularizations. Experimental results demonstrate that our partition-wise linear models perform better than or are at least competitive with state-of-the-art region-specific or locally linear models.", "full_text": "Partition-wise Linear Models\n\nGraduate School of Information Science and Technology\n\nHidekazu Oiwa\u2217\n\nThe University of Tokyo\n\nhidekazu.oiwa@gmail.com\n\nRyohei Fujimaki\n\nNEC Laboratories America\n\nrfujimaki@nec-labs.com\n\nAbstract\n\nRegion-speci\ufb01c linear models are widely used in practical applications because\nof their non-linear but highly interpretable model representations. One of the key\nchallenges in their use is non-convexity in simultaneous optimization of regions\nand region-speci\ufb01c models. This paper proposes novel convex region-speci\ufb01c lin-\near models, which we refer to as partition-wise linear models. Our key ideas\nare 1) assigning linear models not to regions but to partitions (region-speci\ufb01ers)\nand representing region-speci\ufb01c linear models by linear combinations of partition-\nspeci\ufb01c models, and 2) optimizing regions via partition selection from a large\nnumber of given partition candidates by means of convex structured regulariza-\ntions. In addition to providing initialization-free globally-optimal solutions, our\nconvex formulation makes it possible to derive a generalization bound and to use\nsuch advanced optimization techniques as proximal methods and decomposition\nof the proximal maps for sparsity-inducing regularizations. Experimental results\ndemonstrate that our partition-wise linear models perform better than or are at\nleast competitive with state-of-the-art region-speci\ufb01c or locally linear models.\n\nIntroduction\n\n1\nAmong pre-processing methods, data partitioning is one of the most fundamental. In it, an input\nspace is divided into several sub-spaces (regions) and assigned a simple model for each region. In\naddition to better predictive performance resulting from the non-linear nature that arises from mul-\ntiple partitions, the regional structure provides a better understanding of data (i.e., interpretability).\nRegion-speci\ufb01c linear models learn both partitioning structures and predictors in each region.\nSuch models vary\u2014from traditional decision/regression trees [1] to more advanced models [2, 3,\n4]\u2014depending on their region-speci\ufb01ers (how they characterize regions), region-speci\ufb01c prediction\nmodels, and the objective functions to be optimized. One important challenge that remains in learn-\ning these models is the non-convexity that arises from the inter-dependency of optimizing regions\nand prediction models in individual regions. Most previous work suffers from disadvantages arising\nfrom non-convexity, including initialization-dependency (bad local minima) and lack of generaliza-\ntion error analysis.\nWe propose convex region-speci\ufb01c linear models, which are referred to as partition-wise linear mod-\nels. Our models have two distinguishing characteristics that help avoid the non-convexity problem.\n\nPartition-wise Modeling We propose partition-wise linear models as a novel class of region-\nspeci\ufb01c linear models. Our models divide an input space by means of a small set of partitions1.\nEach partition possesses one weight vector, and this weight vector is only applied to one side of\nthe divided space. It is trained to represent the local relationship between input vectors and output\n\u2217The work reported here was conducted when the \ufb01rst author was a visiting researcher at NEC Laboratories\n\nAmerica.\n\n1In our paper, a region is a sub-space in an input space. Multiple regions do not intersect each other, and, in\ntheir entirety, they cover the whole input space. A partition is an indicator function that divides an input space\ninto two parts.\n\n1\n\n\fvalues. Region-speci\ufb01c predictors are constructed by linear combinations of these weight vectors.\nOur partition-wise parameterization enables us to construct convex objective functions.\nConvex Optimization via Sparse Partition Selection We optimize regions by selecting effec-\ntive partitions from a large number of given candidates, using convex sparsity-inducing structured\nregularizations. In other words, we trade continuous region optimization for convexity. We allow\npartitions to locate only given discrete candidate positions, and are able to derive convex optimiza-\ntion problems. We have developed an ef\ufb01cient algorithm to solve structured-sparse optimization\nproblems, and in it we adopt a proximal method [5, 6] and the decomposition of proximal maps [7].\nAs a reliable partition-wise linear model, we have developed a global and local residual model that\ncombines one global linear model and a set of partition-wise linear ones. Further, our theoretical\nanalysis gives a generalization bound for this model to evaluate the risk of over-\ufb01tting. Our general-\nization bound analysis indicates that we can increase the number of partition candidates by less than\nan exponential order with respect to the sample size, which is large enough to achieve good pre-\ndictive performance in practice. Experimental results have demonstrated that our models perform\nbetter than or are at least competitive with state-of-the-art region-speci\ufb01c or locally linear models.\n\n1.1 Related Work\nRegion-speci\ufb01c linear models and locally linear models are the most closely related models to our\nown. The former category, to which our models belong, assumes one predictor in a speci\ufb01c region\nand has an advantage in clear model interpretability, while the latter assigns one predictor to every\nsingle datum and has an advantage in higher model \ufb02exibility.\nInterpretable models are able to\nindicate clearly where and how the relationships between inputs and outputs change.\nWell-known precursors to region-speci\ufb01c linear models are decision/regression trees [1], which use\nrule-based region-speci\ufb01ers and constant-valued predictors. Another traditional framework is a hier-\narchical mixture of experts [8], which is a probabilistic tree-based region-speci\ufb01c model framework.\nRecently, Local Supervised Learning through Space Partitioning (LSL-SP) has been proposed [3].\nLSL-SP utilizes a linear-chain of linear region-speci\ufb01ers as well as region-speci\ufb01c linear predictors.\nThe highly important advantage of LSL-SP is the upper bound of generalization error analysis via\nthe VC dimension. Additionally, a Cost-Sensitive Tree of Classi\ufb01ers (CSTC) algorithm has also\nbeen developed [4]. It utilizes a tree-based linear localizer and linear predictors. This algorithm\u2019s\nuniqueness among other region-speci\ufb01c linear models is in its taking \u201cfeature utilization cost\u201d into\naccount for test time speed-up. Although the developers\u2019 formulation with sparsity-inducing struc-\ntured regularizations is, in a way, related to ours, their model representations and, more importantly,\ntheir motivation (test time speed-up) is different from ours.\nFast Local Kernel Support Vector Machines (FaLK-SVMs) represent state-of-the-art locally linear\nmodels. FaLK-SVMs produce test-point-speci\ufb01c weight vectors by learning local predictive models\nfrom the neighborhoods of individual test points [9]. It aims to reduce prediction time cost by pre-\nprocessing for nearest-neighbor calculations and local model sharing, at the cost of initialization-\nindependency. Another advanced locally linear model is that of Locally Linear Support Vector\nMachines (LLSVMs) [10]. LLSVMs assign linear SVMs to multiple anchor points produced by\nmanifold learning [11, 12] and construct test-point-speci\ufb01c linear predictors according to the weights\nof anchor points with respect to individual test points. When the manifold learning procedure is\ninitialization-independent, LLSVMs become initial-value-independent because of the convexity of\nthe optimization problem. Similarly, clustered SVMs (CSVMs) [13] assume given data clusters\nand learn multiple SVMs for individual clusters simultaneously. Although CSVMs are convex and\ngeneralization bound analysis has been provided, they cannot optimize regions (clusters).\nJoes et al. have proposed Local Deep Kernel Learning (LDKL) [2], which adopts an intermediate\napproach with respect to region-speci\ufb01c and locally linear models. LDKL is a tree-based local ker-\nnel classi\ufb01er in which the kernel de\ufb01nes regions and can be seen as performing region-speci\ufb01cation.\nOne main difference from common region-speci\ufb01c linear models is that LDKL changes kernel com-\nbination weights for individual test points, and the predictors are locally determined in every single\nregion. Its aim is to speed up kernel SVMs\u2019 prediction while maintaining the non-linear ability.\nTable 1 summarizes the above described state-of-the-art models in contrast with ours from a number\nof signi\ufb01cant perspectives. Our proposed model uniquely exhibits three properties: joint optimiza-\ntion of regions and region-speci\ufb01c predictors, initialization-independent optimization, and meaning-\nful generalization bound.\n\n2\n\n\fTable 1: Comparison of region-speci\ufb01c and locally linear models.\n\nRegion Optimization\n\nInitialization-independent\n\nGeneralization Bound\n\nRegion Speci\ufb01ers\n\n\u221a\nOurs\n\u221a\n\u221a\n\nLSL-SP\n\n\u221a\n\u221a\n\n\u221a\nCSTC\n\nLDKL\n\n\u221a\n\nFaLK-SVM\n\n\u221a\n\nLLSVM\n\n\u221a\n\n(Sec. 2.2)\n\nLinear\n\nLinear\n\nLinear Non-Regional Non-Regional\n\n1.2 Notations\nScalars and vectors are denoted by lower-case x. Matrices are denoted by upper-case X. An n-th\ntraining sample and label are denoted by xn \u2208 RD and yn, respectively.\n2 Partition-wise Linear Models\nThis section explains partition-wise linear models under the assumption that effective partitioning is\nalready \ufb01xed. We discuss how to optimize partitions and region-speci\ufb01c linear models in Section 3.\n\n2.1 Framework\nFigure 1 illustrates the concept of partition-wise linear models. Suppose we have P partitions (red\ndashed lines) which essentially specify 2P regions. Partition-wise linear models are de\ufb01ned as fol-\nlows. First, we assign a linear weight vector ap to the p-th partition. This partition has an activeness\nfunction, fp, which indicates whether the attached weight vector ap is applied to individual data\npoints or not. For example, in Figure 1, we set the weight vector a1 to be applied to the right-hand\nside of partition p1. In this case, the corresponding activeness function is de\ufb01ned as f1(x) = 1 when\nx is in the right-hand side of p1. Second, region-speci\ufb01c predictors (squared regions surrounded by\npartitions in Figure 1) are de\ufb01ned by a linear combination of active partition-wise weight vectors\nthat are also linear models.\nLet us formally de\ufb01ne the partition-wise linear mod-\nels. We have a set of given activeness functions,\nf1, . . . , fP , which is denoted in a vector form as\nf (\u00b7) = (f1(\u00b7), . . . , fP (\u00b7))T .\nThe p-th element\nfp(x) \u2208 {0, 1} indicates whether the attached\nweight vector ap is applied to x or not. The active-\nness function f (\u00b7) can represent at most 2P regions,\nand f (x) speci\ufb01es to which region x belongs. A lin-\near model of an individual region is then represented\np=1 fp(\u00b7)ap. It is worth noting that partition-\nwise linear models use P linear weight vectors to\nrepresent 2P regions and restrict the number of pa-\nrameters.\nThe overall predictor g(\u00b7) can be denoted as follows:\n\nFigure 1: Concept of Partition-wise Linear\nModels\n\nas(cid:80)P\n\n(cid:88)\n\n(cid:88)\n\np\n\nd\n\ng(x) =\n\nfp(x)\n\nadpxd.\n\n(1)\n\nLet us de\ufb01ne A as A = (a1, . . . , aP ). The partition-wise linear model (1) simply acts as a linear\nmodel w.r.t. A while it captures the non-linear nature of data (individual regions use different linear\nmodels). Such non-linearity originates from the activeness functions fps, which are fundamentally\nimportant components in our models.\nBy introducing a convex loss function (cid:96)(\u00b7,\u00b7) (e.g., squared loss for regression, squared hinge or\nlogistic loss for classi\ufb01cation), we can represent an objective function of the partition-wise linear\nmodels as a convex loss minimization problem as follows:\n\n(cid:88)\n\nn\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\nn\n\np\n\nd\n\nmin\n\nA\n\n(cid:96)(yn, g(xn)) = min\n\nA\n\n(cid:96)(yn,\n\nfp(xn)\n\nadpxnd).\n\n(2)\n\nHere we give a convex formulation of region-speci\ufb01c linear models under the assumption that a set\nof partitions is given. In Section 3, we propose a convex optimization algorithm for partitions and\nregions as a partition selection problem, using sparsity-inducing structured regularizations.\n\n3\n\np1p2p3p4a2a1a3a1+a3a1+a3+a4a1+a2+a3a4\f2.2 Partition Activeness Functions\nA partition activeness function fp divides the input space into two regions, and a set of activeness\nfunctions de\ufb01nes the entire region-structure. Although any function is applicable in principle to\nbeing used as a partition activeness function, we prefer as simple a region representation as possible\nbecause of our practical motivation of utilizing region-speci\ufb01c linear models (i.e., interpretability is\na priority). This paper restricts them to being parallel to the coordinates, e.g., fp(x) = 1 (xi > 2.5)\nand fp(x) = 0 (otherwise) with respect to the i-th coordinate. Although this \u201crule-representation\u201d\nis simpler than others [2, 3] which use dense linear hyperplanes as region-speci\ufb01ers, our empirical\nevaluation (Section 5) indicates that our models perform competitively with or even better than those\nothers by appropriately optimizing the simple region-speci\ufb01ers (partition activeness functions).\n2.3 Global and Local Residual Model\nAs a special instance of partition-wise linear models, we here propose a model which we refer to\nas a global and local residual model. It employs a global linear weight vector a0 in addition to\npartition-wise linear weights. The predictor model (1) can be rewritten as:\n\ng(x) = aT\n\n0 x +\n\nfp(x)\n\nadpxd .\n\n(3)\n\n(cid:88)\n\n(cid:88)\n\np\n\nd\n\nThe global weight vector is active for all data. The integration of the global weight vector enables\nthe model to determine how features affect outputs not only locally but also globally. Let us consider\na new partition activeness function f0(x) that always returns to 1 regardless of x. Then, by setting\nf (\u00b7) = (f0(\u00b7), f1(\u00b7), . . . , fp(\u00b7), . . . , fP (\u00b7))T and A = (a0, a1, . . . , aP ), the global and local residual\nmodel can be represented using the same notation as is used in Section 2.1. Although a0 and ap have\nno fundamental difference here, they are different in terms of how we regularize them (Section 3.1).\n3 Convex Optimization of Regions and Predictors\nIn Section 2, we presented a convex formulation of partition-wise linear models in (2) under the as-\nsumption that a set of partition activeness functions was given. This section relaxes this assumption\nand proposes a convex partition optimization algorithm.\n3.1 Region Optimization as Sparse Partition Selection\nLet us assume that we have been given P + 1 partition activeness functions, f0, f1, . . . , fP , and their\nattached linear weight vectors, a0, a1, . . . , aP , where f0 and a0 are the global activeness function\nand weight vector, respectively. We formulate the region optimization problem here as partition\nselection by setting most of aps to zero since ap = 0 corresponds to the situation in which the p-th\npartition does not exist.\nFormally, we formulate our optimization problem with respect to regions and weight vectors by\nintroducing two types of sparsity-inducing constrains to (2) as follows:\n\nmin\n\nA\n\n(cid:96)(yn, g(xn)) s.t.\n\n1{ap(cid:54)=0} \u2264 \u00b5P , (cid:107)ap(cid:107)0 \u2264 \u00b50 \u2200p.\n\n(4)\n\n(cid:88)\n\nn\n\n(cid:88)\n\np\u2208{1,...,P}\n\nThe former constraint restricts the number of effective partitions to at most \u00b5P . Note that we do\nnot enforce this sparse partition constraint to the global model a0 so as to be able to determine local\ntrends as residuals from a global trend. The latter constraint restricts the number of effective features\nof ap to at most \u00b50. We add this constraint because 1) it is natural to assume only a small number\nof features are locally effective in practical applications and 2) a sparser model is typically preferred\nfor our purposes because of its better interpretability.\n3.2 Convex Optimization via Decomposition of Proximal map\n3.2.1 The Tightest Convex Envelope\nThe constraints in (5) are non-convex, and it is very hard to \ufb01nd the global optimum due to the\nindicator functions and L0 penalties. This makes optimization over a non-convex region a very\ncomplicated task, and we therefore apply a convex relaxation. One standard approach to convex\nrelaxation would be a combination of group L1 (the \ufb01rst constraint) and L1 (the second constraint)\npenalties. Here, however, we consider the tightest convex relaxation of (4) as follows:\n(cid:107)adp(cid:107)\u221e \u2264 \u00b50 \u2200p.\n\n(cid:107)ap(cid:107)\u221e \u2264 \u00b5P ,\n\n(cid:96)(yn, g(xn)) s.t.\n\n(cid:88)\n\nP(cid:88)\n\nD(cid:88)\n\n(5)\n\nmin\n\nA\n\nn\n\np=1\n\nd=1\n\n4\n\n\fThe tightness of (5) is shown in the full version [14]. Through such a convex envelope of constraints,\nthe feasible region becomes convex. Therefore, we can reformulate (5) to the following equivalent\nproblem:\n\nP(cid:88)\n\nP(cid:88)\n\nD(cid:88)\n\n(cid:96)(yn, g(xn)) + \u2126(A) where \u2126(A) = \u03bbP\n\n(cid:107)ap(cid:107)\u221e + \u03bb0\n\n(cid:107)adp(cid:107)\u221e ,\n\n(6)\n\nn\n\np=1\n\np=0\n\nd=1\n\n(cid:88)\n\nmin\n\nA\n\n2 ) = A(t) \u2212 \u03b7(t)(cid:88)\n\nwhere \u03bbP and \u03bb0 are regularization weights corresponding to \u00b5P and \u00b50, respectively. We derive an\nef\ufb01cient optimization algorithm using a proximal method and the decomposition of proximal maps.\n3.2.2 Proximal method and FISTA\nThe proximal method is a standard ef\ufb01cient tool for solving convex optimization problems with\nnon-differential regularizers. It iteratively applies gradient steps and proximal steps to update pa-\nrameters. This achieves O(1/t) convergence [5] under Lipschitz-continuity of the loss gradient, or\neven O(1/t2) convergence if an acceleration technique, such as a fast iterative shrinkage threshold-\ning algorithm (FISTA) [6, 15], is incorporated.\nLet us de\ufb01ne A(t) as the weight matrix at the t-th iteration. In the gradient step, the weight vectors\nare updated to decrease empirical loss through the \ufb01rst-order approximation of loss functions as:\n\n(cid:19)\n\n(cid:18) 1\n\nA(t+ 1\n\n(7)\nwhere \u03b7(t) is a step size and \u2202A(t)(cid:96)(\u00b7,\u00b7) is the gradient of loss functions evaluated at A(t). In the\nproximal step, we apply regularization to the current solution A(t+ 1\n\n\u2202A(t)(cid:96) (yn, g(xn)) ,\n\n2 ) as follows:\n\nn\n\n(cid:107)A \u2212 B(cid:107)2\n\n,\n\nA\n\n2\n\nF + \u03b7(t)\u2126(A)\n\nA(t+1) = M0(A(t+ 1\n\n2 )) where M0(B) = argmin\n\n(8)\nwhere (cid:107) \u00b7 (cid:107)F is the Frobenius norm. Furthermore, we employed FISTA [6] to achieve the faster\nconvergence rate for weakly convex problem and adopted a backtracking rule [6] to avoid the dif-\n\ufb01culty of calculating appropriate step widths beforehand. Through empirical evaluations as well as\ntheoretical backgrounds, we have con\ufb01rmed that it signi\ufb01cantly improves convergence in learning\npartition-wise linear models. The detail is written in the full version [14].\n3.2.3 Decomposition of Proximal Map\nThe computational cost of the proximal method depends strongly on the ef\ufb01ciency of solving the\nproximal step (8). A number of approaches have been developed for improving ef\ufb01ciency, includ-\ning the minimum-norm-point approach [16] and the network\ufb02ow approach [17, 18]. Their com-\nputational ef\ufb01ciencies depend strongly on feature and partition size2, however, which makes them\ninappropriate for our formulation because of potentially large feature and partition sizes.\nAlternatively, this paper employs the decomposition of proximal maps [7]. The key idea here is\nto decompose the proximal step into a sequence of sub-problems that are easily solvable. We \ufb01rst\nintroduce two easily-solvable proximal maps as follows:\n\nM1(B) = argmin\n\nA\n\nM2(B) = argmin\n\nA\n\n1\n2\n\n1\n2\n\n(cid:107)A \u2212 B(cid:107)2\n\nF + \u03b7(t)\u03bbP\n\n(cid:107)A \u2212 B(cid:107)2\n\nF + \u03b7(t)\u03bb0\n\nP(cid:88)\nD(cid:88)\nP(cid:88)\n\np=1\n\np=0\n\nd=1\n\n(cid:107)ap(cid:107)\u221e ,\n\n|adp| .\n\n(9)\n\n(10)\n\nThe theorem below guarantees that the decomposition of the proximal map (8) can be performed.\nThe proof is provided in the full version.\n\nTheorem 1 The original problem (8) can be decomposed into a sequence of two easily solvable\nproximal map problems as follows:\n\nA(t+1) = M0(A(t+ 1\n\n(11)\n2For example, the fastest algorithm for the network\ufb02ow approach has O(M(B+1) log(M2/(B+1))) time\ncomplexity, where B is the number of breakpoints determined by the structure of the graph (B \u2264 D(P + 1) =\nO(DP )) and M is the number of nodes, that is P + D(P + 1) = O(DP ) [17]. Therefore, the worst\ncomputational complexity is O(D2P 2 log DP ).\n\n2 )) = M2(M1(A(t+ 1\n\n2 ))) .\n\n5\n\n\fThe \ufb01rst proximal map (9) is the proximal operator with respect to the L1,\u221e-regularization. This\nproblem can be decomposed into group-wise sub-problems. Each proximal operator with respect\nto each group can be computed through a projection on an L1-norm ball (derived from the Moreau\ndecomposition [16]), that is, ap = bp \u2212 argmin\n(cid:107)c \u2212 bp(cid:107)2 s.t. (cid:107)c(cid:107)1 \u2264 \u03b7(t)\u03bb. This projection\nproblem can be ef\ufb01ciently solved [19].\nThe second proximal map (10) is a well-known proximal operator with respect to L1-regularization.\nThis problem can be decomposed into element-wise ones and its solution is generated in a closed\n\nform through adp = sgn(bdp) max(cid:0)0,|bdp \u2212 \u03b7(t)\u03bb0|(cid:1). These two sub-problems can be easily\n\nsolved, therefore, we can easily obtain the solution of the original proximal map (8).\n\nc\n\nO(N P + \u02c6P D + P D log D) is the computational complexity of partition-wise linear models where\n\u02c6P is the number of active partitions. The procedure to derive the computational complexity, the\nimplementation to speed up the optimization through warm start, and the summary of the iterative\nupdate procedure are written in the full version.\n4 Generalization Bound Analysis\nThis section presents the derivation of a generalization error bound for partition-wise linear models\nand discusses how we can increase the number of partition candidates P over the number of samples\nN. Our bound analysis is related to that of [20], which gives bounds for general overlapping group\nLasso cases, while ours is speci\ufb01cally designed for partition-wise linear models.\nLet us \ufb01rst derive an empirical Rademacher complexity [21] for a feasible weight space conditioned\non the value of the regularization term in (6). We can derive Rademacher complexity for our model\nusing the Lemma below. Its proof is shown in the full version and this result is used to analyze the\nexpected loss bound.\nLemma 1 If \u2126(A) \u2264 1 is satis\ufb01ed and if almost surely (cid:107)x(cid:107)\u221e \u2264 1 with respect to x \u2208 X , the\nempirical Rademacher complexity for partition-wise linear models can be bounded as:\n\n2 +\n\nln(P + D(P + 1))\n\n.\n\n(12)\n\n(cid:16)\n\n(cid:112)\n\n(cid:60)A(X) =\n\n23/2\u221a\nN\n\n(cid:17)\n\n(cid:114)\n\nThe next theorem shows the generalization bound of the global and local residual model. This bound\nis straightforwardly derived from Lemma 1 and the discussion of [21]. In [21], it has been shown that\nthe uniform bound on the estimation error can be obtained through the upper bound of Rademacher\ncomplexity derived in Lemma 1. By using the uniform bound, the generalization bound of the global\nand local residual model de\ufb01ned in formula (6) can be derived.\nTheorem 2 Let us de\ufb01ne a set of weights that satis\ufb01es \u2126group(A) \u2264 1 as A where \u2126group(A)\nis as de\ufb01ned in Section 2.5 in [20]. Let a datum (xn, yn) be i.i.d. sampled from a speci\ufb01c data\ndistribution D and let us assume loss functions (cid:96)(\u00b7,\u00b7) to be L-Lipschitz functions with respect to a\nnorm (cid:107) \u00b7 (cid:107) and its range to be within [0, 1]. Then, for any constant \u03b4 \u2208 (0, 1) and any A \u2208 A, the\nfollowing inequality holds with probability at least 1 \u2212 \u03b4.\n\n(x,y)\u223cD [(cid:96)(y, g(x))] \u2264 1\nE\nN\n\n(cid:96)(yn, g(xn)) + (cid:60)A(X) +\n\nln 1/\u03b4\n\n2N\n\n.\n\n(13)\n\nN(cid:88)\n\nn=1\n\nThis theorem implies how we can increase the number of partition candidates. The third term of the\nright-hand side is obviously small if N is large. The second term converges to zero with N \u2192 \u221e if\nthe value of P is smaller than o(eN ), which is suf\ufb01ciently large in practice. In summary, we expect\nto handle a suf\ufb01cient number of partition candidates for learning with little risk of over \ufb01tting.\n5 Experiments\nWe conducted two types of experiments: 1) evaluation of how partition-wise linear models perform,\non the basis of a simple synthetic dataset and 2) comparisons with state-of-the-art region-speci\ufb01c\nand locally linear models on the basis of standard classi\ufb01cation and regression benchmark datasets.\n5.1 Demonstration using Synthetic Dataset\nWe generated a synthetic binary classi\ufb01cation dataset as follows. xns were uniformly sampled from\na 20-dimensional input space in which each dimension had values between [\u22121, 1]. The target vari-\nables were determined using the XOR rule over the \ufb01rst and second features (the other 18 features\n\n6\n\n\fFigure 2: How the global and local residual model clas-\nsi\ufb01es XOR data. Red line indicates effective partition;\ngreen lines indicate local predictors; red circles indi-\ncate samples with y = \u22121; blue circles indicate sam-\nples with y = 1: This model classi\ufb01ed XOR data pre-\ncisely.\n\nwere added as noise for prediction purposes.), i.e., if the signs of \ufb01rst feature value and second\nfeature value are the same, y = 1, otherwise y = \u22121. This is well known as a case in which lin-\near models do not work. For example, L1-regularized logistic regression produced nearly random\noutputs where the error rate was 0.421.\nWe generated one partition for each feature except for the \ufb01rst feature. Each partition became active\nif the corresponding feature value was greater than 0.0. Therefore, the number of candidate partitions\nwas 19. We used the logistic regression function for loss functions. Hyper-parameters3 were set as\n\u03bb0 = 0.01 and \u03bbP = 0.001. The algorithm was run in 1, 000 iterations.\nFigure 2 illustrates results produced by\nthe global and local residual model. The\nleft-hand \ufb01gure illustrates a learned ef-\nfective partition (red line) to which the\nweight vector a1 = (10.96, 0.0,\u00b7\u00b7\u00b7 ) was\nassigned. This weight a1 was only applied\nto the region above the red line. By com-\nbining a1 and the global weight a0, we ob-\ntained the piece-wise linear representation\nshown in the right-hand \ufb01gure. While it\nis yet dif\ufb01cult for existing piece-wise lin-\near methods to capture global structures4,\nour convex formulation makes it possible\nfor the global and local residual model to\neasily capture the global XOR structures.\n5.2 Comparisons using Benchmark Datasets\nWe next used benchmark datasets to compare our\nmodels with other state-of-the-art region-speci\ufb01c\nones.\nIn these experiments, we simply generated\npartition candidates (activeness functions) as fol-\nlows. For continuous value features, we calculated\nall 5-quantiles for each feature and generated parti-\ntions at each quantile point. Partitions became active\nif a feature value was greater than the corresponding\nquantile value. For binary categorical features, we\ngenerated two partitions in which one became active\nwhen the feature was 1 (yes) and the other became\nactive only when the feature value was 0 (no).\nWe utilized several standard benchmark datasets\ncen-\nfrom UCI datasets\nsus income,\ntwitter,\nenergy heat,\nenergy cool, communities), libsvm datasets (a1a, breast cancer), and LIACC datasets (abalone,\nkinematics, puma8NH, bank8FM). Table 2 summarizes speci\ufb01cations for each dataset.\n5.2.1 Classi\ufb01cation\nFor classi\ufb01cation, we compared the global and local residual model (Global/Local) with L1 lo-\ngistic regression (Linear), LSL-SP with linear discrimination analysis5, LDKL supported by L2-\nregularized hinge loss6, FaLK-SVM with linear kernels7, and C-SVM with RBF kernel8. Note that\nC-SVM is neither a region-speci\ufb01c nor locally linear classi\ufb01cation model; it is, rather, non-linear.\nWe compared it with ours as a reference with respect to a common non-linear classi\ufb01cation model.\n\nTable 2:\nClassi\ufb01cation and regression\ndatasets. N is the size of data. D is the num-\nber of dimensions. P is the number of par-\ntitions. CL/RG denotes the type of dataset\n(CL: Classi\ufb01cation/RG: Regression).\n\n245,057\n6,497\n45,222\n140,707\n1,605\n683\n2,359\n768\n768\n4,177\n8,192\n8,192\n8,192\n1,994\n\n(skin, winequality,\ninternet ad,\n\nbreast-cancer\ninternet ad\nenergy heat\nenergy cool\n\nP\n12\n44\n99\n44\n452\n40\n\n32\n32\n40\n32\n32\n32\n404\n\nD\n3\n11\n105\n11\n113\n10\n\n8\n8\n10\n8\n8\n8\n101\n\nCL\nCL\nCL\nCL\nCL\nCL\nCL\nRG\nRG\nRG\nRG\nRG\nRG\nRG\n\nabalone\n\nkinematics\npuma8NH\nbank8FM\n\ncommunities\n\n1,559\n\n1,558\n\nskin\n\nwinequality\n\ncensus income\n\ntwitter\n\na1a\n\nN\n\nCL/RG\n\n3We conducted several experiments on other hyper-parameter settings and con\ufb01rmed that variations in\n\nhyper-parameter settings did not signi\ufb01cantly affect results.\n\n4For example, a decision tree cannot be used to \ufb01nd a \u201ctrue\u201d XOR structure since marginal distributions on\n\nthe \ufb01rst and second features cannot discriminate between positive and negative classes.\n\n5The source code is provided by the author of [3].\n6https://research.microsoft.com/en-us/um/people/manik/code/LDKL/\n\ndownload.html\n\n7http://disi.unitn.it/\u02dcsegata/FaLKM-lib/\n8We used a libsvm package. http://www.csie.ntu.edu.tw/\u02dccjlin/libsvm/\n\n7\n\n\ud835\udc4e0=\u22124.370.0\u22ee \ud835\udc4e1=10.960.0\u22ee \ud835\udc34\ud835\udc53\ud835\udc65\ud835\udc47=\u22124.370.0\u22ee \ud835\udc34\ud835\udc53\ud835\udc65\ud835\udc47=6.590.0\u22ee \fTable 3: Classi\ufb01cation results: error rate (standard deviation). The best performance \ufb01gure in each\ndataset is denoted in bold typeface and the second best is denoted in bold italic.\nFaLK-SVM\n0.040 (0.016)\n28.706 (1.298)\n\nwinequality\n\nLSL-SP\n\nLDKL\n\nLinear\n\nskin\n\n8.900 (0.174)\n33.667 (1.988)\n43.972 (0.404)\n6.964 (0.164)\n16.563 (2.916)\n35.000 (4.402)\n7.319 (1.302)\n\nGlobal Local\n0.249 (0.048)\n23.713 (1.202)\n35.697 (0.453)\n4.231 (0.090)\n16.250 (2.219)\n3.529 (1.883)\n2.638 (1.003)\n\ncensus income\n\ntwitter\n\na1a\n\nbreast-cancer\ninternet ad\n\n12.481 (8.729)\n30.878 (1.783)\n35.405 (1.179)\n8.370 (0.245)\n20.438 (2.717)\n3.677 (2.110)\n6.383 (1.118)\n\n1.858 (1.012)\n36.795 (3.198)\n47.229 (2.053)\n15.557 (11.393)\n17.063 (1.855)\n35.000 (4.402)\n13.064 (3.601)\n\n\u2013\n\n\u2013\n\n4.135 (0.149)\n18.125 (1.398)\n\n3.362 (0.997)\n\nRBF-SVM\n0.229 (0.029)\n23.898 (1.744)\n45.843 (0.772)\n9.109 (0.160)\n16.500 (1.346)\n33.824 (4.313)\n3.447 (0.772)\n\nTable 4: Regression results: root mean squared loss (standard deviation). The best performance\n\ufb01gure in each dataset is denoted in bold typeface and the second best is denoted in bold italic.\n\nenergy heat\nenergy cool\n\nabalone\n\nkinematics\npuma8NH\nbank8FM\n\ncommunities\n\nLinear\n\n0.480 (0.047)\n0.501 (0.044)\n0.687 (0.024)\n0.766 (0.019)\n0.793 (0.023)\n0.255 (0.012)\n0.586 (0.049)\n\nGlobal Local\n0.101 (0.014)\n0.175 (0.018)\n0.659 (0.023)\n0.634 (0.022)\n0.601 (0.017)\n0.218 (0.009)\n0.578 (0.040)\n\nRegTree\n\n0.050 (0.005)\n0.200 (0.018)\n0.727 (0.028)\n0.732 (0.031)\n0.612 (0.024)\n0.254 (0.008)\n0.653 (0.060)\n\nRBF-SVR\n\n0.219 (0.017)\n0.221 (0.026)\n0.713 (0.025)\n0.347 (0.010)\n0.571 (0.020)\n0.202 (0.007)\n0.618 (0.053)\n\nFor our models, we used logistic functions for loss functions. The max iteration number was set as\n1000, and the algorithm stopped early when the gap in the empirical loss from the previous iteration\nbecame lower than 10\u22129 in 10 consecutive iterations. Hyperparameters9 were optimized through\n10-fold cross validation. We \ufb01xed the number of regions to 10 in LSL-SP, tree-depth to 3 in LDKL,\nand neighborhood size to 100 in FaLK-SVM.\nTable 3 summarizes the classi\ufb01cation errors. We observed 1) Global/Local consistently performed\nwell and achieved the best error rates for four datasets out of seven. 2) LSL-SP performed well for\ncensus income and breast-cancer, but did signi\ufb01cantly worse than Linear for skin, twitter, and a1a.\nSimilarly, LDKL performed worse than Linear for census income, twitter, a1a and internet ad. This\narose partly because of over \ufb01tting and partly because of bad local minima. Particularly noteworthy\nis that the standard deviations in LDKL were much larger than in the others, and the initialization\nissue would seem to become signi\ufb01cant in practice. 3) FaLK-SVM performed well in most cases,\nbut its computational cost was signi\ufb01cantly higher than that of others, and it was unable to obtain\nresults for census income and internet ad (we stopped the algorithm after 24 hours running).\n5.2.2 Regression\nFor regression, we compared Global/Local with Linear, regression tree10 by CART (RegTree) [1],\nand epsilon-SVR with RBF kernel11. Target variables were standardized so that their mean was\n0 and their variance was 1. Performance was evaluated using the root mean squared loss in the\ntest data. Tree-depth of RegTree and \u0001 in RBF-SVR were determined by means of 10-fold cross\nvalidation. Other experimental settings were the same as those used in the classi\ufb01cation tasks.\nTable 4 summarizes RMSE values.\nIn classi\ufb01cation tasks, Global/Local consistently performed\nwell. For the kinematics, RBF-SVR performed much better than Global/Local, but Global/Local\nwas better than Linear and RegTree in many other datasets.\n6 Conclusion\nWe have proposed here a novel convex formulation of region-speci\ufb01c linear models that we refer\nto as partition-wise linear models. Our approach simultaneously optimizes regions and predictors\nusing sparsity-inducing structured penalties. For the purpose of ef\ufb01ciently solving the optimization\nproblem, we have derived an ef\ufb01cient algorithm based on the decomposition of proximal maps.\nThanks to its convexity, our method is free from initialization dependency, and a generalization\nerror bound can be derived. Empirical results demonstrate the superiority of partition-wise linear\nmodels over other region-speci\ufb01c and locally linear models.\nAcknowledgments\nThe majority of the work was done during the internship of the \ufb01rst author at the NEC central\nresearch laboratories.\n\np in Global/Local,\u03bb1 in Linear, \u03bbW , \u03bb\u03b8, \u03bb\u03b8\u2018, \u03c3 in LDKL, C in FaLK-SVM, and C, \u03b3 in RBF-SVM.\n\n9 \u03bb1, \u03bb2\n10We used a scikit-learn package. http://scikit-learn.org/\n11We used a libsvm package.\n\n8\n\n\fReferences\n[1] Leo Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classi\ufb01cation and Regression\n\nTrees. Wadsworth, 1984.\n\n[2] Cijo Jose, Prasoon Goyal, Parv Aggrwal, and Manik Varma. Local deep kernel learning for\n\nef\ufb01cient non-linear svm prediction. In ICML, pages 486\u2013494, 2013.\n\n[3] Joseph Wang and Venkatesh Saligrama. Local supervised learning through space partitioning.\n\nIn NIPS, pages 91\u201399, 2012.\n\n[4] Zhixiang Xu, Matt Kusner, Minmin Chen, and Kilian Q. Weinberger. Cost-Sensitive Tree of\n\nClassi\ufb01ers. In ICML, pages 133\u2013141, 2013.\n\n[5] Paul Tseng. Approximation accuracy, gradient methods, and error bound for structured convex\n\noptimization. Mathematical Programming, 125(2):263\u2013295, 2010.\n\n[6] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear\n\ninverse problems. SIAM Journal on Imaging Sciences, 2(1):183\u2013202, 2009.\n\n[7] Yaoliang Yu. On decomposing the proximal map. In NIPS, pages 91\u201399, 2013.\n[8] Michael I. Jordan and Robert A. Jacobs. Hierarchical mixtures of experts and the em algorithm.\n\nNeural Computation, 6(2):181\u2013214, 1994.\n\n[9] Nicola Segata and Enrico Blanzieri. Fast and scalable local kernel machines. Journal of\n\nMachine Learning Research, 11:1883\u20131926, 2010.\n\n[10] Lubor Ladicky and Philip H.S. Torr. Locally Linear Support Vector Machines. In ICML, pages\n\n985\u2013992, 2011.\n\n[11] Kai Yu, Tong Zhang, and Yihong Gong. Nonlinear learning using local coordinate coding. In\n\nNIPS, pages 2223\u20132231, 2009.\n\n[12] Ziming Zhang, Lubor Ladicky, Philip H.S. Torr, and Amir Saffari. Learning anchor planes for\n\nclassi\ufb01cation. In NIPS, pages 1611\u20131619, 2011.\n\n[13] Quanquan Gu and Jiawei Han. Clustered support vector machines. In AISTATS, pages 307\u2013\n\n315, 2013.\n\n[14] Hidekazu Oiwa and Ryohei Fujimaki.\n\narXiv:1410.8675, 2014.\n\nPartition-wise linear models.\n\narXiv preprint\n\n[15] Yurii Nesterov. Gradient methods for minimizing composite objective function. Core discus-\n\nsion papers, Catholic University of Louvain, 2007.\n\n[16] Francis R. Bach. Structured sparsity-inducing norms through submodular functions. In NIPS,\n\npages 118\u2013126, 2010.\n\n[17] Giorgio Gallo, Michael D. Grigoriadis, and Robert E. Tarjan. A fast parametric maximum \ufb02ow\n\nalgorithm and applications. SIAM Journal on Computing, 18(1):30\u201355, 1989.\n\n[18] Kiyohito Nagano and Yoshinobu Kawahara. Structured convex optimization under submodular\n\nconstraints. In UAI, 2013.\n\n[19] John Duchi and Yoram Singer. Ef\ufb01cient online and batch learning using forward backward\n\nsplitting. Journal of Machine Learning Research, 10:2899\u20132934, 2009.\n\n[20] Andreas Maurer and Massimiliano Pontil. Structured sparsity and generalization. Journal of\n\nMachine Learning Research, 13:671\u2013690, 2012.\n\n[21] Peter L. Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: risk bounds\n\nand structural results. Journal of Machine Learning Research, 3:463\u2013482, 2002.\n\n9\n\n\f", "award": [], "sourceid": 1862, "authors": [{"given_name": "Hidekazu", "family_name": "Oiwa", "institution": "The University of Tokyo"}, {"given_name": "Ryohei", "family_name": "Fujimaki", "institution": "NEC Laboratories America"}]}