{"title": "Multivariate Dyadic Regression Trees for Sparse Learning Problems", "book": "Advances in Neural Information Processing Systems", "page_first": 1441, "page_last": 1449, "abstract": "We propose a new nonparametric learning method based on multivariate dyadic regression trees (MDRTs).  Unlike traditional dyadic decision trees (DDTs) or classification and regression trees (CARTs), MDRTs are constructed using penalized empirical risk minimization with a novel sparsity-inducing penalty.  Theoretically, we show that MDRTs can simultaneously adapt to the unknown sparsity and smoothness of the true regression functions, and achieve the nearly optimal rates of convergence (in a minimax sense) for the class of $(\\alpha, C)$-smooth functions. Empirically, MDRTs can simultaneously conduct function estimation and variable selection in high dimensions. To make MDRTs applicable  for large-scale learning problems, we propose a greedy heuristics. The superior performance of MDRTs are demonstrated on both synthetic and real datasets.", "full_text": "Multivariate Dyadic Regression Trees for Sparse\n\nLearning Problems\n\nSchool of Computer Science, Carnegie Mellon University\n\nHan Liu and Xi Chen\n\nPittsburgh, PA 15213\n\nAbstract\n\nWe propose a new nonparametric learning method based on multivariate dyadic\nregression trees (MDRTs). Unlike traditional dyadic decision trees (DDTs) or\nclassi\ufb01cation and regression trees (CARTs), MDRTs are constructed using penal-\nized empirical risk minimization with a novel sparsity-inducing penalty. Theoret-\nically, we show that MDRTs can simultaneously adapt to the unknown sparsity\nand smoothness of the true regression functions, and achieve the nearly optimal\nrates of convergence (in a minimax sense) for the class of (\u03b1, C)-smooth func-\ntions. Empirically, MDRTs can simultaneously conduct function estimation and\nvariable selection in high dimensions. To make MDRTs applicable for large-scale\nlearning problems, we propose a greedy heuristics. The superior performance of\nMDRTs are demonstrated on both synthetic and real datasets.\n\n1 Introduction\n\n}\n\n{\n\nj , . . . , yn\n\nj )T and kth dimension of x by xk = (x1\n\nk, . . . , xn\n\n(x1, y1), . . . , (xn, yn)\n\nMany application problems need to simultaneously predict several quantities using a common set\nof variables, e.g. predicting multi-channel signals within a time frame, predicting concentrations of\nseveral chemical constitutes using the mass spectra of a sample, or predicting expression levels of\nmany genes using a common set of phenotype variables. These problems can be naturally formulated\nin terms of multivariate regression.\nIn particular, let\nbe n independent and identically distributed pairs of data\nwith xi \u2208 X \u2282 Rd and yi \u2208 Y \u2282 Rp for i = 1, . . . , n. Moreover, we denote the jth dimension of y\nk)T . Without loss of generality,\nby yj = (y1\nwe assume X = [0, 1]d and the true model on yj is :\nj = fj(xi) + \u03f5i\nyi\n(1)\nwhere fj : Rd \u2192 R is a smooth function. In the sequel, let f = (f1, . . . , fp), where f : Rd \u2192 Rp is\na p-valued smooth function. The vector form of (1) then becomes yi = f(xi)+\u03f5i, i = 1, . . . , n. We\nalso assume that the noise terms\nare independently distributed and bounded almost surely.\nThis is a general setting of the nonparametric multivariate regression. From the minimax theory, we\nknow that estimating f in high dimensions is very challenging. For example, when f1, . . . , fp lie in\na d-dimensional Sobolev ball with order \u03b1 and radius C, the best convergence rate for the minimax\nrisk is p \u00b7 n\nHowever, in many real world applications, the true regression function f may depend only on a\nsmall set of variables. In other words, the problem is jointly sparse:\nwhere xS = (xk : k \u2208 S), S \u2282 {1, . . . , d} is a subset of covariates with size r = |S| \u226a d. If\nS has been given, the minimax lower bound can be improved to be p \u00b7 n\n\u22122(cid:11)=(2(cid:11)+r), which is the\nbest possible rate can be expected. For sparse learning problems, our task is to develop an estimator,\nwhich adaptively achieves this faster rate of convergence without knowing S in advance.\n\n\u22122(cid:11)=(2(cid:11)+d). For a \ufb01xed \u03b1, such rate can be very slow when d becomes large.\n\nf(x) = f(xS) = (f1(xS), . . . , fp(xS)),\n\nj, i = 1, . . . , n,\n\n{\n\n}\n\n\u03f5i\nj\n\ni;j\n\n1\n\n\fd\n\n\u2211\n\nPrevious research on these problems can be roughly divided into three categories: (i) parametric lin-\near models, (ii) nonparametric additive models, and (iii) nonparametric tree models. The methods\nin the \ufb01rst category assume that the true models are linear and use some block-norm regulariza-\ntion to induce jointly sparse solutions [16, 11, 13, 5]. If the linear model assumptions are correct,\naccurate estimates can be obtained. However, given the increasing complexity of modern appli-\ncations, conclusions inferred under these restrictive linear model assumptions can be misleading.\nRecently, signi\ufb01cant progress has been made on inferring nonparametric additive models with joint\nsparsity constraints [7, 10]. For additive models, each fj(x) is assumed to have an additive form:\nfj(x) =\nk=1 fjk(xk). Although they are more \ufb02exible than linear models, the additivity assump-\ntions might still be too stringent for real world applications.\nA family of more \ufb02exible nonparametric methods are based on tree models. One of the most popular\ntree methods is the classi\ufb01cation and regression tree (CART) [2]. It \ufb01rst grows a full tree by orthog-\nonally splitting the axes at locally optimal splitting points, then prunes back the full tree to form\na subtree. Theoretically, CART is hard to analyze unless strong assumptions have been enforced\n[8]. In contrast to CART, dyadic decision trees (DDTs) are restricted to only axis-orthogonal dyadic\nsplits, i.e. each dimension can only be split at its midpoint. For a broad range of classi\ufb01cation prob-\nlems, [15] showed that DDTs using a special penalty can attain nearly optimal rate of convergence in\na minimax sense. [1] proposed a dynamic programming algorithm for constructing DDTs when the\npenalty term has an additive form, i.e. the penalty of the tree can be written as the sum of penalties\non all terminal nodes. Though intensively studied for classi\ufb01cation problems, the dyadic decision\ntree idea has not drawn much attention in the regression settings. One of the closest results we are\naware of is [4], in which a single response dyadic regression procedure is considered for non-sparse\nlearning problems. Another interesting tree model, \u201cBayesian Additive Regression Trees (BART)\u201d,\nis proposed under Bayesian framework [6], which is essentially a \u201csum-of-trees\u201d model. Most of the\nexisting work adopt the number of terminal nodes as the penalty. Such penalty cannot lead to sparse\nmodels since a tree with a small number of terminal nodes might still involve too many variables.\nTo obtain sparse models, we propose a new nonparametric method based on multivariate dyadic\nregression trees (MDRTs). Similar to DDTs, MDRTs are constructed using penalized empirical\nrisk minimization. The novelty of MDRT is to introduce a sparsity-inducing term in the penalty,\nwhich explicitly induces sparse solutions. Our contributions are two-fold: (i) Theoretically, we\nshow that MDRTs can simultaneously adapt to the unknown sparsity and smoothness of the true\nregression functions, and achieve the nearly optimal rate of convergence for the class of (\u03b1, C)-\nsmooth functions. (ii) Empirically, to avoid computationally prohibitive exhaustive search in high\ndimensions, we propose a two-stage greedy algorithm and its randomized version that achieve good\nperformance in both function estimation and variable selection. Note that our theory and algorithm\ncan be straightforwardly adapted to univariate sparse regression problem, which is a special case of\nthe multivariate one. To the best of our knowledge, this is the \ufb01rst time such a sparsity-inducing\npenalty is equipped to tree models for solving sparse regression problems.\nThe rest of this paper is organized as follows. Section 2 presents MDRTs in detail. Section 3\nstudies the statistical properties of MDRTs. Section 4 presents the algorithms which approximately\ncompute the MDRT solutions. Section 5 reports empirical results of MDRTs and their comparison\nwith CARTs. Conclusions are made in Section 6.\n\n2 Multivariate Dyadic Regression Trees\n\n\u220f\n\nd\n\nWe adopt the notations in [15]. A MDRT T is a multivariate regression tree that recursively divides\nthe input space X by means of axis-orthogonal dyadic splits. The nodes of T are associated with\nhyperrectangles (cells) in X = [0, 1]d. The root node corresponds to X itself. If a node is associated\n{\nto the cell B =\nj=1[aj, bj], after being dyadically split on the dimension k, the two children are\nassociated to the subcells Bk;1 and Bk;2:\nxi \u2208 B | xi\n\nand Bk;2 = B \\ Bk;1.\n\n\u2264 ak + bk\n\nBk;1 =\n\n}\n\nThe set of terminal nodes of a MDRT T is denoted as term(T ). Let Bt be the cell in X induced by\na terminal node t, the partition induced by term(T ) can be denoted as \u03c0(T ) = {Bt|t \u2208 term(T )}.\n\nk\n\n2\n\n2\n\n\fFor each terminal node t, we can \ufb01t a multivariate m-th order polynomial regression on data points\nfalling in Bt. Instead of using all covariates, such a polynomial regression is only \ufb01tted on a set of\nactive variables, which is denoted as A(t). For each node b \u2208 T (not necessarily a terminal node),\nA(b) can be an arbitrary subset of {1, . . . , d} satisfying two rules:\n\n1. If a node is dyadically split perpendicular to the axis k, k must belong to the active sets of\n2. For any node b, let par(b) be its parent node, then A(par(b)) \u2282 A(b).\n\nits two children.\n\nFor a MDRT T , we de\ufb01ne F m\nT to be the class of p-valued measurable m-th order polynomials\ncorresponding to \u03c0(T ). Furthermore, for a dyadic integer N = 2L, let TN be the collection of all\nMDRTs such that no terminal cell has a side length smaller than 2\u2212L.\nGiven integers M and N, let F M;N be de\ufb01ned as\nn\u2211\n\nThe \ufb01nal MDRT estimator with respect to F M;N , denoted as bf M;N , can then be de\ufb01ned as\n\nF M;N = \u222a0\u2264m\u2264M \u222aT\u2208TN\n\nF m\nT .\n\nf\u2208F M;N\n\n(2)\nTo de\ufb01ne in detail pen(f) for f \u2208 F M;N , let T and m be the MDRT and the order of polynomials\ncorresponding to f, pen(f) then takes the following form:\n\n)\nlog n(rT + 1)m(NT + 1)rT + |\u03c0(T )| log d\n\n(3)\nwhere \u03bb > 0 is a regularization parameter, rT = | \u222at\u2208term(T ) A(t)| corresponds to the number of\nrelevant dimensions and\n\npen(f) = \u03bb \u00b7 p\nn\n\n2 + pen(f).\n\nbf M;N = arg min\n\n\u2225yi \u2212 f(xi)\u22252\n\n(\n\n1\nn\n\ni=1\n\n,\n\nNT = min{s \u2208 {1, 2, . . . , N}| T \u2208 Ts}.\n\nThere are two terms in (3) within the parenthesis. The latter one penalizing the number of terminal\nnodes |\u03c0(T )| has been commonly adopted in the existing tree literature. The former one is novel.\nIntuitively, it penalizes non-sparse models since the number of relevant dimensions rT appears in\nthe exponent term. In the next section, we will show that this sparsity-inducing term is derived by\nbounding the VC-dimension of the underlying subgraph of regression functions. Thus it has a very\nintuitive interpretation.\n\n3 Statistical Properties\n\np\nj=1\n\n\u2211\n\nIn this section, we present theoretical properties of the MDRT estimator. Our main technical result\nis Theorem 1, which provides the nearly optimal rate of the MDRT estimator.\nTo evaluate the algorithm performance, we use the L2-risk with respect to the Lebesgue measure\n\n\u222b\nX |bfj(x) \u2212 fj(x)|2d\u00b5(x), where bf is the function\n\n\u00b5(\u00b7), which is de\ufb01ned as R(bf , f) = E\n\nestimate constructed from n observed samples. Note that all the constants appear in this section are\ngeneric constants, i.e. their values can change from one line to another in the analysis.\nLet N0 = {0, 1, . . .} be the set of natural number, we \ufb01rst de\ufb01ne the class of (\u03b1, C)-smooth func-\ntions.\nDe\ufb01nition 1 ((\u03b1, C)-smoothness) Let \u03b1 = q + \u03b2 for some q \u2208 N0, 0 < \u03b2 \u2264 1, and let C > 0. A\nfunction g : Rd \u2192 R is called (\u03b1, C)-smooth if for every \u03b1 = (\u03b11, . . . , \u03b1d), \u03b1i \u2208 N0,\nj=1 \u03b1j =\nq, the partial derivative\n\n\u2211\n\n@qg\n\nd\n\nexists and satis\ufb01es, for all x, z \u2208 Rd,\n\u2212\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u2264 C \u00b7 \u2225x \u2212 z\u2225(cid:12)\n\n\u2202qg(z)\n1 . . . \u2202x(cid:11)d\n\n\u2202x(cid:11)1\n\n2 .\n\nd\n\n(cid:12)(cid:12)(cid:12)(cid:12)\n\n(cid:11)d\nd\n\n@x(cid:11)1\n1 :::@x\n\u2202qg(x)\n1 . . . \u2202x(cid:11)d\n\nd\n\n\u2202x(cid:11)1\n\nIn the following, we denote the class of (\u03b1, C)-smooth functions by D(\u03b1, C).\nAssumption 1 We assume f1, . . . , fp \u2208 D(\u03b1, C) for some \u03b1, C > 0 and for all j \u2208 {1, . . . , p},\nfj(x) = fj(xS) with r = |S| \u226a d.\nTheorem 3.2 of [9] shows that the lower minimax rate of convergence for class D(\u03b1, C) is exactly\nthe same as that for class of d-dimensional Sobolev ball with order \u03b1 and radius C.\n\n3\n\n\fProposition 1 The proof of this proposition can be found in [9].\n\nlim inf\nn\u2192\u221e\n\n1\np\n\n\u00b7 n2(cid:11)=(2(cid:11)+d) infbf\n\nsup\n\nf1;:::;fp\u2208D((cid:11);C)\n\nR(bf , f) > 0.\n\n\u22122(cid:11)=(2(cid:11)+r) when S is given.\n\nTherefore, the lower minimax rate of convergence is p \u00b7 n\n\u22122(cid:11)=(2(cid:11)+d). Similarly, if the problem is\njointly sparse with the index set S and r = |S| \u226a d, the best rate of convergence can be improved\nto p \u00b7 n\nThe following is another technical assumption needed for the main theorem.\nAssumption 2 Let 1 \u2264 \u03b3 < \u221e, we assume that\n\nmax\n1\u2264j\u2264p\n\nsup\nx\n\n|fj(x)| \u2264 \u03b3 and max\n1\u2264i\u2264n\n\n\u2225yi\u2225\u221e \u2264 \u03b3 a.s.\n\nThis condition is mild. Indeed, we can even allow \u03b3 to increase with the sample size n at a certain\nrate. This will not affect the \ufb01nal result. For example, when {\u03f5i\n}i;j are i.i.d. Gaussian random\nvariables, this assumption easily holds with \u03b3 = O(\nlog n), which only contributes a logarithmic\nterm to the \ufb01nal rate of convergence.\nThe next assumption speci\ufb01es the scaling of the relevant dimension r and ambient dimension d with\nrespect to the sample size n.\n\n\u221a\n\nj\n\nAssumption 3 r = O(1) and d = O(exp(n(cid:24))) for some 0 < \u03be < 1.\n\nHere, r = O(1) is crucial, since even if r increases at a logarithmic rate with respect to n, i.e.\nr = O(log n), it is hopeless to get any consistent estimator for the class D(\u03b1, C) since n\n\u2212(1= log n) =\n1/e. On the other hand, the ambient dimension d can increase exponentially fast with the sample\nsize, which is a realistic scaling for high dimensional settings.\nThe following is the main theorem.\n\nTheorem 1 Under Assumptions 1 to 3, there exist a positive number \u03bb that only depends on \u03b1, \u03b3\nand r, such that\n\n(\n\n)\n(log n)(rT + 1)m(NT + 1)rT + |\u03c0(T )| log d\n\nFor large enough M, N, the solution bf M;N obtained from (2) satis\ufb01es\n)2(cid:11)=(2(cid:11)+r)\n\npen(f) = \u03bb \u00b7 p\nn\n\n(\n\nR(bf M;N , f) \u2264 c \u00b7 p \u00b7\n\nlog n + log d\n\n,\n\n,\n\n(4)\n\n(5)\n\nwhere c is some generic constant.\n\nn\n\nRemark 1 As discussed in Proposition 1, the obtained rate of convergence in (5) is nearly optimal\nup to a logarithmic term.\n\nRemark 2 Since the estimator de\ufb01ned in (2) does not need to know the smoothness \u03b1 and the\nsparsity level r in advance, MDRTs are simultaneously adaptive to the unknown smoothness and\nsparsity level.\n\nProof of Theorem 1: To \ufb01nd an upper bound of R(bf M;N , f), we need to analyze and control\nWithout loss of generality, we always assume bf M;N obtained from (2) satis\ufb01es the condition that\n(x)| \u2264 \u03b3. if this is not true, we can always truncate bf M;N at the rate \u03b3 and\n\nthe approximation and estimation errors separately. Our analysis closely follows the least squares\nregression analysis in [9] and some speci\ufb01c coding scheme of trees in [15].\n\n|f M;N\n\nj\n\nmax1\u2264j\u2264p supx\nobtain the desired result in Theorem 1.\nLet S m\nand let Gm\n\n{\nT be the class of all subgraphs of functions of S m\nbe the VC-dimension of Gm\n\n(z, t) \u2208 Rd \u00d7 R; t \u2264 g(z); g \u2208 S m\nT\nT , we have the following lemma:\n\nGm\nT =\n\nT , i.e.\n\nLet VGm\n\nT\n\n}\n\n.\n\nT be the class of scalar-valued measurable m-th order polynomials corresponding to \u03c0(T ),\n\n4\n\n\f(6)\n\nLemma 1 Let rT and NT be de\ufb01ned as in (3), we know that\n\n\u2264 (rT + 1)m \u00b7 (NT + 1)rT .\n\nVGm\n\nT\n\n(cid:3)\n\nSketch of Proof: From Theorem 9.5 of [9], we only need to show the dimension of Gm\nT is upper\nbounded by the R.H.S. of (6). By the de\ufb01nition of rT and NT , the result follows from a straightfor-\nward combinatorial analysis.\nThe next lemma provides an upper bound of the approximation error for the class D(\u03b1, C).\nLemma 2 Let f = (f1, . . . , fp) be the true regression function, there exists a set of piecewise\npolynomials h1, . . . , hp \u2208 \u222aT\u2208TK Sm\nwhere K \u2264 N, c is a generic constant depends on r.\nSketch of Proof: This is a standard approximation result using multivariate piecewise polynomials.\nThe main idea is based on a multivariate Taylor expansion of the function fj at a given point x0.\nThen try to utilize De\ufb01nition 1 to bound the remainder terms. For the sake of brevity, we omit the\ntechnical details.\nThe next lemma is crucial, it provides an oracle inequality to bound the risk using an approximation\nterm and an estimation term. Its analysis follows from a simple adaptation of Theorem 12.1 on page\n227 of [9].\n\n\u2200j \u2208 {1, . . . , p}, sup\nx\u2208X\n\n|fj(x) \u2212 hj(x)| \u2264 cK\n\n\u2212(cid:11)\n\n(cid:3)\n\nT\n\nFirst, we de\ufb01ne eR(g, f) =\n\n\u2211\n\n\u222b\n\nX |gj(x) \u2212 fj(x)|2d\u00b5(x),\n\np\nj=1\n\nLemma 3 [9] Choose\n\npen(f) \u2265 5136 \u00b7 p\n\n\u03b34\nn\nfor some pre\ufb01x code [[T ]] > 0 satisfying\n\nR(bf M;N , f) \u2264 12840 \u00b7 p \u00b7 \u03b34\n\nT\u2208TN\n+ 2 inf\nT\u2208TN\n\nn\n\n(\n\u2211\n\nlog(120e\u03b34n)VGm\n\n)\n[[T ]] log 2\n2\u2212[[T ]] \u2264 1. Then, we have\n\n{\np \u00b7 pen(g) + eR(g, f)\n\n+\n\n2\n\nT\n\ninf\n\ng\u2208F M;N\n\n}\n\n.\n\n(7)\n\n(8)\n\nOne appropriate pre\ufb01x code [[T ]] for each MDRT T is proposed in [15], which speci\ufb01es that\n[[T ]] = 3|\u03c0(T )| \u2212 1 + (|\u03c0(T )| \u2212 1) log d/ log 2. A simpler upper bound for [[T ]] is [[T ]] \u2264\n(3 + log d/ log 2)|\u03c0(T )|.\nRemark 3 The derived constants in the Lemma 3 will be pessimistic due to the very large numerical\nvalues. This may result in selecting oversimpli\ufb01ed tree structures. In practice, we always use cross-\nvalidation to choose the tuning parameters.\nTo prove Theorem 1, \ufb01rst, using Assumption 1 and Lemma 2, we know that for any K \u2264 N, there\n\u2032 \u2208 TK,\nmust exists generic constants c1, c2, c3 and a function f\nsatisfying f\n, f) \u2264 c1 \u00b7 p \u00b7 K\n(log n)(r + 1)M (K + 1)r\n\n\u2032 that is conformal with a MDRT T\n\u2032)| \u2264 (K + 1)r such that\n\u22122(cid:11),\n\n\u2032(xS) and |\u03c0(T\n\nlog d(K + 1)r\n\neR(f\n\n\u2032(x) = f\n\nand\n\n(9)\n\n\u2032\n\nn\n\n+ c3\n\n.\n\nn\n\n(10)\n\nThe desired result then follows by plugging (9) and (10) into (8) and balancing these three terms.\n\npen(f\n\n\u2032) \u2264 c2\n\n4 Computational Algorithm\n\nExhaustive search of bf M;N in the MDRT space has similar complexity as that of DDTs and could be\n\ncomputationally very expansive. To make MDRTs scalable for high dimensional massive datasets,\nusing similar ideas as CARTs, we propose a two-stage procedure: (1) we grow a full tree in a greedy\nmanner; (2) we prune back the full tree to from the \ufb01nal tree. Before going to the detail of the\nalgorithm, we \ufb01rstly introduce some necessary notations.\nGiven a MDRT T , denote the corresponding multivariate m-th order polynomial \ufb01t on \u03c0(T ) by\nis the m-th order polynomial regression \ufb01t on the partition Bt. For\n\n}t\u2208(cid:25)(T ), where bf m\n\nbf m\nT = {bf m\n\nt\n\nt\n\n5\n\n\feach xi falling in Bt, let bf m\nlocal squared error (LSE) on node t by bRm(t,A(t)):\nt (xi,A(t)) be the predicted function value for xi. We denote the the\n\u2211\nbRm(t,A(t)) =\nIt is worthwhile noting that bRm(t,A(t)) is calculated as the average with respect to the total sample\nsize n, instead of the number of data points contained in Bt. The total MSE of the tree bR(T ) can\n\n\u2225yi \u2212 bf m\n\nxi\u2208Bt\n\n1\nn\n\nthen be computed by the following equation:\n\nt (xi,A(t))\u22252\n2.\nbRm(t,A(t)).\n\n\u2211\n\nbR(T ) =\nbC(T ) = bR(T ) + pen(bf m\n\nt\u2208term(T )\n\nThe total cost of T , which is de\ufb01ned as the the right hand side of (2), then can be written as:\n\n(11)\nOur goal is to \ufb01nd the tree structure with the polynomial regression on each terminal node that can\nminimize the total cost.\nThe \ufb01rst stage is tree growing, in which a terminal node t is \ufb01rst selected in each step. We then\nperform one of two actions a1 and a2:\n\nT ).\n\na1: adding another dimension k \u0338\u2208 A(t) to A(t), and re\ufb01t the regression model on all data\na2: dyadically splitting t perpendicular to the dimension k \u2208 A(t).\n\npoints falling in Bt;\n\nIn each tree growing step, we need to decide which action to perform. For action a1, we denote the\ndrop in LSE as:\n\n(12)\nFor action a2, let sl(t(k)) be the side length of Bt on dimension k \u2208 A(t). If sl(t(k)) > 2\u2212L, the\ndimension k of Bt can then be dyadically split. In this case, let t(k)\nR be the left and right\nchild of node t. The drop in LSE takes the following form:\n\nL and t(k)\n\n1 (t, k) = bRm(t,A(t)) \u2212 bRm(t,A(t) \u222a {k}).\n\u2206bRm\nL ,A(t) \u2212 bRm(t(k)\n2 (t, k) = bRm(t,A(t)) \u2212 bRm(t(k)\n\u2206bRm\n\u2206bRm\n\n\u2217) =\n\nargmax\n\n\u2217\n(a\n\n, k\n\na (t, k).\n\nR ,A(t)).\n\u2217 on the dimension k\n\n(13)\n\u2217, which are deter-\n\nFor each terminal node t, we greedily perform the action a\nmined by\n\n(14)\n\na\u2208{1;2};k\u2208{1:::d}\n\n2\u2211\n\n\u2211\n\n\u2206bRm\n\nIn high dimensional setting, the above greedy procedure may not lead to the optimal tree since suc-\ncessively locally optimal splits cannot guarantee the global optimum. Once an irrelevant dimension\nhas been added in or split, the greedy procedure can never \ufb01x the mistake. To make the algorithm\nmore robust, we propose a randomized scheme. Instead of greedily performing the action on the\ndimension that leads the maximum drop in LSE, we randomly choose which action to perform ac-\n\ncording to a multinomial distribution. In particular, we normalize \u2206bR such that:\n\nk\n\n, k\n\na=1\n\na (t, k) = 1.\n\n\u2217) is drawn from multinomial(1, \u2206bR). The action a\n\n(15)\n\u2217 is then performed on the\n\u2217. In general, when the randomized scheme is adopted, we need to repeat our algorithm\n\n\u2217\nAnd a sample (a\ndimension k\nmany times to pick the best tree.\nThe second stage is cost complexity pruning. For each step, we either merge a pair of terminal nodes\nor remove a variable from the active set of a terminal node such that the resulted tree has the smaller\ncost. We repeat this process until the tree becomes a single root node with an empty active set. The\ntree with the minimum cost in this process is returned as the \ufb01nal tree. The pseudocode for the\ngrowing stage and cost complexity pruning stage are presented in the Appendix. Moreover, to avoid\na cell with too few data points, we pre-de\ufb01ne a quantity nmax. Let n(t) be the the number of data\npoints fall into Bt, if n(t) \u2264 nmax, Bt will no longer be split. It is worthwhile noting that we ignore\nthose actions that lead to \u2206R = 0. In addition, whenever we perform the mth order polynomial\nregression on the active set of a node, we need to make sure it is not rank de\ufb01cient.\n\n6\n\n\f5 Experimental Results\n\nIn this section, we present numerical results for MDRTs applied to both synthetic and real datasets.\nWe compare \ufb01ve methods: [1] Greedy MDRT with M = 1 (MDRT(G, M=1)); [2] Randomized\nMDRT with M = 1 (MDRT(R, M=1)); [3] Greedy MDRT with M = 0 (MDRT(G, M=0)); [4]\nRandomized MDRT with M = 0 (MDRT(R, M=0)); [5] CART. For randomized scheme, we run\n50 random trials and pick the minimum cost tree.\nAs for CART, we adopt the MATLAB package from [12], which \ufb01ts piecewise constant on each\n|\u03c0(T )|, where \u03c1 is the tuning\n\nterminal node with the cost complexity criterion: bC(T ) = bR(T ) + \u03c1 p\n\nparameter playing the same role as \u03bb in (3).\nSynthetic Data: For the synthetic data experiment, we consider the high dimensional compound\nsymmetry covariance structure of the design matrix with n = 200 and d = 100. Each dimension xj\nis generated according to\n\nn\n\nxj = Wj + tU\n1 + t\n\n,\n\nj = 1, . . . , d,\n\nwhere W1, . . . , Wd and U are i.i.d. sampled from Uniform(0,1). Therefore the correlation between\nxj and xk is t2/(1 + t2) for j \u0338= k.\nWe study three models as shown below:\nthe \ufb01rst one is linear; the second one is nonlinear but\nadditive; the third one is nonlinear with three-way interactions. All these models only involve four\nrelevant variables. The noise terms, denoted as \u03f5 , are independently drawn from a standard normal\ndistribution.\n\nModel 1: yi\nModel 2: yi\nModel 3: yi\n\n2 + 4xi\n\n1 + 3xi\n\n1 = 2xi\n1 = exp(xi\n1 = exp(2xi\n\n1) + (xi\n1xi\n\n2 + xi\n\n3 + 5xi\n2)2 + 3xi\n3) + xi\n\n4 + \u03f5i\n3 + 2xi\n4 + \u03f5i\n\n1\n\n1\n\n4 + \u03f5i\n\n1\n\n2 = 5xi\nyi\n2 = (xi\nyi\n2 = sin(xi\nyi\n\n1 + 4xi\n1)2 + 2xi\n\n2 + 3xi\n\n3 + 2xi\n2 + exp(xi\n\n1xi\n\n2) + (xi\n\n3)2 + 2xi\n\n2\n\n4 + \u03f5i\n3) + 3xi\n4 + \u03f5i\n\n2\n\n4 + \u03f5i\n\n2\n\nWe compare the performances of different methods using two criteria: (i) variable selection and (ii)\nfunction estimation. For each model, we generate 100 designs and an equal-sized validation set per\ndesign. For more detailed experiment protocols, we set nmax = 5 and L = 6. By varying the values\nof \u03bb or \u03c1 from large to small, we obtain a full regularization path. The tree with the minimum MSE\non the validation set is then picked as the best tree. For criterion (i), if the variables involved in the\nbest tree are exactly the \ufb01rst four variables, the variable selection task for this design is deemed as\nsuccessful. The numerical results are presented in Table 1. For each method, the three quantities\nreported in order are the number of success out of 100 designs, the mean and standard deviation of\nthe MSE on the validation set. Note that we omit \u201cMDRT\u201d in Table 1 due to space limitations.\nFrom Table 1, the performance of MDRT with M = 1 is dominantly better in both variable selection\nand estimation than those of the others. For linear models, MDRT with M = 1 always select\nthe correct variables even for large ts. For variable selection, MDRT with M = 0 has a better\nperformance compared with CART due to its sparsity-inducing penalty. In contrast, CART is more\n\ufb02exible in the sense that its splits are not necessarily dyadic. As a consequence, they are comparable\nin function estimation. Moreover, the performance of randomized scheme is slightly better than\nits deterministic version in variable selection. Another observation is that, when t becomes larger,\nalthough the performance of variable selection decreases on all methods, the estimation performance\nbecomes slightly better. This might be counter-intuitive at the \ufb01rst sight. In fact, with the increase\nof t, all methods tend to select more variables. Due to the high correlations, even the irrelevant\nvariables are also helpful in predicting the responses. This is an expected effect.\nReal Data: In this subsection, we compare these methods on three real datasets. The \ufb01rst dataset\nis the Chemometrics data (Chem for short), which has been extensively studied in [3]. The data are\nfrom a simulation of a low density tubular polyethylene reactor with n = 56, d = 22 and p = 6.\nFollowing the same procedures in [3], we log-transformed the responses because they are skewed.\nThe second dataset is Boston Housing 1 with n = 506, d = 10 and p = 1. We add 10 irrelevant\nvariables randomly drawn from Uniform(0,1) to evaluate the variable selection performance. The\nthird one, Space ga2, is an election data with spatial coordinates on 3107 US counties. Our task\nis to predict the x, y coordinates of each county given 5 variables regarding voting information.\n\n1Available from UCI Machine Learning Database Repository: http:archive.ics.uci.edu/ml\n2Available from StatLib: http:lib.stat.cmu.edu/datasets/\n\n7\n\n\fTable 1: Comparison of Variable Selection and Function Estimation on Synthetic Datasets\n\nModel 1\nt = 0\nt = 0:5\nt = 1\n\nModel 2\nt = 0\nt = 0:5\nt = 1\n\nModel 3\nt = 0\nt = 0:5\nt = 1\n\nR, M=1\n\nG, M=1\n\nR, M=0\n\n100\n100\n100\n\n2.03 (0.14)\n2.05 (0.14)\n2.05 (0.13)\n\n100\n100\n100\n\n2.08 (0.15)\n2.06 (0.15)\n2.05 (0.16)\n\n100\n76\n19\n\n5.84 (0.51)\n5.42 (0.53)\n5.40 (0.60)\n\n97\n68\n20\n\nG, M=0\n5.74 (0.54)\n5.36 (0.60)\n5.56 (0.69)\n\nCART\n6.17 (0.55)\n5.48 (0.51)\n5.30 (0.58)\n\n52\n29\n3\n\nR, M=1\n\nG, M=1\n\n100\n96\n76\n\n2.07 (0.13)\n2.05 (0.15)\n2.09 (0.14)\n\n100\n93\n68\n\n2.06 (0.15)\n2.09 (0.17)\n2.21 (0.19)\n\n39\n17\n2\n\nR, M=0\n3.21 (0.26)\n3.10 (0.25)\n3.17 (0.30)\n\nG, M=0\n3.22 (0.28)\n3.15 (0.26)\n3.16 (0.26)\n\n31\n11\n2\n\nCART\n3.52 (0.31)\n3.20 (0.27)\n3.16 (0.27)\n\n25\n5\n1\n\nR, M=1\n2.68 (0.31)\n2.56 (0.21)\n2.51 (0.26)\n\n98\n84\n65\n\nG, M=1\n2.67 (0.47)\n2.52 (0.25)\n2.62 (0.23)\n\n95\n86\n50\n\nR, M=0\n3.90 (0.47)\n3.63 (0.47)\n3.75 (0.45)\n\n75\n32\n3\n\nG, M=0\n4.03 (0.54)\n3.60 (0.40)\n3.88 (0.51)\n\n63\n32\n4\n\nCART\n4.35 (0.73)\n3.69 (0.38)\n3.66 (0.38)\n\n29\n15\n2\n\nFor Space ga, we normalize the responses to [0, 1]. Similarly, we add other 15 irrelevant variables\nrandomly drawn from Uniform(0,1). For all these datasets, we scale the input variables into a unit\ncube.\nFor evaluation purpose, each dataset is randomly split such that half data are used for training and\nthe other half for testing. We run a 5-fold cross-validation on the training set to pick the best tuning\n\u2217.\n\u2217 and \u03c1\nparameter \u03bb\nWe repeat this process 20 times and report the mean and standard deviation of the testing MSE in\nTable 2. nmax is set to be 5 for the \ufb01rst dataset and 20 for the latter two. For all datasets, we set\nL = 6. Moreover, for randomized scheme, we run 50 random trials and pick the minimum cost tree.\n\n\u2217. We then train MDRTs and CART on the entire training data using \u03bb\n\n\u2217 and \u03c1\n\nChem\nHousing\nSpace ga\n\nR, M=1\n0.15 (0.09)\n20.18 (2.94)\n0.054 (7.8e-4)\n\nTable 2: Testing MSE on Real Datasets\nG, M=1\n0.18 (0.12)\n21.60 (2.83)\n0.055 (8.0e-4)\n\nR, M=0\n0.38 (0.18)\n24.67 (2.05)\n0.068 (7.2e-4)\n\nG, M=0\n0.52 (0.06)\n29.46 (1.95)\n0.068 (9.2e-4)\n\nCART\n0.40 (0.09)\n25.91 (3.05)\n0.064 (8.3e-4)\n\nFrom Table 2, we see that MDRT with M = 1 has the best estimation performance. Moreover,\nrandomized scheme does improve the performance compared to the deterministic counterpart. In\nparticularly, such an improvement is quite signi\ufb01cant when M = 0. The performance of MDRT(G,\nM=0) is always worse than CART since CART can have more \ufb02exible splits. However, using ran-\ndomized scheme, the performance of MDRT(R, M=0) achieves a comparable performance as CART.\nAs for variable selection of Housing data, in all the 20 runs, MDRT(G, M=1) and MDRT(R, M=1)\nnever select the arti\ufb01cially added variables. However, for the other three methods, nearly 10 out of\n20 runs involve at least one extraneous variable. In particular, we compare our results with those\nreported in [14]. They \ufb01nd that there are 4 (indus, age, dis, tax) irrelevant variables in the Housing\ndata. Our experiments con\ufb01rm this result since in 15 out of the 20 trials, MDRT(G, M=1) and\nMDRT(R, M=1) never select these four variables. Similarly, for Space ga data, there are only 2 and\n1 times that MDRT(G, M=1) and MDRT(R, M=1) involve the arti\ufb01cially added variables.\n\n6 Conclusions\nWe propose a novel sparse learning method based on multivariate dyadic regression trees (MDRTs).\nOur approach adopts a new sparsity-inducing penalty that simultaneously conduct function estima-\ntion and variable selection. Some theoretical analysis and practical algorithms have been developed.\nTo the best of our knowledge, it is the \ufb01rst time that such a penalty is introduced in the tree literature\nfor high dimensional sparse learning problems.\n\n8\n\n\fReferences\n[1] G. Blanchard, C. Sch\u00a8afer, Y. Rozenholc, and K.-R. M\u00a8uller. Optimal dyadic decision trees.\n\nMachine Learning Journal, 66(2-3):209\u2013241, 2007.\n\n[2] Leo Breiman, Jerome Friedman, Charles J. Stone, and R.A. Olshen. Classi\ufb01cation and regres-\n\nsion trees. Wadsworth Publishing Co Inc, 1984.\n\n[3] Leo Breiman and Jerome H. Friedman. Predicting multivariate responses in multiple linear\n\nregression. J. Roy. Statist. Soc. B, 59:3, 1997.\n\n[4] R. Castro, R. Willett, and R. Nowak. Fast rates in regression via active learning. NIPS, 2005.\n[5] Xi Chen, Weike Pan, James T. Kwok, and Jamie G. Carbonell. Accelerated gradient method\n\nfor multi-task sparse learning problem. In ICDM, 2009.\n\n[6] Hugh A. Chipman, Edward I. George, and Robert E. McCulloch. Bart: Bayesian additive re-\ngression trees. Technical report, Department of Mathematics and Statistics, Acadia University,\nCanada, 2006.\n\n[7] Jerome H. Friedman. Multivariate adaptive regression splines. The Annals of Statistics, 19:1\u2013\n\n67, 1991.\n\n[8] S. Gey and E. Nedelec. Model selection for cart regression trees. IEEE Tran. on Info. Theory,\n\n51(2):658\u2013 670, 2005.\n\n[9] L\u00b4aszl\u00b4o Gy\u00a8or\ufb01, Michael Kohler, Adam Krzy\u02d9zak, and Harro Walk. A Distribution-Free Theory\n\nof Nonparametric Regression. Springer-Verlag, 2002.\n\n[10] Han Liu, John Lafferty, and Larry Wasserman. Nonparametric regression and classi\ufb01cation\n\nwith joint sparsity constraints. In NIPS. MIT Press, 2008.\n\n[11] Han Liu and Jian Zhang. On the estimation consistency of the group lasso and its applications.\n\nAISTATS, pages 376\u2013383, 2009.\n\n[12] Wendy L. Martinez and Angel R. Martinez. Computational Statistics Handbook with MATLAB.\n\nChapman & Hall CRC, 2 edition, 2008.\n\n[13] G. Obozinski, M. J. Wainwright, and M. I. Jordan. High-dimensional union support recovery\n\nin multivariate regression. In NIPS. MIT Press, 2009.\n\n[14] Pradeep Ravikumar, Han Liu, John Lafferty, and Larry Wasserman. Spam: Sparse additive\n\nmodels. In NIPS. MIT Press, 2007.\n\n[15] C. Scott and R.D. Nowak. Minimax-optimal classi\ufb01cation with dyadic decision trees. IEEE\n\nTran. on Info. Theory, 52(4):1335\u20131353, 2006.\n\n[16] B.A. Turlach, W. N. Venables, and S. J. Wright. Simultaneous variable selection. Technomet-\n\nrics, 27:349\u2013363, 2005.\n\n9\n\n\f", "award": [], "sourceid": 268, "authors": [{"given_name": "Han", "family_name": "Liu", "institution": null}, {"given_name": "Xi", "family_name": "Chen", "institution": null}]}