{"title": "Maximum Margin Interval Trees", "book": "Advances in Neural Information Processing Systems", "page_first": 4947, "page_last": 4956, "abstract": "Learning a regression function using censored or interval-valued output data is an important problem in fields such as genomics and medicine. The goal is to learn a real-valued prediction function, and the training output labels indicate an interval of possible values. Whereas most existing algorithms for this task are linear models, in this paper we investigate learning nonlinear tree models. We propose to learn a tree by minimizing a margin-based discriminative objective function, and we provide a dynamic programming algorithm for computing the optimal solution in log-linear time. We show empirically that this algorithm achieves state-of-the-art speed and prediction accuracy in a benchmark of several data sets.", "full_text": "Maximum Margin Interval Trees\n\nAlexandre Drouin\n\nD\u00e9partement d\u2019informatique et de g\u00e9nie logiciel\n\nUniversit\u00e9 Laval, Qu\u00e9bec, Canada\nalexandre.drouin.8@ulaval.ca\n\nToby Dylan Hocking\nMcGill Genome Center\n\nMcGill University, Montr\u00e9al, Canada\n\ntoby.hocking@r-project.org\n\nFran\u00e7ois Laviolette\n\nD\u00e9partement d\u2019informatique et de g\u00e9nie logiciel\n\nUniversit\u00e9 Laval, Qu\u00e9bec, Canada\n\nfrancois.laviolette@ift.ulaval.ca\n\nAbstract\n\nLearning a regression function using censored or interval-valued output data is an\nimportant problem in \ufb01elds such as genomics and medicine. The goal is to learn a\nreal-valued prediction function, and the training output labels indicate an interval of\npossible values. Whereas most existing algorithms for this task are linear models,\nin this paper we investigate learning nonlinear tree models. We propose to learn\na tree by minimizing a margin-based discriminative objective function, and we\nprovide a dynamic programming algorithm for computing the optimal solution in\nlog-linear time. We show empirically that this algorithm achieves state-of-the-art\nspeed and prediction accuracy in a benchmark of several data sets.\n\n1\n\nIntroduction\n\ni\n\ni\n\nIn the typical supervised regression setting, we are given set of learning examples, each associated\nwith a real-valued output. The goal is to learn a predictor that accurately estimates the outputs,\ngiven new examples. This fundamental problem has been extensively studied and has given rise to\nalgorithms such as Support Vector Regression (Basak et al., 2007). A similar, but far less studied,\nproblem is that of interval regression, where each learning example is associated with an interval\n, yi), indicating a range of acceptable output values, and the expected predictions are real numbers.\n(y\nInterval-valued outputs arise naturally in \ufb01elds such as computational biology and survival analysis.\nIn the latter setting, one is interested in predicting the time until some adverse event, such as\ndeath, occurs. The available information is often limited, giving rise to outputs that are said to be\neither un-censored (\u2212\u221e < y\n< yi < \u221e), right-censored\n(\u2212\u221e < y\n< yi < \u221e) (Klein and Moeschberger, 2005).\nFor instance, right censored data occurs when all that is known is that an individual is still alive after\na period of time. Another recent example is from the \ufb01eld of genomics, where interval regression\nwas used to learn a penalty function for changepoint detection in DNA copy number and ChIP-seq\ndata (Rigaill et al., 2013). Despite the ubiquity of this type of problem, there are surprisingly few\nexisting algorithms that have been designed to learn from such outputs, and most are linear models.\nDecision tree algorithms have been proposed in the 1980s with the pioneering work of Breiman et al.\n(1984) and Quinlan (1986). Such algorithms rely on a simple framework, where trees are grown by\nrecursive partitioning of leaves, each time maximizing some task-speci\ufb01c criterion. Advantages of\nthese algorithms include the ability to learn non-linear models from both numerical and categorical\ndata of various scales, and having a relatively low training time complexity. In this work, we extend\nthe work of Breiman et al. (1984) to learning non-linear interval regression tree models.\n\n< yi = \u221e), or interval-censored (\u2212\u221e < y\n\n= yi < \u221e), left-censored (\u2212\u221e = y\n\ni\n\ni\n\ni\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f1.1 Contributions and organization\n\nOur \ufb01rst contribution is Section 3, in which we propose a new decision tree algorithm for interval\nregression. We propose to partition leaves using a margin-based hinge loss, which yields a sequence of\nconvex optimization problems. Our second contribution is Section 4, in which we propose a dynamic\nprogramming algorithm that computes the optimal solution to all of these problems in log-linear time.\nIn Section 5 we show that our algorithm achieves state-of-the-art prediction accuracy in several real\nand simulated data sets. In Section 6 we discuss the signi\ufb01cance of our contributions and propose\npossible future research directions. An implementation is available at https://git.io/mmit.\n\n2 Related work\n\nThe bulk of related work comes from the \ufb01eld of survival analysis. Linear models for censored\noutputs have been extensively studied under the name accelerated failure time (AFT) models (Wei,\n1992). Recently, L1-regularized variants have been proposed to learn from high-dimensional data\n(Cai et al., 2009; Huang et al., 2005). Nonlinear models for censored data have also been studied,\nincluding decision trees (Segal, 1988; Molinaro et al., 2004), Random Forests (Hothorn et al., 2006)\nand Support Vector Machines (P\u00f6lsterl et al., 2016). However, most of these algorithms are limited to\nthe case of right-censored and un-censored data. In contrast, in the interval regression setting, the data\nare either left, right or interval-censored. To the best of our knowledge, the only existing nonlinear\nmodel for this setting is the recently proposed Transformation Tree of Hothorn and Zeileis (2017).\nAnother related method, which shares great similarity with ours, is the L1-regularized linear models\nof Rigaill et al. (2013). Like our proposed algorithm, their method optimizes a convex loss function\nwith a margin hyperparameter. Nevertheless, one key limitation of their algorithm is that it is limited\nto modeling linear patterns, whereas our regression tree algorithm is not.\n\n3 Problem\n\n3.1 Learning from interval outputs\nLet S def= {(x1, y1), ..., (xn, yn)} \u223c Dn be a data set of n learning examples, where xi \u2208 Rp is\ndef= (yi, yi), with yi, yi \u2208 R and yi < yi, are the lower and upper limits of a\na feature vector, yi\ntarget interval, and D is an unknown data generating distribution. In the interval regression setting, a\npredicted value is only considered erroneous if it is outside of the target interval.\nFormally, let (cid:96) : R \u2192 R be a function and de\ufb01ne \u03c6(cid:96)(x) def= (cid:96)[(x)+] as its corresponding hinge loss,\nwhere (x)+ is the positive part function, i.e. (x)+ = x if x > 0 and (x)+ = 0 otherwise. In this\nwork, we will consider two possible hinge loss functions: the linear one, where (cid:96)(x) = x, and the\nsquared one where (cid:96)(x) = x2. Our goal is to \ufb01nd a function h : Rp \u2192 R that minimizes the expected\nerror on data drawn from D:\n\nminimize\n\nh\n\nE\n\n(xi,yi)\u223cD\n\n\u03c6(cid:96)(\u2212h(xi) + yi) + \u03c6(cid:96)(h(xi) \u2212 yi),\n\nNotice that, if (cid:96)(x) = x2, this is a generalization of the mean squared error to interval outputs.\nMoreover, this can be seen as a surrogate to a zero-one loss that measures if a predicted value lies\nwithin the target interval (Rigaill et al., 2013).\n\n3.2 Maximum margin interval trees\nWe will seek an interval regression tree model T : Rp \u2192 R that minimizes the total hinge loss on\ndata set S:\n\n(cid:2)\u03c6(cid:96)\n\n(cid:0)\u2212T (xi) + yi + \u0001(cid:1) + \u03c6(cid:96) (T (xi) \u2212 yi + \u0001)(cid:3) ,\n\n(1)\n\n(cid:88)\n\n(xi,yi)\u2208S\n\nC(T ) def=\n\nwhere \u0001 \u2208 R+\ndetails).\n\n0 is a hyperparameter introduced to improve regularity (see supplementary material for\n\n2\n\n\fFigure 1: An example partition of leaf \u03c40 into leaves \u03c41 and \u03c42.\n\nA decision tree is an arrangement of nodes and leaves. The leaves are responsible for making\npredictions, whereas the nodes guide the examples to the leaves based on the outcome of some\n\nboolean-valued rules (Breiman et al., 1984). Let (cid:101)T denote the set of leaves in a decision tree T .\nEach leaf \u03c4 \u2208 (cid:101)T is associated with a set of examples S\u03c4 \u2286 S, for which it is responsible for making\npredictions. The sets S\u03c4 obey the following properties: S =(cid:83)\n\u03c4\u2208(cid:101)T S\u03c4 and S\u03c4 \u2229 S\u03c4(cid:48) (cid:54)= \u2205 \u21d4 \u03c4 = \u03c4(cid:48).\nHence, the contribution of a leaf \u03c4 to the total loss of the tree C(T ), given that it predicts \u00b5 \u2208 R, is\n\n(cid:88)\n\n(cid:2)\u03c6(cid:96)(\u2212\u00b5 + yi + \u0001) + \u03c6(cid:96)(\u00b5 \u2212 yi + \u0001)(cid:3)\n\n(2)\n\nC\u03c4 (\u00b5) def=\n\n(xi,yi)\u2208S\u03c4\n\nand the optimal predicted value for the leaf is obtained by minimizing this function over all \u00b5 \u2208 R.\nAs in the CART algorithm (Breiman et al., 1984), our tree growing algorithm relies on recursive\npartitioning of the leaves. That is, at any step of the tree growing algorithm, we obtain a new tree T (cid:48)\n\nfrom T by selecting a leaf \u03c40 \u2208 (cid:101)T and dividing it into two leaves \u03c41, \u03c42 \u2208(cid:102)T (cid:48), s.t. S\u03c40 = S\u03c41 \u222a S\u03c42\nand \u03c40 (cid:54)\u2208 (cid:102)T (cid:48). This partitioning results from applying a boolean-valued rule r : Rp \u2192 B to each\nexample (xi, yi) \u2208 S\u03c40 and sending it to \u03c41 if r(xi) = True and to \u03c42 otherwise. The rules that we\nconsider are threshold functions on the value of a single feature, i.e., r(xi) def= \u201c xij \u2264 \u03b4 \u201d. This is\nillustrated in Figure 1. According to Equation (2), for any such rule, we have that the total hinge loss\nfor the examples that are sent to \u03c41 and \u03c42 are\n\nC\u03c41(\u00b5) =\n\n\u2190\u2212\nC\u03c40(\u00b5|j, \u03b4) def=\n\nC\u03c42 (\u00b5) =\n\n\u2212\u2192\nC\u03c40(\u00b5|j, \u03b4) def=\n\n(cid:2)\u03c6(cid:96)(\u2212\u00b5 + yi + \u0001) + \u03c6(cid:96)(\u00b5 \u2212 yi + \u0001)(cid:3)\n(cid:2)\u03c6(cid:96)(\u2212\u00b5 + yi + \u0001) + \u03c6(cid:96)(\u00b5 \u2212 yi + \u0001)(cid:3) .\n\n(3)\n\n(4)\n\n(cid:88)\n(cid:88)\n\n(xi,yi)\u2208S\u03c40 :xij\u2264\u03b4\n\n(xi,yi)\u2208S\u03c40 :xij >\u03b4\n\n(cid:20)\u2190\u2212\n\nThe best rule is the one that leads to the smallest total cost C(T (cid:48)). This rule, as well as the optimal\npredicted values for \u03c41 and \u03c42, are obtained by solving the following optimization problem:\n\nC\u03c40(\u00b51|j, \u03b4) +\n\n\u2212\u2192\nC\u03c40(\u00b52|j, \u03b4)\n\nargmin\nj,\u03b4,\u00b51,\u00b52\n\n.\n\n(5)\n\n(cid:21)\n\nIn the next section we propose a dynamic programming algorithm for this task.\n\n4 Algorithm\n\nFirst note that, for a given j, \u03b4, the optimization separates into two convex minimization sub-problems,\nwhich each amount to minimizing a sum of convex loss functions:\n\nC\u03c4 (\u00b51|j, \u03b4) +\n\n\u2212\u2192\nC\u03c4 (\u00b52|j, \u03b4)\n\nmin\n\nj,\u03b4,\u00b51,\u00b52\n\n\u2190\u2212\nC\u03c4 (\u00b51|j, \u03b4) + min\n\n\u00b52\n\n\u2212\u2192\nC\u03c4 (\u00b52|j, \u03b4)\n\n.\n\n(6)\n\n(cid:20)\u2190\u2212\n\n(cid:21)\n\n(cid:21)\n\n(cid:20)\n\n= min\nj,\u03b4\n\nmin\n\u00b51\n\n3\n\nFeaturevalue(xij)Intervallimits\u00b50\u0001\u0001Leaf\u03c40Featurevalue(xij)Intervallimits\u00b51\u0001\u0001\u00b52\u0001\u0001Leaf\u03c41:xij\u2264\u03b4Leaf\u03c42:xij>\u03b4Upperlimit(yi)Lowerlimit(yi)Threshold(\u03b4)Predictedvalues(\u00b50,\u00b51,\u00b52)Margin(\u0001)Cost\fWe will show that if there exists an ef\ufb01cient dynamic program \u2126 which, given any set of hinge\nloss functions de\ufb01ned over \u00b5, computes their sum and returns the minimum value, along with a\nminimizing value of \u00b5, the minimization problem of Equation (6) can be solved ef\ufb01ciently.\nObserve that, although there is a continuum of possible values for \u03b4, we can limit the search to\nthe values of feature j that are observed in the data (i.e., \u03b4 \u2208 {xij ; i = 1, ... , n}), since all other\nvalues do not lead to different con\ufb01gurations of S\u03c41 and S\u03c42. Thus, there are at most nj \u2264 n unique\nthresholds to consider for each feature. Let these thresholds be \u03b4j,1 < ... < \u03b4j,nj . Now, consider\n\u03a6j,k as the set that contains all the losses \u03c6(cid:96)(\u2212\u00b5 + yi + \u0001) and \u03c6(cid:96)(\u00b5 \u2212 yi + \u0001) for which we have\n(xi, yi) \u2208 S\u03c40 and xij = \u03b4j,k. Since we now only consider a \ufb01nite number of \u03b4-values, it follows\n\u2190\u2212\n\u2190\u2212\nC\u03c4 (\u00b51|j, \u03b4j,k\u22121) by adding all the losses\nC\u03c4 (\u00b51|j, \u03b4j,k) from\nfrom Equation (3), that one can obtain\n\u2212\u2192\n\u2212\u2192\nC\u03c4 (\u00b51|j, \u03b4j,k) from\nC\u03c4 (\u00b51|j, \u03b4j,k\u22121) by removing all the\nin \u03a6j,k. Similarly, one can also obtain\n\u2190\u2212\nC\u03c4 (\u00b5|j, \u03b4j,k) = \u2126(\u03a6j,1\u222a...\u222a\u03a6j,k)\nlosses in \u03a6j,k (see Equation (4)). This, in turn, implies that min\u00b5\nand min\u00b5\nHence, the cost associated with a split on each threshold \u03b4j,k is given by:\n\n\u2212\u2192\nC\u03c4 (\u00b5|j, \u03b4j,k) = \u2126(\u03a6j,k+1 \u222a ... \u222a \u03a6j,nj ) .\n\n\u2126(\u03a6j,1)\n\n\u03b4j,1 :\n. . .\n\u03b4j,i :\n. . .\n\u03b4j,nj\u22121 : \u2126(\u03a6j,1 \u222a \u00b7\u00b7\u00b7 \u222a \u03a6j,nj\u22121) +\n\n\u2126(\u03a6j,1 \u222a \u00b7\u00b7\u00b7 \u222a \u03a6j,i)\n\n. . .\n\n. . .\n\n+ \u2126(\u03a6j,2 \u222a \u00b7\u00b7\u00b7 \u222a \u03a6j,nj )\n+ \u2126(\u03a6j,i+1 \u222a \u00b7\u00b7\u00b7 \u222a \u03a6j,nj )\n\n. . .\n\n. . .\n\n\u2126(\u03a6j,nj )\n\n(7)\n\nand the best threshold is the one with the smallest cost. Note that, in contrast with the other thresholds,\n\u03b4j,nj needs not be considered, since it leads to an empty leaf. Note also that, since \u2126 is a dynamic\nprogram, one can ef\ufb01ciently compute Equation (7) by using \u2126 twice, from the top down for the \ufb01rst\ncolumn and from the bottom up for the second. Below, we propose such an algorithm.\n\n4.1 De\ufb01nitions\nA general expression for the hinge losses \u03c6(cid:96)(\u2212\u00b5 + yi + \u0001) and \u03c6(cid:96)(\u00b5 \u2212 yi + \u0001) is \u03c6(cid:96)(si(\u00b5 \u2212 yi) + \u0001),\nwhere si = \u22121 or 1 respectively. Now, choose any convex function (cid:96) : R \u2192 R and let\n\nPt(\u00b5) def=\n\n\u03c6(cid:96)(si(\u00b5 \u2212 yi) + \u0001)\n\n(8)\n\nt(cid:88)\n\ni=1\n\nbe a sum of t hinge loss functions. In this notation, \u2126(\u03a6j,1 \u222a ... \u222a \u03a6j,i) = min\u00b5 Pt(\u00b5), where\nt = |\u03a6j,1 \u222a ... \u222a \u03a6j,i|.\nObservation 1. Each of the t hinge loss functions has a breakpoint at yi \u2212 si\u0001, where it transitions\nfrom a zero function to a non-zero one if si = 1 and the converse if si = \u22121.\nFor the sake of simplicity, we will now consider the case where these breakpoints are all different;\nthe generalization is straightforward, but would needlessly complexify the presentation (see the\nsupplementary material for details). Now, note that Pt(\u00b5) is a convex piecewise function that can be\nuniquely represented as:\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\nPt(\u00b5) =\n\npt,1(\u00b5)\n. . .\npt,i(\u00b5)\n. . .\npt,t+1(\u00b5)\n\nif \u00b5 \u2208 (\u2212\u221e, bt,1]\nif \u00b5 \u2208 (bt,i\u22121, bt,i]\nif \u00b5 \u2208 (bt,t,\u221e)\n\n(9)\n\nwhere we will call pt,i the ith piece of Pt and bt,i the ith breakpoint of Pt (see Figure 2 for an\nexample). Observe that each piece pt,i is the sum of all the functions that are non-zero on the interval\n(bt,i\u22121, bt,i]. It therefore follows from Observation 1 that\n\npt,i(\u00b5) =\n\n(cid:96)[sj(\u00b5 \u2212 yj) + \u0001] I[(sj = \u22121 \u2227 bt,i\u22121 < yj + \u0001) \u2228 (sj = 1 \u2227 yj \u2212 \u0001 < bt,i)]\n\n(10)\n\nt(cid:88)\n\nj=1\n\nwhere I[\u00b7] is the (Boolean) indicator function, i.e., I[True] = 1 and 0 otherwise.\n\n4\n\n\ft = 1 before optimization\n\nt = 1 after optimization\n\nj1 = 2\n\np1,2(\u00b5) = \u00b5 \u2212 3\n\nJ1 = 1\n\np1,1(\u00b5)\n\nM1(\u00b5) = p1,1(\u00b5) = 0\n\nP1(\u00b5)\n\nb1,1 = 3\n\nf1,1(\u00b5) = \u00b5 \u2212 3\n\n1\n\n0\n\nt\ns\no\nC\n\nt = 2\n\nj2 = J2 = 2\n\np2,3(\u00b5)\n\np2,1(\u00b5)\n\nP2(\u00b5)\n\nM2(\u00b5) =\np2,2(\u00b5)\n\nb2,1 = 2\n\nb2,2 = 3\n\nf2,1(\u00b5) = \u00b5 \u2212 2\n\nf2,2(\u00b5) = \u00b5 \u2212 3\n\n-1\n\n1\n\n2\n\n3\n\n4\n\n1\n\n2\n\n3\n\n4\n\n1\n\n2\n\n3\n\n4\n\nPredicted value (\u00b5)\n\nFigure 2: First two steps of the dynamic programming algorithm for the data y1 = 4, s1 = 1, y2 =\n1, s2 = \u22121 and margin \u0001 = 1, using the linear hinge loss ((cid:96)(x) = x). Left: The algorithm begins by\ncreating a \ufb01rst breakpoint at b1,1 = y1 \u2212 \u0001 = 3, with corresponding function f1,1(\u00b5) = \u00b5\u2212 3. At this\ntime, we have j1 = 2 and thus b1,j1 = \u221e. Note that the cost p1,1 before the \ufb01rst breakpoint is not\nyet stored by the algorithm. Middle: The optimization step is to move the pointer to the minimum\n(J1 = j1 \u2212 1) and update the cost function, M1(\u00b5) = p1,2(\u00b5) \u2212 f1,1(\u00b5). Right: The algorithm adds\nthe second breakpoint at b2,1 = y2 + \u0001 = 2 with f2,1(\u00b5) = \u00b5 \u2212 2. The cost at the pointer is not\naffected by the new data point, so the pointer does not move.\n\nLemma 1. For any i \u2208 {1, ..., t}, we have that pt,i+1(\u00b5) = pt,i(\u00b5) + ft,i(\u00b5), where ft,i(\u00b5) =\nsk(cid:96)[sk(\u00b5 \u2212 yk) + \u0001] for some k \u2208 {1, ..., t} such that yk \u2212 sk\u0001 = bt,i.\n\nProof. The proof relies on Equation (10) and is detailed in the supplementary material.\n\n4.2 Minimizing a sum of hinge losses by dynamic programming\n\nOur algorithm works by recursively adding a hinge loss to the total function Pt(\u00b5), each time, keeping\ntrack of the minima. To achieve this, we use a pointer Jt, which points to rightmost piece of Pt(\u00b5)\nthat contains a minimum. Since Pt(\u00b5) is a convex function of \u00b5, we know that this minimum is global.\nIn the algorithm, we refer to the segment pt,Jt as Mt and the essence of the dynamic programming\nupdate is moving Jt to its correct position after a new hinge loss is added to the sum.\nAt any time step t, let Bt = {(bt,1, ft,1), ..., (bt,t, ft,t) | bt,1 < ... < bt,t} be the current set of\nbreakpoints (bt,i) together with their corresponding difference functions (ft,i). Moreover, assume the\nconvention bt,0 = \u2212\u221e and bt,t+1 = \u221e, which are de\ufb01ned, but not stored in Bt.\nThe initialization (t = 0) is\n\nB0 = {}, J0 = 1, M0(\u00b5) = 0 .\n\n(11)\n\nNow, at any time step t > 0, start by inserting the new breakpoint and difference function. Hence,\n\nBt = Bt\u22121 \u222a {(yt \u2212 st\u0001, st (cid:96)[st(\u00b5 \u2212 yt) + \u0001])} .\n\n(12)\nRecall that, by de\ufb01nition, the set Bt remains sorted after the insertion. Let jt \u2208 {1, . . . , t + 1}, be\nthe updated value for the previous minimum pointer (Jt\u22121) after adding the tth hinge loss (i.e., the\nindex of bt\u22121,Jt\u22121 in the sorted set of breakpoints at time t). It is obtained by adding 1 if the new\nbreakpoint is before Jt\u22121 and 0 otherwise. In other words,\n\njt = Jt\u22121 + I[yt \u2212 st\u0001 < bt\u22121,Jt\u22121] .\n\n(13)\nIf there is no minimum of Pt(\u00b5) in piece pt,jt, we must move the pointer from jt to its \ufb01nal position\nJt \u2208 {1, ..., t + 1}, where Jt is the index of the rightmost function piece that contains a minimum:\n(14)\n\ni\u2208{1,...,t+1} i, s.t. (bt,i\u22121, bt,i] \u2229 {x \u2208 R | Pt(x) = min\n\nPt(\u00b5)} (cid:54)= \u2205 .\n\n\u00b5\n\nJt = max\n\nSee Figure 2 for an example. The minimum after optimization is in piece Mt, which is obtained by\nadding or subtracting a series of difference functions ft,i. Hence, applying Lemma 1 multiple times,\n\n5\n\n\fFigure 3: Empirical evaluation of the expected O(n(m + log n)) time complexity for n data points\nand m pointer moves per data point. Left: max and average number of pointer moves m over all real\nand simulated data sets we considered (median line and shaded quartiles over all features, margin\nparameters, and data sets of a given size). We also observed m = O(1) pointer moves on average for\nboth the linear and squared hinge loss. Right: timings in seconds are consistent with the expected\nO(n log n) time complexity.\n\nwe obtain:\n\nMt(\u00b5) def= pt,Jt(\u00b5) = pt,jt(\u00b5) +\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f30\n(cid:80)Jt\u22121\n\u2212(cid:80)jt\u22121\n\ni=jt\n\nft,i(\u00b5)\n\nft,i(\u00b5)\n\ni=Jt\n\nif jt = Jt\nif jt < Jt\nif Jt < jt\n\n(15)\n\nThen, the optimization problem can be solved using min\u00b5 Pt(\u00b5) = min\u00b5\u2208(bt,Jt\u22121,bt,Jt ] Mt(\u00b5). The\nproof of this statement is available in the supplementary material, along with a detailed pseudocode\nand implementation details.\n\n4.3 Complexity analysis\n\nThe (cid:96) functions that we consider are (cid:96)(x) = x and (cid:96)(x) = x2. Notice that any such function can be\nencoded by three coef\ufb01cients a, b, c \u2208 R. Therefore, summing two functions amounts to summing\ntheir respective coef\ufb01cients and takes time O(1). The set of breakpoints Bt can be stored using any\ndata structure that allows sorted insertions in logarithmic time (e.g., a binary search tree).\nAssume that we have n hinge losses. Inserting a new breakpoint at Equation (12) takes O(log n)\ntime. Updating the jt pointer at Equation (13) takes O(1). In contrast, the complexity of \ufb01nding the\nnew pointer position Jt and updating Mt at Equations (14) and (15) varies depending on the nature\nof (cid:96). For the case where (cid:96)(x) = x, we are guaranteed that Jt is at distance at most one of jt. This is\ndemonstrated in Theorem 2 of the supplementary material. Since we can sum two functions in O(1)\ntime, we have that the worst case time complexity of the linear hinge loss algorithm is O(n log n).\nHowever, for the case where (cid:96)(x) = x2, the worst case could involve going through the n breakpoints.\nHence, the worst case time complexity of the squared hinge loss algorithm is O(n2). Nevertheless, in\nSection 5.1, we show that, when tested on a variety real-world data sets, the algorithm achieved a\ntime complexity of O(n log n) in this case also.\nFinally, the space complexity of this algorithm is O(n), since a list of n breakpoints (bt,i) and\ndifference functions (ft,i) must be stored, along with the coef\ufb01cients (a, b, c \u2208 R) of Mt. Moreover,\nit follows from Lemma 1 that the function pieces pt,i need not be stored, since they can be recovered\nusing the bt,i and ft,i.\n\n5 Results\n\n5.1 Empirical evaluation of time complexity\n\nWe performed two experiments to evaluate the expected O(n(m + log n)) time complexity for n\ninterval limits and m pointer moves per limit. First, we ran our algorithm (MMIT) with both squared\n\n6\n\nlinearsquaresquarelinearmaxaverage01234567891001100100010000100100010000n = number of outputs (finite interval limits)m = pointer movesPointer moves on changepoint/UCI data setssquarelinear0.010.101.0010.001e+041e+051e+061e+07number of outputs (finite interval limits)secondsTimings on simulated data sets\fMMIT\n\nf (x) = |x|\n\nL1-Linear\n\nFunction f (x)\nf (x) = sin(x)\n\nLower limit\n\nUpper limit\n\nf (x) = x/5\n\nSignal feature (x)\n\nSignal feature (x)\n\nSignal feature (x)\n\nFigure 4: Predictions of MMIT (linear hinge loss) and the L1-regularized linear model of Rigaill\net al. (2013) (L1-Linear) for simulated data sets.\n\nand linear hinge loss solvers on a variety of real-world data sets of varying sizes (Rigaill et al., 2013;\nLichman, 2013), and recorded the number of pointer moves. We plot the average and max pointer\nmoves over a wide range of margin parameters, and all possible feature orderings (Figure 3, left). In\nagreement with our theoretical result (supplementary material, Theorem 2), we observed a maximum\nof one move per interval limit for the linear hinge loss. On average we observed that the number of\nmoves does not increase with data set size, even for the squared hinge loss. These results suggest that\nthe number of pointer moves per limit is generally constant m = O(1), so we expect an overall time\ncomplexity of O(n log n) in practice, even for the squared hinge loss. Second, we used the limits of\nthe target intervals in the neuroblastoma changepoint data set (see Section 5.3) to simulate data sets\nfrom n = 103 to n = 107 limits. We recorded the time required to run the solvers (Figure 3, right),\nand observed timings which are consistent with the expected O(n log n) complexity.\n\n5.2 MMIT recovers a good approximation in simulations with nonlinear patterns\n\nWe demonstrate one key limitation of the margin-based interval regression algorithm of Rigaill\net al. (2013) (L1-Linear): it is limited to modeling linear patterns. To achieve this, we created three\nsimulated data sets, each containing 200 examples and 20 features. Each data set was generated in\nsuch a way that the target intervals followed a speci\ufb01c pattern f : R \u2192 R according to a single feature,\nwhich we call the signal feature. The width of the intervals and a small random shift around the true\nvalue of f were determined randomly. The details of the data generation protocol are available in the\nsupplementary material. MMIT (linear hinge loss) and L1-Linear were trained on each data set, using\ncross-validation to choose the hyperparameter values. The resulting data sets and the predictions of\neach algorithm are illustrated in Figure 4. As expected, L1-Linear fails to \ufb01t the non-linear patterns,\nbut achieves a near perfect \ufb01t for the linear pattern. In contrast, MMIT learns stepwise approximations\nof the true functions, which results from each leaf predicting a constant value. Notice the \ufb02uctuations\nin the models of both algorithms, which result from using irrelevant features.\n\n5.3 Empirical evaluation of prediction accuracy\n\nIn this section, we compare the accuracy of predictions made by MMIT and other learning algorithms\non real and simulated data sets.\n\nEvaluation protocol To evaluate the accuracy of the algorithms, we performed 5-fold cross-\nvalidation and computed the mean squared error (MSE) with respect to the intervals in each of the\n\ufb01ve testing sets (Figure 5). For a data set S = {(xi, yi)}n\ni=1 with xi \u2208 Rp and yi \u2208 R2, and for a\nmodel h : Rp \u2192 R, the MSE is given by\n\nn(cid:88)\n\n(cid:0)[h(xi) \u2212 yi] I[h(xi) < yi] + [h(xi) \u2212 yi] I[h(xi) > yi](cid:1)2\n\n.\n\n(16)\n\nMSE(h, S) =\n\n1\nn\n\ni=1\n\n7\n\n\fFigure 5: MMIT testing set mean squared error is comparable to, or better than, other interval\nregression algorithms in seven real and simulated data sets. Five-fold cross-validation was used to\ncompute 5 test error values (points) for each model in each of the data sets (panel titles indicate data\nset source, name, number of observations=n, number of features=p, proportion of intervals with \ufb01nite\nlimits and proportion of all interval limits that are upper limits).\n\nAt each step of the cross-validation, another cross-validation (nested within the former) was used\nto select the hyperparameters of each algorithm based on the training data. The hyperparameters\nselected for MMIT are available in the supplementary material.\n\nAlgorithms The linear and squared hinge loss variants of Maximum Margin Interval Trees (MMIT-\nL and MMIT-S) were compared to two state-of-the-art interval regression algorithms: the margin-\nbased L1-regularized linear model of Rigaill et al. (2013) (L1-Linear) and the Transformation Trees\nof Hothorn and Zeileis (2017) (TransfoTree). Moreover, two baseline methods were included in the\ncomparison. To provide an upper bound for prediction error, we computed the trivial model that\nignores all features and just learns a constant function h(x) = \u00b5 that minimizes the MSE on the\ntraining data (Constant). To demonstrate the importance of using a loss function designed for interval\nregression, we also considered the CART algorithm (Breiman et al., 1984). Speci\ufb01cally, CART was\nused to \ufb01t a regular regression tree on a transformed training set, where each interval regression\nexample (x, [y, y]) was replaced by two real-valued regression examples with features x and labels\ny + \u0001 and y \u2212 \u0001. This algorithm, which we call Interval-CART, uses a margin hyperparameter and\nminimizes a squared loss with respect to the interval limits. However, in contrast with MMIT, it does\nnot take the structure of the interval regression problem into account, i.e., it ignores the fact that no\ncost should be incurred for values predicted inside the target intervals.\n\nResults in changepoint data sets The problem in the \ufb01rst two data sets is to learn a penalty\nfunction for changepoint detection in DNA copy number and ChIP-seq data (Hocking et al., 2013;\nRigaill et al., 2013), two signi\ufb01cant interval regression problems from the \ufb01eld of genomics. For the\nneuroblastoma data set, all methods, except the constant model, perform comparably. Interval-CART\nachieves the lowest error for one fold, but L1-Linear is the overall best performing method. For the\nhistone data set, the margin-based models clearly outperform the non-margin-based models: Constant\nand TransfoTree. MMIT-S achieves the lowest error on one of the folds. Moreover, MMIT-S tends\nto outperform MMIT-L, suggesting that a squared loss is better suited for this task. Interestingly,\nMMIT-S outperforms Interval-CART, which also uses a squared loss, supporting the importance of\nusing a loss function adapted to the interval regression problem.\n\nResults in UCI data sets The next two data sets are regression problems taken from the UCI\nrepository (Lichman, 2013). For the sake of our comparison, the real-valued outputs in these data\nsets were transformed into censored intervals, using a protocol that we detail in the supplementary\nmaterial. For the dif\ufb01cult triazines data set, all methods struggle to surpass the Constant model.\nNeverthess, some achieve lower errors for one fold. For the servo data set, the margin-based tree\nmodels: MMIT-S, MMIT-L, and Interval-CART perform comparably and outperform the other\nmodels. This highlights the importance of developping non-linear models for interval regression and\nsuggests a positive effect of the margin hyperparameter on accuracy.\n\n8\n\nchangepointneuroblastoman=3418p=1170%finite16%upchangepointhistonen=935p=2647%finite32%upUCItriazinesn=186p=6093%finite50%upUCIservon=167p=1992%finite50%upsimulatedlinearn=200p=2080%finite49%upsimulatedsinn=200p=2085%finite49%upsimulatedabsn=200p=2077%finite49%upllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllConstantL1\u2212LinearTransfoTreeInterval\u2212CARTMMIT\u2212LMMIT\u2212S\u22122.5\u22122.0\u22121.5\u22120.6\u22120.4\u22120.20.0\u22123.0\u22122.5\u22122.0\u22124\u22123\u22122\u22124\u22123\u22122\u22121\u22123\u22122\u22121\u22122\u221210log10(mean squared test error) in 5\u2212fold CV, one point per fold\fResults in simulated data sets The last three data sets are the simulated data sets discussed in the\nprevious section. As expected, the L1-linear model tends outperforms the others on the linear data set.\nHowever, surprisingly, on a few folds, the MMIT-L and Interval-CART models were able to achieve\nlow test errors. For the non-linear data sets (sin and abs), MMIT-S, MMIT-L and Interval-Cart clearly\noutperform the TransfoTree, L1-linear and Constant models. Observe that the TransfoTree algorithm\nachieves results comparable to those of L1-linear which, in Section 5.2, has been shown to learn\na roughly constant model in these situations. Hence, although these data sets are simulated, they\nhighlight situations where this non-linear interval regression algorithm fails to yield accurate models,\nbut where MMITs do not.\nResults for more data sets are available in the supplementary material.\n\n6 Discussion and conclusions\n\nWe proposed a new margin-based decision tree algorithm for the interval regression problem. We\nshowed that it could be trained by solving a sequence of convex sub-problems, for which we proposed\na new dynamic programming algorithm. We showed empirically that the latter\u2019s time complexity is\nlog-linear in the number of intervals in the data set. Hence, like classical regression trees (Breiman\net al., 1984), our tree growing algorithm\u2019s time complexity is linear in the number of features and\nlog-linear in the number of examples. Moreover, we studied the prediction accuracy in several real\nand simulated data sets, showing that our algorithm is competitive with other linear and nonlinear\nmodels for interval regression.\nThis initial work on Maximum Margin Interval Trees opens a variety of research directions, which\nwe will explore in future work. We will investigate learning ensembles of MMITs, such as random\nforests. We also plan to extend the method to learning trees with non-constant leaves. This will\nincrease the smoothness of the models, which, as observed in Figure 4, tend to have a stepwise nature.\nMoreover, we plan to study the average time complexity of the dynamic programming algorithm.\nAssuming a certain regularity in the data generating distribution, we should be able to bound the\nnumber of pointer moves and justify the time complexity that we observed empirically. In addition,\nwe will study the conditions in which the proposed MMIT algorithm is expected to surpass methods\nthat do not exploit the structure of the target intervals, such as the proposed Interval-CART method.\nIntuitively, one weakness of Interval-CART is that it does not properly model left and right-censored\nintervals, for which it favors predictions that are near the \ufb01nite limits. Finally, we plan to extend\nthe dynamic programming algorithm to data with un-censored outputs. This will make Maximum\nMargin Interval Trees applicable to survival analysis problems, where they should rank among the\nstate of the art.\n\nReproducibility\n\n\u2022 Implementation: https://git.io/mmit\n\u2022 Experimental code: https://git.io/mmit-paper\n\u2022 Data: https://git.io/mmit-data\n\nThe versions of the software used in this work are also provided in the supplementary material.\n\nAcknowledgements\n\nWe are grateful to Ulysse C\u00f4t\u00e9-Allard, Mathieu Blanchette, Pascal Germain, S\u00e9bastien Gigu\u00e8re, Ga\u00ebl\nLetarte, Mario Marchand, and Pier-Luc Plante for their insightful comments and suggestions. This\nwork was supported by the National Sciences and Engineering Research Council of Canada, through\nan Alexander Graham Bell Canada Graduate Scholarship Doctoral Award awarded to AD and a\nDiscovery Grant awarded to FL (#262067).\n\n9\n\n\fReferences\nBasak, D., Pal, S., and Patranabis, D. C. (2007). Support vector regression. Neural Information\n\nProcessing-Letters and Reviews, 11(10), 203\u2013224.\n\nBreiman, L., Friedman, J., Stone, C. J., and Olshen, R. A. (1984). Classi\ufb01cation and regression trees.\n\nCRC press.\n\nCai, T., Huang, J., and Tian, L. (2009). Regularized estimation for the accelerated failure time model.\n\nBiometrics, 65, 394\u2013404.\n\nHocking, T. D., Schleiermacher, G., Janoueix-Lerosey, I., Boeva, V., Cappo, J., Delattre, O., Bach,\nF., and Vert, J.-P. (2013). Learning smoothing models of copy number pro\ufb01les using breakpoint\nannotations. BMC Bioinformatics, 14(1), 164.\n\nHothorn, T. and Zeileis, A. (2017). Transformation Forests. arXiv:1701.02110.\n\nHothorn, T., B\u00fchlmann, P., Dudoit, S., Molinaro, A., and Van Der Laan, M. J. (2006). Survival\n\nensembles. Biostatistics, 7(3), 355\u2013373.\n\nHuang, J., Ma, S., and Xie, H. (2005). Regularized estimation in the accelerated failure time\nmodel with high dimensional covariates. Technical Report 349, University of Iowa Department of\nStatistics and Actuarial Science.\n\nKlein, J. P. and Moeschberger, M. L. (2005). Survival analysis: techniques for censored and truncated\n\ndata. Springer Science & Business Media.\n\nLichman, M. (2013). UCI machine learning repository.\n\nMolinaro, A. M., Dudoit, S., and van der Laan, M. J. (2004). Tree-based multivariate regression and\n\ndensity estimation with right-censored data. Journal of Multivariate Analysis, 90, 154\u2013177.\n\nP\u00f6lsterl, S., Navab, N., and Katouzian, A. (2016). An Ef\ufb01cient Training Algorithm for Kernel\n\nSurvival Support Vector Machines. arXiv:1611.07054.\n\nQuinlan, J. R. (1986). Induction of decision trees. Machine learning, 1(1), 81\u2013106.\n\nRigaill, G., Hocking, T., Vert, J.-P., and Bach, F. (2013). Learning sparse penalties for change-point\n\ndetection using max margin interval regression. In Proc. 30th ICML, pages 172\u2013180.\n\nSegal, M. R. (1988). Regression trees for censored data. Biometrics, pages 35\u201347.\n\nWei, L. (1992). The accelerated failure time model: a useful alternative to the cox regression model\n\nin survival analysis. Stat Med, 11(14\u201315), 1871\u20139.\n\n10\n\n\f", "award": [], "sourceid": 2558, "authors": [{"given_name": "Alexandre", "family_name": "Drouin", "institution": "Universit\u00e9 Laval + Element AI"}, {"given_name": "Toby", "family_name": "Hocking", "institution": "McGill Genome Center, McGill University"}, {"given_name": "Francois", "family_name": "Laviolette", "institution": "Universit\u00e9 Laval"}]}