{"title": "Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 927, "page_last": 935, "abstract": "Generalized Linear Models (GLMs) and Single Index Models (SIMs) provide powerful generalizations of linear regression, where the target variable is assumed to be a (possibly unknown) 1-dimensional function of a linear predictor. In general, these problems entail non-convex estimation procedures, and, in practice, iterative local search heuristics are often used. Kalai and Sastry (2009) provided the first provably efficient method, the \\emph{Isotron} algorithm, for learning SIMs and GLMs, under the assumption that the data is in fact generated under a GLM and under certain monotonicity and Lipschitz (bounded slope) constraints. The Isotron algorithm interleaves steps of perceptron-like updates with isotonic regression (fitting a one-dimensional non-decreasing function). However, to obtain provable performance, the method requires a fresh sample every iteration. In this paper, we provide algorithms for learning GLMs and SIMs, which are both computationally and statistically efficient. We modify the isotonic regression step in Isotron to fit a Lipschitz monotonic function, and also provide an efficient $O(n \\log(n))$ algorithm for this step, improving upon the previous $O(n^2)$ algorithm. We provide a brief empirical study, demonstrating the feasibility of our algorithms in practice.", "full_text": "Ef\ufb01cient Learning of Generalized Linear and Single\n\nIndex Models with Isotonic Regression\n\nSham M. Kakade\n\nMicrosoft Research and Wharton, U Penn\n\nskakade@microsoft.com\n\nVarun Kanade\n\nSEAS, Harvard University\n\nvkanade@fas.harvard.edu\n\nAdam Tauman Kalai\nMicrosoft Research\n\nadum@microsoft.com\n\nOhad Shamir\n\nMicrosoft Research\n\nohadsh@microsoft.com\n\nAbstract\n\nGeneralized Linear Models (GLMs) and Single Index Models (SIMs) provide\npowerful generalizations of linear regression, where the target variable is assumed\nto be a (possibly unknown) 1-dimensional function of a linear predictor. In gen-\neral, these problems entail non-convex estimation procedures, and, in practice,\niterative local search heuristics are often used. Kalai and Sastry (2009) provided\nthe \ufb01rst provably ef\ufb01cient method, the Isotron algorithm, for learning SIMs and\nGLMs, under the assumption that the data is in fact generated under a GLM and\nunder certain monotonicity and Lipschitz (bounded slope) constraints. The Isotron\nalgorithm interleaves steps of perceptron-like updates with isotonic regression (\ufb01t-\nting a one-dimensional non-decreasing function). However, to obtain provable\nperformance, the method requires a fresh sample every iteration. In this paper, we\nprovide algorithms for learning GLMs and SIMs, which are both computationally\nand statistically ef\ufb01cient. We modify the isotonic regression step in Isotron to \ufb01t\na Lipschitz monotonic function, and also provide an ef\ufb01cient O(n log(n)) algo-\nrithm for this step, improving upon the previous O(n2) algorithm. We provide a\nbrief empirical study, demonstrating the feasibility of our algorithms in practice.\n\n1\n\nIntroduction\n\nThe oft used linear regression paradigm models a dependent variable Y as a linear function of a\nvector-valued independent variable X. Namely, for some vector w, we assume that E[Y |X] = w\u00b7X.\nGeneralized linear models (GLMs) provide a \ufb02exible extension of linear regression, by assuming\nthat the dependent variable Y is of the form, E[Y |X] = u(w \u00b7 X); u is referred to as the inverse link\nfunction or transfer function (see [1] for a review). Generalized linear models include commonly\nused regression techniques such as logistic regression, where u(z) = 1/(1 + e\u2212z) is the logistic\nfunction. The class of perceptrons also falls in this category, where u is a simple piecewise linear\nfunction of the form /\u00af, with the slope of the middle piece being the inverse of the margin.\nIn the case of linear regression, the least-squares method is an highly ef\ufb01cient procedure for pa-\nrameter estimation. Unfortunately, in the case of GLMs, even in the setting when u is known, the\nproblem of \ufb01tting a model that minimizes squared error is typically not convex. We are not aware\nof any classical estimation procedure for GLMs which is both computationally and statistically ef\ufb01-\ncient, and with provable guarantees. The standard procedure is iteratively reweighted least squares,\nbased on Newton-Raphson (see [1]).\nThe case when both u and w are unknown (sometimes referred to as Single Index Models (SIMs)),\ninvolves the more challenging (and practically relevant) question of jointly estimating u and w,\n\n1\n\n\fwhere u may come from a large non-parametric family such as all monotonic functions. There are\ntwo questions here: 1) What statistical rate is achievable for simultaneous estimation of u and w? 2)\nIs there a computationally ef\ufb01cient algorithm for this joint estimation? With regards to the former,\nunder mild Lipschitz-continuity restrictions on u, it is possible to characterize the effectiveness of an\n(appropriately constrained) joint empirical risk minimization procedure. This suggests that, from a\npurely statistical viewpoint, it may be worthwhile to attempt jointly optimizing u and w on empirical\ndata.\nHowever, the issue of computationally ef\ufb01ciently estimating both u and w (and still achieving a\ngood statistical rate) is more delicate, and is the focus of this work. We note that this is not a trivial\nproblem: in general, the joint estimation problem is highly non-convex, and despite a signi\ufb01cant\nbody of literature on the problem, existing methods are usually based on heuristics, which are not\nguaranteed to converge to a global optimum (see for instance [2, 3, 4, 5, 6]).\nThe Isotron algorithm of Kalai and Sastry [7] provides the \ufb01rst provably ef\ufb01cient method for learning\nGLMs and SIMs, under the common assumption that u is monotonic and Lipschitz, and assuming\nthat the data corresponds to the model.1 The sample and computational complexity of this algo-\nrithm is polynomial, and the sample complexity does not explicitly depend on the dimension. The\nalgorithm is a variant of the \u201cgradient-like\u201d perceptron algorithm, where apart from the perceptron-\nlike updates, an isotonic regression procedure is performed on the linear predictions using the Pool\nAdjacent Violators (PAV) algorithm, on every iteration.\nWhile the Isotron algorithm is appealing due to its ease of implementation (it has no parameters\nother than the number of iterations to run) and theoretical guarantees (it works for any u, w), there\nis one principal drawback. It is a batch algorithm, but the analysis given requires the algorithm to\nbe run on fresh samples each batch. In fact, as we show in experiments, this is not just an artifact of\nthe analysis \u2013 if the algorithm loops over the same data in each update step, it really does over\ufb01t in\nvery high dimensions (such as when the number of dimensions exceeds the number of examples).\nOur Contributions: We show that the over\ufb01tting problem in Isotron stems from the fact that al-\nthough it uses a slope (Lipschitz) condition as an assumption in the analysis, it does not constrain\nthe output hypothesis to be of this form. To address this issue, we introduce the SLISOTRON algo-\nrithm (pronounced slice-o-tron, combining slope and Isotron). The algorithm replaces the isotonic\nregression step of the Isotron by \ufb01nding the best non-decreasing function with a bounded Lipschitz\nparameter - this constraint plays here a similar role as the margin in classi\ufb01cation algorithms. We\nalso note SLISOTRON (like Isotron) has a signi\ufb01cant advantage over standard regression techniques,\nsince it does not require knowing the transfer function. Our two main contributions are:\n1. We show that the new algorithm, like Isotron, has theoretical guarantees, and signi\ufb01cant new\nanalysis is required for this step.\n2. We provide an ef\ufb01cient O(n log(n)) time algorithm for \ufb01nding the best non-decreasing function\nwith a bounded Lipschitz parameter, improving on the previous O(n2) algorithm [10]. This makes\nSLISOTRON practical even on large datasets.\nWe begin with a simple perceptron-like algorithm for \ufb01tting GLMs, with a known transfer function\nu which is monotone and Lipschitz. Somewhat surprisingly, prior to this work (and Isotron [7])\na computationally ef\ufb01cient procedure that guarantees to learn GLMs was not known. Section 4\ncontains the more challenging SLISOTRON algorithm and also the ef\ufb01cient O(n log(n)) algorithm\nfor Lipschitz isotonic regression. We conclude with a brief empirical analysis.\n\n2 Setting\nWe assume the data (x, y) are sampled i.i.d. from a distribution supported on Bd \u00d7 [0, 1], where\nBd = {x \u2208 Rd : (cid:107)x(cid:107) \u2264 1} is the unit ball in d-dimensional Euclidean space. Our algorithms and\n1In the more challenging agnostic setting, the data is not required to be distributed according to a true u and\nw, but it is required to \ufb01nd the best u, w which minimize the empirical squared error. Similar to observations\nof Kalai et al. [8], it is straightforward to show that this problem is likely to be computationally intractable in\nthe agnostic setting. In particular, it is at least as hard as the problem of \u201clearning parity with noise,\u201d whose\nhardness has been used as the basis for designing multiple cryptographic systems. Shalev-Shwartz et al. [9]\npresent a kernel-based algorithm for learning certain types of GLMs and SIMs in the agnostic setting. However,\ntheir worst-case guarantees are exponential in the norm of w (or equivalently the Lipschitz parameter).\n\n2\n\n\fAlgorithm 1 GLM-TRON\nInput: data (cid:104)(xi, yi)(cid:105)m\nw1 := 0;\nfor t = 1, 2, . . . do\n\ni=1 \u2208 Rd \u00d7 [0, 1], u : R \u2192 [0, 1], held-out data (cid:104)(xm+j, ym+j)(cid:105)s\nm(cid:88)\nOutput: arg minht(cid:80)s\n\nht(x) := u(wt \u00b7 x);\nwt+1 := wt +\n\nj=1(ht(xm+j) \u2212 ym+j)2\n\n(yi \u2212 u(wt \u00b7 xi))xi;\n\nend for\n\n1\nm\n\ni=1\n\nj=1\n\nanalysis also apply to the case where Bd is the unit ball in some high (or in\ufb01nite)-dimensional kernel\nfeature space. We assume there is a \ufb01xed vector w, such that (cid:107)w(cid:107) \u2264 W , and a non-decreasing\n1-Lipschitz function u : R \u2192 [0, 1], such that E[y|x] = u(w \u00b7 x) for all x. The restriction that u is\n1-Lipschitz is without loss of generality, since the norm of w is arbitrary (an equivalent restriction\nis that (cid:107)w(cid:107) = 1 and that u is W -Lipschitz for an arbitrary W ).\nOur focus is on approximating the regression function well, as measured by the squared loss. For a\nreal valued function h : Bd \u2192 [0, 1], de\ufb01ne\n\n(cid:2)(h(x) \u2212 y)2(cid:3)\n\nerr(h) = E(x,y)\n\u03b5(h) = err(h) \u2212 err(E[y|x]) = E(x,y)\n\n(cid:2)(h(x) \u2212 u(w \u00b7 x))2(cid:3)\n\nerr(h) measures the error of h, and \u03b5(h) measures the excess error of h compared to the Bayes-\noptimal predictor x (cid:55)\u2192 u(w \u00b7 x). Our goal is to \ufb01nd h such that \u03b5(h) (equivalently, err(h)) is as\nsmall as possible.\n\nIn addition, we de\ufb01ne the empirical counterparts (cid:99)err(h),\nm(cid:88)\n\n(x1, y1), . . . , (xm, ym), to be\n\nm(cid:88)\n\n(h(xi) \u2212 yi)2;\n\n\u02c6\u03b5(h) =\n\n(cid:99)err(h) =\n\n1\nm\n\ni=1\n\n(h(xi) \u2212 u(w \u00b7 xi))2.\n\n1\nm\n\ni=1\n\n\u02c6\u03b5(h), based on a sample\n\nNote that \u02c6\u03b5 is the standard \ufb01xed design error (as this error conditions on the observed x\u2019s).\nOur algorithms work by iteratively constructing hypotheses ht of the form ht(x) = ut(wt\u00b7x), where\nut is a non-decreasing, 1-Lipschitz function, and wt is a linear predictor. The algorithmic analysis\nprovides conditions under which \u02c6\u03b5(ht) is small, and using statistical arguments, one can guarantee\nthat \u03b5(ht) would be small as well.\n\n3 The GLM-TRON algorithm\n\nWe begin with the simpler case, where the transfer function u is assumed to be known (e.g. a sig-\nmoid), and the problem is estimating w properly. We present a simple, parameter-free, perceptron-\nlike algorithm, GLM-TRON (Alg. 1), which ef\ufb01ciently \ufb01nds a close-to-optimal predictor. We note\nthat the algorithm works for arbitrary non-decreasing, Lipschitz functions u, and thus covers most\ngeneralized linear models. We refer the reader to the pseudo-code in Algorithm 1 for some of the\nnotation used in this section.\nTo analyze the performance of the algorithm, we show that if we run the algorithm for suf\ufb01ciently\nmany iterations, one of the predictors ht obtained must be nearly-optimal, compared to the Bayes-\noptimal predictor.\nTheorem 1. Suppose (x1, y1), . . . , (xm, ym) are drawn independently from a distribution supported\non Bd \u00d7 [0, 1], such that E[y|x] = u(w \u00b7 x), where (cid:107)w(cid:107) \u2264 W , and u : R \u2192 [0, 1] is a known non-\ndecreasing 1-Lipschitz function. Then for any \u03b4 \u2208 (0, 1), the following holds with probability at\nhypothesis ht(x) = u(wt \u00b7 x) satis\ufb01es\n\nleast 1 \u2212 \u03b4: there exists some iteration t < O(W(cid:112)m/ log(1/\u03b4)) of GLM-TRON such that the\n\n(cid:32)(cid:114)\n\n(cid:33)\n\nW 2 log(m/\u03b4)\n\nm\n\n.\n\nmax{\u02c6\u03b5(ht), \u03b5(ht)} \u2264 O\n\n3\n\n\fj=1\n\n1\nm\n\ni=1\n\nend for\n\nAlgorithm 2 SLISOTRON\nInput: data (cid:104)(xi, yi)(cid:105)m\nw1 := 0;\nfor t = 1, 2, . . . do\n\nut := LIR ((wt \u00b7 x1, y1), . . . , (wt \u00b7 xm, ym)) // Fit 1-d function along wt\nwt+1 := wt +\n\n(yi \u2212 ut(wt \u00b7 xi))xi\n\ni=1 \u2208 Rd \u00d7 [0, 1], held-out data (cid:104)(xm+j, ym+j)(cid:105)s\nm(cid:88)\nOutput: arg minht(cid:80)s\nup to a constant, we can easily \ufb01nd an appropriate ht by picking the one that has least (cid:99)err(ht) on a\ntance(cid:13)(cid:13)wt+1 \u2212 w(cid:13)(cid:13)2 is substantially smaller than (cid:107)wt \u2212 w(cid:107)2. Since the squared distance is bounded\nbelow by 0, and(cid:13)(cid:13)w0 \u2212 w(cid:13)(cid:13)2 \u2264 W 2, there is an iteration (arrived at within reasonable time) such that\n\nheld-out set.\nThe main idea of the proof is showing that at each iteration, if \u02c6\u03b5(ht) is not small, then the squared dis-\n\nIn particular, the theorem implies that some ht has small enough \u03b5(ht). Since \u03b5(ht) equals err(ht)\n\nj=1(ht(xm+j) \u2212 ym+j)2\n\nthe hypothesis ht at that iteration is highly accurate. Although the algorithm minimizes empirical\nsquared error, we can bound the true error using a uniform convergence argument. The complete\nproofs are provided in the full version of the paper ([11] Appendix A).\n\n4 The SLISOTRON algorithm\n\nIn this section, we present SLISOTRON (Alg. 2), which is applicable to the harder setting where\nthe transfer function u is unknown, except for it being non-decreasing and 1-Lipschitz. SLISOTRON\ndoes have one parameter, the Lipschitz constant; however, in theory we show that this can simply be\nset to 1. The main difference between SLISOTRON and GLM-TRON is that now the transfer function\nmust also be learned, and the algorithm keeps track of a transfer function ut which changes from\niteration to iteration. The algorithm is inspired by the Isotron algorithm [7], with the main difference\nbeing that at each iteration, instead of applying the PAV procedure to \ufb01t an arbitrary monotonic\nfunction along the direction wt, we use a different procedure, (Lipschitz Isotonic Regression) LIR,\nto \ufb01t a Lipschitz monotonic function, ut, along wt. This key difference allows for an analysis that\ndoes not require a fresh sample each iteration. We also provide an ef\ufb01cient O(m log(m)) time\nalgorithm for LIR (see Section 4.1), making SLISOTRON an extremely ef\ufb01cient algorithm.\nWe now turn to the formal theorem about our algorithm. The formal guarantees parallel those of\nthe GLM-TRON algorithm. However, the rates achieved are somewhat worse, due to the additional\ndif\ufb01culty of simultaneously estimating both u and w.\nTheorem 2. Suppose (x1, y1), . . . , (xm, ym) are drawn independently from a distribution supported\non Bd \u00d7 [0, 1], such that E[y|x] = u(w \u00b7 x), where (cid:107)w(cid:107) \u2264 W , and u : R \u2192 [0, 1] is an unknown\nnon-decreasing 1-Lipschitz function. Then the following two bounds hold:\n\n1. (Dimension-dependent) With probability at least 1 \u2212 \u03b4, there exists some iteration t <\n\n(cid:18)(cid:16)\n\n(cid:17)1/3(cid:19)\n\nO\n\nW m\n\nd log(W m/\u03b4)\n\nof SLISOTRON such that\n\nmax{\u02c6\u03b5(ht), \u03b5(ht)} \u2264 O\n\n(cid:32)(cid:18) dW 2 log(W m/\u03b4)\n\n(cid:19)1/3(cid:33)\n\n.\n\n2. (Dimension-independent) With probability at least 1 \u2212 \u03b4, there exists some iteration t <\n\n(cid:18)(cid:16) W m\n\nO\n\nlog(m/\u03b4)\n\n(cid:17)1/4(cid:19)\n\nm\n\nm\n\n(cid:32)(cid:18) W 2 log(m/\u03b4)\n\n(cid:19)1/4(cid:33)\n\nof SLISOTRON such that\n\nmax{\u02c6\u03b5(ht), \u03b5(ht)} \u2264 O\n\n4\n\n\fAs in the case of Thm. 1, one can easily \ufb01nd ht which satis\ufb01es the theorem\u2019s conditions, by running\nthe SLISOTRON algorithm for suf\ufb01ciently many iterations, and choosing the hypothesis ht which\n\nminimizes (cid:99)err(ht) on a held-out set. The algorithm minimizes empirical error and generalization\n\nbounds are obtained using a uniform convergence argument. The proofs are somewhat involved and\nappear in the full paper ([11] Appendix B).\n\n4.1 Lipschitz isotonic regression\n\nThe SLISOTRON algorithm (Alg. 2) performs Lipschitz Isotonic Regression (LIR) at each iteration.\nThe goal is to \ufb01nd the best \ufb01t (least squared error) non-decreasing 1-Lipschitz function that \ufb01ts the\ndata in one dimension. Let (z1, y1), . . . (zm, ym) be such that zi \u2208 R, yi \u2208 [0, 1] and z1 \u2264 z2 \u2264\n\u00b7\u00b7\u00b7 \u2264 zm. The Lipschitz Isotonic Regression (LIR) problem is de\ufb01ned as the following quadratic\nprogram:\n\nm(cid:88)\n\ni=1\n\nMinimize w.r.t \u02c6yi : 1\n2\n\n(yi \u2212 \u02c6yi)2\n\nsubject to:\n\n\u02c6yi \u2264 \u02c6yi+1\n\n\u02c6yi+1 \u2212 \u02c6yi \u2264 (zi+1 \u2212 zi)\n\n1 \u2264 i \u2264 m \u2212 1 (Monotonicity)\n1 \u2264 i \u2264 m \u2212 1 (Lipschitz)\n\n(1)\n\n(2)\n(3)\n\nOnce the values \u02c6yi are obtained at the data points, the actual function can be constructed by interpo-\nlating linearly between the data points. Prior to this work, the best known algorithm for this problem\nwass due to Yeganova and Wilbur [10] and required O(m2) time for m points. In this work, we\npresent an algorithm that performs the task in O(m log(m)) time. The actual algorithm is fairly\ncomplex and relies on designing a clever data structure. We provide a high-level view here; the\ndetails are provided in the full version ([11] Appendix D).\nAlgorithm Sketch: We de\ufb01ne functions Gi(\u00b7), where Gi(s) is the minimum squared loss that can\nbe attained if \u02c6yi is \ufb01xed to be s, and \u02c6yi+1, . . . \u02c6ym are then chosen to be the best \ufb01t 1-Lipschitz\nnon-decreasing function to the points (zi, yi), . . . , (zm, ym). Formally, for i = 1, . . . , m, de\ufb01ne the\nfunctions,\n\nGi(s) = min\n\n\u02c6yi+1,...,\u02c6ym\n\n(s \u2212 yi)2 +\n\n1\n2\n\n1\n2\n\nsubject to the constraints (where s = \u02c6yi),\n\n\u02c6yj \u2264 \u02c6yj+1\n\n\u02c6yj+1 \u2212 \u02c6yj \u2264 zj+1 \u2212 zj\n\n(\u02c6yj \u2212 yj)2\n\n(4)\n\nj=i+1\n\ni \u2264 j \u2264 m \u2212 1 (Monotonic)\ni \u2264 j \u2264 m \u2212 1 (Lipschitz)\n\nFurthermore, de\ufb01ne: s\u2217\ni = mins Gi(s). The functions Gi are piecewise quadratic, differentiable\neverywhere and strictly convex, a fact we prove in full paper [11]. Thus, Gi is minimized at s\u2217\ni . Note that Gm(s) = (1/2)(s \u2212 ym)2 and hence\nand it is strictly increasing on both sides of s\u2217\ni\nis piecewise quadratic, differentiable everywhere and strictly convex. Let \u03b4i = zi+1 \u2212 zi. The\nremaining Gi obey the following recursive relation.\n\nm(cid:88)\n\nGi\u22121(s) =\n\n(s \u2212 yi\u22121)2 +\n\n1\n2\n\n(cid:40) Gi(s + \u03b4i\u22121)\n\nGi(s\u2217\ni )\nGi(s)\n\ni \u2212 \u03b4i\u22121\n\nIf s \u2264 s\u2217\nIf s\u2217\nIf s\u2217\n\ni \u2212 \u03b4i\u22121 < s \u2264 s\u2217\ni < s\n\ni\n\n(5)\n\ni (since Gi is strictly increasing on both sides of s\u2217\n\nAs intuition for the above relation, note that Gi\u22121(s) is obtained \ufb01xing \u02c6yi\u22121 = s and then by\nchoosing \u02c6yi as close to s\u2217\ni ) as possible without\nviolating either the monotonicity or Lipschitz constraints.\nThe above argument can be immediately translated into an algorithm, if the values s\u2217\ni are known.\nSince s\u2217\n1 minimizes G1(s), which is the same as the objective of (1), start with \u02c6y1 = s\u2217\n1, and then\nsuccessively chose values for \u02c6yi to be as close to s\u2217\ni as possible without violating the Lipschitz\nor monotonicity constraints. This will produce an assignment for \u02c6yi which achieves loss equal to\nG1(s\u2217\n\n1) and hence is optimal.\n\n5\n\n\f(a)\n\n(b)\n\nFigure 1: (a) Finding the zero of G(cid:48)\n\ni. (b) Update step to transform representation of G(cid:48)\n\ni to G(cid:48)\n\ni\u22121\n\nThe harder part of the algorithm is \ufb01nding the values s\u2217\ni . Notice that G(cid:48)\ncontinuous and strictly increasing, and obey a similar recursive relation (G(cid:48)\ni \u2212 \u03b4i\u22121\n\n(cid:40) G(cid:48)\n\ni are all piecewise linear,\nm(s) = s \u2212 ym):\n\nIf s \u2264 s\u2217\nIf s\u2217\nIf s\u2217\n\ni \u2212 \u03b4i\u22121 < s \u2264 s\u2217\ni < s\n\ni\n\n(6)\n\ni\u22121(s) = (s \u2212 yi\u22121) +\nG(cid:48)\n\ni(s + \u03b4i\u22121)\n0\nG(cid:48)\ni(s)\ni by \ufb01nding zeros of G(cid:48)\n\nm = s \u2212 ym, and s\u2217\n\ni. Starting from m, G(cid:48)\n\ni, we need to support two operations:\n\ni . Such an operation can be done ef\ufb01ciently O(log(m)) time using a\n\nThe algorithm then \ufb01nds s\u2217\nm = ym.\nWe design a special data structure, called notable red-black trees, for representing piecewise linear,\nm(s) = s \u2212 ym.\ncontinuous, strictly increasing functions. We initialize such a tree T to represent G(cid:48)\nAssuming that at some time it represents G(cid:48)\n1. Find the zero of G(cid:48)\ni to get s\u2217\ntree-like structure (Fig. 1 (a)).\n2. Update T to represent G(cid:48)\ni\u22121. This operation is more complicated, but using the relation (6), we\ndo the following: Split the interval containing s(cid:48)\ni. Move the left half of the piecewise linear function\nG(cid:48)\ni by \u03b4i\u22121 (Fig. 1(b)), adding the constant zero function in between. Finally, we add the linear\nfunction s \u2212 yi\u22121 to every interval, to get G(cid:48)\ni\u22121, which is again piecewise linear, continuous and\nstrictly increasing.\nTo perform the operations in step (2) above, we cannot na\u00a8\u0131vely apply the transformations,\nshift-by(\u03b4i\u22121) and add(s \u2212 yi\u22121) to every node in the tree, as it may take O(m) operations.\nInstead, we simply leave a note (hence the name notable red-black trees) that such a transformation\nshould be applied before the function is evaluated at that node or at any of its descendants. To pre-\nvent a large number of such notes accumulating at any given node we show that these notes satisfy\ncertain commutative and additive relations, thus requiring us to keep track of no more than 2 notes\nat any given node. This lazy evaluation of notes allows us to perform all of the above operations in\nO(log(m)) time. The details of the construction are provided in the full paper ([11] Appendix D).\n\n5 Experiments\n\nIn this section, we present an empirical study of the SLISOTRON and GLM-TRON algorithms. We\nperform two evaluations using synthetic data. The \ufb01rst one compares SLISOTRON and Isotron [7]\nand illustrates the importance of imposing a Lipschitz constraint. The second one demonstrates the\nadvantage of using SLISOTRON over standard regression techniques, in the sense that SLISOTRON\ncan learn any monotonic Lipschitz function.\nWe also report results of an evaluation of SLISOTRON, GLM-TRON and several competing ap-\nproaches on 5 UCI[12] datasets.\nAll errors are reported in terms of average root mean squared error (RMSE) using 10 fold cross\nvalidation along with the standard deviation.\n\n5.1 Synthetic Experiments\n\nAlthough, the theoretical guarantees for Isotron are under the assumption that we get a fresh sample\neach round, one may still attempt to run Isotron on the same sample each iteration and evaluate the\n\n6\n\nb\fSLISOTRON\n0.289 \u00b1 0.014\n\nIsotron\n\n0.334 \u00b1 0.026\n\n\u2206\n\n0.045 \u00b1 0.018\n\nSLISOTRON\n0.058 \u00b1 0.003\n\nLogistic\n\n0.073 \u00b1 0.006\n\n\u2206\n\n0.015 \u00b1 0.004\n\n(a) Synthetic Experiment 1\n\n(b) Synthetic Experiment 2\n\nFigure 2: (a) The \ufb01gure shows the transfer functions as predicted by SLISOTRON and Isotron. The\ntable shows the average RMSE using 10 fold cross validation. The \u2206 column shows the average\ndifference between the RMSE values of the two algorithms across the folds. (b) The \ufb01gure shows the\ntransfer function as predicted by SLISOTRON. Table shows the average RMSE using 10 fold cross\nvalidation for SLISOTRON and Logistic Regression. The \u2206 column shows the average difference\nbetween the RMSE values of the two algorithms across folds.\n\nempirical performance. Then, the main difference between SLISOTRON and Isotron is that while\nSLISOTRON \ufb01ts the best Lipschitz monotonic function using LIR each iteration, Isotron merely \ufb01nds\nthe best monotonic \ufb01t using PAV. This difference is analogous to \ufb01nding a large margin classi\ufb01er\nvs. just a consistent one. We believe this difference will be particularly relevant when the data is\nsparse and lies in a high dimensional space.\nOur \ufb01rst synthetic dataset is the following: The dataset is of size m = 1500 in d = 500 dimensions.\nThe \ufb01rst co-ordinate of each point is chosen uniformly at random from {\u22121, 0, 1}. The remaining\nco-ordinates are all 0, except that for each data point one of the remaining co-ordinates is randomly\nset to 1. The true direction is w = (1, 0, . . . , 0) and the transfer function is u(z) = (1 + z)/2. Both\nSLISOTRON and Isotron put weight on the \ufb01rst co-ordinate (the true direction). However, Isotron\nover\ufb01ts the data using the remaining (irrelevant) co-ordinates, which SLISOTRON is prevented from\ndoing because of the Lipschitz constraint. Figure 2(a) shows the transfer functions as predicted by\nthe two algorithms, and the table below the plot shows the average RMSE using 10 fold cross valida-\ntion. The \u2206 column shows the average difference between the RMSE values of the two algorithms\nacross the folds.\nA principle advantage of SLISOTRON over standard regression techniques is that it is not necessary\nto know the transfer function in advance. The second synthetic experiment is designed as a sanity\ncheck to verify this claim. The dataset is of size m = 1000 in d = 4 dimensions. We chose a random\ndirection as the \u201ctrue\u201d w and used a piecewise linear function as the \u201ctrue\u201d u. We then added random\nnoise (\u03c3 = 0.1) to the y values. We compared SLISOTRON to Logistic Regression on this dataset.\nSLISOTRON correctly recovers the true function (up to some scaling). Fig. 2(b) shows the actual\ntransfer function as predicted by SLISOTRON, which is essentially the function we used. The table\nbelow the \ufb01gure shows the performance comparison between SLISOTRON and logistic regression.\n\n5.2 Real World Datasets\n\nWe now turn to describe the results of experiments performed on the following 5 UCI datasets:\ncommunities, concrete, housing, parkinsons, and wine-quality. We compared\nthe performance of SLISOTRON (Sl-Iso) and GLM-TRON with logistic transfer function (GLM-t)\nagainst Isotron (Iso), as well as standard logistic regression (Log-R), linear regression (Lin-R) and\na simple heuristic algorithm (SIM) for single index models, along the lines of standard iterative\nmaximum-likelihood procedures for these types of problems (e.g., [13]). The SIM algorithm works\nby iteratively \ufb01xing the direction w and \ufb01nding the best transfer function u, and then \ufb01xing u and\n\n7\n\n\u22121\u22120.8\u22120.6\u22120.4\u22120.200.20.40.60.8100.10.20.30.40.50.60.70.80.91 SlisotronIsotron\u22120.6\u22120.4\u22120.200.20.40.60.80.20.30.40.50.60.70.80.91 Slisotron\foptimizing w via gradient descent. For each of the algorithms we performed 10-fold cross validation,\nusing 1 fold each time as the test set, and we report averaged results across the folds.\nTable 1 shows average RMSE values of all the algorithms across 10 folds. The \ufb01rst column shows\nthe mean Y value (with standard deviation) of the dataset for comparison. Table 2 shows the average\ndifference between RMSE values of SLISOTRON and the other algorithms across the folds. Negative\nvalues indicate that the algorithm performed better than SLISOTRON. The results suggest that the\nperformance of SLISOTRON (and even Isotron) is comparable to other regression techniques and in\nmany cases also slightly better. The performance of GLM-TRON is similar to standard implementa-\ntions of logistic regression on these datasets. This suggests that these algorithms should work well\nin practice, while providing non-trivial theoretical guarantees.\nIt is also illustrative to see how the transfer functions found by SLISOTRON and Isotron compare.\nIn Figure 3, we plot the transfer functions for concrete and communities. We see that the \ufb01ts\nfound by SLISOTRON tend to be smoother because of the Lipschitz constraint. We also observe that\nconcrete is the only dataset where SLISOTRON performs noticeably better than logistic regres-\nsion, and the transfer function is indeed somewhat far from the logistic function.\n\nTable 1: Average RMSE values using 10 fold cross validation. The \u00afY column shows the mean Y\nvalue and standard deviation.\n\ndataset\n\ncommunities\n\nconcrete\nhousing\nparkinsons\nwinequality\n\n\u00afY\n\n0.24 \u00b1 0.23\n35.8 \u00b1 16.7\n22.5 \u00b1 9.2\n29 \u00b1 10.7\n5.9 \u00b1 0.9\n\nSl-Iso\n0.13 \u00b1 0.01\n9.9 \u00b1 0.9\n4.65 \u00b1 1.00\n10.1 \u00b1 0.2\n0.78 \u00b1 0.04\n\nGLM-t\n0.14 \u00b1 0.01\n10.5 \u00b1 1.0\n4.85 \u00b1 0.95\n10.3 \u00b1 0.2\n0.79 \u00b1 0.04\n\nIso\n\n0.14 \u00b1 0.01\n9.9 \u00b1 0.8\n4.68 \u00b1 0.98\n10.1 \u00b1 0.2\n0.78 \u00b1 0.04\n\nLin-R\n\n0.14 \u00b1 0.01\n10.4 \u00b1 1.1\n4.81 \u00b1 0.99\n10.2 \u00b1 0.2\n0.75 \u00b1 0.04\n\nLog-R\n0.14 \u00b1 0.01\n10.4 \u00b1 1.0\n4.70 \u00b1 0.98\n10.2 \u00b1 0.2\n0.75 \u00b1 0.04\n\nSIM\n\n0.14 \u00b1 0.01\n9.9 \u00b1 0.9\n4.63 \u00b1 0.78\n10.3 \u00b1 0.2\n0.78 \u00b1 0.03\n\nTable 2: Performance comparison of SLISOTRON with the other algorithms. The values reported\nare the average difference between RMSE values of the algorithm and SLISOTRON across the folds.\nNegative values indicate better performance than SLISOTRON.\n\ndataset\n\ncommunities\n\nconcrete\nhousing\nparkinsons\nwinequality\n\nGLM-t\n\n0.00 \u00b1 0.00\n0.56 \u00b1 0.35\n0.20 \u00b1 0.48\n0.19 \u00b1 0.09\n0.01 \u00b1 0.01\n\nIso\n\n0.00 \u00b1 0.00\n0.04 \u00b1 0.17\n0.03 \u00b1 0.55\n0.01 \u00b1 0.03\n0.00 \u00b1 0.00\n\nLin-R\n\n0.00 \u00b1 0.00\n0.52 \u00b1 0.35\n0.16 \u00b1 0.49\n0.11 \u00b1 0.07\n-0.03 \u00b1 0.02\n\nLog-R\n\n0.00 \u00b1 0.00\n0.55 \u00b1 0.32\n0.05 \u00b1 0.43\n0.09 \u00b1 0.07\n-0.03 \u00b1 0.02\n\nSIM\n\n0.00 \u00b1 0.00\n-0.03 \u00b1 0.26\n-0.02 \u00b1 0.53\n0.21 \u00b1 0.20\n0.01 \u00b1 0.01\n\n(a) concrete\n\n(b) communities\n\nFigure 3: The transfer function u as predicted by SLISOTRON (blue) and Isotron (red) for the\nconcrete and communities datasets. The domain of both functions was normalized to [\u22121, 1].\n\n8\n\n\u22121\u22120.8\u22120.6\u22120.4\u22120.200.20.40.60.8100.10.20.30.40.50.60.70.80.91 SlisotronIsotron\u22121\u22120.8\u22120.6\u22120.4\u22120.200.20.40.60.8100.10.20.30.40.50.60.70.80.91 SlisotronIsotron\fReferences\n[1] P. McCullagh and J. A. Nelder. Generalized Linear Models (2nd ed.). Chapman and Hall, 1989.\n[2] P. Hall W. H\u00a8ardle and H. Ichimura. Optimal smoothing in single-index models. Annals of Statistics,\n\n21(1):157\u2013178, 1993.\n\n[3] J. Horowitz and W. H\u00a8ardle. Direct semiparametric estimation of single-index models with discrete co-\n\nvariates, 1994.\n\n[4] A. Juditsky M. Hristache and V. Spokoiny. Direct estimation of the index coef\ufb01cients in a single-index\n\nmodel. Technical Report 3433, INRIA, May 1998.\n\n[5] P. Naik and C. Tsai. Isotonic single-index model for high-dimensional database marketing. Computational\n\nStatistics and Data Analysis, 47:775\u2013790, 2004.\n\n[6] P. Ravikumar, M. Wainwright, and B. Yu. Single index convex experts: Ef\ufb01cient estimation via adapted\n\nbregman losses. Snowbird Workshop, 2008.\n\n[7] A. T. Kalai and R. Sastry. The isotron algorithm: High-dimensional isotonic regression. In COLT \u201909,\n\n2009.\n\n[8] A. T. Kalai, A. R. Klivans, Y. Mansour, and R. A. Servedio. Agnostically learning halfspaces. In Pro-\nceedings of the 46th Annual IEEE Symposium on Foundations of Computer Science, FOCS \u201905, pages\n11\u201320, Washington, DC, USA, 2005. IEEE Computer Society.\n\n[9] S. Shalev-Shwartz, O. Shamir, and K. Sridharan. Learning kernel-based halfspaces with the zero-one\n\nloss. In COLT, 2010.\n\n[10] L. Yeganova and W. J. Wilbur. Isotonic regression under lipschitz constraint. Journal of Optimization\n\nTheory and Applications, 141(2):429\u2013443, 2009.\n\n[11] S. M. Kakade, A. T. Kalai, V. Kanade, and O. Shamir. Ef\ufb01cient learning of generalized linear and single\n\nindex models with isotonic regression. arxiv.org/abs/1104.2018.\n\n[12] UCI. University of california, irvine: http://archive.ics.uci.edu/ml/.\n[13] S. Cosslett. Distribution-free maximum-likelihood estimator of the binary choice model. Econometrica,\n\n51(3), May 1983.\n\n9\n\n\f", "award": [], "sourceid": 589, "authors": [{"given_name": "Sham", "family_name": "Kakade", "institution": null}, {"given_name": "Varun", "family_name": "Kanade", "institution": null}, {"given_name": "Ohad", "family_name": "Shamir", "institution": null}, {"given_name": "Adam", "family_name": "Kalai", "institution": null}]}