{"title": "Fast, Provable Algorithms for Isotonic Regression in all L_p-norms", "book": "Advances in Neural Information Processing Systems", "page_first": 2719, "page_last": 2727, "abstract": "Given a directed acyclic graph $G,$ and a set of values $y$ on the vertices, the Isotonic Regression of $y$ is a vector $x$ that respects the partial order described by $G,$ and minimizes $\\|x-y\\|,$ for a specified norm. This paper gives improved algorithms for computing the Isotonic Regression for all weighted $\\ell_{p}$-norms with rigorous performance guarantees. Our algorithms are quite practical, and their variants can be implemented to run fast in practice.", "full_text": "Fast, Provable Algorithms for Isotonic Regression in\n\nall (cid:96)p-norms \u2217\n\nDept. of Computer Science\n\nSchool of Computer Science\n\nDept. of Computer Science\n\nAnup Rao\u2020\n\nGeorgia Tech\n\nSushant Sachdeva\n\nYale University\n\nRasmus Kyng\n\nYale University\n\nrasmus.kyng@yale.edu\n\narao89@gatech.edu\n\nsachdeva@cs.yale.edu\n\nAbstract\n\nGiven a directed acyclic graph G, and a set of values y on the vertices, the Isotonic\nRegression of y is a vector x that respects the partial order described by G, and\nminimizes (cid:107)x \u2212 y(cid:107) , for a speci\ufb01ed norm. This paper gives improved algorithms\nfor computing the Isotonic Regression for all weighted (cid:96)p-norms with rigorous\nperformance guarantees. Our algorithms are quite practical, and variants of them\ncan be implemented to run fast in practice.\n\n1\n\nIntroduction\n\nA directed acyclic graph (DAG) G(V, E) de\ufb01nes a partial order on V where u precedes v if there\nis a directed path from u to v. We say that a vector x \u2208 RV is isotonic (with respect to G) if it is a\nweakly order-preserving mapping of V into R. Let IG denote the set of all x that are isotonic with\nrespect to G. It is immediate that IG can be equivalently de\ufb01ned as follows:\n\nIG = {x \u2208 RV | xu \u2264 xv for all (u, v) \u2208 E}.\n\n(1)\nGiven a DAG G, and a norm (cid:107)\u00b7(cid:107) on RV , the Isotonic Regression of observations y \u2208 RV , is given\nby x \u2208 IG that minimizes (cid:107)x \u2212 y(cid:107) .\nSuch monotonic relationships are fairly common in data. They allow one to impose only weak\nassumptions on the data, e.g. the typical height of a young girl child is an increasing function of her\nage, and the heights of her parents, rather than a more constrained parametric model.\nIsotonic Regression is an important shape-constrained nonparametric regression method that has\nbeen studied since the 1950\u2019s [1, 2, 3]. It has applications in diverse \ufb01elds such as Operations Re-\nsearch [4, 5] and Signal Processing [6]. In Statistics, it has several applications (e.g. [7, 8]), and the\nstatistical properties of Isotonic Regression under the (cid:96)2-norm have been well studied, particularly\nover linear orderings (see [9] and references therein). More recently, Isotonic regression has found\nseveral applications in Learning [10, 11, 12, 13, 14]. It was used by Kalai and Sastry [10] to provably\nlearn Generalized Linear Models and Single Index Models; and by Zadrozny and Elkan [13], and\nNarasimhan and Agarwal [14] towards constructing binary Class Probability Estimation models.\nThe most common norms of interest are weighted (cid:96)p-norms, de\ufb01ned as\n\n(cid:40)(cid:0)(cid:80)\n\nv \u00b7 |zv|p(cid:1)1/p , p \u2208 [1,\u221e),\n\np = \u221e,\n\n(cid:107)z(cid:107)w,p =\n\nv\u2208V wp\n\nmaxv\u2208V wv \u00b7 |zv|,\n\nwhere wv > 0 is the weight of a vertex v \u2208 V. In this paper, we focus on algorithms for Isotonic\nRegression under weighted (cid:96)p-norms. Such algorithms have been applied to large data-sets from\nMicroarrays [15], and from the web [16, 17].\n\n\u2217Code from this work is available at https://github.com/sachdevasushant/Isotonic\n\u2020Part of this work was done when this author was a graduate student at Yale University.\n\n1\n\n\fGiven a DAG G, and observations y \u2208 RV , our regression problem can be expressed as the following\nconvex program:\n\nmin(cid:107)x \u2212 y(cid:107)w,p\n\nsuch that xu \u2264 xv for all (u, v) \u2208 E.\n\n(2)\n\n1.1 Our Results\nLet |V | = n, and |E| = m. We\u2019ll assume that G is connected, and hence m \u2265 n \u2212 1.\n(cid:96)p-norms, p < \u221e. We give a uni\ufb01ed, optimization-based framework for algorithms that provably\nsolve the Isotonic Regression problem for p \u2208 [1,\u221e). The following is an informal statement of our\nmain theorem (Theorem 3.1) in this regard (assuming wv are bounded by poly(n)).\nTheorem 1.1 (Informal). There is an algorithm that, given a DAG G, observations y, and \u03b4 > 0,\nruns in time O(m1.5 log2 n log n/\u03b4), and computes an isotonic xALG \u2208 IG such that\n\n(cid:107)xALG \u2212 y(cid:107)p\n\nw,p \u2264 min\nx\u2208IG\nm ) for p \u2208 (1,\u221e) [18] and O(nm + n2 log n) for\nThe previous best time bounds were O(nm log n2\np = 1 [19].\n(cid:96)\u221e-norms. For (cid:96)\u221e-norms, unlike (cid:96)p-norms for p \u2208 (1,\u221e), the Isotonic Regression problem need\nnot have a unique solution. There are several speci\ufb01c solutions that have been studied in the literature\n(see [20] for a detailed discussion). In this paper, we show that some of them (MAX, MIN, and AVG\nto be precise) can be computed in time linear in the size of G.\nTheorem 1.2. There is an algorithm that, given a DAG G(V, E), a set of observations y \u2208 RV , and\nweights w, runs in expected time O(m), and computes an isotonic xINF \u2208 IG such that\n\n(cid:107)x \u2212 y(cid:107)p\n\nw,p + \u03b4.\n\n(cid:107)xINF \u2212 y(cid:107)w,\u221e = min\nx\u2208IG\n\n(cid:107)x \u2212 y(cid:107)w,\u221e .\n\nOur algorithm achieves the best possible running time. This was not known even for linear or tree\norders. The previous best running time was O(m log n) [20].\nStrict Isotonic Regression. We also give improved algorithms for Strict Isotonic Regression. Given\nobservations y, and weights w, its Strict Isotonic Regression xSTRICT is de\ufb01ned to be the limit of \u02c6xp\nas p goes to \u221e, where \u02c6xp is the Isotonic Regression for y under the norm (cid:107)\u00b7(cid:107)w,p . It is immediate\nthat xStrict is an (cid:96)\u221e Isotonic Regression for y. In addition, it is unique and satis\ufb01es several desirable\nproperties (see [21]).\nTheorem 1.3. There is an algorithm that, given a DAG G(V, E), a set of observation y \u2208 RV , and\nweights w, runs in expected time O(mn), and computes xSTRICT, the strict Isotonic Regression of y.\n\nThe previous best running time was O(min(mn, n\u03c9) + n2 log n) [21].\n\n1.2 Detailed Comparison to Previous Results\n(cid:96)p-norms, p < \u221e. There has been a lot of work for fast algorithms for special graph families, mostly\nfor p = 1, 2 (see [22] for references). For some cases where G is very simple, e.g. a directed path\n(corresponding to linear orders), or a rooted, directed tree (corresponding to tree orders), several\nworks give algorithms with running times of O(n) or O(n log n) (see [22] for references).\nTheorem 1.1 not only improves on the previously best known algorithms for general DAGs, but also\non several algorithms for special graph families (see Table 1). One such setting is where V is a point\nset in d-dimensions, and (u, v) \u2208 E whenever ui \u2264 vi for all i \u2208 [d]. This setting has applications\nto data analysis, as in the example given earlier, and has been studied extensively (see [23] for\nreferences). For this case, it was proved by Stout (see Prop. 2, [23]) that these partial orders can be\nembedded in a DAG with O(n logd\u22121 n) vertices and edges, and that this DAG can be computed in\ntime linear in its size. The bounds then follow by combining this result with our theorem above.\nWe obtain improved running times for all (cid:96)p norms for DAGs with m = o(n2/ log6 n), and for\nd-dim point sets for d \u2265 3. For d = 2, Stout [19] gives an O(n log2 n) time algorithm.\n\n2\n\n\fTable 1: Comparison to previous best results for (cid:96)p-norms, p (cid:54)= \u221e\n\nd-dim vertex set, d \u2265 3\n\n(cid:96)1\n\nn2 logd n [19]\n\narbitrary DAG nm + n2 log n [15]\n\nFor sake of brevity, we have ignored the O(\u00b7) notation implicit in the bounds, and o(log n) terms. The results\nare reported assuming an error parameter \u03b4 = n\u2212\u2126(1), and that wv are bounded by poly(n).\n\nPrevious best\n\nThis paper\n(cid:96)p, p < \u221e\n\n(cid:96)p, 1 < p < \u221e\nn2 logd+1 n [19] n1.5 log1.5(d+1) n\nnm log n2\n\nm1.5 log3 n\n\nm [18]\n\n(cid:96)\u221e-norms. For weighted (cid:96)\u221e-norms on arbitrary DAGs, the previous best result was O(m log n +\nn log2 n) due to Kaufman and Tamir [24]. A manuscript by Stout [20] improves it to O(m log n).\nThese algorithms are based on parametric search, and are impractical. Our algorithm is simple,\nachieves the best possible running time, and only requires random sampling and topological sort.\nIn a parallel independent work, Stout [25] gives O(n)-time algorithms for linear order, trees, and\nd-grids, and an O(n logd\u22121 n) algorithm for point sets in d-dimensions. Theorem 1.2 implies the\nlinear-time algorithms immediately. The result for d-dimensional point sets follows after embedding\nthe point sets into DAGs of size O(n logd\u22121 n), as for (cid:96)p-norms.\nStrict Isotonic Regression. Strict Isotonic regression was introduced and studied in [21]. It also\ngave the only previous algorithm for computing it, that runs in time O(min(mn, n\u03c9) + n2 log n).\nTheorem 1.3 is an improvement when m = o(n log n).\n\n1.3 Overview of the Techniques and Contribution\n(cid:96)p-norms, p < \u221e.\nIt is immediate that Isotonic Regression, as formulated in Equation (2), is\na convex programming problem. For weighted (cid:96)p-norms with p < \u221e, applying generic convex-\nprogramming algorithms such as Interior Point methods to this formulation leads to algorithms that\nare quite slow.\nWe obtain faster algorithms for Isotonic Regression by replacing the computationally intensive com-\nponent of Interior Point methods, solving systems of linear equations, with approximate solves. This\napproach has been used to design fast algorithms for generalized \ufb02ow problems [26, 27, 28].\nWe present a complete proof of an Interior Point method for a large class of convex programs that\nonly requires approximate solves. Daitch and Spielman [26] had proved such a result for linear\nprograms. We extend this to (cid:96)p-objectives, and provide an improved analysis that only requires\nlinear solvers with a constant factor relative error bound, whereas the method from Daitch and\nSpielman required polynomially small error bounds.\nThe linear systems in [27, 28] are Symmetric Diagonally Dominant (SDD) matrices. The seminal\nwork of Spielman and Teng [29] gives near-linear time approximate solvers for such systems, and\nlater research has improved these solvers further [30, 31]. Daitch and Spielman [26] extended these\nsolvers to M-matrices (generalizations of SDD). The systems we need to solve are neither SDD,\nnor M-matrices. We develop fast solvers for this new class of matrices using fast SDD solvers. We\nstress that standard techniques for approximate inverse computation, e.g. Conjugate Gradient, are\nnot suf\ufb01cient for approximately solving our systems in near-linear time. These methods have at least\na square root dependence on the condition number, which inevitably becomes huge in IPMs.\n(cid:96)\u221e-norms and Strict Isotonic Regression. Algorithms for (cid:96)\u221e-norms and Strict Isotonic Regres-\nsion are based on techniques presented in a recent paper of Kyng et al. [32]. We reduce (cid:96)\u221e-norm\nIsotonic Regression to the following problem, referred to as Lipschitz learning on directed graphs\nin [32] (see Section 4 for details) : We have a directed graph H, with edge lengths given by len.\nGiven x \u2208 RV (H), for every (u, v) \u2208 E(H), de\ufb01ne grad+\n. Now,\ngiven y that assigns real values to a subset of V (H), the goal is to determine x \u2208 RV (H) that agrees\nwith y and minimizes max(u,v)\u2208E(H) grad+\n\n(cid:110) x(u)\u2212x(v)\n\nG[x](u, v) = max\n\n(cid:111)\n\nlen(u,v)\n\n, 0\n\nG[x](u, v).\n\n3\n\n\fThe above problem is solved in O(m + n log n) time for general directed graphs in [32]. We give\na simple linear-time reduction to the above problem with the additional property that H is a DAG.\nFor DAGs, their algorithm can be implemented to run in O(m + n) time.\nIt is proved in [21] that computing the Strict Isotonic Regression is equivalent to computing the\nisotonic vector that minimizes the error under the lexicographic ordering (see Section 4). Under the\nsame reduction as in the (cid:96)\u221e-case, we show that this is equivalent to minimizing grad+ under the\nlexicographic ordering. It is proved in [32] that the lex-minimizer can be computed with basically n\ncalls to (cid:96)\u221e-minimization, immediately implying our result.\n\n1.4 Further Applications\n\nThe IPM framework that we introduce to design our algorithm for Isotonic Regression (IR), and\nthe associated results, are very general, and can be applied as-is to other problems. As a concrete\napplication, the algorithm of Kakade et al. [12] for provably learning Generalized Linear Models\nand Single Index Models learns 1-Lipschitz monotone functions on linear orders in O(n2) time\n(procedure LPAV). The structure of the associated convex program resembles IR. Our IPM results\nand solvers immediately imply an n1.5 time algorithm (up to log factors).\nImproved algorithms for IR (or for learning Lipschitz functions) on d-dimensional point sets could\nbe applied towards learning d-dim multi-index models where the link-function is nondecreasing\nw.r.t. the natural ordering on d-variables, extending [10, 12]. They could also be applied towards\nconstructing Class Probability Estimation (CPE) models from multiple classi\ufb01ers, by \ufb01nding a map-\nping from multiple classi\ufb01er scores to a probabilistic estimate, extending [13, 14].\nOrganization. We report experimental results in Section 2. An outline of the algorithms and analy-\nsis for (cid:96)p-norms, p < \u221e, are presented in Section 3. In Section 4, we de\ufb01ne the Lipschitz regression\nproblem on DAGs, and give the reduction from (cid:96)\u221e-norm Isotonic Regression. We defer a detailed\ndescription of the algorithms, and most proofs to the accompanying supplementary material.\n\n2 Experiments\n\nAn important advantage of our algorithms is that they can be implemented quite ef\ufb01ciently. Our\n\u221a\nalgorithms are based on what is known as a short-step method (see Chapter 11, [33]), that leads to\nan O(\nm) bound on the number of iterations. Each iteration corresponds to one linear solve in the\nHessian matrix. A variant, known as the long-step method (see [33]) typically require much fewer\niterations, about log m, even though the only provable bound known is O(m).\nFor the important special case of (cid:96)2-Isotonic\nRegression, we have implemented our algo-\nrithm in Matlab, with long step barrier method,\ncombined with our approximate solver for the\nlinear systems involved. A number of heuristics\nrecommended in [33] that greatly improve the\nrunning time in practice have also been incor-\nporated. Despite the changes, our implemen-\ntation is theoretically correct and also outputs\nan upper bound on the error by giving a feasi-\nble point to the dual program. Our implementa-\ntion is available at https://github.com/\nsachdevasushant/Isotonic.\nIn the \ufb01gure, we plot average running times\n(with error bars denoting standard deviation) for (cid:96)2-Isotonic Regression on DAGs, where the un-\nderlying graphs are 2-d grid graphs and random regular graphs (of constant degree). The edges for\n2-d grid graphs are all oriented towards one of the corners. For random regular graphs, the edges\nare oriented according to a random permutation. The vector of initial observations y is chosen to be\na random permutation of 1 to n obeying the partial order, perturbed by adding i.i.d. Gaussian noise\nto each coordinate. For each graph size, and two different noise levels (standard deviation for the\nnoise on each coordinate being 1 or 10), the experiment is repeated multiple time. The relative error\nin the objective was ascertained to be less than 1%.\n\n4\n\n0246810x 10401020304050607080Number of VerticesTime in SecsRunning Time in Practice Grid\u22121Grid\u221210RandReg\u22121RandReg\u221210\f3 Algorithms for (cid:96)p-norms, p < \u221e\nWithout loss of generality, we assume y \u2208 [0, 1]n. Given p \u2208 [1,\u221e), let p-ISO denote the following\n(cid:96)p-norm Isotonic Regression problem, and OPTp-ISO denote its optimum:\n\n(cid:107)x \u2212 y(cid:107)p\n\nw,p .\n\nmin\nx\u2208IG\n\n(3)\n\nLet w p denote the entry-wise pth power of w. We assume the minimum entry of w p is 1, and the\nmax \u2264 exp(n). We also assume the additive error parameter \u03b4 is lower bounded\nmaximum entry is w p\n\nby exp(\u2212n), and that p \u2264 exp(n). We use the (cid:101)O notation to hide poly log log n factors.\nparameter \u03b4 > 0, the algorithm ISOTONICIPM runs in time (cid:101)O(cid:0)m1.5 log2 n log (npw p\n\nTheorem 3.1. Given a DAG G(V, E), a set of observations y \u2208 [0, 1]V , weights w, and an error\nwith probability at least 1 \u2212 1/n, outputs a vector xALG \u2208 IG with\nw,p \u2264 OPTp-ISO + \u03b4.\n\nmax/\u03b4)(cid:1) , and\n\n(cid:107)xALG \u2212 y(cid:107)p\n\nThe algorithm ISOTONICIPM is obtained by an appropriate instantiation of a general Interior Point\nMethod (IPM) framework which we call APPROXIPM.\nTo state the general IPM result, we need to introduce two important concepts. These concepts are\nde\ufb01ned formally in Supplementary Material Section A.1. The \ufb01rst concept is self-concordant barrier\nfunctions; we denote the class of these functions by SCB. A self-concordant barrier function f is\na special convex function de\ufb01ned on some convex domain set S. The function approaches in\ufb01nity\nat the boundary of S. We associate with each f a complexity parameter \u03b8(f ) which measures how\nwell-behaved f is. The second important concept is the symmetry of a point z w.r.t. S: A non-\nnegative scalar quantity sym(z, S). A large symmetry value guarantees that a point is not too close\nto the boundary of the set. For our algorithms to work, we need a starting point whose symmetry is\nnot too small. We later show that such a starting point can be constructed for the p-ISO problem.\nAPPROXIPM is a primal path following IPM: Given a vector c, a domain D and a barrier function\nf \u2208 SCB for D, we seek to compute minx\u2208D (cid:104)c, x(cid:105) . To \ufb01nd a minimizer, we consider a function\nfc,\u03b3(x) = f (x) + \u03b3 (cid:104)c, x(cid:105), and attempt to minimize fc,\u03b3 for changing values of \u03b3 by alternately\nupdating x and \u03b3. As x approaches the boundary of D the f (x) term grows to in\ufb01nity and with\nsome care, we can use this to ensure we never move to a point x outside the feasible domain D. As\nwe increase \u03b3, the objective term (cid:104)c, x(cid:105) contributes more to fc,\u03b3. Eventually, for large enough \u03b3, the\nobjective value (cid:104)c, x(cid:105) of the current point x will be close to the optimum of the program.\nTo stay near the optimum x for each new value of \u03b3, we use a second-order method (Newton steps)\nto update x when \u03b3 is changed. This means that we minimize a local quadratic approximation to our\nobjective. This requires solving a linear system Hz = g, where g and H are the gradient and Hessian\nof f at x respectively. Solving this system to \ufb01nd z is the most computationally intensive aspect of\nthe algorithm. Crucially we ensure that crude approximate solutions to the linear system suf\ufb01ces,\nallowing the algorithm to use fast approximate solvers for this step. APPROXIPM is described in\ndetail in Supplementary Material Section A.5, and in this section we prove the following theorem.\nTheorem 3.2. Given a convex bounded domain D \u2286 IRn and vector c \u2208 IRn, consider the program\n\n(4)\nLet OPT denote the optimum of the program. Let f \u2208 SCB be a self-concordant barrier function\nfor D. Given a initial point x0 \u2208 D, a value upper bound K \u2265 sup{(cid:104)c, x(cid:105) : x \u2208 D}, a symmetry\nlower bound s \u2264 sym(x0, D), and an error parameter 0 < \u0001 < 1, the algorithm APPROXIPM runs\nfor\n\nmin\nx\u2208D\n\n(cid:104)c, x(cid:105) .\n\niterations and returns a point xapx, which satis\ufb01es (cid:104)c,xapx(cid:105)\u2212OPT\nThe algorithm requires O(Tapx) multiplications of vectors by a matrix M (x) satisfying 9/10 \u00b7\nH(x)\u22121 (cid:22) M (x) (cid:22) 11/10 \u00b7 H(x)\u22121, where H(x) is the Hessian of f at various points x \u2208 D\nspeci\ufb01ed by the algorithm.\n\nTapx = O\n\n(cid:16)(cid:112)\u03b8(f ) log (\u03b8(f )/\u0001\u00b7s)\n\n(cid:17)\nK\u2212OPT \u2264 \u0001.\n\n5\n\n\fWe now reformulate the p-ISO program to state a version which can be solved using the APPROX-\nIPM framework. Consider points (x, t) \u2208 IRn \u00d7 IRn, and de\ufb01ne a set\n\nDG = {(x, t) : for all v \u2208 V . |x(v) \u2212 y(v)|p \u2212 t(v) \u2264 0} .\n\nTo ensure boundedness, as required by APPROXIPM, we add the constraint (cid:104)w p, t(cid:105) \u2264 K.\nDe\ufb01nition 3.3. We de\ufb01ne the domain DK = (IG \u00d7 IRn) \u2229 DG \u2229 {(x, t) : (cid:104)w p, t(cid:105) \u2264 K} .\nThe domain DK is convex, and allows us to reformulate program (3) with a linear objective:\n\n(cid:104)w p, t(cid:105)\n\nsuch that (x, t) \u2208 DK.\n\nmin\nx,t\n\n(5)\n\nOur next lemma determines a choice of K which suf\ufb01ces to ensure that programs (3) and (5) have\nthe same optimum. The lemma is proven in Supplementary Material Section A.4.\nLemma 3.4. For all K \u2265 3nw p\nmax, DK is non-empty and bounded, and the optimum of program (5)\nis OPTp-ISO.\n\nThe following result shows that for program (5) we can compute a good starting point for the path\nfollowing IPM ef\ufb01ciently. The algorithm GOODSTART computes a starting point in linear time by\nrunning a topological sort on the vertices of the DAG G and assigning values to x according to the\nvertex order of the sort. Combined with an appropriate choice of t, this suf\ufb01ces to give a starting\npoint with good symmetry. The algorithm GOODSTART is speci\ufb01ed in more detail in Supplementary\nMaterial Section A.4, together with a proof of the following lemma.\nLemma 3.5. The algorithm GOODSTART runs in time O(m) and returns an initial point (x0, t0)\nmax, satis\ufb01es sym((x0, t0),DK) \u2265\nthat is feasible, and for K = 3nw p\n\n1\n\n.\n\n18n2pw p\n\nmax\n\nCombining standard results on self-concordant barrier functions with a barrier for p-norms devel-\noped by Hertog et al. [34], we can show the following properties of a function FK whose exact\nde\ufb01nition is given in Supplementary Material Section A.2.\nCorollary 3.6. The function FK is a self-concordant barrier for DK and it has complexity param-\neter \u03b8(FK) = O(m). Its gradient gFK is computable in O(m) time, and an implicit representation\nof the Hessian HFK can be computed in O(m) time as well.\nThe key reason we can use APPROXIPM to give a fast algorithm for Isotonic Regression is that we\ndevelop an ef\ufb01cient solver for linear equations in the Hessian of FK. The algorithm HESSIANSOLVE\nsolves linear systems in Hessian matrices of the barrier function FK. The Hessian is composed of\na structured main component plus a rank one matrix. We develop a solver for the main component\nby doing a change of variables to simplify its structure, and then factoring the matrix by a block-\nwise LDL(cid:62)-decomposition. We can solve straightforwardly in the L and L(cid:62), and we show that\nthe D factor consists of blocks that are either diagonal or SDD, so we can solve in this factor\napproximately using a nearly-linear time SDD solver. The algorithm HESSIANSOLVE is given in\nfull in Supplementary Material Section A.3, along with a proof of the following result.\nTheorem 3.7. For any instance of program (5) given by some (G, y), at any point z \u2208 DK, for any\nvector a, HESSIANSOLVE((G, y), z, \u00b5, a) returns a vector b = M a for a symmetric linear operator\nHESSIANSOLVE runs in time (cid:101)O(m log n log(1/\u00b5)).\nM satisfying 9/10 \u00b7 HFK (z)\u22121 (cid:22) M (cid:22) 11/10 \u00b7 HFK (z)\u22121. The algorithm fails with probability < \u00b5.\n\nThese are the ingredients we need to prove our main result on solving p-ISO. The algorithm ISO-\nTONICIPM is simply APPROXIPM instantiated to solve program (5), with an appropriate choice of\nparameters. We state ISOTONICIPM informally as Algorithm 1 below. ISOTONICIPM is given in\nfull as Algorithm 6 in Supplementary Material Section A.5.\n\nmax, and the error parameter \u0001 = \u03b4\n\nISOTONICIPM uses the symmetry lower bound s =\n\nProof of Theorem 3.1:\n, the value\nupper bound K = 3nw p\nK when calling APPROXIPM. By Corol-\nlary 3.6, the barrier function FK used by ISOTONICIPM has complexity parameter \u03b8(FK) \u2264 O(m).\nBy Lemma 3.5 the starting point (x0, t0) computed by GOODSTART and used by ISOTONICIPM is\nfeasible and has symmetry sym(x0,DK) \u2265\nBy Theorem 3.2 the point (xapx, tapx) output by ISOTONICIPM satis\ufb01es (cid:104)w p ,tapx(cid:105)\u2212OPT\nOPT is the optimum of program (5), and K = 3nw p\n\n\u2264 \u0001, where\nmax is the value used by ISOTONICIPM for the\n\nK\u2212OPT\n\n18n2pw p\n\n18n2pw p\n\nmax\n\n1\n\n1\n\n.\n\nmax\n\n6\n\n\fconstraint (cid:104)w p, t(cid:105) \u2264 K, which is an upper bound on the supremum of objective values of feasible\np \u2264 (cid:104)w p, tapx(cid:105) \u2264\npoints of program (5). By Lemma 3.4, OPT = OPTp-ISO. Hence, (cid:107)y \u2212 xapx(cid:107)p\nOPT + \u0001K = OPTp-ISO + \u03b4.\nAgain, by Theorem 3.2, the number of calls to HESSIANSOLVE by ISOTONICIPM is bounded by\n\n(cid:16)(cid:112)\u03b8(FK) log (\u03b8(FK )/\u0001\u00b7s)\n\n(cid:17) \u2264 O(cid:0)\u221a\nas \u00b5 = n3. Thus the total running time is (cid:101)O(cid:0)m1.5 log2 n log (npw p\n\nm log (npw p\nEach call to HESSIANSOLVE fails with probability < n\u22123. Thus, by a union bound, the probability\n\u221a\nthat some call to HESSIANSOLVE fails is upper bounded by O(\nmax/\u03b4))/n3 = O(1/n).\nThe algorithm uses O (\n\nmax/\u03b4)) calls to HESSIANSOLVE that each take time (cid:101)O(m log2 n),\n\nmax/\u03b4)(cid:1) .\n\nmax/\u03b4)(cid:1).\n\nO(T ) \u2264 O\n\nm log(npw p\n\nm log (npw p\n\n\u221a\n\n(cid:3)\n\nAlgorithm 1: Sketch of Algorithm ISOTONICIPM\n\n1. Pick a starting point (x, t) using the GOODSTART algorithm\n2. for r = 1, 2\nif r = 1 then \u03b3 \u2190 \u22121; \u03c1 \u2190 1; c = \u2212 gradient of f at (x, t)\n3.\nelse \u03b3 \u2190 1; \u03c1 \u2190 1/poly(n); c = (0, w p)\n4.\nfor i \u2190 1, . . . , C1m0.5 log m :\n5.\n6.\n7.\n8.\n9.\n10. Return x.\n\n\u03c1 \u2190 \u03c1 \u00b7 (1 + \u03b3C2m\u22120.5)\nLet H, g be the Hessian and gradient of fc,\u03c1 at x\nCall HESSIANSOLVE to compute z \u2248 H\u22121g\nUpdate x \u2190 x \u2212 z\n\n4 Algorithms for (cid:96)\u221e and Strict Isotonic Regression\n\nWe now reduce (cid:96)\u221e Isotonic Regression and Strict Isotonic Regression to the Lipschitz Learning\nproblem, as de\ufb01ned in [32]. Let G = (V, E, len) be any DAG with non-negative edge lengths\nlen : E \u2192 R\u22650, and y : V \u2192 R \u222a {\u2217} a partial labeling. We think of a partial labeling as a function\nthat assigns real values to a subset of the vertex set V . We call such a pair (G, y) a partially-labeled\nDAG. For a complete labeling x : V \u2192 R, de\ufb01ne the gradient on an edge (u, v) \u2208 E due to x\nto be grad+\nG[x](u, v) = 0 unless\nx(u) > x(v), in which case it is de\ufb01ned as +\u221e. Given a partially-labelled DAG (G, y), we say that\na complete assignment x is an inf-minimizer if it extends y, and for all other complete assignments\nx(cid:48) that extends y we have\n\n(cid:110) x(u)\u2212x(v)\n\n. If len(u, v) = 0, then grad+\n\nG[x](u, v) = max\n\n(cid:111)\n\nlen(u,v)\n\n, 0\n\nmax\n(u,v)\u2208E\n\ngrad+\n\nG[x(cid:48)](u, v).\n\ngrad+\n\nG[x](u, v) \u2264 max\n(u,v)\u2208E\n\n(cid:48)\n\nG[x](u, v) < \u221e if and only if x is isotonic on G.\n\n(cid:48)\n(uL, u) = 1/w(u) and len\n\nNote that when len = 0, then max(u,v)\u2208E grad+\nSuppose we are interested in Isotonic Regression on a DAG G(V, E) under (cid:107)\u00b7(cid:107)w,\u221e. To reduce this\nproblem to that of \ufb01nding an inf-minimizer, we add some auxiliary nodes and edges to G. Let VL, VR\nbe two copies of V . That is, for every vertex u \u2208 V , add a vertex uL to VL and a vertex uR to VR. Let\nEL = {(uL, u)}u\u2208V and ER = {(u, uR)}u\u2208V . We then let len\n(u, uR) =\n1/w(u). All other edge lengths are set to 0. Finally, let G(cid:48) = (V \u222a VL \u222a VR, E \u222a EL \u222a ER, len\n(cid:48)\n).\nThe partial assignment y(cid:48) takes real values only on the the vertices in VL \u222a VR. For all u \u2208 V ,\ny(cid:48)(uL) := y(u), y(cid:48)(uR) := y(u) and y(cid:48)(u) := \u2217. (G(cid:48), y(cid:48)) is our partially-labeled DAG. Observe\nthat G(cid:48) has n(cid:48) = 3n vertices and m(cid:48) = m + 2n edges.\nLemma 4.1. Given a DAG G(V, E), a set of observations y \u2208 RV , and weights w, construct G(cid:48)\nand y(cid:48) as above. Let x be an inf-minimizer for the partially-labeled DAG (G(cid:48), y(cid:48)). Then, x |V is the\nIsotonic Regression of y with respect to G under the norm (cid:107)\u00b7(cid:107)w,\u221e .\nProof. We note that since the vertices corresponding to V in (G(cid:48), y(cid:48)) are connected to each other by\nG[x](u, v) < \u221e iff x is isotonic on those edges. Since G is a\nzero length edges, max(u,v)\u2208E grad+\nDAG, we know that there are isotonic labelings on G. When x is isotonic on vertices corresponding\nto V , gradient is zero on all the edges going in between vertices in V . Also, note that every vertex\n\n7\n\n\fx corresponding to V in G(cid:48) is attached to two auxiliary nodes xL \u2208 VL, xR \u2208 VR. We also have\ny(cid:48)(xL) = y(cid:48)(xR) = y(x). Thus, for any x that extends y and is Isotonic on G(cid:48), the only non-zero\nentries in grad+ correspond to edges in ER and EL, and thus\n\n(u,v)\u2208E(cid:48) grad+\nmax\n\nG(cid:48)[x](u, v) = max\nu\u2208V\n\nwu \u00b7 |y(u) \u2212 x(u)| = (cid:107)x \u2212 y(cid:107)w,\u221e .\n\nAlgorithm COMPINFMIN from [32] is proved to compute the inf-minimizer, and is claimed to work\nfor directed graphs (Section 5, [32]). We exploit the fact that Dijkstra\u2019s algorithm in COMPINFMIN\ncan be implemented in O(m) time on DAGs using a topological sorting of the vertices, giving a\nlinear time algorithm for computing the inf-minimizer. Combining it with the reduction given by\nthe lemma above, and observing that the size of G(cid:48) is O(m+n), we obtain Theorem 1.2. A complete\ndescription of the modi\ufb01ed COMPINFMIN is given in Section B.2. We remark that the solution to\nthe (cid:96)\u221e-Isotonic Regression that we obtain has been referred to as AVG (cid:96)\u221e Isotonic Regression\nin the literature [20]. It is easy to modify the algorithm to compute the MAX, MIN (cid:96)\u221e Isotonic\nRegressions. Details are given in Section B.\nFor Strict Isotonic Regression, we de\ufb01ne the lexicographic ordering. Given r \u2208 Rm, let \u03c0r denote\na permutation that sorts r in non-increasing order by absolute value, i.e., \u2200i \u2208 [m \u2212 1],|r(\u03c0r(i))| \u2265\n|r(\u03c0r(i + 1))|. Given two vectors r, s \u2208 Rm, we write r (cid:22)lex s to indicate that r is smaller than s in\nthe lexicographic ordering on sorted absolute values, i.e.\n\n\u2203j \u2208 [m],|r(\u03c0r(j))| < |s(\u03c0s(j))| and \u2200i \u2208 [j \u2212 1],|r(\u03c0r(i))| = |s(\u03c0s(i))|\n\nor \u2200i \u2208 [m],|r(\u03c0r(i))| = |s(\u03c0s(i))| .\n\nNote that it is possible that r (cid:22)lex s and s (cid:22)lex r while r (cid:54)= s. It is a total relation: for every r and s\nat least one of r (cid:22)lex s or s (cid:22)lex r is true.\nGiven a partially-labelled DAG (G, y), we say that a complete assignment x is a lex-minimizer if it\nextends y and for all other complete assignments x(cid:48) that extend y we have grad+\nG[x(cid:48)].\nStout [21] proves that computing the Strict Isotonic Regression is equivalent to \ufb01nding an Isotonic\nx that minimizes zu = wu \u00b7 (xu \u2212 yu) in the lexicographic ordering. With the same reduction as\nabove, it is immediate that this is equivalent to minimizing grad+\nLemma 4.2. Given a DAG G(V, E), a set of observations y \u2208 RV , and weights w, construct G(cid:48)\nand y(cid:48) as above. Let x be the lex-minimizer for the partially-labeled DAG (G(cid:48), y(cid:48)). Then, x |V is the\nStrict Isotonic Regression of y with respect to G with weights w.\n\nG[x] (cid:22)lex grad+\n\nG(cid:48) in the lex-ordering.\n\nAs for inf-minimization, we give a modi\ufb01cation of the algorithm COMPLEXMIN from [32] that\ncomputes the lex-minimizer in O(mn) time. The algorithm is described in Section B.2. Combining\nthis algorithm with the reduction from Lemma 4.2, we can compute the Strict Isotonic Regression\nin O(m(cid:48)n(cid:48)) = O(mn) time, thus proving Theorem 1.3.\nAcknowledgements. We thank Sabyasachi Chatterjee for introducing the problem to us, and Daniel\nSpielman for his advice and comments. We would also like to thank Quentin Stout and anonymous\nreviewers for their suggestions. This research was partially supported by AFOSR Award FA9550-\n12-1-0175, NSF grant CCF-1111257, and a Simons Investigator Award to Daniel Spielman.\n\nReferences\n[1] M. Ayer, H. D. Brunk, G. M. Ewing, W. T. Reid, and E. Silverman. An empirical distribution function for\nsampling with incomplete information. The Annals of Mathematical Statistics, 26(4):pp. 641\u2013647, 1955.\n[2] D. J. Barlow, R. E .and Bartholomew, J.M. Bremner, and H. D. Brunk. Statistical inference under order\n\nrestrictions: the theory and application of Isotonic Regression. Wiley New York, 1972.\n\n[3] F. Gebhardt. An algorithm for monotone Regression with one or more independent variables. Biometrika,\n\n57(2):263\u2013271, 1970.\n\n[4] W. L. Maxwell and J. A. Muckstadt. Establishing consistent and realistic reorder intervals in production-\n\ndistribution systems. Operations Research, 33(6):1316\u20131341, 1985.\n\n[5] R. Roundy. A 98%-effective lot-sizing rule for a multi-product, multi-stage production / inventory system.\n\nMathematics of Operations Research, 11(4):pp. 699\u2013727, 1986.\n\n[6] S.T. Acton and A.C. Bovik. Nonlinear image estimation using piecewise and local image models. Image\n\nProcessing, IEEE Transactions on, 7(7):979\u2013991, Jul 1998.\n\n8\n\n\f[7] C.I.C. Lee. The Min-Max algorithm and Isotonic Regression. The Annals of Statistics, 11(2):pp. 467\u2013477,\n\n1983.\n\n[8] R. L. Dykstra and T. Robertson. An algorithm for Isotonic Regression for two or more independent\n\nvariables. The Annals of Statistics, 10(3):pp. 708\u2013716, 1982.\n\n[9] S. Chatterjee, A. Guntuboyina, and B. Sen. On Risk Bounds in Isotonic and Other Shape Restricted\n\nRegression Problems. The Annals of Statistics, to appear.\n\n[10] A. T. Kalai and R. Sastry. The isotron algorithm: High-dimensional Isotonic Regression. In COLT, 2009.\n[11] T. Moon, A. Smola, Y. Chang, and Z. Zheng. Intervalrank: Isotonic Regression with listwise and pairwise\n\nconstraints. In WSDM, pages 151\u2013160. ACM, 2010.\n\n[12] S. M Kakade, V. Kanade, O. Shamir, and A. Kalai. Ef\ufb01cient learning of generalized linear and single\n\nindex models with Isotonic Regression. In NIPS. 2011.\n\n[13] B. Zadrozny and C. Elkan. Transforming classi\ufb01er scores into accurate multiclass probability estimates.\n\nKDD, pages 694\u2013699, 2002.\n\n[14] H. Narasimhan and S. Agarwal. On the relationship between binary classi\ufb01cation, bipartite ranking, and\n\nbinary class probability estimation. In NIPS. 2013.\n\n[15] S. Angelov, B. Harb, S. Kannan, and L. Wang. Weighted Isotonic Regression under the l1 norm.\n\nSODA, 2006.\n\nIn\n\n[16] K. Punera and J. Ghosh. Enhanced hierarchical classi\ufb01cation via Isotonic smoothing. In WWW, 2008.\n[17] Z. Zheng, H. Zha, and G. Sun. Query-level learning to rank using Isotonic Regression. In Communication,\n\nControl, and Computing, Allerton Conference on, 2008.\n\n[18] D.S. Hochbaum and M. Queyranne. Minimizing a convex cost closure set. SIAM Journal on Discrete\n\nMathematics, 16(2):192\u2013207, 2003.\n\n[19] Q. F. Stout. Isotonic Regression via partitioning. Algorithmica, 66(1):93\u2013112, 2013.\n[20] Q. F. Stout. Weighted l\u221e Isotonic Regression. Manuscript, 2011.\n[21] Q. F. Stout. Strict l\u221e Isotonic Regression. Journal of Optimization Theory and Applications, 152(1):121\u2013\n\n135, 2012.\n\n[22] Q. F Stout. Fastest Isotonic Regression algorithms. http://web.eecs.umich.edu/\u02dcqstout/\n\nIsoRegAlg_140812.pdf.\n\n[23] Q. F. Stout. Isotonic Regression for multiple independent variables. Algorithmica, 71(2):450\u2013470, 2015.\n[24] Y. Kaufman and A. Tamir. Locating service centers with precedence constraints. Discrete Applied Math-\n\nematics, 47(3):251 \u2013 261, 1993.\n\n[25] Q. F. Stout. L in\ufb01nity Isotonic Regression for linear, multidimensional, and tree orders. CoRR, 2015.\n[26] S. I. Daitch and D. A. Spielman. Faster approximate lossy generalized \ufb02ow via interior point algorithms.\n\nSTOC \u201908, pages 451\u2013460. ACM, 2008.\n\n[27] A. Madry. Navigating central path with electrical \ufb02ows. In FOCS, 2013.\n[28] Y. T. Lee and A. Sidford. Path \ufb01nding methods for linear programming: Solving linear programs in\n\n\u221a\n\n\u02dcO(\n\nrank) iterations and faster algorithms for maximum \ufb02ow. In FOCS, 2014.\n\n[29] D. A. Spielman and S. Teng. Nearly-linear time algorithms for graph partitioning, graph sparsi\ufb01cation,\n\nand solving linear systems. STOC \u201904, pages 81\u201390. ACM, 2004.\n\n[30] I. Koutis, G. L. Miller, and R. Peng. A nearly-m log n time solver for SDD linear systems. FOCS \u201911,\n\npages 590\u2013598, Washington, DC, USA, 2011. IEEE Computer Society.\n\n[31] M. B. Cohen, R. Kyng, G. L. Miller, J. W. Pachocki, R. Peng, A. B. Rao, and S. C. Xu. Solving SDD\n\nlinear systems in nearly m log1/2 n time. STOC \u201914, 2014.\n\n[32] R. Kyng, A. Rao, S. Sachdeva, and D. A. Spielman. Algorithms for Lipschitz learning on graphs. In\n\nProceedings of COLT 2015, pages 1190\u20131223, 2015.\n\n[33] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n[34] D. den Hertog, F. Jarre, C. Roos, and T. Terlaky. A suf\ufb01cient condition for self-concordance. Math.\n\nProgram., 69(1):75\u201388, July 1995.\n\n[35] J. Renegar. A mathematical view of interior-point methods in convex optimization. SIAM, 2001.\n[36] A. Nemirovski. Lecure notes: Interior point polynomial time methods in convex programming, 2004.\n[37] E. J. McShane. Extension of range of functions. Bull. Amer. Math. Soc., 40(12):837\u2013842, 12 1934.\n[38] H. Whitney. Analytic extensions of differentiable functions de\ufb01ned in closed sets. Transactions of the\n\nAmerican Mathematical Society, 36(1):pp. 63\u201389, 1934.\n\n9\n\n\f", "award": [], "sourceid": 1566, "authors": [{"given_name": "Rasmus", "family_name": "Kyng", "institution": "Yale University"}, {"given_name": "Anup", "family_name": "Rao", "institution": "School of Computer Science, Georgia Tech"}, {"given_name": "Sushant", "family_name": "Sachdeva", "institution": "Yale University"}]}