{"title": "An Efficient Pruning Algorithm for Robust Isotonic Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 219, "page_last": 229, "abstract": "We study a generalization of the classic isotonic regression problem  where we allow separable nonconvex objective functions, focusing on the case of estimators used in robust regression. A simple dynamic programming approach allows us to solve this problem to within \u03b5-accuracy (of the global minimum) in time linear in 1/\u03b5 and the dimension. We can combine techniques from the convex case with branch-and-bound ideas to form a new algorithm for this problem that naturally exploits the shape of the objective function. Our algorithm achieves the best bounds for both the general nonconvex and convex case (linear in log (1/\u03b5)), while performing much faster in practice than a straightforward dynamic programming approach, especially as the desired accuracy increases.", "full_text": "An Ef\ufb01cient Pruning Algorithm for Robust Isotonic\n\nRegression\n\nSchool of Industrial Systems and Engineering\n\nCong Han Lim \u2217\n\nGeorgia Tech\n\nAltanta, GA 30332\n\nclim31@gatech.edu\n\nAbstract\n\nWe study a generalization of the classic isotonic regression problem where we allow\nseparable nonconvex objective functions, focusing on the case where the functions\nare estimators used in robust regression. One can solve this problem to within\n\u0001-accuracy (of the global minimum) in O(n/\u0001) using a simple dynamic program,\nand the complexity of this approach is independent of the underlying functions. We\nintroduce an algorithm that combines techniques from the convex case with branch-\nand-bound ideas that is able to exploit the shape of the functions. Our algorithm\nachieves the best known bounds for both the convex case (O(n log(1/\u0001))) and the\ngeneral nonconvex case. Experiments show that this algorithm can perform much\nfaster than the dynamic programming approach on robust estimators, especially as\nthe desired accuracy increases.\n\n1\n\nIntroduction\n\n(cid:88)\n\ni\u2208[n]\n\nIn this paper we study the following optimization problem with monotonicity constraints:\n\nminx\u2208[0,1]n\n\nfi(xi) where xi \u2264 xi+1 for i \u2208 [n \u2212 1]\n\n(1)\nwhere the functions f1, f2, . . . , fn : [0, 1] \u2192 R may be nonconvex and the notation [n] denotes the\nset {1, 2, . . . , n}. Our goal is to develop an algorithm that achieves an objective \u0001-close to the global\noptimal value for any \u0001 > 0 with a complexity that scales along with the properties of f. In particular,\nwe present an algorithm that simultaneously achieves the best known bounds when fi are convex and\nalso for general fi, while scaling much better in practice than the straightforward approach when\nconsidering f used in robust estimation such as Huber Loss, Tukey\u2019s biweight function, and MCP.\nProblem (1) is a generalization of the classic isotonic regression problem (Brunk, 1955; Ayer et al.,\n1955). The goal there to \ufb01nd the best isotonic \ufb01t in terms of Euclidean distance to a given set of points\ny1, y2, . . . , yn. This corresponds to setting each fi(x) to (cid:107)xi \u2212 yi(cid:107)2\n2. Besides having applications\nin domains where such a monotonicity assumption is reasonable, isotonic regression also appears\nas a key step in other statistical and optimization problems such as learning generalized linear and\nsingle index models (Kalai and Sastry, 2009), submodular optimization (Bach, 2013), sparse recovery\n(Bogdan et al., 2013; Zeng and Figueiredo, 2014), and ranking problems (Gunasekar et al., 2016).\nThere are several reasons to go beyond Euclidean distance and to consider more general fi functions.\nFor example, using the appropriate Bregman divergence can lead to better regret bounds for certain\nonline learning problems over the convex hull of all rankings (Yasutake et al., 2011; Suehiro et al.,\n2012), and allowing general fi functions has applications in computer vision (Hochbaum, 2001;\n\n\u2217Work done while at Wisconsin Institute for Discovery, University of Wisconsin-Madison.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fKolmogorov et al., 2016). In this paper we will focus on the use of quasiconvex distance functions,\nthe use of which is much more robust to outliers (Bach, 2018)2. Figure 1 describes this in more detail.\n\nFigure 1:\nIsotonic regression in the presence of outliers. The left image shows the value of the Euclidean\ndistance and Tukey\u2019s biweight function (a canonical function for robust estimation) from x = \u22121 to 1, the\nmiddle image demonstrates isotonic regression on a simple linear and noiseless example, and the right image\nshows how outliers can adversely affect isotonic regression under Euclidean distance.\n\nFor general fi functions we cannot solve Problem (1) exactly (without some strong additional\nassumptions), and instead we focus on the problem\n\nfi(xi) where xi \u2264 xi+1 for i \u2208 [n \u2212 1]\n\n(2)\n\n(cid:88)\n\nminx\u2208Gn\n\nk\n\ni\u2208[n]\n\nwhere instead of allowing the xi values to lie anywhere in the interval [0, 1], we restrict them to\nGk := {0, 1/k, 2/k, . . . , 1}, a equally-spaced grid of k + 1 points. This discretized version of the\nproblem will give a feasible solution to the original problem that is close to optimal. The relation\nbetween the granularity of the grid and approximation quality for any optimization problem over a\nbounded domain can be described in terms of the Lipschitz constants of the objective function, and\nfor this particular problem has been described in Bach (2015, 2018) \u2014 if functions fi are Lipschitz\ncontinuous with constant L, then to obtain a precision of \u0001 in terms of the objective value, it suf\ufb01ces to\nchoose k \u2265 2nL/\u0001. One can achieve better bounds using higher-order Lipschitz constants. The main\napproach for solving Problem (2) for general nonconvex functions is to use dynamic programming\n(see for example Felzenszwalb and Huttenlocher (2012)) that runs in O(nk). When all the fi are\nconvex, one can instead use the faster O(n log k)) scaling algorithm by Ahuja and Orlin (2001).\nOur main contribution is an algorithm that also achieves O(nk) in the general case and O(n log k) in\nthe convex case by exploiting the following key fact \u2014 the dynamic programming method always runs\nin time linear in the sum of possible xi values over all xi. Thus, our goal is to limit the range of values\nby using properties of the fi functions. This is done by combining ideas from branch-and-bound and\nthe scaling algorithm by Ahuja and Orlin (2001) with the dynamic programming approach. When\nrestricted to convex fi functions, our algorithm is very similar to the scaling algorithm.\nOur algorithm works by solving the problem over increasingly \ufb01ner domains, choosing not to include\npoints that will not get us closer to the global optimum. We use two ways to exclude points, the\n\ufb01rst of which uses lower bounds over intervals for each fi, and the second requires us to be able to\ncompute a linear underestimate of fi over an interval ef\ufb01ciently. This information is readily available\nfor a variety of quasiconvex distance functions, and we provide an example of how to compute this\nfor Tukey\u2019s biweight function. In practice, this leads to an algorithm that can require far less function\nevaluations to achieve the same accuracy as dynamic programming, which in turn translates into a\nfaster running time even considering the additional work needed to process each point.\nThe paper is organized as follows. For the rest of the introduction, we will survey other methods\nfor isotonic regression for speci\ufb01c classes of sets of fi functions and also mention related problems.\nSection 2 describes the standard dynamic programming approach. In Section 3, we describe our main\npruning algorithm and the key pruning rules for removing xi values that we need to consider. Section\n4 demonstrates the performance of the algorithm on a series of experiments. The longer version of\nthis paper (provided as supplementary material) includes proofs for the linear underestimation rule\nand also brie\ufb02y discusses a heuristic variant of our main algorithm.\n\n2 Our focus in this paper is on developing algorithms for global optimization. For more on robust estimators,\n\nwe refer the reader to textbooks by Huber (2004); Hampel et al. (2011).\n\n2\n\n\f(cid:80)\n\nExisting methods for isotonic regression. We will \ufb01rst discuss the main methods for exactly\nsolving Problem (1) and the classes of functions the methods can handle. For convex fi functions,\nthe pool-adjacent-violators (PAV) algorithm (Ayer et al., 1955; Brunk, 1955) has been the de facto\nmethod for solving the problem. The algorithm was originally developed for the Euclidean distance\ncase but in fact works for any set of convex fi, provided that one can exactly solve intermediate\nsubproblems of the form arg minz\u2208[0,1]\ni\u2208S fi(z) (Best et al., 2000) over subsets S of [n]. PAV\nrequires solving up to n such subproblems, and the total cost of solving can be just O(n) for a wide\nrange of functions, including for many Bregman divergences (Lim and Wright, 2016).\nThere are algorithms for nonconvex functions that are piecewise convex. Let q denote the total number\nof pieces over all the fi functions. In the case where the overall functions are convex, piecewise-\nlinear and -quadratic functions can be handled in O(q log log n) and O(q log n) time respectively\n(Kolmogorov et al., 2016; Hochbaum and Lu, 2017), while in the nonconvex case it is O(nq).\nIn some cases, we cannot solve the problem exactly and instead deal with the discretized problem (2).\nFor example, this is the case when our knowledge to the functions fi can only be obtained through\nfunction evaluation queries (i.e. xi \u2192 fi(xi)). In the convex case, PAV can be used to obtain an\nalgorithm with O(n2 log k) time to solve the problem over a grid of k points, but a clever recursive\nalgorithm by Ahuja and Orlin (2001) takes only O(n log k). A general approach that works for\narbitrary functions is dynamic programming, which has a complexity of O(nk).\nBach (2015) recently proposed a framework for optimizing continuous submodular functions that\ncan be applied to solving such functions over monotonicity constraints. This includes separable\nnonconvex functions as a special case. Although the method is a lot more versatile, when specialized\nto our setting it results in an algorithm with a complexity of O(n2k2). This and dynamic programming\nare the main known methods for general nonconvex functions.\n\nOne can also replace the ordering constraints with the pairwise terms(cid:80)\n\nRelated Problems. There have been many extensions and variants of the classic isotonic regression\nproblem, and we will brie\ufb02y describe two of them. One common extension is to use a partial ordering\ninstead of a full ordering. This signi\ufb01cantly increases the dif\ufb01culty of the problem, and this problem\ncan be solved by recursively solving network \ufb02ow problems. For a detailed survey of this area, which\nconsiders different types of partial orderings and (cid:96)p functions, we refer the reader to Stout (2014).\ni\u2208[n\u22121] gi(xi+1 \u2212 xi) where\ngi : R \u2192 R \u222a {\u221e}. By choosing gi appropriately, we recover many known variants of isotonic\nregression, including nearly-isotonic regression (Tibshirani et al., 2011), smoothed isotonic regression\n(Sysoev and Burdakov, 2016; Burdakov and Sysoev, 2017), and a variety of problems from computer\nvision. The most general recent work (involving piecewise linear functions) is by Hochbaum and Lu\n(2017). We note that the works by Bach (2015, 2018) also applies in many of these settings.\n\n2 Dynamic Programming\n\nWe now provide a DP reformulation of Problem (2). Let AGk\nwe can de\ufb01ne the following functions:\nGk\ni (xi),\n\nA\n\nGk\ni (xi) := fi(xi) + C\nGk\ni (xi) := minxi+1\u2208Gk A\n\nC\n\ni+1(xi+1) where xi \u2264 xi+1.\nGk\n\nn (xn) := fn(xn). For any i \u2208 [n \u2212 1],\n\n(aggregate)\n(min-convolution)\n\nGk\ni\n\nGk\ni\n\nfunctions aggregate the accumulated information from the indices i + 1, i + 2, . . . , n with\nfunctions represent the minimum-convolution of\ni+1 function with the indicator function g where g(z) = 0 if z \u2264 0, and g = \u221e otherwise. With\nGk\nGk\n1 (x1) has the same objective and x1 value as Problem (2).\n\nThe A\nthe information at the current index i, where the C\nthe A\nthis notation, the problem minx1\u2208Gk A\nWe can use the above recursion to solve the problem, which we formally describe in Algorithm 1.\nThis dynamic programming algorithm can be viewed an application of the Viterbi algorithm. The\nGk\nalgorithm does a backward pass, building up all the A\nvalues from i = n to i = 1. Once A\n1\nhas been computed, we know the minimizer x1. We then work our way forwards, each time picking\ni on the grid Gk subject to the condition that xi \u2265 xi\u22121. The total running\nGk\nan xi that minimizes A\ntime of this algorithm is O(nk), on the order of the number of points in the grid.\n\nGk\n, C\ni\n\nGk\ni\n\n3\n\n\fAlgorithm 1 Dynamic Program for \ufb01xed grid Gk\n\ninput: Functions {fi}, Parameter k\nn (z) \u2190 fn(z) for z \u2208 Gk\nAGk\nfor i = n \u2212 1, . . . , 1 do\ni (1) \u2190 A\nGk\nGk\ni+1(1)\nC\ni (1) \u2190 fi(1) + C\nGk\nGk\ni (1)\nA\nfor z = k\u22121/k, k\u22122/k, . . . , 0 do\ni (z) \u2190 min(A\nGk\nGk\ni+1(z), C\nC\ni (z) \u2190 fi(z) + C\nGk\nGk\ni (z)\nA\nx0 \u2190 0\nfor i = 1, 2 . . . , n do\n\nxi \u2190 arg minz\u2208Gk,z\u2265xi\u22121 A\n\nreturn (x1, x2, . . . , xn)\n\nGk\ni (z)\n\nGk\ni (z + 1/k))\n\n(cid:46) Backwards Pass\n\n(cid:46) Forward Pass\n\nThe main drawback of the dynamic programming approach is that it requires us to pick the desired\naccuracy a priori via choosing an appropriate k value and then overall running time is then O(nk),\nno matter the properties of the fi functions.\n\n3 A Pruning Algorithm for Robust Isotonic Regression\n\nInstead of solving the full discretized problem (2) directly, we can work over a much smaller set of\npoints. Let xGk denote an optimal solution to the problem, and for each i \u2208 [n] let Si \u2286 Gk denote a\nset of points such that x\n\n(cid:88)\ni \u2208 Si. Then\nGk\n\nminx\u2208S1\u00d7...Sn\n\ni\u2208[n]\n\nfi(xi) where xi \u2264 xi+1 for i \u2208 [n \u2212 1],\n\nhas the same solution xGk and it is easy to modify the DP algorithm to work for this problem. All\nthat is needed is to perform the following replacements:\n\n\u2022 z = . . . and z \u2208 . . . with the appropriate series of points in Si,\n\u2022 C\ni+1(z) for zmax = arg max(Si), and\nGk\n\u2022 C\n\ni (zmax) \u2190 minz\u2265zmax A\nGk\ni (z) \u2190 min(A\nGk\nGk\n, A\ni\n\ni+1(z(cid:48))) where z(cid:48) \u2265 z.\nGk\nfunctions are the same for both problem formulations on xGk.\n\nGk\nThe values of the C\ni\nThe modi\ufb01ed operations can be performed ef\ufb01ciently by maintaining the appropriate minimum values,\nand this results in an algorithm with a complexity of just O(|S1| + . . .|Sn|). Our goal is thus to\nrestrict the size of Si sets. We perform this by starting from a coarse set of intervals Ii for each\nindex i that initially contains just [0, 1]. This contains all points in Gk. We repeatedly subdivide each\ninterval into two and keep only the intervals that may contain certain better solutions, which in turn\nreduces the number of points in Gk that are contained in some interval.\nFrom here on we assume that k is a power of 2. Algorithm 2 describes the basic framework which\nwe build on throughout this section.\n\nAlgorithm 2 Algorithmic Framework for Faster Robust Isotonic Regression\n\ninput: Functions {fi}, Parameter k\nk(cid:48) \u2190 1\nIi \u2190 {[0, 1]} for i \u2208 [n]\nwhile k(cid:48) < k do\n\n(cid:8)I\n(cid:9) using {fi}\nx \u2190 run modi\ufb01ed DP on endpoints of(cid:8)IGk\n\n(cid:9) \u2190 Re\ufb01ne(cid:8)IGk(cid:48)\n\nk(cid:48) \u2190 2k(cid:48)\n\nG2k(cid:48)\ni\n\n1\n\ni\n\nreturn x\n\n(cid:9)\n\n4\n\n\fAt the end of each round of the loop, we want xGk be contained in I1 \u00d7 . . . \u00d7 In where Ii is some\ninterval from Ii. This ensures that we \ufb01nd the optimal point in the \ufb01nal grid Gk. We also want IGk\nto\nconsist only of intervals of width 1/k(cid:48) with endpoints contained in Gk(cid:48). This ensures that the overall\nnumber of points processed over all iterations is at most O(nk), and by bounding the number of\nintervals in each Ii in each iteration we can achieve signi\ufb01cantly better performance. In particular,\nthe scaling algorithm for convex functions by Ahuja and Orlin (2001) can be seen as a particular\nrealization of this framework where the re\ufb01nement process keeps the size of each Ii to exactly one.\nIn the rest of this section, we will describe two ef\ufb01cient rules for re\ufb01ning the sets of intervals {Ii}\nand analyze the complexity of the overall algorithm. The \ufb01rst rule uses lower and upper bounds (akin\nto standard branch-and-bound), while the second requires one to be able to ef\ufb01ciently construct good\nlinear underestimators of the fi functions within intervals.\n\ni\n\n3.1 Pruning via lower/upper bounds\n\ni\n\nn\n\nmin\n\nThis pruning rule constructs lower bounds over the current active intervals, then uses upper bounds\n(that can be obtained via the aforementioned DP) to decide which intervals can be removed from\nconsideration in subsequent iterations of the algorithm.\nWe again modify the dynamic program, this time to compute lower bounds over intervals. Let\nALB,Gk\n\n(a) := minxn\u2208[a,a+1/2k] fn(xn) and recursively de\ufb01ne the following:\nALB,Gk\nC LB,Gk\n\n(a) :=\n(a) := mina(cid:48)\u2208Gk ALB,Gk\nIt is straightforward to see that ALB,Gk\nj=i fj(xj) when xi is contained in\nthe interval [a, a + 1/2k]. This dynamic program can be computed in O(|I1| + . . .|In|) time using the\nsame ideas as before, provided that terms of the form minxi\u2208[a,b] fi(xi) can be ef\ufb01ciently calculated.\nAs for which intervals to keep, we remove an interval [a, b] from Ii if there is another interval in Ii\nwhich can be used in place of [a, b] and the upper bound from using the other interval is smaller than\nthe lower bound corresponding to [a, b]. This concept is formalized in Algorithm 3.\n\n(min-convolution for lower bound)\n\n(a(cid:48)) where a \u2264 a(cid:48).\n\n(a) is a lower bound for(cid:80)n\n\n(aggregate for lower bound)\n\nfi(xi) + C LB,Gk\n\nxi\u2208[a,a+1/2k]\n\n(a),\n\ni+1\n\ni\n\ni\n\ni\n\nAlgorithm 3 Pruning I via Lower/Upper Bounds\n\n(cid:9) and(cid:8)ALB,Gk(cid:48)\n\nCompute(cid:8)A\n\ninput: Interval Sets {IGk(cid:48)\nZ \u2190 0\nfor i = 1, . . . , n do\n\nGk(cid:48)\ni\n\ni\n\ni\n\n}, functions {fi}, Parameter k(cid:48)\n\n(cid:9) using {fi}\n\nz \u2190 \ufb01rst element in Z sequence\nz(cid:48) \u2190 next element (1 if there are none)\nJ \u2190 \u2205\nwhile z (cid:54)= 1 do\n\nu \u2190 min(cid:8)A\nJ \u2190 J \u222a(cid:8)[a, b] \u2208 IGk(cid:48)\n\n(xi) (cid:12)(cid:12) xi \u2208 Gk(cid:48) \u2229 [z, z(cid:48)](cid:9)\n\n(cid:12)(cid:12) ALB,Gk(cid:48)\n\nGk(cid:48)\ni\n\n(a) \u2264 u, [a, b] \u2286 [z, z(cid:48)](cid:9)\n\ni\n\ni\n\nz \u2190 z(cid:48)\nz(cid:48) \u2190 next element in sequence Z (1 if there are none)\nIGk(cid:48)\ni \u2190 J\nZ \u2190 all endpoints in J\nreturn {IGk(cid:48)\n}\n\ni\n\nWe can show that this procedure does not remove certain solutions, including the optimal solutions to\nProblems (1) and (2). De\ufb01nition 3.1 and Proposition 3.2 describes this more precisely.\nDe\ufb01nition 3.1. Given a nondecreasing vector x \u2208 Rn, x is S-improvable for some S \u2286 [0, 1] if there\ni\u2208[n] fi(xi) and if yi /\u2208 S it\n\nis a different nondecreasing vector y \u2208 Rn such that(cid:80)\n\ni\u2208[n] fi(yi) <(cid:80)\n\nmust be the case that yi = xi.\n\n5\n\n\fNote that the optimal solution xGk is not Gk(cid:48)-improvable for any k(cid:48) that is a factor of k.\nProposition 3.2. Let x\u2217 be a nondecreasing vector which is not Gk(cid:48)-improvable. Suppose x\u2217 is in\n\n(cid:16)(cid:91)(cid:8)[a, b] \u2208 IGk(cid:48)\n\n(cid:9)(cid:17)\n\n.\n\ni\n\n(cid:89)\n\ni\u2208[n]\n\nThis remains true after applying Algorithm 3 to the sets {IGk(cid:48)\n\ni\n\n}.\n\ni \u00b7 (a\u2212 b) \u2264 fi(z) for a \u2264 z < b and fi(b) + gR\n\n3.2 Pruning via linear underestimators\nWe now describe a rule that uses linear underestimators on intervals in Ii. In the convex case, one\ncan think of this as using subgradient information. This is what the scaling algorithm of Ahuja and\nOrlin (2001) uses to obtain a complexity of O(n log k). We will rely on the following assumption.\nAssumption 3.3. Given a, b, c \u2208 [0, 1] where a < b < c, we can compute in constant time gL\ni \u2208\ni , gR\ni \u00b7 (c\u2212 b) \u2264 fi(z) for b < z \u2264 c.\nR such that fi(b) + gL\nthat satis\ufb01es the condition, but the tighter the underestimator,\nThis pruning rule works with any gL\nthe better our algorithm will perform. In particular, it is ideal to minimize gL\ni . For\nconvex functions, the best possible gR\ni\nSuppose we have the interval [u, v] \u2208 IGk(cid:48)\nfor i \u2208 {s, s + 1, . . . , t}. Our goal is to decide for each i\nif we should include the intervals [u, (u+v)/2] and [(u+v)/2, v] in IG2k(cid:48)\n. We can do this by taking into\naccount linear underestimators for fi in each of these two intervals and also by considering which xi\nmay lie outside of [u, v]. Algorithm 4 describes how this can be done.\n\nis a subgradient of the function.\n\ni and maximize gR\n\ni , gR\ni\n\ni\n\ni\n\nAlgorithm 4 Pruning Subroutine\n\ni , gR\n\ni (from Assumption 3.3) for i \u2208 [n]\n\ninput: {fi}, {s, s + 1, . . . , t}, a, b, c \u2208 [0, 1] where a < b < c, indices l, r\nCompute gL\nt \u2190 gL\nSL\ni \u2190 gL\nt\nSL\ni + max(SL\nI L \u2190 {i | i \u2264 l, SL\ns \u2190 gR\nSR\ni \u2190 gR\ns\nSR\ni + min(SR\nI R \u2190 {i | i \u2265 r, SR\nfor i = s, s + 1, . . . , t do\n\ni+1, 0) for i \u2208 {s, s + 1, . . . , t \u2212 1}\ni > 0} \u222a {i | l + 1 \u2264 i < k, k is \ufb01rst index after l where SL\ni\u22121, 0) for i \u2208 {s + 1, . . . , t}\ni \u2264 0} \u222a {i | k > i \u2265 r + 1, k is last index before r where SR\n\nk \u2264 0}\n\nk > 0}\n\nIi \u2190 \u2205\nif i \u2208 I L, add [a, b] to Ii\nif i \u2208 I R or Ii is empty, add [b, c] to Ii\n\nreturn {Ii}\n\ns and x\u2217\n\nTheorem 3.4. Consider Algorithm 4 and its inputs. Suppose that there is some nondecreasing vector\nx\u2217 \u2208 [0, 1]n such that x\u2217 is not {b}-improvable. Let s, t denote the indices where x\u2217\nt are the\n\ufb01rst and terms of x\u2217 contained in [a, c] respectively. Suppose l \u2265 s \u2212 1 and r \u2264 t + 1. For any\ni \u2208 {s, s + 1, . . . , t}, the term x\u2217\ni is contained in one of the intervals in Ii returned by the algorithm.\nWe use Algorithm 4 as part of a larger procedure over the entire collection of interval sets I1, . . . ,In.\nThis procedure is detailed in Algorithm 5, and re\ufb01nes the set of intervals by splitting each interval\ninto two and running Algorithm 4 on the pair of adjacent intervals.\nProposition 3.5. Suppose the intervals used as inputs to Algorithm 5 are {IGk(cid:48)\n} (i.e. all the\nendpoints are in Gk(cid:48)). Let x\u2217 \u2208 [0, 1]n be a nondecreasing vector that is not G2k(cid:48)-improvable and\n\nis contained in(cid:81)\nwhere(cid:8)IG2k(cid:48)(cid:9) are the intervals returned by the algorithm.\n\n(cid:16)(cid:83)(cid:8)[a, b] \u2208 IGk(cid:48)\n\n. Then, x\u2217 is contained in(cid:81)\n\n(cid:16)(cid:83)(cid:8)[a, b] \u2208 IG2k(cid:48)\n\n(cid:9)(cid:17)\n\n(cid:9)(cid:17)\n\ni\u2208[n]\n\ni\u2208[n]\n\ni\n\ni\n\ni\n\n3.3 Computing Lower Bounds and Linear Underestimators for Quasiconvex Estimators\n\nFor quasiconvex functions, we can compute the lower bound over an interval [a, b] by just evaluating\nthe function on the endpoints a and b (and by knowing what the minimizer and minimum value are).\n\n6\n\n\fAlgorithm 5 Main Algorithm for Re\ufb01ning via Linear Underestimators\n\n}, functions {fi}\n\nfor [u, v] \u2208(cid:83)\n\ninput: Interval Sets {IGk(cid:48)\ni \u2190 \u2205 for i \u2208 [n]\nI(cid:48)\ni IGk(cid:48)\nfor each contiguous block of indices s, s + 1, . . . , t in {i | Ii contains [u, v]} do\n\ndo\n\ni\n\ni\n\nl \u2190 max{i | \u2203 an interval to the left of [u, v] contained in IGk(cid:48)\n}\nr \u2190 min{i | \u2203 an interval to the right of [u, v] contained in IGk(cid:48)\nUpdate {I(cid:48)\n\ni} with Alg. 4 with inputs {fi},{s, . . . , t}, (a, b, c) = (u, u+v/2, v), indices l, r\n\n}\n\ni\n\ni\n\ni}\nreturn {I(cid:48)\n\nIt is straightforward to compute good linear underestimators for many quasiconvex distance functions\nused in robust statistics. We will discuss how this can be done for the Tukey biweight function, and\nsimilar steps can be taken for other popular functions such as the Huber Loss, SCAD, and MCP.\n\nExample: Tukey\u2019s biweight function and how to ef\ufb01ciently compute good m values. Tukey\u2019s\nbiweight function is a classic function used in robust regression. The function is zero at the origin\nand the derivative is x(1 \u2212 (x/c)2)2 for |x| < c and 0 otherwise for some \ufb01xed c.\n\n\u221a\n\u221a\nFigure 2: Tukey\u2019s biweight function with c = 1. In the plot of the derivative, we mark the region in which the\nfunction is convex in red (\u22121/\n5), while in the other regions at the sides the function is concave.\n\n5 \u2264 x \u2264 1/\n\nWe will describe how to choose gL and gR for x < 0, and by symmetry we can use similar methods\nfor x > 0. We obtain gL from connecting f (x) to the largest value of the function. If x is in the\nconvex region, we can simply set gR to the gradient. We now add a line with slope \u2212L (where L is\nthe largest gradient of the function) to the transition point between the concave and convex regions,\nand for x in the concave region we obtain gR by connecting f (x) to this line.\n\n3.4 Putting it all together\n\nAfter stating the pruning and re\ufb01nement rules for our nonconvex distance functions, we can formally\ndescribe in detail the full process in Algorithm 6. The worse case running time is O(nk), since the\nnumber of points and intervals processed is on that order and the complexity of the subroutines are\nlinear in those numbers. On the other hand, when the functions fi are convex,\nTheorem 3.6. Algorithm 6 solves Problem (2) in O(nk) time in general, and O(n log k) time for\nconvex functions if we use subgradient information.\n\nThere are two things to note about Algorithm 6. First, it only presents one possible combination of\nthe pruning rules. Another combination would be to not apply the lower/upper bound pruning rule at\nevery iteration. We stick to this particular description in our experiments and theorems for simplicity.\nSecond, we only require the linear underestimator rule for the O(n log k) convex bound, since that\nsuf\ufb01ces to ensure that sets Si have at most a few points.\n\n7\n\n\fAlgorithm 6 A Pruning Algorithm for Robust Isotonic Regression\n\n(cid:9)\n\nGk(cid:48)\ni\n\nG2k(cid:48)\ni\n\ni\n\ninput: Functions {fi}, Parameter k\nk(cid:48) \u2190 1\nSi \u2190 {0, 1} for i \u2208 [n]\nIi \u2190 {[0, 1]} for i \u2208 [n]\nwhile k(cid:48) < k do\n\n(cid:8)I\n(cid:9) \u2190 Algorithm 5 to re\ufb01ne and prune(cid:8)I\n(cid:8)I\n(cid:9) \u2190 Algorithm 3 to prune(cid:8)I\n(cid:9)\n(cid:9)\nx \u2190 run modi\ufb01ed DP on endpoints of(cid:8)IGk\n\nk(cid:48) \u2190 2k(cid:48)\n\nG2k(cid:48)\ni\nG2k(cid:48)\ni\n\nreturn x\n\n4 Empirical Observations\n\nWe evaluate the ef\ufb01ciency of the DP approach and our algorithm on a simple isotonic regression task.\nWe adopt an experiment setup similar to the one used by Bach (2018). We generate a series of n\npoints y1, . . . , yn from 0.2 to 0.8 equally spaced out and added Gaussian random noise with standard\ndeviation of 0.03. We then randomly \ufb02ipped between 5% to 50% of the points around 0.5, and these\npoints act as the outliers. Our goal now is to test the computational effort required to solve Problem\n(2). where f is the Tukey\u2019s biweight function with c = 0.3. We set n to 1000 and varied k from\n27 = 128 to 216 = 65536.\n\nFigure 3: The yi points (pluses) and results of using Euclidean distance (blue, dashed) vs. Tukey\u2019s biweight\nfunction (orange, solid).\n\nWe used two metrics to evaluate the computational ef\ufb01ciency. The \ufb01rst measure we use is the total\nnumber of points in all Si across all iterations, an implementation-independent measure. The second\nis the wall-clock time taken. The algorithms were implemented in Python 3.6.7, and the experiments\nwere ran on an Intel 7th generation core i5-7200U dual-core processor with 8GB of RAM.\nThe results are summarized in Figure 4, where the results are averaged over 10 independent trials. In\nthe \ufb01rst \ufb01gure on the left, we see how the error decreases with an increase in k, re\ufb02ecting the equation\nthat k \u2265 O(1/\u0001) is needed to achieve an error of \u0001 in the objective.\nIn the second and third \ufb01gures, we compare the performance of the dynamic program against our\nmethod, with different percentages of points \ufb02ipped/corrupted. Instead of presenting three DP lines\nfor each percentage, we simply use one line since the number of points evaluated is always the same\nand the variation in the timing across all runs is signi\ufb01cantly less than 5 percent for all values of\nk. The fact that our method performs differently for different levels of corruption indicates that the\nperformance of our method varies with the dif\ufb01culty of the problem, a key design goal.\nThe difference between the second and third \ufb01gures for our method is approximately a constant factor,\nindicating that the computational effort for each point is roughly the same. We also see that our\nmethod takes signi\ufb01cantly more effort per point. Nonetheless our method is signi\ufb01cantly faster than\nthe DP across all tested levels of corruption, and the difference gets more signi\ufb01cant as we increase k.\nTo more closely investigate how the dif\ufb01culty of the problem can affect the running time performance,\nwe compare how the speedup is affected by the percent of \ufb02ipped/corrupted points in Figure 5 at\n\n8\n\n\fFigure 4: Summary of empirical results. The graph on the left shows how increasing the granularity of the grid\ndecreases the error, and the next two graphs compare the performance of DP against our method (under different\npercent of \ufb02ipped/corrupted points) in terms of points processed and running time.\n\nk = 216. For low levels of noise, the speedup is extremely high. There is a rapid decrease in\nperformance between 20 and 30 percent, and at higher levels of noise the performance begins to\nstabilize again at about 9-10\u00d7.\n\nFigure 5: Speedup as a function of the amount of points that were \ufb02ipped/corrupted at k = 216.\n\nIn addition to the above experiments, we also ran preliminary experiments varying the value of n. As\npredicted by the theory, the complexity of both methods scale roughly linearly with n. Tests on a\nrange of quasiconvex robust estimators shows similar results.\n\n5 Conclusions and Future Work\n\nWe propose a pruning algorithm that builds upon the standard DP algorithm for solving the separable\nnonconvex isotonic regression problem (1) to any arbitrary accuracy (to the global optimal value). On\nthe theoretical front, we demonstrate that the pruning rules developed retain the correct points and\nintervals required to reach the global optimal value, and in the convex case our algorithm becomes a\nvariant of the O(n log k) scaling algorithm. In terms of empirical performance, our initial synthetic\nexperiments show that our algorithm scales signi\ufb01cantly better as the desired accuracy increases.\nBesides developing more pruning rules that can work on a larger range of nonconvex fi functions,\nthere are two main directions for extensions to this work, mirroring the line of developments for\nthe classic isotonic regression problem. The \ufb01rst is go beyond monotonicity constraints and instead\nconsider chain functions gi(xi \u2212 xi+1) that link together adjacent indices. A particularly interesting\ncase is the one where gi(xi) incorporates a (cid:96)2-penalty in addition to the monotonicity constraints\nin order to promote smoothness. The second is to go from the full ordering we consider here to\ngeneral partial orders. Dynamic programming approaches fail in that setting and we would require a\nsigni\ufb01cantly different approach. It may be possible to adapt the general submodular-based approach\ndeveloped by Bach (2018), which works in both the above mentioned extensions.\n\n9\n\n\fAcknowledgements\n\nThe author would like to thank Alberto Del Pia and Silvia Di Gregorio for initial discussion that\nlead to this work. The author was partially supported by NSF Award CMMI-1634597, NSF Award\nIIS-1447449 at UW-Madison. Part of the work was completed while visiting the Simons Institute for\nthe Theory of Computing (partially supported by the DIMACS/Simons Collaboration on Bridging\nContinuous and Discrete Optimization through NSF Award CCF-1740425).\n\nReferences\nAhuja, R. K. and Orlin, J. B. (2001). A Fast Scaling Algorithm for Minimizing Separable Convex\n\nFunctions Subject to Chain Constraints. Operations Research, 49(5):784\u2013789.\n\nAyer, M., Brunk, H. D., Ewing, G. M., Reid, W. T., and Silverman, E. (1955). An Empirical\nDistribution Function for Sampling with Incomplete Information. The Annals of Mathematical\nStatistics, 26(4):641\u2013647.\n\nBach, F. (2013). Learning with submodular functions: A convex optimization perspective. Founda-\n\ntions and Trends R(cid:13) in Machine Learning, 6(2-3):145\u2013373.\n\nBach, F. (2015). Submodular Functions: from Discrete to Continous Domains. arXiv:1511.00394\n\n[cs, math]. arXiv: 1511.00394.\n\nBach, F. (2018). Ef\ufb01cient algorithms for non-convex isotonic regression through submodular opti-\nmization. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett,\nR., editors, Advances in Neural Information Processing Systems 31, pages 1\u201310. Curran Associates,\nInc.\n\nBest, M. J., Chakravarti, N., and Ubhaya, V. A. (2000). Minimizing Separable Convex Functions\n\nSubject to Simple Chain Constraints. SIAM Journal on Optimization, 10(3):658\u2013672.\n\nBogdan, M., van den Berg, E., Su, W., and Candes, E. (2013). Statistical estimation and testing via\n\nthe sorted L1 norm. arXiv:1310.1969.\n\nBrunk, H. D. (1955). Maximum Likelihood Estimates of Monotone Parameters. The Annals of\n\nMathematical Statistics, 26(4):607\u2013616.\n\nBurdakov, O. and Sysoev, O. (2017). A Dual Active-Set Algorithm for Regularized Monotonic\n\nRegression. Journal of Optimization Theory and Applications, 172(3):929\u2013949.\n\nFelzenszwalb, P. F. and Huttenlocher, D. P. (2012). Distance Transforms of Sampled Functions.\n\nTheory of Computing, 8:415\u2013428.\n\nGunasekar, S., Koyejo, O. O., and Ghosh, J. (2016). Preference Completion from Partial Rankings.\nIn Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., and Garnett, R., editors, Advances in\nNeural Information Processing Systems 29, pages 1370\u20131378. Curran Associates, Inc.\n\nHampel, F. R., Ronchetti, E. M., Rousseeuw, P. J., and Stahel, W. A. (2011). Robust statistics: the\n\napproach based on in\ufb02uence functions, volume 196. John Wiley & Sons.\n\nHochbaum, D. and Lu, C. (2017). A Faster Algorithm Solving a Generalization of Isotonic Median\nRegression and a Class of Fused Lasso Problems. SIAM Journal on Optimization, 27(4):2563\u2013\n2596.\n\nHochbaum, D. S. (2001). An Ef\ufb01cient Algorithm for Image Segmentation, Markov Random Fields\n\nand Related Problems. J. ACM, 48(4):686\u2013701.\n\nHuber, P. (2004). Robust Statistics. Wiley Series in Probability and Statistics - Applied Probability\n\nand Statistics Section Series. Wiley.\n\nKalai, A. and Sastry, R. (2009). The Isotron Algorithm: High-Dimensional Isotonic Regression. In\n\nConference on Learning Theory.\n\n10\n\n\fKolmogorov, V., Pock, T., and Rolinek, M. (2016). Total Variation on a Tree. SIAM Journal on\n\nImaging Sciences, 9(2):605\u2013636.\n\nLim, C. H. and Wright, S. J. (2016). Ef\ufb01cient bregman projections onto the permutahedron and\nrelated polytopes. In Gretton, A. and Robert, C. C., editors, Proceedings of the 19th International\nConference on Arti\ufb01cial Intelligence and Statistics (AISTATS 2016), page 1205\u20131213.\n\nStout, Q. F. (2014). Fastest isotonic regression algorithms.\n\nSuehiro, D., Hatano, K., Kijima, S., Takimoto, E., and Nagano, K. (2012). Online prediction\nunder submodular constraints. In Bshouty, N., Stoltz, G., Vayatis, N., and Zeugmann, T., editors,\nAlgorithmic Learning Theory, volume 7568 of Lecture Notes in Computer Science, pages 260\u2013274.\nSpringer Berlin Heidelberg.\n\nSysoev, O. and Burdakov, O. (2016). A Smoothed Monotonic Regression via L2 Regularization.\n\nLink\u00f6ping University Electronic Press.\n\nTibshirani, R. J., Hoe\ufb02ing, H., and Tibshirani, R. (2011). Nearly-Isotonic Regression. Technometrics,\n\n53(1):54\u201361.\n\nYasutake, S., Hatano, K., Kijima, S., Takimoto, E., and Takeda, M. (2011). Online linear optimization\nover permutations. In Asano, T., Nakano, S.-i., Okamoto, Y., and Watanabe, O., editors, Algorithms\nand Computation, volume 7074 of Lecture Notes in Computer Science, pages 534\u2013543. Springer\nBerlin Heidelberg.\n\nZeng, X. and Figueiredo, M. A. T. (2014). The Ordered Weighted (cid:96)1 Norm: Atomic Formulation,\n\nProjections, and Algorithms. arXiv:1409.4271.\n\n11\n\n\f", "award": [], "sourceid": 177, "authors": [{"given_name": "Cong Han", "family_name": "Lim", "institution": "Georgia Tech"}]}