{"title": "Random Permutation Online Isotonic Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 4180, "page_last": 4189, "abstract": "We revisit isotonic regression on linear orders, the problem of fitting monotonic functions to best explain the data, in an online setting. It was previously shown that online isotonic regression is unlearnable in a fully adversarial model, which lead to its study in the fixed design model. Here, we instead develop the more practical random permutation model. We show that the regret is bounded above by the excess leave-one-out loss for which we develop efficient algorithms and matching lower bounds. We also analyze the class of simple and popular forward algorithms and recommend where to look for algorithms for online isotonic regression on partial orders.", "full_text": "Random Permutation Online Isotonic Regression\n\nWojciech Kot\u0142owski\n\nPozna\u00b4n University of Technology\n\nPoland\n\nWouter M. Koolen\n\nCentrum Wiskunde & Informatica\n\nAmsterdam, The Netherlands\n\nwkotlowski@cs.put.poznan.pl\n\nwmkoolen@cwi.nl\n\nAlan Malek\n\nMIT\n\nCambridge, MA\namalek@mit.edu\n\nAbstract\n\nWe revisit isotonic regression on linear orders, the problem of \ufb01tting monotonic\nfunctions to best explain the data, in an online setting. It was previously shown that\nonline isotonic regression is unlearnable in a fully adversarial model, which lead to\nits study in the \ufb01xed design model. Here, we instead develop the more practical\nrandom permutation model. We show that the regret is bounded above by the\nexcess leave-one-out loss for which we develop ef\ufb01cient algorithms and matching\nlower bounds. We also analyze the class of simple and popular forward algorithms\nand recommend where to look for algorithms for online isotonic regression on\npartial orders.\n\nIntroduction\n\n1\nA function f : R \u2192 R is called isotonic (non-decreasing) if x \u2264 y implies f (x) \u2264 f (y). Isotonic\nfunctions model monotonic relationships between input and output variables, like those between\ndrug dose and response [25] or lymph node condition and survival time [24]. The problem of\nisotonic regression is to \ufb01nd the isotonic function that best explains a given data set or population\ndistribution. The isotonic regression problem has been extensively studied in statistics [1, 24], which\nresulted in ef\ufb01cient optimization algorithms for \ufb01tting isotonic functions to the data [7, 16] and sharp\nconvergence rates of estimation under various model assumptions [26, 29].\nIn online learning problems, the data arrive sequentially, and the learner is tasked with predicting\neach subsequent data point as it arrives [6]. In online isotonic regression, the natural goal is to predict\nthe incoming data points as well as the best isotonic function in hindsight. Speci\ufb01cally, for time steps\n\nt = 1, . . . , T , the learner observes an instance xi \u2208 R, makes a prediction(cid:98)yi of the true label yi,\nevaluate a prediction(cid:98)yi by its squared loss ((cid:98)yi \u2212 yi)2. The quality of an algorithm is measured by its\nregret,(cid:80)T\n\nwhich is assumed to lie in [0, 1]. There is no restriction that the labels or predictions be isotonic. We\n\nT is the loss of the best isotonic function on the entire data\n\nt=1((cid:98)yi \u2212 yi)2 \u2212 L\u2217\n\nT , where L\u2217\n\nsequence.\nIsotonic regression is nonparametric: the number of parameters grows linearly with the number of\ndata points. It is thus natural to ask whether there are ef\ufb01cient, provably low regret algorithms for\nonline isotonic regression. As of yet, the picture is still very incomplete in the online setting. The\n\ufb01rst online results were obtained in the recent paper [14] which considered linearly ordered domains\nin the adversarial \ufb01xed design model, i.e. a model in which all the inputs x1, . . . , xT are given to the\nlearner before the start of prediction. The authors show that, due to the nonparametric nature of the\nproblem, many textbook online learning algorithms fail to learn at all (including Online Gradient\nDescent, Follow the Leader and Exponential Weights) in the sense that their worst-case regret grows\nlinearly with the number of data points. They prove a \u2126(T 1\n3 ) worst case regret lower bound, and\ndevelop a matching algorithm that achieves the optimal \u02dcO(T 1\n3 ) regret. Unfortunately, the \ufb01xed design\nassumption is often unrealistic. This leads us to our main question: Can we design methods for online\nisotonic regression that are practical (do not hinge on \ufb01xed design)?\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fOur contributions Our long-term goal is to design practical and ef\ufb01cient methods for online\nisotonic regression, and in this work we move beyond the \ufb01xed design model and study algorithms\nthat do not depend on future instances. Unfortunately, the completely adversarial design model (in\nwhich the instances are selected by an adaptive adversary) is impossibly hard: every learner can\nsuffer linear regret in this model [14]. So in order to drop the \ufb01xed design assumption, we need to\nconstrain the adversary in some other way. In this paper we consider the natural random permutation\nmodel, in which all T instances and labels are chosen adversarially before the game begins but then\nare presented to the learner in a random order.\nThis model corresponds with the intuition that the data gathering process (which \ufb01xes the order) is\nindependent of the underlying data generation mechanism (which \ufb01xes instances and labels). We\nwill show that learning is possible in the random permutation model (in fact we present a reduction\nshowing that it is not harder than adversarial \ufb01xed design) by proving an \u02dcO(T 1\n3 ) upper bound on\nregret for an online-to-batch conversion of the optimal \ufb01xed design algorithm from [14] (Section 3).\nOur main tool for analyzing the random permutation model is the leave-one-out loss, drawing\ninteresting connections with cross-validation and calibration. The leave-one-out loss on a set of t\nlabeled instances is the error of the learner predicting the i-th label after seeing all remaining t \u2212 1\nlabels, averaged uniformly over i = 1, . . . , t. We begin by proving a general correspondence between\nregret and leave-one-out loss for the random permutation model in Section 2.1, which allows us to\nuse excess leave-one-out loss as a proxy for regret. We then describe a version of online-to-batch\nconversion that relates the \ufb01xed design model with the random permutation model, resulting in an\nalgorithm that attains the optimal \u02dcO(T 1\nSection 4 then turns to the computationally ef\ufb01cient and natural class of forward algorithms that\nuse an of\ufb02ine optimization oracle to form their prediction. This class contains most common online\nisotonic regression algorithms. We then show a O(T 1\n2 ) upper bound on the regret for the entire class,\nwhich improves to O(T 1\n3 ) for the well-speci\ufb01ed case where the data are in fact generated from an\nisotonic function plus i.i.d. noise (the most common model in the statistics literature).\nWhile forward algorithms match the lower bound for the well-speci\ufb01ed case, there is a factor T 1\n6 gap\nin the random permutation case. Section 4.6 proposes a new algorithm that calls a weighted of\ufb02ine\noracle with a large weight on the current instance. This algorithm can be ef\ufb01ciently computed via\n[16]. We prove necessary bounds on the weight.\n\n3 ) regret.\n\nRelated work Of\ufb02ine isotonic regression has been extensively studied in statistics starting from\nwork by [1, 4]. Applications range across statistics, biology, medicine, psychology, etc. [24, 15, 25,\n22, 17]. In statistics, isotonic regression is studied in generative models [26, 3, 29]. In machine\nlearning, isotonic regression is used for calibrating class probability estimates [28, 21, 18, 20, 27],\nROC analysis [8], training Generalized Linear Models and Single Index Models[12, 11], data cleaning\n[13], and ranking [19]. Fast algorithms for partial ordering are developed in [16].\nIn the online setting, [5] bound the minimax regret for monotone predictors under logarithmic loss\nand [23, 10] study online nonparametric regression in general. Ef\ufb01cient algorithms and worst-cases\nregret bounds for \ufb01xed design online isotonic regression are studied in [14]. Finally, the relation\nbetween regret and leave-one-out loss was pioneered by [9] for linear regression.\n\n2 Problem Setup\nGiven a \ufb01nite set of instances {x1, . . . , xt} \u2282 R, a function f : {x1, . . . , xt} \u2192 [0, 1] is isotonic\n(non-decreasing) if xi \u2264 xj implies f (xi) \u2264 f (xj) for all i, j \u2208 {1, . . . , t}. Given a set of labeled\ninstances D = {(x1, y1), . . . , (xt, yt)} \u2282 R \u00d7 [0, 1], let L\u2217(D) denote the total squared loss of the\nbest isotonic function on D,\n\nL\u2217(D) := min\nisotonic f\n\n(yi \u2212 f (xi))2.\n\nt(cid:88)\n\ni=1\n\nThis convex optimization problem can be solved by the celebrated Pool Adjacent Violators Algorithm\n(PAVA) in time linear in t [1, 7]. The optimal solution, called the isotonic regression function, is\npiecewise constant and its value on any of its levels sets equals the average of labels within that set\n[24].\n\n2\n\n\fOnline isotonic regression in the random permutation model is de\ufb01ned as follows. At the beginning of\n1 and labels y1, . . . , yT . A permutation\nthe game, the adversary chooses data instances x1 < . . . < xT\n\u03c3 = (\u03c31, . . . , \u03c3T ) of {1, . . . , T} is then drawn uniformly at random and used to determine the order\npredicts(cid:98)y\u03c3t. Next, the learner observes the true label y\u03c3t and incurs the squared loss ((cid:98)y\u03c3t \u2212 y\u03c3t)2.\nin which the data will be revealed. In round t, the instance x\u03c3t is revealed to the learner who then\nt = L\u2217({(x\u03c31, y\u03c31), . . . , (x\u03c3t, y\u03c3t)}) to\nFor a \ufb01xed permutation \u03c3, we use the shorthand notation L\u2217\ndenote the optimal isotonic regression loss of the \ufb01rst t labeled instances (L\u2217\nt will clearly depend on\n\u03c3, except for the case t = T ). The goal of the learner is to minimize the expected regret,\n\nRT := E\u03c3\n\n\u2212 L\u2217\n\nT =\n\nrt,\n\n(cid:21)\n\n(cid:20) T(cid:88)\n(y\u03c3t \u2212(cid:98)y\u03c3t)2\n(cid:104)\n(y\u03c3t \u2212(cid:98)y\u03c3t)2 \u2212 L\u2217\n\nt=1\n\nT(cid:88)\n(cid:105)\n\nt=1\n\n,\n\nrt := E\u03c3\n\nwhere we have decomposed the regret into its per-round increase,\nt + L\u2217\nt\u22121\n\n(1)\nwith L\u2217\n0 := 0. To simplify the analysis, let us assume that the prediction strategy does not depend\non the order in which the past data were revealed (which is true for all algorithms considered\nin this paper). Fix t and de\ufb01ne D = {(x\u03c31 , y\u03c31), . . . , (x\u03c3t, y\u03c3t)} to be the set of \ufb01rst t labeled\ninstances. Furthermore, let D\u2212i = D\\{(x\u03c3i, y\u03c3i)} denote the set D with the instance from round\ni removed. Using this notation, the expression under the expectation in (1) can be written as\n\n(cid:0)y\u03c3t \u2212(cid:98)y\u03c3t (D\u2212t)(cid:1)2 \u2212 L\u2217(D) + L\u2217(D\u2212t), where we made the dependence of(cid:98)y\u03c3t on D\u2212t explicit\n(cid:104)(cid:0)y\u03c3t \u2212(cid:98)y\u03c3t(D\u2212t)(cid:1)2(cid:105)\n(cid:2)L\u2217(D\u2212i)(cid:3) ,\n\n(and used the fact that it only depends on the set of instances, not on their order). By symmetry of the\nexpectation over permutations with respect to the indices, we have\nE\u03c3\nand E\u03c3\n,\nfor all i = 1, . . . , t. Thus, (1) can as well be rewritten as:\n\n= E\u03c3\n\n(cid:104)(cid:0)y\u03c3i \u2212(cid:98)y\u03c3i(D\u2212i)(cid:1)2(cid:105)\n(cid:16)(cid:0)y\u03c3i \u2212(cid:98)y\u03c3i(D\u2212i)(cid:1)2\nt(cid:88)\n\ni=1\n\n(cid:20) 1\n\nt\n\n(cid:2)L\u2217(D\u2212t)(cid:3) = E\u03c3\n(cid:17) \u2212 L\u2217(D)\n\n(cid:21)\n\n.\n\nrt = E\u03c3\n\n+ L\u2217(D\u2212i)\n\nLet us denote the expression inside the expectation by rt(D) to stress its dependence on the set of\ninstances D, but not on the order in which they were revealed. If we can show that rt(D) \u2264 Bt holds\n\nfor all t, then its expectation has the same bound, so RT \u2264(cid:80)T\n\nt=1 Bt.\n\n2.1 Excess Leave-One-Out Loss and Regret\n\nlearner is given D\u2212i, the entire data set except (xi, yi), and predicts(cid:98)yi (as a function of D\u2212i) on\n\nOur main tool for analyzing the random permutation model is the leave-one-out loss.\nIn the\nleave-one-out model, there is no sequential structure. The adversary picks a data set D =\n{(x1, y1), . . . , (xt, yt)} with x1 < . . . < xt. An index i is sampled uniformly at random, the\ninstance xi. We call the difference between the expected loss of the learner and L\u2217(D) the expected\nexcess leave-one-out loss:\n\n(cid:32)(cid:18) t(cid:88)\n\ni=1\n\n(cid:0)yi \u2212(cid:98)yi(D\u2212i)(cid:1)2(cid:19)\n\n(cid:96)oot(D) :=\n\n1\nt\n\n(cid:33)\n\n\u2212 L\u2217(D)\n\n.\n\n(2)\n\nThe random permutation model has the important property that the bound on the excess leave-one-out\nloss of a prediction algorithm translates into a regret bound. A similar result has been shown by [9]\nfor expected loss in the i.i.d. setting.\nLemma 2.1. rt(D) \u2264 (cid:96)oot(D) for any t and any data set D = {(x1, y1), . . . , (xt, yt)}.\nProof. As x1 < . . . < xt, let (y\u2217\nL\u2217(D\u2212i) + (yi \u2212 y\u2217\n\nregression function on D. From the de\ufb01nition of L\u2217, we can see that L\u2217(D) =(cid:80)t\n(yi \u2212(cid:98)yi)2 \u2212 (yi \u2212 y\u2217\n\n(cid:80)t\ni=1(yi \u2212 fi)2 be the isotonic\ni \u2212 yi)2 \u2265\n\n(yi \u2212(cid:98)yi)2 + L\u2217(D\u2212i)\n\ni )2. Thus, the regret increase rt(D) is bounded by\n\n\u2212 L\u2217(D) \u2264 t(cid:88)\n\nt ) = argminf1\u2264...\u2264ft\n\n1, . . . , y\u2217\n\nt(cid:88)\n\ni=1(y\u2217\n\ni )2\n\n= (cid:96)oot(D).\n\nrt(D) =\n\ni=1\n\nt\n\ni=1\n\nt\n\n1 We assume all points xt are distinct as it will signi\ufb01cantly simplify the presentation. All results in this\n\npaper are also valid for the case x1 \u2264 . . . \u2264 xT .\n\n3\n\n\fwhich the regret bound RT \u2264(cid:80)T\n\nHowever, we note that lower bounds for (cid:96)oot(D) do not imply lower bounds on regret.\nIn what follows, our strategy will be to derive bounds (cid:96)oot(D) \u2264 Bt for various algorithms, from\nt=1 Bt can be immediately obtained. From now on, we abbreviate\n(cid:96)oot(D) to (cid:96)oot, (as D is clear from the context); we will also consistently assume x1 < . . . < xt.\n\n2.2 Noise free case\n\nAs a warm-up, we analyze the noise-free case (when the labels themselves are isotonic) and demon-\nstrate that analyzing (cid:96)oot easily results in an optimal bound for this setting.\n\nProposition 2.2. Assume that the labels satisfy y1 \u2264 y2 \u2264 . . . \u2264 yt. The prediction(cid:98)yi that is the\nlinear interpolation between adjacent labels(cid:98)yi = 1\n\n2 (yi\u22121 + yi+1), has\n\n(cid:96)oot \u2264 1\n2t\n\n, and thus RT \u2264 1\n2\n\n(cid:80)t\nProof. For \u03b4i := yi \u2212 yi\u22121, it is easy to check that (cid:96)oot = 1\ni=1(\u03b4i+1 \u2212 \u03b4i)2 because the L\u2217(D)\nterm is zero. This expression is a convex function of \u03b41, . . . , \u03b4t+1. Note that \u03b4i \u2265 0 for each\ni=1 \u03b4i = 1. Since the maximum of a convex function is at the boundary of\nthe feasible region, the maximizer is given by \u03b4i = 1 for some i \u2208 {1, . . . , t + 1}, and \u03b4j = 0 for all\nj \u2208 {1, . . . , t + 1}, j (cid:54)= i. This implies that (cid:96)oot \u2264 (2t)\u22121.\n\ni = 1, . . . , t + 1, and(cid:80)t+1\n\nlog(T + 1).\n\n4t\n\n2.3 General Lower Bound\n\nIn [14], a general lower bound was derived showing that the regret of any online isotonic regression\nprocedure is at least \u2126(T 1\n3 ) for the adversarial setup (when labels and the index order were chosen\nadversarially). This lower bound applies regardless of the order of outcomes, and hence it is also a\nlower bound for the random permutation model. This bound translates into (cid:96)oot = \u2126(t\u22122/3).\n\n3 Online-to-batch for \ufb01xed design\n\n(cid:98)yfd(cid:0)x\u03c3t\n\n(cid:12)(cid:12)y\u03c31, . . . , y\u03c3t\u22121, {x1, . . . , xTfd}(cid:1),\n\nHere, we describe an online-to-batch conversion that relates the adversarial \ufb01xed design model with\nthe random permutation model considered in this paper. In the \ufb01xed design model with time horizon\nTfd the learner is given the points x1, . . . , xTfd in advance (which is not the case in the random\npermutation model), but the adversary chooses the order \u03c3 in which the labels are revealed (as\nopposed to \u03c3 being drawn at random). We can think of an algorithm for \ufb01xed design as a prediction\nfunction\nfor any order \u03c3, any set {x1, . . . , xTfd} (and hence any time horizon Tfd), and any time t. This\nnotation is quite heavy, but makes it explicit that the learner, while predicting at point x\u03c3t, knows the\npreviously revealed labels and the whole set of instances.\nIn the random permutation model, at trial t, the learner only knows the previously revealed t \u2212 1\nlabeled instances and predicts on the new instance. Without loss of generality, denote the past\ninstances by D\u2212i = {(x1, y1), . . . , (xi\u22121, yi\u22121), (xi+1, yi+1), . . . (xt, yt)}, and the new instance by\n\nxi, for some i \u2208 {1, . . . , t}. Given an algorithm for \ufb01xed design(cid:98)yfd, we construct a prediction\n(cid:98)yt =(cid:98)yt(D\u2212i, xi) of the algorithm in the random permutation model. The reduction goes through an\n\nonline-to-batch conversion. Speci\ufb01cally, at trial t, given past labeled instances D\u2212i, and a new point\nxi, the learner plays the expectation of the prediction of the \ufb01xed design algorithm with time horizon\nT fd = t and points {x1, . . . , xt} under a uniformly random time from the past j \u2208 {1, . . . , t} and a\nrandom permutation \u03c3 on {1, . . . , t}, with \u03c3t = i, i.e.2\n\n(3)\n\n(cid:20) 1\n\nt(cid:88)\n\n(cid:98)yfd(xi|y\u03c31 , . . . , y\u03c3j\u22121 , {x1, . . . , xt}(cid:1)(cid:21)\n\n.\n\n(cid:98)yt = E{\u03c3:\u03c3t=i}\n\nt\n\nj=1\n\n2Choosing the prediction as an expectation is elegant but inef\ufb01cient. However, the proof indicates that we\nmight as well sample a single j and a single random permutation \u03c3 to form the prediction and the reduction\nwould also work in expectation.\n\n4\n\n\fTheorem 3.1. Let D = {(x1, y1), . . . , (xt, yt)} be a set of t labeled instances. Fix any algorithm\n\nNote that this is a valid construction, as the right hand side only depends on D\u2212i and xi, which are\nknown to the learner in a random permutation model at round t. We prove (in Appendix A) that the\n\nexcess leave-one-out loss of(cid:98)y at trial t is upper bounded by the expected regret (over all permutations)\nof(cid:98)yfd in trials 1, . . . , t divided by t:\n(cid:98)yfd for online adversarial isotonic regression with \ufb01xed design, and let Regt((cid:98)yfd | \u03c3) denote its regret\non D when the labels are revealed in order \u03c3. The random permutation learner(cid:98)y from (3) ensures\nE\u03c3[Regt((cid:98)yfd | \u03c3)].\n3 ) and hence expected regret RT \u2264(cid:80)\n\nThis constructions allows immediate transport of the \u02dcO(T\nTheorem 3.2. There is an algorithm for the random-permutation model with excess leave-one-out\nloss (cid:96)oot = \u02dcO(t\u2212 2\n\nfd) \ufb01xed design regret result from [14].\n\n(cid:96)oot(D) \u2264 1\n\n3 ) = \u02dcO(T 1\n\n\u02dcO(t\u2212 2\n\n3 ).\n\n1\n3\n\nt\n\nt\n\n4 Forward Algorithms\n\n1, . . . , y\u2217\n\nFor clarity of presentation, we use vector notation in this section: y = (y1, . . . , yt) is the label vector,\ny\u2217 = (y\u2217\nt ) is the isotonic regression function, and y\u2212i = (y1, . . . , yi\u22121, yi+1, . . . , yt) is y\nwith i-th label removed. Moreover, keeping in mind that x1 < . . . < xt, we can drop xi\u2019s entirely\nfrom the notation and refer to an instance xi simply by its index i.\nGiven labels y\u2212i and some index i to predict on, we want a good prediction for yi. Follow the Leader\n(FL) algorithms, which predict using the best isotonic function on the data seen so far, are not directly\napplicable to online isotonic regression: the best isotonic function is only de\ufb01ned at the observed\ndata instances and can be arbitrary (up to monotonicity constraint) otherwise. Instead, we analyze\na simple and natural class of algorithms which we dub forward algorithms3. We de\ufb01ne a forward\ni \u2208 [0, 1] (possibly dependent on i and\nalgorithm, or FA, to be any algorithm that estimates a label y(cid:48)\ny\u2212i), and plays with the FL strategy on the sequence of past data including the new instance with the\nestimated label, i.e. performs of\ufb02ine isotonic regression on y(cid:48),\n\n(cid:26) t(cid:88)\n\n(cid:98)y = argmin\n\nf1\u2264...\u2264ft\n\n(cid:27)\n\n,\n\nj \u2212 fj)2\n(y(cid:48)\n\nwhere y(cid:48) = (y1, . . . , yi\u22121, y(cid:48)\n\ni, yi+1, . . . , yt).\n\nj=1\n\nThen, FA predicts with(cid:98)yi, the value at index i of the of\ufb02ine function of the augmented data. Note that\n\nif the estimate turned out to be correct (y(cid:48)\nloss for that round.\nForward algorithms are practically important: we will show that many popular algorithms can be cast\nas FA with a particular estimate. FA automatically inherit any computational advances for of\ufb02ine\nisotonic regression; in particular, they scale ef\ufb01ciently to partially ordered data [16]. To our best\nknowledge, we are \ufb01rst to give bounds on the performance of these algorithms in the online setting.\n\ni = yi), the forward algorithm would suffer no additional\n\nj=(cid:96) yj\n\nmax\n(cid:96)\u2264i\n\ny(cid:96),r = max\n(cid:96)\u2264i\n\ny\u2217\ni = min\nr\u2265i\n\nAlternative formulation We can describe a FA using a minimax representation of the isotonic\nregression [see, e.g., 24]: the optimal isotonic regression y\u2217 satis\ufb01es\ny(cid:96),r,\n\nmin\n(cid:80)r\nr\u2265i\nr\u2212(cid:96)+1 . The \u201csaddle point\u201d ((cid:96)i, ri) for which y\u2217\nwhere y(cid:96),r =\nthe level set {j : y\u2217\nIt follows from (4) that isotonic regression is monotonic with respect to labels: for any two label\ni and(cid:98)y1\ni \u2264 z\u2217\nsequences y and z such that yi \u2264 zi for all i, we have y\u2217\ni for all i. Thus, if we denote the\nthat any FA has(cid:98)y0\ni = 0 and y(cid:48)\npredictions for label estimates y(cid:48)\ni , respectively, the monotonicity implies\nfunction of y, (which follows from (4)), we can show that for any prediction(cid:98)yi with(cid:98)y0\ni \u2264(cid:98)yi \u2264(cid:98)y1\ni . Conversely, using the continuity of isotonic regression y\u2217 as a\ni ,\ninterpret FA as an algorithm which in each trial predicts with some(cid:98)yi in the range [(cid:98)y0\ni ,(cid:98)y1\nt \u2208 [0, 1] that could generate this prediction. Hence, we can equivalently\n\ni } of the isotonic regression function that contains i.\n\ni = y(cid:96)i,ri, speci\ufb01es the boundaries of\n\ni \u2264 (cid:98)yi \u2264 (cid:98)y1\n\nthere exists an estimate y(cid:48)\n\ni = 1 by(cid:98)y0\n\nj = y\u2217\n\n(4)\n\ni ].\n\n3The name highlights resemblance to the Forward algorithm introduced by [2] for exponential family models.\n\n5\n\n\f4.1\n\nInstances\n\n(cid:0)f\u2217\n\n0 = 0 and f\u2217\n\ni\u22121 + f\u2217\nt+1 = 1. To see that this is a FA, note that if we use estimate y(cid:48)\ni\u22121, y(cid:48)\ni, f\u2217\n\ni ], we\ncan show that many of the well know isotonic regression algorithms are forward algorithms and\nthereby add weight to our next section where we prove regret bounds for the entire class.\n\nIsotonic regression with interpolation (IR-Int)[28] Given y\u2212i and index i, the algorithm \ufb01rst\nwe used f\u2217\nisotonic regression of y(cid:48) = (y1, . . . , yi\u22121, y(cid:48)\n\nWith the above equivalence between forward algorithms and algorithms that predict in [(cid:98)y0\ni ,(cid:98)y1\n(cid:1), where\ncomputes f\u2217, the isotonic regression of y\u2212i, and then predicts with(cid:98)yint\ni =(cid:98)yint\ni, yi+1, . . . , yt) is(cid:98)y = (f\u2217\nThis is because: i)(cid:98)y is isotonic by construction; ii) f\u2217 has the smallest squared error loss for y\u2212t\namong isotonic functions; and iii) the loss of (cid:98)y on point y(cid:48)\ni is zero, and the loss of (cid:98)y on all other\nDirect combination of (cid:98)y0\ni and (cid:98)y1\n(cid:98)yi = \u03bbi(cid:98)y0\ni + (1 \u2212 \u03bbi)(cid:98)y1\nIt is clear from Section 4, that any algorithm that predicts\n\u03bbi = 1/2), or can be chosen depending on(cid:98)y1\ni for some \u03bbi \u2208 [0, 1] is a FA. The weight \u03bbi can be set to a constant (e.g.,\n(cid:98)y1\nlog-IVAP : (cid:98)ylog\n(cid:98)y1\ni + 1 \u2212(cid:98)y0\nIt is straightforward to show that both algorithms satisfy(cid:98)y0\n\ni . Such algorithms were considered by [27]:\ni )2\n\npoints is equal to the loss of f\u2217.\n\ni )2 \u2212 (1 \u2212(cid:98)y1\n\ni and are thus instances of FA.\n\ni = 1\n2\n1 , . . . , f\u2217\n\n, the\ni+1, . . . , f\u2217\nt ).\n\n1 + ((cid:98)y0\n\ni =\n\ni .\n\ni+1\n\n=\n\n2\n\n,\n\n.\n\ni\n\ni\n\ni\n\ni and(cid:98)y0\nBrier-IVAP : (cid:98)yBrier\ni \u2264(cid:98)yi \u2264(cid:98)y1\n(cid:110)\n(cid:111)\n((cid:98)y \u2212 yi)2 \u2212 L\u2217(y)\n\ni\n\n,\n\nmax\nyi\u2208[0,1]\n\nLast-step minimax (LSM). LSM plays the minimax strategy with one round remaining,\n\nwhere L\u2217(y) is the isotonic regression loss on y. De\ufb01ne L\u2217\nfor b \u2208 {0, 1}, i.e. L\u2217\n\nb is the loss of isotonic regression function with label estimate y(cid:48)\n\nb = L\u2217(y1, . . . , yi\u22121, b, yi+1, . . . , yt)\ni = b. In\n\nand it is also an instance of FA.\n\n(cid:98)yi = argmin\n(cid:98)y\u2208[0,1]\nAppendix B we show that(cid:98)yi = 1+L\u2217\n\n0\u2212L\u2217\n2\n\n1\n\n4.2 Bounding the leave-one-out loss\n\n(cid:113) log t\n\nWe now give a O(\nt ) bound on the leave-one-out loss for forward algorithms. Interestingly, the\nbound holds no matter what label estimate the algorithm uses. The proof relies on the stability of\nisotonic regression with respect to a change of a single label. While the bound looks suboptimal in\nlight of Section 2.3, we will argue in Section 4.5 that the bound is actually tight (up to a logarithmic\nfactor) for one FA and experimentally verify that all other mentioned forward algorithms also have a\ntight lower bound of that form for the same sequence of outcomes.\n\nWe will bound (cid:96)oot by de\ufb01ning \u03b4i =(cid:98)yi \u2212 y\u2217\ni \u2212 yi)2(cid:17)\n\n((cid:98)yi \u2212 yi)2 \u2212 (y\u2217\n\nt(cid:88)\n\n(cid:96)oot =\n\n(cid:16)\n\n1\nt\n\ni=1\n\nTheorem 4.1. Any forward algorithm has (cid:96)oot = O\n\n=\n\ni=1\n\n1\nt\n\nt(cid:88)\n\ni and using the following simple inequality:\ni \u2212 2yi) \u2264 2\nt\n\nt(cid:88)\n((cid:98)yi \u2212 y\u2217\ni )((cid:98)yi + y\u2217\n(cid:16)(cid:113) log t\n(cid:17)\ni } = {(cid:96)i, . . . , ri}, for some (cid:96)i \u2264 i \u2264 ri,\ni . We need the stronger version of the minimax\n\nj = y\u2217\n\n|\u03b4i|.\n\ni=1\n\n.\n\nt\n\nProof. Fix some forward algorithm. For any i, let {j : y\u2217\nbe the level set of isotonic regression at level y\u2217\nrepresentation, shown in Appendix C:\ny\u2217\ni = min\nr\u2265i\n\nWe partition the points {1, . . . , t} into K consecutive segments: Sk =\n\nk = 1, . . . , K \u2212 1 and SK =(cid:8)i : y\u2217\n\ny(cid:96),ri.\n\ny(cid:96)i,r = max\n(cid:96)\u2264i\n\n(cid:17)(cid:111)\n(cid:9). Due to monotonicity of y\u2217, Sk are subsets of the\n\ni \u2208(cid:104) k\u22121\n\nK , k\n\ni : y\u2217\n\n(cid:110)\n\n(5)\n\nfor\n\nK\n\nform {(cid:96)k, . . . , rk} (where we use rk = (cid:96)k \u2212 1 if Sk is empty). From the de\ufb01nition, every level set\nof y\u2217 is contained in Sk for some k, and each (cid:96)k (rk) is a left-end (right-end) of some level set.\n\nK\n\ni \u2265 K\u22121\n\n6\n\n\f(cid:98)yi = max\n\n(cid:96)\u2264i\n\u2265 min\nr\u2265i\nby (5)\u2265 y\u2217\n\n(cid:96)k\n\nmin\nr\u2265i\n\n(cid:96),r \u2265 min\ny(cid:48)\ny(cid:48)\n(cid:96)k,r = min\nr\u2265i\nr\u2265i\ni \u2212 yi\ny(cid:48)\n\u2265 min\nr \u2212 (cid:96)k + 1\nr\u2265(cid:96)k\n\u22121\n\u2265 y\u2217\n\u2212\n\n(cid:96)k\n\ny(cid:96)k,r + min\nr\u2265i\n\n(cid:110)\n\ny(cid:96)k,r +\n\n(cid:111)\ni \u2212 yi\ny(cid:48)\nr \u2212 (cid:96)k + 1\ni \u2212 yi\ny(cid:48)\nr \u2212 (cid:96)k + 1\ni \u2212 1\n\u2212\nK\n\ny(cid:96)k,r + min\nr\u2265i\n\u2265 y\u2217\n\n1\n\nNow, choose some index i, and let Sk be such that i \u2208 Sk. Let y(cid:48)\ny(cid:48) = (y1, . . . , yi\u22121, y(cid:48)\n\ni, yi+1, . . . , yt). The minimax representation (4) and de\ufb01nition of FA imply\n\ni be the estimate of the FA, and let\n\n1\n\ni\n\n+ 4\n\n1\n\n1\n\ni \u2212 (cid:96)k + 1\n\nrk\u2212i+1 . Hence, we can bound |\u03b4i| = |(cid:98)yi \u2212 y\u2217\nK + 2(cid:0)1 + log |Sk|(cid:1),\ni | \u2264\n\n|\u03b4i| \u2264 |Sk|\n\ni \u2212 (cid:96)k + 1\n\ni\u2208Sk\n\n.\n\ni + 1\n\nrk\u2212i+1\n\ni\u2212(cid:96)k+1 ,\n\n+ min\nr\u2265i\n\nK + 1\n\nr \u2212 (cid:96)k + 1\n\n1\nwhich allows the bound\n\n(cid:9). Summing over i \u2208 Sk yields(cid:80)\n\nA symmetric argument gives(cid:98)yi \u2264 y\u2217\nK + max(cid:8)\n(cid:88)\nThe theorem follows from setting K = \u0398((cid:112)t/ log t).\nWhile the (cid:96)oot upper bound of the previous section yields a regret bound RT \u2264(cid:80)\n\n4.3 Forward algorithms for the well-speci\ufb01ed case\n\n(cid:96)oot \u2264 2\nt\n\n|\u03b4i| \u2264 2\nK\n\n(1 + log t).\n\nK\nt\n\nt O((cid:112)log t/t) =\n\n2 ) that is a factor O(T 1\n\n\u02dcO(T 1\n6 ) gap from the lower bound in Section 2.3, there are two pieces of good\nnews. First, forward algorithms do get the optimal rate in the well-speci\ufb01ed setting, popular in the\nclassical statistics literature, where the labels are generated i.i.d. such that E[yi] = \u00b5i with isotonic\n\u00b51 \u2264 . . . \u2264 \u00b5t.4 Second, there is a \u2126(t\u2212 1\n2 ) lower bound for forward algorithms as proven in the next\nsection. Together, these results imply that the random permutation model in indeed harder than the\nwell-speci\ufb01ed case: forward algorithms are suf\ufb01cient for the latter but not the former.\nTheorem 4.2. For data generated from the well-speci\ufb01ed setting (monotonic means with i.i.d. noise),\nany FA has (cid:96)oot = \u02dcO(t\u2212 2\nThe proof is given in Appendix D. Curiously, the proof makes use of the existence of the seemingly\nunrelated optimal algorithm with \u02dcO(t\u2212 2\n\n3 ) excess leave-one-out loss from Theorem 3.2.\n\n3 ), which translates to a \u02dcO(T 1\n\n3 ) bound on the regret.\n\n4.4 Entropic loss\n\nWe now abandon the squared loss for a moment and analyze how a FA performs when the loss function\n\nis the entropic loss, de\ufb01ned as \u2212y log(cid:98)y \u2212 (1 \u2212 y) log(1 \u2212(cid:98)y) for y \u2208 [0, 1]. Entropic loss (precisely:\n\nits binary-label version known as log-loss) is extensively used in the isotonic regression context for\nmaximum likelihood estimation [14] or for probability calibration [28, 21, 27]. A surprising fact in\nisotonic regression is that minimizing entropic loss5 leads to exactly the same optimal solution as in\nthe squared loss case, the isotonic regression function y\u2217 [24].\nNot every FA is appropriate for entropic loss, as recklessly choosing the label estimate might result in\nan in\ufb01nite loss in just a single trial (as noted by [27]). Indeed, consider a sequence of outcomes with\ny1 = 0 and yi = 1 for i > 1. While predicting on index i = 1, choosing y(cid:48)\nwhich the entropic loss is in\ufb01nite (as y1 = 0). Does there exists a FA which achieves a meaningful\nbound on (cid:96)oot in the entropic loss setup?\nWe answer this question in the af\ufb01rmative, showing that the log-IVAP predictor FA gets the same\nexcess-leave-one-out loss bound as given in Theorem 4.1. As the reduction from the regret to leave-\none-out loss (Lemma 2.1) does not use any properties of the loss function, this immediately implies a\nbound on the expected regret. Interestingly, the proof (given in Appendix G) uses as an intermediate\nstep the bound on |\u03b4i| for the worst possible forward algorithm which always produces the estimate\nbeing the opposite of the actual label.\nTheorem 4.3. The log-IVAP algorithm has (cid:96)oot = O\n\n1 = 1 results in(cid:98)y1 = 1, for\n\n(cid:16)(cid:113) log t\n\nfor the entropic loss.\n\n(cid:17)\n\nt\n\n4The \u2126(T 1/3) regret lower bound in [14] uses a mixture of well-speci\ufb01ed distributions and still applies.\n5In fact, this statement applies to any Bregman divergence [24].\n\n7\n\n\f4.5 Lower bound\nThe last result of this section is that FA can be made to have (cid:96)oot = \u2126(t\u2212 1\n2 ). We show this by means\nof a counterexample. Assume t = n2 for some integer n > 0 and let the labels be binary, yi \u2208 {0, 1}.\n\u221a\nWe split the set {1, . . . , t} into n consecutive segments, each of size n =\nt. The proportion of ones\nn, but within each segment all ones precede all zeros. For\n(yi = 1) in the k-th segment is equal to k\ninstance, when t = 25, the label sequence is:\n11000\n\n11100\n\n10000\n\n,\n\n(cid:124)(cid:123)(cid:122)(cid:125)\n\n11110\n\n4/5\n\n(cid:124)(cid:123)(cid:122)(cid:125)\n\n11111\n\n5/5\n\n(cid:124)(cid:123)(cid:122)(cid:125)\n\n1/5\n\n(cid:124)(cid:123)(cid:122)(cid:125)\n\n2/5\n\n(cid:124)(cid:123)(cid:122)(cid:125)\n\n3/5\n\nOne can use the minimax formulation (4) to verify that the segments will correspond to the level sets\nof the isotonic regression and that y\u2217\nn for any i in the k-th segment. This sequence is hard:\nLemma 4.4. The IR-Int algorithm run on the sequence described above has (cid:96)oot = \u2126(t\u2212 1\n2 ).\nWe prove the lower bound for IR-Int, since the presentation (in Appendix E) is clearest. Empirical\nsimulations showing that the other forward algorithms also suffer this regret are in Appendix F.\n\ni = k\n\n4.6 Towards optimal forward algorithms\n\nAn attractive feature of forward algorithms is that they generalize to partial orders, for which ef\ufb01cient\nof\ufb02ine optimization algorithms exist. However, in Section 4 we saw that FAs only give a \u02dcO(t\u2212 1\n2 )\nrate, while in Section 3 we saw that \u02dcO(t\u2212 2\n3 ) is possible (with an algorithm that is not known to scale\nto partial orders). Is there any hope of an algorithm that both generalizes and has the optimal rate?\nIn this section, we propose the Heavy-\u03b3 algorithm, a slight modi\ufb01cation of the forward algorithm that\ni = \u03b3 \u2208 [0, 1] with weight c (with unit weight on all other points), then plays\nplugs in label estimate y(cid:48)\nthe value of the isotonic regression function. Implementation is straightforward for of\ufb02ine isotonic\nregression algorithms that permit the speci\ufb01cation of weights (such as [16]). Otherwise, we might\nsimulate such weighting by plugging in c copies of the estimated label \u03b3 at location xi.\nWhat label estimate \u03b3 and weight c should we use? We show that the choice of \u03b3 is not very sensitive,\nbut it is crucial to tune the weight to c = \u0398(t 1\n3 ). Lemmas H.1 and H.2 show that higher and lower c\nare necessarily sub-optimal for (cid:96)oot. This leaves only one choice for c, for which we believe\nConjecture 4.5. Heavy-\u03b3 with weight c = \u0398(t 1\n\n3 ) has (cid:96)oot = \u02dcO(t\u2212 2\n3 ).\n\nWe cannot yet prove this conjecture, although numerical experiments strongly suggest it. We do not\nbelieve that picking a constant label \u03b3 is special. For example, we might alternatively predict with the\naverage of the predictions of Heavy-1 and Heavy-0. Yet not any label estimate works. In particular, if\nwe estimate the label that would be predicted by IR-Int (see 4.1) and the discussion below it), and\nwe plug that in with any weight c \u2265 0, then the isotonic regression function will still have that same\nlabel estimate as its value. This means that the \u2126(t\u2212 1\n\n2 ) lower bound of Section 4.5 applies.\n\n5 Conclusion\n\nWe revisit the problem of online isotonic regression and argue that we need a new perspective to\ndesign practical algorithms. We study the random permutation model as a novel way to bypass the\nstringent \ufb01xed design requirement of previous work. Our main tool in the design and analysis of\nalgorithms is the leave-one-out loss, which bounds the expected regret from above. We start by\nobserving that the adversary from the adversarial \ufb01xed design setting also provides a lower bound\nhere. We then show that this lower bound can be matched by applying online-to-batch conversion to\nthe optimal algorithm for \ufb01xed design. Next we provide an online analysis of the natural, popular and\npractical class of Forward Algorithms, which are de\ufb01ned in terms of an of\ufb02ine optimization oracle.\nWe show that Forward algorithms achieve a decent regret rate in all cases, and match the optimal rate\nin special cases. We conclude by sketching the class of practical Heavy algorithms and conjecture\nthat a speci\ufb01c parameter setting might guarantee the correct regret rate.\n\nOpen problem The next major challenge is the design and analysis of ef\ufb01cient algorithms for\nonline isotonic regression on arbitrary partial orders. Heavy-\u03b3 is our current best candidate. We pose\ndeciding if it in fact even guarantees \u02dcO(T 1\n\n3 ) regret on linear orders as an open problem.\n\n8\n\n\fAcknowledgments\n\nWojciech Kot\u0142owski acknowledges support from the Polish National Science Centre (grant no.\n2016/22/E/ST6/00299). Wouter Koolen acknowledges support from the Netherlands Organization for\nScienti\ufb01c Research (NWO) under Veni grant 639.021.439. This work was done in part while Koolen\nwas visiting the Simons Institute for the Theory of Computing.\n\nReferences\n[1] M. Ayer, H. D. Brunk, G. M. Ewing, W. T. Reid, and E. Silverman. An empirical distribution\nfunction for sampling with incomplete information. Annals of Mathematical Statistics, 26(4):\n641\u2013647, 1955.\n\n[2] K. Azoury and M. Warmuth. Relative loss bounds for on-line density estimation with the\n\nexponential family of distributions. Journal of Machine Learning, 43(3):211\u2013246, 2001.\n\n[3] Lucien Birg\u00e9 and Pascal Massart. Rates of convergence for minimum contrast estimators.\n\nProbability Theory and Related Fields, 97:113\u2013150, 1993.\n\n[4] H. D. Brunk. Maximum likelihood estimates of monotone parameters. Annals of Mathematical\n\nStatistics, 26(4):607\u2013616, 1955.\n\n[5] Nicol\u00f2 Cesa-Bianchi and G\u00e1bor Lugosi. Worst-case bounds for the logarithmic loss of predictors.\n\nMachine Learning, 43(3):247\u2013264, 2001.\n\n[6] Nicol\u00f2 Cesa-Bianchi and G\u00e1bor Lugosi. Prediction, learning, and games. Cambridge University\n\nPress, 2006.\n\n[7] Jan de Leeuw, Kurt Hornik, and Patrick Mair. Isotone optimization in R: Pool-adjacent-violators\n\nalgorithm (PAVA) and active set methods. Journal of Statistical Software, 32:1\u201324, 2009.\n\n[8] Tom Fawcett and Alexandru Niculescu-Mizil. PAV and the ROC convex hull. Machine Learning,\n\n68(1):97\u2013106, 2007.\n\n[9] J\u00fcrgen Forster and Manfred K Warmuth. Relative expected instantaneous loss bounds. Journal\n\nof Computer and System Sciences, 64(1):76\u2013102, 2002.\n\n[10] Pierre Gaillard and S\u00e9bastien Gerchinovitz. A chaining algorithm for online nonparametric\n\nregression. In Conference on Learning Theory (COLT), pages 764\u2013796, 2015.\n\n[11] Sham M Kakade, Varun Kanade, Ohad Shamir, and Adam Kalai. Ef\ufb01cient learning of general-\nized linear and single index models with isotonic regression. In Neural Information Processing\nSystems (NIPS), pages 927\u2013935, 2011.\n\n[12] Adam Tauman Kalai and Ravi Sastry. The isotron algorithm: High-dimensional isotonic\n\nregression. In COLT, 2009.\n\n[13] Wojciech Kot\u0142owski and Roman S\u0142owi\u00b4nski. Rule learning with monotonicity constraints. In\n\nInternational Conference on Machine Learning (ICML), pages 537\u2013544, 2009.\n\n[14] Wojciech Kot\u0142owski, Wouter M. Koolen, and Alan Malek. Online isotonic regression. In\nVitaly Feldman and Alexander Rakhlin, editors, Proceedings of the 29th Annual Conference on\nLearning Theory (COLT), pages 1165\u20131189, June 2016.\n\n[15] J. B. Kruskal. Multidimensional scaling by optimizing goodness of \ufb01t to a nonmetric hypothesis.\n\nPsychometrika, 29(1):1\u201327, 1964.\n\n[16] Rasmus Kyng, Anup Rao, and Sushant Sachdeva. Fast, provable algorithms for isotonic\n\nregression in all (cid:96)p-norms. In Neural Information Processing Systems (NIPS), 2015.\n\n[17] Ronny Luss, Saharon Rosset, and Moni Shahar. Ef\ufb01cient regularized isotonic regression with\napplication to gene\u2013gene interaction search. Annals of Applied Statistics, 6(1):253\u2013283, 2012.\n\n9\n\n\f[18] Aditya Krishna Menon, Xiaoqian Jiang, Shankar Vembu, Charles Elkan, and Lucila Ohno-\nMachado. Predicting accurate probabilities with a ranking loss. In Interantional Conference on\nMachine Learning (ICML), 2012.\n\n[19] T. Moon, A. Smola, Y. Chang, and Z. Zheng. Intervalrank: Isotonic regression with listwise\n\nand pairwise constraint. In WSDM, pages 151\u2013160. ACM, 2010.\n\n[20] Harikrishna Narasimhan and Shivani Agarwal. On the relationship between binary classi\ufb01cation,\nbipartite ranking, and binary class probability estimation. In Neural Information Processing\nSystems (NIPS), pages 2913\u20132921, 2013.\n\n[21] Alexandru Niculescu-Mizil and Rich Caruana. Predicting good probabilities with supervised\n\nlearning. In ICML, volume 119, pages 625\u2013632. ACM, 2005.\n\n[22] G. Obozinski, C. E. Grant, G. R. G. Lanckriet, M. I. Jordan, and W. W. Noble. Consistent\n\nprobabilistic outputs for protein function prediction. Genome Biology, 2008 2008.\n\n[23] Alexander Rakhlin and Karthik Sridharan. Online nonparametric regression. In Conference on\n\nLearning Theory (COLT), pages 1232\u20131264, 2014.\n\n[24] T. Robertson, F. T. Wright, and R. L. Dykstra. Order Restricted Statistical Inference. John\n\nWiley & Sons, 1998.\n\n[25] Mario Stylianou and Nancy Flournoy. Dose \ufb01nding using the biased coin up-and-down design\n\nand isotonic regression. Biometrics, 58(1):171\u2013177, 2002.\n\n[26] Sara Van de Geer. Estimating a regression function. Annals of Statistics, 18:907\u2013924, 1990.\n\n[27] Vladimir Vovk, Ivan Petej, and Valentina Fedorova. Large-scale probabilistic predictors with\nand without guarantees of validity. In Neural Information Processing Systems (NIPS), pages\n892\u2013900, 2015.\n\n[28] Bianca Zadrozny and Charles Elkan. Transforming classi\ufb01er scores into accurate multiclass\nprobability estimates. In International Conference on Knowledge Discovery and Data Mining\n(KDD), pages 694\u2013699, 2002.\n\n[29] Cun-Hui Zhang. Risk bounds in isotonic regression. The Annals of Statistics, 30(2):528\u2013555,\n\n2002.\n\n10\n\n\f", "award": [], "sourceid": 2208, "authors": [{"given_name": "Wojciech", "family_name": "Kotlowski", "institution": "Poznan University of Technology"}, {"given_name": "Wouter", "family_name": "Koolen", "institution": "Centrum Wiskunde & Informatica, Amsterdam"}, {"given_name": "Alan", "family_name": "Malek", "institution": "MIT"}]}