{"title": "Designing smoothing functions for improved worst-case competitive ratio in online optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 3287, "page_last": 3295, "abstract": "Online optimization covers problems such as online resource allocation, online bipartite matching, adwords (a central problem in e-commerce and advertising), and adwords with separable concave returns. We analyze the worst case competitive ratio of two primal-dual algorithms for a class of online convex (conic) optimization problems that contains the previous examples as special cases defined on the positive orthant. We derive a sufficient condition on the objective function that guarantees a constant worst case competitive ratio (greater than or equal to $\\frac{1}{2}$) for monotone objective functions. We provide new examples of online problems on the positive orthant % and the positive semidefinite cone that satisfy the sufficient condition. We show how smoothing can improve the competitive ratio of these algorithms, and in particular for separable functions, we show that the optimal smoothing can be derived by solving a convex optimization problem. This result allows us to directly optimize the competitive ratio bound over a class of smoothing functions, and hence design effective smoothing customized for a given cost function.", "full_text": "Designing smoothing functions for improved\n\nworst-case competitive ratio in online optimization\n\nReza Eghbali\n\nDepartment of Electrical Engineering\n\nUniversity of Washington\n\nSeattle, WA 98195\neghbali@uw.edu\n\nMaryam Fazel\n\nDepartment of Electrical Engineering\n\nUniversity of Washington\n\nSeattle, WA 98195\nmfazel@uw.edu\n\nAbstract\n\nOnline optimization covers problems such as online resource allocation, online\nbipartite matching, adwords (a central problem in e-commerce and advertising),\nand adwords with separable concave returns. We analyze the worst case com-\npetitive ratio of two primal-dual algorithms for a class of online convex (conic)\noptimization problems that contains the previous examples as special cases de\ufb01ned\non the positive orthant. We derive a suf\ufb01cient condition on the objective function\nthat guarantees a constant worst case competitive ratio (greater than or equal to\n2) for monotone objective functions. We provide new examples of online prob-\n1\nlems on the positive orthant that satisfy the suf\ufb01cient condition. We show how\nsmoothing can improve the competitive ratio of these algorithms, and in particular\nfor separable functions, we show that the optimal smoothing can be derived by\nsolving a convex optimization problem. This result allows us to directly optimize\nthe competitive ratio bound over a class of smoothing functions, and hence design\neffective smoothing customized for a given cost function.\n\n\u03c8 ((cid:80)m\n\nxt \u2208 Ft,\n\nmaximize\nsubject to\n\nt=1 Atxt)\n\n\u2200t \u2208 [m],\n\nIntroduction\n\n1\nGiven a proper convex cone K \u2282 Rn, let \u03c8 : K (cid:55)\u2192 R be an upper semi-continuous concave function.\nConsider the optimization problem\n\n(1)\n\nwhere for all t \u2208 [m] := {1, 2, . . . , m}, xt \u2208 Rl are the optimization variables and Ft are compact\nconvex constraint sets. We assume At \u2208 Rn\u00d7l maps Ft to K; for example, when K = Rn\n+ and\nFt \u2282 Rl\n+, this assumption is satis\ufb01ed if At has nonnegative entries. We consider problem (1) in the\nonline setting, where it can be viewed as a sequential game between a player (online algorithm) and\nan adversary. At each step t, the adversary reveals At, Ft and the algorithm chooses \u02c6xt \u2208 Ft. The\nperformance of the algorithm is measured by its competitive ratio, i.e., the ratio of objective value\nat \u02c6x1, . . . , \u02c6xm to the of\ufb02ine optimum. Problem (1) covers (convex relaxations of) various online\ncombinatorial problems including online bipartite matching [14], the \u201cadwords\u201d problem [16], and\nthe secretary problem [15]. More generally, it covers online linear programming (LP) [6], online\npacking/covering with convex cost [3, 4, 7], and generalization of adwords [8]. In this paper, we\nstudy the case where \u2202\u03c8(u) \u2282 K\u2217 for all u \u2208 K, i.e., \u03c8 is monotone with respect to the cone K.\nThe competitive performance of online algorithms has been studied mainly under the worst-case\nmodel (e.g., in [16]) or stochastic models (e.g., in [15]). In the worst-case model one is interested\nin lower bounds on the competitive ratio that hold for any (A1, F1), . . . , (Am, Fm). In stochas-\ntic models, adversary choses a probability distribution from a family of distributions to generate\n\n\f(A1, F1), . . . , (Am, Fm), and the competitive ratio is calculated using the expected value of the\nalgorithm\u2019s objective value. Online bipartite matching and its generalization, the \u201cadwords\u201d problem,\nare the two main problems that have been studied under the worst case model. The greedy algorithm\nachieves a competitive ratio of 1/2 while the optimal algorithm achieves a competitive ratio of 1\u22121/e\n(as bid to budget ratio goes to zero) [16, 5, 14, 13]. A more general version of Adwords in which\neach agent (advertiser) has a concave cost has been studied in [8].\nThe majority of algorithms proposed for the problems mentioned above rely on a primal-dual\nframework [5, 6, 3, 8, 4]. The differentiating point among the algorithms is the method of updating\nthe dual variable at each step, since once the dual variable is updated the primal variable can be\nassigned using a simple complementarity condition. A simple and ef\ufb01cient method of updating\nthe dual variable is through a \ufb01rst order online learning step. For example, the algorithm stated in\n[9] for online linear programming uses mirror descent with entropy regularization (multiplicative\nweight updates algorithm) once written in the primal dual language. Recently, the work in [9] was\nindependently extended to random permutation model in [12, 2, 11]. In [2], the authors provide\ncompetitive difference bound for online convex optimization under random permutation model as a\nfunction of the regret bound for the online learning algorithm applied to the dual.\nIn this paper, we consider two versions of the greedy algorithm for problem (1), a sequential\nupdate and a simultaneous update algorithm. The simultaneous update algorithm, Algorithm 2,\nprovides a direct saddle-point representation of what has been described informally in the literature\nas \u201ccontinuous updates\u201d of primal and dual variables. This saddle point representation allows us to\ngeneralize this type of updates to non-smooth function. In section 2, we bound the competitive ratios\nof the two algorithms. A suf\ufb01cient condition on the objective function that guarantees a non-trivial\nworst case competitive ratio is introduced. We show that the competitive ratio is at least 1\n2 for a\nmonotone non-decreasing objective function. Examples that satisfy the suf\ufb01cient condition (on\nthe positive orthant and the positive semide\ufb01nite cone) are given. In section 3, we derive optimal\nalgorithms, as variants of greedy algorithm applied to a smoothed version of \u03c8. For example, Nesterov\nsmoothing provides optimal algorithm for the adwords problem. The main contribution of this paper\nis to show how one can derive the optimal smoothing function (or from the dual point of view the\noptimal regularization function) for separable \u03c8 on positive orthant by solving a convex optimization\nproblem. This gives an implementable algorithm that achieves the optimal competitive ratio derived\nin [8]. We also show how this convex optimization can be modi\ufb01ed for the design of smoothing\nfunction speci\ufb01cally for the sequential algorithm. In contrast, [8] only considers continuous updates.\nThe algorithms considered in this paper and their general analysis are the same as those we considered\nin [10].\nIn [10], the focus is on non-monotone functions and online problems on the positive\nsemide\ufb01nite cone. However, the focus of this paper is on monotone functions on the positive orthant.\nIn [10], we only considered Nesterov smoothing and only derived competitive ratio bounds for the\nsimultaneous algorithm.\nNotation. Given a function \u03c8 : Rn (cid:55)\u2192 R, \u03c8\u2217 denotes the concave conjugate of \u03c8 de\ufb01ned as\n\u03c8\u2217(y) = inf u(cid:104)y, u(cid:105) \u2212 \u03c8(u), for all y \u2208 Rn. For a concave function \u03c8, \u2202\u03c8(u) denotes the set of\nsupergradients of \u03c8 at u, i.e., the set of all y \u2208 Rn such that \u2200u(cid:48) \u2208 Rn : \u03c8(u(cid:48)) \u2264 (cid:104)y, u(cid:48) \u2212 u(cid:105)+\u03c8(u).\nThe set \u2202\u03c8 is related to the concave conjugate function \u03c8\u2217 as follows. For an upper semi-continuous\nconcave function \u03c8 we have \u2202\u03c8(u) = argminy(cid:104)y, u(cid:105) \u2212 \u03c8\u2217(y). A differentiable function \u03c8 has\na Lipschitz continuous gradient with respect to (cid:107)\u00b7(cid:107) with continuity parameter 1/\u00b5 > 0 if for all\nu, u(cid:48) \u2208 Rn, (cid:107)\u2207\u03c8(u(cid:48)) \u2212 \u2207\u03c8(u)(cid:107)\u2217 \u2264 1/\u00b5(cid:107)u \u2212 u(cid:48)(cid:107), where (cid:107)\u00b7(cid:107)\u2217 is the dual norm to (cid:107)\u00b7(cid:107).\nThe dual cone K\u2217 of a cone K \u2282 Rn is de\ufb01ned as K\u2217 = {y | (cid:104)y, u(cid:105) \u2265 0 \u2200u \u2208 K}. Two examples\n+ and the cone of n \u00d7 n positive semide\ufb01nite matrices\nof self-dual cones are the positive orthant Rn\n+. A proper cone (pointed convex cone with nonempty interior) K induces a partial ordering on Rn\nSn\nwhich is denoted by \u2264K and is de\ufb01ned as x \u2264K y \u21d4 y \u2212 x \u2208 K.\n\n1.1 Two primal-dual algorithms\n\nThe (Fenchel) dual problem for problem (1) is given by\nt=1 \u03c3t(AT\n\n(2)\nwhere the optimization variable is y \u2208 Rn, and \u03c3t denotes the support function for the set Ft de\ufb01ned\nas \u03c3t(z) = supx\u2208Ft(cid:104)x, z(cid:105). A pair (x\u2217, y\u2217) \u2208 (F1 \u00d7 . . . \u00d7 Fm) \u00d7 K\u2217 is an optimal primal-dual pair\nif and only if\n\nminimize (cid:80)m\n\nt y) \u2212 \u03c8\u2217(y),\n\n2\n\n\ft \u2208 argmax\nx\u2217\nx\u2208Ft\n\n(cid:104)x, AT\n\nt y\u2217(cid:105),\n\nm(cid:88)\n\ny\u2217 \u2208 \u2202\u03c8(\n\nt=1\n\nAtx\u2217\nt ),\n\n\u2200t \u2208 [m].\n\nviewed as \u02c6yt+1 \u2208 argminy(cid:104)(cid:80)t\n\nBased on these optimality conditions, we consider two algorithms. Algorithm 1 updates the primal\nand dual variables sequentially, by maintaining a dual variable \u02c6yt and using it to assign \u02c6xt \u2208\nargmaxx\u2208Ft(cid:104)x, AT\nt \u02c6yt(cid:105). The then algorithm updates the dual variable based on the second optimality\ncondition. By the assignment rule, we have At \u02c6xt \u2208 \u2202\u03c3t(\u02c6yt), and the dual variable update can be\ns=1 As \u02c6xs, y(cid:105) \u2212 \u03c8\u2217(y). Therefore, the dual update is the same as the\nupdate in dual averaging [18] or Follow The Regularized Leader (FTRL) [20, 19, 1] algorithm with\nregularization \u2212\u03c8\u2217(y).\n\nAlgorithm 1 Sequential Update\n\nInitialize \u02c6y1 \u2208 \u2202\u03c8(0)\nfor t \u2190 1 to m do\nReceive At, Ft\n\u02c6xt \u2208 argmaxx\u2208Ft(cid:104)x, AT\ns=1 As \u02c6xs)\n\n\u02c6yt+1 \u2208 \u2202\u03c8((cid:80)t\n\nt \u02c6yt(cid:105)\n\nend for\n\nAlgorithm 2 updates the primal and dual variables simultaneously, ensuring that\n\n\u02dcxt \u2208 argmax\nx\u2208Ft\n\n(cid:104)x, AT\n\nt \u02dcyt(cid:105),\n\n\u02dcyt \u2208 \u2202\u03c8(\n\nAs \u02dcxs).\n\nt(cid:88)\n\ns=1\n\nThis algorithm is inherently more complicated than algorithm 1, since \ufb01nding \u02dcxt involves solving a\nsaddle-point problem. This can be solved by a \ufb01rst order method like mirror descent algorithm for\nsaddle point problems. In contrast, the primal and dual updates in algorithm 1 solve two separate\nmaximization and minimization problems 1.\n\nAlgorithm 2 Simultaneous Update\n\nfor t \u2190 1 to m do\nReceive At, Ft\n\n(\u02dcyt, \u02dcxt) \u2208 arg miny maxx\u2208Ft (cid:104)y, Atx +(cid:80)t\u22121\n\nend for\n\ns=1 As \u02dcxs(cid:105) \u2212 \u03c8\u2217(y)\n\n2 Competitive ratio bounds and examples for \u03c8\n\nIn this section, we derive bounds on the competitive ratios of Algorithms 1 and 2 by bounding their\nrespective duality gaps. We begin by stating a suf\ufb01cient condition on \u03c8 that leads to non-trivial\ncompetitive ratios, and we assume this condition holds in the rest of the paper. Roughly, one can\ninterpret this assumption as having \u201cdiminishing returns\u201d with respect to the ordering induced by a\ncone. Examples of functions that satisfy this assumption will appear later in this section.\nAssumption 1 Whenever u \u2265K v, there exists y \u2208 \u2202\u03c8(u) that satis\ufb01es y \u2264K\u2217 z for all z \u2208 \u2202\u03c8(v).\nWhen \u03c8 is differentiable, assumption 1 simpli\ufb01es to u \u2265K v \u21d2 \u2207\u03c8(u) \u2264K\u2217 \u2207\u03c8(v). That is, the\ngradient, as a map from Rn (equipped with \u2264K) to Rn (equipped with \u2264K\u2217), is order-reversing.When\n\u03c8 is twice differentiable, assumption 1 is equivalent to (cid:104)w,\u22072\u03c8(u)v(cid:105) \u2264 0, for all u, v, w \u2208 K. For\n+.\nexample, this is equivalent to Hessian being element-wise non-positive when K = Rn\n\nLet de\ufb01ne \u02dcym+1 to be the minimum element in \u2202\u03c8((cid:80)m\nan element exists in the superdifferential by Assumption (1)). Let Pseq = \u03c8 ((cid:80)m\n\u03c8 ((cid:80)m\n\nt=1 At \u02dcxt) with respect to ordering \u2264K\u2217 (such\nt=1 At \u02c6xt) and Psim =\nt=1 At \u02dcxt) denote the primal objective values for the primal solution produced by the algorithms\n1Also if the original problem is a convex relaxation of an integer program, meaning that each Ft = convFt\nwhere Ft \u2282 Zl, then \u02c6xt can always be chosen to be integral while integrality may not hold for the solution of\nthe second algorithm.\n\n3\n\n\f1 and 2, and Dseq =(cid:80)m\n\nt=1 \u03c3t(AT\n\nt \u02c6yt)\u2212 \u03c8\u2217(\u02c6ym+1) and Dsim =(cid:80)m\n\nthe corresponding dual objective values.\nThe next lemma provides a lower bound on the duality gaps of both algorithms.\n\nt=1 \u03c3t(AT\n\nt \u02dcyt)\u2212 \u03c8\u2217(\u02dcym+1) denote\n\nLemma 1 The duality gaps for the two algorithms can be lower bounded as\nPsim \u2212 Dsim \u2265 \u03c8\u2217(\u02dcym+1) + \u03c8(0), Pseq \u2212 Dseq \u2265 \u03c8\u2217(\u02c6ym+1) + \u03c8(0) +\n\nm(cid:88)\n\n(cid:104)At \u02c6xt, \u02c6yt+1 \u2212 \u02c6yt(cid:105)\n\nFurthermore, if \u03c8 has a Lipschitz continuous gradient with parameter 1/\u00b5 with respect to (cid:107)\u00b7(cid:107),\n\nPseq \u2212 Dseq \u2265 \u03c8\u2217(\u02c6ym+1) + \u03c8(0) \u2212 1\n\n2\u00b5\n\nt=1\n\n(cid:80)m\nt=1 (cid:107)At \u02c6xt(cid:107)2 .\n\n(3)\n\n(4)\n\nNote that right hand side of (3) is exactly the regret bound of the FTRL algorithm (with a negative\nsign) [19]. The proof is given in the appendix. To simplify the notation in the rest of the paper, we\nassume \u03c8(0) = 0 by replacing \u03c8(u) with \u03c8(u) \u2212 \u03c8(0). To quantify the competitive ratio of the\nalgorithms, we de\ufb01ne \u03b1\u03c8 as\n\n\u03b1\u03c8 = sup{c | \u03c8\u2217(y) \u2265 c\u03c8(u), y \u2208 \u2202\u03c8(u), u \u2208 K},\n\nSince \u03c8\u2217(y) + \u03c8(u) = (cid:104)y, u(cid:105) for all y \u2208 \u2202\u03c8(u), \u03b1\u03c8 is equivalent to\n\n\u03b1\u03c8 = sup{c | (cid:104)y, u(cid:105) \u2265 (c + 1)\u03c8(u), y \u2208 \u2202\u03c8(u) u \u2208 K}.\n\n(5)\nNote that \u22121 \u2264 \u03b1\u03c8 \u2264 0, since for any u \u2208 K and y \u2208 \u2202\u03c8(u), by concavity of \u03c8 and the fact\nthat y \u2208 K\u2217, we have 0 \u2264 (cid:104)y, u(cid:105) \u2264 \u03c8(u) \u2212 \u03c8(0). If \u03c8 is a linear function then \u03b1\u03c8 = 0, while if\n0 \u2208 \u2202\u03c8(u) for some u \u2208 K, then \u03b1\u03c8 = \u22121.\nThe next theorem provides lower bounds on the competitive ratio of the two algorithms.\n\nTheorem 1 If Assumption 1 holds, we have\nD(cid:63), Pseq \u2265\n\nPsim \u2265\n\n1\n\n1 \u2212 \u03b1\u03c8\n\nm(cid:88)\n\n(cid:104)At \u02c6xt, \u02c6yt+1 \u2212 \u02c6yt(cid:105))\n\nt=1\n\n1\n\n1 \u2212 \u03b1\u03c8\n\n(D(cid:63) +\n\n(cid:80)m\nt=1 (cid:107)At \u02c6xt(cid:107)2).\n\nwhere D(cid:63) is the dual optimal objective. If \u03c8 has a Lipschitz continuous gradient with parameter 1/\u00b5\nwith respect to (cid:107)\u00b7(cid:107),\n\nPseq \u2265 1\n1\u2212\u03b1\u03c8\n\n(D(cid:63) \u2212 1\n\n2\u00b5\n\nProof: Consider the simultaneous update algorithm. We have(cid:80)t\nt, since AsFs \u2282 K for all s. Since \u02dcyt \u2208 \u2202\u03c8((cid:80)t\nelement in \u2202\u03c8((cid:80)m\n\ns=1 As \u02dcxs for all\ns=1 As \u02dcxs) and \u02dcym+1 was picked to be the minimum\ns=1 As \u02dcxs) with respect to \u2264K\u2217, by Assumption 1, we have \u02dcyt \u2265K\u2217 \u02dcym+1. Since\nm(cid:88)\nt \u02dcym). Thus\n\nt \u02dcyt) \u2212 \u03c8\u2217(\u02dcym) \u2265 m(cid:88)\n\nt \u02dcym+1) \u2212 \u03c8\u2217(\u02dcym) \u2265 D\u2217.\n\nt \u02dcyt) \u2265 \u03c3t(AT\n\nAtx \u2208 K for all x \u2208 Ft, we get (cid:104)Atx, \u02dcyt(cid:105) \u2265 (cid:104)Atx, \u02dcym+1(cid:105); therefore, \u03c3t(AT\n\ns=1 As \u02dcxs \u2264K\n\n(cid:80)m\n\nDsim =\n\n\u03c3t(AT\n\n\u03c3t(AT\n\n(6)\n\nt=1\n\nt=1\n\nNow Lemma 1 and de\ufb01nition of \u03b1\u03c8 give the desired result. The proof for Algorithm 1 follows similar\n(cid:3)\nsteps.\nWe now consider examples of \u03c8 that satisfy Assumption 1 and derive lower bound on \u03b1\u03c8 for those\nexamples.\n+ and note that K\u2217 = K. To simplify the notation we\nExamples on positive orthant. Let K = Rn\nuse \u2264 instead of \u2264Rn\n. Assumption 1 is satis\ufb01ed for a twice differentiable function if and only if\nthe Hessian is element-wise non-positive over Rn\ni=1 \u03c8i(ui),\nAssumption 1 is satis\ufb01ed since by concavity for each \u03c8i we have \u2202\u03c8i(ui) \u2264 \u2202\u03c8i(vi) when ui \u2264 vi.\nIn the basic adwords problem, for all t, Ft = {x \u2208 Rl\n+ | 1T x \u2264 1}, At is a diagonal matrix with\nnon-negative entries, and\ni=1(ui \u2212 1)+,\n\n+. If \u03c8 is separable, i.e., \u03c8(u) = (cid:80)n\n\ni=1 ui \u2212(cid:80)n\n\n\u03c8(u) =(cid:80)n\n\n(7)\n\n+\n\n4\n\n\f(cid:80)m\nt=1(cid:104)At \u02c6xt, \u02c6yt+1 \u2212 \u02c6yt(cid:105) \u2264 nr. Therefore, the competitive ratio of algorithm 1 goes to 1\nFor any p \u2265 1 let Bp be the lp-norm ball. We can rewrite the penalty function \u2212(cid:80)n\nthe adwords objective using the distance from B\u221e: we have(cid:80)n\n\nwhere (\u00b7)+ = max{\u00b7, 0}. In this problem, \u03c8\u2217(y) = 1T (y \u2212 1). Since 0 \u2208 \u2202\u03c8(1) we have \u03b1\u03c8 = \u22121\n2. Let r = maxt,i,j At,i,j, then we have\nby (5); therefore, the competitive ratio of algorithm 2 is 1\n2 as r (bid to\nbudget ratio) goes to zero. In adwords with concave returns studied in [8], At is diagonal for all t and\n\u03c8 is separable 2.\ni=1(ui \u2212 1)+ in\ni=1(ui \u2212 1)+ = d1(u,B\u221e), where\nd1(\u00b7, C) is the l1 norm distance from set C. For p \u2208 [1,\u221e) the function \u2212d1(u,Bp) although not\nseparable it satis\ufb01es Assumption 1. The proof is given in the supplementary materials.\n+ and note that K\u2217 = K. Two examples\nExamples on the positive semide\ufb01nite cone. Let K = Sn\nthat satisfy Assumption 1 are \u03c8(U ) = log det(U + A0), and \u03c8(U ) = trU p with p \u2208 (0, 1). We\nrefer the reader to [10] for examples of online problems that entails log det in the objective function\nand competitive ratio analysis of the simultanuous algorithm for these problems.\n\n3 Smoothing of \u03c8 for improved competitive ratio\n\nThe technique of \u201csmoothing\u201d an (potentially non-smooth) objective function, or equivalently adding\na strongly convex regularization term to its conjugate function, has been used in several areas. In\nconvex optimization, a general version of this is due to Nesterov [17], and has led to faster convergence\nrates of \ufb01rst order methods for non-smooth problems. In this section, we study how replacing \u03c8\nwith a appropriately smoothed function \u03c8S helps improve the performance of the two algorithms\ndiscussed in section 1.1, and show that it provides optimal competitive ratio for two of the problems\nmentioned in section 2, adwords and online LP. We then show how to maximize the competitive\nratio of both algorithms for a separable \u03c8 and compute the optimal smoothing by solving a convex\noptimization problem. This allows us to design the most effective smoothing customized for a given\n\u03c8: we maximize the bound on the competitive ratio over the set of smooth functions.(see subsection\n3.2 for details).\nLet \u03c8S denote an upper semi-continuous concave function (a smoothed version of \u03c8), and suppose\n\u03c8S satis\ufb01es Assumption 1. The algorithms we consider in this section are the same as Algorithms\n1 and 2, but with \u03c8 replacing \u03c8S. Note that the competitive ratio is computed with respect to the\noriginal problem, that is the of\ufb02ine primal and dual optimal values are still the same P (cid:63) and D(cid:63) as\nbefore.\nt=1 At \u02c6xt)\u2212\nt=1(cid:104)At \u02c6xt, \u02c6yt+1 \u2212 \u02c6yt(cid:105). To simplify the notation, assume \u03c8S(0) = 0 as before. De\ufb01ne\n\u03b1\u03c8,\u03c8S = sup{c |\u03c8\u2217(y) \u2265 \u03c8S(u) + (c \u2212 1)\u03c8(u), y \u2208 \u2202\u03c8S(u), u \u2208 K}.\n\nFrom Lemma 1, we have that Dsim \u2264 \u03c8S ((cid:80)m\n\u03c8\u2217(\u02c6ym+1)\u2212(cid:80)m\n\nt=1 At \u02dcxt)\u2212\u03c8\u2217(\u02dcym+1) and Dseq \u2264 \u03c8S ((cid:80)m\n\nThen the conclusion of Theorem 1 for Algorithms 1 and 2 applied to the smoothed function holds\nwith \u03b1\u03c8 replaced by \u03b1\u03c8,\u03c8S .\n\n3.1 Nesterov Smoothing\n\nWe \ufb01rst consider Nesterov smoothing [17], and apply it to examples on non-negative orthant. Given a\nproper upper semi-continuous concave function \u03c6 : Rn (cid:55)\u2192 R \u222a {\u2212\u221e}, let\n\n\u03c8S = (\u03c8\u2217 + \u03c6\u2217)\u2217.\n\nNote that \u03c8S is the supremal convolution of \u03c8 and \u03c6. If \u03c8 and \u03c6 are separable, then \u03c8S satis\ufb01es\nAssumption 1 for K = Rn\n+. Here we provide example of Nesterov smoothing for functions on\nnon-negative orthant.\nAdwords: The optimal competitive ratio for the Adwords problem is 1 \u2212 e\u22121. This is achieved by\n\nsmoothing \u03c8 with \u03c6\u2217(y) =(cid:80)m\n\n(cid:40) eui\u2212exp (ui)+1\ne\u22121 ) log(e \u2212 (e \u2212 1)yi) \u2212 2yi, which gives\n\ni=1(yi \u2212 e\n\u03c8S,i(ui) \u2212 \u03c8S,i(0) =\n\ne\u22121\n1\ne\u22121\n\nui \u2208 [0, 1]\nui > 1,\n\n2Note that in this case one can remove the assumption that \u2202\u03c8i \u2282 R+ since if \u02dcyt,i = 0 for some t and i,\n\nthen \u02dcxs,i = 0 for all s \u2265 t.\n\n5\n\n\f3.2 Computing optimal smoothing for separable functions on Rn\n\n+\n\ni=1 \u03c8i(ui) and \u03c8S(u) =(cid:80)n\n\nsmoothing. Given a separable monotone \u03c8(u) =(cid:80)n\nWe now tackle the problem of \ufb01nding the optimal smoothing for separable functions on the positive\northant, which as we show in an example at the end of this section is not necessarily given by Nesterov\ni=1 \u03c8S,i(ui) on Rn\nwe have that \u03b1\u03c8,\u03c8S \u2265 mini \u03b1\u03c8i,\u03c8S,i.\n+\nTo simplify the notation, drop the index i and consider \u03c8 : R+ (cid:55)\u2192 R. We formulate the problem\n(cid:82) u\nof \ufb01nding \u03c8S to maximize \u03b1\u03c8,\u03c8S as an optimization problem. In section 4 we discuss the relation\nbetween this optimization method and the optimal algorithm presented in [8]. We set \u03c8S(u) =\n0 y(s)ds with y a continuous function (y \u2208 C[0,\u221e)), and state the in\ufb01nite dimensional convex\noptimization problem with y as a variable,\n\nminimize\nsubject to\n\n\u03b2\n\n(cid:82) u\n0 y(s)ds \u2212 \u03c8\u2217(y(u)) \u2264 \u03b2\u03c8(u),\ny \u2208 C[0,\u221e),\n\n\u2200u \u2208 [0,\u221e)\n\n(8)\n\nwhere \u03b2 = 1 \u2212 \u03b1\u03c8,\u03c8S (theorem 1 describes the dependence of the competitive ratios on this\nparameter). Note that we have not imposed any condition on y to be non-increasing (i.e., the\ncorresponding \u03c8S to be concave). The next lemma establishes that every feasible solution to the\nproblem (8) can be turned into a non-increasing solution.\nLemma 2 Let (y, \u03b2) be a feasible solution for problem (8) and de\ufb01ne \u00afy(t) = inf s\u2264t y(s). Then\n(\u00afy, \u03b2) is also a feasible solution for problem (8).\n\nIn particular if (y, \u03b2) is an optimal solution, then so is (\u00afy, \u03b2). The proof is given in the supplement. Re-\nvisiting the adwords problem, we observe that the optimal solution is given by y(u) =\n,\n+\nwhich is the derivative of the smooth function we derived using Nesterov smoothing in section 3.1.\nThe optimality of this y can be established by providing a dual certi\ufb01cate, a measure \u03bd correspond-\ning to the inequality constraint, that together with y satis\ufb01es the optimality condition. If we set\nd\u03bd = exp (1 \u2212 u)/(e \u2212 1) du, the optimality conditions are satis\ufb01ed with \u03b2 = (1 \u2212 1/e)\u22121. Also\nnote that if \u03c8 plateaus (e.g., as in the adwords objective), then one can replace problem (8) with a\nproblem over a \ufb01nite horizon.\nTheorem 2 Suppose \u03c8(t) = c on [u(cid:48),\u221e) (\u03c8 plateaus). Then problem (8) is equivalent to\n\ne\u22121\n\n(cid:16) e\u2212exp(u)\n\n(cid:17)\n\nminimize\nsubject to\n\n\u03b2\n\n(cid:82) u\n0 y(s)ds \u2212 \u03c8\u2217(y(u)) \u2264 \u03b2\u03c8(u),\n\ny(u(cid:48)) = 0,\n\ny \u2208 C[0, u(cid:48)].\n\n\u2200u \u2208 [0, u(cid:48)]\n\n(9)\n\nSo for a function \u03c8 with a plateau, one can discretize problem (9) to get a \ufb01nite dimensional problem,\n\nminimize\nsubject to\n\n\u03b2\n\nh(cid:80)t\n\ns=1 y[s] \u2212 \u03c8\u2217(y[t]) \u2264 \u03b2\u03c8(ht)\n\n\u2200t \u2208 [d]\n\n(10)\n\ny[d] = 0,\n\nwhere h = u(cid:48)/d is the discretization step. Figure 1a shows the optimal smoothing for the piecewise\nlinear function \u03c8(u) = min(.75, u,\n.5u + .25) by solving problem (10). We point out that the\noptimal smoothing for this function is not given by Nesterov\u2019s smoothing (even though the optimal\nsmoothing can be derived by Nesterov\u2019s smoothing for a piecewise linear function with only two\npieces, like the adwords cost function). Figure 1d shows the difference between the conjugate of the\noptimal smoothing function and \u03c8\u2217 for the piecewise linear function, which we can see is not concave.\nWe simulated the performance of the simultaneous algorithm on a dataset with n = m, Ft simplex,\nand At diagonal. We varied m in the range from 1 to 30 and for each m calculated the the smallest\ncompetitive ratio achieved by the algorithm over (10m)2 random permutation of A1, . . . , Am. Figure\n1i depicts this quantity vs. m for the optimal smoothing and the Nesterov smoothing. For the Nesterov\n\u221a\nsmoothing we used the function \u03c6\u2217(y) = (y \u2212 \u221a\nIn cases where a bound umax on(cid:80)m\ne\u221a\ne\u22121 ) log(\n\nt=1 AtFt is known, we can restrict t to [0, umax] and discretize\nproblem (8) over this interval. However, the conclusion of Lemma 2 does not hold for a \ufb01nite horizon\n\ne \u2212 1)y) \u2212 3\n\n\u221a\ne \u2212 (\n\n2 y.\n\n6\n\n\fand we need to impose additional linear constraints y[t] \u2264 y[t \u2212 1] to ensure the monotonicity of\ny. We \ufb01nd the optimal smoothing for two examples of this kind: \u03c8(u) = log(1 + u) over [0, 100]\n(Figure 1b), and \u03c8(u) =\nu over [0, 100] (Figure 1c). Figure 1e shows the competitive ratio achieved\nwith the optimal smoothing of \u03c8(u) = log(1 + u) over [0, umax] as a function of umax. Figure 1f\ndepicts this quantity for \u03c8(u) =\n\n\u221a\n\n\u221a\n\nu.\n\n3.3 Competitive ratio bound for the sequential algorithm\n\nIn this section we provide a lower bound on the competitive ratio of the sequential algorithm\n(Algorithm 1). Then we modify Problem (8) to \ufb01nd a smoothing function that optimizes this\ncompetitive ratio bound for the sequential algorithm.\n\nTheorem 3 Suppose \u03c8S is differentiable on an open set containing K and satis\ufb01es Assumption 1. In\naddition, suppose there exists c \u2208 K such that AtFt \u2264K c for all t, then\n\nPseq \u2265\n\n1\n\n1 \u2212 \u03b1\u03c8,\u03c8S + \u03bac,\u03c8,\u03c8S\n\nD(cid:63),\n\nwhere \u03ba is given by\n\n\u03bac,\u03c8,\u03c8S = inf{r | (cid:104)c,\u2207\u03c8S(0) \u2212 \u2207\u03c8S(u)(cid:105) \u2264 r\u03c8(u), u \u2208 K}\n(cid:80)m\nt=1(cid:104)At \u02c6xt, \u02c6yt \u2212 \u02c6yt+1(cid:105) \u2264(cid:80)m\n\nt=1(cid:104)c, \u02c6yt \u2212 \u02c6yt+1(cid:105) = (cid:104)c, \u02c6y0 \u2212 \u02c6ym+1(cid:105)\n\nProof: Since \u03c8S satis\ufb01es Assumption 1, we have \u02c6yt+1 \u2264K\u2217 \u02c6yt. Therefore, we can write:\n\nNow by combining the duality gap given by Lemma 1 with 11, we get Dseq \u2264 \u03c8S ((cid:80)m\n\u03c8\u2217(\u02c6ym+1)+(cid:104)c,\u2207\u03c8S(0) \u2212 \u2207\u03c8S ((cid:80)m\n\n(11)\nt=1 At \u02c6xt) \u2212\nt=1 At \u02c6xt)(cid:105). The conclusion follows from the de\ufb01nition of \u03b1\u03c8,\u03c8S ,\n\u03bac,\u03c8,\u03c8S and the fact that Dseq \u2265 D(cid:63).\n(cid:3)\nBased on the result of the previous theorem we can modify the optimization problem set up in Section\n3.2 for separable functions on Rn\n+ to maximize the lower bound on the competitive ratio of the\nsequential algorithm. Note that when \u03c8 and \u03c8S are separable, we have \u03bac,\u03c8,\u03c8S \u2264 maxi \u03baci,\u03c8i,\u03c8S i.\nTherefore, similar to the previous section to simplify the notation we drop the index i and assume\n\u03c8 is a function of a scalar variable. The optimization problem for \ufb01nding \u03c8S that minimizes\n\u03bac,\u03c8,\u03c8S \u2212 \u03b1\u03c8,\u03c8S is as follows:\n\nminimize\nsubject to\n\n\u03b2\n\n(cid:82) u\n0 y(s)ds + c(\u03c8(cid:48)(0) \u2212 y(u)) \u2212 \u03c8\u2217(y(u)) \u2264 \u03b2\u03c8(u),\ny \u2208 C[0,\u221e).\n\n\u2200u \u2208 [0,\u221e)\n\n(12)\n\n(cid:16) u\u22121\n\n(cid:17)(cid:17)\n\n(cid:16)\n\n1 \u2212 exp\n\nFor adwords, the optimal solution is given by \u03b2 =\n1\u2212exp(\u2212 1\n+\nwhich gives a competitive ratio of 1 \u2212 exp\n. In Figure 1h we have plotted the competitive\nratio achieved by solving problem 12 for \u03c8(u) = log det(1 + u) with umax = 100 as a function\nof c. Figure 1g shows the competitive ratio as a function of c for the piecewise linear function\n\u03c8(u) = min(.75, u, .5u + .25).\n\nc+1 ) and y(u) = \u03b2\n\nc+1\n\n1+c\n\n1\n\n,\n\n(cid:16) \u22121\n\n(cid:17)\n\n4 Discussion and Related Work\n\nWe discuss results and papers from two communities, computer science theory and machine learning,\nrelated to this work.\nOnline optimization. In [8], the authors proposed an optimal algorithm for adwords with differ-\nentiable concave returns (see examples in section 2). Here, \u201coptimal\u201d means that they construct\nan instance of the problem for which competitive ratio bound cannot be improved, hence showing\nthe bound is tight. The algorithm is stated and analyzed for a twice differentiable, separable \u03c8(u).\nThe assignment rule for primal variables in their proposed algorithm is explained as a continuous\nprocess. A closer look reveals that this algorithm falls in the framework of algorithm 2, with the only\ndifference being that at each step, (\u02dcxt, \u02dcyt) are chosen such that\n\n\u02dcxt \u2208 argmax(cid:104)x, AT\n\u2200i \u2208 [n] :\n\nt \u02dcyt(cid:105)\n\n\u02dcyt,i = \u2207\u03c8i(vi(ui)),\n\nui = ((cid:80)t\n\nt=1 As \u02dcxs)i,\n\n7\n\n\f(a)\n\n(d)\n\n(g)\n\u221a\n\n(b)\n\n(e)\n\n(h)\n\n(c)\n\n(f)\n\n(i)\n\nFigure 1: Optimal smoothing for \u03c8(u) = min(.75, u, .5u+.25) (a), \u03c8(u) = log(1+u) over [0, 100]\n(b), and \u03c8(u) =\nu over [0, 100] (c). The competitive ratio achieved by the optimal smoothing as a\nS \u2212 \u03c8\u2217 for the piecewise linear\nfunction of umax for \u03c8(u) = log(1 + u) (e) and \u03c8(u) =\nfunction (d). The competitive ratio achieved by the optimal smoothing for the sequential algorithm as\na function of c for \u03c8(u) = min(.75, u, .5u + .25) (g) and \u03c8(u) = log(1 + u) with umax = 100 (h).\ni, Competitive ratio of the simultaneous algorithm for \u03c8(u) = min(.75, u, .5u + .25) as a function\nof m with optimal smoothing and Nesterov smoothing (see text).\n\nu (f). \u03c8\u2217\n\n\u221a\n\nwhere vi : R+ (cid:55)\u2192 R+ is an increasing differentiable function given as a solution of a nonlinear\ndifferential equation that involves \u03c8i and may not necessarily have a closed form. The competitive\nratio is also given based on the differential equation. They prove that this gives the optimal competitive\nratio for the instances where \u03c81 = \u03c82 = . . . = \u03c8m.\nNote that this is equivalent of setting \u03c8S,i(ui) = \u03c8(vi(ui))). Since vi is nondecreasing \u03c8S,i is a\nconcave function. On the other hand, given a concave function \u03c8S,i(R+) \u2282 \u03c8i(R+), we can set\nvi : R+ (cid:55)\u2192 R+ as vi(u) = inf{z | \u03c8i(z) \u2265 \u03c8S,i(u)}. Our formulation in section 3.2 provides a\nconstructive way of \ufb01nding the optimal smoothing. It also applies to non-smooth \u03c8.\nOnline learning. As mentioned before, the dual update in Algorithm 1 is the same as in Follow-the-\nRegularized-Leader (FTRL) algorithm with \u2212\u03c8\u2217 as the regularization. This primal dual perspective\nhas been used in [20] for design and analysis of online learning algorithms. In the online learning\nliterature, the goal is to derive a bound on regret that optimally depends on the horizon, m. The goal\nin the current paper is to provide competitive ratio for the algorithm that depends on the function \u03c8.\nRegret provides a bound on the duality gap, and in order to get a competitive ratio the regularization\nfunction should be crafted based on \u03c8. A general choice of regularization which yields an optimal\nregret bound in terms of m is not enough for a competitive ratio argument, therefore existing results\nin online learning do not address our aim.\n\n8\n\nu00.511.5200.20.40.60.8\u03c8S\u03c8u020406080100012345\u03c8S\u03c8u0204060801000510152025\u03c8S\u03c8y00.20.40.60.81\u03c8\u2217S(y)\u2212\u03c8\u2217(y)-0.100.10.20.30.4umax05001000comp. ratio0.70.750.80.850.90.95umax02004006008001000comp. ratio0.70.750.80.850.90.951c00.20.40.60.81Competitive ratio0.30.40.50.60.7c020406080100Competitive ratio00.20.40.60.8m0102030competitive ratio0.70.750.80.850.90.951Optimal smoothingNesterov smoothing\fReferences\n\n[1] Jacob Abernethy, Elad Hazan, and Alexander Rakhlin. Competing in the dark: An ef\ufb01cient\n\nalgorithm for bandit linear optimization. In COLT, pages 263\u2013274, 2008.\n\n[2] Shipra Agrawal and Nikhil R Devanur. Fast algorithms for online stochastic convex program-\n\nming. arXiv preprint arXiv:1410.7596, 2014.\n\n[3] Yossi Azar, Ilan Reuven Cohen, and Debmalya Panigrahi. Online covering with convex\n\nobjectives and applications. arXiv preprint arXiv:1412.3507, 2014.\n\n[4] Niv Buchbinder, Shahar Chen, Anupam Gupta, Viswanath Nagarajan, et al. Online packing and\n\ncovering framework with convex objectives. arXiv preprint arXiv:1412.8347, 2014.\n\n[5] Niv Buchbinder, Kamal Jain, and Joseph Sef\ufb01 Naor. Online primal-dual algorithms for maxi-\n\nmizing ad-auctions revenue. In Algorithms\u2013ESA 2007, pages 253\u2013264. Springer, 2007.\n\n[6] Niv Buchbinder and Joseph Naor. Online primal-dual algorithms for covering and packing.\n\nMathematics of Operations Research, 34(2):270\u2013286, 2009.\n\n[7] TH Chan, Zhiyi Huang, and Ning Kang. Online convex covering and packing problems. arXiv\n\npreprint arXiv:1502.01802, 2015.\n\n[8] Nikhil R Devanur and Kamal Jain. Online matching with concave returns. In Proceedings of\nthe forty-fourth annual ACM symposium on Theory of computing, pages 137\u2013144. ACM, 2012.\n[9] Nikhil R Devanur, Kamal Jain, Balasubramanian Sivan, and Christopher A Wilkens. Near\noptimal online algorithms and fast approximation algorithms for resource allocation problems.\nIn Proceedings of the 12th ACM conference on Electronic commerce, pages 29\u201338. ACM, 2011.\n[10] R. Eghbali, M. Fazel, and M. Mesbahi. Worst Case Competitive Analysis for Online Conic\n\nOptimization. In 55th IEEE conference on decision and control (CDC). IEEE, 2016.\n\n[11] Reza Eghbali, Jon Swenson, and Maryam Fazel. Exponentiated subgradient algorithm for\nonline optimization under the random permutation model. arXiv preprint arXiv:1410.7171,\n2014.\n\n[12] Anupam Gupta and Marco Molinaro. How the experts algorithm can help solve lps online.\n\narXiv preprint arXiv:1407.5298, 2014.\n\n[13] Bala Kalyanasundaram and Kirk R Pruhs. An optimal deterministic algorithm for online\n\nb-matching. Theoretical Computer Science, 233(1):319\u2013325, 2000.\n\n[14] Richard M Karp, Umesh V Vazirani, and Vijay V Vazirani. An optimal algorithm for on-line\nbipartite matching. In Proceedings of the twenty-second annual ACM symposium on Theory of\ncomputing, pages 352\u2013358. ACM, 1990.\n\n[15] Robert Kleinberg. A multiple-choice secretary algorithm with applications to online auctions.\nIn Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms, pages\n630\u2013631. Society for Industrial and Applied Mathematics, 2005.\n\n[16] Aranyak Mehta, Amin Saberi, Umesh Vazirani, and Vijay Vazirani. Adwords and generalized\n\nonline matching. Journal of the ACM (JACM), 54(5):22, 2007.\n\n[17] Yu Nesterov. Smooth minimization of non-smooth functions. Mathematical programming,\n\n103(1):127\u2013152, 2005.\n\n[18] Yurii Nesterov. Primal-dual subgradient methods for convex problems. Mathematical program-\n\nming, 120(1):221\u2013259, 2009.\n\n[19] Shai Shalev-Shwartz and Yoram Singer. Online learning: Theory, algorithms, and applications.\n\n2007.\n\n[20] Shai Shalev-Shwartz and Yoram Singer. A primal-dual perspective of online learning algorithms.\n\nMachine Learning, 69(2-3):115\u2013142, 2007.\n\n9\n\n\f", "award": [], "sourceid": 1639, "authors": [{"given_name": "Reza", "family_name": "Eghbali", "institution": "University of washington"}, {"given_name": "Maryam", "family_name": "Fazel", "institution": "University of Washington"}]}