{"title": "Online convex optimization for cumulative constraints", "book": "Advances in Neural Information Processing Systems", "page_first": 6137, "page_last": 6146, "abstract": "We propose the algorithms for online convex\n  optimization which lead to cumulative squared constraint violations\n  of the form\n  $\\sum\\limits_{t=1}^T\\big([g(x_t)]_+\\big)^2=O(T^{1-\\beta})$, where\n  $\\beta\\in(0,1)$.  Previous literature has\n  focused on long-term constraints of the form\n  $\\sum\\limits_{t=1}^Tg(x_t)$. There, strictly feasible solutions\n  can cancel out the effects of violated constraints.\n  In contrast, the new form heavily penalizes large constraint\n  violations and cancellation effects cannot occur. \n  Furthermore, useful bounds on the single step constraint violation\n  $[g(x_t)]_+$ are derived.\n  For convex objectives, our regret bounds generalize\n  existing bounds, and for strongly convex objectives we give improved\n  regret bounds.\n  In numerical experiments, we show that our algorithm closely follows\n  the constraint boundary leading to low cumulative violation.", "full_text": "Online convex optimization for cumulative constraints\n\nDepartment of Electrical and Computer Engineering\n\nJianjun Yuan\n\nUniversity of Minnesota\nMinneapolis, MN, 55455\nyuanx270@umn.edu\n\nDepartment of Electrical and Computer Engineering\n\nAndrew Lamperski\n\nUniversity of Minnesota\nMinneapolis, MN, 55455\nalampers@umn.edu\n\nAbstract\n\nT(cid:80)\n\n(cid:0)[g(xt)]+\n\n(cid:1)2\n\nt=1\n\nT(cid:80)\n\nWe propose the algorithms for online convex optimization which lead to cumulative\n= O(T 1\u2212\u03b2), where\nsquared constraint violations of the form\n\u03b2 \u2208 (0, 1) . Previous literature has focused on long-term constraints of the form\ng(xt). There, strictly feasible solutions can cancel out the effects of violated\nt=1\nconstraints. In contrast, the new form heavily penalizes large constraint violations\nand cancellation effects cannot occur. Furthermore, useful bounds on the single\nstep constraint violation [g(xt)]+ are derived. For convex objectives, our regret\nbounds generalize existing bounds, and for strongly convex objectives we give\nimproved regret bounds. In numerical experiments, we show that our algorithm\nclosely follows the constraint boundary leading to low cumulative violation.\n\n1\n\nIntroduction\n\nOnline optimization is a popular framework for machine learning, with applications such as dictionary\nlearning [14], auctions [1], classi\ufb01cation, and regression [3]. It has also been in\ufb02uential in the\ndevelopment of algorithms in deep learning such as convolutional neural networks [11], deep Q-\nnetworks [15], and reinforcement learning [8, 20].\nThe general formulation for online convex optimization (OCO) is as follows: At each time t, we\nchoose a vector xt in convex set S = {x : g(x) \u2264 0}. Then we receive a loss function ft : S \u2192 R\ndrawn from a family of convex functions and we obtain the loss ft(xt). In this general setting, there\nis no constraint on how the sequence of loss functions ft is generated. See [21] for more details.\nThe goal is to generate a sequence of xt \u2208 S for t = 1, 2, .., T to minimize the cumulative regret\nwhich is de\ufb01ned by:\n\nRegretT (x\u2217) =\n\n(1)\n\nT(cid:88)\n\nft(xt) \u2212 T(cid:88)\n\nt=1\n\nt=1\n\nft(x\u2217)\n\nT(cid:80)\n\nwhere x\u2217 is the optimal solution to the following problem: min\nx\u2208S\nsolution to Problem (1) is called Hannan consistent if RegretT (x\u2217) is sublinear in T .\n\nt=1\n\nft(x). According to [2], the\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFor online convex optimization with constraints, a projection operator is typically applied to the\nupdated variables in order to make them feasible at each time step [21, 6, 7]. However, when the\nconstraints are complex, the computational burden of the projection may be too high for online\ncomputation. To circumvent this dilemma, [13] proposed an algorithm which approximates the\ntrue desired projection with a simpler closed-form projection. The algorithm gives a cumulative\nT ), but the constraint g(xt) \u2264 0 may not\nregret RegretT (x\u2217) which is upper bounded by O(\ng(xt) \u2264\nbe satis\ufb01ed in every time step. Instead, the long-term constraint violation satis\ufb01es\nO(T 3/4), which is useful when we only require the constraint violation to be non-positive on average:\n\nT(cid:80)\n\n\u221a\n\nt=1\n\nT(cid:80)\n\nt=1\n\nlimT\u2192\u221e\n\ng(xt)/T \u2264 0.\n\nT(cid:80)\n\nT(cid:80)\n\nt=1\n\n\u221a\nT ) regret and a bound of O(\n\nMore recently, [10] proposed an adaptive stepsize version of this algorithm which can make\nRegretT (x\u2217) \u2264 O(T max{\u03b2,1\u2212\u03b2}) and\ng(xt) \u2264 O(T 1\u2212\u03b2/2). Here \u03b2 \u2208 (0, 1) is a user-\n\u221a\ndetermined trade-off parameter. In related work, [19] provides another algorithm which achieves\nO(\nIn this paper, we propose two algorithms for the following two different cases:\nConvex Case: The \ufb01rst algorithm is for the convex case, which also has the user-determined trade-\noff as in [10], while the constraint violation is more strict. Speci\ufb01cally, we have RegretT (x\u2217) \u2264\nO(T max{\u03b2,1\u2212\u03b2}) and\nNote the square term heavily penalizes large constraint violations and constraint violations from one\nstep cannot be canceled out by strictly feasible steps. Additionally, we give a bound on the cumulative\n\n(cid:1)2 \u2264 O(T 1\u2212\u03b2) where [g(xt)]+ = max{0, g(xt)} and \u03b2 \u2208 (0, 1).\n\nT ) on the long-term constraint violation.\n\n(cid:0)[g(xt)]+\n\nt=1\n\nconstraint violation\n\n[g(xt)]+ \u2264 O(T 1\u2212\u03b2/2), which generalizes the bounds from [13, 10].\n\nT(cid:80)\nT(cid:80)\n\nt=1\n\nT(cid:80)\n\n\u221a\ng(xt) \u2264 O(\n\nIn the case of \u03b2 = 0.5, which we call \"balanced\", both RegretT (x\u2217) and\n\u221a\nT ). More importantly, our algorithm guarantees that at each time step,\nsame upper bound of O(\nT 1/6 ), which does not follow from\nthe clipped constraint term [g(xt)]+ is upper bounded by O( 1\nthe results of [13, 10]. However, our results currently cannot generalize those of [19], which has\n\n([g(xt)]+)2 have the\n\nt=1\n\nT ). As discussed below, it is unclear how to extend the work of [19] to the clipped\n\nT(cid:80)\n\n[g(xt)]+ \u2264 O((cid:112)log(T )T ). The improved bounds match\n\nt=1\nconstraints, [g(xt)]+.\nStrongly Convex Case: Our second algorithm for strongly convex function ft(x) gives us\nthe improved upper bounds compared with the previous work in [10]. Speci\ufb01cally, we have\nRegretT (x\u2217) \u2264 O(log(T )), and\nthe regret order of standard OCO from [9], while maintaining a constraint violation of reasonable\norder.\nWe show numerical experiments on three problems. A toy example is used to compare trajectories of\nour algorithm with those of [10, 13], and we see that our algorithm tightly follows the constraints.\nThe algorithms are also compared on a doubly-stochastic matrix approximation problem [10] and an\neconomic dispatch problem from power systems. In these, our algorithms lead to reasonable objective\nregret and low cumulative constraint violation.\n\nt=1\n\n2 Problem Formulation\n\nThe basic projected gradient algorithm for Problem (1) was de\ufb01ned in [21]. At each step, t, the\nalgorithm takes a gradient step with respect to ft and then projects onto the feasible set. With some\nassumptions on S and ft, this algorithm achieves a regret of O(\nAlthough the algorithm is simple, it needs to solve a constrained optimization problem at every\ntime step, which might be too time-consuming for online implementation when the constraints are\n\nT ).\n\n\u221a\n\n2\n\n\f(2)\n\n(4)\n\n(5)\n\ncomplex. Speci\ufb01cally, in [21], at each iteration t, the update rule is:\n\nxt+1 = \u03a0S(xt \u2212 \u03b7\u2207ft(xt)) = arg min\ny\u2208S\n\n(cid:107)y \u2212 (xt \u2212 \u03b7\u2207ft(xt))(cid:107)2\n(cid:107) is the (cid:96)2 norm.\n\nmin\n\nT(cid:80)\n\nwhere \u03a0S is the projection operation to the set S and (cid:107)\nIn order to lower the computational complexity and accelerate the online processing speed, the work\nof [13] avoids the convex optimization by projecting the variable to a \ufb01xed ball S \u2286 B, which always\nhas a closed-form solution. That paper gives an online solution for the following problem:\ngi(xt) \u2264 0, i = 1, 2, ..., m\n\n(3)\nwhere S = {x : gi(x) \u2264 0, i = 1, 2, ..., m} \u2286 B. It is assumed that there exist constants R > 0 and\nr < 1 such that rK \u2286 S \u2286 RK with K being the unit (cid:96)2 ball centered at the origin and B = RK.\nCompared to Problem (1), which requires that xt \u2208 S for all t, (3) implies that only the sum of\nconstraints is required. This sum of constraints is known as the long-term constraint.\nTo solve this new problem, [13] considers the following augmented Lagrangian function at each\niteration t:\n\nft(xt) \u2212 min\nx\u2208S\n\nx1,...,xT \u2208B\n\nT(cid:80)\n\nT(cid:80)\n\nft(x)\n\ns.t.\n\nt=1\n\nt=1\n\nt=1\n\nLt(x, \u03bb) = ft(x) +\n\n\u03bbigi(x) \u2212 \u03c3\u03b7\n2\n\n\u03bb2\ni\n\nm(cid:88)\n\n(cid:110)\n\ni=1\n\n(cid:111)\n\nThe update rule is as follows:\n\nxt+1 = \u03a0B(xt \u2212 \u03b7\u2207xLt(xt, \u03bbt)), \u03bbt+1 = \u03a0[0,+\u221e)m(\u03bbt + \u03b7\u2207\u03bbLt(xt, \u03bbt))\n\nwhere \u03b7 and \u03c3 are the pre-determined stepsize and some constant, respectively.\nMore recently, an adaptive version was developed in [10], which has a user-de\ufb01ned trade-off param-\neter. The algorithm proposed by [10] utilizes two different stepsize sequences to update x and \u03bb,\nrespectively, instead of using a single stepsize \u03b7.\nIn both algorithms of [13] and [10], the bound for the violation of the long-term constraint is that \u2200i,\ngi(xt) \u2264 O(T \u03b3) for some \u03b3 \u2208 (0, 1). However, as argued in the last section, this bound does not\nenforce that the violation of the constraint xt \u2208 S gets small. A situation can arise in which strictly\nsatis\ufb01ed constraints at one time step can cancel out violations of the constraints at other time steps.\nThis problem can be recti\ufb01ed by considering clipped constraint, [gi(xt)]+, in place of gi(xt).\n\nT(cid:80)\n\nt=1\n\nT(cid:80)\n\n(cid:0)[gi(xt)]+\n\n(cid:1)2, which, as discussed in the\n\nFor convex problems, our goal is to bound the term\nprevious section, is more useful for enforcing small constraint violations, and also recovers the\n\nt=1\n\nexisting bounds for both\nthe improvement on the upper bounds compared to the results in [10].\nIn sum, in this paper, we want to solve the following problem for the general convex condition:\n\ngi(xt). For strongly convex problems, we also show\n\n[gi(xt)]+ and\n\nt=1\n\nt=1\n\nft(xt) \u2212 min\nx\u2208S\n\nt=1\n\nmin\n\nft(x)\n\nx1,x2,...,xT \u2208B\n\n(6)\nwhere \u03b3 \u2208 (0, 1). The new constraint from (6) is called the square-clipped long-term constraint\n(since it is a square-clipped version of the long-term constraint) or square-cumulative constraint\n(since it encodes the square-cumulative violation of the constraints).\nTo solve Problem (6), we change the augmented Lagrangian function Lt as follows:\n\ns.t.\n\nt=1\n\nt=1\n\nT(cid:80)\n\n(cid:0)[gi(xt)]+\n\n(cid:1)2 \u2264 O(T \u03b3),\u2200i\n\nT(cid:80)\n\nT(cid:80)\n\nT(cid:80)\n\nT(cid:80)\n\nLt(x, \u03bb) = ft(x) +\n\n\u03bbi[gi(x)]+ \u2212 \u03b8t\n2\n\n\u03bb2\ni\n\n(7)\n\nm(cid:88)\n\n(cid:110)\n\ni=1\n\n(cid:111)\n\nIn this paper, we will use the following assumptions as in [13]: 1. The convex set S is non-empty,\nclosed, bounded, and can be described by m convex functions as S = {x : gi(x) \u2264 0, i = 1, 2, ..., m}.\n\n3\n\n\fAlgorithm 1 Generalized Online Convex Optimization with Long-term Constraint\n1: Input: constraints gi(x) \u2264 0, i = 1, 2, ..., m, stepsize \u03b7, time horizon T, and constant \u03c3 > 0.\n2: Initialization: x1 is in the center of the B .\n3: for t = 1 to T do\n4:\n5:\n6:\n\nInput the prediction result xt.\nObtain the convex loss function ft(x) and the loss value ft(xt).\nCalculate a subgradient \u2202xLt(xt, \u03bbt), where:\n\u2202xLt(xt, \u03bbt) = \u2202xft(xt) +\n\n\u03bbi\nt\u2202x([gi(xt)]+), \u2202x([gi(xt)]+) =\n\ngi(xt) \u2264 0\n\n(cid:26)0,\n\nm(cid:80)\n\n\u2207xgi(xt), otherwise\n\n7:\n\nUpdate xt and \u03bbt as below:\n\ni=1\n\nxt+1 = \u03a0B(xt \u2212 \u03b7\u2202xLt(xt, \u03bbt)), \u03bbt+1 = [g(xt+1)]+\n\n\u03c3\u03b7\n\n8: end for\n\n2. Both the loss functions ft(x), \u2200t and constraint functions gi(x), \u2200i are Lipschitz continuous in the\nset B. That is, (cid:107)ft(x) \u2212 ft(y)(cid:107) \u2264 Lf (cid:107)x \u2212 y(cid:107), (cid:107)gi(x) \u2212 gi(y)(cid:107) \u2264 Lg (cid:107)x \u2212 y(cid:107), \u2200x, y \u2208 B and \u2200t, i.\nG = max{Lf , Lg}, and\n\nx,y\u2208B ft(x) \u2212 ft(y) \u2264 2Lf R, D = max\n\nmax\n\ni=1,2,...,m\n\nx\u2208B gi(x) \u2264 LgR\n\nmax\n\nF = max\n\nt=1,2,...,T\n\n3 Algorithm\n\n3.1 Convex Case:\n\nThe main algorithm for this paper is shown in Algorithm 1. For simplicity, we abuse the subgradient\nnotation, denoting a single element of the subgradient by \u2202xLt(xt, \u03bbt). Comparing our algorithm\nwith Eq.(5), we can see that the gradient projection step for xt+1 is similar, while the update rule for\n\u03bbt+1 is different. Instead of a projected gradient step, we explicitly maximize Lt+1(xt+1, \u03bb) over \u03bb.\nThis explicit projection-free update for \u03bbt+1 is possible because the constraint clipping guarantees\nthat the maximizer is non-negative. Furthermore, this constraint-violation-dependent update helps to\nenforce small cumulative and individual constraint violations. Speci\ufb01c bounds on constraint violation\nare given in Theorem 1 and Lemma 1 below.\nBased on the update rule in Algorithm 1, the following theorem gives the upper bounds for both the\n\n(cid:16)\n\nT(cid:80)\n\n(cid:17)2\n\n[gi(xt)]+\n\nin Problem 6.\n\nregret on the loss and the squared-cumulative constraint violation,\nFor space purposes, all proofs are contained in the supplementary material.\nTheorem 1. Set \u03c3 = (m+1)G2\n\u03b1 \u2208 (0, 1) and x\u2217 being the optimal solution for min\nx\u2208S\n\n2(1\u2212\u03b1) , \u03b7 =\n\nT(cid:80)\n\n(m+1)RT\n\n\u221a\n\nt=1\n\nG\n\n1\n\n(cid:17) \u2264 O(\n\n\u221a\n\n(cid:16)\n\nT(cid:80)\n\nt=1\n\nft(x), we have\n\u221a\n\n(cid:17)2 \u2264 O(\n\nft(xt) \u2212 ft(x\u2217)\n\nT ),\n\n[gi(xt)]+\n\nt=1\n\nt=1\n\n(cid:16)\n\nT(cid:80)\n\n. If we follow the update rule in Algorithm 1 with\n\nT ),\u2200i \u2208 {1, 2, ..., m}\n\n\u221a\nFrom Theorem 1, we can see that by setting appropriate stepsize, \u03b7, and constant, \u03c3, we can obtain\nthe upper bound for the regret of the loss function being less than or equal to O(\nT ), which is also\nshown in [13] [10]. The main difference of the Theorem 1 is that previous results of [13] [10] all\n\nobtain the upper bound for the long-term constraint\n\ngi(xt), while here the upper bound for the\n\nT(cid:80)\n\nt=1\n\nconstraint violation of the form\nis achieved. Also note that the stepsize depends on\nT , which may not be available. In this case, we can use the \u2019doubling trick\u2019 described in the book [2]\nto transfer our T -dependent algorithm into T -free one with a worsening factor of\n\n\u221a\n2/(\n\n2 \u2212 1).\n\n[gi(xt)]+\n\n\u221a\n\nt=1\n\n(cid:16)\n\nT(cid:80)\n\n(cid:17)2\n\n4\n\n\fT(cid:80)\n\nThe proposed algorithm and the resulting bound are useful for two reasons: 1. The square-cumulative\n\nconstraint implies a bound on the cumulative constraint violation,\n[gi(xt)]+, while enforcing\nlarger penalties for large violations. 2. The proposed algorithm can also upper bound the constraint\nviolation for each single step [gi(xt)]+, which is not bounded in the previous literature.\nThe next results show how to bound constraint violations at each step.\nLemma 1. If there is only one differentiable constraint function g(x) with Lipschitz continuous\ngradient parameter L, and we run the Algorithm 1 with the parameters in Theorem 1 and large\nenough T , we have\n\nt=1\n\n[g(xt)]+ \u2264 O( 1\n\nT 1/6 ), \u2200t \u2208 {1, 2, ..., T},\n\nif\n\n[g(x1)]+ \u2264 O( 1\n\nT 1/6 ).\n\nLemma 1 only considers single constraint case. For case of multiple differentiable constraints, we\nhave the following:\nProposition 1. For multiple differentiable constraint functions gi(x), i \u2208 {1, 2, ..., m} with Lipschitz\ncontinuous gradient parameters Li, if we use \u00afg(x) = log\nas the constraint function\nin Algorithm 1, then for large enough T , we have\nT 1/6 ), \u2200i, t,\n\n[gi(xt)]+ \u2264 O( 1\n\n[\u00afg(x1)]+ \u2264 O( 1\n\n(cid:16) m(cid:80)\n\nexp gi(x)\n\nT 1/6 ).\n\n(cid:17)\n\ni=1\n\nif\n\nClearly, both Lemma 1 and Proposition 1 only deal with differentiable functions. For a non-\ndifferentiable function g(x), we can \ufb01rst use a differentiable function \u00afg(x) to approximate the\ng(x) with \u00afg(x) \u2265 g(x), and then apply the previous Lemma 1 and Proposition 1 to upper bound each\nindividual gi(xt). Many non-smooth convex functions can be approximated in this way as shown in\n[16].\n\n3.2 Strongly Convex Case:\n\nT(cid:80)\n\nFor ft(x) to be strongly convex, the Algorithm 1 is still valid. But in order to have lower upper\nbounds for both objective regret and the clipped long-term constraint\n[gi(xt)]+ compared with\nProposition 3 in next section, we need to use time-varying stepsize as the one used in [9]. Thus, we\nmodify the update rule of xt, \u03bbt to have time-varying stepsize as below:\n\nt=1\n\nxt+1 = \u03a0B(xt \u2212 \u03b7t\u2202xLt(xt, \u03bbt)), \u03bbt+1 = [g(xt+1)]+\n\n\u03b8t+1\n\n.\n\n(8)\n\nIf we replace the update rule in Algorithm 1 with Eq.(8), we can obtain the following theorem:\nTheorem 2. Assume ft(x) has strongly convexity parameter H1. If we set \u03b7t = H1\n1)G2, follow the new update rule in Eq.(8), and x\u2217 being the optimal solution for min\nx\u2208S\n\u2200i \u2208 {1, 2, ..., m}, we have\nft(xt) \u2212 ft(x\u2217)\n\nT(cid:80)\n[gi(xt)]+ \u2264 O((cid:112)log(T )T ).\n\n(cid:17) \u2264 O(log(T )),\n\ngi(xt) \u2264 T(cid:80)\n\nt+1 , \u03b8t = \u03b7t(m +\nft(x), for\n\nT(cid:80)\n\nT(cid:80)\n\n(cid:16)\n\nt=1\n\nt=1\n\nt=1\n\nt=1\n\nThe paper [10] also has a discussion of strongly convex functions, but only provides a bound similar\nto the convex one. Theorem 2 shows the improved bounds for both objective regret and the constraint\nviolation. On one hand the objective regret is consistent with the standard OCO result in [9], and on\nthe other the constraint violation is further reduced compared with the result in [10].\n\n4 Relation with Previous Results\n\nIn this section, we extend Theorem 1 to enable direct comparison with the results from [13] [10]. In\nparticular, it is shown how Algorithm 1 recovers the existing regret bounds, while the use of the new\naugmented Lagrangian (7) in the previous algorithms also provides regret bounds for the clipped\nconstraint case.\n\n5\n\n\fThe \ufb01rst result puts a bound on the clipped long-term constraint, rather than the sum-of-squares that\nappears in Theorem 1. This will allow more direct comparisons with the existing results.\n\nProposition 2. If \u03c3 = (m+1)G2\nresult of Algorithm 1 satis\ufb01es\n\n2(1\u2212\u03b1) , \u03b7 = O( 1\u221a\n\nT\n\n), \u03b1 \u2208 (0, 1), and x\u2217 = argmin\nx\u2208S\n\nft(x), then the\n\nT(cid:80)\n\nt=1\n\n(cid:16)\n\nT(cid:80)\n\nt=1\n\nft(xt) \u2212 ft(x\u2217)\n\n(cid:17) \u2264 O(\n\n\u221a\n\nT(cid:80)\n\nt=1\n\nT ),\n\ngi(xt) \u2264 T(cid:80)\n\nt=1\n\n[gi(xt)]+ \u2264 O(T 3/4),\u2200i \u2208 {1, 2, ..., m}\n\nThis result shows that our algorithm generalizes the regret and long-term constraint bounds of [13].\nThe next result shows that by changing our constant stepsize accordingly, with the Algorithm 1, we\ncan achieve the user-de\ufb01ned trade-off from [10]. Furthermore, we also include the squared version\nand clipped constraint violations.\n\nT \u03b2 ), \u03b1 \u2208 (0, 1), \u03b2 \u2208 (0, 1), and x\u2217 = argmin\nx\u2208S\n\nft(x),\n\nT(cid:80)\n\nt=1\n\nT(cid:80)\n\n([gi(xt)]+)2 \u2264 O(T 1\u2212\u03b2),\u2200i \u2208 {1, 2, ..., m}\n\nProposition 3. If \u03c3 = (m+1)G2\nthen the result of Algorithm 1 satis\ufb01es\n\n2(1\u2212\u03b1) , \u03b7 = O( 1\n\n(cid:17) \u2264 O(T max{\u03b2,1\u2212\u03b2}),\n\nT(cid:80)\nT(cid:80)\n\nt=1\n\n(cid:16)\ngi(xt) \u2264 T(cid:80)\n\nft(xt) \u2212 ft(x\u2217)\n\n[gi(xt)]+ \u2264 O(T 1\u2212\u03b2/2),\n\nt=1\n\nt=1\n\nt=1\n\nProposition 3 provides a systematic way to balance the regret of the objective and the constraint\nviolation. Next, we will show that previous algorithms can use our proposed augmented Lagrangian\nfunction to have their own clipped long-term constraint bound.\nProposition 4. If we run Algorithm 1 in [13] with the augmented Lagrangian formula de\ufb01ned in\nEq.(7), the result satis\ufb01es\n\n(cid:16)\n\nT(cid:80)\n\n(cid:17) \u2264 O(\n\n\u221a\n\nT ),\n\nT(cid:80)\n\ngi(xt) \u2264 T(cid:80)\n\nft(xt) \u2212 ft(x\u2217)\n\n[gi(xt)]+ \u2264 O(T 3/4),\u2200i \u2208 {1, 2, ..., m}.\n\nt=1\n\nt=1\n\nt=1\n\nFor the update rule proposed in [10], we need to change the Lt(x, \u03bb) to the following one:\n\nLt(x, \u03bb) = ft(x) + \u03bb[g(x)]+ \u2212 \u03b8t\n2\n\n\u03bb2\n\n(9)\n\nwhere g(x) = maxi\u2208{1,...,m} gi(x).\nProposition 5. If we use the update rule and the parameter choices in [10] with the augmented\nLagrangian in Eq.(9), then \u2200i \u2208 {1, ..., m}, we have\n\n(cid:16)\n\nT(cid:80)\n\n(cid:17) \u2264 O(T max{\u03b2,1\u2212\u03b2}),\n\nT(cid:80)\n\ngi(xt) \u2264 T(cid:80)\n\nft(xt) \u2212 ft(x\u2217)\n\n[gi(xt)]+ \u2264 O(T 1\u2212\u03b2/2).\n\nt=1\n\nt=1\n\nt=1\n\nPropositions 4 and 5 show that clipped long-term constraints can be bounded by combining the\nalgorithms of [13, 10] with our augmented Lagrangian. Although these results are similar in part to\nour Propositions 2 and 3, they do not imply the results in Theorems 1 and 2 as well as the new single\nstep constraint violation bound in Lemma 1, which are our key contributions. Based on Propositions\n4 and 5, it is natural to ask whether we could apply our new augmented Lagrangian formula (7) to the\nrecent work in [19] . Unfortunately, we have not found a way to do so.\n\n(cid:16)\n\n(cid:17)2\n\n(cid:16)\n\n(cid:17)2\n\n[gi(xt)]+\n\nis also convex, we could de\ufb01ne \u02dcgi(xt) =\n\nFurthermore, since\nand apply\nthe previous algorithms [13] [10] and [19]. This will result in the upper bounds of O(T 3/4) [13] and\nO(T 1\u2212\u03b2/2) [10], which are worse than our upper bounds of O(T 1/2) (Theorem 1) and O(T 1\u2212\u03b2) (\nProposition 3). Note that the algorithm in [19] cannot be applied since the clipped constraints do not\nsatisfy the required Slater condition.\n\n[gi(xt)]+\n\n6\n\n\fFigure 1: Toy Example Results: Trajectories generated by different algorithms. Note how trajectories\ngenerated by Clipped-OGD follow the desired constraints tightly. In contrast, OGD oscillates around\nthe true constraints, and A-OGD closely follows the boundary of the outer ball.\n\n(a)\n\n(b)\n\n(c)\n\nFigure 2: Doubly-Stochastic Matrices. Fig.2(a): Clipped Long-term Constraint Violation. Fig.2(b):\nLong-term Constraint Violation. Fig.2(c): Cumulative Regret of the Loss function\n\n(a)\n\n(b)\n\n(c)\n\nFigure 3: Economic Dispatch. Fig.3(a): Power Demand Trajectory. Fig.3(b): Constraint Violation\nfor each time step. All of the previous algorithms incurred substantial constraint violations. The\n\ufb01gure on the right shows the violations of our algorithm, which are signi\ufb01cantly smaller. Fig.3(c):\nRunning Average of the Objective Loss\n\n7\n\n\u22121.00\u22120.75\u22120.50\u22120.250.000.250.500.751.00x1\u22121.00\u22120.75\u22120.50\u22120.250.000.250.500.751.00x2Clipped\u2212OGD(\u03b2=0.5)A\u2212OGD(\u03b2=0.5)OGD\u22121.00\u22120.75\u22120.50\u22120.250.000.250.500.751.00x1\u22121.00\u22120.75\u22120.50\u22120.250.000.250.500.751.00x2Clipped\u2212OGD(\u03b2=2/3)A\u2212OGD(\u03b2=2/3)OGD02500500075001000012500150001750020000Different Iterations T050100150200250Clipped Constraint RegretClipped\u2212OGD(\u03b2=0.5)Clipped\u2212OGD(\u03b2=2/3)A\u2212OGD(\u03b2=0.5)A\u2212OGD(\u03b2=2/3)OGDOur-Strong02500500075001000012500150001750020000Different Iterations T050100150200250Constraint RegretClipped\u2212OGD(\u03b2=0.5)Clipped\u2212OGD(\u03b2=2/3)A\u2212OGD(\u03b2=0.5)A\u2212OGD(\u03b2=2/3)OGDOur-Strong02500500075001000012500150001750020000Different Iterations T50100150200250Objective RegretClipped\u2212OGD(\u03b2=0.5)Clipped\u2212OGD(\u03b2=2/3)A\u2212OGD(\u03b2=0.5)A\u2212OGD(\u03b2=2/3)OGDOur-Strong050010001500200025003000Time Slots(each 5 min)10203040506070Demand050010001500200025003000Time Slots(each 5 min)050100150200Constrain Violation %Clipped\u2212OGD(\u03b2=0.5)Clipped\u2212OGD(\u03b2=2/3)A\u2212OGD(\u03b2=0.5)A\u2212OGD(\u03b2=2/3)OGD050010001500200025003000Time Slots(each 5 min)0246810Constrain Violation %Clipped\u2212OGD(\u03b2=0.5)Clipped\u2212OGD(\u03b2=2/3)050010001500200025003000Time Slots(each 5 min)255075100125150175200Objective CostRunning Average Objective CostClipped\u2212OGD(\u03b2=0.5)Clipped\u2212OGD(\u03b2=2/3)A\u2212OGD(\u03b2=0.5)A\u2212OGD(\u03b2=2/3)OGDBest fixed strategy in hindsight\f5 Experiments\n\nIn this section, we test the performance of the algorithms including OGD [13], A-OGD [10], Clipped-\nOGD (this paper), and our proposed algorithm strongly convex case (Our-strong). Throughout\nthe experiments, our algorithm has the following \ufb01xed parameters: \u03b1 = 0.5, \u03c3 = (m+1)G2\n2(1\u2212\u03b1) , \u03b7 =\n. In order to better show the result of the constraint violation trajectories, we aggregate\n\n\u221a\n\nT \u03b2 G\nall the constraints as a single one by using g(xt) = maxi\u2208{1,...,m} gi(xt) as done in [13].\n\n1\nR(m+1)\n\n5.1 A Toy Experiment\n\nFor illustration purposes, we solve the following 2-D toy experiment with x = [x1, x2]T :\n\nT(cid:80)\n\nt=1\n\nmin\n\ncT\nt x,\n\ns.t.|x1| + |x2| \u2212 1 \u2264 0.\n\n(10)\n\nwhere the constraint is the (cid:96)1-norm constraint. The vector ct is generated from a uniform random\nvector over [0, 1.2] \u00d7 [0, 1] which is rescaled to have norm 1. This leads to slightly average cost on\nthe on the \ufb01rst coordinate. The of\ufb02ine solutions for different T are obtained by CVXPY [5].\nAll algorithms are run up to T = 20000 and are averaged over 10 random sequences of {ct}T\nt=1.\nSince the main goal here is to compare the variables\u2019 trajectories generated by different algorithms,\nthe results for different T are in the supplementary material for space purposes. Fig.1 shows these\ntrajectories for one realization with T = 8000. The blue star is the optimal point\u2019s position.\nFrom Fig.1 we can see that the trajectories generated by Clipped-OGD follows the boundary very\ntightly until reaching the optimal point. This can be explained by the Lemma 1 which shows that\nthe constraint violation for single step is also upper bounded. For the OGD, the trajectory oscillates\nwidely around the boundary of the true constraint. For the A-OGD, its trajectory in Fig.1 violates the\nconstraint most of the time, and this violation actually contributes to the lower objective regret shown\nin the supplementary material.\n\n5.2 Doubly-Stochastic Matrices\n\nT(cid:80)\n\n1\n\nWe also test the algorithms for approximation by doubly-stochastic matrices, as in [10]:\n\nF\n\nt=1\n\nmin\n\n2 (cid:107)Yt \u2212 X(cid:107)2\n\ns.t. X1 = 1, X T 1 = 1, Xij \u2265 0.\n\n(11)\nwhere X \u2208 Rd\u00d7d is the matrix variable, 1 is the vector whose elements are all 1, and matrix Yt is the\npermutation matrix which is randomly generated.\nAfter changing the equality constraints into inequality ones (e.g.,X1 = 1 into X1 \u2265 1 and X1 \u2264 1),\nwe run the algorithms with different T up to T = 20000 for 10 different random sequences of\n{Yt}T\nt=1. Since the objective function ft(x) is strongly convex with parameter H1 = 1, we also\ninclude our designed strongly convex algorithm as another comparison. The of\ufb02ine optimal solutions\nare obtained by CVXPY [5].\nThe mean results for both constraint violation and objective regret are shown in Fig.2. From the\nresult we can see that, for our designed strongly convex algorithm Our-Strong, its result is around the\nbest ones in not only the clipped constraint violation, but the objective regret. For our most-balanced\nconvex case algorithm Clipped-OGD with \u03b2 = 0.5, although its clipped constraint violation is\nrelatively bigger than A-OGD, it also becomes quite \ufb02at quickly, which means the algorithm quickly\nconverges to a feasible solution.\n\n5.3 Economic Dispatch in Power Systems\n\nThis example is adapted from [12] and [18], which considers the problem of power dispatch. That\nis, at each time step t, we try to minimize the power generation cost ci(xt,i) for each generator i\nwhile maintaining the power balance\nxt,i = dt, where dt is the power demand at time t. Also,\neach power generator produces an emission level Ei(xt,i). To bound the emissions, we impose the\n\nn(cid:80)\n\ni=1\n\n8\n\n\fEi(xt,i) \u2264 Emax. In addition to requiring this constraint to be satis\ufb01ed on average, we\n\nconstraint\nalso require bounded constraint violations at each timestep. The problem is formally stated as:\n\ni=1\n\nn(cid:80)\n\nxt,i \u2212 dt)2(cid:17)\n\nn(cid:80)\n\nmin\n\nci(xt,i) + \u03be(\n\n,\n\ns.t.\n\nt=1\n\ni=1\n\ni=1\n\ni=1\n\nEi(t, i) \u2264 Emax, 0 \u2264 xt,i \u2264 xi,max.\n(12)\n\nn(cid:80)\n(cid:16) n(cid:80)\n\nT(cid:80)\n\nt,i + bixt,i, and Ei = dix2\n\nwhere the second constraint is from the fact that each generator has the power generation limit.\nIn this example, we use three generators. We de\ufb01ne the cost and emission functions according to\n[18] and [12] as ci(xt,i) = 0.5aix2\nt,i + eixt,i, respectively. The parameters\nare: a1 = 0.2, a2 = 0.12, a3 = 0.14, b1 = 1.5, b2 = 1, b3 = 0.6, d1 = 0.26, d2 = 0.38, d3 = 0.37,\nEmax = 100, \u03be = 0.5, and x1,max = 20, x2,max = 15, x3,max = 18. The demand dt is adapted\nfrom real-world 5-minute interval demand data between 04/24/2018 and 05/03/2018 1, which is\nshown in Fig.3(a). The of\ufb02ine optimal solution or best \ufb01xed strategy in hindsight is obtained by\nan implementation of SAGA [4]. The constraint violation for each time step is shown in Fig.3(b),\nand the running average objective cost is shown in Fig.3(c). From these results we can see that our\nalgorithm has very small constraint violation for each time step, which is desired by the requirement.\nFurthermore, our objective costs are very close to the best \ufb01xed strategy.\n\n6 Conclusion\n\nIn this paper, we propose two algorithms for OCO with both convex and strongly convex objective\nfunctions. By applying different update strategies that utilize a modi\ufb01ed augmented Lagrangian\nfunction, they can solve OCO with a squared/clipped long-term constraints requirement. The\nalgorithm for general convex case provides the useful bounds for both the long-term constraint\nviolation and the constraint violation at each timestep. Furthermore, the bounds for the strongly\nconvex case is an improvement compared with the previous efforts in the literature. Experiments show\nthat our algorithms can follow the constraint boundary tightly and have relatively smaller clipped\nlong-term constraint violation with reasonably low objective regret. It would be useful if future work\ncould explore the noisy versions of the constraints and obtain the similar upper bounds.\n\nAcknowledgments\n\nThanks to Tianyi Chen for valuable discussions about algorithm\u2019s properties.\n\nReferences\n[1] Avrim Blum, Vijay Kumar, Atri Rudra, and Felix Wu. Online learning in online auctions.\n\nTheoretical Computer Science, 324(2-3):137\u2013146, 2004.\n\n[2] Nicolo Cesa-Bianchi and G\u00e1bor Lugosi. Prediction, learning, and games. Cambridge university\n\npress, 2006.\n\n[3] Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, and Yoram Singer. Online\npassive-aggressive algorithms. Journal of Machine Learning Research, 7(Mar):551\u2013585, 2006.\n\n[4] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient\nmethod with support for non-strongly convex composite objectives. In Advances in Neural\nInformation Processing Systems, pages 1646\u20131654, 2014.\n\n[5] Steven Diamond and Stephen Boyd. CVXPY: A Python-embedded modeling language for\n\nconvex optimization. Journal of Machine Learning Research, 17(83):1\u20135, 2016.\n\n[6] John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra. Ef\ufb01cient projections\nonto the l 1-ball for learning in high dimensions. In Proceedings of the 25th international\nconference on Machine learning, pages 272\u2013279. ACM, 2008.\n\n1https://www.iso-ne.com/isoexpress/web/reports/load-and-demand\n\n9\n\n\f[7] John C Duchi, Shai Shalev-Shwartz, Yoram Singer, and Ambuj Tewari. Composite objective\n\nmirror descent. In COLT, pages 14\u201326, 2010.\n\n[8] Maryam Fazel, Rong Ge, Sham M Kakade, and Mehran Mesbahi. Global convergence of policy\n\ngradient methods for linearized control problems. arXiv preprint arXiv:1801.05039, 2018.\n\n[9] Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex\n\noptimization. Machine Learning, 69(2):169\u2013192, 2007.\n\n[10] Rodolphe Jenatton, Jim Huang, and C\u00e9dric Archambeau. Adaptive algorithms for online convex\noptimization with long-term constraints. In International Conference on Machine Learning,\npages 402\u2013411, 2016.\n\n[11] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning\n\napplied to document recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[12] Yingying Li, Guannan Qu, and Na Li. Online optimization with predictions and switching costs:\n\nFast algorithms and the fundamental limit. arXiv preprint arXiv:1801.07780, 2018.\n\n[13] Mehrdad Mahdavi, Rong Jin, and Tianbao Yang. Trading regret for ef\ufb01ciency: online convex\noptimization with long term constraints. Journal of Machine Learning Research, 13(Sep):2503\u2013\n2528, 2012.\n\n[14] Julien Mairal, Francis Bach, Jean Ponce, and Guillermo Sapiro. Online dictionary learning for\nsparse coding. In Proceedings of the 26th annual international conference on machine learning,\npages 689\u2013696. ACM, 2009.\n\n[15] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G\nBellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al.\nHuman-level control through deep reinforcement learning. Nature, 518(7540):529\u2013533, 2015.\n\n[16] Yu Nesterov. Smooth minimization of non-smooth functions. Mathematical programming,\n\n103(1):127\u2013152, 2005.\n\n[17] Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87.\n\nSpringer Science & Business Media, 2013.\n\n[18] K Senthil and K Manikandan. Economic thermal power dispatch with emission constraint\nand valve point effect loading using improved tabu search algorithm. International Journal of\nComputer Applications, 2010.\n\n[19] Hao Yu, Michael Neely, and Xiaohan Wei. Online convex optimization with stochastic con-\n\nstraints. In Advances in Neural Information Processing Systems, pages 1427\u20131437, 2017.\n\n[20] Jianjun Yuan and Andrew Lamperski. Online control basis selection by a regularized actor critic\n\nalgorithm. In American Control Conference (ACC), 2017, pages 4448\u20134453. IEEE, 2017.\n\n[21] Martin Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient ascent.\nIn Proceedings of the 20th International Conference on Machine Learning (ICML-03), pages\n928\u2013936, 2003.\n\n10\n\n\f", "award": [], "sourceid": 3021, "authors": [{"given_name": "Jianjun", "family_name": "Yuan", "institution": "University of Minnesota"}, {"given_name": "Andrew", "family_name": "Lamperski", "institution": "University of Minnesota"}]}