{"title": "Stochastic Convex Optimization with Multiple Objectives", "book": "Advances in Neural Information Processing Systems", "page_first": 1115, "page_last": 1123, "abstract": "In this paper, we are interested in the development of efficient algorithms for convex optimization problems in the simultaneous presence of multiple objectives and stochasticity in the first-order information. We cast the stochastic multiple objective optimization problem into a constrained optimization problem by choosing one function as the objective and try to bound other objectives by appropriate thresholds. We first examine a two stages exploration-exploitation based algorithm which first approximates the stochastic objectives by sampling and then solves a constrained stochastic optimization problem by projected gradient method. This method attains a suboptimal convergence rate even under strong assumption on the objectives. Our second approach is an efficient primal-dual stochastic algorithm. It leverages on the theory of Lagrangian method in constrained optimization and attains the optimal convergence rate of $[O(1/ \\sqrt{T})]$ in high probability for general Lipschitz continuous objectives.", "full_text": "Stochastic Convex Optimization with\n\nMultiple Objectives\n\nMehrdad Mahdavi\n\nMichigan State University\nmahdavim@cse.msu.edu\n\nTianbao Yang\n\nRong Jin\n\nNEC Labs America, Inc\ntyang@nec-labs.com\n\nMichigan State University\nrongjin@cse.msu.edu\n\nAbstract\n\nIn this paper, we are interested in the development of ef\ufb01cient algorithms for con-\nvex optimization problems in the simultaneous presence of multiple objectives\nand stochasticity in the \ufb01rst-order information. We cast the stochastic multi-\nple objective optimization problem into a constrained optimization problem by\nchoosing one function as the objective and try to bound other objectives by appro-\npriate thresholds. We \ufb01rst examine a two stages exploration-exploitation based\nalgorithm which \ufb01rst approximates the stochastic objectives by sampling and\nthen solves a constrained stochastic optimization problem by projected gradient\nmethod. This method attains a suboptimal convergence rate even under strong\nassumption on the objectives. Our second approach is an ef\ufb01cient primal-dual\np\nstochastic algorithm.\nIt leverages on the theory of Lagrangian method in con-\nT ) in\nstrained optimization and attains the optimal convergence rate of O(1=\nhigh probability for general Lipschitz continuous objectives.\n\n1 Introduction\n\nAlthough both stochastic optimization [17, 4, 18, 10, 26, 20, 22] and multiple objective optimiza-\ntion [9] are well studied subjects in Operational Research and Machine Learning [11, 12, 24], much\nless is developed for stochastic multiple objective optimization, which is the focus of this work.\nUnlike multiple objective optimization where we have access to the complete objective functions, in\nstochastic multiple objective optimization, only stochastic samples of objective functions are avail-\nable for optimization. Compared to the standard setup of stochastic optimization, the fundamental\nchallenge of stochastic multiple objective optimization is how to make appropriate tradeoff between\ndifferent objectives given that we only have access to stochastic oracles for different objectives. In\nparticular, an algorithm for this setting has to ponder con\ufb02icting objective functions and accommo-\ndate the uncertainty in the objectives.\nA simple approach toward stochastic multiple objective optimization is to linearly combine multiple\nobjectives with a \ufb01xed weight assigned to each objective. It converts stochastic multiple objective\noptimization into a standard stochastic optimization problem, and is guaranteed to produce Pareto\nef\ufb01cient solutions. The main dif\ufb01culty with this approach is how to decide an appropriate weight for\neach objective, which is particularly challenging when the complete objective functions are unavail-\nable. In this work, we consider an alternative formulation that casts multiple objective optimization\ninto a constrained optimization problem. More speci\ufb01cally, we choose one of the objectives as the\ntarget to be optimized, and use the rest of the objectives as constraints in order to ensure that each\nof these objectives is below a speci\ufb01ed level. Our assumption is that although full objective func-\ntions are unknown, their desirable levels can be provied due to the prior knowledge of the domain.\nBelow, we provide a few examples that demonstrate the application of stochastic multiple objective\n\u2211\noptimization in the form of stochastic constrained optimization.\nRobust Investment. Let r 2 Rn denote random returns of the n risky assets, and w 2 W (cid:17)\nfw 2 Rn\ni wi = 1g denote the distribution of an investor\u2019s wealth over all assets. The\nreturn for an investment distribution is de\ufb01ned as \u27e8w; r\u27e9. The investor needs to consider con\ufb02icting\nobjectives such as rate of return, liquidity and risk in maximizing his wealth [2]. Suppose that r has\na unknown probability distribution with mean vector (cid:22) and covariance matrix (cid:6). Then the target\n\n+ :\n\nn\n\n1\n\n\fof the investor is to choose an optimal portfolio w that lies on the mean-risk ef\ufb01cient frontier. In\nmean-variance theory [15], which trades off between the expected return (mean) and risk (variance)\nof a portfolio, one is interested in minimizing the variance subject to budget constraints which leads\nto a formulation like:\n\n[\u27e8\n\n\u27e9]\n\n\u2211\n\nmin\nw2Rn\n+;\n\nn\ni wi=1\n\n\u22a4\n\nw; E[rr\n\n]w\n\nsubject to E[\u27e8r; w\u27e9] (cid:21) (cid:13):\n\nmin\n\nw\n\nmin\nw2W\n\nNeyman-Pearson Classi\ufb01cation.\nIn the Neyman-Pearson (NP) classi\ufb01cation paradigm (see\ne.g, [19]), the goal is to learn a classi\ufb01er from labeled training data such that the probability of\na false negative is minimized while the probability of a false positive is below a user-speci\ufb01ed level\n(cid:13) 2 (0; 1). Let hypothesis class be a parametrized convex set W = fw 7! \u27e8w; x\u27e9 : w 2 Rd;\u2225w\u2225 (cid:20)\nRg and for all (x; y) 2 (cid:4) (cid:17) Rd (cid:2) f(cid:0)1; +1g the loss function \u2113 : W (cid:2) (cid:4) 7! R+ be a non-negative\nconvex function. While the goal of classical binary classi\ufb01cation problem is to minimize the risk as\nminw2W [L(w) = E [\u2113(w; (x; y))]], the Neyman-Pearson targets on\n(w) (cid:20) (cid:13);\n\nsubject to L(cid:0)\n\nL+(w)\n\n+ : A((cid:24))w (cid:20) b((cid:24))g;\n\n(w) = E[\u2113(w; (x; y))jy = (cid:0)1].\n\nwhere L+(w) = E[\u2113(w; (x; y))jy = +1] and L(cid:0)\nLinear Optimization with Stochastic Constraints.\nIn many applications in economics, most\nnotably in welfare and utility theory, and management parameters are known only stochastically\nand it is unreasonable to assume that the objective functions and the solution domain are determin-\nistically \ufb01xed. These situations involve the challenging task of pondering both con\ufb02icting goals\nand random data concerning the uncertain parameters of the problem. Mathematically, the goal in\nmulti-objective linear programming with stochastic information is to solve:\nsubject to w 2 W = fw 2 Rd\n\n[\u27e8c1((cid:24)); w\u27e9 ;(cid:1)(cid:1)(cid:1) ;\u27e8cK((cid:24)); w\u27e9]\n\nwhere (cid:24) is the randomness in the parameters, ci; i 2 [K] are the objective functions, and A and b\nformulate the stochastic constraints on the solution where randomness is captured by (cid:24).\nIn this paper, we \ufb01rst examine two methods that try to eliminate the multi-objective aspect or the\nstochastic nature of stochastic multiple objective optimization and reduce the problem to a standard\nconvex optimization problem. We show that both methods fail to tackle the problem of stochastic\nmultiple objective optimization in general and require strong assumptions on the stochastic objec-\ntives, which limits their applications to real world problems. Having discussed these negative results,\np\nwe propose an algorithm that can solve the problem optimally and ef\ufb01ciently. We achieve this by\nT ) con-\nan ef\ufb01cient primal-dual stochastic gradient descent method that is able to attain an O(1=\nvergence rate for all the objectives under the standard assumption of the Lipschitz continuity of\nobjectives which is known to be optimal (see for instance [3]). We note that there is a \ufb02urry of re-\nsearch on heuristics-based methods to address the multi-objective stochastic optimization problem\n(see e.g., [8] and [1] for a recent survey on existing methods). However, in contrast to this study,\nmost of these approaches do not have theoretical guarantees.\nFinally, we would like to distinguish our work from robust optimization [5] and online learning\nwith long term constraint [13]. Robust optimization was designed to deal with uncertainty within\nthe optimization systems. Although it provides a principled framework for dealing with stochastic\nconstraints, it often ends up with non-convex optimization problems that are not computationally\ntractable. Online learning with long term constraint generalizes online learning. Instead of requiring\nthe constraints to be satis\ufb01ed by every solution generated by online learning, it allows the constraints\nto be satis\ufb01ed by the entire sequence of solutions. However, unlike stochastic multiple objective\noptimization, in online learning with long term constraints, constraint functions are \ufb01xed and known\nbefore the start of online learning.\nOutline. The remainder of the paper is organized as follows. In Section 2 we establish the necessary\nnotation and introduce the problem under consideration. Section 3 introduces the problem reduction\nmethods and elaborates their disadvantages. Section 4 presents our ef\ufb01cient primal-dual stochastic\noptimization algorithm. Finally, we conclude the paper with open questions in Section 5.\n2 Preliminaries\nNotation Throughout this paper, we use the following notation. We use bold-face letters to denote\n}\n{\n\u2032\u27e9 where W (cid:18) Rd\n\u2032 2 W by \u27e8w; w\nvectors. We denote the inner product between two vectors w; w\nis a compact closed domain. For m 2 N, we denote by [m] the set f1; 2;(cid:1)(cid:1)(cid:1) ; mg. We only consider\nw 2 Rd : \u2225w\u2225 (cid:20) R\nthe \u21132 norm throughout the paper. The ball with radius R is denoted by B =\n.\nStatement of the Problem In this work, we generalize online stochastic convex optimization to\nthe case of multiple objectives.\nIn particular, at each iteration, the learner is asked to present a\n\n2\n\n\fsolution wt, which will be evaluated by multiple loss functions f 0\nt (w). A fun-\ndamental difference between single- and multi-objective optimization is that for the latter it is not\nobvious how to evaluate the optimization quality. Since it is impossible to simultaneously min-\nimize multiple loss functions and in order to avoid complications caused by handling more than\none objective, we choose one function as the objective and try to bound other objectives by ap-\npropriate thresholds. Speci\ufb01cally, the goal of OCO with multiple objectives becomes to minimize\nt (wt) and at the same time keep the other objective functions below a given threshold, i.e.\n\nt (w); : : : ; f m\n\nt (w); f 1\n\n\u2211\n\nT\nt=1 f 0\n\nT\u2211\n\nt=1\n\n1\nT\n\nt (wt) (cid:20) (cid:13)i; i 2 [m];\nf i\n\nwhere w1; : : : ; wT are the solutions generated by the online learner and (cid:13)i speci\ufb01es the level of loss\nthat is acceptable to the ith objective function. Since the general setup (i.e., full adversarial setup) is\nchallenging for online convex optimization even with two objectives [14], in this work, we consider\nt (w); i 2 [m] are i.i.d samples from an unknown\na simple scenario where all the loss functions f i\ndistribution [21]. We also note that our goal is NOT to \ufb01nd a Pareto ef\ufb01cient solution (a solution\nis Pareto ef\ufb01cient if it is not dominated by any solution in the decision space). Instead, we aim to\n\ufb01nd a solution that (i) optimizes one selected objective, and (ii) satis\ufb01es all the other objectives with\nrespect to the speci\ufb01ed thresholds.\nWe denote by (cid:22)f i(w) = Et[f i\nt (w)]; i = 0; 1; : : : ; m the expected loss function of sampled function\nt (w). In stochastic multiple objective optimization, we assume that we do not have direct access\nf i\nto the expected loss functions and the only information available to the solver is through a stochastic\n{\noracle that returns a stochastic realization of the expected loss function at each call. We assume that\nthere exists a solution w strictly satisfying all the constraints, i.e. (cid:22)f i(w) < (cid:13)i; i 2 [m]. We denote\nOur goal is to ef\ufb01ciently compute a solution bwT after T trials that (i) obeys all the constraints, i.e.\nby w(cid:3) the optimal solution to multiple objective optimization, i.e.,\n(cid:22)f 0(w) : (cid:22)f i(w) (cid:20) (cid:13)i; i 2 [m]\n(cid:22)f i(bwT ) (cid:20) (cid:13)i; i 2 [m] and (ii) minimizes the objective (cid:22)f 0 with respect to the optimal solution w(cid:3),\n(1)\ni.e. (cid:22)f 0(bwT ) (cid:0) (cid:22)f 0(w(cid:3)). For the convenience of discussion, we refer to f 0\n\nt (w) and (cid:22)f 0(w) as the\n\nt (w) and (cid:22)f i(w) for all i 2 [m] as the constraint functions.\n\nobjective function, and to f i\nBefore discussing the algorithms, we \ufb01rst mention a few assumptions made in our analysis. We\nassume that the optimal solution w(cid:3) belongs to B. We also make the standard assumption that\nall the loss functions, including both the objective function and constraint functions, are Lipschitz\ncontinuous, i.e., jf i\n\n\u2032\u2225 for any w; w\n\n)j (cid:20) L\u2225w (cid:0) w\n\nw(cid:3) = arg min\n\n\u2032 2 B.\n\nt (w) (cid:0) f i\n\n}\n\n\u2032\n\nt (w\n\n:\n\n3 Problem Reduction and its Limitations\n\nHere we examine two algorithms to cope with the complexity of stochastic optimization with multi-\nple objectives and discuss some negative results which motivate the primal-dual algorithm presented\nin Section 4. The \ufb01rst method transforms a stochastic multi-objective problem into a stochastic\nsingle-objective optimization problem and then solves the latter problem by any stochastic program-\nming approach. Alternatively, one can eliminate the randomness of the problem by estimating the\nstochastic objectives and transform the problem into a deterministic multi-objective problem.\n\n\u2211\n\nm\ni=0 (cid:11)if i\n\n3.1 Linear Scalarization with Stochastic Optimization\nA simple approach to solve stochastic optimization problem with multiple objectives is to eliminate\nthe multi-objective aspect of the problem by aggregating the m + 1 objectives into a single objective\nt (wt), where (cid:11)i; i 2 f0; 1;(cid:1)(cid:1)(cid:1) ; mg is the weight of ith objective, and then solving the\nresulting single objective stochastic problem by stochastic optimization methods. This approach\nis in general known as the weighted-sum or scalarization method [1]. Although this naive idea\nconsiderably facilitates the computational challenge of the problem, unfortunately, it is dif\ufb01cult to\ndecide the weight for each objective, such that the speci\ufb01ed levels for different objectives are obeyed.\nBeyond the hardness of optimally determining the weight of individual functions, it is also unclear\nhow to bound the sub-optimality of \ufb01nal solution for individual objective functions.\n\n3.2 Projected Gradient Descent with Estimated Objective Functions\nThe main challenge of the proposed problem is that the expected constraint functions (cid:22)f i(w) are not\ngiven. Instead, only a sampled function is provided at each trial t. Our naive approach is to replace\nthe expected constraint function (cid:22)f i(w) with its empirical estimation based on sampled objective\nfunctions. This approach circumvents the problem of stochastically optimizing multiple objective\n\n3\n\n\finto the original online convex optimization with complex projections, and therefore can be solved\nby projected gradient descent. More speci\ufb01cally, at trial t, given the current solution wt and received\nloss functions f i\n\nt (w); i = 0; 1; : : : ; m, we \ufb01rst estimate the constraint functions as\n\nbf i\n\nt (w) =\n\nt\u2211\n(\nt (w) (cid:20) (cid:13)i; i 2 [m]g.\n\n1\nt\n\nk=1\n\nk(w); i 2 [m];\nf i\nwt (cid:0) (cid:17)rf 0\n\n}\n\nand then update the solution by wt+1 = (cid:5)Wt\nwhere (cid:17) > 0 is the step size,\n(cid:5)W (w) = minz2W \u2225z (cid:0) w\u2225 projects a solution w into domain W, and Wt is an approximate\n\nt (wt)\n\ndomain given by Wt = fw : bf i\n\nOne problem with the above approach is that although it is feasible to satisfy all the constraints based\non the true expected constraint functions, there is no guarantee that the approximate domain Wt is\nnot empty. One way to address this issue is to estimate the expected constraint functions by burning\nthe \ufb01rst bT trials, where b 2 (0; 1) is a constant that needs to be adjusted to obtain the optimal\nperformance, and keep the estimated constraint functions unchanged afterwards. Given the sampled\nfunctions f i\n\nbT\u2211\nbT received in the \ufb01rst bT trials, we compute the approximate domain W\u2032 as\n\nw : bf i(w) (cid:20) (cid:13)i + ^(cid:13)i; i = 1; : : : ; m\n\nbf i(w) =\n\nt (w); i 2 [m]; W\u2032\nf i\n\n1; : : : ; f i\n\n{\n\n}\n\n=\n\n1\nbT\n\nt=1\n\nwhere ^(cid:13)i > 0 is a relaxed constant introduced to ensure that with a high probability, the approximate\ndomain Wt is not empty provided that the original domain W is not empty.\nTo ensure the correctness of the above approach, we need to establish some kind of uniform (strong)\nconvergence assumption to make sure that the solutions obtained by projection onto the estimated\ndomain W\u2032 will be close to the true domain W with high probability. It turns out that the following\nassumption ensures the desired property.\nobtained by averaging over bT i.i.d samples for (cid:22)f i(w); i 2 [m]. We assume that, with a high\nprobability,\n\nAssumption 1 (Uniform Convergence). Let bf i(w); i = 0; 1;(cid:1)(cid:1)(cid:1) ; m be the estimated functions\n\n(cid:12)(cid:12)(cid:12)bf i(w) (cid:0) (cid:22)f i(w)\n\n(cid:12)(cid:12)(cid:12) (cid:20) O([bT ]\n\n(cid:0)q); i = 0; 1;(cid:1)(cid:1)(cid:1) ; m:\n\nsup\nw2W\n\n\u2211\n\nwhere q > 0 decides the convergence rate.\nIt is straightforward to show that under Assumption 1, with a high probability, for any w 2 W,\nwe have w 2 W\u2032, with appropriately chosen relaxation constant ^(cid:13)i; i 2 [m]. Using the estimated\ndomain W\u2032, for trial t 2 [bT + 1; T ], we update the solution by wt+1 = (cid:5)W\u2032(wt (cid:0) (cid:17)rf 0\nt (wt)).\nThere are however several drawbacks with this naive approach. Since the \ufb01rst bT trials are used for\nestimating the constraint functions, only the last (1(cid:0)b)T trials are used for searching for the optimal\nsolution. The total amount of violation of individual constraint functions for the last (1 (cid:0) b)T trials,\n\u221a\n\u2211\n(cid:22)f i(wt), is O((1 (cid:0) b)b\n(cid:0)qT 1(cid:0)q), where each of the (1 (cid:0) b)T trials receives a\nT\ngiven by\n(cid:0)q). Similarly, following the conventional analysis of online learning [26], we\nt=bT +1\nviolation of O([bT ]\n\u221a\nt (wt) (cid:0) f 0\nt (w(cid:3))) (cid:20) O(\n(1 (cid:0) b)T ). Using the same trick as in [13], to obtain\nT\nhave\nt=bT +1(f 0\na solution with zero violation of constraints, we will have a regret bound O((1 (cid:0) b)b\n(cid:0)qT 1(cid:0)q +\n(1 (cid:0) b)T ), which yields a convergence rate of O(T\n(cid:0)q) which could be worse than\n(cid:0)1=2) when q < 1=2. Additionally, this approach requires memorizing the\nthe optimal rate O(T\nconstraint functions of the \ufb01rst bT trials. This is in contrast to the typical assumption of online\nlearning where only the solution is memorized.\nRemark 1. We \ufb01nally remark on the uniform convergence assumption, which holds when the con-\nstraint functions are linear [25], but unfortunately does not hold for general convex Lipschitz func-\ntions.\nIn particular, one can simply show examples where there is no uniform convergence for\nstochastic convex Lipchitz functions in in\ufb01nite dimensional spaces [21]. Without uniform conver-\ngence assumption, the approximate domain W\u2032 may depart from the true W signi\ufb01cantly at some\nunknown point, which makes the above approach to fail for general convex objectives.\n\n(cid:0)1=2 + T\n\nTo address these limitations and in particular the dependence on uniform convergence assumption,\nwe present an algorithm that does not require projection when updating the solution and does not\nrequire to impose any additional assumption on the stochastic functions except for the standard\nLipschitz continuity assumption. We note that our result is closely related to the recent studies of\nlearning from the viewpoint of optimization [23], which state that solutions found by stochastic\ngradient descent can be statistically consistent even when uniform convergence theorem does not\nhold.\n\n4\n\n\f0 > 0; i 2 [m] and total iterations T\n\n0 ); (cid:21)i\n\n0;(cid:1)(cid:1)(cid:1) ; (cid:21)m\n\nAlgorithm 1 Stochastic Primal-Dual Optimization with Multiple Objectives\n1: INPUT: step size (cid:17), (cid:21)0 = ((cid:21)1\n2: w1 = (cid:21)1 = 0\n3: for t = 1; : : : ; T do\n4:\n5:\n6:\n7:\n\nSubmit the solution wt\nReceive loss functions f i\nCompute the gradients rf i\nUpdate the solution w and (cid:21) by\n)\nwt+1 = (cid:5)B (wt (cid:0) (cid:17)rwLt(wt; (cid:21)t)) = (cid:5)B\n\n[\n[\nrf 0\n\n(\nwt (cid:0) (cid:17)\n\nt (wt); i = 0; 1; : : : ; m\n\nt ; i = 0; 1; : : : ; m\n\n(\n\n(\n\nt + (cid:17)r(cid:21)iLt(wt; (cid:21)t)\n(cid:21)i\n\n= (cid:5)[0;(cid:21)i\n0]\n\n(cid:21)i\nt + (cid:17)\n\nm\u2211\n(cid:21)i\nt\nt (wt) (cid:0) (cid:13)i\nf i\n\ni=1\n\nt (wt) +\n\n])\nrf i\n\n:\n\n(cid:21)i\nt+1 = (cid:5)[0;(cid:21)i\n0]\n\n8: end for\n9: Return ^wT =\n\nT\nt=1 wt=T\n\n\u2211\n\n])\n\n;\n\nt (wt)\n\nm\u2211\n\nm\u2211\n\n4 An Ef\ufb01cient Stochastic Primal-Dual Algorithm\nWe now turn to devise a tractable formulation of the problem, followed by an ef\ufb01cient primal-dual\noptimization algorithm and the statements of our main results. We show that with a high probabil-\np\nity, the solution found by the proposed algorithm will exactly satisfy the expected constraints and\nachieves a regret bound of O(\nT ). The main idea of the proposed algorithm is to design an appro-\npriate objective that combines the loss function (cid:22)f 0(w) with (cid:22)f i(w); i 2 [m]. As mentioned before,\nowing to the presence of con\ufb02icting goals and the randomness nature of the objective functions, we\nresort to seek for a solution that satis\ufb01es all the objectives instead of an optimal one. To this end, we\nde\ufb01ne the following objective function\n\n(cid:22)L(w; (cid:21)) = (cid:22)f 0(w) +\n\n(cid:21)i( (cid:22)f i(w) (cid:0) (cid:13)i):\n\ni=1\n\n\u22a4 2 (cid:3), where (cid:3) (cid:18) Rm\n\nNote that the objective function consists of both the primal variable w 2 W and dual variable\n+ is a compact convex set that bounds the set of dual\n(cid:21) = ((cid:21)1; : : : ; (cid:21)m)\nvariables and will be discussed later. In the proposed algorithm, we will simultaneously update\np\nsolutions for both w and (cid:21). By exploring convex-concave optimization theory [16], we will show\nthat with a high probability, the solution of regret O(\nAs the \ufb01rst step, we consider a simple scenario where the obtained solution is allowed to violate the\nconstraints. The detailed steps of our primal-dual algorithm is presented in Algorithm 1 . It follows\nthe same procedure as convex-concave optimization. Since at each iteration, we only observed a\nrandomly sampled loss functions f i\n\nt (w); i = 0; 1; : : : ; m, the objective function given by\n\nT ) exactly obeyes the constraints.\n\nLt(w; (cid:21)) = f 0\n\nt (w) +\n\n(cid:21)i(f i\n\nt (w) (cid:0) (cid:13)i)\n\nprovides an unbiased estimate of (cid:22)L(w; (cid:21)). Given the approximate objective Lt(w; (cid:21)), the pro-\nposed algorithm tries to minimize the objective Lt(w; (cid:21)) with respect to the primal variable w and\nmaximize the objective with respect to the dual variable (cid:21).\nTo facilitate the analysis, we \ufb01rst rewrite the the constrained optimization problem\n\ni=1\n\n{\nw : (cid:22)f i(w) (cid:20) (cid:13)i; i = 1; : : : m\n\nw2B\\W\n\n(cid:22)f 0(w)\n\nmin\n\n}\n\nm\u2211\n\nwhere W is de\ufb01ned as W =\n\nin the following equivalent form:\n\nWe denote by w(cid:3) and (cid:21)(cid:3) = ((cid:21)1(cid:3); : : : ; (cid:21)m(cid:3) )\nconvex-concave optimization problem, respectively, i.e.,\n\nw2B max\nmin\n(cid:21)2Rm\n\n+\n\n(cid:22)f 0(w) +\n\n(2)\n\u22a4 as the optimal primal and dual solutions to the above\n\ni=1\n\n(cid:21)i( (cid:22)f i(w) (cid:0) (cid:13)i):\nm\u2211\n(cid:21)i(cid:3)( (cid:22)f i(w) (cid:0) (cid:13)i);\nm\u2211\n\ni=1\n\n(cid:21)i( (cid:22)f i(w(cid:3)) (cid:0) (cid:13)i):\n\nw(cid:3) = arg min\n\nw2B\n\n(cid:22)f 0(w) +\n\n(cid:21)(cid:3) = arg max\n(cid:21)2Rm\n\n+\n\n(cid:22)f 0(w(cid:3)) +\n\ni=1\n\n5\n\n(3)\n\n(4)\n\n\fThe following assumption establishes upper bound on the gradients of L(w; (cid:21)) with respect to w\nand (cid:21). We later show that this assumption holds under a mild condition on the objective functions.\nAssumption 2 (Gradient Boundedness). The gradients rwL(w; (cid:21)) and r(cid:21)L(w; (cid:21)) are uniformly\nbounded, i.e., there exist a constant G > 0 such that\nmax (rwL(w; (cid:21));r(cid:21)L(w; (cid:21))) (cid:20) G;\n\nfor any w 2 B and (cid:21) 2 (cid:3):\n\np\nO(1=\n\nUnder the preceding assumption,\n\nT ) for both the regret and the violation of the constraints.\n\nin the following theorem, we show that under appropriate\n\nconditions, the average solution bwT generated by of Algorithm 1 attains a convergence rate of\n(cid:21) (cid:21)i(cid:3) + (cid:18); i 2 [m], where (cid:18) > 0 is a constant. Let bwT be the solution obtained\n(cid:22)f 0(bwT ) (cid:0) (cid:22)f 0(w(cid:3)) (cid:20) (cid:22)((cid:14))p\n\nTheorem 1. Set (cid:21)i\nby Algorithm 1 after T iterations. Then, with a probability 1 (cid:0) (2m + 1)(cid:14), we have\n0\n; i 2 [m]\n\nand (cid:22)f i(bwT ) (cid:0) (cid:13)i (cid:20) (cid:22)((cid:14))\n\nwhere D2 =\n\nm\ni=1[(cid:21)i\n\n(R2 + D2)=2T ]=G, and\n\n\u2211\n\n\u221a\n\nT\n\n\u221a\n\n0]2, (cid:17) = [\np\n2G\n\n(cid:22)((cid:14)) =\n\n\u221a\n\np\nT\n\n(cid:18)\n\n1\n(cid:14)\n\n:\n\n2 ln\n\nR2 + D2 + 2G(R + D)\n\n(5)\nRemark 2. The parameter (cid:18) 2 R+ is a quantity that may be set to obtain sharper upper bound\non the violation of constraints and may be chosen arbitrarily. In particular, a larger value for (cid:18)\nimposes larger penalty on the violation of the constraints and results in a smaller violation for the\nobjectives.\n\nWe also can develop an algorithm that allows the solution to exactly satisfy all the constraints. To\n\u2032\n\n. We will run Algorithm 1 but with (cid:13)i replaced byb(cid:13)i. Let G\nthis end, we de\ufb01neb(cid:13)i = (cid:13)i (cid:0) (cid:22)((cid:14))\ndenote the upper bound in Assumption 2 for r(cid:21)L(w; (cid:21)) withb(cid:13)i is replaced by (cid:13)i; i 2 [m]. The\nfollowing theorem shows the property of the obtained average solution bwT .\nTheorem 2. Let bwT be the solution obtained by Algorithm 1 with (cid:13)i replaced by b(cid:13)i and\n0 = (cid:21)i(cid:3) + (cid:18); i 2 [m]. Then, with a probability 1 (cid:0) (2m + 1)(cid:14), we have\n(cid:21)i\n\np\n\nT\n\n(cid:18)\n\n(cid:22)f 0(bwT ) (cid:0) (cid:22)f 0(w(cid:3)) (cid:20) (1 +\n\n\u2211\n\np\nm\ni=1 (cid:21)i\n\nT\n\n((cid:14))\n\nand (cid:22)f i(bwT ) (cid:20) (cid:13)i; i 2 [m];\n\u221a\n\n\u2032.\n(R2 + D2)=2T ]=G\n\n\u2032\n0)(cid:22)\n\u2032 and (cid:17) = [\n\n\u2032\nwhere (cid:22)\n\n((cid:14)) is same as (5) with G is replaced by G\n\n4.1 Convergence Analysis\n\nHere we provide the proofs of main theorems stated above. We start by proving Theorem 1 and then\nextend it to prove Theorem 2.\n\n+\n\n\u27e9 (cid:0)\u27e8\n\n(cid:20) \u27e8\n\nProof. (of Theorem 1) Using the standard analysis of convex-concave optimization, from the con-\nvexity of (cid:22)L(w;(cid:1)) with respect to w and concavity of (cid:22)L((cid:1); (cid:21)) with respect to (cid:21), for any w 2 B and\n\u27e9\n0]; i 2 [m], we have\n(cid:21)i 2 [0; (cid:21)i\n(cid:22)L(wt; (cid:21)) (cid:0) (cid:22)L(w; (cid:21)t)\nwt (cid:0) w;rw (cid:22)L(wt; (cid:21)t)\n(cid:21)t (cid:0) (cid:21);r(cid:21) (cid:22)L(wt; (cid:21)t)\n\u27e8\n= \u27e8wt (cid:0) w;rwLt(wt; (cid:21)t)\u27e9 (cid:0) \u27e8(cid:21)t (cid:0) (cid:21);r(cid:21)Lt(wt; (cid:21)t)\u27e9\nwt (cid:0) w;rw (cid:22)L(wt; (cid:21)t) (cid:0) rwLt(wt; (cid:21)t)\n(cid:20) \u2225wt (cid:0) w\u22252 (cid:0) \u2225wt+1 (cid:0) w\u22252\n\n\u27e9 (cid:0)\u27e8\n(\u2225rwLt(wt; (cid:21)t)\u22252 + \u2225r(cid:21)Lt(wt; (cid:21)t)\u22252\n)\n\u27e9 (cid:0)\u27e8\n\nwt (cid:0) w;rw (cid:22)L(wt; (cid:21)t) (cid:0) rwLt(wt; (cid:21)t)\n\n(cid:21)t (cid:0) (cid:21);r(cid:21) (cid:22)L(wt; (cid:21)t) (cid:0) r(cid:21)Lt(wt; (cid:21)t)\nwhere in the \ufb01rst inequality we have added and subtracted the stochastic gradients used for up-\ndating the solutions, the last inequality follows from the updating rules for wt+1 and (cid:21)t+1 and\nnon-expensiveness property of the orthogonal projection operation onto the convex domain.\n\n(cid:21)t (cid:0) (cid:21);r(cid:21) (cid:22)L(wt; (cid:21)t) (cid:0) r(cid:21)Lt(wt; (cid:21)t)\n\n\u2225(cid:21)t (cid:0) (cid:21)\u22252 (cid:0) \u2225(cid:21)t+1 (cid:0) (cid:21)\u22252\n\n\u27e9\n\n\u27e8\n\n\u27e9\n\n(cid:17)\n2\n\n2(cid:17)\n\n2(cid:17)\n\n+\n\n+\n\n+\n\n;\n\n6\n\n\fT\u2211\n\nt=1\n\nBy adding all the inequalities together, we get\n\n(cid:22)L(wt; (cid:21)) (cid:0) (cid:22)L(w; (cid:21)t)\n(cid:20) \u2225w (cid:0) w1\u22252 + \u2225(cid:21) (cid:0) (cid:21)1\u22252\nT\u2211\n\n2(cid:17)\n\n\u27e8\n\nT\u2211\n\nt=1\n\n+\n\n(cid:17)\n2\n\nwt (cid:0) w;rw (cid:22)L(wt; (cid:21)t) (cid:0) rwLt(wt; (cid:21)t)\n\n+ (cid:17)G2T\n\nwt (cid:0) w;rw (cid:22)L(wt; (cid:21)t) (cid:0) rwLt(wt; (cid:21)t)\n\n\u221a\n\n+\n\nt=1\n\n(cid:20) R2 + D2\n\nT\u2211\n\n\u27e8\n\n2(cid:17)\n\n+\n\nt=1\n\n(cid:20) R2 + D2\n\n2(cid:17)\n\n\u2225rwLt(wt; (cid:21)t)\u22252 + \u2225r(cid:21)Lt(wt; (cid:21)t)\u22252\n\n\u27e9 (cid:0)\u27e8\n\u27e9 (cid:0)\u27e8\n\n\u27e9\n(cid:21)t (cid:0) (cid:21);r(cid:21) (cid:22)L(wt; (cid:21)t) (cid:0) r(cid:21)Lt(wt; (cid:21)t)\n\u27e9\n(cid:21)t (cid:0) (cid:21);r(cid:21) (cid:22)L(wt; (cid:21)t) (cid:0) r(cid:21)Lt(wt; (cid:21)t)\n\n(w.p. 1 (cid:0) (cid:14));\n\n1\n(cid:14)\n\n\u2211\n\n+ (cid:17)G2T + 2G(R + D)\n\n2T ln\n\nT\n\nT\n\ni=1\n\ni=1\n\n(cid:20)\n\n(6)\n\nR2 + D2\n\n2\nT\n\nln\n\n1\n(cid:14)\n\n:\n\n+ 2G(R + D)\n\np\n2G\n\nb(cid:21)i\nT ( (cid:22)f i(w) (cid:0) (cid:13)i)\n\nT\n\nt=1 (cid:21)t=T , for any \ufb01xed (cid:21)i 2 [0; (cid:21)i\n\naverage solutions bwT =\n(cid:22)f 0(bwT ) +\n\nBy \ufb01xing w = w(cid:3) and (cid:21) = 0 in (6), we have (cid:22)f i(w(cid:3)) (cid:20) (cid:13)i; i 2 [m], and therefore, with a\nprobability 1 (cid:0) (cid:14), have\n\n\u2211\nt=1 wt=T andb(cid:21)T =\nwhere the last inequality follows from the Hoef\ufb01ding inequality for Martingales [6]. By expanding\nthe left hand side, substituting the stated value of (cid:17), and applying the Jensen\u2019s inequality for the\n0]; i 2 [m]\n(cid:21)i( (cid:22)f i(bwT ) (cid:0) (cid:13)i) (cid:0) (cid:22)f 0(w) (cid:0) m\u2211\nm\u2211\nand w 2 B, with a probability 1 (cid:0) (cid:14), we have\n\u221a\n\u221a\n\u221a\n\u221a\n(cid:21)j(cid:3)( (cid:22)f j(bwT ) (cid:0) (cid:13)j) (cid:0) (cid:22)f 0(w(cid:3)) (cid:0) m\u2211\n(cid:21)j(cid:3)( (cid:22)f j(bwT ) (cid:0) (cid:13)j) (cid:0) (cid:22)f 0(w(cid:3)) (cid:0) m\u2211\n\n(cid:22)f 0(bwT ) (cid:20) (cid:22)f 0(w(cid:3)) +\n0( (cid:22)f i(bwT ) (cid:0) (cid:13)i) +\n0( (cid:22)f i(bwT ) (cid:0) (cid:13)i) +\n\u221a\nwhere the \ufb01rst inequality utilizes (4) and the second inequality utilizes (3).\nWe thus have, with a probability 1 (cid:0) (cid:14),\n\n0; i 2 [m], and (cid:21)j = (cid:21)j(cid:3); j \u0338= i in (6).\nb(cid:21)i\nT ( (cid:22)f i(w(cid:3)) (cid:0) (cid:13)i)\n\n(cid:22)f 0(bwT ) + (cid:21)i\n(cid:21) (cid:22)f 0(bwT ) + (cid:21)i\n(cid:21) (cid:18)( (cid:22)f i(bwT ) (cid:0) (cid:13)i);\n(cid:22)f i(bwT ) (cid:0) (cid:13)i (cid:20)\n\nTo bound the violation of constraints we set w = w(cid:3), (cid:21)i = (cid:21)i\nWe have\n\n(cid:21)i(cid:3)( (cid:22)f i(w(cid:3)) (cid:0) (cid:13)i)\n\n\u2211\n\u2211\n\nj\u0338=i\n\n; i 2 [m]:\n\np\n2G\n\n+ 2G(R + D)\n\n2\nT\n\nln\n\n1\n(cid:14)\n\n:\n\n2G(R + D)\n\nR2 + D2\n\nR2 + D2\n\n\u221a\n\nj\u0338=i\n\ni=1\n\ni=1\n\nT\n\n+\n\nT\n\n(cid:18)\n\n2\nT\n\nln\n\n1\n(cid:14)\n\np\n\n2G\n(cid:18)\n\nWe complete the proof by taking the union bound over all the random events.\n\nWe now turn to the proof of Theorem 2 that gives high probability bound on the convergence of the\nmodi\ufb01ed algorithm which obeys all the constraints.\n\nProof. (of Theorem 2) Following the proof of Theorem 1, with a probability 1 (cid:0) (cid:14), we have\n\n(cid:22)f 0(bwT ) +\n\nm\u2211\n\n(cid:21)i( (cid:22)f i(bwT ) (cid:0)b(cid:13)i) (cid:0) (cid:22)f 0(w) (cid:0) m\u2211\n\nb(cid:21)i\nT ( (cid:22)f i(w) (cid:0)b(cid:13)i)\n\n\u221a\n\n2\nT\n\nln\n\n1\n(cid:14)\n\ni=1\n\u2032\n\n+ 2G\n\n(R + D)\n\n\u221a\n\ni=1\n\np\n2G\n\n\u2032\n\n(cid:20)\n\nR2 + D2\n\nT\n\n7\n\n\fDe\ufb01ne ew(cid:3) ande(cid:21)(cid:3) be the saddle point for the following minimax optimization problem\n(cid:21)i( (cid:22)f i(w) (cid:0)b(cid:13)i)\nFollowing the same analysis as Theorem 1, for each i 2 [m], by setting w = ew(cid:3), (cid:21)i = (cid:21)i\n(cid:21)j =e(cid:21)j(cid:3), using the fact thate(cid:21)j(cid:3) (cid:20) (cid:21)j(cid:3), we have, with a probability 1 (cid:0) (cid:14)\n\u221a\n\nw2B max\nmin\n(cid:21)2Rm\n\nm\u2211\n\n(cid:22)f 0(w) +\n\n\u221a\n\ni=1\n\n+\n\n0, and\n\n(cid:18)( (cid:22)f i(bwT ) (cid:0) (cid:13)i) (cid:20)\n\np\n\n\u2032\n2G\n\nR2 + D2\n\nT\n\n\u2032\n+ 2G\n\n(R + D)\n\n2\nT\n\nln\n\n1\n(cid:14)\n\n(cid:0) (cid:22)((cid:14))p\nT\n\n(cid:20) 0;\n\nwhich completes the proof.\n\n4.2\n\nImplementation Issues\n\n0; i 2 [m], which requires to decide\nIn order to run Algorithm 1, we need to estimate the parameter (cid:21)i\nthe set (cid:3) by estimating an upper bound for the optimal dual variables (cid:21)i(cid:3); i 2 [m]. To this end, we\nconsider an alternative problem to the convex-concave optimization problem in (2), i.e.\n\nw2B max\nmin\n(cid:21)(cid:21)0\n\n(cid:22)f 0(w) + (cid:21) max\n1(cid:20)i(cid:20)m\n\n( (cid:22)f i(w) (cid:0) (cid:13)i):\n\n(7)\n\nEvidently w(cid:3) is the optimal primal solution to (7). Let (cid:21)a be the optimal dual solution to the problem\nin (7). We have the following proposition that links (cid:21)i(cid:3); i 2 [m], the optimal dual solution to (2),\nwith (cid:21)a, the optimal dual solution to (7).\nProposition 1. Let (cid:21)a be the optimal dual solution to (7) and (cid:21)i(cid:3); i 2 [m] be the optimal solution to\n(2). We have (cid:21)a =\n\n\u2211\n\nm\n\ni=1 (cid:21)i(cid:3).\n\n\u2211\n\nm\n\nm\n\n\u2211\n\n(cid:22)f 0(w) +\n\nw2B max\n\ni=1 (cid:21)i as claimed.\n\ni=1 pi(cid:21)( (cid:22)f i(w)(cid:0) (cid:13)i), where domain \u2206m is\n(cid:21)(cid:21)0;p2\u2206m\ni=1 (cid:11)i = 1g. By rede\ufb01ning (cid:21)i = pi(cid:21), we have the problem in (7)\n(cid:13)(cid:13)\u2211\n\nProof. We can rewrite (7) as min\nde\ufb01ned as \u2206m = f(cid:11) 2 Rm\n+ :\nequivalent to (2) and consequently (cid:21) =\nGiven the result from Proposition 1, it is suf\ufb01cient to bound (cid:21)a. In order to bound (cid:21)a, we need to\nmake certain assumption about (cid:22)f i(w); i 2 [m]. The purpose of introducing this assumption is to\nensure that the optimal dual variable is well bounded from the above.\nAssumption 3. We assume min\n(cid:11)2\u2206m\nEquipped with Assumption 3, we are able to bound (cid:21)a by (cid:28). To this end, using the \ufb01rst order optimal-\nity condition of (2) [7], we have (cid:21)a = \u2225r (cid:22)f 0(w(cid:3))\u2225=\u2225@g(w)\u2225, where g(w) = max1(cid:20)i(cid:20)m\n(cid:22)f i(w).\n(cid:28) . By com-\n\u2211\nbining Proposition 1 with the upper bound on (cid:21)a, we obtain (cid:21)i(cid:3) (cid:20) L\nFinally, we note that by having (cid:21)(cid:3) bounded, Assumption 2 is guaranteed by setting G2 =\ni=1( (cid:22)f i(w) (cid:0) (cid:13)i)2) which follows from Lipschitz continuity of\nmax(L2\nthe objective functions. In a similar way we can set G\n\n, under Assumption 3, we have (cid:21)a (cid:20) L\n(cid:28) ; i 2 [m] as desired.\n\u2032 in Theorem 2 by replacing (cid:13)i withb(cid:13)i.\n\n(cid:13)(cid:13) (cid:21) (cid:28), where (cid:28) > 0 is a constant.\n\nSince @g(w) 2{\u2211\n\u2211\n\ni=1 (cid:11)ir (cid:22)f i(w) : (cid:11) 2 \u2206m\n)2\n\ni=1 (cid:11)ir (cid:22)f i(w)\n\n; max\nw2B\n\nm\ni=1 (cid:21)i\n0\n\n}\n\n(\n\n1 +\n\nm\n\nm\n\nm\n\n\u2211\n\nm\n\n5 Conclusions and Open Questions\n\nIn this paper we have addressed the problem of stochastic convex optimization with multiple ob-\njectives underlying many applications in machine learning. We \ufb01rst examined a simple problem\nreduction technique that eliminates the stochastic aspect of constraint functions by approximating\nthem using the sampled functions from each iteration. We showed that this simple idea fails to attain\nthe optimal convergence rate and requires to impose a strong assumption, i.e., uniform convergence,\np\non the objective functions. Then, we presented a novel ef\ufb01cient primal-dual algorithm which attains\nthe optimal convergence rate O(1=\nT ) for all the objectives relying only on the Lipschitz continu-\nity of the objective functions. This work leaves few direction for further elaboration. In particular, it\nwould be interesting to see whether or not making stronger assumptions on the analytical properties\nof objective functions such as smoothness or strong convexity may yield improved convergence rate.\nAcknowledgments. The authors would like to thank the anonymous reviewers for their helpful and insight-\nful comments. The work of M. Mahdavi and R. Jin was supported in part by ONR Award N000141210431 and\nNSF (IIS-1251031).\n\n8\n\n\fReferences\n[1] F. B. Abdelaziz. Solution approaches for the multiobjective stochastic programming. European\n\nJournal of Operational Research, 216(1):1\u201316, 2012.\n\n[2] F. B. Abdelaziz, B. Aouni, and R. E. Fayedh. Multi-objective stochastic programming for\n\nportfolio selection. European Journal of Operational Research, 177(3):1811\u20131823, 2007.\n\n[3] A. Agarwal, P. L. Bartlett, P. D. Ravikumar, and M. J. Wainwright.\n\nInformation-theoretic\nlower bounds on the oracle complexity of stochastic convex optimization. IEEE Transactions\non Information Theory, 58(5):3235\u20133249, 2012.\n\n[4] F. Bach and E. Moulines. Non-asymptotic analysis of stochastic approximation algorithms for\n\nmachine learning. In NIPS, pages 451\u2013459, 2011.\n\n[5] A. Ben-Tal, L. El Ghaoui, and A. Nemirovski. Robust optimization. Princeton University\n\nPress, 2009.\n\n[6] S. Boucheron, G. Lugosi, and O. Bousquet. Concentration inequalities. In Advanced Lectures\n\non Machine Learning, pages 208\u2013240, 2003.\n\n[7] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n[8] R. Caballero, E. Cerd\u00b4a, M. del Mar Mu\u02dcnoz, and L. Rey. Stochastic approach versus multi-\nobjective approach for obtaining ef\ufb01cient solutions in stochastic multiobjective programming\nproblems. European Journal of Operational Research, 158(3):633\u2013648, 2004.\n\n[9] M. Ehrgott. Multicriteria optimization. Springer, 2005.\n[10] E. Hazan and S. Kale. Beyond the regret minimization barrier: an optimal algorithm for\nstochastic strongly-convex optimization. Journal of Machine Learning Research - Proceedings\nTrack, 19:421\u2013436, 2011.\n\n[11] K.-J. Hsiao, K. S. Xu, J. Calder, and A. O. H. III. Multi-criteria anomaly detection using pareto\n\ndepth analysis. In NIPS, pages 854\u2013862, 2012.\n\n[12] Y. Jin and B. Sendhoff. Pareto-based multiobjective machine learning: An overview and case\nstudies. IEEE Transactions on Systems, Man, and Cybernetics, Part C, 38(3):397\u2013415, 2008.\n[13] M. Mahdavi, R. Jin, and T. Yang. Trading regret for ef\ufb01ciency: online convex optimization\n\nwith long term constraints. JMLR, 13:2465\u20132490, 2012.\n\n[14] S. Mannor, J. N. Tsitsiklis, and J. Y. Yu. Online learning with sample path constraints. Journal\n\nof Machine Learning Research, 10:569\u2013590, 2009.\n\n[15] H. Markowitz. Portfolio selection. The journal of \ufb01nance, 7(1):77\u201391, 1952.\n[16] A. Nemirovski. Ef\ufb01cient methods in convex programming. Lecture Notes, Available at\n\nhttp://www2.isye.gatech.edu/ nemirovs, 1994.\n\n[17] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach\n\nto stochastic programming. SIAM J. on Optimization, 19:1574\u20131609, 2009.\n\n[18] A. Rakhlin, O. Shamir, and K. Sridharan. Making gradient descent optimal for strongly convex\n\nstochastic optimization. In ICML, 2012.\n\n[19] P. Rigollet and X. Tong. Neyman-pearson classi\ufb01cation, convexity and stochastic constraints.\n\nThe Journal of Machine Learning Research, 12:2831\u20132855, 2011.\n\n[20] S. Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends\n\nin Machine Learning, 4(2):107\u2013194, 2012.\n\n[21] S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan. Stochastic convex optimization.\n\nIn COLT, 2009.\n\n[22] S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal estimated sub-gradient solver\n\nfor svm. In ICML, pages 807\u2013814, 2007.\n\n[23] K. Sridharan. Learning from an optimization viewpoint. PhD Thesis, 2012.\n[24] K. M. Svore, M. N. Volkovs, and C. J. Burges. Learning to rank with multiple objective\n\nfunctions. In WWW, pages 367\u2013376. ACM, 2011.\n\n[25] H. Xu and F. Meng. Convergence analysis of sample average approximation methods for a\nclass of stochastic mathematical programs with equality constraints. Mathematics of Opera-\ntions Research, 32(3):648\u2013668, 2007.\n\n[26] M. Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient ascent. In\n\nICML, pages 928\u2013936, 2003.\n\n9\n\n\f", "award": [], "sourceid": 591, "authors": [{"given_name": "Mehrdad", "family_name": "Mahdavi", "institution": "Michigan State University (MSU)"}, {"given_name": "Tianbao", "family_name": "Yang", "institution": "NEC Labs America"}, {"given_name": "Rong", "family_name": "Jin", "institution": "Michigan State University (MSU)"}]}