{"title": "Robust Optimization for Non-Convex Objectives", "book": "Advances in Neural Information Processing Systems", "page_first": 4705, "page_last": 4714, "abstract": "We consider robust optimization problems, where the goal is to optimize in the worst case over a class of objective functions. We develop a reduction from robust improper optimization to stochastic optimization: given an oracle that returns $\\alpha$-approximate solutions for distributions over objectives, we compute a distribution over solutions that is $\\alpha$-approximate in the worst case. We show that derandomizing this solution is NP-hard in general, but can be done for a broad class of statistical learning tasks. We apply our results to robust neural network training and submodular optimization. We evaluate our approach experimentally on corrupted character classification and robust influence maximization in networks.", "full_text": "Robust Optimization for Non-Convex Objectives\n\nRobert Chen\n\nComputer Science\nHarvard University\n\nBrendan Lucier\nMicrosoft Research\n\nNew England\n\nYaron Singer\n\nComputer Science\nHarvard University\n\nVasilis Syrgkanis\nMicrosoft Research\n\nNew England\n\nAbstract\n\nWe consider robust optimization problems, where the goal is to optimize in the\nworst case over a class of objective functions. We develop a reduction from\nrobust improper optimization to stochastic optimization: given an oracle that\nreturns \u21b5-approximate solutions for distributions over objectives, we compute a\ndistribution over solutions that is \u21b5-approximate in the worst case. We show that\nderandomizing this solution is NP-hard in general, but can be done for a broad\nclass of statistical learning tasks. We apply our results to robust neural network\ntraining and submodular optimization. We evaluate our approach experimentally on\ncorrupted character classi\ufb01cation and robust in\ufb02uence maximization in networks.\n\n1\n\nIntroduction\n\nIn many learning tasks we face uncertainty about the loss we aim to optimize. Consider, for example,\na classi\ufb01cation task such as character recognition, required to perform well under various types of\ndistortion. In some environments, such as recognizing characters in photos, the classi\ufb01er must handle\nrotation and patterned backgrounds. In a different environment, such as low-resolution images, it\nis more likely to encounter noisy pixelation artifacts. Instead of training a separate classi\ufb01er for\neach possible scenario, one seeks to optimize performance in the worst case over different forms of\ncorruption (or combinations thereof) made available to the trainer as black-boxes.\nMore generally, our goal is to \ufb01nd a minimax solution that optimizes in the worst case over a given\nfamily of functions. Even if each individual function can be optimized effectively, it is not clear such\nsolutions would perform well in the worst case. In many cases of interest, individual objectives are\nnon-convex and hence state-of-the-art methods are only approximate. In stochastic optimization,\nwhere one must optimize a distribution over loss functions, approximate stochastic optimization is\noften straightforward, since loss functions are commonly closed under convex combination. Can\napproximately optimal stochastic solutions yield an approximately optimal robust solution?\nIn this paper we develop a reduction from robust optimization to stochastic optimization. Given an \u21b5-\napproximate oracle for stochastic optimization we show how to implement an \u21b5-approximate solution\nfor robust optimization under a necessary extension, and illustrate its effectiveness in applications.\n\nMain Results. Given an \u21b5-approximate stochastic oracle for distributions over (potentially non-\nconvex) loss functions, we show how to solve \u21b5-approximate robust optimization in a convexi\ufb01ed\nsolution space. This outcome is \u201cimproper\u201d in the sense that it may lie outside the original solution\nspace, if the space is non-convex. This can be interpreted as computing a distribution over solutions.\nWe show that the relaxation to improper learning is necessary in general: It is NP-hard to achieve\nrobust optimization with respect to the original outcome space, even if stochastic optimization can be\nsolved exactly, and even if there are only polynomially many loss functions. We complement this\nby showing that in any statistical learning scenario where loss is convex in the predicted dependent\nvariable, we can \ufb01nd a single (deterministic) solution with matching performance guarantees.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fTechnical overview. Our approach employs an execution of no-regret dynamics on a zero-sum\ngame, played between a learner equipped with an \u21b5-approximate stochastic oracle, and an adversary\nwho aims to \ufb01nd a distribution over loss functions that maximizes the learner\u2019s loss. This game\nconverges to an approximately robust solution, in which the learner and adversary settle upon an \u21b5-\napproximate minimax solution. This convergence is subject to an additive regret term that converges\nat a rate of T 1/2 over T rounds of the learning dynamics.\n\nApplications. We illustrate the power of our reduction through two main examples. We \ufb01rst\nconsider statistical learning via neural networks. Given an arbitrary training method, our reduction\ngenerates a net that optimizes robustly over a given class of loss functions. We evaluate our method\nexperimentally on a character recognition task, where the loss functions correspond to different\ncorruption models made available to the learner as black boxes. We verify experimentally that our\napproach signi\ufb01cantly outperforms various baselines, including optimizing for average performance\nand optimizing for each loss separately. We also apply our reduction to in\ufb02uence maximization,\nwhere the goal is to maximize a concave function (the independent cascade model of in\ufb02uence\n[11]) over a non-convex space (subsets of vertices in a network). Previous work has studied robust\nin\ufb02uence maximization directly [9, 5, 15], focusing on particular, natural classes of functions (e.g.,\nedge weights chosen within a given range) and establishing hardness and approximation results.\nIn comparison, our method is agnostic to the particular class of functions, and achieves a strong\napproximation result by returning a distribution over solutions. We evaluate our method on real and\nsynthetic datasets, with the goal of robustly optimizing a suite of random in\ufb02uence instantiations. We\nverify experimentally that our approach signi\ufb01cantly outperforms natural baselines.\n\nRelated work. There has recently been a great deal of interest in robust optimization in machine\nlearning [20, 4, 17, 21, 16]. For continuous optimization, the work that is closest to ours is perhaps\nthat by Shalev-Shwartz and Wexler [20] and Namkoong and Duchi [17] that use robust optimization\nto train against convex loss functions. The main difference is that we assume a more general setting\nin which the loss functions are non-convex and one is only given access to the stochastic oracle.\nHence, the proof techniques and general results from these papers do not apply to our setting. We\nnote that our result generalizes these works, as they can be considered as the special case in which\nwe have a distributional oracle whose approximation is optimal. In particular, [20, Theorem 1]\napplies to the realizable statistical learning setting where the oracle has small mistake bound C. Our\napplications require a more general framing that hold for any optimization setting with access to\nan approximate oracle, and approximation is in the multiplicative sense with respect to the optimal\nvalue. In submodular optimization there has been a great deal of interest in robust optimization as\nwell [12, 13, 10, 6]. The work closest to ours is that by He and Kempe [10] who consider a slightly\ndifferent objective than ours. Kempe and He\u2019s results apply to in\ufb02uence but do not extend to general\nsubmodular functions. Finally, we note that unlike recent work on non-convex optimization [7, 1, 8]\nour goal in this paper is not to optimize a non-convex function. Rather, we abstract the non-convex\nguarantees via the approximate stochastic oracle.\n\n2 Robust Optimization with Approximate Stochastic Oracles\n\nWe consider the following model of optimization that is robust to objective uncertainty. There is a\nspace X over which to optimize, and a \ufb01nite set of loss functions1 L = {L1, . . . , Lm} where each\nLi 2L is a function from X to [0, 1]. Intuitively, our goal is to \ufb01nd some x 2X that achieves low\nloss in the worst-case over loss functions in L. For x 2X , write g(x) = maxi2[m] Li(x) for the\nworst-case loss of x. The minimax optimum \u2327 is given by\n\n\u2327 = min\nx2X\n\ng(x) = min\nx2X\n\nmax\ni2[m]\n\nLi(x).\n\n(1)\n\nThe goal of \u21b5-approximate robust optimization is to \ufb01nd x such that g(x) \uf8ff \u21b5\u2327.2\n\n1We describe an extension to in\ufb01nite sets of loss functions in the full version of the paper. Our results also\n\nextend naturally to the goal of maximizing the minimum of a class of reward functions.\n\n2This oracle framework is similar to that used by Ben-Tal et al. [3], but where the approximation is\n\nmultiplicative rather than additive.\n\n2\n\n\fAlgorithm 1 Oracle Ef\ufb01cient Improper Robust Optimization\n\nInput: Objectives L = {L1, . . . , Lm}, Apx stochastic oracle M, parameters T,\u2318\nfor each time step t 2 [T ] do\n\nSet\n\nwt[i] / exp(\u2318\n\nt1X\u2327 =1\nend for\nOutput: the uniform distribution over {x1, . . . , xT}\n\nSet xt = M (wt)\n\nLi(x\u2327 ))\n\n(3)\n\nGiven a distribution P over solutions X , write g(P) = maxi2[m] Ex\u21e0P [Li(x)] for the worst-case\nexpected loss of a solution drawn from P. A weaker version of robust approximation is improper\nrobust optimization: \ufb01nd a distribution P over X such that g(P) \uf8ff \u21b5\u2327.\nOur results take the form of reductions to an approximate stochastic oracle, which \ufb01nds a solution\nx 2X that approximately minimizes a given distribution over loss functions.3\nDe\ufb01nition 1 (\u21b5-Approximate Stochastic Oracle). Given a distribution D over L, an \u21b5-approximate\nstochastic oracle M (D) computes x\u21e4 2X such that\nEL\u21e0D [L(x\u21e4)] \uf8ff \u21b5 min\n(2)\nx2X\nImproper Robust Optimization with Oracles\n\nEL\u21e0D [L(x)] .\n\n2.1\n\nWe \ufb01rst show that, given access to an \u21b5-approximate stochastic oracle, it is possible to ef\ufb01ciently\nimplement improper \u21b5-approximate robust optimization, subject to a vanishing additive loss term.\n\nTheorem 1. Given access to an \u21b5-approximate stochastic oracle, Algorithm 1 with \u2318 =q log(m)\ncomputes a distribution P over solutions, de\ufb01ned as a uniform distribution over a set {x1, . . . , xT},\nso that\n\n2T\n\nEx\u21e0P [Li(x)] \uf8ff \u21b5\u2327 +r 2 log(m)\nMoreover, for any \u2318 the distribution P computed by Algorithm 1 satis\ufb01es:\n.\n\nmax\ni2[m]\n\nlog(m)\n\nEx\u21e0P [Li(x)] \uf8ff \u21b5(1 + \u2318)\u2327 +\n\n\u2318T\n\n.\n\nT\n\nmax\ni2[m]\n\n(4)\n\n(5)\n\nProof. We give the proof of the \ufb01rst result and defer the second result to the full version of the paper.\nWe can interpret Algorithm 1 in the following way. We de\ufb01ne a zero-sum game between a learner\nand an adversary. The learner\u2019s action set is equal to X and the adversary\u2019s action set is equal to [m].\nThe loss of the learner when he picks x 2X and the adversary picks i 2 [m] is de\ufb01ned as Li(x).\nThe corresponding payoff of the adversary is Li(x).\nWe will run no-regret dynamics on this zero-sum game, where at every iteration t = 1, . . . , T , the\nadversary will pick a distribution over functions and subsequently the learner picks a solution xt.\nFor simpler notation we will denote with wt the probability density function on [m] associated with\nthe distribution of the adversary. That is, wt[i] is the probability of picking function Li 2L . The\nadversary picks a distribution wt based on some arbitrary no-regret learning algorithm on the m\nactions in L. For concreteness consider the case where the adversary picks a distribution based on the\nmultiplicative weight updates algorithm, i.e.,\n\nwt[i] / exp(r log(m)\n\n2T\n\nLi(x\u2327 )) .\n\nt1X\u2327 =1\n\n(6)\n\n3All our results easily extend to the case where the oracle computes a solution that is approximately optimal\nup to an additive error, rather than a multiplicative one. For simplicity of exposition we present the multiplicative\nerror case as it is more in line with the literature on approximation algorithms.\n\n3\n\n\fSubsequently the learner picks a solution xt that is the output of the \u21b5-approximate stochastic oracle\non the distribution selected by the adversary at time-step t. That is,\n\nxt = M (wt) .\n\n(7)\n\nWrite \u270f(T ) =q 2 log(m)\n\nthat\n\nT\n\n. By the guarantees of the no-regret algorithm for the adversary, we have\n\n1\nT\n\nTXt=1\n\nEI\u21e0wt [LI(xt)] max\ni2[m]\n\n1\nT\n\nTXt=1\n\nLi(xt) \u270f(T ).\n\n(8)\n\nCombining the above with the guarantee of the stochastic oracle we have\n\n\u2327 = min\nx2X\n\nmax\ni2[m]\n\nLi(x) min\nx2X\n\n1\nT\n\nEI\u21e0wt [LI(x)] \n\nTXt=1\n1\n\u21b5 \u00b7 EI\u21e0wt [LI(xt)]\n\n1\nT\n\nTXt=1\n\nmin\nx2X\n\nEI\u21e0wt [LI(x)]\n\n(By oracle guarantee for each t)\n\n1\nT\n\nTXt=1\n\u21b5 \u00b7 max\n\ni2[m]\n\n1\n\n\n\n\n\nLi(xt) \u270f(T )! .\n\n1\nT\n\nTXt=1\n\n(By no-regret of adversary)\n\nThus, if we de\ufb01ne with P to be the uniform distribution over {x1, . . . , xT}, then we have derived\n(9)\n\nmax\ni2[m]\n\nEx\u21e0P [Li(x)] \uf8ff \u21b5\u2327 + \u270f(T )\n\nas required.\n\nA corollary of Theorem 1 is that if the solution space X is convex and the objective functions Li 2L\nare all convex functions, then we can compute a single solution x\u21e4 that is approximately minimax\noptimal. Of course, in this setting one can calculate and optimize the maximum loss directly in time\nproportional to |L|; this result therefore has the most bite when the set of functions is large.\nCorollary 2. If the space X is a convex space and each loss function Li 2L is a convex function,\nt=1 xt 2X , where {x1, . . . , xT} are the output of Algorithm 1, satis\ufb01es:\nthen the point x\u21e4 = 1\n(10)\n\nT PT\n\nLi(x\u21e4) \uf8ff \u21b5\u2327 +r 2 log(m)\n\nmax\ni2[m]\n\nT\n\nProof. By Theorem 1, we get that if P is the uniform distribution over {x1, . . . , xT} then\n\nEx\u21e0P [Li(x)] \uf8ff \u21b5\u2327 +r 2 log(m)\n\nT\n\n.\n\nmax\ni2[m]\n\nSince X is convex, the solution x\u21e4 = Ex\u21e0P [x] is also part of X . Moreover, since each Li 2L is\nconvex, we have that Ex\u21e0P [Li(x)] Li(Ex\u21e0P [x]) = Li(x\u21e4). We therefore conclude\n\nmax\ni2[m]\n\nLi(x\u21e4) \uf8ff max\ni2[m]\n\nas required.\n\nEx\u21e0P [Li(x)] \uf8ff \u21b5\u2327 +r 2 log(m)\n\nT\n\n2.2 Robust Statistical Learning\nNext we apply our main theorem to statistical learning. Consider regression or classi\ufb01cation settings\nwhere data points are pairs (z, y), z 2Z is a vector of features, and y 2Y is the dependent variable.\nThe solution space X is then a space of hypotheses H, with each h 2H a function from Z to Y. We\nalso assume that Y is a convex subset of a \ufb01nite-dimensional vector space.\nWe are given a set of loss functions L = {L1, . . . , Lm}, where each Li 2L is a functional\nLi : H! [0, 1]. Theorem 1 implies that, given an \u21b5-approximate stochastic optimization oracle,\n\n4\n\n\fwe can compute a distribution over T hypotheses from H that achieves an \u21b5-approximate minimax\nguarantee. If the loss functionals are convex over hypotheses, then we can compute a single ensemble\nhypothesis h\u21e4 (possibly from a larger space of hypotheses, if H is non-convex) that achieves this\nguarantee.\nTheorem 3. Suppose that L = {L1, . . . , Lm} are convex functionals. Then the ensemble hypoth-\nesis h\u21e4 = 1\nt=1 h, where {h1, . . . , hT} are the hypotheses output by Algorithm 1 given an\n\u21b5-approximate stochastic oracle, satis\ufb01es\n\nT PT\n\nmax\ni2[m]\n\nLi(h\u21e4) \uf8ff \u21b5 min\nh2H\n\nmax\ni2[m]\n\nLi(h) +r 2 log(m)\n\nT\n\n.\n\n(11)\n\nProof. The proof is similar to the proof of Corollary 2.\n\nWe emphasize that the convexity condition in Theorem 3 is over the class of hypotheses, rather than\nover features or any natural parameterization of H (such as weights in a neural network). This is a\nmild condition that applies to many examples in statistical learning theory. For instance, consider the\ncase where each loss Li(h) is the expected value of some ex-post loss function `i(h(z), y) given a\ndistribution Di over Z \u21e5 Y :\n\nLi(h) = E(z,y)\u21e0Di [`i(h(z), y)] .\n\n(12)\nIn this case, it is enough for the function `i(\u00b7,\u00b7) to be convex with respect to its \ufb01rst argument (i.e.,\nthe predicted dependent variable). This is satis\ufb01ed by most loss functions used in machine learning,\nsuch as multinomial logistic loss (cross-entropy loss) `(\u02c6y, y) = Pc2[k] yc log(\u02c6yc) from multi-class\nclassi\ufb01cation or squared loss `(\u02c6y, y) = k\u02c6y yk2 as used in regression. For all these settings, Theorem\n3 provides a tool for improper robust learning, where the \ufb01nal hypothesis h\u21e4 is an ensemble of T base\nhypotheses from H. Again, the underlying optimization problem can be arbitrarily non-convex in the\nnatural parameters of the hypothesis space; in Section 3.1 we will show how to apply this approach to\nrobust training of neural networks, where the stochastic oracle is simply a standard network training\nmethod. For neural networks, the fact that we achieve improper learning (as opposed to standard\nlearning) corresponds to training a neural network with a single extra layer relative to the networks\ngenerated by the oracle.\n\n2.3 Robust Submodular Maximization\nIn robust submodular maximization we are given a family of reward functions F = {f1, . . . , fm},\nwhere each fi 2F is a monotone submodular function from a ground set N of n elements to [0, 1].\nEach function is assumed to be monotone and submodular, i.e., for any S \u2713 T \u2713 N, fi(S) \uf8ff fi(T );\nand for any S, T \u2713 N, f (S [ T ) + f (S \\ T ) \uf8ff f (S) + f (T ). The goal is to select a set S \u2713 N\nof size k whose worst-case value over i, i.e., g(S) = mini2[m] fi(S), is at least a 1/\u21b5 factor of the\nminimax optimum \u2327 = maxT :|T|\uf8ffk mini2[m] fi(T ).\nThis setting is a special case of our general robust optimization setting (phrased in terms of rewards\nrather than losses). The solution space X is equal to the set of subsets of size k among all elements in\nN and the set F is the set of possible objective functions. The stochastic oracle 1, instantiated in\nthis setting, asks for the following: given a convex combination of submodular functions F (S) =\nPm\ni=1 w[i] \u00b7 fi(S), compute a set S\u21e4 such that F (S\u21e4) 1\nComputing the maximum value set of size k is NP-hard even for a single submodular function. The\nfollowing very simple greedy algorithm computes a (1 1/e)-approximate solution [19]: begin with\nScur = ;, and at each iteration add to the current solution Scur the element j 2 N Scur that has\nthe largest marginal contribution: f ({j}[ Scur) f (Scur). Moreover, this approximation ratio is\nknown to be the best possible in polynomial time [18]. Since a convex combination of monotone\nsubmodular functions is also a monotone submodular function, we immediately get that there exists a\n(1 1/e)-approximate stochastic oracle that can be computed in polynomial time. The algorithm is\nformally given in Algorithm 2. Combining the above with Theorem 1 we get the following corollary.\nCorollary 4. Algorithm 1, with stochastic oracle Mgreedy, computes in time poly(T, n) a distribution\nP over sets of size k, de\ufb01ned as a uniform distribution over a set {S1, . . . , ST}, such that\n\n\u21b5 maxS:|S|\uf8ffk F (S).\n\nES\u21e0P [fi(S)] \u27131 \n\n1\n\ne\u25c6 (1 \u2318)\u2327 \n\nmin\ni2[m]\n\nlog(m)\n\n\u2318T\n\n.\n\n(13)\n\n5\n\n\fAlgorithm 2 Greedy stochastic Oracle for Submodular Maximization Mgreedy\n\nInput: Set of elements N, objectives F = {f1, . . . , fm}, distribution over objectives w\nSet Scur = ;\nfor j = 1 to k do\nLet j\u21e4 = arg maxj2NScurPm\nSet Scur = {j\u21e4}[ Scur\n\ni=1 w[i] (fi({j}[ Scur) fi(Scur))\n\nend for\n\nFigure 1: Sample MNIST image with each of the corruptions applied to it. Background Corruption\nSet & Shrink Corruption Set (top). Pixel Corruption Set & Mixed Corruption Set (bottom).\n\nWe show in the full version of the paper that computing a single set S that achieves a (1 1/e)-\napproximation to \u2327 is also N P -hard. This is true even if the functions fi are additive. However, by\nallowing a randomized solution over sets we can achieve a constant factor approximation to \u2327 in\npolynomial time.\nSince the functions are monotone, the above result implies a simple way of constructing a single set\nS\u21e4 that is of larger size than k, which deterministically achieves a constant factor approximation to \u2327.\nThe latter holds by simply taking the union of the sets {S1, . . . , ST} in the support of the distribution\nreturned by Algorithm 1. We get the following bi-criterion approximation scheme.\nCorollary 5. Suppose that we run the reward version of Algorithm 1, with \u2318 = \u270f and for T = log(m)\n,\n\u2327 \u270f2\nreturning {S1, . . . , ST}. Then the set S\u21e4 = S1 [ . . . [ ST , which is of size at most k log(m)\n, satis\ufb01es\n\n\u2327 \u270f2\n\nfi(S\u21e4) \u27131 \n\nmin\ni2[m]\n\n1\n\ne 2\u270f\u25c6 \u2327.\n\n(14)\n\n3 Experiments4\n\n3.1 Robust Classi\ufb01cation with Neural Networks\n\nA classic application of our robust optimization framework is classi\ufb01cation with neural networks\nfor corrupted or perturbed datasets. We have a data set Z of pairs (z, y) of an image z 2Z and\nlabel y 2Y that can be corrupted in m different ways which produces data sets Z1, . . . , Zm. The\nhypothesis space H is the set of all neural nets of some \ufb01xed architecture and for each possible\nassignment of weights. We denote each such hypothesis with h(\u00b7; \u2713) : Z!Y for \u2713 2 Rd, with d\nbeing the number of parameters (weights) of the neural net. If we let Di be the uniform distribution\nover each corrupted data set Zi, then we are interested in minimizing the empirical cross-entropy\n(aka multinomial logistic) loss in the worst case over these different distributions Di. The latter is a\nspecial case of our robust statistical learning framework from Section 2.2.\nTraining a neural network is a non-convex optimization problem and we have no guarantees on its\nperformance. We instead assume that for any given distribution D over pairs (z, y) of images and\nlabels and for any loss function `(h(z; \u2713), y), training a neural net with stochastic gradient descent\nrun on images drawn from D can achieve an \u21b5 approximation to the optimal expected loss, i.e.\nmin\u27132Rd E(z,y)\u21e0D [`(h(z; \u2713), y)]. Notice that this implies an \u21b5-approximate stochastic oracle for the\n4Code used to implement the algorithms and run the experiments is available at https://github.com/\n\n12degrees/Robust-Classification/.\n\n6\n\n\fcorrupted dataset robust training problem: for any distribution w over the different corruptions [m],\nthe stochastic oracle asks to give an \u21b5-approximation to the minimization problem:\n\nmin\n\u27132Rd\n\nmXi=1\n\nw[i] \u00b7 E(z,y)\u21e0Di [`(h(z; \u2713), y)]\n\n(15)\n\nfunction being the weighted combination of loss functionsPm\n\nThe latter is simply another expected loss problem with distribution over images being the mixture\ndistribution de\ufb01ned by \ufb01rst drawing a corruption index i from w and then drawing a corrupted\nimage from distribution Di. Hence, our oracle assumption implies that SGD on this mixture is an\n\u21b5-approximation. By linearity of expectation, an alternative way of viewing the stochastic oracle\nproblem is that we are training a neural net on the original distribution of images, but with loss\ni=1 w[i] \u00b7 `(h(ci(z); \u2713), y), where\nci(z) is the i-th corrupted version of image z. In our experiments we implemented both of these\ninterpretations of the stochastic oracle, which we call the Hybrid Method and Composite Method,\nrespectively, when designing our neural network training scheme (see the full version of the paper\nfor further details). Finally, because we use the cross-entropy loss, which is convex in the prediction\nof the neural net, we can also apply Theorem 3 to get that the ensemble neural net, which takes the\naverage of the predictions of the neural nets created at each iteration of the robust optimization, will\nalso achieve good worst-case loss (we refer to this as Ensemble Bottleneck Loss).\n\nExperiment Setup. We use the MNIST handwritten digits data set containing 55000 training\nimages, 5000 validation images, and 10000 test images, each image being a 28 \u21e5 28 pixel grayscale\nimage. The intensities of these 576 pixels (ranging from 0 to 1) are used as input to a neural network\nthat has 1024 nodes in its one hidden layer. The output layer uses the softmax function to give a\ndistribution over digits 0 to 9. The activation function is ReLU and the network is trained using\nGradient Descent with learning parameter 0.5 through 500 iterations of mini-batches of size 100.\nIn general, the corruptions can be any black-box corruption of the image. In our experiments, we\nconsider four types of corruption (m = 4). See the full version of the paper for further details about\ncorruptions.\n\nBaselines. We consider three baselines: (i) Individual Corruption: for each corruption type i 2 [m],\nwe construct an oracle that trains a neural network using the training data perturbed by corruption i,\nand then returns the trained network weights as \u2713t, for every t = 1, . . . , T . This gives m baselines,\none for each corruption type; (ii) Even Split: this baseline alternates between training with different\ncorruption types between iterations. In particular, call the previous m baseline oracles O1, ..., Om.\nThen this new baseline oracle will produce \u2713t with Oi+1, where i \u2318 t mod m, for every t = 1, ..., T ;\n(iii) Uniform Distribution: This more advanced baseline runs the robust optimization scheme with the\nHybrid Method (see Appendix), but without the distribution updates. Instead, the distribution over\ncorruption types is \ufb01xed as the discrete uniform [ 1\nm ] over all T iterations. This allows us to\ncheck if the multiplicative weight updates in the robust optimization algorithm are providing bene\ufb01t.\n\nm , ..., 1\n\nResults. The Hybrid and Composite Methods produce results far superior to all three baseline\ntypes, with differences both substantial in magnitude and statistically signi\ufb01cant, as shown in Figure\n2. The more sophisticated Composite Method outperforms the Hybrid Method.\nIncreasing T\nimproves performance, but with diminishing returns\u2013largely because for suf\ufb01ciently large T , the\ndistribution over corruption types has moved from the initial uniform distribution to some more\noptimal stable distribution (see the full version for details). All these effects are consistent across\nthe 4 different corruption sets tested. The Ensemble Bottleneck Loss is empirically much smaller\nthan Individual Bottleneck Loss. For the best performing algorithm, the Composite Method, the\nmean Ensemble Bottleneck Loss (mean Individual Bottleneck Loss) with T = 50 was 0.34 (1.31)\nfor Background Set, 0.28 (1.30) for Shrink Set, 0.19 (1.25) for Pixel Set, and 0.33 (1.25) for Mixed\nSet. Thus combining the T classi\ufb01ers obtained from robust optimization is practical for making\npredictions on new data.\n\n3.2 Robust In\ufb02uence Maximization\n\nWe apply the results of Section 2.3 to the robust in\ufb02uence maximization problem. Given a directed\ngraph G = (V, E), the goal is to pick a seed set S of k nodes that maximize an in\ufb02uence function\n\n7\n\n\fFigure 2: Comparison of methods, showing mean of 10 independent runs and a 95% con\ufb01dence band. The\ncriterion is Individual Bottleneck Loss: min[m] E\u2713\u21e0P [`(h(z; \u2713), y)], where P is uniform over all solutions \u2713i\nfor that method. Baselines (i) and (ii) are not shown as they produce signi\ufb01cantly higher loss.\n\nfG(S), where fG(S) is the expected number of individuals in\ufb02uenced by opinion of the members of\nS. We used fG(S) to be the number of nodes reachable from S (our results extend to other models).\nIn robust in\ufb02uence maximization, the goal is to maximize in\ufb02uence in the worst-case (Bottleneck\nIn\ufb02uence) over m functions {f1, . . . , fm}, corresponding to m graphs {G1, . . . , Gm}, for some \ufb01xed\nseed set of size k. This is a special case of robust submodular maximization after rescaling to [0, 1].\n\nExperiment Setup. Given a base directed graph G(V, E), we produce m graphs Gi = (V, Ei) by\nrandomly including each edge e 2 E with some probability p. We consider two base graphs and two\nsets of parameters for each: (i) The Wikipedia Vote Graph [14]. In Experiment A, the parameters are\n|V | = 7115, |E| = 103689, m = 10, p = 0.01 and k = 10. In Experiment B, change p = 0.015 and\nk = 3. (ii) The Complete Directed Graph on |V | = 100 vertices. In Experiment A, the parameters\nare m = 50, p = 0.015 and k = 2. In Experiment B, change p = 0.01 and k = 4.\n\n1 , . . . , Sg\n\nBaselines. We compared our algorithm (Section 2.3) to three baselines: (i) Uniform over Individual\nGreedy Solutions: Apply greedy maximization (Algorithm 2) on each graph separately, to get\nsolutions {Sg\nm}. Return the uniform distribution over these solutions; (ii) Greedy on Uniform\nDistribution over Graphs: Return the output of greedy submodular maximization (Algorithm 2)\non the uniform distribution over in\ufb02uence functions. This can be viewed as maximizing expected\nin\ufb02uence; (iii) Uniform over Greedy Solutions on Multiple Perturbed Distributions: Generate T\ndistributions {w\u21e41, . . . , w\u21e4T} over the m functions, by randomly perturbing the uniform distribution.\nPerturbation magnitudes were chosen s.t. w\u21e4t has the same expected `1 distance from uniform as the\ndistribution returned by robust optimization at iteration t.\n\nResults. For both graph experiments, robust optimization outperforms all baselines on Bottleneck\nIn\ufb02uence; the difference is statistically signi\ufb01cant as well as large in magnitude for all T > 50 (see\nFigure 3). Moreover, the individual seed sets generated at each iteration t of robust optimization\nthemselves achieve empirically good in\ufb02uence as well; see the full version for further details.\n\nReferences\n[1] Zeyuan Allen Zhu and Elad Hazan. Variance reduction for faster non-convex optimization. In\nProceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York\nCity, NY, USA, June 19-24, 2016, pages 699\u2013707, 2016.\n\n8\n\n\fFigure 3: Comparison for various T , showing mean Bottleneck In\ufb02uence and 95% con\ufb01dence on 10 runs.\n\n[2] Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplicative weights update method: a\n\nmeta-algorithm and applications. Theory of Computing, 8(6):121\u2013164, 2012.\n\n[3] Aharon Ben-Tal, Elad Hazan, Tomer Koren, and Shie Mannor. Oracle-based robust optimization\n\nvia online learning. Operations Research, 63(3):628\u2013638, 2015.\n\n[4] Sabyasachi Chatterjee, John C. Duchi, John D. Lafferty, and Yuancheng Zhu. Local minimax\ncomplexity of stochastic convex optimization. In Advances in Neural Information Processing\nSystems 29: Annual Conference on Neural Information Processing Systems 2016, December\n5-10, 2016, Barcelona, Spain, pages 3423\u20133431, 2016.\n\n[5] Wei Chen, Tian Lin, Zihan Tan, Mingfei Zhao, and Xuren Zhou. Robust in\ufb02uence maximization.\nIn Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery\nand Data Mining, San Francisco, CA, USA, August 13-17, 2016, pages 795\u2013804, 2016.\n\n[6] Wei Chen, Tian Lin, Zihan Tan, Mingfei Zhao, and Xuren Zhou. Robust in\ufb02uence maximization.\nIn Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery\nand Data Mining, San Francisco, CA, USA, August 13-17, 2016, pages 795\u2013804, 2016.\n\n[7] Elad Hazan, K\ufb01r Y. Levy, and Shai Shalev-Shwartz. Beyond convexity: Stochastic quasi-convex\noptimization. In Advances in Neural Information Processing Systems 28: Annual Conference\non Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec,\nCanada, pages 1594\u20131602, 2015.\n\n[8] Elad Hazan, K\ufb01r Yehuda Levy, and Shai Shalev-Shwartz. On graduated optimization for\nstochastic non-convex problems. In Proceedings of the 33nd International Conference on\nMachine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, pages 1833\u20131841,\n2016.\n\n[9] Xinran He and David Kempe. Robust in\ufb02uence maximization. In Proceedings of the 22nd ACM\nSIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco,\nCA, USA, August 13-17, 2016, pages 885\u2013894, 2016.\n\n[10] Xinran He and David Kempe. Robust in\ufb02uence maximization. In Proceedings of the 22nd ACM\nSIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco,\nCA, USA, August 13-17, 2016, pages 885\u2013894, 2016.\n\n[11] David Kempe, Jon Kleinberg, and \u00c9va Tardos. Maximizing the spread of in\ufb02uence through\na social network. In Proceedings of the Ninth ACM SIGKDD International Conference on\n\n9\n\n\fKnowledge Discovery and Data Mining, KDD \u201903, pages 137\u2013146, New York, NY, USA, 2003.\nACM.\n\n[12] Andreas Krause, H. Brendan McMahan, Carlos Guestrin, and Anupam Gupta. Selecting obser-\nvations against adversarial objectives. In Advances in Neural Information Processing Systems\n20, Proceedings of the Twenty-First Annual Conference on Neural Information Processing\nSystems, Vancouver, British Columbia, Canada, December 3-6, 2007, pages 777\u2013784, 2007.\n\n[13] Andreas Krause, Alex Roper, and Daniel Golovin. Randomized sensing in adversarial environ-\nments. In Proceedings of the 22nd International Joint Conference On Arti\ufb01cial Intelligence,\nBarcelona, Catalonia, Spain, July 16-22, 2011, pages 2133\u20132139, 2011.\n\n[14] Jure Leskovec. Wikipedia vote network. Stanford Network Analysis Project.\n[15] Meghna Lowalekar, Pradeep Varakantham, and Akshat Kumar. Robust in\ufb02uence maximization:\n(extended abstract). In Proceedings of the 2016 International Conference on Autonomous\nAgents & Multiagent Systems, Singapore, May 9-13, 2016, pages 1395\u20131396, 2016.\n\n[16] Yishay Mansour, Aviad Rubinstein, and Moshe Tennenholtz. Robust probabilistic inference. In\nProceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA\n2015, San Diego, CA, USA, January 4-6, 2015, pages 449\u2013460, 2015.\n\n[17] Hongseok Namkoong and John C. Duchi. Stochastic gradient methods for distributionally\nrobust optimization with f-divergences. In Advances in Neural Information Processing Systems\n29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016,\nBarcelona, Spain, pages 2208\u20132216, 2016.\n\n[18] G. L. Nemhauser and L. A. Wolsey. Best algorithms for approximating the maximum of a\n\nsubmodular set function. Mathematics of Operations Research, 3(3):177\u2013188, 1978.\n\n[19] G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher. An analysis of approximations for maximizing\n\nsubmodular set functions\u2014i. Mathematical Programming, 14(1):265\u2013294, 1978.\n\n[20] Shai Shalev-Shwartz and Yonatan Wexler. Minimizing the maximal loss: How and why. In\nProceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York\nCity, NY, USA, June 19-24, 2016, pages 793\u2013801, 2016.\n\n[21] Jacob Steinhardt and John C. Duchi. Minimax rates for memory-bounded sparse linear regres-\nsion. In Proceedings of The 28th Conference on Learning Theory, COLT 2015, Paris, France,\nJuly 3-6, 2015, pages 1564\u20131587, 2015.\n\n10\n\n\f", "award": [], "sourceid": 2462, "authors": [{"given_name": "Robert", "family_name": "Chen", "institution": "Harvard University"}, {"given_name": "Brendan", "family_name": "Lucier", "institution": "Microsoft Research"}, {"given_name": "Yaron", "family_name": "Singer", "institution": "Harvard University"}, {"given_name": "Vasilis", "family_name": "Syrgkanis", "institution": "Microsoft Research"}]}