{"title": "Generalization Bounds in the Predict-then-Optimize Framework", "book": "Advances in Neural Information Processing Systems", "page_first": 14412, "page_last": 14421, "abstract": "The predict-then-optimize framework is fundamental in many practical settings: predict the unknown parameters of an optimization problem, and then solve the problem using the predicted values of the parameters. A natural loss function in this environment is to consider the cost of the decisions induced by the predicted parameters, in contrast to the prediction error of the parameters. This loss function was recently introduced in [Elmachtoub and Grigas, 2017], which called it the Smart Predict-then-Optimize (SPO) loss. Since the SPO loss is nonconvex and noncontinuous, standard results for deriving generalization bounds do not apply. In this work, we provide an assortment of generalization bounds for the SPO loss function. In particular, we derive bounds based on the Natarajan dimension that, in the case of a polyhedral feasible region, scale at most logarithmically in the number of extreme points, but, in the case of a general convex set, have poor dependence on the dimension. By exploiting the structure of the SPO loss function and an additional strong convexity assumption on the feasible region, we can dramatically improve the dependence on the dimension via an analysis and corresponding bounds that are akin to the margin guarantees in classification problems.", "full_text": "Generalization Bounds in the\n\nPredict-then-Optimize Framework\n\nOthman El Balghiti\n\nRayens Capital\n\nChicago, IL 60606\n\noe2161@columbia.edu\n\nPaul Grigas\n\nUniversity of California, Berkeley\n\nBerkeley, CA 94720\n\npgrigas@berkeley.edu\n\nAdam N. Elmachtoub\nColumbia University\nNew York, NY 10027\n\nadam@ieor.columbia.edu\n\nAmbuj Tewari\n\nUniversity of Michigan\nAnn Arbor, MI 48109\ntewaria@umich.edu\n\nAbstract\n\nThe predict-then-optimize framework is fundamental in many practical settings:\npredict the unknown parameters of an optimization problem, and then solve the\nproblem using the predicted values of the parameters. A natural loss function in\nthis environment is to consider the cost of the decisions induced by the predicted\nparameters, in contrast to the prediction error of the parameters. This loss function\nwas recently introduced [7] and christened Smart Predict-then-Optimize (SPO) loss.\nSince the SPO loss is nonconvex and noncontinuous, standard results for deriving\ngeneralization bounds do not apply. In this work, we provide an assortment of\ngeneralization bounds for the SPO loss function. In particular, we derive bounds\nbased on the Natarajan dimension that, in the case of a polyhedral feasible region,\nscale at most logarithmically in the number of extreme points, but, in the case of\na general convex set, have poor dependence on the dimension. By exploiting the\nstructure of the SPO loss function and an additional strong convexity assumption on\nthe feasible region, we can dramatically improve the dependence on the dimension\nvia an analysis and corresponding bounds that are akin to the margin guarantees in\nclassi\ufb01cation problems.\n\n1\n\nIntroduction\n\nA common application of machine learning is to predict-then-optimize, i.e., predict unknown parame-\nters of an optimization problem and then solve the optimization problem using the predictions. For\ninstance, consider a navigation task that requires solving a shortest path problem. The key input\ninto this problem are the travel times on each edge, typically called edge costs. Although the exact\ncosts are not known at the time the problem is solved, the edge costs are predicted using a machine\nlearning model trained on historical data consisting of features (time of day, weather, etc.) and edge\ncosts (collected from app data). Fundamentally, a good model induces the optimization problem to\n\ufb01nd good shortest paths, as measured by the true edge costs. In fact, recent work has been developed\nto consider how to solve problems in similar environments [3, 13, 6]. In particular, recent work [7]\ndeveloped the Smart Predict-then-Optimize (SPO) loss function which exactly measures the quality\nof a prediction by the decision error, in contrast to the prediction error as measured by standard loss\nfunctions such as squared error. In this work, we seek to provide an assortment of generalization\nbounds for the SPO loss function.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fSpeci\ufb01cally, we shall assume that our optimization task is to minimize a linear objective over a convex\nfeasible region. In the shortest path example, the feasible region is a polyhedron. We assume the\nobjective cost vector is not known at the time the optimization problem is solved, but rather predicted\nfrom data. A decision is made with respect to the predicted cost vector, and the SPO loss is computed\nby evaluating the decision on the true cost vector and then subtracting the optimal cost assuming\nknowledge of the true cost vector. Unfortunately, the SPO loss is nonconvex and non-Lipschitz, and\ntherefore proving generalization bounds is not immediate.\nOur results consider two cases, depending on whether the feasible region is a polyhedron or a strongly\nconvex body. In all cases, we achieve a dependency of 1\u221a\nn up to logarithmic terms, where n is the\nnumber of samples. In the polyhedral case, our generalization bound is formed by considering the\nRademacher complexity of the class obtained by compositing the SPO loss with our predict-then-\noptimize models. This in turn can be bounded by a term on the order of square root of the Natarajan\ndimension times the logarithm of the number of extreme points in the feasible region. Since the\nnumber of extreme points is typically exponential in the dimension, this logarithm is essential so that\nthe bound is at most linear in the dimension. When our cost vector prediction models are restricted\nto linear, we show that the Natarajan dimension of the predict-then-optimize hypothesis class is\nsimply bounded by the product of the two relevant dimensions, the feature dimension and the cost\nvector dimension, of the linear hypothesis class. Using this polyhedral approach, we show that a\ngeneralization bound is possible for any convex set by looking at a covering of the feasible region,\nalthough the dependency on the dimension is at least linear.\nFortunately, we show that when the feasible region is strongly convex, tighter generalization bounds\ncan be obtained using margin-based methods. The proof relies on constructing an upper bound on the\nSPO-loss function and showing it is Lipschitz. Our margin based bounds have no explicit dependence\non dimensions of input features and of cost vectors. It is expressed as a function of the multivariate\nRademacher complexity of the vector-valued hypothesis class being used. We show that for suitably\nconstrained linear hypothesis classes, we get a much improved dependence on problem dimensions.\nSince the SPO loss generalizes the 0-1 multiclass loss from multiclass classi\ufb01cation (see Example 3),\nour work can be seen as extending classic Natarajan-dimension based [20, Ch. 29] and margin-based\ngeneralization bounds [14] to the predict-then-optimize framework.\nWe note that one can generally construct an instance of a multiclass classi\ufb01cation problem from\nan instance of an SPO problem by considering the \u201clabel\u201d of each observed cost vector to be the\ncorresponding optimal solution which is without loss of generality an extreme point of the feasible\nset of solutions. The number of classes in the resulting multiclass problem is the number of extreme\npoints of the feasible set. It is therefore important to use those generalization bounds from the multi-\nclass classi\ufb01cation literature that are not too large in the number of classes. For data-independent\nworst-case bounds, the dependency is at best square root in the number of classes [9, 4]. In contrast,\nwe provide data-independent bounds that grow only logarithmically in the number of extreme points.\nUsing data-dependent (margin-based) approaches, [15, 16] successfully decreased this complexity to\nlogarithm in the number of classes.\nHowever, the reduction of an SPO problem to multiclass classi\ufb01cation throws away potentially\nimportant information, namely the numerical values of the cost vectors. Our margin-based approach\nremoves any explicit dependency on the number of classes by exploiting the structure of the SPO\nloss. In Section 4, we make an important assumption that the feasible set is strongly convex (which\nnecessarily implies that the number of extreme points is in\ufb01nite) and also heavily uses the structure\nof the SPO loss via the construction of the \u03b3-margin SPO loss. This re\ufb01ned analysis allows us to\ncircumvent a naive bound that depends on the in\ufb01nite number of classes, which would be vacuous.\nEven though we construct a Lipschitz upper bound on SPO loss in a general norm setting (Theorem 3),\nour margin bounds (Theorem 4) are stated in the (cid:96)2 norm setting. This is because the most general\ncontraction type lemma for vector valued Lipschitz functions we know of only works for the (cid:96)2-\nnorm [17]. Similar results are available in the in\ufb01nity norm setting [3] but our understanding of\ngeneral norms appears limited at present. Our work will hopefully provide the motivation to develop\ncontraction inequalities for vector valued Lipschitz functions in a general norm setting.\n\n2\n\n\f2 Predict-then-optimize framework and preliminaries\n\nWe now describe the predict-then-optimize framework which is central to many applications of\noptimization in practice. Speci\ufb01cally, we assume that there is a nominal optimization problem of\ninterest which models our downstream decision-making task. Furthermore, we assume that the\nnominal problem has a linear objective and that the decision variable w \u2208 Rd and feasible region\nS \u2286 Rd are well-de\ufb01ned and known with certainty. However, the cost vector of the objective, c \u2208 Rd,\nis not observed directly, and rather an associated feature vector x \u2208 Rp is observed. Let D be the\nunderlying joint distribution of (x, c) and let Dx be the conditional distribution of c given x. Then\nthe goal for the decision maker is to solve\n\nEc\u223cDx[cT w|x] = min\nw\u2208S\n\nmin\nw\u2208S\n\nEc\u223cDx [c|x]T w\n\n(1)\n\nThe predict-then-optimize framework relies on using a prediction/estimate for Ec\u223cDx [c|x], which we\ndenote by \u02c6c, and solving the deterministic version of the optimization problem based on \u02c6c. We de\ufb01ne\nP (\u02c6c) to be the optimization task with objective cost vector \u02c6c, namely\n\nP (\u02c6c) : min\n\n\u02c6cT w\n\nw\n\ns.t. w \u2208 S.\n\n(2)\n\n(cid:8)cT w(cid:9) for all c \u2208 Rd. For instance, if (2) corresponds to a linear,\n\nWe assume S \u2286 Rd is a nonempty, compact, and convex set representing the feasible region. We let\nw\u2217(\u00b7) : Rd \u2192 S denote any oracle for solving P (\u00b7). That is, w\u2217(\u00b7) is a \ufb01xed deterministic mapping\nsuch that w\u2217(c) \u2208 arg minw\u2208S\nconic, or even a particular combinatorial or mixed-integer optimization problem (in which case S\ncan be implicitly described as a convex set), then a commercial optimization solver or a specialized\nalgorithm suf\ufb01ces for w\u2217(c).\nIn this framework, we assume that predictions are made from a model that is learned on a training\ndata set. Speci\ufb01cally, the sample training data (x1, c1), . . . , (xn, cn) is drawn i.i.d. from the joint\ndistribution D, where xi \u2208 X is a feature vector representing auxiliary information associated with\nthe cost vector ci. We denote by H our hypothesis class of cost vector prediction models, thus\nfor a function f \u2208 H, we have that f : X \u2192 Rd. Most approaches for learning a model f \u2208 H\nfrom the training data are based on specifying a loss function that quanti\ufb01es the error in making\nprediction \u02c6c when the realized (true) cost vector is actually c. Following prior work [7], our primary\nloss function of interest is the \u201csmart predict-then-optimize\u201d loss function that directly takes the\nnominal optimization problem P (\u00b7) into account when measuring errors in predictions. Namely, we\nconsider the SPO loss function (relative to the optimization oracle w\u2217(\u00b7)) de\ufb01ned by:\n\n(cid:96)SPO(\u02c6c, c) := cT (w\u2217(\u02c6c) \u2212 w\u2217(c)) ,\n\nwhere \u02c6c \u2208 Rd is the predicted cost vector and c \u2208 C \u2286 Rd is the true realized cost vector. Notice that\n(cid:96)SPO(\u02c6c, c) exactly measures the excess cost incurred when making a suboptimal decision due to an\nimprecise cost vector prediction. Also, note that the SPO loss is non-negative and bounded above by\n\u03c9S(C) for all \u02c6c \u2208 Rd and c \u2208 C where \u03c9S(C) is a diameter-like quantity that we will de\ufb01ne shortly.\nLet us now present several examples to illustrate the applicability and generality of the SPO loss\nfunction and framework.\nExample 1. In the shortest path problem, the feature vector x may include features such as weather\nand time information that may be used to predict the cost vector c representing the travel times along\neach edge of the network. In this case, the network is assumed to be given (e.g., the road network of\na city) and the feasible region S is a network \ufb02ow polytope that represents \ufb02ow conservation and\ncapacity constraints on the underlying network.\nExample 2. In portfolio optimization, the returns of potential investments can depend on many\nfeatures which typically include historical returns, news, economic factors, social media, and others.\nWe presume that these auxiliary features may be used to predict the vector of returns r of d different\nassets, but that the covariance matrix of the asset returns does not depend on the auxiliary features.\nHere we are interested in maximizing returns, so we let the cost vector c be de\ufb01ned by c = \u2212\u02dcr where\n\u02dcr = r \u2212 rRFe, r represents the vector of asset returns, rRF is the risk-free rate, and e is the vector of\nall ones. If \u03a3 \u2208 Rd\u00d7d denotes the (positive semide\ufb01nite) covariance matrix of the asset returns and\n\u03b3 \u2265 0 is a desired bound on the overall variance (risk level) of the portfolio, then we may de\ufb01ne the\nfeasible region by S := {w : wT \u03a3w \u2264 \u03b3, eT w \u2264 1, w \u2265 0}.\n\n3\n\n\fExample 3. Our setting also captures multi-class (and binary) classi\ufb01cation by the following character-\nization: S is the d-dimensional simplex where d is the number of classes, and C = {\u2212ei|i = 1, . . . , d}\nwhere ei is the ith unit vector in Rd. It is easy to see that each vertex of the simplex corresponds to a\nlabel, and correct/incorrect prediction has a loss of 0/1.\n\nAs pointed out before [7], the SPO loss function is generally non-convex, may even be discontinuous,\nand is in fact a strict generalization of the 0-1 loss function in binary classi\ufb01cation. Thus, optimizing\nthe SPO loss via empirical risk minimization may be intractable even when H is a linear hypothesis\nclass. To circumvent these dif\ufb01culties, one approach is to optimize a convex surrogate loss [7]. Our\nfocus is on deriving generalization bounds that hold uniformly over the class H, and thus are valid\nfor any training approach, including using a surrogate or other loss function within the framework of\nempirical risk minimization. Notice that a generalization bound for the SPO loss directly translates to\nan upper bound guarantee for problem (1) that holds \u201con average\u201d over the distribution.\nUseful notation. We will make use of a generic given norm (cid:107)\u00b7(cid:107) on w \u2208 Rd, as well as the (cid:96)q-norm\ndenoted by (cid:107) \u00b7 (cid:107)q for q \u2208 [1,\u221e]. For the given norm (cid:107) \u00b7 (cid:107) on Rd, (cid:107) \u00b7 (cid:107)\u2217 denotes the dual norm de\ufb01ned\nby (cid:107)c(cid:107)\u2217 := maxw:(cid:107)w(cid:107)\u22641 cT w. Let B( \u00afw, r) := {w : (cid:107)w \u2212 \u00afw(cid:107) \u2264 r} denote the ball of radius r\ncentered at \u00afw, and we analogously de\ufb01ne Bq( \u00afw, r) for the (cid:96)q-norm and B\u2217(c, r) for the dual norm.\nFor a set S \u2286 Rd, we de\ufb01ne the size of S in the norm (cid:107) \u00b7 (cid:107) by \u03c1(S) := supw\u2208S (cid:107)w(cid:107). We analogously\nde\ufb01ne \u03c1q(\u00b7) for the (cid:96)q-norm and \u03c1\u2217(\u00b7) for the dual norm. We de\ufb01ne the \u201clinear optimization gap\u201d of S\nwith respect to c by \u03c9S(c) := maxw\u2208S\nabuse notation by de\ufb01ning \u03c9S(C) := supc\u2208C \u03c9S(c). De\ufb01ne w\u2217(H) := {x (cid:55)\u2192 w\u2217(f (x)) : f \u2208 H}.\nRademacher complexity. Let us now brie\ufb02y review the notion of Rademacher complexity and its\napplication in our framework. Recall that H is a hypothesis class of functions mapping from the\nfeature space X to Rd. Given a \ufb01xed sample (x1, c1)...(xn, cn), we de\ufb01ne the empirical risk with\nrespect to the SPO loss of a function f \u2208 H as\n\n(cid:8)cT w(cid:9), and for a set C \u2286 Rd we slightly\n\n(cid:8)cT w(cid:9) \u2212 minw\u2208S\n\nn(cid:88)\n\ni=1\n\n\u02c6RSPO(f ) =\n\n1\nn\n\n(cid:96)SPO(f (xi), ci) ,\n\nand the expected risk as RSPO(f ) = E(x,c)\u223cD[(cid:96)SPO(f (x), c)]. We also de\ufb01ne the empirical\nRademacher complexity of H with respect to the SPO loss, i.e., the empirical Rademacher complexity\nof the function class obtained by composing (cid:96)SPO with H by\n\nSPO(H) := E\u03c3\n\u02c6Rn\n\n\u03c3i(cid:96)SPO(f (xi), ci)\n\n,\n\n(cid:34)\nn(cid:88)\nSPO(H) := E(cid:104) \u02c6Rn\n\nsup\nf\u2208H\n\n1\nn\n\ni=1\n\n(cid:35)\n\n(cid:105)\n\nwhere \u03c3i are i.i.d. Rademacher random variables for i = 1, . . . , n. The expected version of the\nRademacher complexity is de\ufb01ned as Rn\nwhere the expectation is w.r.t. an\ni.i.d. sample drawn from the underlying distribution D. The following theorem is an application of\nthe classical generalization bounds based on Rademacher complexity due to [1] to our setting.\nTheorem 1 (Bartlett and Mendelson [1]). Let H be a family of functions mapping from X to Rd.\nThen, for any \u03b4 > 0, with probability at least 1 \u2212 \u03b4 over an i.i.d. sample drawn from the distribution\nD, each of the following holds for all f \u2208 H\n\nSPO(H)\n\nRSPO(f ) \u2264 \u02c6RSPO(f ) + 2Rn\n\nRSPO(f ) \u2264 \u02c6RSPO(f ) + 2 \u02c6Rn\n\n(cid:114)\n(cid:114)\nSPO(H) + \u03c9S(C)\nSPO(H) + 3\u03c9S(C)\n\nlog(1/\u03b4)\n\n2n\n\n, and\n\nlog(2/\u03b4)\n\n2n\n\n.\n\n3 Combinatorial dimension based generalization bounds\n\nIn this section, we consider the case where S is a polyhedron and derive generalization bounds\nbased on bounding the Rademacher complexity of the SPO loss and applying Theorem 1. Since S\nis polyhedral, the optimal solution of (2) can be found by considering only the \ufb01nite set of extreme\npoints of S, which we denote by the set S. Since the number of extreme points may be exponential\n\n4\n\n\fin d, our goal is to provide bounds that are logarithmic in |S|. At the end of the section, we extend\nour analysis to any compact and convex feasible region S by extending the polyhedral analysis with a\ncovering number argument.\nIn order to derive a bound on the Rademacher complexity, we will critically rely on the notion of the\nNatarajan dimension [19], which is an extension of the VC-dimension to the multiclass classi\ufb01cation\nsetting and is de\ufb01ned in our setting as follows.\nDe\ufb01nition 1 (Natarajan dimension). Suppose that S is a polyhedron and S is the set of its extreme\npoints. Let F \u2286 SX be a hypothesis space of functions mapping from X to S, and let X \u2286 X be\ngiven. We say that F N-shatters X if there exists g1, g2 \u2208 F such that\n\n\u2022 g1(x) (cid:54)= g2(x) for all x \u2208 X\n\u2022 For all T \u2286 X, there exists g \u2208 F such that (i) for all x \u2208 T , g(x) = g1(x) and (ii) for all\n\nx \u2208 X\\T , g(x) = g2(x).\n\nThe Natarajan dimension of F, denoted dN (F), is the maximal cardinality of a set N-shattered by F.\nThe Natarajan dimension is a measure for the richness of a hypothesis class. In Theorem 2, we show\nthat the Rademacher complexity for the SPO loss can be bounded as a function of the Natarajan\ndimension of w\u2217(H) := {x (cid:55)\u2192 w\u2217(f (x)) : f \u2208 H}. The proof follows a classical argument and\nmakes strong use of Massart\u2019s lemma and the Natarajan lemma.\nTheorem 2. Suppose that S is a polyhedron and S is the set of its extreme points. Let H be a family\nof functions mapping from X to Rd. Then we have that\n\n(cid:114)\nSPO(H) \u2264 \u03c9S(C)\nRn\n(cid:114)\n\n2dN (w\u2217(H)) log(n|S|2)\n\n.\n\nn\n\nFurthermore, for any \u03b4 > 0, with probability at least 1\u2212 \u03b4 over an i.i.d. sample (x1, c1), . . . , (xn, cn)\ndrawn from the distribution D, for all f \u2208 H we have\n\nRSPO(f ) \u2264 \u02c6RSPO(f ) + 2\u03c9S(C)\n\n2dN (w\u2217(H)) log(n|S|2)\n\nn\n\n+ \u03c9S(C)\n\nlog(1/\u03b4)\n\n2n\n\n.\n\n(cid:114)\n\nNext, we show that when H is restricted to the linear hypothesis class Hlin = {x (cid:55)\u2192 Bx : B \u2208 Rd\u00d7p},\nthen the Natarajan dimension of w\u2217(Hlin) can be bounded by dp. The proof relies on translating our\nproblem to an instance of linear multiclass prediction problem and using a result of [5].\nCorollary 1. Suppose that S is a polyhedron and S is the set of its extreme points. Let Hlin be the\nhypothesis class of all linear functions, i.e., Hlin = {x (cid:55)\u2192 Bx : B \u2208 Rd\u00d7p}. Then we have\n\nFurthermore, for any \u03b4 > 0, with probability at least 1\u2212 \u03b4 over an i.i.d. sample (x1, c1), . . . , (xn, cn)\ndrawn from the distribution D, for all f \u2208 Hlin we have\n\ndN (w\u2217(Hlin)) \u2264 dp.\n\n(cid:114)\n\n(cid:114)\n\nRSPO(f ) \u2264 \u02c6RSPO(f ) + 2\u03c9S(C)\n\n2dp log(n|S|2)\n\nn\n\n+ \u03c9S(C)\n\nlog(1/\u03b4)\n\n2n\n\n.\n\nNext, we will build off the previous results to prove generalization bounds in the case where S is a\ngeneral compact convex set. The arguments we made earlier made extensive use of the extreme points\nof the polyhedron. Nevertheless, this combinatorial argument can be modi\ufb01ed in order to derive\nsimilar results for general S. The approach is to approximate S by a grid of points corresponding\nto the smallest cardinality \u0001-covering of S. To optimize over these grid of points, we \ufb01rst \ufb01nd the\noptimal solution in S and then round to the nearest point in the grid. Both the grid representation and\nthe rounding procedure can fortunately both be handled by similar arguments made in Theorems 2\nand Corollary 1, yielding a generalization bound below.\nCorollary 2. Let S be any compact and convex set, and let Hlin be the hypothesis class of all\nlinear functions. Then, for any \u03b4 > 0, with probability at least 1 \u2212 \u03b4 over an i.i.d. sample\n(x1, c1), . . . , (xn, cn) drawn from the distribution D, for all f \u2208 Hlin we have\n\nRSPO(f ) \u2264 \u02c6RSPO(f ) + 4d\u03c9S(C)\n\n+ 3\u03c9S(C)\n\nlog(2/\u03b4)\n\n2n\n\n+ O\n\n(cid:114)\n\n(cid:18) 1\n\n(cid:19)\n\n.\n\nn\n\n(cid:114)\n\n2p log(2n\u03c12(S)d)\n\nn\n\n5\n\n\fAlthough the dependence on the sample size n in the above bound is favorable, the dependence on\nthe number of features p and the dimension of the feasible region d is relatively weak. Given that the\nproofs of Corollary 2 and Theorem 2 are purely combinatorial and hold for worst-case distributions,\nthis is not surprising. In the next section, we demonstrate how to exploit the structure of the SPO loss\nfunction and additional convexity properties of S in order to develop improved bounds.\n\n4 Margin-based generalization bounds under strong convexity\n\nIn this section, we develop improved generalization bounds for the SPO loss function under the\nadditional assumption that the feasible region S is strongly convex. Our developments are akin to\nand in fact are a strict generalization of the margin guarantees for binary classi\ufb01cation based on\nRademacher complexity developed in [14]. We adopt the de\ufb01nition of strongly convex sets presented\nin [10, 8], which is reviewed in De\ufb01nition 2 below. Recall that (cid:107) \u00b7 (cid:107) is a generic given norm on Rd\nand B( \u00afw, r) := {w : (cid:107)w \u2212 \u00afw(cid:107) \u2264 r} denotes the ball of radius r centered at \u00afw.\nDe\ufb01nition 2. We say that a convex set S \u2286 Rd is \u00b5-strongly convex with respect to the norm (cid:107) \u00b7 (cid:107) if,\nfor any w1, w2 \u2208 S and for any \u03bb \u2208 [0, 1], it holds that:\n\nB(cid:0)\u03bbw1 + (1 \u2212 \u03bb)w2,(cid:0) \u00b5\n\n(cid:1) \u03bb(1 \u2212 \u03bb)(cid:107)w1 \u2212 w2(cid:107)2(cid:1) \u2286 S .\n\n2\n\nInformally, De\ufb01nition 2 says that, for every convex combination of points in S, a ball of appropriate\nradius also lies in S. Several examples of strongly convex sets are presented in [10, 8], including (cid:96)q\nand Schatten (cid:96)q balls for q \u2208 (1, 2], certain group norm balls, and generally any level set of a smooth\nand strongly convex function.\nOur analysis herein relies on the following Proposition, which strengthens the \ufb01rst-order general\noptimality condition for differentiable convex optimization problems under the additional assumption\nof strong convexity. Proposition 1 may be of independent interest and, to the best of our knowledge,\nhas not appeared previously in the literature.\nProposition 1. Let S \u2286 Rd be a non-empty \u00b5-strongly convex set and let F (\u00b7) : Rd \u2192 R be a convex\nand differentiable function. Consider the convex optimization problem:\n\nmin\nF (w)\ns.t. w \u2208 S .\nThen, \u00afw \u2208 S is an optimal solution of (3) if and only if:\n\nw\n\n\u2207F ( \u00afw)T (w \u2212 \u00afw) \u2265(cid:0) \u00b5\n\n(cid:1)(cid:107)\u2207F ( \u00afw)(cid:107)\u2217(cid:107)w \u2212 \u00afw(cid:107)2 for all w \u2208 S .\n\n2\n\n(3)\n\n(4)\n\nIn fact, we prove a slightly more general version of the proposition where the function F need only\nbe de\ufb01ned on an open set containing S. In the case of linear optimization with F (w) = \u02c6cT w, the\ninequality (4) implies that w\u2217(\u02c6c) is the unique optimal solution of P (\u02c6c) whenever \u02c6c (cid:54)= 0 and \u00b5 > 0.\nHence, in the context of the SPO loss function with a strongly convex feasible region, (cid:107)\u02c6c(cid:107)\u2217 provides\na degree of \u201ccon\ufb01dence\u201d regarding the decision w\u2217(\u02c6c) implied by the cost vector prediction \u02c6c. This\nintuition motivates us to de\ufb01ne the \u201c\u03b3-margin SPO loss\u201d, which places a greater penalty on cost\nvector predictions near 0.\nDe\ufb01nition 3. For a \ufb01xed parameter \u03b3 > 0, given a cost vector prediction \u02c6c and a realized cost vector\nc, the \u03b3-margin SPO loss (cid:96)\u03b3\n\nSPO(\u02c6c, c) is de\ufb01ned as:\n\n(cid:40)(cid:96)SPO(\u02c6c, c)\n(cid:16)(cid:107)\u02c6c(cid:107)\u2217\n(cid:17)\n\n(cid:96)\u03b3\nSPO(\u02c6c, c) :=\n\n(cid:96)SPO(\u02c6c, c) +\n\n\u03b3\n\n(cid:16)\n\n1 \u2212 (cid:107)\u02c6c(cid:107)\u2217\n\n\u03b3\n\n(cid:17)\n\nif (cid:107)\u02c6c(cid:107)\u2217 > \u03b3\nif (cid:107)\u02c6c(cid:107)\u2217 \u2264 \u03b3\n\n\u03c9S(c)\n\nRecall that, for any \u02c6c, c \u2208 Rd, it holds that (cid:96)SPO(\u02c6c, c) \u2264 \u03c9S(c). Hence, we also have that\n(cid:96)SPO(\u02c6c, c) \u2264 (cid:96)\u03b3\nSPO(\u02c6c, c), that is the \u03b3-margin SPO loss provides an upper bound on the SPO\nloss. Notice that the \u03b3-margin SPO loss interpolates between the SPO loss and the upper bound \u03c9S(c)\nwhenever (cid:107)\u02c6c(cid:107)\u2217 \u2264 \u03b3. The \u03b3-margin SPO loss also satis\ufb01es a simple monotonicity property whereby\nSPO(\u02c6c, c) \u2264 (cid:96)\u00af\u03b3\nSPO(\u02c6c, c) for any \u02c6c, c \u2208 Rd and \u00af\u03b3 \u2265 \u03b3 > 0. We can also de\ufb01ne a \u201chard \u03b3-margin SPO\n(cid:96)\u03b3\nloss\u201d that simply returns the upper bound \u03c9S(c) whenever (cid:107)\u02c6c(cid:107)\u2217 \u2264 \u03b3.\n\n6\n\n\fDe\ufb01nition 4. For a \ufb01xed parameter \u03b3 \u2265 0, given a cost vector prediction \u02c6c and a realized cost vector\nc, the hard \u03b3-margin SPO loss \u00af(cid:96)\u03b3\n\nSPO(\u02c6c, c) is de\ufb01ned as:\n\n\u00af(cid:96)\u03b3\nSPO(\u02c6c, c) :=\n\n(cid:26)(cid:96)SPO(\u02c6c, c)\n\n\u03c9S(c)\nSPO(\u02c6c, c) \u2264 \u00af(cid:96)\u03b3\n\nif (cid:107)\u02c6c(cid:107)\u2217 > \u03b3\nif (cid:107)\u02c6c(cid:107)\u2217 \u2264 \u03b3\n\nSPO(\u02c6c, c) \u2264 \u03c9S(c) for all \u02c6c, c \u2208 Rd and \u03b3 > 0.\nIt is simple to see that (cid:96)SPO(\u02c6c, c) \u2264 (cid:96)\u03b3\nDue to this additional upper bound, in all of the subsequent generalization bound results, the empirical\n\u03b3-margin SPO loss can be replaced by its hard margin counterpart.\nWe are now ready to state a theorem concerning the Lipschitz properties of the optimization oracle\nw\u2217(\u00b7) and the \u03b3-margin SPO loss, which will then be used to derive margin-based generalization\nbounds. Theorem 3 below \ufb01rst demonstrates that the optimization oracle w\u2217(\u00b7) satis\ufb01es a \u201cLipschitz-\nlike\u201d property away from zero. Subsequently, this Lipschitz-like property is a key ingredient in\ndemonstrating that the \u03b3-margin SPO loss is Lipschitz.\nTheorem 3. Suppose that feasible region S is \u00b5-strongly convex with \u00b5 > 0. Then, the optimization\noracle w\u2217(\u00b7) satis\ufb01es the following \u201cLipschitz-like\u201d property: for any \u02c6c1, \u02c6c2 \u2208 Rd, it holds that:\n\n(cid:107)w\u2217(\u02c6c1) \u2212 w\u2217(\u02c6c2)(cid:107) \u2264\n\n(5)\nMoreover, for any \ufb01xed c \u2208 Rd and \u03b3 > 0, the \u03b3-margin SPO loss is (5(cid:107)c(cid:107)\u2217/\u03b3\u00b5)-Lipschitz with\nrespect to the dual norm (cid:107) \u00b7 (cid:107)\u2217, i.e., it holds that:\n\n\u00b5 \u00b7 min{(cid:107)\u02c6c1(cid:107)\u2217,(cid:107)\u02c6c2(cid:107)\u2217}(cid:107)\u02c6c1 \u2212 \u02c6c2(cid:107)\u2217 .\n\n1\n\n|(cid:96)\u03b3\nSPO(\u02c6c1, c) \u2212 (cid:96)\u03b3\n\nSPO(\u02c6c2, c)| \u2264 5(cid:107)c(cid:107)\u2217\n\n\u03b3\u00b5\n\n(cid:107)\u02c6c1 \u2212 \u02c6c2(cid:107)\u2217 for all \u02c6c1, \u02c6c2 \u2208 Rd .\n\n(6)\n\nand\n\nProof. We present here only the proof of (5) and defer the proof of (6), which relies crucially on (5),\nto the supplementary materials. Let \u03c4 := min{(cid:107)\u02c6c1(cid:107)\u2217,(cid:107)\u02c6c2(cid:107)\u2217}. We assume without loss of generality\nthat \u03c4 > 0 (otherwise the right-hand side of (5) is equal to +\u221e by convention). Applying Proposition\n1 twice yields:\n\n1 (w\u2217(\u02c6c2) \u2212 w\u2217(\u02c6c1)) \u2265 (cid:0) \u00b5\n2 (w\u2217(\u02c6c1) \u2212 w\u2217(\u02c6c2)) \u2265 (cid:0) \u00b5\n\n2\n\n2\n\n\u02c6cT\n\n\u02c6cT\n\n(cid:1) \u03c4(cid:107)w\u2217(\u02c6c1) \u2212 w\u2217(\u02c6c2)(cid:107)2 ,\n(cid:1) \u03c4(cid:107)w\u2217(\u02c6c1) \u2212 w\u2217(\u02c6c2)(cid:107)2 .\n\nAdding the above two inequalities together yields:\n\u00b5\u03c4(cid:107)w\u2217(\u02c6c1) \u2212 w\u2217(\u02c6c2)(cid:107)2 \u2264 (\u02c6c2 \u2212 \u02c6c1)T (w\u2217(\u02c6c1) \u2212 w\u2217(\u02c6c2)) \u2264 (cid:107)\u02c6c1 \u2212 \u02c6c2(cid:107)\u2217(cid:107)w\u2217(\u02c6c1) \u2212 w\u2217(\u02c6c2)(cid:107) ,\nwhere the second inequality is H\u00f6lder\u2019s inequality. Dividing both sides of the above by \u00b5\u03c4(cid:107)w\u2217(\u02c6c1) \u2212\nw\u2217(\u02c6c2)(cid:107) yields the desired result.\n\nMargin-based generalization bounds. We are now ready to present our main generalization\nbounds of interest in the strongly convex case. Our results are based on combining Theorem 3 with\nthe Lipschitz vector-contraction inequality for Rademacher complexities developed in [17], as well\nas the results of [1]. Following [3, 17], given a \ufb01xed sample ((x1, c1)...(xn, cn)), we de\ufb01ne the\nmultivariate empirical Rademacher complexity of H as\n\n\uf8ee\uf8f0sup\n\nf\u2208H\n\n1\nn\n\nn(cid:88)\n\nd(cid:88)\n\ni=1\n\nj=1\n\n(cid:34)\n\n\uf8f9\uf8fb = E\u03c3\n\nn(cid:88)\n\ni=1\n\n1\nn\n\nsup\nf\u2208H\n\n(cid:35)\n\n\u02c6Rn(H) := E\u03c3\n\n\u03c3ijfj(xi)\n\n\u03c3T\n\ni f (xi)\n\n,\n\n(7)\n\n(cid:105)\n\nRn(H) := E(cid:104) \u02c6Rn(H)\n\nwhere \u03c3ij are i.i.d. Rademacher random variables for i = 1, . . . , n and j = 1, . . . , d, and \u03c3i =\n(\u03c3i1, . . . , \u03c3id)T . The expected version of the multivariate Rademacher complexity is de\ufb01ned as\nwhere the expectation is w.r.t. the i.i.d. sample drawn from the underlying\n\ndistribution D.\nLet us also de\ufb01ne the empirical \u03b3-margin SPO loss and the empirical Rademacher complexity of H\nwith respect to the \u03b3-margin SPO loss as follows:\n\n\u02c6R\u03b3\n\nSPO(f ) :=\n\n1\nn\n\nSPO(f (xi), ci) , and \u02c6Rn\n(cid:96)\u03b3\n\n\u03b3SPO(H) := E\u03c3\n\n\u03c3i(cid:96)\u03b3\n\nSPO(f (xi), ci)\n\n,\n\n(cid:34)\n\nn(cid:88)\n\ni=1\n\n1\nn\n\nsup\nf\u2208H\n\n(cid:35)\n\nn(cid:88)\n\ni=1\n\n7\n\n\fwhere f \u2208 H on the left side above and \u03c3i are i.i.d. Rademacher random variables for i = 1, . . . , n.\nIn the following two theorems, we focus only on the case of the (cid:96)2-norm set-up, i.e., the norm on\nthe space of w variables as well as the norm on the space of cost vectors c are both the (cid:96)2-norm. To\nthe best of our knowledge, extending the vector-contraction inequality of [17] to an arbitrary norm\nsetting (or even the case of general (cid:96)q-norms) remains an open question that would have interesting\napplications to our framework. Theorem 4 below presents our margin based generalization bounds\nfor a \ufb01xed \u03b3 > 0. Recall that C denotes the domain of the true cost vectors c, \u03c12(C) = supc\u2208C (cid:107)c(cid:107)2,\nand \u03c9S(C) := supc\u2208C \u03c9S(c).\nTheorem 4. Suppose that feasible region S is \u00b5-strongly convex with respect to the (cid:96)2-norm with\n\u00b5 > 0, and let \u03b3 > 0 be \ufb01xed. Let H be a family of functions mapping from X to Rd. Then, for any\n\ufb01xed sample ((x1, c1)...(xn, cn)) we have that\n\n\u221a\n\u03b3SPO(H) \u2264 5\n\u02c6Rn\n\n2\u03c12(C) \u02c6Rn(H)\n\n.\n\n\u03b3\u00b5\n\nFurthermore, for any \u03b4 > 0, with probability at least 1 \u2212 \u03b4 over an i.i.d. sample Sn drawn from the\ndistribution D, each of the following holds for all f \u2208 H\n2\u03c12(C)Rn(H)\n\nlog(1/\u03b4)\n\n\u221a\n\n10\n\nRSPO(f ) \u2264 \u02c6R\u03b3\n\nSPO(f ) +\n\nRSPO(f ) \u2264 \u02c6R\u03b3\n\nSPO(f ) +\n\n\u221a\n\n10\n\n\u03b3\u00b5\n\n2\u03c12(C) \u02c6Rn(H)\n\n\u03b3\u00b5\n\n+ \u03c9S(C)\n\n, and\n\n2n\n\n+ 3\u03c9S(C)\n\nlog(2/\u03b4)\n\n2n\n\n.\n\n(cid:114)\n(cid:114)\n\n\u03b3SPO(H) follows simply by combining Theorem 3, particularly (6), with\nProof. The bound on \u02c6Rn\nequation (1) of [17]. The subsequent generalization bounds then simply follow since RSPO(f ) \u2264\nSPO(f ) for all f \u2208 H and by applying the version of Theorem 1 for the \u03b3-margin SPO loss.\nR\u03b3\nIt is often the case that the structure of the hypothesis class H naturally leads to a bound on Rn(H)\nthat can have mild, even logarithmic, dependence on dimensions p and d. For example, let us\nconsider the general setting of a constrained linear function class, namely H = HB := {f : f (x) =\nBx for some B \u2208 Rd\u00d7p, B \u2208 B}, where B \u2286 Rd\u00d7p. In Section A.2.4 of the supplementary materials,\nwe derive a result that extends Theorem 3 of [12] to multivariate Rademacher complexity and provides\na convenient way to bound Rn(HB) in the case when B corresponds to the level set of a strongly\nconvex function. When B = {B : (cid:107)B(cid:107)F \u2264 \u03b2} (where (cid:107)B(cid:107)F denotes the Frobenius norm of B) this\nresult implies that Rn(HB) \u2264 \u03c12(X )\u03b2\n, and when B = {B : (cid:107)B(cid:107)1 \u2264 \u03b2} (where (cid:107)B(cid:107)1 denotes the\n\u221a\n\u221a\n(cid:96)1-norm of the vectorized matrix B) this result implies that Rn(HB) \u2264 \u03c1\u221e(X )\u03b2\n\u221a\n. Note the\nabsence of any explicit dependence on p in the \ufb01rst bound and only logarithmic dependence on p, d\nin the second. We discuss the details of these and additional examples, including the \u201cgroup-lasso\"\nnorm, in Section A.2.4.\nTheorem 4 may also be extended to bounds that hold uniformly over all values of \u03b3 \u2208 (0, \u00af\u03b3], where\n\u00af\u03b3 > 0 is a \ufb01xed parameter. This extension is presented below in Theorem 5.\nTheorem 5. Suppose that feasible region S is \u00b5-strongly convex with respect to the (cid:96)2-norm with\n\u00b5 > 0, and let \u00af\u03b3 > 0 be \ufb01xed. Let H be a family of functions mapping from X to Rd. Then, for any\n\u03b4 > 0, with probability at least 1 \u2212 \u03b4 over an i.i.d. sample drawn from the distribution D, each of the\nfollowing holds for all f \u2208 H and for all \u03b3 \u2208 (0, \u00af\u03b3]\n\n6 log(pd)\nn\n\n\u221a\n\n2d\n\nn\n\nRSPO(f ) \u2264 \u02c6R\u03b3\n\nSPO(f ) +\n\nRSPO(f ) \u2264 \u02c6R\u03b3\n\nSPO(f ) +\n\n\u221a\n\n20\n\n\u221a\n\n20\n\n2\u03c12(C)Rn(H)\n\n\u03b3\u00b5\n\n2\u03c12(C) \u02c6Rn(H)\n\n\u03b3\u00b5\n\n+ \u03c9S(C)\n\n+ \u03c9S(C)\n\n(cid:32)(cid:114)\n(cid:32)(cid:114)\n\n(cid:114)\n(cid:114)\n\n(cid:33)\n(cid:33)\n\nlog(log2(2\u00af\u03b3/\u03b3))\n\nn\n\n+\n\nlog(2/\u03b4)\n\n2n\n\n, and\n\nlog(log2(2\u00af\u03b3/\u03b3))\n\nn\n\n+ 3\n\nlog(4/\u03b4)\n\n2n\n\n.\n\nNote that a natural choice for \u00af\u03b3 in Theorem 5 is \u00af\u03b3 \u2190 supf\u2208H,x\u2208X (cid:107)f (x)(cid:107)2, presuming that one can\nbound this quantity based on the properties of H and X . Example 4 below discusses how Theorems\n4 and 5 relate to known results in binary classi\ufb01cation.\n\n8\n\n\fExample 4. In [7], it is shown that the SPO loss corresponds exactly to the 0-1 loss in binary\nclassi\ufb01cation when d = 1, S = [\u22121/2, +1/2], and C = {\u22121, +1}. In this case, using our notation,\nthe margin value of a prediction \u02c6c is c\u02c6c. It is also easily seen that \u03c9S(C) = \u03c12(C) = 1, the \u03b3-margin\nSPO loss corresponds exactly to the margin loss (or ramp loss) that interpolates between 1 and 0 when\nc\u02c6c \u2208 [0, \u03b3], and the hard \u03b3-margin SPO loss corresponds exactly to the margin loss that returns 1\nwhen c\u02c6c \u2264 \u03b3 and 0 otherwise. Furthermore, note that the interval S = [\u2212 1\n2 ] is 2-strongly convex\n[8]. Thus, except for some worse absolute constants, Theorems 4 and 5 generalize the well-known\nresults on margin guarantees based on Rademacher complexity for binary classi\ufb01cation [14].\n\n2 , + 1\n\nAs in the case of binary classi\ufb01cation, the utility of Theorems 4 and 5 is strengthened when the\nunderlying distribution D has a \u201cfavorable margin property.\u201d Namely, the bounds in Theorems 4 and\n5 can be much stronger than those of Corollary 2 when the distribution D and the sample are such\nthat there exists a relatively large value of \u03b3 such that the empirical \u03b3-margin SPO loss is small. One\nis thus motivated to choose the value of \u03b3 in a data-driven way so that, given a prediction function \u02c6f\ntrained on the data Sn, the upper bound on \u02c6RSPO( \u02c6f ) is minimized. Since Theroem 5 is a uniform\nresult over \u03b3 \u2208 (0, \u00af\u03b3], this data-driven procedure for choosing \u03b3 is indeed valid.\n\n5 Conclusions and Future Directions\n\nOur work extends learning theory, as developed for binary and multiclass classi\ufb01cation, to predict-\nthen-optimize problems in two very signi\ufb01cant directions: (i) obtaining worst-case generalization\nbounds using combinatorial parameters that measure the capacity of function classes, and (ii) ex-\nploiting special structure in data by deriving margin-based generalization bounds that scale more\ngracefully w.r.t. problem dimensions. It also motivates several interesting avenues for future work.\nBeyond the margin theory, other aspects of the problem that lead to improvements over worst case\nrates should be studied. In this respect, developing a theory of local Rademacher complexity for\npredict-then-optimize problems would be a promising approach. It will be good to use minimax\nconstructions to provide matching lower bounds for our upper bounds. Extending the margin theory\nfor strongly convex sets, where the SPO loss is ill-behaved only near 0, to polyhedral sets, where\nit can be much more ill-behaved, is a challenging but fascinating direction. Developing a theory of\nsurrogate losses, especially convex ones, that are calibrated w.r.t. the non-convex SPO loss will also\nbe extremely important. Finally, the assumption that the optimization objective is linear could be\nrelaxed to include non-linear objectives.\n\nAcknowledgments\n\nOB thanks Rayens Capital for their support. AE acknowledges the support of NSF via grant\nCMMI-1763000. PG acknowledges the support of NSF Awards CCF-1755705 and CMMI-1762744.\nAT acknowledges the support of NSF via CAREER grant IIS-1452099 and of a Sloan Research\nFellowship.\n\n9\n\n\fReferences\n[1] P. L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and\n\nstructural results. Journal of Machine Learning Research, 3(Nov):463\u2013482, 2002.\n\n[2] D. P. Bertsekas and A. Scienti\ufb01c. Convex optimization algorithms. Athena Scienti\ufb01c Belmont,\n\n2015.\n\n[3] D. Bertsimas and N. Kallus. From predictive to prescriptive analytics. arXiv preprint\n\narXiv:1402.5481, 2014.\n\n[4] A. Daniely, S. Sabato, S. Ben-David, and S. Shalev-Shwartz. Multiclass learnability and the\n\nerm principle. The Journal of Machine Learning Research, 16(1):2377\u20132404, 2015.\n\n[5] A. Daniely and S. Shalev-Shwartz. Optimal learners for multiclass problems. In Conference on\n\nLearning Theory, pages 287\u2013316, 2014.\n\n[6] P. Donti, B. Amos, and J. Z. Kolter. Task-based end-to-end model learning in stochastic\noptimization. In Advances in Neural Information Processing Systems, pages 5484\u20135494, 2017.\narXiv preprint\n\n[7] A. N. Elmachtoub and P. Grigas.\n\narXiv:1710.08005, 2017.\n\nSmart \"predict,\n\nthen optimize\".\n\n[8] D. Garber and E. Hazan. Faster rates for the frank-wolfe method over strongly-convex sets. In\n\n32nd International Conference on Machine Learning, ICML 2015, 2015.\n\n[9] Y. Guermeur. Vc theory of large margin multi-category classi\ufb01ers. Journal of Machine Learning\n\nResearch, 8(Nov):2551\u20132594, 2007.\n\n[10] M. Journ\u00e9e, Y. Nesterov, P. Richt\u00e1rik, and R. Sepulchre. Generalized power method for sparse\nprincipal component analysis. Journal of Machine Learning Research, 11(Feb):517\u2013553, 2010.\n[11] S. M. Kakade, S. Shalev-Shwartz, and A. Tewari. Regularization techniques for learning with\n\nmatrices. Journal of Machine Learning Research, 13:1865\u20131890, June 2012.\n\n[12] S. M. Kakade, K. Sridharan, and A. Tewari. On the complexity of linear prediction: Risk\nbounds, margin bounds, and regularization. In Advances in neural information processing\nsystems, pages 793\u2013800, 2009.\n\n[13] Y.-h. Kao, B. V. Roy, and X. Yan. Directed regression. In Advances in Neural Information\n\nProcessing Systems, pages 889\u2013897, 2009.\n\n[14] V. Koltchinskii, D. Panchenko, et al. Empirical margin distributions and bounding the general-\n\nization error of combined classi\ufb01ers. The Annals of Statistics, 30(1):1\u201350, 2002.\n\n[15] Y. Lei, U. Dogan, A. Binder, and M. Kloft. Multi-class svms: From tighter data-dependent\ngeneralization bounds to novel algorithms. In Advances in Neural Information Processing\nSystems, pages 2035\u20132043, 2015.\n\n[16] J. Li, Y. Liu, R. Yin, H. Zhang, L. Ding, and W. Wang. Multi-class learning: From theory\nto algorithm. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and\nR. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 1586\u20131595.\nCurran Associates, Inc., 2018.\n\n[17] A. Maurer. A vector-contraction inequality for rademacher complexities. In International\n\nConference on Algorithmic Learning Theory, pages 3\u201317. Springer, 2016.\n\n[18] M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of machine learning. MIT press,\n\n2018.\n\n[19] B. K. Natarajan. On learning sets and functions. Machine Learning, 4(1):67\u201397, 1989.\n[20] S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: From theory to\n\nalgorithms. Cambridge university press, 2014.\n\n[21] R. Tibshirani, M. Wainwright, and T. Hastie. Statistical learning with sparsity: the lasso and\n\ngeneralizations. Chapman and Hall/CRC, 2015.\n\n10\n\n\f", "award": [], "sourceid": 8164, "authors": [{"given_name": "Othman", "family_name": "El Balghiti", "institution": "Columbia University"}, {"given_name": "Adam", "family_name": "Elmachtoub", "institution": "Columbia University"}, {"given_name": "Paul", "family_name": "Grigas", "institution": "UC Berkeley"}, {"given_name": "Ambuj", "family_name": "Tewari", "institution": "University of Michigan"}]}