{"title": "Satisfying Real-world Goals with Dataset Constraints", "book": "Advances in Neural Information Processing Systems", "page_first": 2415, "page_last": 2423, "abstract": "The goal of minimizing misclassification error on a training set is often just one of several real-world goals that might be defined on different datasets. For example, one may require a classifier to also make positive predictions at some specified rate for some subpopulation (fairness), or to achieve a specified empirical recall. Other real-world goals include reducing churn with respect to a previously deployed model, or stabilizing online training. In this paper we propose handling multiple goals on multiple datasets by training with dataset constraints, using the ramp penalty to accurately quantify costs, and present an efficient algorithm to approximately optimize the resulting non-convex constrained optimization problem. Experiments on both benchmark and real-world industry datasets demonstrate the effectiveness of our approach.", "full_text": "Satisfying Real-world Goals with Dataset Constraints\n\nGabriel Goh\n\nAndrew Cotter, Maya Gupta\n\nDept. of Mathematics\n\nUC Davis\n\nDavis, CA 95616\n\nggoh@math.ucdavis.edu\n\nGoogle Inc.\n\n1600 Amphitheatre Parkway\nMountain View, CA 94043\n\nacotter@google.com\nmayagupta@google.com\n\nMichael Friedlander\n\nDept. of Computer Science\n\nUniversity of British Columbia\n\nVancouver, B.C. V6T 1Z4\n\nmpf@cs.ubc.ca\n\nAbstract\n\nThe goal of minimizing misclassi\ufb01cation error on a training set is often just one of\nseveral real-world goals that might be de\ufb01ned on different datasets. For example,\none may require a classi\ufb01er to also make positive predictions at some speci\ufb01ed\nrate for some subpopulation (fairness), or to achieve a speci\ufb01ed empirical recall.\nOther real-world goals include reducing churn with respect to a previously de-\nployed model, or stabilizing online training. In this paper we propose handling\nmultiple goals on multiple datasets by training with dataset constraints, using the\nramp penalty to accurately quantify costs, and present an ef\ufb01cient algorithm to\napproximately optimize the resulting non-convex constrained optimization problem.\nExperiments on both benchmark and real-world industry datasets demonstrate the\neffectiveness of our approach.\n\n1 Real-world goals\nWe consider a broad set of design goals important for making classi\ufb01ers work well in real-world\napplications, and discuss how metrics quantifying many of these goals can be represented in a\nparticular optimization framework. The key theme is that these metrics, which range from the\nstandard precision and recall, to less well-known examples such as coverage and fairness [17, 27, 15],\nand including some new proposals, can be expressed in terms of the positive and negative classi\ufb01cation\nrates on multiple datasets.\nCoverage: One may wish to control how often a classi\ufb01er predicts the positive (or negative) class.\nFor example, one may want to ensure that only 10% of customers are selected to receive a printed\ncatalog due to budget constraints, or perhaps to compensate for a biased training set. In practice,\nconstraining the \u201ccoverage rate\u201d (the expected proportion of positive predictions) is often easier than\nmeasuring e.g. accuracy or precision because coverage can be computed on unlabeled data\u2014labeling\ndata can be expensive, but acquiring a large number of unlabeled examples is often very easy.\nCoverage was also considered by Mann and McCallum [17], who proposed what they call \u201clabel\nregularization\u201d, in which one adds a regularizer penalizing the relative entropy between the mean\nscore for each class and the desired distribution, with an additional correction to avoid degeneracies.\nChurn: Work does not stop once a machine learning model has been adopted. There will be new\ntraining data, improved features, and potentially new model structures. Hence, in practice, one will\ndeploy a series of models, each improving slightly upon the last. In this setting, determining whether\neach candidate should be deployed is surprisingly challenging: if we evaluate on the same held-out\ntesting set every time a new candidate is proposed, and deploy it if it outperforms its predecessor, then\nevery compare-and-deploy decision will increase the statistical dependence between the deployed\nmodel and the testing dataset, causing the model sequence to \ufb01t the originally-independent testing\ndata. This problem is magni\ufb01ed if, as is typical, the candidate models tend to disagree only on a\nrelatively small number of examples near the true decision boundary.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fA simple and safe solution is to draw a fresh testing sample every time one wishes to compare two\nmodels in the sequence, only considering examples on which the two models disagree. Because\nlabeling data is expensive, one would like these freshly sampled testing datasets to be as small as\npossible. It is here that the problem of \u201cchurn\u201d arises. Imagine that model A, our deployed model,\nis 70% accurate, and that model B, our candidate, is 75% accurate. In the best case, only 5% of\ntest samples would be labeled differently, and all differences would be \u201cwins\u201d for classi\ufb01er B. Then\nonly a dozen or so examples would need to be labeled in order to establish that B is the statistically\nsigni\ufb01cantly better classi\ufb01er with 95% con\ufb01dence. In the worst case, model A would be correct and\nmodel B incorrect 25% of the time, model B correct and model A incorrect 30% of the time, and\nboth models correct the remaining 45% of the time. Then 55% of testing examples will be labeled\ndifferently, and closer to 1000 examples would need to be labeled to determine that model B is better.\nWe de\ufb01ne the \u201cchurn rate\u201d as the expected proportion of examples on which the prediction of the\nmodel being considered (model B above) differs from that of the currently-deployed model (model A).\nDuring training, we propose constraining the empirical churn rate with respect to a given deployed\nmodel on a large unlabeled dataset (see also Fard et al. [12] for an alternative approach).\nStability: A special case of minimizing churn is to ensure stability of an online classi\ufb01er as it\nevolves, by constraining it to not deviate too far from a trusted classi\ufb01er on a large held-out unlabeled\ndataset.\nFairness: A practitioner may be required to guarantee fairness of a learned classi\ufb01er, in the sense\nthat it makes positive predictions on different subgroups at certain rates. For example, one might\nrequire that housing loans be given equally to people of different genders. Hardt et al. [15] identify\nthree types of fairness: (i) demographic parity, in which positive predictions are made at the same\nrate on each subgroup, (ii) equal opportunity, in which only the true positive rates must match, and\n(iii) equalized odds, in which both the true positive rates and false positive rates must match. Fairness\ncan also be speci\ufb01ed by a proportion, such as the 80% rule in US law that certain decisions must be\nin favor of group B individuals at least 80% as often as group A individuals [e.g. 3, 26, 27, 15].\nZafar et al. [27] propose learning fair classi\ufb01ers by imposing linear constraints on the covariance\nbetween the predicted labels and the values of certain features, while Hardt et al. [15] propose \ufb01rst\nlearning an \u201cunfair\u201d classi\ufb01er, and then choosing population-dependent thresholds to satisfy the\ndesired fairness criterion. In our framework, rate constraints such as those mentioned above can be\nimposed directly, at training time.\nRecall and Precision: Requirements of real-world classi\ufb01ers are often expressed in terms of\nprecision and recall, especially when examples are highly imbalanced between positives and negatives.\nIn our framework, we can handle this problem via Neyman-Pearson classi\ufb01cation [e.g. 23, 9], in\nwhich one seeks to minimize the false negative rate subject to a constraint on the false positive rate.\nIndeed, our ramp-loss formulation is equivalent to that of Gasso et al. [13] in this setting.\nEgregious Examples: For certain classi\ufb01cation applications, examples may be discovered that are\nparticularly embarrassing if classi\ufb01ed incorrectly. One standard approach to handling such examples\nis to increase their weights during training, but this is dif\ufb01cult to get right: too large a weight may\ndistort the classi\ufb01er too much in the surrounding feature space, whereas too small a weight may not\n\ufb01x the problem. Worse, over time the dataset will often be augmented with new training examples\nand new features, causing the ideal weights to drift. We propose instead simply adding a constraint\nensuring that some proportion of a set of such egregious examples is correctly classi\ufb01ed. Such\nconstraints should be used with extreme care, since they can cause the problem to become infeasible.\n2 Optimization problem\nA key aspect of many of the goals of Section 1 is that they are de\ufb01ned on different datasets. For\nexample, we might seek to maximize the accuracy on a set of labeled examples drawn in some biased\nmanner, require that its recall be at least 90% on 50 small datasets sampled in an unbiased manner\nfrom 50 different countries, desire low churn relative to a deployed classi\ufb01er on a large unbiased\nunlabeled dataset, and require that 100 given egregious examples be classi\ufb01ed correctly.\nAnother characteristic common to the metrics of Section 1 is that they can be expressed in terms of\nthe positive and negative classi\ufb01cation rates on various datasets. We consider only unlabeled datasets,\nas described in Table 1\u2014a dataset with binary labels, for example, would be handled by partitioning\nit into the two unlabeled datasets D+ and D\u2212 containing the positive and negative examples,\n\n2\n\n\fTable 1: Dataset notation.\n\nNotation\nD\nD+, D\u2212\nD++, D+\u2212, D\u2212+, D\u2212\u2212 Sets of examples with ground-truth positive/negative labels, and for\n\nDataset\nAny dataset\nSets of examples labeled positive/negative, respectively\n\nDA, DB\n\nwhich a baseline classi\ufb01er makes positive/negative predictions\nSets of examples belonging to subpopulation A and B, respectively\n\nTable 2: The quantities discussed in Section 1, expressed in the notation used in Problem 1, with the\ndependence on w and b dropped for notational simplicity, and using the dataset notation of Table 1.\nMetric\nCoverage rate\n#TP, #TN, #FP, #FN\n#Errors\nError rate\nRecall\n#Changes\n\nExpression\nsp (D)\n|D+| sp (D+), |D\u2212| sn (D\u2212), |D\u2212| sp (D\u2212), |D+| sn (D+)\n#FP + #FN\n#Errors/ (|D+| + |D\u2212|)\n#TP/ (#TP + #FN) = #TP/|D+|\n|D+\u2212| sp (D+\u2212) + |D\u2212+| sn (D\u2212+) + |D+\u2212| sp (D+\u2212) +\n|D\u2212+| sn (D\u2212+)\n#Changes/ (|D++| + |D+\u2212| + |D\u2212+| + |D\u2212\u2212|)\nsp\nsp\nsp (D+) \u2265 \u03ba and/or sn (D\u2212) \u2264 \u03ba for a dataset D of egregious\nexamples, where \u03ba \u2208 [0, 1]\n\nChurn rate\nFairness constraint\nEqual opportunity constraint\nEgregious example constraint\n\n(cid:0)DA(cid:1) \u2265 \u03basp\n(cid:0)DA \u2229 D+(cid:1) \u2265 \u03basp\n\n(cid:0)DB(cid:1), where \u03ba > 0\n\n(cid:0)DB \u2229 D+(cid:1), where \u03ba > 0\n\nrespectively. We wish to learn a linear classi\ufb01cation function f (x) = (cid:104)w, x(cid:105) \u2212 b parameterized by a\nweight vector w \u2208 Rd and bias b \u2208 R, for which the positive and negative classi\ufb01cation rates are:\n\nsp (D; w, b) = 1|D|\n\nx\u2208D1 ((cid:104)w, x(cid:105) \u2212 b) ,\n\nsn (D; w, b) = sp (D;\u2212w,\u2212b) ,\n\n(1)\n\n(cid:80)\n\nwhere 1 is an indicator function that is 1 if its argument is positive, 0 otherwise. In words, sp(D; w, b)\nand sn(D; w, b) denote the proportion of positive or negative predictions, respectively, that f makes\non D. Table 2 speci\ufb01es how the metrics of Section 1 can be expressed in terms of the sps and sns.\nWe propose handling these goals by minimizing an (cid:96)2-regularized positive linear combination of\nprediction rates on different datasets, subject to upper-bound constraints on other positive linear\ncombinations of such prediction rates:\nProblem 1. Starting point: discontinuous constrained problem\n\n(cid:17)\n(cid:17) \u2264 \u03b3(j)\n\n+ \u03bb\n\n2 (cid:107)w(cid:107)2\n\n2\n\nj \u2208 {1, . . . , m}.\n\n(cid:80)k\ns.t. (cid:80)k\n\ni=1\n\ni=1\n\n(cid:16)\n(cid:16)\n\nminimize\nw\u2208Rd,b\u2208R\n\ni sp(Di; w, b) + \u03b2(0)\n\u03b1(0)\n\ni sn(Di; w, b)\n\n\u03b1(j)\ni sp(Di; w, b) + \u03b2(j)\n\ni sn(Di; w, b)\n\nHere, \u03bb is the parameter on the (cid:96)2 regularizer, there are k unlabeled datasets D1, . . . , Dk and m\nconstraints. The metrics minimized by the objective and bounded by the constraints are speci\ufb01ed\nvia the choices of the nonnegative coef\ufb01cients \u03b1(0)\nand upper bounds \u03b3(j) for the\nith dataset and, where applicable, the jth constraint\u2014a user should base these choices on Table 2.\nNote that because sp + sn = 1, it is possible to transform any linear combination of rates into an\nequivalent positive linear combination, plus a constant (see Appendix B1 for an example).\nWe cannot optimize Problem 1 directly because the rate functions sp and sn are discontinuous. We\ncan, however, work around this dif\ufb01culty by training a classi\ufb01er that makes randomized predictions\nbased on the ramp function [7]:\n\n, \u03b2(0)\n\n, \u03b1(j)\n\n, \u03b2(j)\n\ni\n\ni\n\ni\n\ni\n\n\u03c3(z) = max{0, min{1, 1/2 + z}},\n\n(2)\n\n1Appendices may be found in the supplementary material\n\n3\n\n\fAlgorithm 1 Proposed majorization-minimization procedure for (approximately) optimizing Prob-\nlem 2. Starting from an initial feasible solution w(0), b0, we repeatedly \ufb01nd a convex upper bound\nproblem that is tight at the current candidate solution, and optimize it to yield the next candidate. See\nSection 2.1 for details, and Section 2.2 for how one can perform the inner optimizations on line 3.\n\nMajorizationMinimization(cid:0)w(0), b0, T(cid:1)\n\n1\n2\n3\n4\n\nFor t \u2208 {1, 2, . . . , T}\n\nReturn w(t), bt\n\nConstruct an instance of Problem 3 with w(cid:48) = w(t\u22121) and b(cid:48) = bt\u22121\nOptimize this convex optimization problem to yield w(t) and bt\n\nwhere the randomized classi\ufb01er parameterized by w and b will make a positive prediction on x with\nprobability \u03c3 ((cid:104)w, x(cid:105) \u2212 b), and a negative prediction otherwise (see Appendix A for more on this\nrandomized classi\ufb01cation rule). For this randomized classi\ufb01er, the expected positive and negative\nrates will be:\n\nrp (D; w, b) = 1|D|\n\nrn (D; w, b) = rp (D;\u2212w,\u2212b) .\nUsing these expected rates yields a continuous (but non-convex) analogue of Problem 1:\nProblem 2. Ramp version of Problem 1\n\n(cid:80)\nx\u2208D\u03c3 ((cid:104)w, x(cid:105) \u2212 b) ,\n\n(3)\n\n(cid:80)k\ns.t. (cid:80)k\n\ni=1\n\ni=1\n\n(cid:16)\n(cid:16)\n\nminimize\nw\u2208Rd,b\u2208R\n\n\u03b1(0)\ni rp(Di; w, b) + \u03b2(0)\n\ni rn(Di; w, b)\n\ni rp(Di; w, b) + \u03b2(j)\n\u03b1(j)\n\ni rn(Di; w, b)\n\n(cid:17)\n(cid:17) \u2264 \u03b3(j)\n\n+ \u03bb\n\n2 (cid:107)w(cid:107)2\n\n2\n\nj \u2208 {1, . . . , m}.\n\nEf\ufb01cient optimization of this problem is the ultimate goal of this section. In Section 2.1, we will\npropose a majorization-minimization approach that sequentially minimizes convex upper bounds\non Problem 2, and, in Section 2.2, will discuss how these convex upper bounds may themselves be\nef\ufb01ciently optimized.\n2.1 Optimizing the ramp problem\nTo address the non-convexity of Problem 2, we will\niteratively optimize approximations, by, starting\nfrom an feasible initial candidate solution, con-\nstructing a convex optimization problem upper-\nbounding Problem 2 that is tight at the current can-\ndidate, optimizing this convex problem to yield the\nnext candidate, and repeating.\nOur choice of a ramp for \u03c3 makes \ufb01nding such tight\nconvex upper bounds easy: both the hinge function\nmax{0, 1/2 + z} and constant-1 function are upper\nbounds on \u03c3, with the former being tight for all\nz \u2264 1/2, and the latter for all z \u2265 1/2 (see Figure 1).\nWe\u2019ll therefore de\ufb01ne the following upper bounds\non \u03c3 and 1 \u2212 \u03c3, with the additional parameter z(cid:48)\ndetermining which of the two bounds (hinge or\nconstant) will be used, such that the bounds will\nalways be tight for z = z(cid:48):\n\nFigure 1: Convex upper bounds on the ramp\nfunction \u03c3(z) = max{0, min{1, 1/2 + z}}.\nNotice that the hinge bound (red) is tight for\nall z \u2264 1/2, and the constant bound (blue) is\ntight for all z \u2265 1/2.\n\n(cid:26)max{0, 1/2 + z}\n\n\u02c7\u03c3p (z; z(cid:48)) =\n\nif z(cid:48) \u2264 1/2\notherwise ,\n\n\u02c7\u03c3n(z; z(cid:48)) = \u02c7\u03c3p (\u2212z;\u2212z(cid:48)) .\n\n1\n\n(cid:80)\nBased upon these we de\ufb01ne the following upper bounds on the expected rates:\n(cid:80)\nx\u2208D \u02c7\u03c3p ((cid:104)w, x(cid:105) \u2212 b;(cid:104)w(cid:48), x(cid:105) \u2212 b(cid:48))\nx\u2208D \u02c7\u03c3n ((cid:104)w, x(cid:105) \u2212 b;(cid:104)w(cid:48), x(cid:105) \u2212 b(cid:48)) ,\n\n\u02c7rp (D; w, b; w(cid:48), b(cid:48)) = 1|D|\n\u02c7rn (D; w, b; w(cid:48), b(cid:48)) = 1|D|\n\n(4)\n\n(5)\n\nwhich have the properties that both \u02c7rp and \u02c7rn are convex in w and b, are upper bounds on the original\nramp-based rates:\n\n\u02c7rp (D; w, b; w(cid:48), b(cid:48)) \u2265 rp (D; w, b)\n\n\u02c7rn (D; w, b; w(cid:48), b(cid:48)) \u2265 rn (D; w, b) ,\n\nand\n\n4\n\n-1-0.500.5100.51\fAlgorithm 2 Skeleton of a cutting-plane algorithm that optimizes Equation 6 to within \u0001 for v \u2208 V,\nwhere V \u2286 Rm is compact and convex. Here, l0, u0 \u2208 R are \ufb01nite with l0 \u2264 maxv\u2208V (cid:122)(v) \u2264 u0.\nThere are several options for the CutChooser function on line 8\u2014please see Appendix E for details.\nThe SVMOptimizer function returns w(t) and bt approximately minimizing \u03a8(w, b, v(t); w(cid:48), b(cid:48)), and\na lower bound lt \u2264 (cid:122)(v) for which ut \u2212 lt \u2264 \u0001t for ut as de\ufb01ned on line 10.\n\n1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n\nCuttingPlane (l0, u0,V, \u0001)\n\nInitialize g(0) \u2208 Rm to the all-zero vector\nFor t \u2208 {1, 2, . . .}\n\nLet ht (v) = mins\u2208{0,1,...,t\u22121}(cid:0)us +(cid:10)g(s), v \u2212 v(s)(cid:11)(cid:1)\n\nLet Lt = maxs\u2208{0,1,...,t\u22121} ls and Ut = maxv\u2208V ht (v)\nIf Ut \u2212 Lt \u2264 \u0001 then\nLet s \u2208 {1, . . . , t \u2212 1} be an index maximizing ls\nReturn w(s), bs, v(s)\n\nLet w(t), bt, lt = SVMOptimizer(cid:0)v(t), ht\n\nLet v(t), \u0001t = CutChooser (ht, Lt)\nLet ut = \u03a8(w(t), bt, v(t); w(cid:48), b(cid:48)) and g(t) = \u2207v\u03a8(w(t), bt, v(t); w(cid:48), b(cid:48))\n\n(cid:0)v(t)(cid:1) , \u0001t\n\n(cid:1)\n\nand are tight at w(cid:48), b(cid:48):\n\n\u02c7rp (D; w(cid:48), b(cid:48); w(cid:48), b(cid:48)) = rp (D; w(cid:48), b(cid:48))\n\nand\n\n\u02c7rn (D; w(cid:48), b(cid:48); w(cid:48), b(cid:48)) = rn (D; w(cid:48), b(cid:48)) .\n\n(cid:17)\n(cid:17) \u2264 \u03b3(j)\n\n+ \u03bb\n\n2 (cid:107)w(cid:107)2\n\n2\n\nSubstituting these bounds into Problem 2 yields:\nProblem 3. Convex upper bound on Problem 2\n\n(cid:80)k\ns.t. (cid:80)k\n\ni=1\n\n(cid:16)\n(cid:16)\n\nminimize\nw\u2208Rd,b\u2208R\n\ni \u02c7rp (Di; w, b; w(cid:48), b(cid:48)) + \u03b2(0)\n\u03b1(0)\ni \u02c7rp (Di; w, b; w(cid:48), b(cid:48)) + \u03b2(j)\n\u03b1(j)\n\ni \u02c7rn (Di; w, b; w(cid:48), b(cid:48))\ni \u02c7rn (Di; w, b; w(cid:48), b(cid:48))\n\ni=1\n\nj \u2208 {1, . . . , m}.\nAs desired, this problem upper bounds Problem 2, is tight at w(cid:48), b(cid:48), and is convex (because any\npositive linear combination of convex functions is convex).\nAlgorithm 1 contains our proposed procedure for approximately solving Problem 2. Given an initial\nfeasible solution, it\u2019s straightforward to verify inductively, using the fact that we construct tight\nconvex upper bounds at every step, that every convex subproblem will have a feasible solution,\nevery (w(t), bt) pair will be feasible w.r.t. Problem 2, and every (w(t+1), bt+1) will have an objective\nfunction value that is no larger that that of (w(t), bt). In other words, no iteration can make negative\nprogress. The non-convexity of Problem 2, however, will cause Algorithm 1 to arrive at a suboptimal\nsolution that depends on the initial (w(0), b0).\n2.2 Optimizing the convex subproblems\nThe \ufb01rst step in optimizing Problem 3 is to add Lagrange multipliers v over the constraints, yielding\nthe equivalent unconstrained problem:\n\n(cid:122)(v) = min\n\n\u03a8 (w, b, v; w(cid:48), b(cid:48)) ,\n\nwhere the function:\n\n\u03a8 (w, b, v; w(cid:48), b(cid:48)) =(cid:80)k\n(cid:16)\n\ni=1\n\n+\n\nmaximize\n\nv(cid:23)0\n\n(cid:16)(cid:16)\ni +(cid:80)m\ni +(cid:80)m\n\n\u03b1(0)\n\n\u03b2(0)\n\nj=1vj\u03b2(j)\n\ni\n\n(cid:17)\n\nw,b\n\n(cid:17)\n\nj=1vj\u03b1(j)\n\ni\n\n\u02c7rp (Di; w, b; w(cid:48), b(cid:48))\n\n\u02c7rn (Di; w, b; w(cid:48), b(cid:48))\n\n+ \u03bb\n\n2 (cid:107)w(cid:107)2\n\nj=1vj\u03b3(j)\n\n(cid:17)\n\n2 \u2212(cid:80)m\n\n(6)\n\n(7)\n\nis convex in w and b, and concave in the multipliers v. For the purposes of this section, w(cid:48) and b(cid:48),\nwhich were found in the previous iteration of Algorithm 1, are \ufb01xed constants.\nBecause this is a convex-concave saddle point problem, there are a large number of optimization\ntechniques that could be successfully applied. For example, in settings similar to our own, Eban et al.\n[10] simply perform SGD jointly over all parameters (including v), while Gasso et al. [13] use the\nUzawa algorithm, which would alternate between (i) optimizing exactly over w and b, and (ii) taking\ngradient steps on v.\n\n5\n\n\f(calculated on line 10 of Algorithm 2) de\ufb01ne the cut ut +(cid:10)g(t), v \u2212 v(t)(cid:11). Notice that since\n\nWe instead propose an approach for which, in our setting, it is particularly easy to create an ef\ufb01cient\nimplementation. The key insight is that evaluating (cid:122)(v) is, thanks to our use of hinge and constant\nupper-bounds on our ramp \u03c3, equivalent to optimization of a support vector machine (SVM) with per-\nexample weights\u2014see Appendix F for details. This observation enables us to solve the saddle system\nin an inside-out manner. On the \u201cinside\u201d, we optimize over (w, b) for \ufb01xed v using an off-the-shelf\nSVM solver [e.g. 6]. On the \u201coutside\u201d, the resulting (w, b)-optimizer is used as a component in a\ncutting-plane optimization over v. Notice that this outer optimization is very low-dimensional, since\nv \u2208 Rm, where m is the number of constraints.\nAlgorithm 2 contains a skeleton of the cutting-plane algorithm that we use for this outer optimization\nover v. Because this algorithm is intended to be used as an outer loop in a nested optimization\nroutine, it does not expect that (cid:122)(v) can be evaluated or differentiated exactly. Rather, it\u2019s based upon\nthe idea of possibly making \u201cshallow\u201d cuts [4] by choosing a desired accuracy \u0001t at each iteration,\nand expecting the SVMOptimizer to return a solution with suboptimality \u0001t. More precisely, the\nSVMOptimizer function approximately evaluates (cid:122)(v(t)) for a given \ufb01xed v(t) by constructing the\ncorresponding SVM problem and \ufb01nding a (w(t), bt) for which the primal and dual objective function\nvalues differ by at most \u0001t.\nAfter \ufb01nding (w(t), bt), the SVMOptimizer then evaluates the dual objective function value of\nthe SVM to determine lt. The primal objective function value ut and its gradient g(t) w.r.t. v\n\u03a8(w(t), bt, v; w(cid:48), b(cid:48)) is a linear function of v, it is equal to this cut function, which therefore upper-\nbounds minw,b \u03a8(w, b, v; w(cid:48), b(cid:48)).\nOne advantage of this cutting-plane formulation is that typical CutChooser implementations will\nchoose \u0001t to be large in the early iterations, and will only shrink it to be \u0001 or smaller once we\u2019re close\nto convergence. We leave the details of the analysis to Appendices E and F\u2014a summary can be found\nin Appendix G.\n3 Related work\nThe problem of \ufb01nding optimal trade-offs in the presence of multiple objectives has been studied\ngenerically in the \ufb01eld of multi-objective optimization [18]. Two common approaches are (i)\nlinear scalarization [18, Section 3.1], and (ii) the method of \u0001-constraints [18, Section 3.2]. Linear\nscalarization reduces to the common heuristic of reweighting groups of examples. The method of\n\u0001-constraints puts hard bounds on the magnitudes of secondary objectives, like our dataset constraints.\nNotice that, in our formulation, the Lagrange multipliers v play the role of the weights in the linear\nscalarization approach, with the difference being that, rather than being provided directly by the\nuser, they are dynamically chosen to satisfy constraints. The user controls the problem through these\nconstraint choices, which have concrete real-world meanings.\nWhile the hinge loss is one of the most commonly-used convex upper bounds on the 0/1 loss [22],\nwe use the ramp loss, trading off convexity for tightness. For our purposes, the main disadvantage of\nthe hinge loss is that it is unbounded, and therefore cannot distinguish a single very bad example from\nsay, 10 slightly bad ones, making it ill-suited for constraints on rates. In contrast, for the ramp loss\nthe contribution of any single datum is bounded, no matter how far it is from the decision boundary.\nThe ramp loss has also been investigated in Collobert et al. [7] (without constraints). Gasso et al.\n[13] use the ramp loss both in the objective and constraints, but their algorithm only tackles the\nNeyman-Pearson problem. They compared their classi\ufb01er to that of Davenport et al. [9], which differs\nin that it uses a hinge relaxation instead of the ramp loss, and found with the ramp loss they achieved\nsimilar or slightly better results with up to 10\u00d7 less computation (our approach does not enjoy this\ncomputational speedup).\nNarasimhan et al. [19] considered optimizing the F-measure and other quantities that can be written\nas concave functions of the TP and TN rates. Their proposed stochastic dual solver adaptively\nlinearizes concave functions of the rate functions (Equation 1). Joachims [16] indirectly optimizes\nupper-bounds on functions of sp(D+), sp(D\u2212), sn(D+), sn(D\u2212) using a hinge loss approximation.\nFinally, for some simple problems (particularly when there is only one constraint), the goals in\nSection 1 can be coarsely handled by simple bias-shifting, i.e. \ufb01rst training an unconstrained classi\ufb01er,\nand then attempting to adjust the decision threshold to satisfy the constraints as a second step.\n\n6\n\n\fFigure 2: Blue dots: our proposal, with the classi\ufb01cation functions\u2019 predictions being deterministi-\ncally thresholded at zero. Red dots: same, but using the randomized classi\ufb01cation rule described in\nSection 2. Green dots: Zafar et al. [27]. Green line: unconstrained SVM. (Left) Test set error plotted\nvs. observed test set fairness ratio sp\nspecify the desired fairness in the proposed method, and the observed fairness ratios of our classi\ufb01ers\non the test data. All points are averaged over 100 runs.\n\n(cid:0)DF(cid:1). (Right) The 1/\u03ba hyper-parameter used to\n\n(cid:0)DM(cid:1) /sp\n\n4 Experiments\nWe evaluate the performance of the proposed approach in two experiments, the \ufb01rst using a benchmark\ndataset for fairness, and the second on a real-world problem with churn and recall constraints.\n4.1 Fairness\nWe compare training for fairness on the Adult dataset 2, the same dataset used by Zafar et al. [27].\nThe 32 561 training and 16 281 testing examples, derived from the 1994 Census, are 123-dimensional\nand sparse. Each feature contains categorical attributes such as race, gender, education levels and\nrelationship status. A positive class label means that individual\u2019s income exceeds 50k. Let DM\nand DF denote the sets of male and female examples. The number of positive labels in DM is\nroughly six times that of DF . The goal is to train a classi\ufb01er that respects the fairness constraint\nsp\nmentioned in Section 1).\nOur publicly-available Julia implementation3 for these experiments uses LIBLINEAR [11] with\nthe default parameters (most notably \u03bb = 1/n \u2248 3 \u00d7 10\u22125) to implement the SVMOptimizer\nfunction, and does not include an unregularized bias b. The outer optimization over v does not use\nthe m-dimensional cutting plane algorithm of Algorithm 2, instead using a simpler one-dimensional\nvariant (observe that these experiments involve only one constraint). The majorization-minimization\nprocedure starts from the all-zeros vector (w(0) in Algorithm 1).\nWe compare to the method of Zafar et al. [27], which proposed handling fairness with the constraint:\n\n(cid:0)DF(cid:1) /\u03ba for a parameter \u03ba \u2208 (0, 1] (where \u03ba = 0.8 corresponds to the 80% rule\n\n(cid:0)DM(cid:1) \u2264 sp\n\n\u00afx =(cid:12)(cid:12)DM(cid:12)(cid:12)\u22121(cid:80)\n\nx\u2208DM x \u2212 (cid:12)(cid:12)DF(cid:12)(cid:12)\u22121(cid:80)\n\n(cid:104)w, \u00afx(cid:105) \u2264 c,\n\nx\u2208DF x.\n\n(8)\nAn SVM subject to this constraint (see Appendix D for details), for a range of c values, is our baseline.\nResults in Figure 2 show the proposed method is much more accurate for any desired fairness, and\nachieves fairness ratios not reachable with the approach of Zafar et al. [27] for any choice of c. It is\nalso easier to control: the values of c in Zafar et al. [27] do not have a clear interpretation, whereas \u03ba\nis an effective proxy for the fairness ratio.\n4.2 Churn\nOur second set of experiments demonstrates meeting real-world requirements on a proprietary\nproblem from Google: predicting whether a user interface element should be shown to a user, based\n\n2\u201ca9a\u201d from https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html\n3https://github.com/gabgoh/svmc.jl\n\n7\n\n123456Fairness ratio0.150.160.170.180.190.2Error RateProposed (deterministic)Proposed (stochastic)Zafar et al.\fFigure 3: Blue: our proposal, with the classi\ufb01cation functions\u2019 predictions being deterministically\nthresholded at zero. Red: same, but using the randomized classi\ufb01cation rule described in Section 2.\nGreen: unconstrained SVM trained on D1 \u222a D2, then thresholded (by shifting the bias b) to satisfy\nthe recall constraint on D2. Dashed and dotted curves denote results on the testing and training\ndatasets, respectively. (Left) Observed churn (vertical axis) vs. the churn target used during training\n(horizontal axis), on the unlabeled dataset D3. (Right) Empirical error rates (vertical axis) vs. the\nchurn target, on the union D1 \u222a D2 of the two labeled datasets. All curves are averaged over 10 runs.\n\non a 31-dimensional vector of informative features, which is mapped to a roughly 30 000-dimensional\nfeature vector via a \ufb01xed kernel function \u03a6. We train classi\ufb01ers that are linear with respect to \u03a6(x).\nWe are given the currently-deployed model, and seek to train a classi\ufb01er that (i) has high accuracy,\n(ii) has no worse recall than the deployed model, and (iii) has low churn w.r.t. the deployed model.\nWe are given three datasets, D1, D2 and D3, consisting of 131 840, 53 877 and 68 892 examples,\nrespectively. The datasets D1 and D2 are hand-labeled, while D3 is unlabeled. In addition, D1 was\nchosen via active sampling, while D2 and D3 are sampled i.i.d. from the underlying data distribution.\nFor all three datasets, we split out 80% for training and reserved 20% for testing. We address the three\ngoals in the proposed framework by simultaneously training the classi\ufb01er to minimize the number of\nerrors on D1 plus the number of false positives on D2, subject to the constraints that the recall on\nD2 be at least as high as the deployed model\u2019s recall (we\u2019re essentially performing Neyman-Pearson\nclassi\ufb01cation on D2), and that the churn w.r.t. the deployed model on D3 be no larger than a given\ntarget parameter.\nThese experiments use a proprietary C++ implementation of Algorithm 2, using the combined SDCA\nand cutting plane approach of Appendix F to implement the inner optimizations over w and b, with\nthe CutChooser helper functions being as described in Appendices E.1 and F.2.1. We performed 5\niterations of the majorization-minimization procedure of Algorithm 1.\nOur baseline is an unconstrained SVM that is thresholded after training to achieve the desired recall,\nbut makes no effort to minimize churn. We chose the regularization parameter \u03bb using a power-of-10\ngrid search, found that 10\u22127 was best for this baseline, and then used \u03bb = 10\u22127 for all experiments.\nThe plots in Figure 3 show the achieved churn and error rates on the training and testing sets for a\nrange of churn constraint values (red and blue curves), compared to the baseline thresholded SVM\n(green lines). When using deterministic thresholding of the learned classi\ufb01er (the blue curves, which\nsigni\ufb01cantly outperformed randomized classi\ufb01cation\u2013the red curves), the proposed method achieves\nlower churn and better accuracy for all targeted churn rates, while also meeting the recall constraint.\nAs expected, the empirical churn is extremely close to the targeted churn on the training set when\nusing randomized classi\ufb01cation (red dotted curve, left plot), but less so on the 20% held-out test set\n(red dashed curve). We hypothesize this disparity is due to over\ufb01tting, as the classi\ufb01er has 30 000\nparameters, and D3 is rather small (please see Appendix C for a discussion of the generalization\nperformance of our approach). However, except for the lowest targeted churn, the actual classi\ufb01er\nchurn (blue dashed curves) is substantially lower than the targeted churn. Compared to the thresholded\nSVM baseline, our approach signi\ufb01cantly reduces churn without paying an accuracy cost.\n\n8\n\n0.030.040.050.06Churn target0.020.030.040.05ChurnProposed (deterministic)Proposed (stochastic)Thresholded SVMTestingTraining0.030.040.050.06Churn target0.180.190.20.210.22Error rate\fReferences\n[1] K. Ball. An elementary introduction to modern convex geometry. Flavors of Geometry, 31:\n\n1\u201358, 1997.\n\n[2] P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and\n\n[3] D. Biddle. Adverse Impact and Test Validation: A Practitioner\u2019s Guide to Valid and Defensible\n\nstructural results. JMLR, 3:463\u2013482, 2002.\n\nEmployment Testing. Gower, 2005.\n\n[4] R. G. Bland, D. Goldfarb, and M. J. Todd. Feature article\u2014the ellipsoid method: A survey.\n\nOperations Research, 29(6):1039\u20131091, November 1981.\n\n[5] S. Boyd and L. Vandenberghe. Localization and cutting-plane methods, April 2011. Stanford\n\nEE 364b lecture notes.\n\n[6] C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines. ACM Transactions\non Intelligent Systems and Technology, 2:27:1\u201327:27, 2011. Software available at http:\n//www.csie.ntu.edu.tw/~cjlin/libsvm.\n\n[7] R. Collobert, F. Sinz, J. Weston, and L. Bottou. Trading convexity for scalability. In ICML,\n\n[8] A. Cotter, S. Shalev-Shwartz, and N. Srebro. Learning optimally sparse support vector machines.\n\n2006.\n\nIn ICML, pages 266\u2013274, 2013.\n\n[9] M. Davenport, R. G. Baraniuk, and C. D. Scott. Tuning support vector machines for minimax\nIEEE Transactions on Pattern Analysis and Machine\n\nand Neyman-Pearson classi\ufb01cation.\nIntelligence, 2010.\n\n[10] E. E. Eban, M. Schain, A. Gordon, R. A. Saurous, and G. Elidan. Large-scale learning with\n\nglobal non-decomposable objectives, 2016. URL https://arxiv.org/abs/1608.04802.\n\n[11] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for\n\nlarge linear classi\ufb01cation. JMLR, 9:1871\u20131874, 2008.\n\n[12] M. M. Fard, Q. Cormier, K. Canini, and M. Gupta. Launch and iterate: Reducing prediction\n\nchurn. In NIPS, 2016.\n\n[13] G. Gasso, A. Pappaionannou, M. Spivak, and L. Bottou. Batch and online learning algorithms\nfor nonconvex Neyman-Pearson classi\ufb01cation. ACM Transactions on Intelligent Systems and\nTechnology, 2011.\n\n[14] B. Gr\u00fcnbaum. Partitions of mass-distributions and convex bodies by hyperplanes. Paci\ufb01c\n\nJournal of Mathematics, 10(4):1257\u20131261, December 1960.\n\n[15] M. Hardt, E. Price, and N. Srebro. Equality of opportunity in supervised learning. In NIPS,\n\n2016.\n\n[16] T. Joachims. A support vector method for multivariate performance measures. In ICML, 2005.\n[17] G. S. Mann and A. McCallum. Simple, robust, scalable semi-supervised learning with expecta-\n\n[18] K. Miettinen. Nonlinear multiobjective optimization, volume 12. Springer Science & Business\n\n[19] H. Narasimhan, P. Kar, and P. Jain. Optimizing non-decomposable performance measures: a\n\n[20] A. Nemirovski. Lecture notes: Ef\ufb01cient methods in convex programming. 1994. URL\n\nhttp://www2.isye.gatech.edu/~nemirovs/Lect_EMCO.pdf.\n\n[21] L. Rademacher. Approximating the centroid is hard. In SoCG, pages 302\u2013305, 2007.\n[22] R. T. Rockafellar and S. Uryasev. Optimization of conditional value-at-risk. Journal of Risk, 2:\n\n21\u201342, 2000.\n\n[23] C. D. Scott and R. D. Nowak. A Neyman-Pearson approach to statistical learning. IEEE\n\n[24] S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized\n\nTransactions on Information Theory, 2005.\n\nloss. JMLR, 14(1):567\u2013599, Feb. 2013.\n\n[25] S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter. Pegasos: Primal Estimated sub-GrAdient\n\nSOlver for SVM. Mathematical Programming, 127(1):3\u201330, March 2011.\n\n[26] M. S. Vuolo and N. B. Levy. Disparate impact doctrine in fair housing. New York Law Journal,\n\n2013.\n\n[27] M. B. Zafar, I. Valera, M. G. Rodriguez, and K. P. Gummadi. Fairness constraints: A mechanism\nfor fair classi\ufb01cation. In ICML Workshop on Fairness, Accountability, and Transparency in\nMachine Learning, 2015.\n\ntion regularization. In ICML, 2007.\n\nMedia, 2012.\n\ntale of two classes. In ICML, 2015.\n\n9\n\n\f", "award": [], "sourceid": 1262, "authors": [{"given_name": "Gabriel", "family_name": "Goh", "institution": "UC Davis"}, {"given_name": "Andrew", "family_name": "Cotter", "institution": "Google"}, {"given_name": "Maya", "family_name": "Gupta", "institution": "Google"}, {"given_name": "Michael", "family_name": "Friedlander", "institution": "UC Davis"}]}