{"title": "On Human-Aligned Risk Minimization", "book": "Advances in Neural Information Processing Systems", "page_first": 15055, "page_last": 15064, "abstract": "The statistical decision theoretic foundations of modern machine learning have largely focused on the minimization of the expectation of some loss function for a given task. However, seminal results in behavioral economics have shown that human decision-making is based on different risk measures than the expectation of any given loss function. In this paper, we pose the following simple question: in contrast to minimizing expected loss, could we minimize a better human-aligned risk measure? While this might not seem natural at first glance, we analyze the properties of such a revised risk measure, and surprisingly show that it might also better align with additional desiderata like fairness that have attracted considerable recent attention. We focus in particular on a class of human-aligned risk measures inspired by cumulative prospect theory. We empirically study these risk measures, and demonstrate their improved performance on desiderata such as fairness, in contrast to the traditional workhorse of expected loss minimization.", "full_text": "On Human-Aligned Risk Minimization\n\nLiu Leqi\n\nAdarsh Prasad\n\nCarnegie Mellon University\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\nleqil@cs.cmu.edu\n\nPittsburgh, PA 15213\nadarshp@cs.cmu.edu\n\nPradeep Ravikumar\n\nCarnegie Mellon Universit\n\nPittsburgh, PA 15213\n\npradeepr@cs.cmu.edu\n\nAbstract\n\nThe statistical decision theoretic foundations of modern machine learning have\nlargely focused on the minimization of the expectation of some loss function for\na given task. However, seminal results in behavioral economics have shown that\nhuman decision-making is based on different risk measures than the expectation of\nany given loss function. In this paper, we pose the following simple question: in\ncontrast to minimizing expected loss, could we minimize a better human-aligned\nrisk measure? While this might not seem natural at \ufb01rst glance, we analyze the\nproperties of such a revised risk measure, and surprisingly show that it might also\nbetter align with additional desiderata like fairness that have attracted considerable\nrecent attention. We focus in particular on a class of human-aligned risk measures\ninspired by cumulative prospect theory. We empirically study these risk measures,\nand demonstrate their improved performance on desiderata such as fairness, in\ncontrast to the traditional workhorse of expected loss minimization.\n\n1\n\nIntroduction\n\nThe decision-theoretic foundations of modern machine learning models have largely focused on\nestimating model parameters that minimize the expectation of some loss function. This ensures that\nthe resulting model have high average case performance, which loosely is what is meant by good\ngeneralization performance. However, as ML models are increasingly deployed in broader societal\nsettings, and in particular, to assist humans in decision-making, it is clear that humans want models\nto have not just good average performance but also properties like fairness. Due to the importance of\nthese additional desiderata, there have been a burgeoning interest in capturing these properties via\nappropriate constraints and modi\ufb01cations of the classical objective of expected loss [8, 12, 18, 27].\nIn this work, we posit a very natural if simple solution to addressing these varied desiderata that are\ndriven in large part by human considerations. Speci\ufb01cally, we suggest that in contrast to using the\nstandard workhorse of expected loss, we draw from theories of human cognition in psychology and\nbehavioral economics, to consider a \u201chuman-aligned\u201d risk instead.\nAlternatives to expected loss based risk measures have a long history in decision-making [16], with\nearlier efforts focusing on percentile risk criteria [11]. In machine learning, instead of minimizing\nexpected loss, various risk measures have been considered in different settings. In risk sensitive\nreinforcement learning, conditional value-at-risk (CVaR), a percentile risk measure that quanti\ufb01es the\ntail performance of a model, has been connected to robustness to modeling errors [5, 21]. Recently,\nhuman-aligned risk measures have also been explored in bandit [14] and reinforcement learning [23],\nwhere the goal of the agent is to produce long term returns aligned with the preferences of one or\nmore humans.\n\nContributions.\nIn this work, we introduce a novel notion of human risk minimization (Section 3),\nby bringing ideas from cumulative prospect theory (Section 2) into supervised learning. We explore\nvarious salient characteristics of our objective such as diminishing sensitivity, decision-making\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fbased on higher-order moments and information-content or \"surprisal\" view point of human risk\n(Section 4). We also study the implications of minimizing our objective in the context of subpopulation\nperformance (Section 5). In particular, our empirical results illustrate that human risk minimization\ninherently avoids drastic losses across all subgroups.\n\n2 Background\n\n2.1 Cumulative Prospect Theory\n\nAs a seminal work in behavioral economics, cumulative prospect theory (CPT) [28] provides a\nframework to emulate human decision-making under uncertainty. In particular, CPT points out that\nhumans overweight extreme events that occur with low probability, rather than treating all the events\nequally, which is the assumption of expected utility theory (EUT). As an alternative to EUT, CPT has\nthree important components [25]:\n\n\u2022 Outcomes are considered as gains or losses compared to a reference point;\n\u2022 Value functions are concave for gains, convex for losses and \ufb02atter for gains than for losses;\n\u2022 An inverse S-shaped (\ufb01rst concave then convex) probability weighting function (Figure 1\n(a)) is used to transform the cumulative distribution function so that small probabilities are\nin\ufb02ated and large probabilities are de\ufb02ated [30].\n\nCurrent machine learning follows the EUT framework in the sense that expected losses are minimized.\nHowever, as CPT has pointed out, a human evaluates risk differently. For example, given two models\n{M1,M2} such that M1 has zero loss with probability .95 and loses 100 with probability .05 while\nM2 loses 5.01 all the time, EUT will choose M1. The reasoning behind is that the expected loss\nof M1 is 5, which is smaller than the expected loss of M2, which is 5.01. However, from the CPT\nperspective, M2 will be chosen because CPT in\ufb02ates the probability .05 and M1 will end up having\na larger risk. In this case, because of the human-innate probability weighting, we end up choosing a\nmodel that avoids drastic losses instead of the one with a better average performance.\nThe inverse S-shaped CPT probability weighting function captures that humans over-weight extreme\nevents with low probability while under-weight \u201caverage\u201d events that are more probable but less\nextreme. Many parametric forms of the probability weighting function have been proposed [24, 25,\n28, 30]. To start with, we formally de\ufb01ne a class of weighting functions called WCPT.\nDe\ufb01nition 1. Let w : [0, 1] \u2192 [0, 1] be a differentiable function. Then, w \u2208 WCPT if and only if\n\n1. w(0) = 0 and w(1) = 1;\n2. there exists a \u2208 (0, 1) such that w(a) = a;\n3. w\ufffd(x) is monotonically decreasing on x \u2208 [0, a) and w\ufffd(x) is monotonically increasing on\n\nx \u2208 (a, 1].\n\nTraditional CPT probability weighting functions fall into this class, including the original weighting\nfunction (for losses) w(x) =\n(x.69+(1\u2212x).69)1/.69 [28]. For a real-valued continuous random variable\nX with cumulative distribution function F (x) and a CPT probability weighting function w \u2208 WCPT,\nthe CPT subjective utility is de\ufb01ned as [25, 28]:\n\nx.69\n\nUCPT(X) =\ufffd +\u221e\n\n\u2212\u221e\n\nv(x)dg(F (x))\n\n(1)\n\nwhere (1) v : R \u2192 R is a value function; (2) g(F (x)) = w(F (x)) when x < 0 and g(F (x)) =\n\u2212w(1 \u2212 F (x)) when x \u2265 0.\nRank-dependent Utility. As pointed out in [28], CPT subjective utility is a rank-dependent utility\nsince the decision weight on x depends on the \u201crank\u201d of x, which is given by F (x). When F (x) is\nweighted by w \u2208 WCPT [7] and v(x) = x, the CPT-weighted rank-dependent utility is:\n\n2\n\n\fUCPT-RD(X) =\ufffd +\u221e\n\n\u2212\u221e\n\nxdw(F (x)).\n\n(2)\n\nWe focus on studying the effect of using CPT-weighted cumulative distribution function w(F (x))\non training an ML model. Hence, analyzing the effect of using a reference point and a value\nfunction v is out of the scope of this paper. As one may have noticed, if w(F (x)) = F (x), then\nUCPT-RD(X) = E[X].\nTo have a \ufb01nite CPT subjective utility [25] for a real-valued continuous random variable X with\nw \u2208 WCPT, it is suf\ufb01cient to ensure w to be strictly increasing on [0, 1] and continuously differentiable\non [0, 1], i.e. w\ufffd(0) and w\ufffd(1) are \ufb01nite. As proposed by [25], the simplest polynomial that satis\ufb01es\nthe above conditions is\n\n3 \u2212 3b\n\nwPOLY(F (x)) =\n\na2 \u2212 a + 1\ufffdF (x)3 \u2212 (a + 1)F (x)2 + aF (x)\ufffd + F (x)\n\n(3)\nwhere a \u2208 (0, 1) is the \ufb01xed point, i.e. wPOLY(a) = a, and b \u2208 (0, 1) controls the curvature of\nwPOLY(\u00b7) \u2208 WCPT. As b approaches 1, wPOLY(F (x)) will converge to the linear function w(F (x)) =\nF (x). One could interpret b as controlling the sensitivity of the probability weighting function to a\nunit difference in probability changes [13]. We use wPOLY( \u00b7 ; a, b) to denote the polynomial form\nCPT probability weighting function with \ufb01xed point a and curvature b.\n\n(a) wPOLY(F (x))\n\n(b) w\ufffdPOLY(F (x))\n\nFigure 1: (a) Inverse S-shaped probability weighting function wPOLY is the steepest near the endpoints\n0, 1. The parametric form of wPOLY( \u00b7 ; a, b) is shown in Equation 3. (b) U-shaped CPT probability\nweighting function derivative w\ufffdPOLY up-weights the tails of the original distribution.\n\nProposition 1. Given any cumulative distribution function F (x), if a non-decreasing continuous\nfunction w : [0, 1] \u2192 [0, 1] satis\ufb01es w(0) = 0 and w(1) = 1, then w(F (x)) is a cumulative\ndistribution function of some random variable.\nFor any w \u2208 WCPT that is non-decreasing (e.g. wPOLY), for a real-valued continuous random variable\nX with cumulative distribution function F (x) and density f (x), one can think of f (x)w\ufffd(F (x)) as\na CPT-weighted density and w(F (x)) is the corresponding CPT-weighted cumulative distribution\nfunction. UCPT-RD is the expectation of the random variable that has the CPT-weighted cumulative\ndistribution function. We denote the set of non-decreasing functions in WCPT to be WCPT.\n2.2 Empirical Risk Minimization\n\nThe canonical way of learning an ML model is through empirical risk minimization (ERM). Given n\ni.i.d. samples Z1, . . . , Zn \u2208 Z, and a loss function \ufffd : \u0398 \u00d7 Z \u2192 R, the population risk (expected\nloss) for model \u03b8 is de\ufb01ned to be:\n\nR(\u03b8) = E[\ufffd(\u03b8; Z)].\n\nn\ufffdn\n\nERM minimizes 1\ni=1 \ufffd(\u03b8; Zi) (empirical risk). However, expectation is only one of the many risk\nmeasures. For example, value-at-risk and conditional value-at-risk [26] are popular risk measures for\nevaluating risks of portfolios of \ufb01nancial instruments. CPT de\ufb01nes another way of measuring risk,\nwhich aligns with human\u2019s preferences. We want to study if minimizing a human-aligned risk will\ngive us ML models that have properties other than a low population risk.\n\n3\n\n\f3 Human Risk Minimization\n\nDe\ufb01nition 2. Given a real-valued random variable Z \u2208 Z, a loss function \ufffd : \u0398 \u00d7 Z \u2192 R and a\nCPT probability weighing function w \u2208 WCPT, the human risk is de\ufb01ned to be\n\nRH (\u03b8; w) def= E[\ufffd(\u03b8; Z)w\ufffd(F (\ufffd(\u03b8; Z)))],\n\n(4)\n\nwhere F (\ufffd(\u03b8; Z)) is the cumulative distribution function of the loss.\n\nComparing Equation 4 with the CPT-weighted rank-dependent utility in Equation 2, we see that\nRH (\u03b8) = UCPT-RD(\ufffd(\u03b8; Z)). Given n i.i.d. samples Z1, . . . , Zn \u2208 Z, we de\ufb01ne empirical human\nrisk minimization (EHRM) as\n\n\u03b8\u2217 = arg min\n\n\u03b8\n\n1\nn\n\nwhere Fn is the empirical CDF of the loss.\n\nn\ufffdi=1\n\n\ufffd(\u03b8; Zi)w\ufffd(Fn(\ufffd(\u03b8; Zi))),\n\n(5)\n\nOptimization. When \ufffd is differentiable, we use the following iterative update rule to minimize\nempirical human risk:\n\n\u03b8t+1 = \u03b8t \u2212\n\n\u03b7t\nn\n\nn\ufffdi=1\n\nwt\ni\u2207\u03b8\ufffd(\u03b8t; Zi) for all t \u2208 {0,\u00b7\u00b7\u00b7 , T \u2212 1},\n\nwhere wt\ni = w\ufffd(Fn(\ufffd(\u03b8t; Zi))) and \u03b7t is the learning rate. Note that this heuristic approach relies\ncrucially on the assumption that minor perturbations in \u03b8, don\u2019t change w\ufffd(Fn(\ufffd(\u03b8; Z)) drastically.\nWe empirically show that such a heuristic approach performs quite well in practice (See Appendix B).\nDeriving provably optimal optimization algorithms for EHRM is an interesting open problem.\nRemarks. In general, there are two levels of decision making in supervised learning: model selection\nwhen training a model, and instance prediction when using a model. These two kinds of decisions are\nvery much related. In traditional ERM, a model is selected over others when per-instance predictions\nare more accurate on average. We explore the consequences of EHRM in both settings: Section 2.1\nand 4.2 discuss the model selection consequences of EHRM; Section 5 explores its consequences\non per-instance predictions. Different from traditional settings where CPT is considered, supervised\nlearning only evaluates losses. Humans tend to be risk-averse when facing possibilities of large loss.\nSuch a property distinguishes EHRM from ERM. When training machine learning models, surrogate\nlosses are used (e.g., hinge loss is used in replace of 0/1 loss). Most of the times, such surrogate\nlosses are upper bounds for the original losses. In such cases, the risk-aversion towards possible\ndrastic loss will be carried through when surrogate loss is used instead of the true loss.\nTo further understand how adding the weights w\ufffd(F (\ufffd(\u03b8; Z))) to expected loss in\ufb02uences the learned\nmodel, we provide the psychological interpretation (Section 4.1) and an analytical illustration of\nhow skewness of the loss distribution may in\ufb02uence choices of people with different risk preferences\n(Section 4.2) as well as an information weighting view point (Section 4.3) of the probability weighting\nfunction.\n\n4 Characteristic Properties of Human Risk Minimization\n\nWe next review some characteristic properties of human risk minimization, contrasting it with the\nstandard machine learning objective of expected loss minimization. To simplify the notation, we will\ndenote the CDF F (\ufffd(\u03b8; Z)) of the loss random variable L as F (\ufffd) in the subsequent analysis.\n\n4.1 Diminishing Sensitivity to Probability Changes\n\nRecall from Equation (3) that we work with the following polynomial form of CPT probability\nweighting function\n\nwPOLY(F (x)) =\n\n3 \u2212 3b\n\na2 \u2212 a + 1\ufffdF (x)3 \u2212 (a + 1)F (x)2 + aF (x)\ufffd + F (x),\n\n4\n\n\fwhere a \u2208 (0, 1) is the \ufb01xed point of the function, and b \u2208 (0, 1) controls the curvature. For any\nevent E with probability P (E) \u2208 (0, 1), given a probability change \u0394, we de\ufb01ne\ng(P (E)) =\ufffdwPOLY(P (E) + \u0394; a, b) \u2212 wPOLY(P (E); a, b)\ufffd/\u0394,\n\nwhich is the ratio between the human perceived probability change and the original probability\nchange. Intuitively, g(P (E)) represents human\u2019s sensitivity to probability changes.\nLemma 1. For any event E with probability P (E) \u2208 (0, 1),\nincreasing function of |P (E) \u2212 a+1\n3 |.\nThe above stated result can be seen as a quantitative evidence of how CPT probability weight-\ning function captures humans\u2019 diminishing sensitivity, which has been long-studied in behavioral\neconomics [28]. Humans are sensitive to probability changes of extreme events. Such sensitivity\ndiminishes as the events become less extreme. When using CPT probability weighting function to\nweight F (\ufffd), the event we are considering is E = I{L \u2264 \ufffd}, i.e. if the loss L is less than or equal\nto a threshold \ufffd. In this case, P (E) = F (\ufffd). Diminishing sensitivity states that for a given amount\nof probability change, human\u2019s perceived probability change depends on where the probability\nchange happens. The perceived change diminishes as the distance between where it happens and\nthe boundary (impossibility F (\ufffd) = 0 and certainty F (\ufffd) = 1) becomes smaller. The probability\nchanges that happen close to the boundary will be up-weighted while the changes in between will\nbe down-weighted. As shown in Lemma 1, for wPOLY, the sensitivity of the probability change \u0394\ndiminishes as P (E) moves away from 0 (impossibility) and 1 (certainty).\n\ng(P (E)) is a monotonically\n\nlim\n\u0394\u21920\n\n4.2 Responsiveness to Skewness of the Loss Distribution\n\nSince the inverse S-shaped probability weighting function exaggerates small probabilities of both\ngood and bad extreme outcomes, intuitively, its overall impact on evaluating a model depends on\nhigher-order moments of the loss distribution. We highlight this phenomena by considering a family\nof Bernoulli distributions with same mean and variance. In particular, consider the family of models\n{M\u03b8 | \u03b8 \u2208 [0, 1]} whose losses {\ufffd(\u03b8) | \u03b8 \u2208 [0, 1]} are parameterized by \u03b8. For all \u03b8 \u2208 [0, 1], suppose\nthat \ufffd(\u03b8) follows a Bernoulli distribution [15, 20]:\n\nP\ufffd\ufffd(\u03b8) = 1 \u2212\ufffd 1 \u2212 \u03b8\n\n\u03b8 \ufffd1/2\ufffd = \u03b8, P\ufffd\ufffd(\u03b8) = 1 +\ufffd \u03b8\n\n1 \u2212 \u03b8\ufffd1/2\ufffd = 1 \u2212 \u03b8.\n\nIn the above setup, the mean and variance of the losses are independent of \u03b8. In particular, we have\nthat\n\nE[\ufffd(\u03b8)] = 1 and Var(\ufffd(\u03b8)) = 1 for all \u03b8 \u2208 [0, 1].\n\nHence, in this setting empirical risk minimization will treat all the models equally. However, the third\ncentral moment (skewness) of \ufffd(\u03b8) is given by\n\nSkewness(\ufffd(\u03b8)) =\n\n.\n\n2\u03b8 \u2212 1\n\n\ufffd\u03b8(1 \u2212 \u03b8)\n\nObserve that Skewness(\ufffd(\u03b8)) is a monotonically increasing function of \u03b8, with Skewness(\ufffd(\u03b8)) = 0\nfor \u03b8 = 1\n2. Hence, \u03b8 < 0.5 corresponds to models with negatively skewed loss distributions, while\n\u03b8 > 0.5 corresponds to models with positively skewed loss distributions. Then, in this setting, we\nhave the following result:\nLemma 2. Consider the human risk objective in Equation (4) instantiated with wPOLY having \ufb01xed\npoint a = 1\n\n2 . Then, we have the following:\n\n1. For \u03b8 < 0.5, RH (\u03b8; wPOLY( \u00b7 ; a, b)) is a monotonically increasing function of b.\n2. For \u03b8 > 0.5, RH (\u03b8; wPOLY( \u00b7 ; a, b)) is a monotonically decreasing function of b.\n\nRemarks. The above result shows that for models with negatively skewed loss distributions, their\nexpected loss is higher than any human risk, while the opposite is true for positively skewed loss\ndistributions. While empirical risk minimization will treat all the models equally, human risk\nminimization will distinguish the models through higher-order moments of the loss distribution.\n\n5\n\n\f4.3 Weighting by Information Content\n\nThe information content or \"surprisal\" of an event is the amount of information gained when the\nevent is observed and is de\ufb01ned as follows.\nDe\ufb01nition 3. The information content of an event E with probability P (E) is de\ufb01ned as\n\nwhere e is used as the base of the logarithm.\n\nI(E) def= \u2212 ln[P (E)],\n\nNote that the above de\ufb01nition captures the intuition that the observation of a rare event provides more\ninformation than a common one. In our setting, the rare event corresponds to the event that the loss L\ntakes extreme values.\nNext, we construct a special weighting function wIT(\u00b7) using the information content of the events\nE1 = I{L \u2264 \ufffd} and E2 = I{L > \ufffd}. Observe that the information content of the events E1 and\nE2 is given by \u2212 ln F (\ufffd) and \u2212 ln(1 \u2212 F (\ufffd)) respectively. Moreover, it is easy to see that as \ufffd gets\nsmaller, the information content of the left tail event E1 increases; and as \ufffd gets larger, the information\ncontent of the right tail event E2 increases.\nThen, we can use the information content of E1 and E2 to weight the density of L, and de\ufb01ne the\ncorresponding weighting function. In particular, the information weighted density is de\ufb01ned to be:\n\nw\ufffdIT(F (\ufffd))f (\ufffd) =\n\n(I(E1) + I(E2))f (\ufffd)\n\n1\n2\n= \u2212\n\n1\n2\n\nf (\ufffd) ln (F (\ufffd) \u00b7 (1 \u2212 F (\ufffd))) ,\nand the corresponding information content weighting function is given by:\n\nwIT(F (\ufffd)) =\ufffd 1\n\n0\n1\n\n1 \u00b7 w\ufffdIT(F (\ufffd))dF (\ufffd)\n2\ufffd(1 \u2212 F (\ufffd)) \u00b7 ln(1 \u2212 F (\ufffd)) \u2212 F (\ufffd) \u00b7 ln(F (\ufffd))\ufffd + F (\ufffd).\n\n=\n\nLemma 3. The information content weighing function wIT belongs to WCPT.\nInterestingly, using wIT to weight a distribution is of interest in information theory, where it is known\nas the two-sided information-weighted distribution [6]. Moreover, it is easy to see that wPOLY(\u00b7, a, b)\nwith \ufb01xed point a = 1/2 and curvature b = ln 2 is approximately equal to the third order Taylor\napproximation of wIT. To the best of our knowledge, this is the \ufb01rst time that information weighting\nfunction and CPT probability weighting function have been connected. Uncertainty-aversion in\nhuman preferences has also been studied in behavioral economics [3, 4]. In the context of human risk\nminimization, using wIT, we can de\ufb01ne the information-weighted human risk to be\n\nRH (\u03b8; wIT) = E[\ufffd(\u03b8; Z)w\ufffdIT(F (\ufffd(\u03b8; Z)))].\n\nAs studied in [6], the information-weighted distribution wIT(F (\ufffd)) will be heavy-tailed when the\nCDF F (\ufffd) = e\u2212\u03ba|\ufffd| for some \u03ba > 0. Such heavy-tailedness for the loss distribution may cause\nhuman risk to be hard to estimate. We believe that deriving statistically and computationally optimal\nprocedures for minimizing RH (\u03b8; wIT) is an interesting direction for future work.\n\n5\n\nImplications for Performance over Subgroups\n\nMachine learning models are being increasingly deployed to automate a variety of day-to-day tasks.\nEmployers use such models to select job applicants, provide credit scoring and predicting insurance\npremiums. With such high stakes, ensuring that learned models are non-discriminatory or fair with\nrespect to sensitive features such as gender and race is of utmost importance [10, 18, 19, 22]. In\nthis section, we explore the implications of HRM towards subgroup performances. In particular, we\nknow that HRM up-weights possible extreme events, hence, we expect HRM to avoid drastic losses\nfor all subpopulations. We test this hypothesis on both synthetic and real-world datasets and use\nwPOLY( \u00b7 ; a, b) speci\ufb01ed in Equation 3. We have chosen a = .5 so that for all b \u2208 (0, 1), w\ufffdPOLY(F (\ufffd))\nis symmetric about the line F (\ufffd) = a.\n\n6\n\n\f(a) Majority Group Performance\n\n(b) Minority Group Performance\n\nFigure 2: Majority and minority performance of ERM, EHRM and CVaR on the synthetic dataset.\nNote that when b = 1, the CPT probability weighting function is the identity function. Hence, EHRM\nis the same as ERM. 2000 training and 20000 testing data points are used in the experiment. Solid\nlines and shaded area represent the means and one standard derivations of the risks.\n\n5.1 Synthetic Experiment\n\nSetup.\nIn this experiment, we create a synthetic regression task to test the performance of EHRM\non the minority subgroup. We follow the setup of [9] and draw our covariates (features) from an\nisotropic Gaussian X \u223c N (0, I5) in R5. The noise distribution is \ufb01xed as \ufffd \u223c N (0, .01). We draw\nour response variable Y as,\n\nY =\ufffdX\ufffd\u03b8\u2217 + \ufffd\n\nX\ufffd\u03b8\u2217 + X (1) + \ufffd\n\nif X (1) \u2264 1.645\notherwise\n\nwhere \u03b8\u2217 = [1, 1, 1, 1, 1] and X (1)\n\nis the \ufb01rst coordinate of X. Observe that since\n\nP\ufffdX (1) > 1.645\ufffd = .05, {X | X (1) > 1.645} represents our minority subgroup. We \ufb01x the\n\n2 (y \u2212 xT \u03b8)2 as our loss function.\n\nsquared error \ufffd(\u03b8; (x, y)) = 1\n\nResults. Figure 2 plots the risk of minority and majority groups for EHRM and ERM. The empirical\nrisk minimizer is denoted by OLS, the solution of this ordinary least square problem. We see that for\ndifferent values of b < 1, EHRM has a lower minority risk than ERM. Moreover, as b approaches 1,\nEHRM becomes more similar to ERM. This validates our hypothesis: because the inverse S-shaped\nprobability weighting function in\ufb02ates small probabilities for extreme losses, drastic losses of the\nminority group will be exaggerated and human risk minimization trades a low population risk for a\nbetter minority performance. Optimization performance of EHRM is shown in Figure 4 (Appendix B).\nIn addition to comparing with ERM, we have also compared EHRM with conditional value-at-risk\n(CVaR), a risk measure that has been used to measure the worst-case subgroup performance [8, 29].\nCVaR\u03b1(\ufffd(\u03b8; (x, y))) is the expectation of the worst \u03b1 proportion of the losses. As shown in Figure 2,\nwhen \u03b1 is small, CVaR has a lower minority risk than EHRM and ERM, at a cost of a higher majority\nrisk. As \u03b1 approaches 1, the minority risk increases drastically.\n\n5.2 Recidivism Prediction: Similar Subpopulation Performance\n\nSetup. We follow the experimental set up in [8]. Using the fairML toolkit version of the COM-\nPAS recidivism dataset [1], we want to study the performance of EHRM and ERM on different\ndemographic subgroups. With a 90% and 10% train-test split, ERM and EHRM are used to train a\nlogistic regression model with L2\u2212regularization. To study the subgroup performances, we report\nthe misclassi\ufb01cation rate of different demographic groups on the test set. In particular, out of the 10\nbinary features in the dataset, we have chosen 7 of them that have more than 10 samples to group the\npopulation. For each chosen (binary) attribute, the dataset can be divided into two subgroups. For\nEHRM, we have chosen b to be .3 so that the EHRM probability weighting function is close to the\nmedian estimate of the CPT probability weighting function for high rank losses in [28]. In practice, a\nand b are application-dependent and user-dependent.\n\n7\n\n\fResults.\nIn Figure 3, for each attribute, we\nreport the maximum misclassi\ufb01cation rate of\nthe two subgroups at test time. Compared to\nERM, EHRM has a higher misclassi\ufb01cation\nrate but a more similar worst case performance\nacross different subgroups. Such an observation\naligns with our hypothesis that EHRM avoids\nextremely bad performances for all demographic\ngroups and hence will sacri\ufb01ce average perfor-\nmance for similar subpopulation performances.\n\n5.3 Gender Classi\ufb01cation based on\nFacial Image: Fairness Metrics Comparison\n\nFigure 3: Test misclassi\ufb01cation rate of the re-\ncidivism prediction task. Each box represents\nthe worst performance among the subpopulations\ngrouped by the attribute listed below. EHRM has a\nmore similar performance across subgroups.\n\nSetup. To study EHRM performance on stan-\ndard fairness metrics, we use the AI Fairness 360\ntoolkit [2]. In particular, we use the UTKFace\ndataset [31] to train a neural network1 for pre-\ndicting gender based on facial images (male= 0,\nfemale= 1). As suggested by [2], we use race as an indicator to divide the population into two\ngroups G1 (white) and G2 (other race). The fairness metrics we have used include statistical parity\ndifference, disparate impact, equal opportunity difference, average odds difference, Theil index\nand false negative rate difference.2 We train the model with ERM and EHRM over 10 random\nseeds. For EHRM, b is chosen to be .3 for the same reason mentioned in Section 5.2. To minimize\nempirical human risk, we have used a variant of mini-batch stochastic gradient descent. At each\ni = w\ufffdPOLY(Fn(\ufffd(\u03b8t; Zi))), Fn(\u00b7) is the\nempirical CDF of the mini-batch losses, B is the mini-batch size and \u03b7t is the learning rate. As\nshown in Figure 5 (Appendix B), the empirical human risk of the entire training dataset decreases\nas training proceeds. We have also compared EHRM with a data pre-processing algorithm named\nreweighing [17] that re-weights the samples so that statistical dependence between the protected\nattribute and label are mitigated.\n\nstep t, \u03b8t+1 = \u03b8t \u2212 \u03b7t\ufffdB\n\ni=1 wt\n\ni\u2207\u03b8\ufffd(\u03b8; Zi)/B, where wt\n\nResults. Table 1 shows the mean and standard deviation of the test time performance of EHRM,\nand ERM with and without reweighing pre-processing. ERM performs the best in terms of accuracy\nat test time. However, reweighing with ERM and EHRM does better in terms of the fairness metrics.\nThe empirical result suggests that the human innate risk-aversion towards possibility of extreme\nlosses has promoted similar performances across different subgroups.\n\nTable 1: Mean and standard deviation of accuracy and fairness metrics of models learned by EHRM\nwith wPOLY( \u00b7 ; .5, .3), and ERM with and without reweighing pre-processing. For each metric, the\nbest performing algorithm is highlighted.\n\nAccuracy\n\nStat. Parity Diff.\nDisparate Impact\nEqual Opp. Diff.\nAvg. Odds Diff.\n\nTheil Index\nFNR Diff.\n\nEHRM(.5, .3) Reweighing [17]\n.8751 \u00b1.0052\n-.0825 \u00b1.0220\n.8475 \u00b1.0411\n-.0440 \u00b1.0261\n-.0116 \u00b1.0202\n.0859 \u00b1.0058\n.0440 \u00b1.0261\n\n.8767 \u00b1.0067\n-.0875 \u00b1.0212\n.8396 \u00b1.0390\n-.0518 \u00b1.0253\n-.0165 \u00b1.0177\n.0824 \u00b1.0038\n.0518 \u00b1.0253\n\nERM\n\n.8767 \u00b1.0060\n-.0881 \u00b1.0208\n.8368\u00b1.0386\n-.0502\u00b1.0263\n-.0173 \u00b1.0188\n.0855 \u00b1.0071\n.0502 \u00b1.0263\n\n1Appendix D consists details of the model con\ufb01guration.\n2Appendix C contains de\ufb01nitions of these metrics in terms of G1 and G2.\n\n8\n\n\f6 Conclusion\n\nIn this work, we have studied alternatives to empirical risk minimization, and in particular proposed\nalternate formulations, which are better aligned with human risk measures. We have analyzed several\ncharacteristics of human risk minimization such as diminishing sensitivity, model selection based on\nhigher-order moments and information-weighted loss distributions. Further, our empirical analysis\nhas shown that such risk measures have implications for fairness, and in particular trade average\nperformance for similar subgroup performances. Our empirical analysis raises several interesting\nfuture directions. Fairness is only one of such desiderata that people start caring about in ML. We\nwould like to study other desiderata that HRM brings. Meanwhile, many risk measures such as\nconditional value-at-risk [26] can be expressed in a dual form, however, it is not immediately clear if\nHRM has an equivalent formulation.\n\nAcknowledgement We thank the reviewers for providing thoughtful and constructive feedback\nfor the paper. We thank Hongseok Namkoong for providing the method to optimize conditional\nvalue-at-risk. We acknowledge the support of ONR via N000141812861.\n\nReferences\n[1] Julius A Adebayo et al. FairML: ToolBox for diagnosing bias in predictive modeling. PhD\n\nthesis, Massachusetts Institute of Technology, 2016.\n\n[2] Rachel K. E. Bellamy, Kuntal Dey, Michael Hind, Samuel C. Hoffman, Stephanie Houde,\nKalapriya Kannan, Pranay Lohia, Jacquelyn Martino, Sameep Mehta, Aleksandra Mojsilovic,\nSeema Nagar, Karthikeyan Natesan Ramamurthy, John Richards, Diptikalyan Saha, Prasanna\nSattigeri, Moninder Singh, Kush R. Varshney, and Yunfeng Zhang. AI Fairness 360: An\nextensible toolkit for detecting, understanding, and mitigating unwanted algorithmic bias,\nOctober 2018.\n\n[3] Colin Camerer and Martin Weber. Recent developments in modeling preferences: Uncertainty\n\nand ambiguity. Journal of risk and uncertainty, 5(4):325\u2013370, 1992.\n\n[4] Colin F Camerer, George Loewenstein, and Matthew Rabin. Advances in behavioral economics.\n\nPrinceton university press, 2004.\n\n[5] Yinlam Chow, Aviv Tamar, Shie Mannor, and Marco Pavone. Risk-sensitive and robust decision-\nmaking: a cvar optimization approach. In Advances in Neural Information Processing Systems,\npages 1522\u20131530, 2015.\n\n[6] H\u00e9lio M de Oliveira and Renato J Cintra. A new information theoretical concept: Information-\n\nweighted heavy-tailed distributions. arXiv preprint arXiv:1601.06412, 2016.\n\n[7] Enrico Diecidue and Peter P Wakker. On the intuition of rank-dependent utility. Journal of Risk\n\nand Uncertainty, 23(3):281\u2013298, 2001.\n\n[8] John C Duchi, Tatsunori Hashimoto, and Hongseok Namkoong. Distributionally robust losses\n\nagainst mixture covariate shifts.\n\n[9] Rick Durrett. Probability: theory and examples, volume 49. Cambridge university press, 2019.\n\n[10] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness\nthrough awareness. In Proceedings of the 3rd innovations in theoretical computer science\nconference, pages 214\u2013226. ACM, 2012.\n\n[11] Jerzy A Filar, Dmitry Krass, and Keith W Ross. Percentile performance criteria for limiting\naverage markov decision processes. IEEE Transactions on Automatic Control, 40(1):2\u201310,\n1995.\n\n[12] Javier Garc\u0131a and Fernando Fern\u00e1ndez. A comprehensive survey on safe reinforcement learning.\n\nJournal of Machine Learning Research, 16(1):1437\u20131480, 2015.\n\n[13] Richard Gonzalez and George Wu. On the shape of the probability weighting function. Cognitive\n\npsychology, 38(1):129\u2013166, 1999.\n\n9\n\n\f[14] Aditya Gopalan, LA Prashanth, Michael Fu, and Steve Marcus. Weighted bandits or: How\nbandits learn distorted values that are not expected. In Thirty-First AAAI Conference on Arti\ufb01cial\nIntelligence, 2017.\n\n[15] Xue Dong He, Roy Kouwenberg, and Xun Yu Zhou. Inverse s-shaped probability weighting\n\nand its impact on investment. Mathematical Control & Related Fields, Sep 2018.\n\n[16] Ronald A Howard and James E Matheson. Risk-sensitive markov decision processes. Manage-\n\nment science, 18(7):356\u2013369, 1972.\n\n[17] Faisal Kamiran and Toon Calders. Data preprocessing techniques for classi\ufb01cation without\n\ndiscrimination. Knowledge and Information Systems, 33(1):1\u201333, 2012.\n\n[18] Toshihiro Kamishima, Shotaro Akaho, and Jun Sakuma. Fairness-aware learning through regu-\nlarization approach. In 2011 IEEE 11th International Conference on Data Mining Workshops,\npages 643\u2013650. IEEE, 2011.\n\n[19] Matt J Kusner, Joshua Loftus, Chris Russell, and Ricardo Silva. Counterfactual fairness. In\n\nAdvances in Neural Information Processing Systems, pages 4066\u20134076, 2017.\n\n[20] Cheekiat Low, Dessislava Pachamanova, and Melvyn Sim. Skewness-aware asset allocation: A\nnew theoretical framework and empirical evidence. Mathematical Finance: An International\nJournal of Mathematics, Statistics and Financial Economics, 22(2):379\u2013410, 2012.\n\n[21] Takayuki Osogami. Robustness and risk-sensitivity in markov decision processes. In Advances\n\nin Neural Information Processing Systems, pages 233\u2013241, 2012.\n\n[22] Dino Pedreshi, Salvatore Ruggieri, and Franco Turini. Discrimination-aware data mining. In\nProceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and\ndata mining, pages 560\u2013568. ACM, 2008.\n\n[23] LA Prashanth, Cheng Jie, Michael Fu, Steve Marcus, and Csaba Szepesv\u00e1ri. Cumulative\nprospect theory meets reinforcement learning: Prediction and control. In International Confer-\nence on Machine Learning, pages 1406\u20131415, 2016.\n\n[24] Drazen Prelec et al. The probability weighting function. ECONOMETRICA-EVANSTON ILL-,\n\n66:497\u2013528, 1998.\n\n[25] Marc Oliver Rieger and Mei Wang. Cumulative prospect theory and the st. petersburg paradox.\n\nEconomic Theory, 28(3):665\u2013679, 2006.\n\n[26] R Tyrrell Rockafellar, Stanislav Uryasev, et al. Optimization of conditional value-at-risk.\n\nJournal of risk, 2:21\u201342, 2000.\n\n[27] Suvrit Sra, Sebastian Nowozin, and Stephen J Wright. Optimization for machine learning. Mit\n\nPress, 2012.\n\n[28] Amos Tversky and Daniel Kahneman. Advances in prospect theory: Cumulative representation\n\nof uncertainty. Journal of Risk and uncertainty, 5(4):297\u2013323, 1992.\n\n[29] Robert C Williamson and Aditya Krishna Menon. Fairness risk measures. arXiv preprint\n\narXiv:1901.08665, 2019.\n\n[30] George Wu and Richard Gonzalez. Curvature of the probability weighting function. Manage-\n\nment science, 42(12):1676\u20131690, 1996.\n\n[31] Song Yang Zhang, Zhifei and Hairong Qi. Age progression/regression by conditional adversarial\nautoencoder. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE,\n2017.\n\n10\n\n\f", "award": [], "sourceid": 8591, "authors": [{"given_name": "Liu", "family_name": "Leqi", "institution": "Carnegie Mellon University"}, {"given_name": "Adarsh", "family_name": "Prasad", "institution": "Carnegie Mellon University"}, {"given_name": "Pradeep", "family_name": "Ravikumar", "institution": "Carnegie Mellon University"}]}