{"title": "Online Classification with Specificity Constraints", "book": "Advances in Neural Information Processing Systems", "page_first": 190, "page_last": 198, "abstract": "We consider the online binary classification problem, where we are given m classifiers. At each stage, the classifiers map the input to the probability that the input belongs to the positive class. An online classification meta-algorithm is an algorithm that combines the outputs of the classifiers in order to attain a certain goal, without having prior knowledge on the form and statistics of the input, and without prior knowledge on the performance of the given classifiers. In this paper, we use sensitivity and specificity as the performance metrics of the meta-algorithm. In particular, our goal is to design an algorithm which satisfies the following two properties (asymptotically): (i) its average false positive rate (fp-rate) is under some given threshold, and (ii) its average true positive rate (tp-rate) is not worse than the tp-rate of the best convex combination of the m given classifiers that satisfies fp-rate constraint, in hindsight. We show that this problem is in fact a special case of the regret minimization problem with constraints, and therefore the above goal is not attainable. Hence, we pose a relaxed goal and propose a corresponding practical online learning meta-algorithm that attains it. In the case of two classifiers, we show that this algorithm takes a very simple form. To our best knowledge, this is the first algorithm that addresses the problem of the average tp-rate maximization under average fp-rate constraints in the online setting.", "full_text": "Online Classi\ufb01cation with Speci\ufb01city Constraints\n\nAndrey Bernstein\n\nShie Mannor\n\nDepartment of Electrical Engineering\n\nTechnion - Israel Institute of Technology\n\nDepartment of Electrical Engineering\n\nTechnion - Israel Institute of Technology\n\nHaifa, 32000, Israel\n\nHaifa, 32000, Israel\n\nandreyb@tx.technion.ac.il\n\nshie@ee.technion.ac.il\n\nNahum Shimkin\n\nDepartment of Electrical Engineering\n\nTechnion - Israel Institute of Technology\n\nHaifa, 32000, Israel\n\nshimkin@ee.technion.ac.il\n\nAbstract\n\nWe consider the online binary classi\ufb01cation problem, where we are given m clas-\nsi\ufb01ers. At each stage, the classi\ufb01ers map the input to the probability that the input\nbelongs to the positive class. An online classi\ufb01cation meta-algorithm is an algo-\nrithm that combines the outputs of the classi\ufb01ers in order to attain a certain goal,\nwithout having prior knowledge on the form and statistics of the input, and with-\nout prior knowledge on the performance of the given classi\ufb01ers. In this paper, we\nuse sensitivity and speci\ufb01city as the performance metrics of the meta-algorithm. In\nparticular, our goal is to design an algorithm that satis\ufb01es the following two prop-\nerties (asymptotically): (i) its average false positive rate (fp-rate) is under some\ngiven threshold; and (ii) its average true positive rate (tp-rate) is not worse than the\ntp-rate of the best convex combination of the m given classi\ufb01ers that satis\ufb01es fp-\nrate constraint, in hindsight. We show that this problem is in fact a special case of\nthe regret minimization problem with constraints, and therefore the above goal is\nnot attainable. Hence, we pose a relaxed goal and propose a corresponding practi-\ncal online learning meta-algorithm that attains it. In the case of two classi\ufb01ers, we\nshow that this algorithm takes a very simple form. To our best knowledge, this is\nthe \ufb01rst algorithm that addresses the problem of the average tp-rate maximization\nunder average fp-rate constraints in the online setting.\n\n1\n\nIntroduction\n\nConsider the binary classi\ufb01cation problem, where each input is classi\ufb01ed into +1 or \u22121. A classi\ufb01er\nis an algorithm which, for every input, classi\ufb01es that input. In general, classi\ufb01ers may produce the\nprobability of the input to belong to class 1. There are several metrics for the performance of the\nclassi\ufb01er in the of\ufb02ine setting, where a training set is given in advance. These include error (or\nmistake) count, true positive rate, and false positive rate; see [6] for a discussion. In particular,\nthe true positive rate (tp-rate) is given by the fraction of the number of positive instances correctly\nclassi\ufb01ed out of the total number of the positive instances, while false positive rate (fp-rate) is given\nby the fraction of the number of negative instances incorrectly classi\ufb01ed out of the total number\nof the negative instances. A receiver operating characteristics (ROC) graph then depicts different\nclassi\ufb01ers using their tp-rate on the Y axis, while fp-rate on the X axis (see [6]). We note that\nthere are alternative names for these metrics in the literature. In particular, the tp-rate is also called\nsensitivity, while one minus the fp-rate is usually called speci\ufb01city. In what follows, we prefer to\nuse the terms tp-rate and fp-rate, as we think that they are self-explaining.\n\n1\n\n\fIn this paper we focus on the online classi\ufb01cation problem, where no training set is given in ad-\nvance. We are given m classi\ufb01ers, which at each stage n = 1, 2, ... map the input instance to the\nprobability of the instance to belong to the positive class. An online classi\ufb01cation meta-algorithm\n(or a selection algorithm) is an algorithm that combines the outputs of the given classi\ufb01ers in order\nto attain a certain goal, without prior knowledge on the form and statistics of the input, and without\nprior knowledge on the performance of the given classi\ufb01ers. The assumption is that the observed\nsequence of classi\ufb01cation probabilities and labels comes from some unknown source and, thus, can\nbe arbitrary. Therefore, it is convenient to formulate the online classi\ufb01cation problem as a repeated\ngame between an agent and some abstract opponent that stands for the collective behavior of the\nclassi\ufb01ers and the realized labels. We note that, in this formulation, we can identify the agent with a\ncorresponding online classi\ufb01cation meta-algorithm.\n\nThere is a rich literature that deals with the online classi\ufb01cation problem, in the competitive ra-\ntio framework, such as [5, 1]. In these works, the performance guarantees are usually expressed\nin terms of the mistake bound of the algorithm. In this paper, we take a different approach. Our\nperformance metrics will be the average tp-rate and fp-rate of the meta-algorithm, while the per-\nformance guarantees will be expressed in the regret minimization framework. In a seminal paper,\nHannan [8] introduced the optimal reward-in-hindsight r\u2217\nn with respect to the empirical distribu-\nn is in fact\ntion of opponent\u2019s actions, as a performance goal of an online algorithm. In our case, r\u2217\nthe maximal tp-rate the agent could get at time n by knowing the classi\ufb01cation probabilities and\nactual labels beforehand, using the best convex combination of the classi\ufb01ers. The regret is then\nn and the actual average tp-rate obtained by the agent. Hannan\nde\ufb01ned as the difference between r\u2217\nshowed in [8] that there exist online algorithms whose regret converges to zero (or below) as time\nprogresses, regardless of the opponent\u2019s actions, at 1/\u221an rate. Such algorithms are often called\nno-regret, Hannan-consistent, or universally consistent algorithms. Additional no-regret algorithms\nwere proposed in the literature over the years, such as Blackwell\u2019s approachability-based algorithm\n[2] and weighted majority schemes [10, 7] (see [4] for an overview of these and other related algo-\nrithms). These algorithms can be directly applied to the problem of online classi\ufb01cation when the\ngoal is only to obtain no-regret with respect to the optimal tp-rate in hindsight.\n\nHowever, in addition to tp-rate maximization, some performance guarantees in terms of the fp-\nrate are usually required. In particular, it is reasonable to require (following the Neyman-Pearson\napproach) that, in the long term, the average fp-rate of the agent will be below some given threshold\n0 < \u03b3 < 1.\nIn this case the tp-rate can be considered as the average reward obtained by the\nagent, while fp-rate \u2013 as the average cost. This is in fact a special case of the regret minimization\nproblem with constraints whose study was initiated by Mannor et al. in [11]. They de\ufb01ned the\nconstrained reward-in-hindsight with respect to the empirical distribution of opponent\u2019s actions,\nas a performance goal of an online algorithm. This quantity is the maximal average reward the\nagent could get in hindsight, had he known the opponent\u2019s actions beforehand, by using any \ufb01xed\n(mixed) action, while satisfying the average cost constraints. The desired online algorithm then has\nto satisfy two requirements: (i) it should have a vanishing regret (with respect to the constrained\nreward-in-hindsight); and (ii) it should asymptotically satisfy the average cost constraints.\nIt is\nshown in [11] that such algorithms do not exist in general. The positive result is that a relaxed\ngoal, which is de\ufb01ned in terms of the convex hull of the constrained reward-in-hindsight over an\nappropriate space, is attainable. The two no-regret algorithms proposed in [11] explicitly involve\neither the convex hull or a calibrated forecast of the opponent\u2019s actions. Both of these algorithms\nmay not be computationally feasible, since there are no ef\ufb01cient (polynomial time) procedures for\nthe computation of both the convex hull and a calibrated forecast.\n\nIn this paper, we take an alternative approach to that of [11]. Instead of examining the constrained\ntp-rate in hindsight (or its convex hull), our starting point is the \u201cstandard\u201d regret with respect to\nthe optimal (unconstrained) tp-rate, and we consider a certain relaxation thereof. In particular, we\nde\ufb01ne a simple relaxed form of the optimal tp-rate in-hindsight, by subtracting a positive constant\nfrom the latter. We then \ufb01nd the minimal constant needed in order to have a vanishing regret (with\nrespect to this relaxed goal) while asymptotically satisfying the average fp-rate constraint. The mo-\ntivation for this approach is as follows. We know that if the constraints are always satis\ufb01ed, then the\noptimal tp-rate in-hindsight is attainable (using relatively simple no-regret algorithms). On the other\nhand, when the constraints need to be actively satis\ufb01ed, we should \u201cpay\u201d some penalty in terms of\nthe attainability of the tp-rate in-hindsight. In our case, we express this penalty in terms of the re-\nlaxation constant mentioned above. One of the main contributions of this paper is a computationally\n\n2\n\n\ffeasible online algorithm, the Constrained Regret Matching (CRM) algorithm, that attains the posed\nperformance goal. We note that although we focus in this paper on the online classi\ufb01cation problem,\nour algorithm can be easily extended to the general case of regret minimization under average cost\nconstraints.\n\nThe paper is structured as follows. In Section 2 we formally de\ufb01ne the online classi\ufb01cation problem\nand the goal of the meta-algorithm. In Section 3 we present the general problem of constrained\nregret minimization, and show that the online classi\ufb01cation problem is its special case. In Section\n4 we de\ufb01ne our relaxed goal in terms of the unconstrained optimal tp-rate in-hindsight, propose the\nCRM algorithm, and show that it can be implemented ef\ufb01ciently. Section 5 discusses the special\ncase of two classi\ufb01ers and corresponding experimental results. We conclude in Section 6 with some\n\ufb01nal remarks.\n\n2 Online Classi\ufb01cation\n\nWe consider the online binary classi\ufb01cation problem from an abstract space to {1,\u22121}. We are given\nm classi\ufb01ers that map an input instance to the probability that the instance belongs to the positive\nclass. We denote by A = {1, ...m} the set of indices of the classi\ufb01ers. An online classi\ufb01cation meta-\nalgorithm is an algorithm that combines the outputs of the given classi\ufb01ers in order to attain a certain\ngoal, without prior knowledge on the form and statistics of the input, and without prior knowledge\non the performance of the given classi\ufb01ers. In what follows, we identify the meta-algorithm with an\nagent, and use both these notions interchangeably. The time axis is discrete, with index n = 1, 2, ....\nAt stage n, the following events occur: (i) the input instance is presented to the classi\ufb01ers (but not\nto the agent); (ii) each classi\ufb01er a \u2208 A outputs fn(a) \u2208 [0, 1], which is the probability of the input\nto belong to class 1, and simultaneously the agent chooses a classi\ufb01er an; and (iii) the correct label\nof the instance, bn \u2208 {1,\u22121}, is revealed.\nThere are several standard performance metrics of classi\ufb01ers. These include error count, true-\npositive rate (which is also termed recall or sensitivity), and false-positive rate (one minus the fp-rate\nis usually termed speci\ufb01city). As discussed in [6], tp-rate and fp-rate metrics have some attractive\nproperties, such as that they are insensitive to changes in class distribution, and thus we focus on\nthese metrics in this paper. In the online setting, no training set is given in advance, and therefore\nthese rates have to be updated online, using the obtained data at each stage. Observe that this\n\ndata is expressed in terms of the vector zn , (cid:0){fn(a)}a\u2208A , bn(cid:1) \u2208 [0, 1]m \u00d7 {\u22121, 1}. We let\nrn = r(an, zn) , fn(an) I {bn = 1} and cn = c(an, zn) , fn(an) I {bn = 0} denote the reward\nand the cost of the agent at time n. Note that rn is the probability that the instance with positive\nlabel at time n will be classi\ufb01ed correctly by the agent, while cn is the probability that the instance\nwith negative label will be classi\ufb01ed incorrectly. Then, \u00af\u03b2tp(n) , Pn\nI{bn = 1} and\n\u00af\u03b2f p(n) , Pn\nI{bn = \u22121} are the average tp-rate and fp-rate of the agent at time n,\nrespectively.\nOur aim is to design a meta-algorithm that will have \u00af\u03b2tp(n) not worse than the tp-rate of the best\nconvex combination of the m given classi\ufb01ers (in hindsight), while satisfying \u00af\u03b2f p(n) \u2264 \u03b3, for some\n0 < \u03b3 < 1 (asymptotically, almost surely, for any possible sequence z1, z2, ...). In fact, this problem\nis a special case of the regret minimization problem with constraints. In the next section we thus\npresent the general constrained regret minimization framework, and discuss its applicability to the\ncase of online classi\ufb01cation.\n\nk=1 rk/Pn\n\nk=1\n\nk=1 ck/Pn\n\nk=1\n\n3 Constrained Regret Minimization\n\n3.1 Model De\ufb01nition\n\nWe consider the problem of an agent facing an arbitrary varying environment. We identify the en-\nvironment with some abstract opponent, and therefore obtain a repeated game formulation between\nthe agent and the opponent. The constrained game is de\ufb01ned by a tuple (A,Z, r, c, \u0393) where A\ndenotes the \ufb01nite set of possible actions of the agent; Z denotes the compact set of possible out-\ncomes (or actions) of the environment; r : A \u00d7 Z \u2192 R is the reward function; c : A \u00d7 Z \u2192 R`\nis the vector-valued cost function; and \u0393 \u2286 R` is a convex and closed set within which the average\n\n3\n\n\fcost vector should lie in order to satisfy the constraints. An important special case is that of linear\n\nnPn\n\nnPn\n\nk=1 rk and \u00afcn , 1\n\nconstraints, that is \u0393 =(cid:8)c \u2208 R` : ci \u2264 \u03b3i, i = 1, ..., `(cid:9) for some vector \u03b3 \u2208 R`.\nThe time axis is discrete, with index n = 1, 2, .... At time step n, the following events occur: (i) The\nagent chooses an action an, and the opponent chooses an action zn, simultaneously; (ii) the agent\nobserves zn; and (iii) the agent receives a reward rn = r(an, zn) \u2208 R and a cost cn = c(an, zn) \u2208\nR`. We let \u00afrn , 1\nk=1 ck denote the average reward and cost of the agent\nat time n, respectively. Let Hn , Z n\u22121\u00d7An\u22121 denote the set of all possible histories of actions till\ntime n. At time n, the agent chooses an action an according to the decision rule \u03c0n : Hn \u2192 \u2206(A),\nwhere \u2206(A) is the set of probability distributions over the set A. The collection \u03c0 = {\u03c0n}\u221e\nn=1 is the\nstrategy of the agent. That is, at each time step, a strategy prescribes some mixed action p \u2208 \u2206(A),\nbased on the observed history. A strategy for the opponent is de\ufb01ned similarly. We denote the mixed\naction of the opponent by q \u2208 \u2206(Z), which is the probability density over Z.\nIn what follows, we will use the shorthand notation r(p, q) , Pa\u2208A p(a)Rz\u2208Z q(z)r(a, z)\nfor the expected reward under mixed actions p \u2208 \u2206(A) and q \u2208 \u2206(Z). The notation\nr(a, q), c(p, q), c(p, z), c(a, q) will be interpreted similarly. We make the following assumption that\nthe agent can satisfy the constraints in expectation against any mixed action of the opponent.\nAssumption 3.1 (Satis\ufb01ability of Constraints). For every q \u2208 \u2206(Z), there exists p \u2208 \u2206(A), such\nthat c(p, q) \u2208 \u0393.\nAssumption 3.1 is essential, since otherwise the opponent can violate the average-cost constraints\nsimply by playing the corresponding stationary strategy q.\n\nk=1 \u03b4 {z \u2212 zk} /n denote the empirical density of the opponent\u2019s actions at time n,\n\nLet \u00afqn(z) , Pn\nso that \u00afqn \u2208 \u2206(Z). The optimal reward-in-hindsight is then given by\nXk=1\n\nn(z1, ..., zn) , 1\nr\u2217\nn\n\na\u2208AZz\u2208Z\n\nr(a, zk) = max\n\nXk=1\n\nmax\na\u2208A\n\nr(a, z)\n\nn\n\nn\n\n1\nn\n\n\u03b4 {z \u2212 zk} = max\n\na\u2208A\n\nr(a, \u00afqn),\n\nn = r\u2217(\u00afqn).\n\nimplying that r\u2217\nIn what follows, we will use the term \u201creward envelope\u201d in order\nto refer to functions \u03c1 : \u2206(Z) \u2192 R. The simplest reward envelope is the (unconstrained) best-\nresponse envelope (BE) \u03c1 = r\u2217. The n-stage regret of the algorithm (with respect to the BE) is then\nr\u2217(\u00afqn) \u2212 \u00afrn. The no-regret algorithm must ensure that the regret vanishes as n \u2192 \u221e regardless\nof the opponent\u2019s actions. However, in our case, in addition to vanishing regret, we need to satisfy\nthe cost constraints. Obviously, the BE need not be attainable in the presence of constraints, and\ntherefore other reward envelopes should be considered. Hence, we use the following de\ufb01nition\n(introduced in [11]) in order to assess the online performance of the agent.\nDe\ufb01nition 3.1 (Attainability and No-Regret). A reward envelope \u03c1 : \u2206(Z) \u2192 R is \u0393-attainable if\nthere exists a strategy \u03c0 for the agent such that, almost surely, (i) lim supn\u2192\u221e (\u03c1(\u00afqn) \u2212 \u00afrn) \u2264 0 ,\nand (ii) limn\u2192\u221e d(\u00afcn, \u0393) = 0, for every strategy of the opponent. Here, d(\u00b7, \u0393) is Euclidean set-to-\npoint distance. Such a strategy \u03c0 is called constrainedno-regretstrategy with respect to \u03c1.\n\nA natural extension of the BE to the constrained setting was de\ufb01ned in [11], by noting that if the\nagent knew in advance that the empirical distribution of the opponents actions is \u00afqn = q, he could\nchoose the constrained best response mixed action p, which is a solution of the corresponding opti-\nmization problem:\n\nr\u2217\n\u0393(q) , max\n\np\u2208\u2206(A){r(p, q) :\n\nso that c(p, q) \u2208 \u0393} .\n\n(1)\n\n\u0393 as the constrained best-response envelope (CBE).\n\nWe refer to r\u2217\nThe \ufb01rst positive result that appeared in the literature was that of Shimkin [12], which showed that\nthe value v\u0393 , minq\u2208\u2206(Z) r\u2217\n\u0393(q) of the constrained game is attainable by the agent. The algorithm\nwhich attains the value is based on Blackwell\u2019s approachability theory [3], and is computationally\nef\ufb01cient provided that v\u0393 can be computed of\ufb02ine. Unfortunately, it was shown in [11] that r\u2217\n\u0393(q)\n\u0393), is attainable1.\nitself is not attainable in general. However, the (lower) convex hull of r\u2217\nTwo no-regret algorithms with respect to conv (r\u2217\n\u0393) are suggested in [11]. To our best knowledge,\n1The (lower) convex hull of a function f : X \u2192 R is the largest convex function which is nowhere larger\n\n\u0393(q), conv (r\u2217\n\nthan f.\n\n4\n\n\fthese algorithms are inef\ufb01cient (i.e., not polynomial); these are the only existing constrained no-\nregret algorithms in the literature.\n\nIt should be noted that the problem that is considered here can not be formulated as an instance of\nonline convex optimization [13, 9] \u2013 see [11] for a discussion on this issue.\n\n3.2 Application to the Online Classi\ufb01cation Problem\n\nFor the model described in Section 2, A = {1, ..., m} denotes the set of possible classi\ufb01ers and Z de-\nnotes the set of possible outputs of the classi\ufb01ers and the true labels, that is: z =(cid:0){f (a)}a\u2208A , b(cid:1) \u2208\n[0, 1]m \u00d7 {\u22121, 1} , Z. The reward at time n is rn = r(an, zn) = fn(an) I {bn = 1} and the cost\nis cn = c(an, zn) = fn(an) I {bn = \u22121}. Note that in this case, the mixed action of the opponent\nq \u2208 \u2206(Z) is q(f, b) = q(f|b)q(b), where q(f|b) is the conditional density of the predictions of the\nclassi\ufb01ers and q(b) is the probability of the label b. It is easy to check that\n\nr(p, q) = q(1)Xa\u2208A\n\np(a)\u03b2tp(q; a),\n\n(2)\n\nwhere \u03b2tp(q; a) , Rf f (a)q(f|1) is the tp-rate of classi\ufb01er a under distribution q. Regarding the\ncost, the goal is to keep it under a given threshold 0 < \u03b3 < 1. Since the regret minimization\nframework requires additive rewards and costs, we de\ufb01ne the following modi\ufb01ed cost function:\nc\u03b3(a, z) , c(a, z) \u2212 \u03b3 I {b = \u22121} , and similarly to the reward above, we have that\n\nc\u03b3(p, q) = q(\u22121) Xa\u2208A\n\np(a)\u03b2f p(q; a) \u2212 \u03b3! ,\n\n(3)\n\nto keeping\n\nk=1 c\u03b3(ak, zk) \u2264 0.\n\nwhere \u03b2f p(q; a) , Rf q(f| \u2212 1)f (a) is the fp-rate of classi\ufb01er a under distribution q. We\nnote that keeping the average fp-rate of the agent \u00af\u03b2f p(n) \u2264 \u03b3 is equivalent\n(1/n)Pn\nSince our goal is to keep the fp-rate below \u03b3, some assumption on classi\ufb01ers should be im-\nposed in order to satisfy Assumption 3.1. We assume here that the classi\ufb01ers\u2019 single-stage false-\nIn particular, we rede\ufb01ne2\npositive probability is such that it allows satisfying the constraint.\nZ , {z = (f, b) \u2208 [0, 1]m \u00d7 {\u22121, 1} : if b = \u22121, f (a) \u2264 \u03b3a} , where 0 \u2264 \u03b3a \u2264 1, and there\nexists a\u2217 such that \u03b3a\u2217 < \u03b3. Under this assumption, it is clear that for every q \u2208 \u2206(Z), there exists\np \u2208 \u2206(A), such that c\u03b3(p, q) \u2264 0; in fact this p is the probability mass concentrated on a\u2217. If\nadditional prior information is available on the single-stage performance of the given classi\ufb01ers, this\nmay be usefully used to further restrict the set Z. For example, we can also restrict z = (f, 1) by\nf (a) \u2265 \u03bba for some 0 < \u03bba < 1. Such additional restrictions will generally contribute to reduc-\ning the value of the optimal relaxation parameter \u03b1\u2217 (see (7) below). This effect will be explicitly\ndemonstrated in Section 5.\n\nWe proceed to compute the BE and CBE. Using (2), the BE is\n\nr\u2217(q) , max\na\u2208A\n\nr(a, q) = q(1) max\n\na\u2208{1,...,m}{\u03b2tp(q; a)} , q(1)\u03b2\u2217(q),\n\n(4)\n\nwhere \u03b2\u2217(q) is the optimal (unconstrained) tp-rate in hindsight under distribution q. Now, using (1),\n(2), and (3) we have that r\u2217\n\n\u03b3(q) = q(1)\u03b2\u2217\n\n\u03b3(q), where\n\n\u03b2\u2217\n\u03b3(q) , max\n\np\u2208\u2206(A)(Xa\u2208A\n\np(a)\u03b2tp(q; a) :\n\nso that Xa\u2208A\n\np(a)\u03b2f p(q; a) \u2264 \u03b3) ,\n\n(5)\n\nis the optimal constrained tp-rate in hindsight under distribution q. Finally, note that the value of the\nconstrained game v\u03b3 , minq\u2208\u2206(Z) r\u2217\nAs a consequence of this formulation, the algorithms proposed in [11] can be in principle used in\n\u03b3. However, given the implementation dif\ufb01culties associated with\norder to attain the convex hull of r\u2217\nthese algorithms, we are motivated to examine more carefully the problem of regret minimization\nwith constraints and provide more practical no-regret algorithms with formal guarantees.\n\n\u03b3(q) = 0 in this case.\n\n2This assumption can always be satis\ufb01ed by adding a \ufb01ctitious classi\ufb01er a0 that always outputs a \ufb01xed\nf (a0) < \u03b3, irrespectively of the data. However, such an addition might adversely affect the value of the\noptimal relaxation parameter \u03b1\u2217 (see (7) below), and should be avoided if possible.\n\n5\n\n\f4 Constrained Regret Matching\n\nWe next de\ufb01ne a relaxed reward envelope for the online classi\ufb01cation problem. The proposed is in\nfact applicable to the problem of constrained regret minimization in general. However, due to space\nlimitation, we present it directly for our classi\ufb01cation problem.\n\nOur starting point here in de\ufb01ning an attainable reward envelope will be the BE r\u2217(q) = q(1)\u03b2\u2217(q).\nClearly, r\u2217 is in general not attainable in the presence of fp-constraints, and we thus consider a\n\u03b1 is a convex\nrelaxed version thereof. For \u03b1 \u2265 0, set r\u2217\n\u03b1 is attainable. Furthermore, recall\nfunction, and we can always pick \u03b1 \u2265 0 large enough, such that r\u2217\nthat the value v\u03b3 of the constrained game is attainable by the agent. Observe that, generally, r\u2217\n\u03b1(q)\ncan be smaller than v\u03b3 = 0. We thus introduce the following modi\ufb01cation:\n\n\u03b1(q) , q(1)(\u03b2\u2217(q) \u2212 \u03b1). Obviously, r\u2217\n\nWe refer to rSR\n\n\u03b1 as the scalar-relaxed best-response envelope (SR-BE). Now, let3\n\nrSR\n\u03b1 (q) , q(1) max{0, \u03b2\u2217(q) \u2212 \u03b1} .\n\u03b3(q)(cid:1) .\n\nq\u2208\u2206(Z)(cid:0)\u03b2\u2217(q) \u2212 \u03b2\u2217\n\n\u03b1\u2217 , max\n\n(6)\n\n(7)\n\nWe note that rSR\n\u03b1\u2217 (q) is strictly above 0 at some point, unless the game is in some sense trivial (see\nthe supplementary material for a proof). According to De\ufb01nition 3.1, we are seeking for a strategy\n\u03c0 that is: (i) an \u03b1-relaxed no-regret strategy for the average reward, and (ii) ensures that the cost\nconstraints are asymptotically satis\ufb01ed. Thus, at each time step, we need to balance between the\nneed of maximizing the average tp-rate and satisfying the average fp-rate constraint. Below we\npropose an algorithm which solves this trade-off for \u03b1 \u2265 \u03b1\u2217.\nWe introduce some further notation. Let\n\nR\u03b1\nk (a) , [fk(a) \u2212 fk(ak) \u2212 \u03b1] I {bk = 1} , a \u2208 A, Lk , c\u03b3(ak, zk),\n\n(8)\ndenote the instantaneous \u03b1-regret and the instantaneous constraint violation (respectively) at time\nk. We have that the average \u03b1-regret and constraints violation at time n are\n\n\u03b1\n\nR\n\nn(a) = \u00afqn(1)(cid:2)\u03b2tp(\u00afqn; a) \u2212 \u00af\u03b2tp(n) \u2212 \u03b1(cid:3) , a \u2208 A; Ln = \u00afqn(0)[ \u00af\u03b2f p(n) \u2212 \u03b3].\n\n(9)\nUsing this notation, the Constrained Regret Matching (CRM) algorithm is given in Algorithm 1. We\nthen have the following result.\nTheorem 4.1. Suppose that the CRM algorithm is applied with parameter \u03b1 \u2265 \u03b1\u2217, where \u03b1\u2217 is\ngiven in (7). Then, under Assumption 3.1, it attains rSR\n\u03b1 (6) in the sense of De\ufb01nition 3.1. That is, (i)\n\u00af\u03b2f p(n) \u2264 0,\nlim inf n\u2192\u221e(cid:0) \u00af\u03b2tp(n) \u2212 max{0, maxa\u2208A \u03b2tp(\u00afqn; a) \u2212 \u03b1}(cid:1) \u2265 0 , and (ii) lim supn\u2192\u221e\n\nfor every strategy of the opponent, almost surely.\n\nThe proof of this Theorem is based on Blackwell\u2019s approachability theory [3], and is given in the\nsupplementary material. We note that the mixed action required by the CRM algorithm always\nexists provided that \u03b1 \u2265 \u03b1\u2217. It can be easily shown (see the supplementary material) that whenever\nPa\u2208AhR\n\n> 0, this action can be computed by solving the following linear program:\n\nn\u22121(a)i+\n\n\u03b1\n\n(10)\n\nmin\n\np\u2208Bn Xa\u2208A:p\u03b1\n\nn(a)>p(a)\n\n(p\u03b1\n\nn(a) \u2212 p(a)) ,\n\n\u03b1\n\n\u03b1\n\np\u03b1\n\nn\u22121(a)i+\n\n/Pa0\u2208AhR\nn. Finally, whenPa\u2208AhR\n\nwhere Bn , np \u2208 \u2206(A) :(cid:2)Ln\u22121(cid:3)+(cid:0)Pa0\u2208A p(a0)f (a0) \u2212 \u03b3(cid:1) \u2264 0, \u2200z = (f,\u22121) \u2208 Zo and\nn(a) = hR\n\nwhen the average constraints violation Ln\u22121 is non-positive, the minimum in (10) is obtained by\n= 0, any action p \u2208 Bn can be chosen. It is worth men-\np = p\u03b1\ntioning that our algorithm, and in particular the program (10), can not be formulated in the Online\nConvex Programming (OCP) framework [13, 9], since the equivalent reward functions in our case\nare trajectory-dependent, while in the OCP it is assumed that these functions are arbitrary, but \ufb01xed\n(i.e., they should not depend on the agent\u2019s actions).\n\nn\u22121(a0)i+\nn\u22121(a)i+\n\nis the \u03b1-regret matching strategy. Note also that\n\n\u03b1\n\n3In general, the parameter \u03b1\u2217 may be dif\ufb01cult to compute analytically. See the supplementary material for\na discussion on computational aspects. Also, in the supplementary material we propose an adaptive algorithm\nwhich avoids this computation (see a remark at the end of Section 4). Finally, in Section 5 we show that in the\ncase of two classi\ufb01ers this computation is trivial.\n\n6\n\n\fAlgorithm 1 CRM Algorithm\nParameter: \u03b1 \u2265 0.\nInitialization: At time n = 0 use arbitrary action a0.\nAt times n = 1, 2, ... \ufb01nd a mixed action p \u2208 \u2206(A) such that\n\n\uf8f1\uf8f2\n\uf8f3\n\n\u03b1\n\nn\u22121(a)i+(cid:0)f (a) \u2212Pa0\u2208A p(a0)f (a0) \u2212 \u03b1(cid:1) \u2264 0,\n\nPa\u2208AhR\n(cid:2)Ln\u22121(cid:3)+(cid:0)Pa0\u2208A p(a0)f (a0) \u2212 \u03b3(cid:1) \u2264 0,\n\nn(a) and Ln,i are given in (9). Draw classi\ufb01er an from p.\n\n\u03b1\n\nwhere R\n\n\u2200z = (f, 1) \u2208 Z,\n\u2200z = (f,\u22121) \u2208 Z,\n\n(11)\n\nRemark. In practice, it may be possible to attain rSR\n\u03b1 with \u03b1 < \u03b1\u2217 if the opponent is not entirely\nadversarial. In order to capitalize on this possibility, an adaptive algorithm can be used that adjusts\nthe value of \u03b1 online. The idea is to start from some small initial value \u03b10 \u2265 0 (possibly \u03b10 = 0).\nAt each time step n, we would like to use a parameter \u03b1 = \u03b1n for which inequality (11) can be\nsatis\ufb01ed. This inequality is always satis\ufb01ed when \u03b1 \u2265 \u03b1\u2217. If however \u03b1 < \u03b1\u2217, the inequality may\nor may not be satis\ufb01ed. In the latter case, \u03b1 can be increased so that the condition is satis\ufb01ed. In\naddition, once in a while, \u03b1 can be reset to \u03b10, in order to obtain better results. In the supplementary\nmaterial we further discuss the adaptive scheme, and prove a convergence rate for it. We note that\nthe adaptive scheme does not require the computation of the optimal \u03b1\u2217, as it discovers it online.\n\n5 The Special Case of Two Classi\ufb01ers\n\nIf m = 2, we can obtain explicit expressions for the reward envelopes and for the algorithm. In\nparticular, we have two classi\ufb01ers, and we assume that the outputs of these classi\ufb01ers lie in the set\n\nZ ,(cid:8)z \u2208 (f, b) \u2208 [0, 1]2 \u00d7 {\u22121, 1} : if b = \u22121, f (1) \u2264 \u03b31, f (2) \u2264 \u03b32; if b = 1, f (2) \u2265 \u03bb(cid:9) such\nthat \u03b31 > \u03b3, \u03b32 < \u03b3, and \u03bb \u2265 0. Observe that under this assumption, classi\ufb01er 2 has one-stage per-\nformance guarantees that will allow to obtain better guarantees of the meta-algorithm. By computing\nexplicitly the CBE, we obtain\n\nr\u2217\n\u03b3(q) = q(1)\n\n\u03b3\u2212\u03b2f p(q;2)\n\n\u03b2f p(q;1)\u2212\u03b2f p(q;2) \u03b2tp(q; 1) +\n\n\u03b2f p(q;1)\u2212\u03b3\n\n\u03b2f p(q;1)\u2212\u03b2f p(q;2) \u03b2tp(q; 2),\n\n\u03b2tp(q; 1),\n\n\u03b2tp(q; 2),\n\nif \u03b2tp(q; 1) > \u03b2tp(q; 2)\nand \u03b2f p(q; 1) > \u03b3,\nif \u03b2tp(q; 1) > \u03b2tp(q; 2)\nand \u03b2f p(q; 1) \u2264 \u03b3,\notherwise.\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\n\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\nTherefore, the relaxation parameter is\n\n\u03b1 =\n\nmax\n\n\u03b1\n\n\u03b1\n\n\u03b1\n\n(1 \u2212 \u03bb)(\u03b31 \u2212 \u03b3)\n\n.\n\nn(1), \u03b3\u2212\u03b32\n\n(\u03b2f p(q; 1) \u2212 \u03b3)(cid:27) =\n\ndenotes the \u03b1-regret matching strategy; (ii) otherwise, choose\n\n\u03b31 \u2212 \u03b32\n\u03b31\u2212\u03b32o, where p\u03b1\n\nq: \u03b2tp(q;1)>\u03b2tp(q;2),\u03b2f p(q;1)>\u03b3(cid:26) \u03b2tp(q; 1) \u2212 \u03b2tp(q; 2)\n\u03b2f p(q; 1) \u2212 \u03b2f p(q; 2)\n> 0, choose p(1) = minnp\u03b1\n\nFinally, it is easy to check using (10) that Algorithm 1 reduces in this case to the following sim-\nn(1) =\n\nple rule: (i) if Pa\u2208AhR\nn\u22121(a)i+\n/Pa\u2208AhR\nn\u22121(a)i+\nn\u22121(1)i+\nhR\narbitrary action with p(1) \u2264 \u03b3\u2212\u03b32\nWe simulated the CRM algorithm with the following parameters: \u03b3 = 0.3, \u03b31 = 0.4, \u03b32 = 0.2, \u03bb =\n0.7. This gives the relaxation parameter of \u03b1 = 0.15. Half of the input instances were positives and\nthe other half were negatives (on average). The time was divided into episodes with exponentially\ngrowing lengths. At each odd episode, both classi\ufb01ers had similar tp-rate and both of them satis\ufb01ed\nthe constraints, while in each even episode, classi\ufb01er 1 was perfect in positives\u2019 classi\ufb01cation, but did\nnot satisfy the constraints. The results are shown in Figure 1. We compared the performance of the\nCRM algorithm to a simple unconstrained no-regret algorithm that treats both the true-positive and\nfalse positive probabilities similarly, but with different weight. In particular, the reward at stage n of\nthis algorithm is gn(w) = fn(an) I {bn = 1} \u2212 wfn(an) I {bn = \u22121} for some weight parameter\n\n.\n\n\u03b31\u2212\u03b32\n\n7\n\n\ftp-rate\n\nfp-rate\n\nw = 1.1\n\nw = 1.3\n\n\u03b3 =\n\nw = 1.33\n\nw = 1.4\n\nn\n\nw = 1.1\n\nw = 1.3\n\nw = 1.33\n\nw = 1.4\n\nn\n\n\u03b2 \u2217(\u00afqn)\n\n\u03b3(\u00afqn)\n\u03b2 \u2217\n\nCRM\n\nNR(w)\n\nFigure 1: Experimental results for the case of two classi\ufb01ers.\n\nw \u2265 0. Given w, this is simply a no-regret algorithm with respect to gn(w). When w = 0, the\nalgorithm performs tp-rate maximization, while if w is large, it performs fp-rate minimization. We\ncall this algorithm NR(w). As can be seen from Figure 1, the CRM algorithm outperforms NR(w)\nfor any \ufb01xed parameter w. For w = 1.1, NR(w) has a better tp-rate, but the fp-rate constraint is\nviolated most of the time. For w = 1.4, the constraints are always satis\ufb01ed, but the tp-rate is always\ndominated by that of the CRM algorithm. For w = 1.3, 1.33 it can be seen that the constraints are\nsatis\ufb01ed (or almost satis\ufb01ed), but the tp-rate is usually dominated by that of the CRM algorithm.\n\n6 Conclusion\n\nWe studied regret minimization with average-cost constraints, with the focus on computationally\nfeasible algorithm for the special case of online classi\ufb01cation problem with speci\ufb01city constraints.\nWe de\ufb01ned a relaxed version of the best-response reward envelope and showed that it can be attained\nby the agent while satisfying the constraints, provided that the relaxation parameter is above a certain\nthreshold. A polynomial no-regret algorithm was provided. This algorithm generally solves a linear\nprogram at each time step, while in some special case the algorithm\u2019s mixed action reduces to the\nsimple \u03b1-regret matching strategy. To the best of our knowledge, this is the \ufb01rst algorithm that\naddresses the problem of the average tp-rate maximization under average fp-rate constraints in the\nonline setting.\nIn addition, an adaptive scheme that adapts the relaxation parameter online was\nbrie\ufb02y discussed. Finally, the special case of two classi\ufb01ers was discussed, and the experimental\nresults for this case show that our algorithm outperforms a simple no-regret algorithm which takes\nas the reward function a weighted sum of the tp-rate and fp-rate.\n\nSome remarks about our algorithm and results follow. First, the guaranteed convergence rate of\nthe algorithm is of O(1/\u221an) since it is based on Blackwell\u2019s approachability theorem4. Second,\nadditional constraints can be easily incorporated in the presented framework, since the general regret\nminimization framework assumes a vector of constraints. Third, it seems that there is an inherent\ntrade-off between complexity and performance in the studied problem. In particular, in case of a\nsingle constraint, the maximal attainable relaxed goal is the convex hull of the CBE (see [11]) but no\npolynomial algorithms are known that attain this goal. Our results show that, by further relaxing the\ngoal, it is possible to devise attaining polynomial algorithms. Finally, we note that the assumption\non the single-stage fp-rates of the classi\ufb01ers can be weakened by assuming that, in each suf\ufb01ciently\nlarge period of time, the average fp-rate of each classi\ufb01er a is bounded by \u03b3a. Our approach and\nresults can be then extended to this case, by treating each such period as a single stage.\n\n4A straightforward application of this theorem also gives \u221am dependence of the rate on the number of clas-\nsi\ufb01ers. We note that it is possible to improve the dependence to log(m) by using a potential based Blackwell\u2019s\napproachability strategy (see for example [4], Chapter 7.8)\n\n8\n\n\fReferences\n\n[1] Y. Amit, S. Shalev-Shwartz, and Y. Singer. Online classi\ufb01cation for complex problems using\n\nsimultaneous projections. In NIPS 2006.\n\n[2] D. Blackwell. Controlled random walks.\n\nIn Proceedings of the International Congress of\n\nMathematicians, volume III, pages 335\u2013338, 1954.\n\n[3] D. Blackwell. An analog of the minimax theorem for vector payoffs. Paci\ufb01c Journal of Math-\n\nematics, 6:1\u20138, 1956.\n\n[4] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University\n\nPress, New York, NY, USA, 2006.\n\n[5] K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online passive-aggressive\n\nalgorithms. Journal of Machine Learning Research, 7:551\u2013585, 2006.\n\n[6] T. Fawcett. An introduction to ROC analysis. Pattern Recognition Letters, 27(8):861\u2013874,\n\n2006.\n\n[7] Y. Freund and R.E. Schapire. Adaptive game playing using multiplicative weights. Games and\n\nEconomic Behavior, 29(12):79\u2013103, 1999.\n\n[8] J. Hannan. Approximation to Bayes risk in repeated play. Contributions to the Theory of\n\nGames, 3:97\u2013139, 1957.\n\n[9] E. Hazan, A. Agarwal, and S. Kale. Logarithmic regret algorithms for online convex optimiza-\n\ntion. Machine Learning, 69(2-3):169\u2013192, 2007.\n\n[10] N. Littlestone and M. K. Warmuth. The weighted majority algorithm. Information and Com-\n\nputation, 108(2):212\u2013261, 1994.\n\n[11] S. Mannor, J. N. Tsitsiklis, and J. Y. Yu. Online learning with sample path constraints. Journal\n\nof Machine Learning Research, 10:569\u2013590, 2009.\n\n[12] N. Shimkin. Stochastic games with average cost constraints. Annals of the International\n\nSociety of Dynamic Games, Vol. 1: Advances in Dynamic Games and Applications, 1994.\n\n[13] M. Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient ascent. In\nProceedings of the 20th International Conference on Machine Learning (ICML \u201903), pages\n928\u2013936, 2003.\n\n9\n\n\f", "award": [], "sourceid": 226, "authors": [{"given_name": "Andrey", "family_name": "Bernstein", "institution": null}, {"given_name": "Shie", "family_name": "Mannor", "institution": null}, {"given_name": "Nahum", "family_name": "Shimkin", "institution": null}]}