{"title": "Regret Lower Bound and Optimal Algorithm in Finite Stochastic Partial Monitoring", "book": "Advances in Neural Information Processing Systems", "page_first": 1792, "page_last": 1800, "abstract": "Partial monitoring is a general model for sequential learning with limited feedback formalized as a game between two players. In this game, the learner chooses an action and at the same time the opponent chooses an outcome, then the learner suffers a loss and receives a feedback signal. The goal of the learner is to minimize the total loss. In this paper, we study partial monitoring with finite actions and stochastic outcomes. We derive a logarithmic distribution-dependent regret lower bound that defines the hardness of the problem. Inspired by the DMED algorithm (Honda and Takemura, 2010) for the multi-armed bandit problem, we propose PM-DMED, an algorithm that minimizes the distribution-dependent regret. PM-DMED significantly outperforms state-of-the-art algorithms in numerical experiments. To show the optimality of PM-DMED with respect to the regret bound, we slightly modify the algorithm by introducing a hinge function (PM-DMED-Hinge). Then, we derive an asymptotical optimal regret upper bound of PM-DMED-Hinge that matches the lower bound.", "full_text": "Regret Lower Bound and Optimal Algorithm in\n\nFinite Stochastic Partial Monitoring\n\nJunpei Komiyama\n\nThe University of Tokyo\n\nJunya Honda\n\nThe University of Tokyo\n\njunpei@komiyama.info\n\nhonda@stat.t.u-tokyo.ac.jp\n\nHiroshi Nakagawa\n\nThe University of Tokyo\n\nnakagawa@dl.itc.u-tokyo.ac.jp\n\nAbstract\n\nPartial monitoring is a general model for sequential learning with limited feed-\nback formalized as a game between two players. In this game, the learner chooses\nan action and at the same time the opponent chooses an outcome, then the learner\nsuffers a loss and receives a feedback signal. The goal of the learner is to mini-\nmize the total loss. In this paper, we study partial monitoring with \ufb01nite actions\nand stochastic outcomes. We derive a logarithmic distribution-dependent regret\nlower bound that de\ufb01nes the hardness of the problem. Inspired by the DMED\nalgorithm (Honda and Takemura, 2010) for the multi-armed bandit problem, we\npropose PM-DMED, an algorithm that minimizes the distribution-dependent re-\ngret. PM-DMED signi\ufb01cantly outperforms state-of-the-art algorithms in numeri-\ncal experiments. To show the optimality of PM-DMED with respect to the regret\nbound, we slightly modify the algorithm by introducing a hinge function (PM-\nDMED-Hinge). Then, we derive an asymptotically optimal regret upper bound of\nPM-DMED-Hinge that matches the lower bound.\n\n1 Introduction\n\nPartial monitoring is a general framework for sequential decision making problems with imperfect\nfeedback. Many classes of problems, including prediction with expert advice [1], the multi-armed\nbandit problem [2], dynamic pricing [3], the dark pool problem [4], label ef\ufb01cient prediction [5],\nand linear and convex optimization with full or bandit feedback [6, 7] can be modeled as an instance\nof partial monitoring.\nPartial monitoring is formalized as a repeated game played by two players called a learner and an\nopponent. At each round, the learner chooses an action, and at the same time the opponent chooses\nan outcome. Then, the learner observes a feedback signal from a given set of symbols and suffers\nsome loss, both of which are deterministic functions of the selected action and outcome.\nThe goal of the learner is to \ufb01nd the optimal action that minimizes his/her cumulative loss. Alter-\nnatively, we can de\ufb01ne the regret as the difference between the cumulative losses of the learner and\nthe single optimal action, and minimization of the loss is equivalent to minimization of the regret.\nA learner with a small regret balances exploration (acquisition of information about the strategy of\nthe opponent) and exploitation (utilization of information). The rate of regret indicates how fast the\nlearner adapts to the problem: a linear regret indicates the inability of the learner to \ufb01nd the optimal\naction, whereas a sublinear regret indicates that the learner can approach the optimal action given\nsuf\ufb01ciently large time steps.\n\n1\n\n\fThe study of partial monitoring is classi\ufb01ed into two settings with respect to the assumption on the\noutcomes. On one hand, in the stochastic setting, the opponent chooses an outcome distribution\nbefore the game starts, and an outcome at each round is an i.i.d. sample from the distribution. On\nthe other hand, in the adversarial setting, the opponent chooses the outcomes to maximize the regret\nof the learner. In this paper, we study the former setting.\n\n1.1 Related work\n\nThe paper by Piccolboni and Schindelhauer [8] is one of the \ufb01rst to study the regret of the \ufb01nite par-\ntial monitoring problem. They proposed the FeedExp3 algorithm, which attains O(T 3=4) minimax\nregret on some problems. This bound was later improved by Cesa-Bianchi et al. [9] to O(T 2=3),\nwho also showed an instance in which the bound is optimal. Since then, most literature on partial\nmonitoring has dealt with the minimax regret, which is the worst-case regret over all possible op-\np\nponent\u2019s strategies. Bart\u00b4ok et al. [10] classi\ufb01ed the partial monitoring problems into four categories\nin terms of the minimax regret: a trivial problem with zero regret, an easy problem with ~(cid:2)(\nT )\nregret1, a hard problem with (cid:2)(T 2=3) regret, and a hopeless problem with (cid:2)(T ) regret. This shows\np\nthat the class of the partial monitoring problems is not limited to the bandit sort but also includes\nlarger classes of problems, such as dynamic pricing. Since then, several algorithms with a ~O(\nT )\nregret bound for easy problems have been proposed [11, 12, 13]. Among them, the Bayes-update\nPartial Monitoring (BPM) algorithm [13] is state-of-the-art in the sense of empirical performance.\nDistribution-dependent and minimax regret: we focus on the distribution-dependent regret that\ndepends on the strategy of the opponent. While the minimax regret in partial monitoring has been ex-\ntensively studied, little has been known on distribution-dependent regret in partial monitoring. To the\nauthors\u2019 knowledge, the only paper focusing on the distribution-dependent regret in \ufb01nite discrete\npartial monitoring is the one by Bart\u00b4ok et al. [11], which derived O(log T ) distribution-dependent re-\ngret for easy problems. In contrast to this situation, much more interest in the distribution-dependent\nregret has been shown in the \ufb01eld of multi-armed bandit problems. Upper con\ufb01dence bound (UCB),\nthe most well-known algorithm for the multi-armed bandits, has a distribution-dependent regret\nbound [2, 14], and algorithms that minimize the distribution-dependent regret (e.g., KL-UCB) has\nbeen shown to perform better than ones that minimize the minimax regret (e.g., MOSS), even in\ninstances in which the distributions are hard to distinguish (e.g., Scenario 2 in Garivier et al. [15]).\nTherefore, in the \ufb01eld of partial monitoring, we can expect that an algorithm that minimizes the\ndistribution-dependent regret would perform better than the existing ones.\nContribution: the contributions of this paper lie in the following three aspects. First, we derive\nthe regret lower bound: in some special classes of partial monitoring (e.g., multi-armed bandits), an\nO(log T ) regret lower bound is known to be achievable. In this paper, we further extend this lower\nbound to obtain a regret lower bound for general partial monitoring problems. Second, we propose\nan algorithm called Partial Monitoring DMED (PM-DMED). We also introduce a slightly modi\ufb01ed\nversion of this algorithm (PM-DMED-Hinge) and derive its regret bound. PM-DMED-Hinge is the\n\ufb01rst algorithm with a logarithmic regret bound for hard problems. Moreover, for both easy and hard\nproblems, it is the \ufb01rst algorithm with the optimal constant factor on the leading logarithmic term.\nThird, performances of PM-DMED and existing algorithms are compared in numerical experiments.\nHere, the partial monitoring problems consisted of three speci\ufb01c instances of varying dif\ufb01culty. In\nall instances, PM-DMED signi\ufb01cantly outperformed the existing methods when a number of rounds\nis large. The regret of PM-DMED on these problems quickly approached the theoretical lower\nbound.\n\n2 Problem Setup\n\nThis paper studies the \ufb01nite stochastic partial monitoring problem with N actions, M outcomes,\nand A symbols. An instance of the partial monitoring game is de\ufb01ned by a loss matrix L = (li;j) 2\nRN(cid:2)M and a feedback matrix H = (hi;j) 2 [A]N(cid:2)M , where [A] = f1; 2; : : : ; Ag. At the be-\nginning, the learner is informed of L and H. At each round t = 1; 2; : : : ; T , a learner selects an\naction i(t) 2 [N ], and at the same time an opponent selects an outcome j(t) 2 [M ]. The learner\n\n1Note that ~(cid:2) ignores a polylog factor.\n\n2\n\n\fsuffers loss li(t);j(t), which he/she cannot observe: the only information the learner receives is the\nsignal hi(t);j(t) 2 [A]. We consider a stochastic opponent whose strategy for selecting outcomes is\n(cid:3) 2 PM , where PM is a set of probability distributions over\ngoverned by the opponent\u2019s strategy p\nan M-ary outcome. The outcome j(t) of each round is an i.i.d. sample from p\nThe goal of the learner is to minimize the cumulative loss over T\nrounds. Let the optimal action be the one that minimizes the loss in\n\u22a4\n(cid:3), where Li is the i-th\nexpectation, that is, i\n= arg mini2[N ] L\ni p\n(cid:3) is unique. Without loss of generality, we\nrow of L. Assume that i\n= 1. Let \u2206i = (Li (cid:0) L1)\n(cid:3) 2 [0;1) and Ni(t)\n\u22a4\n(cid:3)\ncan assume that i\nbe the number of rounds before the t-th in which action i is selected.\n\u2211\nThe performance of the algorithm is measured by the (pseudo) regret,\n\nT\u2211\n\n(cid:3).\n\np\n\n(cid:3)\n\nRegret(T ) =\n\n\u2206i(t) =\n\nt=1\n\n\u2206iNi(T + 1);\n\ni2[N ]\n\nFigure 1: Cell decomposi-\ntion of a partial monitoring\ninstance with M = 3.\n\nwhich is the difference between the expected loss of the learner and\nthe optimal action. It is easy to see that minimizing the loss is equiv-\nalent to minimizing the regret. The expectation of the regret measures the performance of an algo-\nrithm that the learner uses.\nFor each action i 2 [N ], let Ci be the set of opponent strategies for which action i is optimal:\n\nCi = fq 2 PM : 8j\u0338=i(Li (cid:0) Lj)\n\u22a4\n\nq (cid:20) 0g:\n\ni = PM n Ci be the set of strategies with which action i is not optimal.\n\nWe call Ci the optimality cell of action i. Each optimality cell is a convex closed polytope. Further-\nmore, we call the set of optimality cells fC1; : : : ;CNg the cell decomposition as shown in Figure 1.\nLet Cc\nThe signal matrix Si 2 f0; 1gA(cid:2)M of action i is de\ufb01ned as (Si)k;j = 11 [hi;j = k], where 11 [X] = 1\nif X is true and 0 otherwise. The signal matrix de\ufb01ned here is slightly different from the one\nin the previous papers (e.g., Bart\u00b4ok et al. [10]) in which the number of rows of Si is the number\nof the different symbols in the i-th row of H. The advantage in using the de\ufb01nition here is that,\n(cid:3) 2 RA is a probability distribution over symbols that the algorithm observes when it selects\nSip\nan action i. Examples of signal matrices are shown in Section 5. An instance of partial monitoring\nis globally observable if for all pairs i; j of actions, Li (cid:0) Lj 2 (cid:8)k2[N ]ImS\n\u22a4\nk . In this paper, we\nexclusively deal with globally observable instances: in view of the minimax regret, this includes\ntrivial, easy, and hard problems.\n\n3 Regret Lower Bound\n\nA good algorithm should work well against any opponent\u2019s strategy. We extend this idea by intro-\nducing the notion of strong consistency: a partial monitoring algorithm is strongly consistent if it\nsatis\ufb01es E[Regret(T )] = o(T a) for any a > 0 and p 2 PM given L and H.\nIn the context of the multi-armed bandit problem, Lai and Robbins [2] derived the regret lower\nbound of a strongly consistent algorithm: an algorithm must select each arm i until its number of\ndraws Ni(t) satis\ufb01es log t \u2272 Ni(t)d((cid:18)i\u2225(cid:18)1), where d((cid:18)i\u2225(cid:18)1) is the KL divergence between the two\none-parameter distributions from which the rewards of action i and the optimal action are generated.\nAnalogously, in the partial monitoring problem, we can de\ufb01ne the minimum number of observations.\nLemma 1. For suf\ufb01ciently large T , a strongly consistent algorithm satis\ufb01es:\n\u2225Siq) (cid:21) log T (cid:0) o(log T );\n\n(cid:3)\nE[Ni(T )]D(p\ni\n\n\u2211\n\n8q2Cc\n\n1\n\ni2[N ]\n(cid:3) and D(p\u2225q) =\n\n(cid:3)\ni = Sip\n\n\u2211\n\ni(p)i log ((p)i=(q)i) is the KL divergence between two discrete\n\nwhere p\ndistributions, in which we de\ufb01ne 0 log 0=0 = 0.\nLemma 1 can be interpreted as follows: for each round t, consistency requires the algorithm to\nmake sure that the possible risk that action i \u0338= 1 is optimal is smaller than 1=t. Large devia-\n(cid:3) is\ntion principle [16] states that, the probability that an opponent with strategy q behaves like p\n\n3\n\np*C1C3C4C2C5||p*-C1c||M\froughly exp ((cid:0)\u2211\n\u2211\n\n(cid:3)\ni\n\ni Ni(t)D(p\n\n\u2225Siq)). Therefore, we need to continue exploration of the actions\n\ni Ni(t)D(p\n\u2225Siq) (cid:24) log t holds for any q 2 Cc\n(cid:3)\ni\n\n1 to reduce the risk to exp ((cid:0) log t) = 1=t.\n\nuntil\nThe proof of Lemma 1 is in Appendix B in the supplementary material. Based on the technique\nused in Lai and Robbins [2], the proof considers a modi\ufb01ed game in which another action i \u0338= 1 is\noptimal. The dif\ufb01culty in proving the lower bound in partial monitoring lies in that, the feedback\nstructure can be quite complex: for example, to con\ufb01rm the superiority of action 1 over 2, one might\nneed to use the feedback from action 3 =2 f1; 2g. Still, we can derive the lower bound by utilizing\nthe consistency of the algorithm in the original and modi\ufb01ed games.\nWe next derive a lower bound on the regret based on Lemma 1. Note that, the expectation of the\n\u2211\nE[Ni(t)](Li (cid:0) L1)\n\u22a4\nregret can be expressed as E[Regret(T )] =\n\n\u2211\n{\ni\u0338=1\nfrigi\u0338=j 2 [0;1)N(cid:0)1 :\n\nriD(pi\u2225Siq) (cid:21) 1\n\nRj(fpig) =\n\n(cid:3). Let\n\n}\n\np\n\nq2cl(Cc\n\ninf\nj ):pj =Sj q\n\ni\n\nwhere cl((cid:1)) denotes a closure. Moreover, let\n\nC\n\nri2Rj (fpig)\n\ninf\n\nj (p;fpig) =\n(cid:3)\n{\nfrigi\u0338=j 2 Rj(fpig) :\n\nthe optimal solution of which is\n\nR(cid:3)\nj (p;fpig) =\n\n\u2211\n\u2211\n\ni\u0338=j\n\ni\u0338=j\n\nri(Li (cid:0) Lj)\n\n\u22a4\n\np ;\n\n}\nj (p;fpig)\n(cid:3)\n\nri(Li (cid:0) Lj)\n\n\u22a4\n\np = C\n\n;\n\n:\n\n(cid:3)\n(cid:3)\n1 (p\n\ng) log T is the possible minimum regret for observations such that the mini-\n;fp\n(cid:3)\nThe value C\n(cid:3) from any q 2 Cc\ni\n1 is larger than log T . Using Lemma 1 yields the following\nmum divergence of p\nregret lower bound:\nTheorem 2. The regret of a strongly consistent algorithm is lower bounded as:\n\nE[Regret(T )] (cid:21) C\n\n(cid:3)\n(cid:3)\n1 (p\n\n;fp\n(cid:3)\ni\n\ng) log T (cid:0) o(log T ):\n\n1\n\n(cid:3)\ni\n\n(cid:3)\n(cid:3)\n1 (p\n\n\u2211\n\n(cid:3) (cid:0) Cc\n\n\u22252\nM ): the regret bound has at most quadratic dependence on \u2225p\n(cid:3) to the boundary of the optimal cell.\n\ng), whereas\n;fp\n(cid:3)\nFrom this theorem, we can naturally measure the harshness of the instance by C\ni\nthe past studies (e.g., Vanchinathan et al. [13]) ambiguously de\ufb01ne the harshness as the closeness to\ng) =\n;fp\n(cid:3)\n(cid:3)\nthe boundary of the cells. Furthermore, we show in Lemma 5 in the Appendix that C\n1 (p\n\u2225M , which is\n(cid:3) (cid:0) Cc\nO(N=\u2225p\n1\nde\ufb01ned in Appendix D as the closeness of p\n4 PM-DMED Algorithm\nIn this section, we describe the partial monitoring deterministic minimum empirical divergence (PM-\nDMED) algorithm, which is inspired by DMED [17] for solving the multi-armed bandit problem.\nLet ^pi(t) 2 [0; 1]A be the empirical distribution of the symbols under the selection of action i.\n\u2211\n) = i]). We\nNamely, the k-th element of ^pi(t) is (\nsometimes omit t from ^pi when it is clear from the context. Let the empirical divergence of q 2 PM\ni2[N ] Ni(t)D(^pi(t)\u2225Siq), the exponential of which can be considered as a likelihood that q is\nbe\nthe opponent\u2019s strategy.\nThe main routine of PM-DMED is in Algorithm 1. At each loop, the actions in the current list ZC\nare selected once. The list for the actions in the next loop ZN is determined by the subroutine in\nAlgorithm 2. The subroutine checks whether the empirical divergence of each point q 2 Cc\n1 is larger\nthan log t or not (Eq. (3)). If it is large enough, it exploits the current information by selecting ^i(t),\nthe optimal action based on the estimation ^p(t) that minimizes the empirical divergence. Otherwise,\nit selects the actions with the number of observations below the minimum requirement for making\nthe empirical divergence of each suboptimal point q 2 Cc\nUnlike the N-armed bandit problem in which a reward is associated with an action, in the partial\nmonitoring problem, actions, outcomes, and feedback signals can be intricately related. Therefore,\nwe need to solve a non-trivial optimization to run PM-DMED. Later in Section 5, we discuss a\npractical implementation of the optimization.\n\n) = i\\ hi(t\u2032);j(t\u2032) = k])=(\n\n1 larger than log t.\n\nt(cid:0)1\nt\u2032=1 11 [i(t\n\nt(cid:0)1\nt\u2032=1 11 [i(t\n\n\u2211\n\n\u2032\n\n\u2032\n\n4\n\n\fAlgorithm 1 Main routine of PM-DMED and\nPM-DMED-Hinge\n1: Initialization: select each action once.\n2: ZC; ZR [N ]; ZN \u2205.\n3: while t (cid:20) T do\n4:\n\nfor i(t) 2 ZC in an arbitrarily \ufb01xed order\ndo\n{\nSelect i(t), and receive feedback.\nZR ZR n fi(t)g.\nAdd actions to ZN in accordance with\nAlgorithm 2 (PM-DMED)\n.\nAlgorithm 3 (PM-DMED-Hinge)\nt t + 1.\nend for\nZC; ZR ZN , ZN \u2205.\n\n8:\n9:\n10:\n11: end while\n\n5:\n6:\n7:\n\nAlgorithm 2 PM-DMED subroutine for adding\nactions to ZN (without duplication).\n1: Parameter: c > 0.\n2: Compute an arbitrary ^p(t) such that\n\n\u2211\n\nq\n\ni\n\n^p(t) 2 arg min\n\n\u22a4\nand let ^i(t) = arg mini L\ni ^p(t).\n\n3: If ^i(t) =2 ZR then put ^i(t) into ZN .\n4: If there are actions i =2 ZR such that\n\nNi(t)D(^pi(t)\u2225Siq) (1)\n\u221a\n\n5: If\n\n(2)\n\nlog t\n\nNi(t) < c\nthen put them into ZN .\ni\u0338=^i(t) =2 R^i(t)(f^pi(t)g)\nfNi(t)= log tg\nthen compute some\n2 R(cid:3)\n^i(t)(^p(t);f^pi(t)g)\n\n(4)\nand put all actions i such that i =2 ZR and\n(cid:3)\ni > Ni(t)= log t into ZN .\n\ng\ni\u0338=^i(t)\n\nfr\n\n(3)\n\n(cid:3)\ni\n\nr\n\np\nlog T exploration: PM-DMED tries to observe each action to some extent (Eq. (2)),\n\nNecessity of\nwhich is necessary for the following reason: consider a four-state game characterized by\n\n0B@ 0\n\nL =\n\n1\n1\n0\n\n0\n0\n10\n10\n0\n11 11 11 11\n\n1\n0\n1\n\n1CA , H =\n\n0B@ 1\n\n1\n1\n1\n\n1\n2\n2\n1\n\n1\n2\n2\n2\n\n1\n3\n3\n2\n\n1CA , and\n\n(cid:3)\n\np\n\n= (0:1; 0:2; 0:3; 0:4)\n\n\u22a4\n\n:\n\n(cid:3)\n\n(cid:3)\n)1, (p\n\n)j is the j-th component of p\n\n(cid:3)\n)2 + (p\n(cid:3). From this, an algorithm can \ufb01nd that (p\n(cid:3)\n\nThe optimal action here is action 1, which does not yield any useful information. By using action 2,\none receives three kinds of symbols from which one can estimate (p\n)4,\n)3, and (p\n(cid:3)\nwhere (p\n)1 is not very\nsmall and thus the expected loss of actions 2 and 3 is larger than that of action 1. Since the feedback\nof actions 2 and 3 are the same, one may also use action 3 in the same manner. However, the loss per\nobservation is 1:2 and 1:3 for actions 2 and 3, respectively, and thus it is better to use action 2. This\n(cid:3)\ndifference comes from the fact that (p\n)3. Since an algorithm does not know\n(cid:3) beforehand, it needs to observe action 4, the only source for distinguishing (p\n(cid:3)\n)3.\np\nYet, an optimal algorithm cannot select it more than \u2126(log T ) times because it affects the O(log T )\nfactor in the regret. In fact, O((log T )a) observations of action 4 with some a > 0 are suf\ufb01cient to\n)3 with probability 1 (cid:0) o(1=T poly(a)). For this reason, PM-DMED\n(cid:3)\n(cid:3)\np\nbe convinced that (p\n)2 < (p\nselects each action\nlog t times.\n\n)2 = 0:2 < 0:3 = (p\n\n)2 from (p\n\n(cid:3)\n\n(cid:3)\n\n(cid:3)\n\n5 Experiment\n\nFollowing Bart\u00b4ok et al. [11], we compared the performances of algorithms in three different games:\nthe four-state game (Section 4), a three-state game and dynamic pricing. Experiments on the N-\narmed bandit game was also done, and the result is shown in Appendix C.1 .\nThe three-state game, which is classi\ufb01ed as easy in terms of the minimax regret, is characterized by:\n\n(\n\n)\n\n(\n\nL =\n\n(\n\n)\n(\n\n1 1 0\n0 1 1\n1 0 1\n\n)\n\nand H =\n\n)\n\n1 2\n2 1\n2 2\n\n2\n2\n1\n\n(\n\n:\n\n)\n\nThe signal matrices of this game are,\n\nS1 =\n\n1 0 0\n0 1 1\n\n; S2 =\n\n0\n1\n\n1\n0\n\n0\n1\n\n; and S3 =\n\n0 0\n1 1\n\n1\n0\n\n:\n\n5\n\n\f(a) three-states, benign\n\n(b) three-states, intermediate\n\n(c) three-states, harsh\n\n(d) dynamic pricing, benign\n\n(e) dynamic pricing, intermediate\n\n(f) dynamic pricing, harsh\n\nFigure 2: Regret-round semilog plots of algorithms. The regrets are averaged over 100 runs. LB is\nthe asymptotic regret lower bound of Theorem 2.\n\n(g) four-states\n\nDynamic pricing, which is classi\ufb01ed as hard in terms of the minimax regret, is a game that models\na repeated auction between a seller (learner) and a buyer (opponent). At each round, the seller sets\na price for a product, and at the same time, the buyer secretly sets a maximum price he is willing to\npay. The signal is \u201cbuy\u201d or \u201cno-buy\u201d, and the seller\u2019s loss is either a given constant (no-buy) or the\ndifference between the buyer\u2019s and the seller\u2019s prices (buy). The loss and feedback matrices are:\n\n1CCA ;\n\n2\n2\n...\n: : :\n\n: : :\n: : :\n...\n1\n\n2\n2\n\n...\n\n2\n\n(cid:3)\n(cid:3)\n1 (p\n\n;fp\n(cid:3)\ni\n\nFollowing Bart\u00b4ok et al. [11], we set N = 5; M = 5, and c = 2.\nIn our experiments with the three-state game and dynamic pricing, we tested three settings regarding\nthe harshness of the opponent: at the beginning of a simulation, we sampled 1,000 points uniformly\nat random from PM , then sorted them by C\ng). We chose the top 10%, 50%, and 90%\nharshest ones as the opponent\u2019s strategy in the harsh, intermediate, and benign settings, respectively.\nWe compared Random, FeedExp3 [8], CBP [11] with (cid:11) = 1:01, BPM-LEAST, BPM-TS [13], and\nPM-DMED with c = 1. Random is a naive algorithm that selects an action uniformly random.\n\u22a4, and thus one cannot apply it to the four-state\nFeedExp3 requires a matrix G such that H\ngame. CBP is an algorithm of logarithmic regret for easy games. The parameters (cid:17) and f (t) of\np\nCBP were set in accordance with Theorem 1 in their paper. BPM-LEAST is a Bayesian algorithm\nwith ~O(\nT ) regret for easy games, and BPM-TS is a heuristic of state-of-the-art performance. The\npriors of two BPMs were set to be uninformative to avoid a misspeci\ufb01cation, as recommended in\ntheir paper.\n\nG = L\n\n\u22a4\n\n6\n\n0BB@ 0\n\nc\n\n...\n\nc\n\nL =\n\n1CCA and H =\n}|\nz\n{\n}|\n\ni(cid:0)1\n\nM(cid:0)i+1\n\n1\n\n0BB@ 2\n...\n{\n)\n\n1\n\n: : : N (cid:0) 1\n: : : N (cid:0) 2\n...\nc\n\n0\n\n1\n0\n...\n: : :\n\n...\nz\n\n(\n\nSi =\n\n1\n0\n\n: : :\n: : :\n\n1\n0\n\n0\n1\n\n: : :\n: : :\n\n0\n1\n\n:\n\nwhere signals 1 and 2 correspond to no-buy and buy. The signal matrix of action i is\n\n100101102103104105106t:round020406080100120R(t):regretRandomFeedExp3CBPBPM-LEASTBPM-TSPM-DMEDLB100101102103104105106t:round0100200300400500600R(t):regretRandomFeedExp3CBPBPM-LEASTBPM-TSPM-DMEDLB103104105106t:round050010001500200025003000R(t):regretRandomFeedExp3CBPBPM-LEASTBPM-TSPM-DMEDLB100101102103104105106t:round0200400600800100012001400R(t):regretRandomFeedExp3CBPBPM-LEASTBPM-TSPM-DMEDLB100101102103104105106t:round0100020003000400050006000R(t):regretRandomFeedExp3CBPBPM-LEASTBPM-TSPM-DMEDLB103104105106t:round020000400006000080000100000120000R(t):regretRandomFeedExp3CBPBPM-LEASTBPM-TSPM-DMEDLB100101102103104105106t:round0500100015002000R(t):regretRandomCBPBPM-LEASTBPM-TSPM-DMEDLB\fAlgorithm 3 PM-DMED-Hinge subroutine for adding actions to ZN (without duplication).\n1: Parameters: c > 0, f (n) = bn\n2: Compute arbitrary ^p(t) which satis\ufb01es\n\n(cid:0)1=2 for b > 0, (cid:11)(t) = a=(log log t) for a > 0.\n\n\u2211\n\n^p(t) 2 arg min\n\nNi(t)(D(^pi(t)\u2225Siq) (cid:0) f (Ni(t)))+\n\nq\n\n\u22a4\nand let ^i(t) = arg mini L\ni ^p(t).\n\n3: If ^i(t) =2 ZR then put ^i(t) into ZN .\n4: If\n\ni\n\nor there exists an action i such that\n\nthen put all actions i =2 ZR into ZN .\n\n5: If there are actions i such that\n\nD(^pi(t)\u2225Si ^p(t)) > f (Ni(t))\n\n^p(t) =2 C^i(t);(cid:11)(t)\n\u221a\n\n(5)\n\n(6)\n\n(7)\n\n(8)\n\n(9)\n\n(10)\n\n6: If\n\nlog t\n\nthen compute some\n\nthen put the actions not in ZR into ZN .\nfNi(t)= log tg\nfr\n\nNi(t) < c\ni\u0338=^i(t) =2 R^i(t)(f^pi(t); f (Ni(t))g)\n2 R(cid:3)\n^i(t)(^p(t);f^pi(t); f (Ni(t))g)\nand put all actions such that i =2 ZR and r\n(cid:3)\ni > Ni(t)= log t into ZN . If such r\nput all action i =2 ZR into ZN .\n\ng\ni\u0338=^i(t)\n\n(cid:3)\ni\n\n(cid:3)\ni is infeasible then\n\n(cid:3)\ni\n\n(cid:3)\ni\n\n(cid:3)\n\n(cid:3)\n1 (p\n\n;fp\n(cid:3)\ni\n\nThe computation of ^p(t) in (1) and the evaluation of the condition in (3) involve convex optimiza-\ng in (4) is classi\ufb01ed as a linear\ntions, which were done with Ipopt [18]. Moreover, obtaining fr\nsemi-in\ufb01nite programming (LSIP) problem, a linear programming (LP) with \ufb01nitely many variables\nand in\ufb01nitely many constraints. Following the optimization of BPM-LEAST [13], we resorted to a\n\ufb01nite sample approximation and used the Gurobi LP solver [19] in computing fr\ng: at each round,\nwe sampled 1,000 points from PM , and relaxed the constraints on the samples. To speed up the\ncomputation, we skipped these optimizations in most rounds with large t and used the result of\ng) of the regret lower bound\nthe last computation. The computation of the coef\ufb01cient C\n(Theorem 2) is also an LSIP, which was approximated by 100,000 sample points from Cc\n1.\nThe experimental results are shown in Figure 2. In the four-state game and the other two games with\nan easy or intermediate opponent, PM-DMED outperforms the other algorithms when the number of\nrounds is large. In particular, in the dynamic pricing game with an intermediate opponent, the regret\nof PM-DMED at T = 106 is ten times smaller than those of the other algorithms. Even in the harsh\nsetting in which the minimax regret matters, PM-DMED has some advantage over all algorithms\nexcept for BPM-TS. With suf\ufb01ciently large T , the slope of an optimal algorithm should converge to\nLB. In all games and settings, the slope of PM-DMED converges to LB, which is empirical evidence\nof the optimality of PM-DMED.\n6 Theoretical Analysis\nSection 5 shows that the empirical performance of PM-DMED is very close to the regret lower\nbound in Theorem 2. Although the authors conjecture that PM-DMED is optimal, it is hard to\nanalyze PM-DMED. The technically hardest part arises from the case in which the divergence of\n{\neach action is small but not yet fully converged. To circumvent this dif\ufb01culty, we can introduce a\ndiscount factor. Let\nfrigi\u0338=j 2 [0;1)N(cid:0)1 :\nRj(fpi; (cid:14)ig)=\n; (11)\nwhere (X)+ = max(X; 0). Note that Rj(fpi; (cid:14)ig) in (11) is a natural generalization of Rj(fpig)\nin Section 4 in the sense that Rj(fpi; 0g) = Rj(fpig). Event fNi(t)= log tgi\u0338=1 2 R1(f^pi(t); (cid:14)ig)\nmeans that the number of observations fNi(t)g is enough to ensure that the \u201cf(cid:14)ig-discounted\u201d em-\npirical divergence of each q 2 Cc\n\n}\nri(D(pi\u2225Siq)(cid:0) (cid:14)i)+ (cid:21) 1\n\n1 is larger than log t. Analogous to Rj(fpi; (cid:14)ig), we de\ufb01ne\n\nj ):D(pj\u2225Sj q)(cid:20)(cid:14)j\n\n\u2211\n\nq2cl(Cc\n\ninf\n\ni\n\n7\n\n\fand its optimal solution by\nR(cid:3)\nj (p;fpi; (cid:14)ig) =\n\nfrigi\u0338=j2Rj (fpi;(cid:14)ig))\n\nC\n\ninf\n\nj (p;fpi; (cid:14)ig) =\n(cid:3)\n{\nfrigi\u0338=j 2 Rj(fpi; (cid:14)ig) :\n\n\u2211\n\ni\u0338=j\n\n\u2211\n\ni\u0338=j\n\nri(Lj (cid:0) Li)\n\n\u22a4\n\np\n\nri(Lj (cid:0) Li)\n\u22a4\n\np = C\n\n}\nj (p;fpi; (cid:14)ig)\n(cid:3)\n\n:\n\nWe also de\ufb01ne Ci;(cid:11) = fp 2 PM : L\ni p + (cid:11) (cid:20) minj\u0338=i L\nj pg, the optimal region of action i\n\u22a4\n\u22a4\nwith margin. PM-DMED-Hinge shares the main routine of Algorithm 1 with PM-DMED and lists\nthe next actions by Algorithm 3. Unlike PM-DMED, it (i) discounts f (Ni(t)) from the empirical\ndivergence D(^pi(t)\u2225Siq). Moreover, (ii) when ^p(t) is close to the cell boundary, it encourages more\nexploration to identify the cell it belongs to by Eq. (6).\n1(p;fpi; (cid:14)ig) is\n(cid:3).\nTheorem 3. Assume that the following regularity conditions hold for p\n\u2225S1q) (cid:20) (cid:14)g, it holds that\n; (cid:14)i = 0. Moreover, (2) for S(cid:14) = fq : D(p\n(cid:3)\n(cid:3)\n(cid:3)\nunique at p = p\n; pi = Sip\n1) \\ S(cid:14)) for all (cid:14) (cid:21) 0 in some neighborhood of (cid:14) = 0, where cl((cid:1)) and\ncl(int(Cc\n1) \\ S(cid:14)) = cl(cl(Cc\n1\nint((cid:1)) denote the closure and the interior, respectively. Then,\nE[Regret(T )] (cid:20) C\n\ng) log T + o(log T ) :\n\n(1) R(cid:3)\n\n(cid:3)\n(cid:3)\n1 (p\n\n;fp\n(cid:3)\ni\n\nN\n\n(cid:3)\n\n(cid:3)\n1 (p\n\n;fp\n(cid:3)\ni\n\ng) =\n\n1(p;f^pi(t); (cid:14)ig) is the set of optimal solutions\nWe prove this theorem in Appendix D . Recall that R(cid:3)\nof an LSIP. In this problem, KKT conditions and the duality theorem apply as in the case of \ufb01nite\n(cid:3) (see, e.g., Ito et al. [20]\nconstraints; thus, we can check whether Condition 1 holds or not for each p\nand references therein). Condition 2 holds in most cases, and an example of an exceptional case is\nshown in Appendix A.\nTheorem 3 states that PM-DMED-Hinge has a regret upper bound that matches the lower bound of\nTheorem 2.\nCorollary 4. (Optimality in the N-armed bandit problem) In the N-armed Bernoulli bandit prob-\n\u2211\nlem, the regularity conditions in Theorem 3 always hold. Moreover, the coef\ufb01cient of the lead-\ning logarithmic term in the regret bound of the partial monitoring problem is equal to the bound\ni\u0338=1(\u2206i=d((cid:22)i\u2225(cid:22)1)), where d(p\u2225q) =\ngiven in Lai and Robbins [2]. Namely, C\np log (p=q) + (1 (cid:0) p) log ((1 (cid:0) p)=(1 (cid:0) q)) is the KL-divergence between Bernoulli distributions.\nCorollary 4, which is proven in Appendix C, states that PM-DMED-Hinge attains the optimal regret\nof the N-armed bandit if we run it on an N-armed bandit game represented as partial monitoring.\nAsymptotic analysis: it is Theorem 6 where we lose the \ufb01nite-time property. This theorem shows\nj (p;fpjg), which does not mention\nthe continuity of the optimal solution set R(cid:3)\n(cid:3)\n\u2225M ; maxi (cid:14)ig (cid:20) (cid:14)\n(cid:3)\u2225M ; maxi \u2225pi(cid:0)p\nhow close R(cid:3)\n(cid:3)\ni\nfor given (cid:14). To obtain an explicit bound, we need sensitivity analysis, the theory of the robustness\nof the optimal value and the solution for small deviations of its parameters (see e.g., Fiacco [21]).\nIn particular, the optimal solution of partial monitoring involves an in\ufb01nite number of constraints,\nwhich makes the analysis quite hard. For this reason, we will not perform a \ufb01nite-time analysis.\nNote that, the N-armed bandit problem is a special instance in which we can avoid solving the\nabove optimization and a \ufb01nite-time optimal bound is known.\nNecessity of the discount factor: we are not sure whether discount factor f (n) in PM-DMED-\nHinge is necessary or not. We also empirically tested PM-DMED-Hinge: although it is better than\nthe other algorithms in many settings, such as dynamic pricing with an intermediate opponent, it\nis far worse than PM-DMED. We found that our implementation, which uses the Ipopt nonlinear\noptimization solver, was sometimes inaccurate at optimizing (5): there were some cases in which\n) (cid:0) f (Ni(t)) = 0, while the solution ^p(t) we obtained\nthe true p\nhad non-zero hinge values. In this case, the algorithm lists all actions from (7), which degrades\nperformance. Determining whether the discount factor is essential or not is our future work.\nAcknowledgements\nThe authors gratefully acknowledge the advice of Kentaro Minami and sincerely thank the anony-\nmous reviewers for their useful comments. This work was supported in part by JSPS KAKENHI\nGrant Number 15J09850 and 26106506.\n\n1(p;fpi; (cid:14)ig) of C\ni ; 0g) if maxf\u2225p(cid:0)p\n;fp\n(cid:3)\n\n(cid:3) satis\ufb01es 8i2[N ]D(^pi(t)\u2225Sip\n(cid:3)\n\n1(p;fpi; (cid:14)ig) is to R(cid:3)\n(cid:3)\n1(p\n\n8\n\n\fReferences\n[1] Nick Littlestone and Manfred K. Warmuth. The weighted majority algorithm. Inf. Comput.,\n\n108(2):212\u2013261, February 1994.\n\n[2] T. L. Lai and Herbert Robbins. Asymptotically ef\ufb01cient adaptive allocation rules. Advances in\n\nApplied Mathematics, 6(1):4\u201322, 1985.\n\n[3] Robert D. Kleinberg and Frank Thomson Leighton. The value of knowing a demand curve:\n\nBounds on regret for online posted-price auctions. In FOCS, pages 594\u2013605, 2003.\n\n[4] Alekh Agarwal, Peter L. Bartlett, and Max Dama. Optimal allocation strategies for the dark\n\npool problem. In AISTATS, pages 9\u201316, 2010.\n\n[5] Nicol`o Cesa-Bianchi, G\u00b4abor Lugosi, and Gilles Stoltz. Minimizing regret with label ef\ufb01cient\n\nprediction. IEEE Transactions on Information Theory, 51(6):2152\u20132162, 2005.\n\n[6] Martin Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient ascent.\n\nIn ICML, pages 928\u2013936, 2003.\n\n[7] Varsha Dani, Thomas P. Hayes, and Sham M. Kakade. Stochastic linear optimization under\n\nbandit feedback. In COLT, pages 355\u2013366, 2008.\n\n[8] Antonio Piccolboni and Christian Schindelhauer. Discrete prediction games with arbitrary\n\nfeedback and loss. In COLT, pages 208\u2013223, 2001.\n\n[9] Nicol`o Cesa-Bianchi, G\u00b4abor Lugosi, and Gilles Stoltz. Regret minimization under partial\n\nmonitoring. Math. Oper. Res., 31(3):562\u2013580, 2006.\n\n[10] G\u00b4abor Bart\u00b4ok, D\u00b4avid P\u00b4al, and Csaba Szepesv\u00b4ari. Minimax regret of \ufb01nite partial-monitoring\n\ngames in stochastic environments. In COLT, pages 133\u2013154, 2011.\n\n[11] G\u00b4abor Bart\u00b4ok, Navid Zolghadr, and Csaba Szepesv\u00b4ari. An adaptive algorithm for \ufb01nite stochas-\n\ntic partial monitoring. In ICML, 2012.\n\n[12] G\u00b4abor Bart\u00b4ok. A near-optimal algorithm for \ufb01nite partial-monitoring games against adversarial\n\nopponents. In COLT, pages 696\u2013710, 2013.\n\n[13] Hastagiri P. Vanchinathan, G\u00b4abor Bart\u00b4ok, and Andreas Krause. Ef\ufb01cient partial monitoring\n\nwith prior information. In NIPS, pages 1691\u20131699, 2014.\n\n[14] Peter Auer, Nicol\u00b4o Cesa-bianchi, and Paul Fischer. Finite-time Analysis of the Multiarmed\n\nBandit Problem. Machine Learning, 47:235\u2013256, 2002.\n\n[15] Aur\u00b4elien Garivier and Olivier Capp\u00b4e. The KL-UCB algorithm for bounded stochastic bandits\n\nand beyond. In COLT, pages 359\u2013376, 2011.\n\n[16] Amir Dembo and Ofer Zeitouni. Large deviations techniques and applications. Applications\n\nof mathematics. Springer, New York, Berlin, Heidelberg, 1998.\n\n[17] Junya Honda and Akimichi Takemura. An Asymptotically Optimal Bandit Algorithm for\n\nBounded Support Models. In COLT, pages 67\u201379, 2010.\n\n[18] Andreas W\u00a8achter and Carl D. Laird. Interior point optimizer (IPOPT).\n[19] Gurobi Optimization Inc. Gurobi optimizer.\n[20] S. Ito, Y. Liu, and K. L. Teo. A dual parametrization method for convex semi-in\ufb01nite program-\n\nming. Annals of Operations Research, 98(1-4):189\u2013213, 2000.\n\n[21] Anthony V. Fiacco. Introduction to sensitivity and stability analysis in nonlinear programming.\n\nAcademic Press, New York, 1983.\n\n9\n\n\f", "award": [], "sourceid": 1062, "authors": [{"given_name": "Junpei", "family_name": "Komiyama", "institution": "The University of Tokyo"}, {"given_name": "Junya", "family_name": "Honda", "institution": "The University of Tokyo"}, {"given_name": "Hiroshi", "family_name": "Nakagawa", "institution": "The University of Tokyo"}]}