{"title": "Efficient Partial Monitoring with Prior Information", "book": "Advances in Neural Information Processing Systems", "page_first": 1691, "page_last": 1699, "abstract": "Partial monitoring is a general model for online learning with limited feedback: a learner chooses actions in a sequential manner while an opponent chooses outcomes. In every round, the learner suffers some loss and receives some feedback based on the action and the outcome. The goal of the learner is to minimize her cumulative loss. Applications range from dynamic pricing to label-efficient prediction to dueling bandits. In this paper, we assume that we are given some prior information about the distribution based on which the opponent generates the outcomes. We propose BPM, a family of new efficient algorithms whose core is to track the outcome distribution with an ellipsoid centered around the estimated distribution. We show that our algorithm provably enjoys near-optimal regret rate for locally observable partial-monitoring problems against stochastic opponents. As demonstrated with experiments on synthetic as well as real-world data, the algorithm outperforms previous approaches, even for very uninformed priors, with an order of magnitude smaller regret and lower running time.", "full_text": "Efficient Partial Monitoring with Prior Information\n\nHastagiri P Vanchinathan\nDept. of Computer Science\nETH Z\u00a8urich, Switzerland\n\nhastagiri@inf.ethz.ch\n\nG\u00b4abor Bart\u00b4ok\n\nDept. of Computer Science\nETH Z\u00a8urich, Switzerland\nbartok@inf.ethz.ch\n\nAndreas Krause\n\nDept. of Computer Science\nETH Z\u00a8urich, Switzerland\nkrausea@ethz.ch\n\nAbstract\n\nPartial monitoring is a general model for online learning with limited feedback: a\nlearner chooses actions in a sequential manner while an opponent chooses outcomes.\nIn every round, the learner suffers some loss and receives some feedback based on\nthe action and the outcome. The goal of the learner is to minimize her cumulative\nloss. Applications range from dynamic pricing to label-efficient prediction to dueling\nbandits. In this paper, we assume that we are given some prior information about the\ndistribution based on which the opponent generates the outcomes. We propose BPM, a\nfamily of new efficient algorithms whose core is to track the outcome distribution with\nan ellipsoid centered around the estimated distribution. We show that our algorithm\nprovably enjoys near-optimal regret rate for locally observable partial-monitoring\nproblems against stochastic opponents. As demonstrated with experiments on synthetic\nas well as real-world data, the algorithm outperforms previous approaches, even for very\nuninformed priors, with an order of magnitude smaller regret and lower running time.\n\n1 Introduction\n\nWe consider Partial Monitoring, a repeated game where in every time step a learner chooses an action while,\nsimultaneously, an opponent chooses an outcome. Then the player receives a loss based on the action and\noutcome chosen. The learner also receives some feedback based on which she can make better decisions in\nsubsequent time steps. The goal of the learner is to minimize her cumulative loss over some time horizon.\nThe performance of the learner is measured by the regret, the excess cumulative loss of the learner\ncompared to that of the best fixed constant action. If the regret scales linearly with the time horizon, it\nmeans that the learner does not approach the performance of the best action, that is, the learner fails to learn\nthe problem. On the other hand, sublinear regret indicates that the disadvantage of the learner compared\nto the best fixed strategy fades with time.\nGames in which the learner receives the outcome as feedback after every time step are called online\nlearning with full information. This special case of partial monitoring has been addressed by (among\nothers) Vovk [1] and Littlestone and Warmuth [2], who designed the randomized algorithm Exponentially\nWeighted Averages (EWA) as a learner strategy. This algorithm achieves \u0398(\u221aT logN) expected regret\nagainst any opponent, where N is the number of actions and T is the time horizon. This regret growth\nrate is also proven to be optimal.\nAnother well-studied special case is the so-called multi-armed bandit problem. In this feedback model,\nthe learner gets to observe the loss she suffered in every time step. That is, the learner does not receive\nany information about losses of actions she did not choose. Asymptotically optimal results were obtained\nby Audibert and Bubeck [3], who designed the Implicitly Normalized Forecaster (INF) that achieves the\nminimax optimal \u0398(\u221aT N) regret growth rate.1\n\n1The algorithm Exp3 due to Auer et al. [4] achieves the same rate up to a logarithmic factor.\n\n1\n\n\fHowever, not all online learning problems have one of the above feedback structures. An important\nexample for a problem that does not fit in either full-information or bandit problems is dynamic pricing.\nConsider the problem of a vendor wanting to sell his products to customers for the best possible price.\nWhen a customer comes in, she (secretly) decides on a maximum price she is willing to buy his product\nfor, while the vendor has to set a price without knowing the customer\u2019s preferences. The loss of the vendor\nis some preset constant if the customer did not buy the product, and an \u201copportunity loss\u201d, when the\nproduct was sold cheaper than the customer\u2019s maximum. The feedback, on the other hand, is merely an\nindicator whether the transaction happened or not.\nDynamic pricing is just one of the practical applications of partial monitoring. Label efficient prediction,\nin its simplest form, has three actions: the first two actions are guesses of a binary outcome but provide\nno information, while the third action provides information about the outcome for some unit loss as the\nprice. This can be thought of an abstract form of spam filtering: the first two actions correspond to putting\nan email to the inbox and the spam folder, the third action corresponds to asking the user if the email\nis spam or not. Another problem that can be cast as partial monitoring is that of dueling bandits [5, 6]\nin which the learner chooses a pair of actions in every time step, the loss she suffers is the average loss\nof the two actions, and the feedback is which action was \u201cbetter\u201d.\nIn online learning, we distinguish different models of how the opponent generates the outcomes. In the\nmildest version called stochastic or stationary memoryless, the opponent chooses an outcome distribution\nbefore the game starts and then selects outcomes in an iid random manner drawn from the chosen\ndistribution. The oblivious adversarial opponent chooses the outcomes arbitrarily, but without observing the\nactions of the learner. This selection method is equivalent to choosing an outcome sequence ahead of time.\nFinally, the non-oblivious or adaptive adversarial opponent chooses outcomes arbitrarily with the possibility\nof looking at past actions of the learner. In this work, we focus on strategies against stochastic opponents.\n\nRelated work Partial monitoring was first addressed in the seminal paper of Piccolboni and Schindel-\nhauer [7], who designed and analyzed the algorithm FeedExp3. The algorithm\u2019s main idea is to maintain\nan unbiased estimate for the loss of each action in every time step, and then use these estimates to run\nthe full-information algorithm (EWA). Piccolboni and Schindelhauer [7] proved an O(T 3/4) upper bound\non the regret (not taking into account the number of actions) for games for which learning is at all possible.\nThis bound was later improved by Cesa-Bianchi et al. [8] to O(T 2/3), who also constructed an example\nof a problem for which this bound is optimal.\nFrom the above bounds it can be seen that not all partial-monitoring problems have the same level of\ndifficulty: while bandit problems enjoy an O(\u221aT ) regret rate, some partial-monitoring problems have\n\u2126(T 2/3) regret. To this end, Bart\u00b4ok et al. [9] showed that partial-monitoring problems with finitely many\n\nactions and outcomes can be classified into four groups: trivial with zero regret, easy with(cid:101)\u0398(\u221aT ) regret,\n\nhard with \u0398(T 2/3) regret, and hopeless with linear regret. The distinguishing feature between easy and\nhard problems is the local observability condition, an algebraic condition on the feedback structure that can\nbe efficiently verified for any problem. Bart\u00b4ok et al. [9] showed the above classification against stochastic\nopponents with the help of algorithm BALATON. This algorithm keeps track of estimates of the loss\ndifference of \u201cneighboring\u201d action pairs and eliminates actions that are highly likely to be suboptimal.\n\nSince then, several algorithms have been proposed that achieve the (cid:101)O(\u221aT ) regret bound for easy\n\ngames [10, 11]. All these algorithms rely on the core idea of estimating the expected loss difference\nbetween pairs of actions.\n\nOur contributions\nIn this paper, we introduce BPM (Bayes-update Partial Monitoring), a new family of\nalgorithms against iid stochastic opponents that rely on a novel way of the usage of past observations. Our\nalgorithms maintain a confidence ellipsoid in the space of outcome distributions, and update the ellipsoid\nbased on observations following a Bayes-like update. Our approach enjoys better empirical performance\nand lower computational overhead; another crucial advantage is that we can incorporate prior information\nabout the outcome distribution by means of an initial confidence ellipsoid. We prove near-optimal minimax\nexpected regret bounds for our algorithm, and demonstrate its effectiveness on several partial monitoring\nproblems on synthetic and real data.\n\n2\n\n\f2 Problem setup\n\nPartial monitoring is a repeated game where in every round, a learner chooses an action while the opponent\nchooses an outcome from some finite action and outcome sets. Then, the learner observes a feedback\nsignal (from some given set of symbols) and suffers some loss, both of which are deterministic functions\nof the action and outcome chosen. In this paper we assume that the opponent chooses the outcomes in\nan iid stochastic manner. The goal of the learner is to minimize her cumulative loss.\nThe following definitions and concepts are mostly taken from Bart\u00b4ok et al. [9]. An instance of partial\nmonitoring is defined by the loss matrix L\u2208RN\u00d7M and the feedback table H\u2208\u03a3N\u00d7M, where N and M\nare the cardinality of the action set and the outcome set, respectively, while \u03a3 is some alphabet of symbols.\nThat is, if learner chooses action i while the outcome is j, the loss suffered by the learner is L[i,j], and\nthe feedback received is H[i,j].\nFor an action 1\u2264 i\u2264 N, let (cid:96)i denote the column vector given by the ith row of L. Let \u2206M denote the\nM-dimensional probability simplex. It is easy to see that for any p\u2208\u2206M, if we assume that the opponent\nuses p to draw the outcomes (that is, p is the opponent strategy), the expected loss of action i can be\nexpressed as (cid:96)(cid:62)\ni p.\nWe measure the performance of an algorithm with its expected regret, defined as the expected difference\nof the cumulative loss of the algorithm and that of the best fixed action in hindsight:\n\nT(cid:88)\n((cid:96)It\u2212(cid:96)i)(cid:62)p,\n\nt=1\n\nRT = max\n1\u2264i\u2264N\n\nwhere T is some time horizon, It (t = 1,...,T ) is the action chosen in time step t, and p is the outcome\ndistribution the opponent uses.\nIn this paper, we also assume we have some prior knowledge about the outcome distribution in the form\nof a confidence ellipsoid: we are given a distribution p0 \u2208 \u2206M and a symmetric positive semidefinite\ncovariance matrix \u03a30\u2208RM\u00d7M such that the true outcome distribution p\u2217 satisfies\n\n(cid:113)\n(p0\u2212p\u2217)(cid:62)\u03a3\u22121\n\n0 (p0\u2212p\u2217)\u22641.\n\n(cid:107)p0\u2212p\u2217\n\n=\n\n(cid:107)\u03a3\n\n\u22121\n0\n\nWe use the term \u201cconfidence ellipsoid\u201d even though our condition is not probabilistic; we do not assume\nthat p\u2217 is drawn from a Gaussian distribution before the game starts. On the other hand, the way we track\np\u2217 is derived by Bayes updates with a Gaussian conjugate prior, hence the name. We would also like\nto note that having the above prior knowledge is without loss of generality. For \u201clarge enough\u201d \u03a30, the\nwhole probability simplex is contained in the confidence ellipsoid and thus partial monitoring without\nany prior information reduces to our setting.\nThe following definition reveals how we use the loss matrix to recover the structure of a game.\nDefinition 1 (Cell decomposition, Bart\u00b4ok et al. [9, Definition 2]). For any action 1\u2264i\u2264N, let Ci denote\nthe set of opponent strategies for which action i is optimal:\n\nCi =(cid:8)p\u2208\u2206M : \u22001\u2264j\u2264N,((cid:96)i\u2212(cid:96)j)(cid:62)p\u22640(cid:9).\n\nWe call the set Ci the optimality cell of action i. Furthermore, we call the set of optimality cells {C1,...,CN}\nthe cell decomposition of the game.\nEvery cell Ci is a convex closed polytope, as it is defined by a linear inequality system. Normally, a cell\nhas dimension M\u22121, which is the same as the dimensionality of the probability simplex. It might happen\nhowever, that a cell is of lower dimensionality. Another possible degeneracy is when two actions share\nthe same cell. In this paper, for ease of presentation, we assume that these degeneracies do not appear.\nFor an illustration of cell decomposition, see Figure 1(a).\nNow that we know the regions of optimality, we can define when two actions are neighbors. Intuitively,\ntwo actions are neighbors if their optimality cells are neighbors in the strong sense that they not only meet\nin \u201cone corner\u201d.\nDefinition 2 (Neighbors, Bart\u00b4ok et al. [9, page 4]). Two actions i and j are neighbors, if the intersection\nof their optimality cells Ci\u2229Cj is an M\u22122-dimensional convex polytope.\n\n3\n\n\f(a) Cell decomposition\n\n(b) Before the update\n\n(c) After the update\n\nFigure 1:\n(a) An example for a cell decomposition with M = 3 outcomes. Under the true outcome distribution\np\u2217, action 3 is optimal. Cells C1 and C3 are neighbors, but C2 and C5 are not. (b) The current estimate pt\u22121 is far away\nfrom the true distribution, the confidence ellipsoid is large. (c) After updating, pt is closer to the truth, the confidence\nellipsoid shrinks.\n\nTo optimize performance, the learner\u2019s primary goal is to find out which cell the opponent strategy lies\nin. Then, the learner can choose the action associated with that cell to play optimally. Since the feedback\nthe learner receives is limited, this task of finding the optimal cell may be challenging.\nThe next definition enables us to utilize the feedback table H.\nDefinition 3 (Signal matrix, Bart\u00b4ok et al. [9, Definition 1]). Let {\u03b11,\u03b12,...,\u03b1\u03c3i}\u2286\u03a3 be the set of symbols\nappearing in row i of the feedback table H. We define the signal matrix Si\u2208{0,1}\u03c3i\u00d7M of action i as\n\nSi[k,j]=I(H[i,j]=\u03b1k).\n\nIn words, Si is the indicator table of observing symbols \u03b11,...,\u03b1\u03c3i under outcomes 1,...,M given that\nthe action chosen is i. For an example, consider the case when the ith row of H is (a b a c). Then,\n\n(cid:32)1 0 1 0\n(cid:33)\n\nSi =\n\n0 1 0 0\n0 0 0 1\n\n.\n\nA very useful property of the signal matrix is that if we represent outcomes with M-dimensional unit vectors,\nthen Si can be used as a linear transformation to arrive at the unit-vector representation of the observation.\nThe following condition condition is key in distinguishing easy and hard games:\nDefinition 4 (Local observability, Bart\u00b4ok et al. [9, Definition 3]). Let actions i and j be neighbors. These\nactions are said to be locally observable if (cid:96)i \u2212 (cid:96)j \u2208 ImS(cid:62)\ni \u2295 ImS(cid:62)\nj . Furthermore, a game is locally\nobservable if all of its neighboring action pairs are locally observable.\nhave(cid:101)\u0398(\u221aT ) minimax expected regret. In the following, we present our new algorithm family that achieves\n\nBart\u00b4ok et al. [9] showed that finite stochastic partial-monitoring problems that admit local observability\n\nthe same regret rate for locally observable games against stochastic opponents.\n\n3 BPM: New algorithms for Partial Monitoring based on Bayes updates\n\nThe algorithms we propose can be decomposed into two main building blocks: the first one keeps track\nof a belief about the true outcome distribution and provides us with a set of feasible actions in every round.\nThe second one is responsible for selecting the action to play from this action set. Pseudocode for the\nalgorithm family is shown in Algorithm 1.\n\n3.1 Update Rule\nThe method of updating the belief about the true outcome distribution (p\u2217) is based on the idea that\nwe pretend that the outcomes are generated from a Gaussian distribution with covariance \u03a3 = IM and\nunknown mean. We also pretend we have a Gaussian prior for tracking the mean. The parameters of\nthis prior are denoted by p0 (mean) and \u03a30 (covariance). In every time step, we perform a Gaussian\nBayes-update using the observation received.\n\n4\n\np\u2217C1C2C3C4C5pt\u22121p\u2217pt\u22121p\u2217pt\fAlgorithm 1 BPM\ninput: L,H,p0,\u03a30\ninitialization: Calculate signal matrices Si\nfor t=1 to T do\n\nUse selection rule (cf., Sec. 3.2) to choose an action It\nObserve feedback Yt\nUpdate posterior: \u03a3\u22121\n\nt\u22121+PIt and pt =\u03a3t\n\nt =\u03a3\u22121\n\n(cid:0)\u03a3\u22121\n\nend for\n\n(cid:1);\n\n(SItS(cid:62)\n\nIt\n\n)\u22121Yt\n\nt\u22121pt\u22121+S(cid:62)\n\nIt\n\nFull-information case As a gentle start, we explain how the update rule would look like if we had full\ninformation about the outcome in each time step. The update in this case is identical with the standard\nGaussian one-step update:\n\n(cid:1)\n\u03a3t =\u03a3t\u22121\u2212\u03a3t\u22121(\u03a3t\u22121+I)\n\u00b5t =\u03a3t\n\n(cid:0)\u03a3\u22121\n\nt\u22121\u00b5t\u22121+Xt\n\n\u22121\u03a3t\u22121\n\nor equiv.\nor equiv.\n\nt\u22121+I,\n\n\u03a3\u22121\nt =\u03a3\u22121\n\u00b5t =\u00b5t\u22121+\u03a3t(Xt\u2212\u00b5t\u22121).\n\nHere we use subindex t\u22121 for the prior parameters and t for the posterior parameters in time step t, and\ndenote by Xt the outcome (observed in this case), encoded by an M-dimensional unit vector.\n\n(cid:0)\u03a3\u22121\n\nGeneral case Moving away from the full-information case, we face the problem of not observing the\noutcome, only some symbol that is governed by the signal matrix of the action we chose and the outcome\nitself. If we denote, as above, the outcome at time step t by an M-dimensional unit vector Xt, then the\nobservation symbol can be thought of as a unit vector given by Yt = SiXt, provided the chosen action\nis i. It follows that what we observe is a linear transformation of the sample from the outcome distribution.\nFollowing the Bayes update rule and assuming we chose action i at time step t, we derive the\ncorresponding Gaussian posterior given that the likelihood of the observation is \u03c0(Y |p)\u223cN(Sip,SiS(cid:62)\ni ).\nAfter some algebraic manipulations we get that the posterior distribution is Gaussian with covariance\n\u03a3t =(\u03a3\u22121\ni )\u22121Si is the orthogonal\nt\u22121+Pi)\u22121 and mean pt =\u03a3t\nprojection to the image space of S(cid:62)\ni . Note that even though Xt is not observed, the update can be\nperformed, since PiXi =S(cid:62)\nA significant advantage of this method of tracking the outcome distribution as opposed to keeping track\nof loss difference estimates (as done in previous works), is that feedback from one action can provide\ninformation about losses across all the actions. We believe that this property has a major role in the\nempirical performance improvement over existing methods.\nAn important part in analyzing our algorithm is to show that, despite the fact that the outcome distribution\nis not Gaussian, the update tracks the true outcome distribution well. For an illustration of tracking the\ntrue outcome distribution with the above update, see Figures 1(b) and 1(c).\n\n(cid:1), where Pi =S(cid:62)\n\nt\u22121pt\u22121+PiXt\ni (SiS(cid:62)\n\ni )\u22121SiXt =S(cid:62)\n\ni (SiS(cid:62)\n\ni (SiS(cid:62)\n\ni )\u22121Yt.\n\n3.2 Selection rules\n\nFor selecting actions given the posterior parameters, we propose two versions for the selection rule:\n\n1. Draw a random sample p from the distribution N(pt\u22121,\u03a3t\u22121), project the sample to the prob-\nability simplex, then choose the action that minimizes the loss for outcome distribution p. This\nrule is a close relative of Thompson-sampling. We call this version of the algorithm BPM-TS.\n2. Use pt\u22121 and \u03a3t\u22121 to build a confidence ellipsoid for p\u2217, enumerate all actions whose cells\nintersect with this ellipsoid, then choose the action that was chosen the fewest times so far (called\nBPM-LEAST).\n\nOur experiments demonstrate the performance of both versions. We analyze version BPM-LEAST.\n\n5\n\n\f4 Analysis\n\nWe now analyze BPM-LEAST that uses the Gaussian updates, and considers a set of feasible actions based\non the criterion that an action is feasible if its optimality cell intersects with the ellipsoid\n\n(cid:40)\n\n(cid:41)\n\n(cid:114)1\n\n2\n\np:(cid:107)p\u2212pt(cid:107)\u03a3\n\n\u22121\n\nt \u22641+\n\nNlogMT\n\n.\n\nFrom these feasible actions, it picks the one that has been chosen the fewest times up to time step t. For\nthis version of the algorithm, the following regret bound holds.\nTheorem 1. Given a locally observable partial-monitoring problem (L,H) with prior information p0,\u03a30,\nthe algorithm BPM-LEAST achieves expected regret\n\nRT \u2264C(cid:112)T Nlog(MT ),\n\nwhere C is some problem-dependent constant.\n\ni\n\ni \u2295ImS(cid:62)\n\nj . This means that there exist vij and vji vectors such that (cid:96)i\u2212(cid:96)j =S(cid:62)\n\nThe above constant C depends on two main factors, both of them related to the feedback structure.\nThe first one is the sum of the smallest eigenvalues of SiS(cid:62)\nfor every action i. The second is related\nto the local observability condition. As the condition says, for every neighboring action pairs i and j,\n(cid:96)i\u2212(cid:96)j \u2208ImS(cid:62)\nj vji.\nThe constant depends on the maximum 2-norm of these vij vectors.\nThe proof of the theorem is deferred to the supplementary material. In a nutshell, the proof is divided into\ntwo main parts. First we need to show that the update rule\u2014even though the underlying distribution is not\nGaussian\u2014serves as a good tool for tracking the true outcome distribution. After some algebraic manipula-\ntions, the problem reduces to a finding a high probability upper bound for norms of weighted sums of noise\nvectors. To this end, we used the martingale version of the matrix Hoeffding inequality [12, Theorem 1.3].\nThen, we need to show that the confidence ellipsoid shrinks fast enough that if we only choose actions\nwhose cell intersect with the ellipsoid, we do not suffer a large regret. In the core of proving this, we arrive\nat a term where we need to upper bound (cid:107)(cid:96)i\u2212 (cid:96)j(cid:107)\u03a3t, for some neighboring action pairs (i,j), and we\nshow that due to local observability and the speed at which the posterior covariance shrinks, this term\ncan be upper bounded by roughly 1/\u221at.\n\ni vij\u2212S(cid:62)\n\n5 Experiments\n\nFirst, we run extensive evaluations of BPM on various synthetic datasets and compare the performance\nagainst CBP [10] and FeedExp3 [7]. The datasets used in the simulated experiments are identical to the\nones used by Bart\u00b4ok et al. [10] and thus allow us to benchmark against the current state of the art. We also\nprovide results of BPM on a dataset that was collected by Singla and Krause [13] from real interactions\nwith many users on the Amazon Mechanical Turk (AMT) [14] crowdsourcing platform. We present the\ndetails of the datasets used and the summarize our results and findings in this section.\n\n5.1 Implementation Details\n\nIn order to implement BPM, we made the following implementation choices:\n\n1. To use BPM-LEAST (see Section 3.2), we need to recover the current feasible actions. We\ndo so by sampling multiple (10000) times from concentric Gaussian ellipsoids centred at the\ncurrent mean (pt) and collect feasible actions based on which cells the samples lie in. We resort\nto sampling for ease of implementation because otherwise we deal with the problem of finding\nthe intersection between an ellipsoid and a simplex in M-dimensional space.\n\n2. To implement BPM-TS, we draw p from the distribution N(pt\u22121,\u03a3t\u22121). We then project it\n\nback to the simplex to obtain a probability distribution on the outcome space.\n\nPrimarily due to sampling, both our algorithms are computationally more efficient than the existing\napproaches. In particular, BPM-TS is ideally suited for real world tasks as it is several orders of magnitude\nfaster than existing algorithms during all our experiments. In each iteration, BPM-TS only needs to\ndraw one sample from a multivariate gaussian and does not need any cell decompositions or expensive\ncomputations to obtain high dimensional intersections.\n\n6\n\n\f(a) Minimax (easy)\n\n(b) Minimax (hard)\n\n(c) Effects of priors\n\n(d) Single opponent (easy).\n\n(e) Single opponent (hard).\n\n(f) Real data (dynamic pricing).\n\nFigure 2: (a,b,d,e) Comparing BPM on the locally non-observable game ((a,d) benign opponent; (c,e) hard opponent).\nHereby, (a,b) show the pointwise maximal regret over 15 scenarios, and (d,e) show the regret against a single opponent\nstrategy. (c) shows the effect of a misspecified prior. (f) is the performance of the algorithms on the real dynamic\npricing dataset.\n\n5.2 Simulated dynamic pricing games\n\nDynamic pricing is a classic example of partial monitoring (see the introduction), and we compare the\nperformance of the algorithms on the small but not locally observable game described in Bart\u00b4ok et al.\n[10]. The loss matrix and feedback tables for the dynamic pricing game are given by:\n\nRecall that the game models a repeated interaction of a seller with buyers in a market. Each buyer can\neither buy the product at the price (signal \u201cy\u201d) or deny the offer (signal \u201cn\u201d). The corresponding loss to\nthe seller is either a known constant c (representing opportunity cost) or the difference between offered\nprice and the outcome of the customer\u2019s latent valuation of the product (willingness-to-pay). A similar\ngame models procurement processes as well. Note that this game does not satisfy local observability.\nWhile our theoretical results require this condition, in practice, if the opponent does not intentionally select\nharsh regions on the simplex, BPM remains applicable. Under this setting, expected individual regret is a\nreasonable measure of performance. That is, we measure the expected regret for fixed opponent strategies.\nWe also consider the minimax expected regret, which measures worst-case performance (pointwise\nmaximum) against multiple opponent strategies.\n\nBenign opponent While the dynamic pricing game is not locally observable in general, certain opponent\nstrategies are easier to compete with than others. Specifically, if the stochastic opponent chooses an\noutcome distribution that is away from the intersection of the cells that do not have local observability, the\nlearning happens in \u201cnon-dangerous\u201d or benign regions. We present results under this setting for simulated\ndynamic pricing with N = M = 5. The results shown in Figures 2(a) and 2(d) illustrate the benefits of\nboth variants of BPM over previous approaches. We achieve an order of magnitude reduction in the regret\nsuffered w.r.t. both the minimax and the individual regret.\n\n7\n\n\uf8eb\uf8ec\uf8ec\uf8ed0\n\nc\n\n...\n\nc\n\nL=\n\n\uf8f6\uf8f7\uf8f7\uf8f8;\n\n\u00b7\u00b7\u00b7 N\u22121\n\u00b7\u00b7\u00b7 N\u22122\n...\n...\nc\n\n0\n\n1\n0\n...\n\u00b7\u00b7\u00b7\n\n\uf8eb\uf8ec\uf8ec\uf8edy\n\nn\n\ny\ny\n...\n...\nn \u00b7\u00b7\u00b7\n\n\uf8f6\uf8f7\uf8f7\uf8f8.\n\ny\ny\n\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n...\nn y\n\n...\n\nH =\n\n02468100510152025303540Time Steps \u00d7 105Minimax Regret \u00d7 103 FeedExpCBPBPM\u2212LeastBPM\u2212TS02468100102030400102030Time Steps \u00d7 105Minimax Regret \u00d7 104 FeedExpCBPBPM\u2212LeastBPM\u2212TS02.557.5100510 Time Steps \u00d7 105 Regret \u00d7 103 misspec. p0,tight \u03a30accurate p0,wide \u03a30accurate p0, tight \u03a30misspec. p0,wide \u03a3002.557.510024681002468Time Steps \u00d7 105Regret \u00d7 103 CBPBPM\u2212LeastBPM\u2212TSFeedExp02468100246810Time Steps \u00d7 105Regret \u00d7 103 FeedExpCBPBPM\u2212LeastBPM\u2212TS00.511.522.5302468101214161820Time Steps \u00d7 105Regret \u00d7 103 FeedExpCBPBPM\u2212LeastBPM\u2212TS\fHarsh opponent For the same problem, with opponent chooses close to the boundary of the cells of\ntwo non-locally observable actions, the problem becomes harder. Still, BPM dramatically outperforms\nthe baselines and suffers very little regret as shown in Figures 2(b) and 2(e).\n\nEffect of the prior We study the effects of a misspecified prior in Figure 2(c). As long as the initial\nconfidence interval specified by the prior covariance is large enough to contain the opponent\u2019s distribution,\nan incorrectly specified prior mean does not have an adverse effect on the performance of BPM. As\nexpected, if the prior confidence ellipse used by BPM does not contain the opponent\u2019s outcome distribution,\nhowever, the regret grows linear in time. Further, if the prior is very informative (accurately specified\nprior mean and tight confidence ellipse), very little regret is suffered.\n\n5.3 Results on real data\n\nDataset description We simulate a procurement game based on real data. Parameter estimation was\ndone by posting a Human Intelligence Task (HIT) on the Amazon Mechanical Turk (AMT) platform.\nMotivated by an application in viral marketing, users were asked about the price they would accept for\n(hypothetically) letting us post promotional material to their friends on a social networking site. The survey\nalso collected features like age, geographic region, number of friends in the social network, activity levels\n(year of joining, time spent per day etc.). Note that since the HIT was just a survey and the questions\nwere about a hypothetical scenario, participants had no incentives to misreport their responses. Complete\nresponses were collected from approx. 800 participants. See [13] for more details.\n\nThe procurement game We simulate a procurement auction by playing back these responses offline.\nThe game is very similar in structure to dynamic pricing, with the optimal action being the best fixed price\nthat maximized the marketer\u2019s value or equivalently, minimized the loss. We sampled iid from the survey\ndata and perturbed the samples slightly to simulate a stream of 300000 potential users. At each iteration,\nwe simulate a user with a private valuation generated as a function of her attributes. We discretized the\noffer prices and the private valuations to be one of 11 values and set the opportunity cost of losing a user\ndue to low pricing to be 0.5. Thus we recover a partial monitoring game with 11 actions and 11 outcomes\nwith a 0/1 feedback matrix.\n\nResults We present the results of our evaluation on this dataset in Figure 2(f). Notice that although the\ngame is not locally observable, the outcome distribution does not seem to be in a difficult region of the cell\ndecomposition as the adaptive algorithms (CBP and both versions of BPM) perform well. We note that\nthe total regret suffered by BPM-LEAST is a factor of 10 lower than the regret achieved by CBP on this\ndataset. The plots are averaged over 30 runs of the competing algorithms on the stream. To the best of our\nknowledge, this is the first time partial monitoring has been evaluated on a real world problem of this size.\n\n6 Conclusions and future work\n\nWe introduced a new family of algorithms for locally observable partial-monitoring problems against\nstochastic opponents. We also enriched the model of partial monitoring with the possibility of incorporating\nprior information about the outcome distribution in the form of a confidence ellipsoid. The new insight\nof our approach is that instead of tracking loss differences, we explicitly track the true outcome distribution.\nThis approach not only eases computational overhead but also helps achieve low regret by being able\nto transfer information between actions. In particular, BPM-TS runs orders of magnitude faster than any\nexisting algorithms, opening the path for the model of partial monitoring to be applied on realistic settings\ninvolving large numbers of actions and outcomes.\nFuture work includes extending our method for adversarial opponents. Bart\u00b4ok [11] already uses the idea\nof tracking the true outcome distribution with the help of a confidence parallelotope, which is rather close\nto our approach, but has the same shortcomings as other algorithms that track loss differences: it can not\ntransfer information between actions. Extending our results to problems with large action and outcome\nspaces is also an important direction: if we have some prior information about the similarities between\noutcomes and/or actions, we have a chance for a reasonable regret.\n\nAcknowledgments This research was supported in part by SNSF grant 200021 137971, ERC StG\n307036 and a Microsoft Research Faculty Fellowship.\n\n8\n\n\fReferences\n[1] V. G. Vovk. Aggregating strategies. In COLT, pages 371\u2013386, 1990.\n[2] Nick Littlestone and Manfred K. Warmuth. The weighted majority algorithm. Inf. Comput., 108\n\n(2):212\u2013261, 1994.\n\n[3] Jean-Yves Audibert and S\u00b4ebastien Bubeck. Minimax policies for adversarial and stochastic bandits.\n\nIn COLT, 2009.\n\n[4] Peter Auer, Nicol`o Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic\n\nmultiarmed bandit problem. SIAM J. Comput., 32(1):48\u201377, 2002.\n\n[5] Yisong Yue, Josef Broder, Robert Kleinberg, and Thorsten Joachims. The K-armed dueling bandits\n\nproblem. Journal of Computer and System Sciences, 78(5):1538\u20131556, 2012.\n\n[6] Nir Ailon, Thorsten Joachims, and Zohar Karnin. Reducing dueling bandits to cardinal bandits.\n\narXiv preprint arXiv:1405.3396, 2014.\n\n[7] Antonio Piccolboni and Christian Schindelhauer. Discrete prediction games with arbitrary feedback\n\nand loss. In COLT/EuroCOLT, pages 208\u2013223, 2001.\n\n[8] Nicol`o Cesa-Bianchi, G\u00b4abor Lugosi, and Gilles Stoltz. Regret minimization under partial monitoring.\n\nMath. Oper. Res., 31(3):562\u2013580, 2006.\n\n[9] G\u00b4abor Bart\u00b4ok, D\u00b4avid P\u00b4al, and Csaba Szepesv\u00b4ari. Minimax regret of finite partial-monitoring games\nin stochastic environments. Journal of Machine Learning Research - Proceedings Track (COLT),\n19:133\u2013154, 2011.\n\n[10] G\u00b4abor Bart\u00b4ok, Navid Zolghadr, and Csaba Szepesv\u00b4ari. An adaptive algorithm for finite stochastic\npartial monitoring. In Proceedings of the 29th International Conference on Machine Learning, ICML\n2012, Edinburgh, Scotland, UK, June 26 - July 1, 2012, 2012.\n\n[11] G\u00b4abor Bart\u00b4ok. A near-optimal algorithm for finite partial-monitoring games against adversarial\nopponents. In COLT 2013 - The 26th Annual Conference on Learning Theory, June 12-14, 2013,\nPrinceton University, NJ, USA, pages 696\u2013710, 2013.\n\n[12] Joel A Tropp. User-friendly tail bounds for sums of random matrices. Foundations of Computational\n\nMathematics, 12(4):389\u2013434, 2012.\n\n[13] Adish Singla and Andreas Krause. Truthful incentives in crowdsourcing tasks using regret\n\nminimization mechanisms. In International World Wide Web Conference (WWW), 2013.\n\n[14] Amazon Mechanical Turk platform. URL https://www.mturk.com.\n\n9\n\n\f", "award": [], "sourceid": 888, "authors": [{"given_name": "Hastagiri", "family_name": "Vanchinathan", "institution": "ETH Zurich"}, {"given_name": "G\u00e1bor", "family_name": "Bart\u00f3k", "institution": "Google Z\u00fcrich"}, {"given_name": "Andreas", "family_name": "Krause", "institution": "ETH Zurich"}]}