{"title": "Learning Prices for Repeated Auctions with Strategic Buyers", "book": "Advances in Neural Information Processing Systems", "page_first": 1169, "page_last": 1177, "abstract": "Inspired by real-time ad exchanges for online display advertising, we consider the problem of inferring a buyer's value distribution for a good when the buyer is repeatedly interacting with a seller through a posted-price mechanism. We model the buyer as a strategic agent, whose goal is to maximize her long-term surplus, and we are interested in mechanisms that maximize the seller's long-term revenue. We present seller algorithms that are no-regret when the buyer discounts her future surplus --- i.e. the buyer prefers showing advertisements to users sooner rather than later. We also give a lower bound on regret that increases as the buyer's discounting weakens and shows, in particular, that any seller algorithm will suffer linear regret if there is no discounting.", "full_text": "Learning Prices for Repeated Auctions\n\nwith Strategic Buyers\n\nKareem Amin\n\nUniversity of Pennsylvania\n\nakareem@cis.upenn.edu\n\nAfshin Rostamizadeh\nrostami@google.com\n\nGoogle Research\n\nAbstract\n\nUmar Syed\n\nGoogle Research\n\nusyed@google.com\n\nInspired by real-time ad exchanges for online display advertising, we consider the\nproblem of inferring a buyer\u2019s value distribution for a good when the buyer is\nrepeatedly interacting with a seller through a posted-price mechanism. We model\nthe buyer as a strategic agent, whose goal is to maximize her long-term surplus,\nand we are interested in mechanisms that maximize the seller\u2019s long-term revenue.\nWe de\ufb01ne the natural notion of strategic regret \u2014 the lost revenue as measured\nagainst a truthful (non-strategic) buyer. We present seller algorithms that are no-\n(strategic)-regret when the buyer discounts her future surplus \u2014 i.e.\nthe buyer\nprefers showing advertisements to users sooner rather than later. We also give a\nlower bound on strategic regret that increases as the buyer\u2019s discounting weakens\nand shows, in particular, that any seller algorithm will suffer linear strategic regret\nif there is no discounting.\n\nIntroduction\n\n1\nOnline display advertising inventory \u2014 e.g., space for banner ads on web pages \u2014 is often sold via\nautomated transactions on real-time ad exchanges. When a user visits a web page whose advertising\ninventory is managed by an ad exchange, a description of the web page, the user, and other relevant\nproperties of the impression, along with a reserve price for the impression, is transmitted to bidding\nservers operating on behalf of advertisers. These servers process the data about the impression and\nrespond to the exchange with a bid. The highest bidder wins the right to display an advertisement\non the web page to the user, provided that the bid is above the reserve price. The amount charged\nthe winner, if there is one, is settled according to a second-price auction. The winner is charged the\nmaximum of the second-highest bid and the reserve price.\nAd exchanges have been a boon for advertisers, since rich and real-time data about impressions\nallow them to target their bids to only those impressions that they value. However, this precise\ntargeting has an unfortunate side effect for web page publishers. A nontrivial fraction of ad exchange\nauctions involve only a single bidder. Without competitive pressure from other bidders, the task of\nmaximizing the publisher\u2019s revenue falls entirely to the reserve price setting mechanism. Second-\nprice auctions with a single bidder are equivalent to posted-price auctions. The seller offers a price\nfor a good, and a buyer decides whether to accept or reject the price (i.e., whether to bid above or\nbelow the reserve price).\nIn this paper, we consider online learning algorithms for setting prices in posted-price auctions where\nthe seller repeatedly interacts with the same buyer over a number of rounds, a common occurrence\nin ad exchanges where the same buyer might be interested in buying thousands of user impressions\ndaily. In each round t, the seller offers a good to a buyer for price pt. The buyer\u2019s value vt for the\ngood is drawn independently from a \ufb01xed value distribution. Both vt and the value distribution are\nknown to the buyer, but neither is observed by the seller. If the buyer accepts price pt, the seller\nreceives revenue pt, and the buyer receives surplus vt \u2212 pt. Since the same buyer participates in\n\n1\n\n\fthe auction in each round, the seller has the opportunity to learn about the buyer\u2019s value distribution\nand set prices accordingly. Notice that in worst-case repeated auctions there is no such opportunity\nto learn, while standard Bayesian auctions assume knowledge of a value distribution, but avoid\naddressing how or why the auctioneer was ever able to estimate this distribution.\nTaken as an online learning problem, we can view this as a \u2018bandit\u2019 problem [18, 16], since the\nrevenue for any price not offered is not observed (e.g., even if a buyer rejects a price, she may\nwell have accepted a lower price). The seller\u2019s goal is to maximize his expected revenue over all\nT rounds. One straightforward way for the seller to set prices would therefore be to use a no-\nregret bandit algorithm, which minimizes the difference between seller\u2019s revenue and the revenue\nthat would have been earned by offering the best \ufb01xed price p\u2217 in hindsight for all T rounds; for\na no-regret algorithm (such as UCB [3] or EXP3 [4]), this difference is o(T ). However, we argue\nthat traditional no-regret algorithms are inadequate for this problem. Consider the motivations of a\nbuyer interacting with an ad exchange where the prices are set by a no-regret algorithm, and suppose\nfor simplicity that the buyer has a \ufb01xed value vt = v for all t. The goal of the buyer is to acquire\nthe most valuable advertising inventory for the least total cost, i.e., to maximize her total surplus\n!t v \u2212 pt, where the sum is over rounds where the buyer accepts the seller\u2019s price. A naive buyer\nmight simply accept the seller\u2019s price pt if and only if vt \u2265 pt; a buyer who behaves this way\nis called truthful. Against a truthful buyer any no-regret algorithm will eventually learn to offer\nprices pt \u2248 v on nearly all rounds. But a more savvy buyer will notice that if she rejects prices in\nearlier rounds, then she will tend to see lower prices in later rounds. Indeed, suppose the buyer only\naccepts prices below some small amount \u0001. Then any no-regret algorithm will learn that offering\nprices above \u0001 results in zero revenue, and will eventually offer prices below that threshold on nearly\nall rounds. In fact, the smaller the learner\u2019s regret, the faster this convergence occurs. If v $ \u0001 then\nthe deceptive buyer strategy results in a large gain in total surplus for the buyer, and a large loss\nin total revenue for the seller, relative to the truthful buyer. While the no-regret guarantee certainly\nholds \u2014 in hindsight, the best price is indeed \u0001 \u2014 it seems fairly useless.\nIn this paper, we propose a de\ufb01nition of strategic regret that accounts for the buyer\u2019s incentives, and\ngive algorithms that are no-regret with respect to this de\ufb01nition. In our setting, the seller chooses a\nlearning algorithm for selecting prices and announces this algorithm to the buyer. We assume that\nthe buyer will examine this algorithm and adopt whatever strategy maximizes her expected surplus\nover all T rounds. We de\ufb01ne the seller\u2019s strategic regret to be the difference between his expected\nrevenue and the expected revenue he would have earned if, rather than using his chosen algorithm\nto set prices, he had instead offered the best \ufb01xed price p\u2217 on all rounds and the buyer had been\ntruthful. As we have seen, this revenue can be much higher than the revenue of the best \ufb01xed price\nin hindsight (in the example above, p\u2217 = v). Unless noted otherwise, throughout the remainder of\nthe paper the term \u201cregret\u201d will refer to strategic regret.\nWe make one further assumption about buyer behavior, which is based on the observation that in\nmany important real-world markets \u2014 and particularly in online advertising \u2014 sellers are far more\nwilling to wait for revenue than buyers are willing to wait for goods. For example, advertisers are\noften interested in showing ads to users who have recently viewed their products online (this practice\nis called \u2018retargeting\u2019), and the value of these user impressions decays rapidly over time. Or consider\nan advertising campaign that is tied to a product launch. A user impression that is purchased long\nafter the launch (such as the release of a movie) is almost worthless. To model this phenomenon we\nmultiply the buyer\u2019s surplus in each round by a discount factor: If the buyer accepts the seller\u2019s price\npt in round t, she receives surplus \u03b3t(vt \u2212 pt), where {\u03b3t} is a nonincreasing sequence contained in\nt=1 \u03b3t the buyer\u2019s \u2018horizon\u2019, since it is analogous to the seller\u2019s\nthe interval (0, 1]. We call T\u03b3 = !T\nhorizon T. The buyer\u2019s horizon plays a central role in our analysis.\nSummary of results: In Sections 4 and 5 we assume that discount rates decrease geometrically:\n\u03b3t = \u03b3t\u22121 for some \u03b3 \u2208 (0, 1]. In Section 4 we consider the special case that the buyer has a \ufb01xed\nvalue vt = v for all rounds t, and give an algorithm with regret at most O(T\u03b3\u221aT ). In Section 5 we\nallow the vt to be drawn from any distribution that satis\ufb01es a certain smoothness assumption, and\ngive an algorithm with regret at most \u02dcO(T \u03b1 + T 1/\u03b1\n) where \u03b1 \u2208 (0, 1) is a user-selected parameter.\nNote that for either algorithm to be no-regret (i.e., for regret to be o(T )), we need that T\u03b3 = o(T ). In\nSection 6 we prove that this requirement is necessary for no-regret: any seller algorithm has regret at\nleast \u2126(T\u03b3). The lower bound is proved via a reduction to a non-repeated, or \u2018single-shot\u2019, auction.\nThat our regret bounds should depend so crucially on T\u03b3 is foreshadowed by the example above, in\n\n\u03b3\n\n2\n\n\fwhich a deceptive buyer foregoes surplus in early rounds to obtain even more surplus is later rounds.\nA buyer with a short horizon T\u03b3 will be unable to execute this strategy, as she will not be capable of\nbearing the short-term costs required to manipulate the seller.\n2 Related work\nKleinberg and Leighton study a posted price repeated auction with goods sold sequentially to T bid-\nders who either all have the same \ufb01xed private value, private values drawn from a \ufb01xed distribution,\nor private values that are chosen by an oblivious adversary (an adversary that acts independently of\nobserved seller behavior) [15] (see also [7, 8, 14]). Cesa-Bianchi et al. study a related problem of\nsetting the reserve price in a second price auction with multiple (but not repeated) bidders at each\nround [9]. Note that none of these previous works allow for the possibility of a strategic buyer, i.e.\none that acts non-truthfully in order to maximize its surplus. This is because a new buyer is consid-\nered at each time step and if the seller behavior depends only on previous buyers, then the setting\nimmediately becomes strategyproof.\nContrary to what is studied in these previous theoretical settings, electronic exchanges in practice see\nthe same buyer appearing in multiple auctions and, thus, the buyer has incentive to act strategically.\nIn fact, [12] \ufb01nds empirical evidence of buyers\u2019 strategic behavior in sponsored search auctions,\nwhich in turn negatively affects the seller\u2019s revenue. In the economics literature, \u2018intertemporal price\ndiscrimination\u2019 refers to the practice of using a buyer\u2019s past purchasing behavior to set future prices.\nPrevious work [1, 13] has shown, as we do in Section 6, that a seller cannot bene\ufb01t from conditioning\nprices on past behavior if the buyer is not myopic and can respond strategically. However, in contrast\nto our work, these results assume that the seller knows the buyer\u2019s value distribution.\nOur setting can be modeled as a nonzero sum repeated game of incomplete information, and there is\nextensive literature on this topic. However, most previous work has focused only on characterizing\nthe equilibria of these games. Further, our game has a particular structure that allows us to design\nseller algorithms that are much more ef\ufb01cient than generic algorithms for solving repeated games.\nTwo settings that are distinct from what we consider in this paper, but where mechanism design and\nlearning are combined, are the multi-armed bandit mechanism design problem [6, 5, 11] and the\nincentive compatible regression/classi\ufb01cation problem [10, 17]. The former problem is motivated\nby sponsored search auctions, where the challenge is to elicit truthful values from multiple bidding\nadvertisers while also ef\ufb01ciently estimating the click-through rate of the set of ads that are to be\nallocated. The latter problem involves learning a discriminative classi\ufb01er or regression function\nin the batch setting with training examples that are labeled by sel\ufb01sh agents. The goal is then to\nminimize error with respect to the truthful labels.\nFinally, Arora et al. proposed a notion of regret for online learning algorithms, called policy regret,\nthat accounts for the possibility that the adversary may adapt to the learning algorithm\u2019s behavior\n[2]. This resembles the ability, in our setting, of a strategic buyer to adapt to the seller algorithm\u2019s\nbehavior. However, even this stronger de\ufb01nition of regret is inadequate for our setting. This is\nbecause policy regret is equivalent to standard regret when the adversary is oblivious, and as we\nexplained in the previous section, there is an oblivious buyer strategy such that the seller\u2019s standard\nregret is small, but his regret with respect to the best \ufb01xed price against a truthful buyer is large.\n3 Preliminaries and Model\nWe consider a posted-price model for a single buyer repeatedly purchasing items from a single seller.\nAssociated with the buyer is a \ufb01xed distribution D over the interval [0, 1], which is known only to\nthe buyer. On each round t, the buyer receives a value vt \u2208V\u2286 [0, 1] from the distribution D. The\nseller, without observing this value, then posts a price pt \u2208P\u2286 [0, 1]. Finally, the buyer selects\nan allocation decision at \u2208{ 0, 1}. On each round t, the buyer receives an instantaneous surplus of\nat(vt \u2212 pt), and the seller receives an instantaneous revenue of atpt.\nWe will be primarily interested in designing the seller\u2019s learning algorithm, which we will denote A.\nLet v1:t denote the sequence of values observed on the \ufb01rst t rounds, (v1, ..., vt), de\ufb01ning p1:t and\na1:t analogously. A is an algorithm that selects each price pt as a (possibly randomized) function\nof (p1:t\u22121, a1:t\u22121). As is common in mechanism design, we assume that the seller announces his\n\n3\n\n\fchoice of algorithm A in advance. The buyer then selects her allocation strategy in response. The\nbuyer\u2019s allocation strategy B generates allocation decisions at as a (possibly randomized) function\nof (D, v1:t, p1:t, a1:t\u22121).\nNotice that a choice of A, B and D \ufb01xes a distribution over the sequences a1:T and p1:T. This in\nturn de\ufb01nes the seller\u2019s total expected revenue:\n\nSellerRevenue(A,B,D, T ) = E\"!T\n\nt=1 atpt## A,B,D$ .\n\nIn the most general setting, we will consider a buyer whose surplus may be discounted through time.\nIn fact, our lower bounds will demonstrate that a suf\ufb01ciently decaying discount rate is necessary for\na no-regret learning algorithm. We will imagine therefore that there exists a nonincreasing sequence\n{\u03b3t \u2208 (0, 1]} for the buyer. For a choice of T, we will de\ufb01ne the effective \u201ctime-horizon\u201d for the\nbuyer as T\u03b3 =!T\nWe assume that the seller is faced with a strategic buyer who adapts to the choice of A. Thus, let\nB\u2217(A,D) be a surplus-maximizing buyer for seller algorithm A and value distribution is D. In other\nwords, for all strategies B we have\n\nt=1 \u03b3t. The buyer\u2019s expected total discounted surplus is given by:\nBuyerSurplus(A,B,D, T ) = E\"!T\n\nt=1 \u03b3tat(vt \u2212 pt)## A,B,D$ .\n\nBuyerSurplus(A,B\u2217(A,D),D, T ) \u2265 BuyerSurplus(A,B,D, T ).\n\nWe are now prepared to de\ufb01ne the seller\u2019s regret. Let p\u2217 = arg maxp\u2208P p PrD[v \u2265 p], the revenue-\nmaximizing choice of price for a seller that knows the distribution D, and simply posts a price of\np\u2217 on every round. Against such a pricing strategy, it is in the buyer\u2019s best interest to be truthful,\naccepting if and only if vt \u2265 p\u2217, and the seller would receive a revenue of T p\u2217 Prv\u223cD[v \u2265 p\u2217].\nInformally, a no-regret algorithm is able to learn D from previous interactions with the buyer, and\nconverge to selecting a price close to p\u2217. We therefore de\ufb01ne regret as:\n\nRegret(A,D, T ) = T p\u2217 Prv\u223cD[v \u2265 p\u2217] \u2212 SellerRevenue(A,B\u2217(A,D),D, T ).\n\nFinally, we will be interested in algorithms that attain o(T ) regret (meaning the averaged re-\ngret goes to zero as T \u2192 \u221e) for the worst-case D.\nIn other words, we say A is no-regret if\nsupD Regret(A,D, T ) = o(T ). Note that this de\ufb01nition of worst-case regret only assumes that Na-\nture\u2019s behavior (i.e., the value distribution) is worst-case; the buyer\u2019s behavior is always presumed\nto be surplus maximizing.\n4 Fixed Value Setting\nIn this section we consider the case of a single unknown \ufb01xed buyer value, that is V = {v} for\nsome v \u2208 (0, 1]. We show that in this setting a very simple pricing algorithm with monotonically\ndecreasing price offerings is able to achieve O(T\u03b3\u221aT ) when the buyer discount is \u03b3t = \u03b3t\u22121. Due\nto space constraints many of the proofs for this section appear in Appendix A.\n\nMonotone algorithm: Choose parameter \u03b2 \u2208 (0, 1), and initialize a0 = 1 and\np0 = 1. In each round t \u2265 1 let pt = \u03b21\u2212at\u22121pt\u22121.\n\nIn the Monotone algorithm, the seller starts at the maximum price of 1, and decreases the price\nby a factor of \u03b2 whenever the buyer rejects the price, and otherwise leaves it unchanged. Since\nMonotone is deterministic and the buyer\u2019s value v is \ufb01xed, the surplus-maximizing buyer algorithm\nB\u2217(Monotone, v) is characterized by a deterministic allocation sequence a\u22171:T \u2208{ 0, 1}T.1\nThe following lemma partially characterizes the optimal buyer allocation sequence.\nLemma 1. The sequence a\u22171, . . . , a\u2217T is monotonically nondecreasing.\n\n1If there are multiple optimal sequences, the buyer can then choose to randomize over the set of sequences.\nIn such a case, the worst case distribution (for the seller) is the one that always selects the revenue minimizing\noptimal sequence. In that case, let a\n\n1:T denote the revenue-minimizing buyer-optimal sequence.\n\n\u2217\n\n4\n\n\fIn other words, once a buyer decides to start accepting the offered price at a certain time step, she\nwill keep accepting from that point on. The main idea behind the proof is to show that if there does\nexist some time step t% where a\u2217t\" = 1 and a\u2217t\"+1 = 0, then swapping the values so that a\u2217t\" = 0 and\na\u2217t\"+1 = 1 (as well potentially swapping another pair of values) will result in a sequence with strictly\nbetter surplus, thereby contradicting the optimality of a\u22171:T. The full proof is shown in Section A.1.\nNow, to \ufb01nish characterizing the optimal allocation sequence, we provide the following lemma,\nwhich describes time steps where the buyer has with certainty begun to accept the offered price.\nLemma 2. Let c\u03b2,\u03b3 = 1 + (1 \u2212 \u03b2)T\u03b3\na\u2217t+1 = 1.\nA detailed proof is presented in Section A.2. These lemmas imply the following regret bound.\nTheorem 1. Regret(Monotone, v, T ) \u2264 vT%1 \u2212 \u03b2\nc\u03b2,\u03b3& + v\u03b2% d\u03b2,\u03b3\nProof. By Lemmas 1 and 2 we receive no revenue until at most round +d\u03b2,\u03b3, + 1, and from that\nround onwards we receive at least revenue \u03b2&d\u03b2,\u03b3 \u2019 per round. Thus\n\nlog( c\u03b2,\u03b3\nlog(1/\u03b2) , then for any t > d\u03b2,\u03b3 we have\n\nand d\u03b2,\u03b3 =\n\nc\u03b2,\u03b3&.\n\n+ 1\n\nv )\n\nc\u03b2,\u03b3\n\nT\n\nRegret(Monotone, v, T ) = vT \u2212\n\n\u03b2&d\u03b2,\u03b3 \u2019 \u2264 vT \u2212 (T \u2212 d\u03b2,\u03b3 \u2212 1)\u03b2d\u03b2,\u03b3 +1\n\n\u2019t=&d\u03b2,\u03b3 \u2019+1\n\nc\u03b2,\u03b3\n\n\u221aT\n1+\u221aT\n\nand rearranging proves the theorem.\n\nthen Regret(Monotone, v, T ) \u2264 \u221aT(4vT\u03b3 + 2v log( 1\n\nNoting that \u03b2d\u03b2,\u03b3 = v\nTuning the learning parameter simpli\ufb01es the bound further and provides a O(T\u03b3\u221aT ) regret bound.\nNote that this tuning parameter does not assume knowledge of the buyer\u2019s discount parameter \u03b3.\nCorollary 1. If \u03b2 =\nv)) + v .\nThe computation used to derive this corollary are found in Section A.3. This corollary shows that it\nis indeed possible to achieve no-regret against a strategic buyer with a unknown \ufb01xed value as long\nas T\u03b3 = o(\u221aT ). That is, the effective buyer horizon must be more than a constant factor smaller\nthan the square-root of the game\u2019s \ufb01nite horizon.\n5 Stochastic Value Setting\nWe next give a seller algorithm that attains no-regret when the set of prices P is \ufb01nite, the buyer\u2019s\ndiscount is \u03b3t = \u03b3t\u22121, and the buyer\u2019s value vt for each round is drawn from a \ufb01xed distribution D\nthat satist\ufb01es a certain continuity assumption, detailed below.\n\n|P|\n\n, T \u03b1\n\ni &. For each phase i = 1, 2, 3, . . . of length Ti rounds:\n\nPhased algorithm: Choose parameter \u03b1 \u2208 (0, 1). De\ufb01ne Ti \u2261 2i and Si \u2261\nmin% Ti\nOffer each price p \u2208P for Si rounds, in some \ufb01xed order; these are the explore\nrounds. Let Ap,i = Number of explore rounds in phase i where price p was offered\nand the buyer accepted. For the remaining Ti\u2212|P|Si rounds of phase i, offer price\n\u02dcpi = arg maxp\u2208P p Ap,i\n\nin each round; these are the exploit rounds.\n\nSi\n\nThe Phased algorithm proceeds across a number of phases. Each phase consists of explore rounds\nfollowed by exploit rounds. During explore rounds, the algorithm selects each price in some \ufb01xed\norder. During exploit rounds, the algorithm repeatedly selects the price that realized the greatest\nrevenue during the immediately preceding explore rounds.\nFirst notice that a strategic buyer has no incentive to lie during exploit rounds (i.e. it will accept any\nprice pt < vt and reject any price pt > vt), since its decisions there do not affect any of its future\nprices. Thus, the exploit rounds are the time at which the seller can exploit what it has learned from\nthe buyer during exploration. Alternatively, if the buyer has successfully manipulated the seller into\noffering a low price, we can view the buyer as \u201cexploiting\u201d the seller.\n\n5\n\n\fDuring explore rounds, on the other hand, the strategic buyer can bene\ufb01t by telling lies which will\ncause it to witness better prices during the corresponding exploit rounds. However, the value of\nthese lies to the buyer will depend on the fraction of the phase consisting of explore rounds. Taken\nto the extreme, if the entire phase consists of explore rounds, the buyer is not interested in lying.\nIn general, the more explore rounds, the more revenue has to be sacri\ufb01ced by a buyer that is lying\nduring the explore rounds. For the myopic buyer, the loss of enough immediate revenue at some\npoint ceases to justify her potential gains in the future exploit rounds.\nThus, while traditional algorithms like UCB balance exploration and exploitation to ensure con\ufb01-\ndence in the observed payoffs of sampled arms, our Phased algorithm explores for two purposes:\nto ensure accurate estimates, and to dampen the buyer\u2019s incentive to mislead the seller. The seller\u2019s\nbalancing act is to explore for long enough to learn the buyer\u2019s value distribution, but leave enough\nexploit rounds to bene\ufb01t from the knowledge.\nContinuity of the value distribution The preceding argument required that the distribution D\ndoes not exhibit a certain pathology. There cannot be two prices p, p% that are very close but\np Prv\u223cD[v \u2265 p] and p% Prv\u223cD[v \u2265 p%] are very different. Otherwise, the buyer is largely indif-\nferent to being offered prices p or p%, but distinguishing between the two prices is essential for the\nseller during exploit rounds. Thus, we assume that the value distribution D is K-Lipschitz, which\neliminates this problem: De\ufb01ning F (p) \u2261 Prv\u223cD[v \u2265 p], we assume there exists K > 0 such that\n|F (p) \u2212 F (p%)|\u2264 K|p \u2212 p%| for all p, p% \u2208 [0, 1]. This assumption is quite mild, as our Phased\nalgorithm does not need to know K, and the dependence of the regret rate on K will be logarithmic.\nTheorem 2. Assume F (p) \u2261 Prv\u223cD[v \u2265 p] is K-Lipschitz. Let \u2206= min p\u2208P\\{p\u2217} p\u2217F (p\u2217) \u2212\npF (p), where p\u2217 = arg maxp\u2208P pF (p). For any parameter \u03b1 \u2208 (0, 1) of the Phased algorithm\nthere exist constants c1, c2, c3, c4 such that\n\nRegret(Phased,D, T ) \u2264 c1|P|T \u03b1 + c2 |P|\n\u22061/\u03b1 T 1/\u03b1\n\u03b3\n= \u02dcO(T \u03b1 + T 1/\u03b1\n).\n\n+ c3 |P|\n\n\u22062/\u03b1 (log T )1/\u03b1\n(log T + log(K/\u2206))1/\u03b1 + c4|P|\n\n\u03b3\n\nThe complete proof of Theorem 2 is rather technical, and is provided in Appendix B.\nTo gain further intuition about the upper bounds proved in this section and the previous section, it\nhelps to parametrize the buyer\u2019s horizon T\u03b3 as a function of T, e.g. T\u03b3 = T c for 0 \u2264 c \u2264 1. Writing\nit in this fashion, we see that the Monotone algorithm has regret at most O(T c+ 1\n2 ), and the Phased\nalgorithm has regret at most \u02dcO(T \u221ac) if we choose \u03b1 = \u221ac. The lower bound proved in the next\nsection states that, in the worst case, any seller algorithm will incur a regret of at least \u2126(T c).\n6 Lower Bound\nIn this section we state the main lower bound, which establishes a connection between the regret of\nany seller algorithm and the buyer\u2019s discounting. Speci\ufb01cally, we prove that the regret of any seller\nalgorithm is \u2126(T\u03b3). Note that when T = T\u03b3 \u2014 i.e., the buyer does not discount her future surplus\n\u2014 our lower bound proves that no-regret seller algorithms do not exist, and thus it is impossible for\nthe seller to take advantage of learned information. For example, consider the seller algorithm that\nuniformly selects prices pt from [0, 1]. The optimal buyer algorithm is truthful, accepting if pt < vt,\nas the seller algorithm is non-adaptive, and the buyer does not gain any advantage by being more\nstrategic. In such a scenario the seller would quickly learn a good estimate of the value distribution\nD. What is surprising is that a seller cannot use this information if the buyer does not discount her\nfuture surplus. If the seller attempts to leverage information learned through interactions with the\nbuyer, the buyer can react accordingly to negate this advantage.\nThe lower bound further relates regret in the repeated setting to regret in a particular single-shot\ngame between the buyer and the seller. This demonstrates that, against a non-discounted buyer, the\nseller is no better off in the repeated setting than he would be by repeatedly implementing such a\nsingle-shot mechanism (ignoring previous interactions with the buyer). In the following section we\ndescribe the simple single-shot game.\n\n6\n\n\f6.1 Single-Shot Auction\nWe call the following game the single-shot auction. A seller selects a family of distributions S\nindexed by b \u2208 [0, 1], where each Sb is a distribution on [0, 1] \u00d7{ 0, 1}. The family S is revealed to\na buyer with unknown value v \u2208 [0, 1], who then must select a bid b \u2208 [0, 1], and then (p, a) \u223cS b\nis drawn from the corresponding distribution.\nAs usual, the buyer gets a surplus of a(v \u2212 p), while the seller enjoys a revenue of ap. We restrict\nthe set of seller strategies to distributions that are incentive compatible and rational. S is incentive\ncompatible if for all b, v \u2208 [0, 1], E(p,a)\u223cSb[a(v\u2212p)] \u2264 E(p,a)\u223cSv [a(v\u2212p)]. It is rational if for all v,\nE(p,a)\u223cSv [a(v\u2212p)] \u2265 0 (i.e. any buyer maximizing expected surplus is actually incentivised to play\nthe game). Incentive compatible and rational strategies exist: drawing p from a \ufb01xed distribution\n(i.e. all Sb are the same), and letting a = 1{b \u2265 p} suf\ufb01ces.2\nWe de\ufb01ne the regret in the single-shot setting of any incentive-compatible and rational strategy S\nwith respect to value v as\n\nSSRegret(S, v) = v \u2212 E(p,a)\u223cSv [ap].\n\nt=1 at(vt \u2212 pt)$\n\nt=1 \u03b3tat(vt \u2212 pt)$ \u2264 E\"!T\n\nThe following loose lower bound on SSRegret(S, v) is straightforward, and establishes that a\nseller\u2019s revenue cannot be a constant fraction of the buyer\u2019s value for all v. The full proof is provided\nin the appendix (Section C.1).\nLemma 3. For any incentive compatible and rational strategy S there exists v \u2208 [0, 1] such that\n12.\nSSRegret(S, v) \u2265 1\n6.2 Repeated Auction\nReturning to the repeated setting, our main lower bound will make use of the following technical\nlemma, the full proof of which is provided in the appendix (Section C.1). Informally, the Lemma\nstates that the surplus enjoyed by an optimal buyer algorithm would only increase if this surplus\nwere viewed without discounting.\nLemma 4. Let the buyer\u2019s discount sequence {\u03b3t} be positive and nonincreasing. For any\nseller algorithm A, value distribution D, and surplus-maximizing buyer algorithm B\u2217(A,D),\nE\"!T\nNotice if at(vt \u2212 pt) \u2265 0 for all t, then the Lemma 4 is trivial. This would occur if the buyer only\never accepts prices less than its value (at = 1 only if pt \u2264 vt). However, Lemma 4 is interesting\nin that it holds for any seller algorithm A. It\u2019s easy to imagine a seller algorithm that incentivizes\nthe buyer to sometimes accept a price pt > vt with the promise that this will generate better prices\nin the future (e.g. setting pt\" = 1 and offering pt = 0 for all t > t% only if at\" = 1 and otherwise\nsetting pt = 1 for all t > t%).\nLemmas 3 and 4 let us prove our main lower bound.\nTheorem 3. Fix a positive, nonincreasing, discount sequence {\u03b3t}. Let A be any seller algorithm\nfor the repeated setting. There exists a buyer value distribution D such that Regret(A,D, T ) \u2265\n12 T\u03b3. In particular, if T\u03b3 =\u2126( T ), no-regret is impossible.\nProof. Let {ab,t, pb,t} be the sequence of prices and allocations generated by playing B\u2217(A, b)\nagainst A. For each b \u2208 [0, 1] and p \u2208 [0, 1) \u00d7{ 0, 1}, let \u00b5b(p, a) = 1\nt=1 \u03b3t1{ab,t =\na}1{pb,t = p}. Notice that \u00b5b(p, a) > 0 for countably many (p, a) and let \u2126b = {(p, a) \u2208\n[0, 1] \u00d7{ 0, 1} : \u00b5b(p, a) > 0}. We think of \u00b5b as being a distribution. It\u2019s in fact a random measure\nsince the {ab,t, pb,t} are themselves random. One could imagine generating \u00b5b by playing B\u2217(A, b)\nagainst A and observing the sequence {ab,t, pb,t}. Every time we observe a price pb,t = p and\nallocation ab,t = a, we assign 1\n\u03b3t additional mass to (p, a) in \u00b5b. This is impossible in practice,\nbut the random measure \u00b5b has a well-de\ufb01ned distribution.\nNow consider the following strategy S for the single-shot setting. Sb is induced by drawing a \u00b5b,\nthen drawing (p, a) \u223c \u00b5b. Note that for any b \u2208 [0, 1] and any measurable function f\n\nT\u03b3 !T\n\n1\n\nT\u03b3\n\n2This subclass of auctions is even ex post rational.\n\n7\n\n\fE(p,a)\u223cSb[f (a, p)] = E\u00b5b\u223cSb*E(p,a)\u223c\u00b5b[f (a, b) | \u00b5b]+ = 1\nThus the strategy S is incentive compatible, since for any b, v \u2208 [0, 1]\n\nT\u03b3\n\nE\"!T\n\nt=1 \u03b3tf (ab,t, pb,t)$.\n\n1\nT\u03b3\n\nE, T\n\u2019t=1\n\nE(p,a)\u223cSb[a(v \u2212 p)] =\n\nBuyerSurplus(A,B\u2217(A, b), v, T )\n\u03b3tav,t(v \u2212 pv,t)- = E(p,a)\u223cSv [a(v \u2212 p)]\nwhere the inequality follows from the fact that B\u2217(A, v) is a surplus-maximizing algorithm for a\nbuyer whose value is v. The strategy S is also rational, since for any v \u2208 [0, 1]\n\n\u03b3tab,t(v \u2212 pb,t)- =\nE, T\n\u2019t=1\n\nBuyerSurplus(A,B\u2217(A, v), v, T ) =\n\n1\nT\u03b3\n\n1\nT\u03b3\n\n1\nT\u03b3\n\n\u2264\n\nBuyerSurplus(A,B\u2217(A, v), v, T ) \u2265 0\nE(p,a)\u223cSv [a(v \u2212 p)] =\nwhere the inequality follows from the fact that a surplus-maximizing buyer algorithm cannot earn\nnegative surplus, as a buyer can always reject every price and earn zero surplus.\n\n\u03b3tav,t(v \u2212 pv,t)- =\n\n1\nT\u03b3\n\n1\nT\u03b3\n\nE, T\n\u2019t=1\n\n1\nT\u03b3\n\n\u03b3tav,tpv,t-/\n\nE, T\n\u2019t=1\n(1 \u2212 rt)av,tpv,t-\n\nt=1 rt. Note that rt \u2265 0. We have the following for any v \u2208 [0, 1]:\n\nLet rt = 1 \u2212 \u03b3t and Tr =!T\nT\u03b3SSRegret(S, v) = T\u03b3(v \u2212 E(p,a)\u223cSv [ap]) = T\u03b3.v \u2212\n\u03b3tav,tpv,t- = (T \u2212 Tr)v \u2212 E, T\n= T\u03b3v \u2212 E, T\n\u2019t=1\n\u2019t=1\nav,tpv,t- + E, T\n= T v \u2212 E, T\nrtav,tpv,t- \u2212 Trv\n\u2019t=1\n\u2019t=1\n= Regret(A, v, T )+E, T\nrtav,tpv,t-\u2212Trv = Regret(A, v, T )+E, T\nrt(av,tpv,t \u2212 v)-\n\u2019t=1\n\u2019t=1\nt=1 rt(av,tpv,t \u2212 v)$, tells us that: E\"!T\nA closer look at the quantity E\"!T\nt=1 rt(av,tpv,t \u2212 v)$ \u2264\nt=1(1 \u2212 \u03b3t)av,t(v \u2212 pv,t)$ \u2264 0, where the last inequality\nE\"!T\nfollows from Lemma 4. Therefore T\u03b3SSRegret(S, v) \u2264 Regret(A, v, T ) and taking D to be the\npoint-mass on the value v \u2208 [0, 1] which realizes Lemma 3 proves the statement of the theorem.\n7 Conclusion\nIn this work, we have analyzed the performance of revenue maximizing algorithms in the setting of\na repeated posted-price auction with a strategic buyer. We show that if the buyer values inventory in\nthe present more than in the far future, no-regret (with respect to revenue gained against a truthful\nbuyer) learning is possible. Furthermore, we provide lower bounds that show such an assumption\nis in fact necessary. These are the \ufb01rst bounds of this type for the presented setting. Future direc-\ntions of study include studying buyer behavior under weaker polynomial discounting rates as well\nunderstanding when existing \u201coff-the-shelf\u201d bandit-algorithm (UCB, or EXP3), perhaps with slight\nmodi\ufb01cations, are able to perform well against strategic buyers.\n\nt=1 rtav,t(pv,t \u2212 v)$ = \u2212E\"!T\n\nAcknowledgements\nWe thank Corinna Cortes, Gagan Goel, Yishay Mansour, Hamid Nazerzadeh and Noam Nisan for\nearly comments on this work and pointers to relevent literature.\n\n8\n\n\fReferences\n[1] Alessandro Acquisti and Hal R. Varian. Conditioning prices on purchase history. Marketing\n\nScience, 24(3):367\u2013381, 2005.\n\n[2] Raman Arora, Ofer Dekel, and Ambuj Tewari. Online bandit learning against an adaptive\n\nadversary: from regret to policy regret. In ICML, 2012.\n\n[3] Peter Auer, Nicol`o Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed\n\nbandit problem. Machine learning, 47(2-3):235\u2013256, 2002.\n\n[4] Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic\n\nmultiarmed bandit problem. Journal on Computing, 32(1):48\u201377, 2002.\n\n[5] Moshe Babaioff, Robert D Kleinberg, and Aleksandrs Slivkins. Truthful mechanisms with\nimplicit payment computation. In Proceedings of the Conference on Electronic Commerce,\npages 43\u201352. ACM, 2010.\n\n[6] Moshe Babaioff, Yogeshwer Sharma, and Aleksandrs Slivkins. Characterizing truthful multi-\nIn Proceedings of Conference on Electronic Commerce, pages\n\narmed bandit mechanisms.\n79\u201388. ACM, 2009.\n\n[7] Ziv Bar-Yossef, Kirsten Hildrum, and Felix Wu.\n\nIncentive-compatible online auctions for\ndigital goods. In Proceedings of Symposium on Discrete Algorithms, pages 964\u2013970. SIAM,\n2002.\n\n[8] Avrim Blum, Vijay Kumar, Atri Rudra, and Felix Wu. Online learning in online auctions. In\n\nProceedings Symposium on Discrete algorithms, pages 202\u2013204. SIAM, 2003.\n\n[9] Nicolo Cesa-Bianchi, Claudio Gentile, and Yishay Mansour. Regret minimization for reserve\nIn Proceedings of the Symposium on Discrete Algorithms.\n\nprices in second-price auctions.\nSIAM, 2013.\n\n[10] Ofer Dekel, Felix Fischer, and Ariel D Procaccia. Incentive compatible regression learning.\n\nJournal of Computer and System Sciences, 76(8):759\u2013777, 2010.\n\n[11] Nikhil R Devanur and Sham M Kakade. The price of truthfulness for pay-per-click auctions.\n\nIn Proceedings of the Conference on Electronic commerce, pages 99\u2013106. ACM, 2009.\n\n[12] Benjamin Edelman and Michael Ostrovsky. Strategic bidder behavior in sponsored search\n\nauctions. Decision support systems, 43(1):192\u2013198, 2007.\n\n[13] Drew Fudenberg and J. Miguel Villas-Boas. Behavior-Based Price Discrimination and Cus-\n\ntomer Recognition. Elsevier Science, Oxford, 2007.\n\n[14] Jason Hartline. Dynamic posted price mechanisms, 2001.\n[15] Robert Kleinberg and Tom Leighton. The value of knowing a demand curve: Bounds on regret\nfor online posted-price auctions. In Symposium on Foundations of Computer Science, pages\n594\u2013605. IEEE, 2003.\n\n[16] Volodymyr Kuleshov and Doina Precup. Algorithms for the multi-armed bandit problem.\n\nJournal of Machine Learning, 2010.\n\n[17] Reshef Meir, Ariel D Procaccia, and Jeffrey S Rosenschein. Strategyproof classi\ufb01cation with\n\nshared inputs. Proc. of 21st IJCAI, pages 220\u2013225, 2009.\n\n[18] Herbert Robbins. Some aspects of the sequential design of experiments. In Herbert Robbins\n\nSelected Papers, pages 169\u2013177. Springer, 1985.\n\n9\n\n\f", "award": [], "sourceid": 608, "authors": [{"given_name": "Kareem", "family_name": "Amin", "institution": "University of Pennsylvania"}, {"given_name": "Afshin", "family_name": "Rostamizadeh", "institution": "Google Research"}, {"given_name": "Umar", "family_name": "Syed", "institution": "Google Research"}]}