{"title": "Semi-Parametric Dynamic Contextual Pricing", "book": "Advances in Neural Information Processing Systems", "page_first": 2363, "page_last": 2373, "abstract": "Motivated by the application of real-time pricing in e-commerce platforms, we consider the problem of revenue-maximization in a setting where the seller can leverage contextual information describing the customer's history and the product's type to predict her valuation of the product. However, her true valuation is unobservable to the seller, only binary outcome in the form of success-failure of a transaction is observed. Unlike in usual contextual bandit settings, the optimal price/arm given a covariate in our setting is sensitive to the detailed characteristics of the residual uncertainty distribution. We develop a semi-parametric model in which the residual distribution is non-parametric and provide the first algorithm which learns both regression parameters and residual distribution with $\\tilde O(\\sqrt{n})$ regret. We empirically test a scalable implementation of our algorithm and observe good performance.", "full_text": "Semi-Parametric Dynamic Contextual Pricing\n\nManagement Science and Engineering\n\nManagement Science and Engineering\n\nVirag Shah\n\nStanford University\n\nCalifornia, USA 94305\nvirag@stanford.edu\n\nJose Blanchet\n\nStanford University\n\nCalifornia, USA 94305\n\njblanche@stanford.edu\n\nRamesh Johari\n\nManagement Science and Engineering\n\nStanford University\n\nCalifornia, USA 94305\nrjohari@stanford.edu\n\nAbstract\n\nMotivated by the application of real-time pricing in e-commerce platforms, we\nconsider the problem of revenue-maximization in a setting where the seller can\nleverage contextual information describing the customer\u2019s history and the prod-\nuct\u2019s type to predict her valuation of the product. However, her true valuation is\nunobservable to the seller, only binary outcome in the form of success-failure of\na transaction is observed. Unlike in usual contextual bandit settings, the optimal\nprice/arm given a covariate in our setting is sensitive to the detailed characteristics\nof the residual uncertainty distribution. We develop a semi-parametric model in\nwhich the residual distribution is non-parametric and provide the \ufb01rst algorithm\nwhich learns both regression parameters and residual distribution with \u02dcO(pn)\nregret. We empirically test a scalable implementation of our algorithm and observe\ngood performance.\n\n1\n\nIntroduction\n\nMany e-commerce platforms are experimenting with approaches to personalized dynamic pricing\nbased on the customer\u2019s context (i.e. customer\u2019s prior search/purchase history and the product\u2019s type).\nHowever, the mapping from context to optimal price needs to be learned. Our paper develops a bandit\nlearning approach towards solving this problem motivated by practical considerations faced by online\nplatforms. In our model, customers arrive sequentially, and each customer is interested in buying one\nproduct. The customer purchases the product if her valuation (unobserved by the platform) for the\nproduct exceeds the price set by the seller. The platform observes the covariate vector corresponding\nto the context, and chooses a price. The customer buys the item if and only if the price is lower than\nher valuation.\nWe emphasize three salient features of this model; taken together, these are the features that distinguish\nour work. First, feedback is only binary: either the customer buys the item, or she does not. In other\nwords, the platform must learn from censored feedback. This type of binary feedback is a common\nfeature of practical demand estimation problems, since typically exact observation of the valuation of\na customer is not possible.\nSecond, the platform must learn the functional form of the relationship between the covariates and the\nexpected valuation. In our work, we assume a parametric model for this relationship. In particular, we\npresume that the expected value of the logarithm of the valuation is linear in the covariates. Among\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fother things, this formulation has the bene\ufb01t that it ensures valuations are always nonnegative. Further,\nfrom a technical standpoint, we demonstrate that this formulation also admits ef\ufb01cient estimation of\nthe parametric model.\nThird, the platform must also learn the distribution of residual uncertainty that determines the actual\nvaluation given the covariates; in other words, the distribution of the error between the expected\nlogarithm of the valuation, and the actual logarithm of the valuation, given covariates. In our work\nwe make minimal assumptions about the distribution of this residual uncertainty. Thus while the\nfunctional relationship between covariates and the expected logarithm of the valuation is parametric\n(i.e., linear), the distribution of the error is nonparametric; for this reason, we refer to our model as a\nsemi-parametric dynamic pricing model.\nThe challenge is to ensure that we can ef\ufb01ciently learn both the coef\ufb01cients in the parametric model,\nas well as the distribution of the error. A key observation we leverage is that our model exhibits free\nexploration: testing a single covariate-vector-to-price mapping at a given time can simultaneously\nprovide information about several such mappings. We develop an arm elimination approach which\nmaintains a set of active prices at each time, where the set depends on the covariate vector of the\ncurrent customer. The set is reduced over time by eliminating empirically suboptimal choices.\nWe analyze our approach both theoreticallly and empirically. We analyze regret against the following\nstandard oracle: the policy that optimally chooses prices given the true coef\ufb01cients in the parametric\nlinear model, as well as the distribution of the error, but without knowledge of the exact valuation\nof each arriving customer. Regret of our policy scales as \u02dcO(pn) with respect to time horizon\nn, which is optimal. Further, it scales polynomially in covariate dimension d, as well as in two\nsmoothness parameters \uf8ff1 and \uf8ff2 de\ufb01ned as part of our model. In addition, we develop a scalable\nimplementation of our approach which leverages a semi-parametric regression technique based on\nconvex optimization. Our simulations show that this scalable policy performs well.\n\n1.1 Related work\n\nNon-contextual dynamic pricing. There is a signi\ufb01cant literature on regret analysis of the dynamic\npricing problem without covariates; see den Boer (2015) for a detailed survey. For example, the\nworks Le Guen (2008); Broder and Rusmevichientong (2012); den Boer and Zwart (2013); den\nBoer (2014); Keskin and Zeevi (2014) consider a parametric model whereas Kleinberg and Leighton\n(2003) consider a non-parametric model for the unknown demand function. Our methodology is most\naligned to that of Kleinberg and Leighton (2003), in that we extend their techniques to incorporate\nside-information from the covariates.\nContextual dynamic pricing. Recently, the problem of dynamic pricing with high-dimensional\ncovariates has garnered signi\ufb01cant interest among researchers; see, e.g., Javanmard and Nazerzadeh\n(2019); Ban and Keskin (2019); Cohen et al. (2016b); Mao et al. (2018); Qiang and Bayati (2019);\nNambiar et al. (2019). In summary, in contrast to the prior works in dynamic pricing with covariates,\nours is the \ufb01rst work to address a setting where the only feedback from each transaction is binary\nand the residual uncertainty given covariates is non-parametric, see Table 1. We believe that these\nfeatures are relevant to several online platforms implementing dynamic pricing with high-dimensional\ncovariates, and thus our work bridges a gap between the state-of-the-art in the academic literature\nand practical considerations.\nLearning techniques: There is extensive prior work on high-dimensional contextual bandits, e.g.,\nLangford and Zhang (2008); Slivkins (2011); Perchet and Rigollet (2013); Greenewald et al. (2017);\nKrishnamurthy et al. (2018); however, their techniques do not directly apply to our setup (in part\ndue to the censored nature of feedback). Our work is also loosely related to the works on learning\nand auctions, e.g. Amin et al. (2014); Morgenstern and Roughgarden (2016). We leverage semi-\nparametric regression technique with binary feedback from Plan and Vershynin (2013) to reduce\ncomputational complexity of our algorithm.\nThere are some similarities between our work and the literature on bandits with side information, e.g.,\nMannor and Shamir (2011); Alon et al. (2013); Caron et al. (2012); Cohen et al. (2016a); Lykouris\net al. (2018). For example, in their work too there is free exploration where testing for one arm\nreveals the reward information for a subset of arms, where the subset may be a function of the chosen\naction. However, there are some crucial differences. In particular, these works assume (a) a discrete\nset of arms, (b) the existence of a sequence of graphs indexed by time (possibly \ufb01xed) with the arms\n\n2\n\n\fContextual Non-parametric residuals Binary feedback\n\nKleinberg and Leighton (2003)\n\nJavanmard and Nazerzadeh (2019)\n\nQiang and Bayati (2019)\n\nCohen et al. (2016b); Mao et al. (2018)\n\nBan and Keskin (2019)\n\nNambiar et al. (2019)\n\nOur work\n\nX\nX\nX\nX\nX\nX\nX\n\nX\n\nX\n\nX\n\nX\nX\n\nX\nX\n\nX\n\nX\n\nX\n\nTable 1: This table compares our results with prior work along three dimensions: (1) incorporating\ncontextual information; (2) modeling the distribution of residual uncertainty (given the context, where\nappropriate) as non-parametric; and (3) receiving only binary success/failure feedback from each\ntransaction.\n\nas its nodes, (c) the action involves pulling an arm, and at each time the reward at each neighbor\nof the pulled arm is revealed. However, in our setting, it is important to model the set of prices,\nand thus the set of covariate-vector-to-price mappings as described above, as a continuous set since\na constant error in price leads to linear regret. While in our DEEP-C policy we discretize the set\nof covariate-vector-to-price mappings into a \ufb01nite set of arms (which scale with time horizon), the\nabove assumptions are still not met due to the following. Each arm in our setting corresponds to a\nsubset of prices/actions. The subset of arms for which the reward is revealed at time t depends on\nthe covariate xt, and the exact price pt from the above subset. Thus, the assumption of a pre-de\ufb01ned\ngraph structure is not satis\ufb01ed.\n\n2 Preliminaries\n\nIn this section we \ufb01rst describe our model and then our objective, which is to minimize regret relative\nto a natural oracle policy.\n\n2.1 Model\nAt each time t 2{ 1, 2, . . . , n}, we have a new user arrival with covariate vector Xt taking values\nin Rd for d 1. Throughout the paper all vectors are encoded as column vectors. The platform\nobserves Xt upon the arrival of the user. The user\u2019s reservation value Vt 2 R is modeled as\n\nln Vt = \u2713|\n\n0 Xt + Z0t,\n\n(1)\nwhere \u27130 2 Rd is a \ufb01xed unknown parameter vector, and Z0t for t 2{ 1, 2, . . . , n} captures the\nresidual uncertainty in demand given covariates.\nSimilar to the linear model Vt = \u2713|\n0 Xt + Z0t, this model is quite \ufb02exible in that linearity is a restriction\nonly on the parameters while the predictor variables themselves can be arbitrarily transformed.\nHowever, our formulation additionally has the feature that it ensures that Vt > 0 for each t, a key\npractical consideration. We conjecture that unlike our model, the linear model Vt = \u2713|\n0 Xt + Z0t does\nnot admit a learning algorithm with \u02dcO(pn) regret. This is due to censored nature of feedback, the\nstructure of revenue as a function of price, and our non-parametric assumption on the distribution\nof Z0t as described below. Also, exponential sensitivity of the valuation with respect to covariate\nmagnitudes can be avoided by using a logarithmic transformation of the covariates themselves.\nMore generally, one may augment our approach with a machine learning algorithm which learns an\nappropriate transformation to \ufb01t the data well. In this paper, however, we focus on valuation model as\ngiven by (1).\n\n3\n\n\fEquivalently to (1), we have\n\nVt = e\u2713|\n\n0 XtZt,\n\nwhere Zt = eZ0t. Thus, Zt > 0 for each t.\nThe platform sets price pt, upon which the user buys the product if Vt pt. Without loss of generality,\nwe will assume the setting where users buy the product; one can equivalently derive exactly the same\nresults in a setting where users are sellers, and sell the product if Vt \uf8ff pt. The revenue/reward at time\nt is ptYt where Yt = Vtpt. We assume that pt is (X1, . . . , Xt1, Xt, Y1, . . . , Yt1, U1, . . . , Ut)\nmeasurable, where Ut for each t 1 is an auxiliary U [0, 1] random variable independent of the\nsources of randomness in the past. In other words, platform does not know the future but it can use\nrandomized algorithms which may leverage past covariates, current covariate, and binary feedback\nfrom the past.\nThe goal of the platform is to design a pricing policy {pt}t2{1,...,n} to maximize the total reward\n\nn =\n\nYtpt.\n\nnXt=1\n\n2 , 1\n\nIn this paper we are interested in the performance characterization of optimal pricing policies as the\ntime horizon n grows large.\nWe make the following assumption on statistics of Xt and Zt.\nA1 We assume that {Xt}t and {Zt}t are i.i.d. and mutually independent. Their distributions are\nunknown to the platform. Their supports X and Z are compact and known. In particular, we assume\nthat X\u21e2 \u21e5 1\n\n2\u21e4d and Z is an interval in [0, 1].\n\nA1 can be signi\ufb01cantly relaxed, as we discuss in Appendix E (both in terms of the i.i.d. distribution\nof random variables, and the compactness of their supports).\nA2 The unknown parameter vector \u27130 lies within a known, connected, compact set \u21e5 \u21e2 Rd. In\nparticular, \u21e5 \u21e2 [0, 1]d.\nIt follows from A1 and A2 that we can compute reals 0 <\u21b5 1 <\u21b5 2 such that for all (z, x, \u2713) 2\nZ\u21e5X\u21e5 \u21e5 we have\nThus, the valuation at each time is known to be in the set [\u21b51,\u21b5 2], and in turn the platform may\nalways choose price from this set. Note also that, since Z\u21e2 [0, 1], for each (x, \u2713) 2X , we have that\n\u21b51 \uf8ff e\u2713|x \uf8ff \u21b52.\n2.2 The oracle and regret\nIt is common in multiarmed bandit problems to measure the performance of an algorithm against\na benchmark, or Oracle, which may have more information than the platform, and for which the\noptimal policy is easier to characterize. Likewise, we measure the performance of our algorithm\nagainst the following Oracle.\n\n\u21b51 \uf8ff ze\u2713|x \uf8ff \u21b52.\n\nDe\ufb01nition 1 The Oracle knows the true value of \u27130 and the distribution of Zt.\n\nNow, let\n\nThe following proposition is easy to show, so the proof is omitted.\n\nF (z) = zP(Z1 z).\n\n0 Xt where z\u21e4 = arg supz F (z).\n\nProposition 1 The following pricing policy is optimal for the Oracle: At each time t set price\npt = z\u21e4e\u2713|\nClearly, the total reward obtained by the Oracle with this policy, denoted as \u21e4n, satis\ufb01es E[\u21e4n] =\nnz\u21e4E[e\u2713|\nOur goal: Regret minimization. Given a feasible policy, de\ufb01ne the regret against the Oracle as Rn:\n\n0 X1].\n\n4\n\n\fOur goal in this paper is to design a pricing policy which minimizes E[Rn] asymptotically to leading\norder in n.\n\nRn = \u21e4n n.\n\n2.3 Smoothness Assumption\nIn addition to A1 and A2, we make a smoothness assumption described below.\nLet\n\nr(z, \u2713) = zEhe\u2713|X11nZ1e\u2713|\n\n0 X1 > ze\u2713|X1oi ,\n\n1\uf8ffl\uf8ffd\n\nwhich can be thought of as the expected revenue of a single transaction when the platform sets price\np = ze\u2713|x after observing a covariate X = x. We impose the following assumption on r(z, \u2713).\nA3 Let \u2713(l) be the lth component of \u2713, i.e., \u2713 = (\u2713(l) : 1 \uf8ff l \uf8ff d). We assume that there exist\n\uf8ff1,\uf8ff 2 > 0 such that for each z 2Z and \u2713 2 \u21e5 we have\n\uf8ff1 max\u21e2(z\u21e4 z)2, max\nwhere k(z, \u2713)k2 =\u21e3z2 +Pd\n\n\uf8ff2\nd + 1k(z\u21e4 z, \u27130 \u2713)k2\n\nRecall that F (z) = zP(Z1 z). It follows from A1 and conditioning on X1 that\n\n(\u2713(`)\n\n0 \u2713(l))2 \uf8ff r(z\u21e4,\u2713 0) r(z, \u2713) \uf8ff\nl=1(\u2713(l))2\u2318 .\nr(z, \u2713) = Ehe\u2713|\n\n0 X1F\u21e3e(\u27130\u2713)|X1z\u2318i .\n\nWe will use this representation throughout our development.\nNote that A3 subsumes that (z\u21e4,\u2713 0) is the unique optimizer of r(z, \u2713). This is true if z\u21e4 is the unique\nmaximizer of F (z) and that \u27130 is identi\ufb01able in the parameter space \u21e5.\nBelow we will also provide suf\ufb01cient conditions for A3 to hold. In particular, we develop suf\ufb01cient\nconditions which are a natural analog of the assumptions made in Kleinberg and Leighton (2003).\n\n2.4 Connection to assumptions in Kleinberg and Leighton (2003)\nThe \u2018stochastic valuations\u2019 model considered in Kleinberg and Leighton (2003) is equivalent to our\nmodel with no covariates, i.e., with d = 0. When d = 0 the revenue function r(z, \u2713) is equal to\nF (z). In Kleinberg and Leighton (2003) it is assumed that {Zt} are i.i.d., and that F (z) has bounded\nsupport. Clearly A1 and A2 are a natural analog to these assumptions. They also assume that F (z)\nhas unique optimizer, and is locally concave at the optimal value, i.e., F 00(z\u21e4) < 0. We show below\nthat a natural analog of these conditions are suf\ufb01cient for A3 to hold.\nSuppose that (z\u21e4,\u2713 0) is the unique optimizer of r(z, \u2713). Also suppose that A1 and A2 hold. Then A3\nholds if r(z, \u2713) is strictly locally concave at (z\u21e4,\u2713 0), i.e., if the Hessian of r(z, \u2713) at (z\u21e4,\u2713 0) exists\nand is negative de\ufb01nite. To see why this is the case, note that strict local concavity at (z\u21e4,\u2713 0) implies\nthat there exists an \u270f> 0 such that the assumption holds for each (z, \u2713) 2B \u270f(z\u21e4,\u2713 0) where B\u270f(z\u21e4,\u2713 0)\nis the d + 1 dimensional ball with center (z\u21e4,\u2713 0) and radius \u270f. This, together with compactness of X\nand \u21e5, implies A3.\nIt is somewhat surprising that to incorporate covariates in a setting where F is non-parametric, only\nminor modi\ufb01cations are needed relative to the assumptions in Kleinberg and Leighton (2003). For\ncompleteness, in the Appendix we provide a class of examples for which it is easy to check that the\nHessian is indeed negative de\ufb01nite and that all our assumptions are satis\ufb01ed.\n\n3 Pricing policies\n\nAny successful algorithm must set prices to balance price exploration to learn (\u27130, z\u21e4) with exploita-\ntion to maximize revenue. Because prices are adaptively controlled, the outputs (Yt : t = 1, 2, . . . , n)\n\n5\n\n\fwill not be conditionally independent given the covariates (Xt : t = 1, 2, . . . , n), as is typically\nassumed in semi-parametric regression with binary outputs (e.g., see Plan and Vershynin (2013)).\nThis issue is referred to as price endogeneity in the pricing literature.\nWe address this problem by \ufb01rst designing our own bandit-learning policy, Dynamic Experimentation\nand Elimination of Prices with Covariates (DEEP-C), which uses only a basic statistical learning\ntechnique which dynamically eliminates sub-optimal values of (\u2713, z) by employing con\ufb01dence\nintervals. At \ufb01rst glance, such a learning approach seems to suffer from the curse of dimensionality,\nin terms of both sample complexity and computational complexity. As we will see, our DEEP-C\nalgorithm yields low sample complexity by cleverly exploiting the structure of our semi-parameteric\nmodel. We then address computational complexity by presenting a variant of our policy which\nincorporates sparse semi-parametric regression techniques.\nThe rest of the section is organized as follows. We \ufb01rst present the DEEP-C policy. We then discuss\nthree variants: (a) DEEP-C with Rounds, a slight variant of DEEP-C which is a bit more complex to\nimplement but simpler to analyze theoretically, and thus enables us to obtain \u02dcO(pn) regret bounds;\n(b) Decoupled DEEP-C, which decouples the estimation of \u27130 and z\u21e4 and thus allows us to leverage\nlow-complexity sparse semi-parametric regression to estimate \u27130 but with the cost of O(n2/3) regret;\nand (c) Sparse DEEP-C, which combines DEEP-C and sparse semi-parametric regression to achieve\nlow complexity without decoupling to achieve the best of both worlds. We provide a theoretical\nanalysis of the \ufb01rst variant, and use simulation to study the others.\nWhile we discuss below the key ideas behind these three variants, their formal de\ufb01nitions are provided\nin Appendix B to save space.\n\n3.1 DEEP-C policy\n\nWe now describe DEEP-C. As noted in Proposition 1, the Oracle achieves optimal performance\nby choosing at each time a price pt = z\u21e4e\u2713|\n0 Xt, where z\u21e4 is the maximizer of F (z). We view the\nproblem as a multi-armed bandit in the space Z\u21e5 \u21e5. Viewed this way, before the context at time\nt arrives, the decision maker must choose a value z 2Z and a \u2713 2 \u21e5. Once Xt arrives, the price\npt = ze\u2713|Xt is set, and revenue is realized. Through this lens, we can see that the Oracle is equivalent\nto pulling the arm (z\u21e4,\u2713 0) at every t in the new multi-armed bandit we have de\ufb01ned. DEEP-C is an\narm-elimination algorithm for this multi-armed bandit.\nFrom a learning standpoint, the goal is to learn the optimal (z\u21e4,\u2713 0), which at the \ufb01rst sight seems to\nsuffer from the curse of dimensionality. However, we observe that in fact, our problem allows for\n\u201cfree exploration\u201d that lets us to learn ef\ufb01ciently in this setting; in particular, given Xt, for each choice\nof price pt we simultaneously obtain information about the expected revenue for a range of pairs\n(z, \u2713). This is speci\ufb01cally because we observe the context Xt, and because of the particular structure\nof demand that we consider. However, to ensure that each candidate (z, \u2713) arm has suf\ufb01ciently high\nprobability of being pulled at any time step, DEEP-C selects prices at random from a set of active\nprices, and ensures that this set is kept small via arm-elimination. The speedup in learning thus\nafforded enables us to obtain low regret.\nFormally, our procedure is de\ufb01ned as follows. We partition the support of Z1 into intervals of length\nn1/4. If the boundary sets are smaller, we enlarge the support slightly (by an amount less than\nn1/4) so that each interval is of equal length, and equal to n1/4. Let the corresponding intervals\nbe Z1, . . . ,Zk, and their centroids be \u21e31, . . . ,\u21e3 k where k is less than or equal to n1/4. Similarly, for\nl = 1, 2, . . . , d, we partition the projection of the support of \u27130 into the lth dimension into kl intervals\nof equal length n1/4, with sets \u21e5(l)\nkl . Again, if the boundary\nsets are smaller, we enlarge the support so that each interval is of equal length n1/4.\nOur algorithm keeps a set of active (z, \u2713) \u21e2Z\u21e5 \u21e5 and eliminates those for which we have suf\ufb01cient\nevidence for being far from (z\u21e4,\u2713 0). We let A(t) \u21e2{ 1, . . . , k}d+1 represent a set of active cells,\nwhere a cell represents a tuple (i, j1, . . . , jd). Then,S(i,j1,...,jd)2A(t) Zi \u21e5Qd\njl represents the\nset of active (z, \u2713) pairs. Here, A(1) contains all cells.\nAt each time t we have a set of active prices, which depends on Xt and A(t), i.e.,\n\nkl and centroids \u2713(l)\n\ni=1 \u21e5(l)\n\n1 , . . . , \u21e5(l)\n\n1 , . . . ,\u2713 (l)\n\n6\n\n\fZi \u21e5\n\np : 9(z, \u2713) 2\n\n[(i,j1,...,jd)2A(t)\n\nP (t) =8<:\nPi,j1,...,jd(t) ,(p : 9z 2Z i,9\u2713 2\n\n.\n\ndYl=1\n\n\u21e5(l)\n\njl s.t. ln p = ln z + \u2713|Xt9=;\njl s.t. ln p = ln z + \u2713|Xt) .\n\nAt time t we pick a price pt from P (t) uniformly at random. We say that cell (i, j1, . . . , jd) is\nchecked if pt 2 Pi,j1,...,jd(t) where\n\n\u21e5(l)\n\ndYl=1\n\nEach price selection checks one or more cells (i, j1, . . . , jd).\nRecall that the reward generated at time t is Ytpt. Let Tt(i, j1, . . . , jd) be the number of times cell\n(i, j1, . . . , jd) is checked until time t, and let St(i, j1, . . . , jd) be the total reward obtained at these\ntimes. Let\n\n\u02c6\u00b5t(i, j1, . . . , jd) =\n\nSt(i, j1, . . . , jd)\nTt(i, j1, . . . , jd)\n\n.\n\nWe also compute con\ufb01dence bounds for \u02c6\u00b5t(i, j1, . . . , jd), as follows. Fix > 0. For each active\n(i, j1, . . . , jd), let\n\nand\n\nut(i, j1, . . . , jd) = \u02c6\u00b5t(i, j1, . . . , jd) +r\nlt(i, j1, . . . , jd) = \u02c6\u00b5t(i, j1, . . . , jd) r\n\n\n\nTt(i, j1, . . . , jd)\n\n,\n\n\n\nTt(i, j1, . . . , jd)\n\n.\n\nThese represent the upper and lower con\ufb01dence bounds, respectively.\nWe eliminate (i, j1, . . . , jd) 2 A(t) from A(t + 1) if there exists (i0, j01, . . . , j0d) 2 A(t) such that\n\nut(i, j1, . . . , jd) < lt(i0, j01, . . . , j0d).\n\n3.2 Variants of DEEP-C\n\nDEEP-C with Rounds: Theoretical analysis of regret for arm elimination algorithms typically involves\ntracking the number of times each sub-optimal arm is pulled before being eliminated. However, this\nis challenging in our setting, since the set of arms which get \u201cpulled\u201d at an offered price depends on\nthe covariate vector at that time. To resolve this challenge, we consider a variant where the algorithm\noperates in rounds, as follows.\nWithin a round the set of active sells remains unchanged. Further, we ensure that within each round\neach arm in the active set is pulled at least once. For our analysis, we keep track of only the \ufb01rst time\nan arm is pulled in each round, and ignore the rest. While this may seem wasteful, a surprising aspect\nof our analysis is that the regret cost incurred by this form of exploration is only poly-logarithmic in\nn. Further, since the number of times each arm is \u201cexplored\u201d in each round is exactly one, theoretical\nanalysis now becomes tractable. For formal de\ufb01nitions of this policy and also of the policies below,\nwe refer the reader to Appendix B.\nDecoupled DEEP-C: We now present a policy which has low computational complexity under\nsparsity and which does not suffer from price endogeneity, but may incur higher regret. At times\nt = 1, 2, . . . ,\u2327 , the price is set independently and uniformly at random from a compact set. This\nensures that outputs (Yt : t = 1, 2, . . . ,\u2327 ) are conditionally independent given covariates (Xt :\nt = 1, 2, . . . ,\u2327 ), i.e., there is no price endogeneity. We then use a low-complexity semi-parametric\nregression technique from Plan and Vershynin (2013) to estimate \u27130 under a sparsity assumption.\nWith estimation of \u27130 in place, at times t = \u2327 + 1, . . . , n, we use a one-dimensional version of\nDEEP-C to simultaneously estimate z\u21e4 and maximize revenue. The best possible regret achievable\nwith this policy is \u02dcO(n2/3), achieved when \u2327 is O(n2/3) Plan and Vershynin (2013).\n\n7\n\n\fSparse DEEP-C: This policy also leverages sparsity, but without decoupling estimation of \u27130 from\nestimation of z\u21e4 and revenue maximization. At each time t, using the data collected in past we\nestimate \u27130 via semi-perametric regression technique from Plan and Vershynin (2013). Using this\nestimate of \u27130, the estimate of rewards for different values of z from samples collected in past, and\nthe corresponding con\ufb01dence bounds, we obtain a set of active prices at each time, similar to that of\nDEEP-C, from which the price is picked at random.\nWhile Sparse DEEP-C suffers from price endogeneity, with an appropriate choice of we conjecture\nthat its cost in terms of expected regret can be made poly-logarithmic in n; proving this result remains\nan important open direction. The intuition for this comes from our theoretical analysis of DEEP-C\nwith Rounds and the following observation: even though the set of active prices may be different at\ndifferent times, we still choose prices at random, and prices are eliminated only upon reception of\nsuf\ufb01cient evidence of suboptimality. We conjecture that these features are suf\ufb01cient to ensure that the\nerror in the estimate of \u27130 is kept small with high probability. Our simulation results indeed show\nthat this algorithm performs relatively well.\n\n4 Regret analysis\n\nThe main theoretical result of this paper is the following. The regret bound below is achieved by\nDEEP-C with Rounds as de\ufb01ned in Section 3.2. For its proof see Appendix C.\n\nTheorem 1 Under A1, A2, and A3, the expected regret under policy DEEP-C with Rounds with\n\n = max\u21e310\u21b52\n\n2, 4 \uf8ff2\n\n2\n\n1\n\nlog n\u2318 satis\ufb01es,\nlog n , \uf8ff2\nE[Rn] \uf8ff 16000\u21b52\n\n1 \u21b52\n\n2\uf8ff2\n\n1 \uf8ff3/2\n\n2 3/4d11/4n1/2 log7/4 n + 5\u21b52.\n\nFirst, note that the above scaling is optimal w.r.t. n (up to polylogarithmic factors), as even for the\ncase where Xt = 0 w.p.1. it is known that achieving o(pn) expected regret is not possible (see\nKleinberg and Leighton (2003)).\nSecond, we state our results with explicit dependence on various parameters discussed in our\nassumptions in order for the reader to track the ultimate dependence on the dimension d. Note that,\nas d scales, the supports \u21e5 and X , and the distribution of X may change. In turn, the parameters \u21b51,\n\u21b52, \uf8ff1 and \uf8ff2 which are constants for a given d, may scale as d scales. These scalings need to be\ncomputed case by case as it depends on how one models the changes in \u21e5 and X . Below we discuss\nbrie\ufb02y how these may scale in practice.\nRecall that \u21b51 and \u21b52 are bounds on ze\u2713|x, namely, the user valuations. Thus, it is meaningful to\npostulate that \u21b51 and \u21b52 do not scale with covariate dimension, as the role of covariates is to aid\nprediction of user valuations and not to change them. For example, one may postulate that \u27130 is\n\u201csparse\u201d, i.e., the number of non-zero coordinates of \u27130 is bounded from above by a known constant,\nin which case \u21b51 and \u21b52 do not scale with d. Dependence of \uf8ff1 and \uf8ff2 on d is more subtle as they\nmay depend on the details of the modeling assumptions. For example, their scaling may depend on\nscaling of the difference between the largest and second largest values of r(z, \u2713). One of the virtues\nof Theorem 1 is that it succinctly characterizes the scaling of regret via a small set of parameters.\nFinally, the above result can be viewed through the lens of sample complexity. The arguments used\nin Lemma 1 and in the derivation of equation (4) imply that the sample complexity is \u201croughly\u201d\nO(log(1/)/\u270f2). More precisely, suppose that at a covariate vector x, we set the price p(x). We say\nthe mapping p is probably approximately revenue optimal if for any x the difference between the\nachieved revenue and the optimal revenue is at most \u270f with probability at least 1 . The number\nof samples m required to learn such a policy satis\ufb01es m polylog(m) \uf8ff log(1/)\nf (d, \u21b51,\u21b5 2,\uf8ff 1,\uf8ff 2)\nwhere f (\u00b7) is polynomial function.\n5 Simulation Results\n\n\u270f2\n\nSimulation setup: First, we simulate our model with covariate dimension d = 2, where covariate\nvectors are i.i.d. d-dimensional standard normal random vectors, the parameter space is \u21e5= [0 , 1]d,\nthe parameter vector is \u27130 = (1/p2, 1/p2), the noise support is Z = [0, 1], and the noise distribution\n\n8\n\n\f(a) DEEP-C, d = 2.\n\n(b) DEEP-C variants, d = 2\n\n(c) DEEP-C variants, d = 100\n\nFigure 1: Regret comparison of the policies.\n\nis Z \u21e0 Uniform([0, 1]). Note that even though we assumed that the covariate distribution has bounded\nsupport for ease of analysis, our policies do not assume that. Hence, we are able to use a covariate\ndistribution with unbounded support in our simulations. In this setting, we simulate policies DEEP-C,\nDecoupled DEEP-C, and Sparse DEEP-C for time horizon n = 10, 000 and for different values of\nparameter . Each policy is simulated 5,000 times for each set of parameters.\nNext, we also simulate our model for d = 100 with s = 4 non-zero entries in \u27130, with each non-zero\nentry equal to 1/ps, each policy is simulated 1,500 times for each set of parameters, with the rest of\nthe setup being the same as earlier. For this setup, we only simulate Decoupled DEEP-C and Sparse\nDEEP-C, as the computational complexity of DEEP-C does not scale well with d.\nMain \ufb01ndings: First, we \ufb01nd that the performance of each policy is sensitive to the choice of , and\nthat the range of where expected regret is low may be different for different policies. The expected\nregret typically increases with increase in , however its variability typically reduces with . This is\nsimilar to the usual bias-variance tradeoff in learning problems. For our setup with d = 2, the reward\nof Oracle concentrates at around 4,150. As Figure 1 shows, each policy performs well in the plotted\nrange of .\nWe \ufb01nd that the main metric where the performance of the policies is differentiated is in fact high\nquantiles of the regret distribution. For example, while the expected regret of DEEP-C at = 2.2\nand that of Decoupled DEEP-C and Sparse DEEP-C at = 7 each are all roughly the same, the\n98th-percentile of regret distribution under DEEP-C and Sparse DEEP-C is 13% and 24% lower than\nthat under Decoupled DEEP-C, respectively.\nFor our setup with d = 100, while both Decoupled DEEP-C and Sparse DEEP-C perform similar\nin average regret, we \ufb01nd that Sparse DEEP-C signi\ufb01cantly outperforms Decoupled DEEP-C in\nstandard deviation and in 95th-percentile. In particular, 95th-percentile of Sparse DEEP-C is 33%\nlower than that under Decoupled DEEP-C.\n\n6 Acknowledgments\n\nThis work was supported in part by National Science Foundation Grants DMS-1820942, DMS-\n1838576, CNS-1544548, and CNS-1343253. Any opinions, \ufb01ndings, and conclusions or recommen-\ndations expressed in this material are those of the authors and do not necessarily re\ufb02ect the views of\nthe National Science Foundation. We would like to thank Linjia Wu for reading and checking our\nproofs.\n\nReferences\nAlon, N., Cesa-Bianchi, N., Gentile, C., and Mansour, Y. (2013). From bandits to experts: A tale of\ndomination and independence. In Advances in Neural Information Processing Systems 26, pages\n1610\u20131618.\n\nAmin, K., Rostamizadeh, A., and Syed, U. (2014). Repeated contextual auctions with strategic\n\nbuyers. In Advances in Neural Information Processing Systems, pages 622\u2013630.\n\nBan, G.-Y. and Keskin, N. B. (2019). Personalized dynamic pricing with machine learning.\n\n9\n\n\fBroder, J. and Rusmevichientong, P. (2012). Dynamic pricing under a general parametric choice\n\nmodel. Operations Research, 60(4):965\u2013980.\n\nCaron, S., Kveton, B., Lelarge, M., and Bhagat, S. (2012). Leveraging side observations in stochastic\nbandits. In Proceedings of the Twenty-Eighth Conference on Uncertainty in Arti\ufb01cial Intelligence,\nUAI\u201912.\n\nCohen, A., Hazan, T., and Koren, T. (2016a). Online learning with feedback graphs without the graphs.\n\nIn Proceedings of The 33rd International Conference on Machine Learning, pages 811\u2013819.\n\nCohen, M. C., Lobel, I., and Paes Leme, R. (2016b). Feature-based dynamic pricing. In Proceedings\n\nof the 2016 ACM Conference on Economics and Computation, EC \u201916.\n\nden Boer, A. V. (2014). Dynamic pricing with multiple products and partially speci\ufb01ed demand\n\ndistribution. Mathematics of operations research, 39(3):863\u2013888.\n\nden Boer, A. V. (2015). Dynamic pricing and learning: Historical origins, current research, and new\n\ndirections.\n\nden Boer, A. V. and Zwart, B. (2013). Simultaneously learning and optimizing using controlled\n\nvariance pricing. Management science, 60(3):770\u2013783.\n\nFrahm, G. (2004). Generalized elliptical distributions: theory and applications. PhD thesis,\n\nUniversit\u00e4t zu K\u00f6ln.\n\nGreenewald, K., Tewari, A., Murphy, S., and Klasnja, P. (2017). Action centered contextual bandits.\n\nIn Advances in Neural Information Processing Systems, pages 5977\u20135985.\n\nJavanmard, A. and Nazerzadeh, H. (2019). Dynamic pricing in high-dimensions. Journal of Machine\n\nLearning Research.\n\nKeskin, N. B. and Zeevi, A. (2014). Dynamic pricing with an unknown demand model: Asymptoti-\n\ncally optimal semi-myopic policies. Operations Research, 62(5):1142\u20131167.\n\nKleinberg, R. and Leighton, T. (2003). The value of knowing a demand curve: Bounds on regret for\n\nonline posted-price auctions. In IEEE Symposium on Foundations of Computer Science.\n\nKrishnamurthy, A., Wu, Z. S., and Syrgkanis, V. (2018). Semiparametric contextual bandits. In\nProceedings of the 35th International Conference on Machine Learning, Proceedings of Machine\nLearning Research. PMLR.\n\nLangford, J. and Zhang, T. (2008). The epoch-greedy algorithm for multi-armed bandits with side\n\ninformation. In Advances in Neural Information Processing Systems.\n\nLe Guen, T. (2008). Data-driven pricing. Master\u2019s thesis, Massachusetts Institute of Technology.\n\nLykouris, T., Sridharan, K., and Tardos, \u00c9. (2018). Small-loss bounds for online learning with partial\n\ninformation. In Proceedings of the 31st Conference On Learning Theory, pages 979\u2013986.\n\nMannor, S. and Shamir, O. (2011). From bandits to experts: On the value of side-observations. In\n\nAdvances in Neural Information Processing Systems 24, pages 684\u2013692.\n\nMao, J., Leme, R., and Schneider, J. (2018). Contextual pricing for lipschitz buyers. In Advances in\n\nNeural Information Processing Systems, pages 5643\u20135651.\n\nMorgenstern, J. and Roughgarden, T. (2016). Learning simple auctions. In Annual Conference on\n\nLearning Theory, pages 1298\u20131318.\n\nNambiar, M., Simchi-Levi, D., and Wang, H. (2019). Dynamic learning and pricing with model\n\nmisspeci\ufb01cation. Management Science.\n\nPerchet, V. and Rigollet, P. (2013). The multi-armed bandit problem with covariates. The Annals of\n\nStatistics, pages 693\u2013721.\n\n10\n\n\fPlan, Y. and Vershynin, R. (2013). Robust 1-bit compressed sensing and sparse logistic regression: A\n\nconvex programming approach. IEEE Transactions on Information Theory, 59(1):482\u2013494.\n\nQiang, S. and Bayati, M. (2019). Dynamic pricing with demand covariates.\nSlivkins, A. (2011). Contextual bandits with similarity information. In Annual Conference On\n\nLearning Theory.\n\n11\n\n\f", "award": [], "sourceid": 1386, "authors": [{"given_name": "Virag", "family_name": "Shah", "institution": "Stanford University"}, {"given_name": "Ramesh", "family_name": "Johari", "institution": "Stanford University"}, {"given_name": "Jose", "family_name": "Blanchet", "institution": "Stanford University"}]}