{"title": "Low-Rank Bandit Methods for High-Dimensional Dynamic Pricing", "book": "Advances in Neural Information Processing Systems", "page_first": 15468, "page_last": 15478, "abstract": "We consider dynamic pricing with many products under an evolving but low-dimensional demand model. Assuming the temporal variation in cross-elasticities exhibits low-rank structure based on fixed (latent) features of the products, we show that the revenue maximization problem reduces to an online bandit convex optimization with side information given by the observed demands. We design dynamic pricing algorithms whose revenue approaches that of the best fixed price vector in hindsight, at a rate that only depends on the intrinsic rank of the demand model and not the number of products. Our approach applies a bandit convex optimization algorithm in a projected low-dimensional space spanned by the latent product features, while simultaneously learning this span via online singular value decomposition of a carefully-crafted matrix containing the observed demands.", "full_text": "Low-Rank Bandit Methods for High-Dimensional\n\nDynamic Pricing\n\nJonas Mueller\nMIT CSAIL\n\nVasilis Syrgkanis\nMicrosoft Research\n\nMatt Taddy\nChicago Booth\n\njonasmueller@csail.mit.edu\n\nvasy@microsoft.com\n\ntaddy@chicagobooth.edu\n\nAbstract\n\nWe consider dynamic pricing with many products under an evolving but low-\ndimensional demand model. Assuming the temporal variation in cross-elasticities\nexhibits low-rank structure based on \ufb01xed (latent) features of the products, we\nshow that the revenue maximization problem reduces to an online bandit convex\noptimization with side information given by the observed demands. We design\ndynamic pricing algorithms whose revenue approaches that of the best \ufb01xed price\nvector in hindsight, at a rate that only depends on the intrinsic rank of the demand\nmodel and not the number of products. Our approach applies a bandit convex\noptimization algorithm in a projected low-dimensional space spanned by the latent\nproduct features, while simultaneously learning this span via online singular value\ndecomposition of a carefully-crafted matrix containing the observed demands.\n\n1 Introduction\n\nIn this work, we consider a seller offering N products, where N is large, and the pricing of certain\nproducts may in\ufb02uence the demand for others in unknown ways. We let pt P RN denote the vector\nof selected prices at which each product is sold during time period t P t1, . . . , Tu, which results\nin total demands for the products over this period represented in the vector qt P RN. Note that qt\nrepresents a (noisy) evaluation of the aggregate demand curve at the chosen prices pt, but we never\nobserve the counterfactual demand that would have resulted had we selected a different price-point.\nThis is referred to as bandit feedback in the online optimization literature [Dani et al., 2007]. Our\ngoal is \ufb01nd a setting of the prices for each time period to maximize the total revenue of the seller\n(over all rounds). This is equivalent to minimizing the negative revenue over time:\n\nRpp1, . . . , pTq \u201c\n\nRtpptq where Rtpptq \u201c \u00b4xqt, pty\n\nT\u00fft\u201c1\n\nWe can alternatively maximize total pro\ufb01ts instead of revenue by simply rede\ufb01ning pt as the difference\nbetween the product-prices and the cost of each product-unit. In practice, the seller can only consider\nprices within some constraint set S \u00c4 RN, which we assume is convex throughout. To \ufb01nd the\noptimal prices, we introduce the following linear model of the aggregate demands, which is allowed\nto change over time in a nonstationary fashion:\n\n(1)\nHere, ct P RN denotes the baseline demand for each product in round t. Bt P RN\u02c6N is an\nasymmetric matrix of demand elasticities which represents how changing the price of one product\nmay affect the demand of not only this product, but also demand for other products as well. By\nconventional economic wisdom, Bt will have the largest entries along its diagonal because demand\n\nqt \u201c ct \u00b4 Btpt ` \u270ft\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\ffor a product is primarily driven by its price rather than the price of other possibly unrelated products.\nSince a price increase usually leads to falling demand, it is reasonable to assume all Bt \u00a9 0 are\npositive-de\ufb01nite (but not necessarily Hermitian), which implies that at each round: Rt is a convex\nfunction of pt. The observed aggregate demands over each time period are additionally subject to\nrandom \ufb02uctuations driven by the noise term \u270ft P RN. Throughout, we suppose the noise in each\nround \u270ft is sampled i.i.d. from some mean-zero distribution with \ufb01nite variance. The classic analysis\nof Houthakker and Taylor [1970] established that historical demand data often nicely \ufb01t a linear\nrelationship. A wealth of past work on dynamic pricing has also posited linear demand models,\nalthough most prior research has not considered settings where the underlying model is changing\nover time [Keskin and Zeevi, 2014, Besbes and Zeevi, 2015, Cohen et al., 2016, Javanmard and\nNazerzadeh, 2016, Javanmard, 2017].\nUnlike standard statistical approaches to this problem which rely on stationarity, we suppose ct, Bt\nmay change every round and are possibly chosen adversarially. This consideration is particularly im-\nportant in dynamic markets where the seller faces new competitors and consumers with ever-changing\npreferences who are actively seeking out the cheapest prices for products [Witt, 1986]. Our goal is to\nselect prices p1, . . . , pT which minimize the expected regret ErRpp1, . . . , pTq \u00b4 Rpp\u02da, . . . , p\u02daqs\ncompared to always selecting the single best con\ufb01guration of prices p\u02da \u201c argminpPS E\u221eT\nt\u201c1 Rtppq\nchosen in hindsight after the revenue functions Rt have all been revealed.\nLow regret algorithms ensure that in the case of a stationary underlying model, our chosen prices\nquickly converge to the optimal choice, and in nonstationary settings, our pricing procedure will natu-\nrally adapt to the intrinsic dif\ufb01culty of the dynamic revenue-optimization problem [Shalev-Shwartz,\n2011]. While low (i.e. opTq) regret is achievable using algorithms for online convex optimization with\nbandit feedback, the regret of existing methods is bounded below by \u2326p?Nq, which is undesirable\nlarge when one is dealing with a vast number of products [Dani et al., 2007, Shalev-Shwartz, 2011,\nFlaxman et al., 2005]. To attain better bounds, we adopt a low-rank structural assumption that the\nvariation in demands changes over time only due to d ! N underlying factors. Under this setting,\nwe develop algorithms whose regret depends only on d rather than N by combining existing bandit\nmethods with low-dimensional projections selected via online singular value decomposition. As far\nas we are aware, our main result (Theorem 3) is the \ufb01rst online bandit optimization algorithm whose\nregret provably does not scale with the action-space dimensionality.\nAppendix D provides a glossary of notation used in this paper, and all proofs of our theorems are\nrelegated to Appendix A. Throughout, C denotes a universal constant, whose value may change from\nline to line (but never depends on problem-speci\ufb01c constants such as T, d, r).\n\n2 Related Work\n\nWhile bandit optimization has been successfully applied to dynamic pricing, research in this area\nhas been primarily restricted to stationary settings [Kleinberg and Leighton, 2003, Besbes and\nZeevi, 2009, den Boer and Bert, 2013, Keskin and Zeevi, 2014, Cohen et al., 2016, Misra et al.,\n2017]. Most similar to our work, Javanmard [2017] recently developed a bandit pricing strategy\nthat presumes demand depends linearly on prices and product-speci\ufb01c features. High-dimensional\ndynamic pricing was also addressed by Javanmard and Nazerzadeh [2016] using sparse maximum\nlikelihood. However, due to their reliance on stationarity, these approaches are less robust under\nevolving/adversarial environments compared with online optimization [Bubeck and Slivkins, 2012].\nBeyond pricing, existing algorithms that combine bandits with subspace estimation [Gopalan et al.,\n2016, Djolonga et al., 2013, Sen et al., 2017] are solely designed for stationary (stochastic) settings\nrather than general online optimization (where the reward functions can vary adversarially over time).\nWhile the \ufb01eld of online bandit optimization has seen many advances since the pioneering work\nof Flaxman et al. [Flaxman et al., 2005], none of the recent improvements guarantees regret that is\nindependent of the action-space dimension [Hazan and Levy, 2014, Bubeck et al., 2017]. To our\nknowledge, Hazan et al. [2016a] is the only prior work to present online optimization algorithms\nwhose regret depends on an intrinsic low rank structure rather than the ambient dimensionality.\nHowever, their approach for online learning with experts is not suited for dynamic pricing since it is\nrestricted to settings with: full-information (rather than bandit feedback), linear and noise-free (or\nstationary) reward functions, and actions that are specially constrained within the probability-simplex.\n\n2\n\n\f3 Low Rank Demand Model\n\nWe now introduce a special case of model (1) in which both ct and Bt display low-rank changes over\ntime. In practice, each product i may be described by some vector of features ui P Rd (with d ! N),\nwhich determine the similarity between products as well as their baseline demands. A natural method\nto gauge similarity between products i and j is via their inner product xui, ujyV \u201c uT\ni Vuj under\nsome linear transformation of the feature-space given by V \u00a9 0. For example, ui might be a binary\nvector indicating that product i falls into certain product-categories (where the number of categories\nd is far less than the number of products N), and V might be a diagonal matrix specifying the\ni Vuj \u00a8 pj would thus be\ncross-elasticity of demand within each product category. In this example, uT\nthe marginal effect on the demand for product i that results from selecting pj as the price for product\nj. Many recommender systems also assume products can be described using low-dimensional latent\nfeatures that govern their desirability to consumers [Zhao et al., 2016, Sen et al., 2017].\nBy introducing time-varying metric transformations Vt, our model allows these product-similarities\nto evolve over time. Encoding the features ui that represent each product as rows in a matrix\nU P RN\u02c6d, we assume the following demand model, in which the temporal variation naturally\nexhibits low-rank structure:\n(2)\nHere, the \u270ft P RN again re\ufb02ect statistical noise in the observed demands, the zt P Rd explain the\nvariation in baseline demand over time, and the (asymmetric) matrices Vt P Rd\u02c6d specify latent\nchanges in the demand-price relationship over time. Under this model, the aggregate demand for\nproduct i at time t is governed by the prices of all products, weighted by their current feature-similarity\nto product i. To ensure our revenue-optimization remains convex, we restrict the adversary to choices\nthat satisfy Vt \u00a9 0 for all t. Note that while the structural variation in our model is assumed to be\nlow-rank, the noise in the observed demands may be intrinsically N-dimensional. In each round, pt\nand qt are the only quantities observed, while \u270ft, zt, Vt all remain unknown (and we consider both\ncases where the product features U are known or unknown). In \u00a75.5, we verify that our low-rank\nassumption accurately describes real historical demand data.\n\nqt \u201c Uzt \u00b4 UVtUT pt ` \u270ft\n\n4 Methods\n\nOur basic dynamic pricing strategy is to employ the gradient-descent without a gradient (GDG)\nonline bandit optimization technique of Flaxman et al. [2005]. While a naive application of this\nalgorithm produces regret dependent on the number of products N, we ensure the updates of this\nmethod are only applied in the d-dimensional subspace spanned by U, which leads to regret bounds\nthat depend only on d rather than N. When U is unknown, this subspace is simultaneously estimated\nonline, in a somewhat similar fashion to the approach of Hazan et al. [2016a] for online learning with\nlow-rank experts. If we de\ufb01ne x \u201c UT p P Rd, then under the low-rank model in (2) with Er\u270fts \u201c 0,\nthe expected value of our revenue-objective in round t can be expressed as:\n(3)\nAs this problem\u2019s intrinsic dimensionality is only d, we can maximize expected revenues by merely\nconsidering a restricted set of d-dimensional actions x and functions ft over projected constraint set:\n(4)\n\nE\u270frRtppqs \u201c pT UVtUT p \u00b4 pT Uzt \u201c xT Vtx \u00b4 xT zt :\u201c ftpxq\n\nUTpSq \u201c x P Rd : x \u201c UT p for some p P S(\n\n4.1 Products with Known Features\n\nIn certain markets, it is clear how to featurize products [Cohen et al., 2016]. Under the low-rank\nmodel in (2) when U is given, we can apply the OPOK method (Algorithm 1) to select prices. This\nalgorithm employs subroutines FINDPRICE and PROJECTION which both solve convex optimization\nproblems in order to compute certain projections. Here, Bd \u201c Unifptx P Rd : ||x||2 \u201c 1uq denotes a\nuniform distribution over surface of the unit sphere in Rd.\nIntuitively, our algorithm adapts GDG to select low-dimensional actions xt P Rd at each time point,\nand then seeks out a feasible price vector pt corresponding to the chosen xt. Note that when d ! N,\n\n3\n\n\fAlgorithm 1 OPOK\n(Online Pricing Optimization with Known Features)\nInput: \u2318, , \u21b5 \u00b0 0, U P RN\u02c6d, initial prices p0 P S\nOutput: Prices p1, . . . , pT to maximize revenue\n1: Set prices to p0 P S and observe q0pp0q, R0pp0q\n2: De\ufb01ne x1 \u201c UT p0\n3: for t \u201c 1, . . . , T :\n4:\nrxt :\u201c xt ` \u21e0t\n5:\nSet prices: pt \u201c FINDPRICEprxt, U,S, pt\u00b41q\n6:\n\nand observe qtpptq, Rtpptq\nxt`1 \u201c PROJECTIONpxt\u00b4\u2318Rtpptq\u21e0t, \u21b5, U, Sq\n\n\u21e0t \u201e Unifptx P Rd : ||x||2 \u201c 1uq\n\n7:\n\nAlgorithm 2 FINDPRICE(x; U,S, pt\u00b41)\nInput: x P Rd, U P RN\u02c6d,\nOutput: argmin\n\nconvex S \u00c4 RN, pt\u00b41 P RN\n\n||p \u00b4 pt\u00b41||2\npPS\nsubject to: UT p \u201c x\n\nconvex set S \u00c4 RN\n\nAlgorithm 3 PROJECTION(x, \u21b5, U, S)\nInput: x P Rd, \u21b5 \u00b0 0, U P RN\u02c6d,\nOutput: p1 \u00b4 \u21b5qUTpp\nwith pp :\u201c argmin\n\n\u02c7\u02c7\u02c7\u02c7p1 \u00b4 \u21b5qUT p \u00b4 x\u02c7\u02c7\u02c7\u02c72\n\npPS\n\nthere are potentially many price-vectors p P RN that map to the same low-dimensional vector\nx P Rd via UT . Out of these, we select the one that is closest to our previously-chosen prices (via\nFINDPRICE), ensuring additional stability in our dynamic pricing procedure. In practice, the initial\nprices p0 should be selected based on external knowledge or historical demand data.\nUnder mild conditions, Theorem 1 below states that the OPOK algorithm incurs OpT 3{4?dq regret\n\nwhen product features are a priori known. This result is derived from Lemma A.1 which shows\nthat Step 7 of our algorithm corresponds (in expectation) to online projected gradient descent on a\nsmoothed version of our objective de\ufb01ned as:\n\n(5)\nwhere \u21e3 is sampled uniformly from within the unit sphere in Rd, and ft is de\ufb01ned in (3). We bound\nthe regret of our pricing algorithm under the following assumptions (which ensure the revenue\nfunctions are bounded/smooth and the set of feasible prices is bounded/well-scaled):\n\npftpxq \u201c E\u21e3\u201cftpx ` \u21e3q\u2030\n\n(A1) ||zt||2 \u00a7 b for t \u201c 1, . . . , T\n(A2) ||Vt||op \u00a7 b for all t (|| \u00a8 ||op denotes spectral norm)\n(A3) T \u00b0 9\n(A4) U is an orthogonal matrix such that UT U \u201c Id\u02c6d\n(A5) S \u201c tp P RN : ||p||2 \u00a7 ru (with r \u2022 1)\n\n4 d2\n\nRequiring that the columns of U form an orthonormal basis for Rd, condition (A4) can be easily\nenforced (when d \u2020 N) by \ufb01rst orthonormalizing the product features. Note that this orthogonality\ncondition does not restrict the overall class of models speci\ufb01ed in (2), and describes the case where the\nfeatures used to encode each product are uncorrelated between products (i.e. a minimally-redundant\nencoding) and have been normalized across all products. To see why (A4) does not limit the allowed\nprice-demand relationships, consider that we can re-express any (non-orthogonal) U \u201c OP in terms\nof orthogonal O P RN\u02c6d. The demand model in (2) can then be equivalently expressed in terms\nof z1t \u201c Pzt, V1t \u201c PVtPT (after appropriately rede\ufb01ning the constant b in (A1)-(A2)), since:\nUzt \u00b4 UVtUT pt \u201c Oz1t \u00b4 OV1tOT pt. To further simplify our analysis, we also from now adopt\n(A5) presuming the constraint set of feasible product-prices is a centered Euclidean ball (implying\nour pt, qt vectors now represent appropriately shifted/scaled prices and demands).\nbp1`dq?T ,  \u201c T \u00b41{4b dr2p1`rq\nTheorem 1. Under assumptions (A1)-(A5), if we choose \u2318 \u201c\n\u21b5 \u201c \n\nr , then there exists C \u00b0 0 such that for any p P S:\n\n9r`6\n\n1\n\n,\n\nE\u270f,\u21e0\u00ab T\u00fft\u201c1\n\nRtpptq \u00b4\n\nT\u00fft\u201c1\n\nRtppq \u00a7 Cbrpr ` 1qT 3{4d1{2\n\nfor the prices p1, . . . , pT selected by the OPOK algorithm.\n\nTheorem A.2 shows the same OpT 3{4?dq regret bound holds for the OPOK algorithm under relaxed\n\nconditions solely based on the revenue functions and feasible prices rather than the speci\ufb01c properties\nof our low-rank structure assumed in (A1)-(A5).\n\n4\n\n\f4.2 Products with Latent Features\n\nIn many settings, it is not clear how to best represent products as feature-vectors. Once again adopting\nthe low-rank demand model in (2), we now consider the case where U is unknown and must be\nestimated. We presume the orthogonality condition (A4) holds throughout this section (recall this\ndoes not restrict the class of allowed models), which implies U is both an isometry as well as the right-\ninverse of UT . Thus, given any low-dimensional action x P UTpSq, we can set the corresponding\nprices as p \u201c Ux such that UT p \u201c x. Lemma 1 shows that this price selection-method is feasible\nthe next price is regularized toward the origin rather than the previous price pt\u00b41. Because prices\npt are multiplied by the noise term \u270ft within each revenue-function Rt, choosing minimum-norm\nprices can help reduce variance in the total revenue generated by our approach. As U is unknown,\n\nand corresponds to changing Step 6 in the OPOK algorithm to pt \u201c FINDPRICEprxt, U,S, 0q, where\nwe instead employ an estimate pU P RN\u02c6d, which is always restricted to be an orthogonal matrix.\nLemma 1. For any orthogonal matrix pU and any x P pUTpSq, de\ufb01nepp \u201c pUx P RN . Under (A5):\npp P S andpp \u201c FINDPRICE(x,pU,S, 0).\nalgorithm where price-selection in Step 6 is done using pt \u201c pUrxt rather than being regularized\nproducts N, as long as pU accurately estimates the column span of U.\nTheorem 2. Suppose spanppUq \u201c spanpUq, i.e. our orthogonal estimate has the same column-span\nas the underlying (rank d) latent product-feature matrix. Let p1, . . . , pT P S denote the prices\nselected by our modi\ufb01ed OPOK algorithm with pU used in place of the underlying U and \u2318, , \u21b5\nchosen as in Theorem 1. Under conditions (A1)-(A5), there exists C \u00b0 0 such that for any p P S:\n\ntoward the previous price pt\u00b41. Even without knowing the true latent features, this result implies that\nthe regret of our modi\ufb01ed OPOK algorithm may still be bounded independently of the number of\n\nIn Theorem 2, we consider a minorly modi\ufb01ed OPOK\n\nProduct Features with Known Span.\n\nRtppq \u00a7 Cbrpr ` 1qT 3{4d1{2\n\nE\u270f,\u21e0\u00ab T\u00fft\u201c1\n\nRtpptq \u00b4\n\nT\u00fft\u201c1\n\nFeatures with Unknown Span and Noise-free Demands.\nIn practice, span(U) may be entirely\nunknown. If we assume the adversary is restricted to strictly positive-de\ufb01nite Vt \u00b0 0 for all t and\nthere is no statistical noise in the observed demands (i.e. qt \u201c Uzt \u00b4 UVtUT pt in each round),\nthen Lemma 2 below shows we can ensure span(U) is revealed within the \ufb01rst d observed demand\nvectors by simply adding a minuscule random perturbation to all of our initial prices selected in the\n\ufb01rst d rounds. Thus, even without knowing the latent product feature subspace, an absence of noise\nin the observed demands enables us to realize a low regret pricing strategy via the same modi\ufb01ed\nOPOK algorithm (applied after the \ufb01rst d rounds).\nLemma 2. Suppose that for t \u201c 1, . . . , T : \u270ft \u201c 0 and Vt \u00b0 0. If each pt is independently\nuniformly distributed within some (uncentered) Euclidean ball of strictly positive radius, then\nspanpq1, . . . , qdq \u201c spanpUq almost surely.\nFeatures with Unknown Span and Noisy Demands. When the observed demands are noisy and\nspanpUq is unknown, we select prices using the OPOL algorithm on the next page. The approach is\nsimilar to our previous OPOK algorithm, except we now additionally maintain a changing estimate\nof the latent product features\u2019 span. Our estimate is updated in an online fashion via an averaged\nsingular value decomposition (SVD) of the previously observed demands.\nStep 9 in our OPOL algorithm corresponds to online averaging of the currently observed demand\n\nvector qt with the historical observations stored in the jth column of matrix pQ. After computing the\nsingular value decomposition of pQ \u201c rUrSrVT , Step 10 is performed by setting pU equal to the \ufb01rst d\ncolumns of rU (presumed to be the indices corresponding to the largest singular values inrS). Since\npQ is only minorly changed within each round, the update operation in Step 10 can be computed more\nthat by their de\ufb01nition as singular vectors, the columns of pU remain orthonormal throughout the\n\nef\ufb01ciently by leveraging existing fast SVD-update procedures [Brand, 2006, Stange, 2008]. Note\n\nexecution of our algorithm.\n\n5\n\n\fAlgorithm 4 OPOL (Online Pricing Optimization with Latent Features)\nInput: \u2318, , \u21b5 \u00b0 0, rank d P r1, Ns, initial prices p0 P S\nOutput: Prices p1, . . . , pT to maximize overall revenue\n1: Initialize pQ as N \u02c6 d matrix of zeros\n2: Initialize pU as random N \u02c6 d orthogonal matrix\n3: Set prices to p0 P S and observe q0pp0q, R0pp0q\n4: De\ufb01ne x1 \u201c pUT p0\n5: for t \u201c 1, . . . , T :\n6:\n7:\n8:\n9:\n10:\n\nrxt :\u201c xt ` \u21e0t, \u21e0t \u201e Unifptx P Rd : ||x||2 \u201c 1uq\nSet prices: pt \u201c pUrxt and observe qtpptq, Rtpptq\nxt`1 \u201c PROJECTIONpxt \u00b4 \u2318Rtpptq\u21e0t, \u21b5, pU, Sq\nWith j \u201c 1 ` rpt \u00b4 1q mod ds, k \u201c \ufb02oorpt{dq, update: pQ\u02da,j \u2013 1\nSet columns of pU as top d left singular vectors of pQ\n\nk qt ` k\u00b41\n\nk pQ\u02da,j\n\n|Ij|\u00ffiPIj\npQ\u02da,j \u201c sQ\u02da\u02da,j ` 1\n\n|Ij| U\u00ffiPIj\n\nsi\n\n\u270fi, with sQ\u02da\u02da,j \u201c 1\n|Ij|\u221eiPIj\n\nTo quantify the regret incurred by this algorithm, we assume the noise vectors \u270ft follow a sub-\nGaussian distribution for each t \u201c 1, . . . , T . The assumption of sub-Gaussian noise is quite general,\ncovering common settings where the noise is Gaussian, bounded, of strictly log-concave density, or\nany \ufb01nite mixture of sub-Gaussian variables [Mueller et al., 2018]. Intuitively, the averaging in step 9\nof our OPOL algorithm ensures statistical concentration of the noise in our observed demands, such\nthat the true column span of the underlying U may be better revealed. More concretely, if we let\nst \u201c zt \u00b4 VtUT pt and q\u02dat \u201c Ust, then the observed demands can be written as: qt \u201c q\u02dat ` \u270ft,\nwhere q\u02dat are the (unobserved) expected demands at our chosen prices. Thus, the jth column of pQ at\nround T is given by:\n\n(6)\n\nd u (so |Ij| \u201c T\n\nd ). Because the average 1\n\nwhere we assume for notational simplicity that T is divisible by d and de\ufb01ne Ij \u201c tj ` dpi\u00b4 1q : i \u201c\n\u270fi exhibits concentration of measure, results\n1, . . . , T\nfrom random matrix theory imply that the span-estimator obtained from the \ufb01rst d singular vectors\nof pQ in Step 10 of our OPOL algorithm will rapidly converge to the column span of sQ\u02da P RN\u02c6d, a\nmatrix of averaged underlying expected demands. This is useful since sQ\u02da shares the same span as\n\nthe underlying U.\nTheorem 3 below shows that our OPOL algorithm achieves low-regret in the setting of unknown\nproduct features with noisy demands, and the regret again depends only on the intrinsic rank d (rather\nthan the number of products N).\nTheorem 3. For unknown U, let p1, . . . , pT be the prices selected by the OPOL algorithm with\n\u2318, , \u21b5 set as in Theorem 1. Suppose \u270ft follows a sub-Gaussianp2q distribution and has statistically\ni.i.d. dimensions for each t. If (A1)-(A5) hold, then there exists C \u00b0 0 such that for any p P S:\n\nRtppq \u00a7 CQrbp4r ` 1qdT 3{4\n\nE\u270f,\u21e0\u00ab T\u00fft\u201c1\nHere, Q \u201c max!1, 2\u00b4 21`1\nsingular values of the underlying rank d matrix sQ\u02da de\ufb01ned in (6).\n\nRtpptq \u00b4\nd \u00af) with 1 (and d) de\ufb01ned as the largest (and smallest) nonzero\n\nOur proof of this result relies on standard random matrix concentration inequalities [Vershynin,\n2012] and Theorem A.3, a useful variant of the Davis-Kahan theory introduced by Yu et al. [2015].\nIntuitively, we show that span(U) can be estimated to suf\ufb01cient accuracy within suf\ufb01ciently few\nrounds, and then follow similar reasoning to the proof of Theorem 2. Note that the regret in Theorem\n3 depends on the constant Q whose value is determined by the noise-level  and the extreme singular\n\nvalues of sQ\u02da de\ufb01ned in (6). In general, these quantities thus measure just how adversarial of an\n\nenvironment the seller is faced with. For example, when the underlying low-rank variation is of\nmuch smaller magnitude than the noise in our observations, it will be dif\ufb01cult to accurately estimate\n\nT\u00fft\u201c1\n\n2\n\n6\n\n\fthe span of the latent product features. In control theory, a signal-to-noise expression similar to Q\nhas also been recently proposed to quantify the intrinsic dif\ufb01culty of system identi\ufb01cation for the\nlinear quadratic regulator [Dean et al., 2017]. A basic setting in which Q can be explicitly bounded is\nillustrated in Appendix B, where we suppose the underlying demand model parameters can only be\nimprecisely controlled by an adversary over time.\n\n5 Experiments\n\nWe evaluate the performance of our methodology in settings where noisy demands are generated\naccording to equation (2), and the underlying structural parameters of the demand curves are randomly\nsampled from Gaussian distributions (details in Appendix C.2). Throughout, pt and qt represent\nrescaled rather than absolute prices/demands, such that the feasible set S can be simply \ufb01xed as a\ncentered sphere of radius r \u201c 20. Noise in the (rescaled) demands for each individual product is\nalways sampled as: \u270ft \u201e Np0, 10q.\nOur proposed algorithms are compared against the GDG online bandit algorithm of Flaxman et al.\n[2005], as well as a simple explore-then-exploit (Explore\nit ) technique. The latter method randomly\nsamples pt during the \ufb01rst T 3{4 rounds (uniformly over S) and for all remaining rounds, pt is \ufb01xed at\nthe best price vector found during exploration. Explore\nit re\ufb02ects a standard pricing technique: initially\nexperiment with prices and eventually settle on those that previously produced the most pro\ufb01t.\n\n5.1 Stationary Demand Model\nFirst, we consider a stationary setting where underlying structural parameters zt,\u201c z, Vt \u201c V remain\n\ufb01xed. Before each experiment, we sample the entries of z, V independently as zij \u201e Np100, 20q,\nVij \u201e Np0, 2q, and U is \ufb01xed as a random sparse binary matrix that re\ufb02ects which of d possible\ncategories each product belongs to. Subsequently, we orthogonalize the columns of U and project\nV into V \u201c tV : VT ` V \u00a9 Iu with  \u201c 10 to ensure positive de\ufb01nite cross-product price\nelasticities. Here, z, V, are chosen to re\ufb02ect properties of real-world demand curves: different\nproducts\u2019 baseline demands and elasticities should be highly diverse (wide range of z), and prices\nshould signi\ufb01cantly in\ufb02uence demands such that price-increases severely decrease demand and affect\ndemand for the same product more than other products (large value of , which in turn induces large\nvalues for certain entries of V). We \ufb01nd the optimal price vector does not lie near the boundary of S\n(||p\u02da||2 \u00ab 8 rather than 20), which shows that prices strongly in\ufb02uence demands under our setup.\nFigures 1A and 1B show that our OPOK and OPOL algorithms are greatly superior to GDG when\nthe dimensionality N exceeds the intrinsic rank d. When N \u201c d (no low-rank structure to exploit),\nour OPOK/OPOL algorithms closely match GDG (blue, green, and red curves overlap). Note that\nin this case: GDG and OPOK are nearly mathematically equivalent (same regret bound applies to\nboth, but their empirical performance slightly differs in this case due to the internal stochasticity of\n\neach bandit algorithm), as are OPOL and OPOK (since d \u201c N implies pU is an orthogonal N \u02c6 N\n\nmatrix and hence invertible). For small N, all online bandit optimization techniques outperform\nExplore\nit , but GDG scales poorly to large N unlike our methods. Interestingly, OPOL (which must\ninfer latent product features alongside the pricing strategy) performs slightly better than the OPOK\napproach, which has access to the ground-truth features. This is because in the presence of noise, our\nSVD-computed features can more robustly represent the subspace where projected pricing variation\ncan maximally impact the overall observed demands. In contrast, the dimensionality-reduction in\nOPOK does not lead to any denoising.\n\n5.2 Model with Demand Shocks\n\nNext, we study a non-stationary setting where the underlying demand model changes drastically\nat times T{3 and 2T{3. At the start of each period r0, T{3s, rT{3, 2T{3s, r2T{3, Ts: we simply\nredraw the underlying structural parameters zt, Vt from the same Gaussian distributions used for the\nstationary setting. Figures 1C and 1D show that our bandit techniques quickly adapt to the changes in\n\n7\n\n\f(A) N \u201c 10, d \u201c 10, Model = \u00a75.1\n\n(B) N \u201c 100, d \u201c 10, Model = \u00a75.1\n\n(C) N \u201c 10, d \u201c 10, Model = \u00a75.2\n\n(D) N \u201c 100, d \u201c 10, Model = \u00a75.2\n\n(E) N \u201c 10, d \u201c 10, Model = \u00a75.3\n\n(F) N \u201c 100, d \u201c 10, Model = \u00a75.3\n\nFigure 1: Average cumulative regret (over 10 repetitions with standard-deviations shaded) of various\npricing strategies when underlying demand model is: (A)-(B) stationary over time, (C)-(D): altered\nby structural shocks at times T{3 and 2T{3, (E)-(F): drifting over time.\nthe underlying demand curves. The regret of the bandit algorithms decreases over time, indicating\nthey begin to outperform the optimal \ufb01xed price chosen in hindsight (recall that our bandits may vary\nprice over time, whereas regret is measured against the best \ufb01xed price-con\ufb01guration which may\nfare much worse than a dynamic schedule in nonstationary environments). Once again, our low-rank\nmethods achieve low regret for a large number of products unlike the existing approaches, while\nretaining the same strong performance as the GDG algorithm in the absence of low-rank structure.\n\n5.3 Drifting Demand Model\n\nFinally, we consider another non-stationary setting where underlying demand curves slowly change\nover time. Here, the underlying structural parameters zt, Vt are initially drawn from the same\npreviously used Gaussian distributions at t \u201c 0, but then begin to stochastically drift over time\naccording to: zt`1 \u201c zt ` w, Vt`1 \u201c \u21e7VpVt ` Wq. Here, the entries of w and W are i.i.d.\nsamples from Np0, 1q and Np0, 0.1q distributions, respectively, and \u21e7V denotes the projection of a\nmatrix into the strongly positive-de\ufb01nite set V we previously de\ufb01ned. Figures 1E and 1F illustrate how\nour bandit pricing approach can adapt to ever-changing demand curves. Again, our low-rank methods\nexhibit much stronger performance than GDG and Explore\n\nit in the settings with many products.\n\n8\n\n\f(A) Model (1) without temporal change\n\n(B) Model (1) with demand shocks\n\nFigure 2: Regret of pricing strategies (for N \u201c 100) when underlying demand model has no low-rank\nstructure (see Appendix C.1) and is: (A) stationary, (B) altered by shocks at T{3 and 2T{3 as in \u00a75.2.\n5.4 Misspeci\ufb01ed Demand Model\n\nAppendix C.1 investigates the robustness of our algorithms in misspeci\ufb01ed settings with full-rank\nor log-linear demands, where the assumptions of our demand model are explicitly violated. Even\nin the absence of explicit low-rank structure, running the OPOL algorithm with low values of d\nsubstantially outperforms other pricing strategies (Figure 2). These empirical results suggest that our\nOPOL algorithm is practically useful for various high-dimensional pricing problems, beyond those\nthat exactly satisfy the low-rank/linearity assumptions in (2).\n\n5.5 Rank of Historical Demand Data\n\nWhile the aforementioned robustness analysis indicates our approach works well even when key\nassumptions are violated, it remains of interest whether our assumptions accurately describe actual\ndemand variation for real products. One key implication of our assumptions in (2) is that the N \u02c6 T\nmatrix Q \u201c rq1; q2; . . . ; qTs, whose columns contain the observed demands in each round, should\nbe approximately low-rank when there is limited noise in the demand-price relationship. This is\nbecause under our assumptions, q1, . . . , qT only span a d-dimensional subspace in the absence of\nnoise (see proof of Lemma 2).\nHere, we study historical demand data1 for 1,340 products sold at various prices over 7 weeks by the\nbaking company Grupo Bimbo. Using this data, we form a matrix Q whose columns contain the total\nweekly demands for each product across all stores. The SVD of Q reveals the following percentages\nof variation in the observed demands are captured within the top k singular vectors: k \u201c 1: 97.1%,\nk \u201c 2: 99.1%, k \u201c 3: 99.9%. This empirical analysis thus suggests that our low-rank assumption on\nthe expected demand variation remains reasonable in practice.\n\n6 Discussion\n\nBy exploiting a low-rank structural condition that naturally emerges in dynamic pricing problems, this\nwork introduces an online bandit optimization algorithm whose regret provably depends only on the\nintrinsic rank of the problem rather than the ambient dimensionality of the action space. Our low-rank\nbandit approach to dynamic pricing scales to a large number of products with intercorrelated demand\ncurves, even if the underlying demand model varies over time in an adversarial fashion. When\napplied to various high-dimensional dynamic pricing systems involving stationary, \ufb02uctuating, and\nmisspeci\ufb01ed demand curves, our approach empirically outperforms standard bandit methods. Future\nextensions of this work could include adaptations for predictable sequences in which future demands\ncan be partially forecasted [Rakhlin and Sridharan, 2013], or generalizing our convex formulation\nand linear demand model to more general subspace structures [Hazan et al., 2016b].\n\n1Historical demand data obtained from: www.kaggle.com/c/grupo-bimbo-inventory-demand/\n\n9\n\n\fReferences\nO. Besbes and A. Zeevi. Dynamic pricing without knowing the demand function: Risk bounds and\n\nnear-optimal algorithms. Operations Research, 57:1407\u201320, 2009.\n\nO. Besbes and A. Zeevi. On the surprising suf\ufb01ciency of linear models for dynamic pricing with\n\ndemand learning. Management Science, 61:723\u201339, 2015.\n\nM. Brand. Fast low-rank modi\ufb01cations of the thin singular value decomposition. Linear Algebra and\n\nits Applications, 415:20\u201330, 2006.\n\nS. Bubeck and A. Slivkins. The best of both worlds: Stochastic and adversarial bandits. Conference\n\non Learning Theory, 2012.\n\nS. Bubeck, Y. T. Lee, and R. Eldan. Kernel-based methods for bandit convex optimization. Proceed-\n\nings of 49th Annual ACM SIGACT Symposium on the Theory of Computing, 2017.\n\nM. Cohen, I Lobel, and R. P. Leme. Feature-based dynamic pricing. ACM Conference on Economics\n\nand Computation, 2016.\n\nV Dani, T. P. Hayes, and S. M. Kakade. The price of bandit information for online optimization.\n\nNeural Information Processing Systems, 2007.\n\nS. Dean, H. Mania, N. Matni, B. Recht, and S. Tu. On the sample complexity of the linear quadratic\n\nregulator. arXiv:1710.01688, 2017.\n\nA. V. den Boer and Z. Bert. Simultaneously learning and optimizing using controlled variance pricing.\n\nManagement Science, 60:770\u201383, 2013.\n\nJ. Djolonga, A. Krause, and V. Cevher. High-dimensional gaussian process bandits. Neural Informa-\n\ntion Processing Systems, 2013.\n\nA. D. Flaxman, A. T. Kalai, and H. B. McMahan. Online convex optimization in the bandit setting:\nGradient descent without a gradient. Proceedings of the 16th Annual ACM-SIAM Symposium on\nDiscrete Algorithms, 2005.\n\nA. Gopalan, O. Maillard, and M. Zaki. Low-rank bandits with latent mixtures. arXiv:1609.01508,\n\n2016.\n\nE. Hazan and K. Y. Levy. Bandit convex optimization: Towards tight bounds. Neural Information\n\nProcessing Systems, 2014.\n\nE. Hazan, T. Koren, R. Livni, and Y. Mansour. Online learning with low rank experts. Conference on\n\nLearning Theory, 2016a.\n\nE. Hazan, K. Y. Levy, and S. Shalev-Shwartz. On graduated optimization for stochastic non-convex\n\nproblems. International Conference on Machine Learning, 2016b.\n\nH. S. Houthakker and L. D. Taylor. Consumer demand in the United States. Harvard University\n\nPress, 1970.\n\nA. Javanmard. Perishability of data: Dynamic pricing under varying-coef\ufb01cient models. Journal of\n\nMachine Learning Research, 18:1\u201331, 2017.\n\nA. Javanmard and H. Nazerzadeh. Dynamic pricing in high-dimensions. arXiv:arXiv:1609.07574,\n\n2016.\n\nN. B. Keskin and A. Zeevi. Dynamic pricing with an unknown demand model: asymptotically\n\noptimal semi-myopic policies. Operations Research, 62:1142\u201367, 2014.\n\nR. Kleinberg and T. Leighton. The value of knowing a demand curve: Bounds on regret for online\nposted-price auctions. Proceedings of the 44th Annual IEEE Symposium on Foundations of\nComputer Science, 2003.\n\n10\n\n\fK. Misra, E. M. Schwartz, and J. Abernethy. Dynamic online pricing with incomplete information\nusing multi-armed bandit experiments. Available at SSRN: http: // ssrn. com/ abstract=\n2981814 , 2017.\n\nJ. Mueller, T. Jaakkola, and D. Gifford. Modeling persistent trends in distributions. Journal of the\n\nAmerican Statistical Association, 113:1296\u20131310, 2018.\n\nA. Rakhlin and K. Sridharan. Online learning with predictable sequences. Conference on Learning\n\nTheory, 2013.\n\nP. Rigollet. High dimensional statistics, 2015. MIT Opencourseware: ocw.mit.edu/courses/\n\nmathematics/18-s997-high-dimensional-statistics-spring-2015/lecture-notes/.\nM. Rudelson and R. Vershynin. The Littlewood-Offord problem and invertibility of random matrices.\n\nAdvances in Mathematics, 218:600\u201333, 2008.\n\nR. Sen, K. Shanmugam, M. Kocaoglu, A. Dimakis, and S. Shakkottai. Contextual bandits with latent\n\nconfounders: An NMF approach. Arti\ufb01cial Intelligence and Statistics, 2017.\n\nShai Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in\n\nMachine Learning, 4:107\u2013194, 2011.\n\nP. Stange. On the ef\ufb01cient update of the singular value decomposition. Proceedings in Applied\n\nMathematics and Mechanics, 8:10827\u201328, 2008.\n\nR. Vershynin. Introduction to the non-asymptotic analysis of random matrices. In Y. Eldar and\nG. Kutyniok, editors, Compressed Sensing, Theory and Applications, pages 210\u2013268. Cambridge\nUniversity Press, 2012.\n\nU. Witt. How can complex economical behavior be investigated? The example of the ignorant\n\nmonopolist revisited. Behavioral Science, 31:173\u2013188, 1986.\n\nY. Yu, T. Wang, and R. Samworth. A useful variant of the Davis-Kahan theorem for statisticians.\n\nBiometrika, 102:315\u2013323, 2015.\n\nF. Zhao, M. Xiao, and Y. Guo. Predictive collaborative \ufb01ltering with side information. International\n\nJoint Conference on Arti\ufb01cial Intelligence, 2016.\n\n11\n\n\f", "award": [], "sourceid": 8962, "authors": [{"given_name": "Jonas", "family_name": "Mueller", "institution": "Amazon Web Services"}, {"given_name": "Vasilis", "family_name": "Syrgkanis", "institution": "Microsoft Research"}, {"given_name": "Matt", "family_name": "Taddy", "institution": "Chicago Booth"}]}