{"title": "Personalizing Many Decisions with High-Dimensional Covariates", "book": "Advances in Neural Information Processing Systems", "page_first": 11473, "page_last": 11484, "abstract": "We consider the k-armed stochastic contextual bandit problem with d dimensional features, when both k and d can be large. To the best of our knowledge, all existing algorithm for this problem have a regret bound that scale as polynomials of degree at least two in k and d. The main contribution of this paper is to introduce and theoretically analyze a new algorithm (REAL Bandit) with a regret that scales by r^2(k+d) when r is rank of the k by d matrix of unknown parameters. REAL Bandit relies on ideas from low-rank matrix estimation literature and a new row-enhancement subroutine that yields sharper bounds for estimating each row of the parameter matrix that may be of independent interest.", "full_text": "Personalizing Many Decisions with High-Dimensional\n\nCovariates\n\nNima Hamidi\u2217\n\nMohsen Bayati\u2020\n\nKapil Gupta\u2021\n\nAbstract\n\nWe consider the k-armed stochastic contextual bandit problem with d dimensional\nfeatures, when both k and d can be large. To the best of our knowledge, all existing\nalgorithms for this problem have regret bounds that scale as polynomials of degree\nat least two, in k and d. The main contribution of this paper is to introduce and\ntheoretically analyse a new algorithm (REAL-Bandit) with a regret that scales\nby r2(k + d) when r is the rank of the k \u00d7 d matrix of unknown parameters.\nREAL-Bandit relies on ideas from low-rank matrix estimation literature and a new\nrow-enhancement subroutine that yields sharper bounds for estimating each row\nof the parameter matrix that may be of independent interest. We also show via\nsimulations that REAL-Bandit algorithm outperforms existing algorithms that do\nnot leverage the low-rank structure of the problem.\n\n1\n\nIntroduction\n\nRunning online experiments has recently become a popular approach in data-centric enterprises.\nHowever, running an experiment involves an opportunity cost or regret (e.g., exposing some users to\npotentially inferior experiences). To reduce this opportunity cost, a growing number of companies\nleverage multi-armed bandit (MAB) experiments [38, 39, 19] that were initially motivated by the\ncost of experimentation in clinical trials [41, 27]. Another common feature of online experiments is\npersonalization; users have heterogenous preferences that means the optimal decisions depend on\nuser or product characteristics (also known as context). MAB approach for personalizing decisions\nis therefore called contextual MAB (or contextual bandit) [29]. For example, [30] used contextual\nbandits to propose a personalized news article recommender system.\nThere is a large body of literature on algorithms with theoretical guarantees for contextual bandits\nwith linear reward functions. An admittedly incomplete list is [5, 13, 2, 12, 14, 4, 34, 37, 42, 25, 6],\nand we defer to [7] for additional references. While these papers study the problem under a variety of\ndifferent assumptions, they can be divided into two groups: (A) when context vectors are arbitrary\nand can be potentially selected by an adversary, and (B) when context vectors are i.i.d. samples from\na \ufb01xed (but unknown) probability distribution. Our focus in this paper is the latter group (\ufb01rst studied\nby [14]). As the number of decisions T (time horizon) grows, the regret bounds for the algorithms in\ngroup (A) grow with\nT . But the algorithms in group (B) take advantage of the i.i.d. assumption\nand have a signi\ufb01cantly lower (logarithmic) dependence in T .\nTwo other important parameters are the number of arms k, and the dimension of context vectors d.\nFor example, when d grows, the regret bound of [14] grows as d3 which can dominate the dependence\non T . [6] tackled this dif\ufb01culty by imposing sparseness assumption and replaced d3 with s2 (up to\nlogarithmic factors) where s is sparsity of the parameter vectors for the reward functions. On the\nother hand, a careful inspection of the bounds in [14, 6] reveals that their regret bounds could grow\nby k3 (in the worst case) that can be very large in applications such as assortment optimization [21].\n\n\u221a\n\n\u2217Department of Statistics, Stanford University, hamidi@stanford.edu\n\u2020Graduate School of Business, Stanford University, bayati@stanford.edu\n\u2021Airbnb, kapil.gupta@airbnb.com\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fOn the other hand, the lower bounds found in [13] and [14] imply that if there is no structural\nassumption, the lowest possible regret would grow at least by kd. The aim of this paper is to reduce\nthis dependence under a low-rank assumption on the k \u00d7 d matrix of parameters of the reward\nfunctions. Speci\ufb01cally, each of the k reward functions is represented by a d-vector of coef\ufb01cients\n(one coef\ufb01cient per covariate) that is a row of the parameter matrix. This can also be interpreted as\nimposing a similarity between the reward functions of the k different arms, like in the multi-task\nlearning literature [10]. We propose a new algorithm (called REAL-Bandit) and prove that its regret\ngrows by r2(k + d) where r is rank of the parameter matrix.\nContributions. Our main technical contributions in design and analysis of REAL-Bandit are as\nfollows. (1) Stronger row-wise guarantees: To prove regret guarantees for REAL-Bandit, we need\nbounds for estimating every single row of the matrix. However, existing matrix estimation results\nprovide a bound on the estimation error of the whole matrix, which would be a crude upper bound for\nthe estimation error of each single row. Therefore, REAL-Bandit includes a subroutine (called Row\nEnhancement) that re\ufb01nes the estimates in order to establish stronger row-wise guarantees that may\nbe of independent interest (see \u00a73 for details). Very recently, [11] provided bounds for the matrix\ncompletion problem that are also sharp at the row or entry level, however, their results are only for\n(2) Implementation: Our theoretical\nthe matrix completion case and do not apply to our setting.\nanalysis does not require that each matrix estimation phase in REAL-Bandit is solved to completion.\nIn other words, REAL-Bandit does not need \ufb01nding a global minimum of the relevant estimator\u2019s\noptimization (penalized maximum likelihood) problem. REAL-Bandit only needs a solution with\ncost below a certain threshold which can be used to signi\ufb01cantly speed up implementation of REAL-\n(3) Estimator independence: Over the last decade, several types of estimators have been\nBandit.\nintroduced for recovering low-rank matrices from noisy observations, with varying assumptions\nand theoretical guarantees. Two of the most common approaches are based on convex optimization\n[8, 9, 31, 15, 35, 36, 26, 24], or non-convex optimization [40, 22, 23]. Unlike [14, 6] that work with\na \ufb01xed estimator, REAL-Bandit is designed to be estimator agnostic and works with any matrix\nestimation algorithm with theoretical guarantees (see \u00a73 for details).\nOther related literature. A class of decision-making problems with a large number of arms are\nassortment optimization problems; when a small subset from a potentially large set of items should\nbe selected. Among the rich literature on this topic, [21, 3] are more related to our paper since\nthey consider a dynamic allocation of assortments via multi-armed bandit ideas. [21] (like us) uses\nlow-rank matrix estimation methods for the learning part. However, the problems and models they\nstudy are very different. Speci\ufb01cally, they assume that a decision-maker shows a subset of products to\na user. Then the user selection is modeled via a multi-nomial logit (MNL) [32] where the parameters\nof the MNL model form a low-rank matrix with rows representing customer types and columns\nrepresenting products.\nTwo other relevant papers are [42, 25] since they too tackle bandit problems with many actions. They\nintroduce algorithms with regrets that scale with spectral dimension of the Laplacian of a graph that\nhas arms as its vertices. These papers are in group (A) of the aforementioned class of bandit papers\nthat are inherently different. Speci\ufb01cally, they assume actions have known feature vectors (with low\nspectral dimension) that, together with a single unknown parameter vector, de\ufb01ne the linear reward\nfunctions. There is a reduction from this setting to our problem, only when the action set is allowed\nto change (see [1]) which is not the case in [42, 25]. Another recent paper in this category is [28].\nThe main difference between all of these papers and ours, as discussed above, is that we consider\ni.i.d. context vectors which allows regret bounds that scale logarithmically in T instead of scaling\nwith\nT . Finally, recent paper of [20] studies a bandit problem where each action is a pair of arms\nand the reward function is a bilinear function of the feature vectors of each arm, and has a low-rank\nparameter matrix.\nOrganization. We introduce additional notation in \u00a72. Then the REAL-estimator and REAL-Bandit\nalgorithm are introduced in \u00a73, followed by simulations in \u00a74. In \u00a75 we present our assumptions,\nstatement of the main theorem, as well as its proof. Proofs of lemmas and additional details are\ndeferred to the extended version of the paper [17].\n\n\u221a\n\n2\n\n\f\u03ba\n\n\u03ba, Xt(cid:105) + \u03b5\u03ba,t is generated, where B(cid:63)\n\n2 Setting and notation\nLet B(cid:63) be a k \u00d7 d matrix with real-valued entries. We further assume that B(cid:63) is of rank r with\nr (cid:28) min(k, d). At time t = 1, 2,\u00b7\u00b7\u00b7 , a context vector Xt \u2208 Rd is drawn from a \ufb01xed probability\ndistribution P, independently from Xs for s < t. Then, by choosing arm 1 \u2264 \u03ba \u2264 k, the reward\nyt := (cid:104)B(cid:63)\n(cid:62) is the \u03ba-th row of the matrix B(cid:63), and \u03b5\u03ba,t are\n\u03b5-sub-Gaussian random variables. In addition, (cid:104)U, V (cid:105) refers to the inner product of\nindependent \u03c32\nvectors U and V , and U(cid:62) refers to the transpose of U. Throughout, we use bold capital letters for\nmatrices, and use notation [n] for the set {1, 2, . . . , n}, when n is an integer. For any two matrices\nY1 and Y2 with d columns, by Y1 (cid:118) Y2, we mean that all rows of Y1 are also rows of Y2. Also,\nfor any subset U of Rd, the notation PU refers to the conditional distribution P(\u00b7|U) of the contexts.\nA policy \u03c0, is a sequential decision-making algorithm that, at each time t, chooses the arm \u03c0t \u2208 [k]\ngiven previous observations and the revealed context Xt. We will evaluate the performance of a policy\n\nby its cumulative regret, de\ufb01ned as RT =(cid:80)T\n\nt=1 rt, where rt = E(cid:2)max\u03ba\u2208[k](cid:104)Xt, B(cid:63)\n\n\u03ba(cid:105) \u2212(cid:10)Xt, B(cid:63)\n\nwhere expectation is with respect to the randomness of Xt, \u03b5t and potential randomness introduced\nby the policy \u03c0. Our goal is to \ufb01nd policies with low RT .\nIn order to avoid dealing with unnecessary subscripts, for each context vector Xt \u2208 Rd, we de\ufb01ne\nt to be a k \u00d7 d matrix with all elements equal to zero, except for the \u03c0t-th row which is equal to\nX\u03c0\nX(cid:62)\nt . Using this notation, we have that\n\nwhere the inner product for matrices is de\ufb01ned as (cid:104)U, V(cid:105) := tr(cid:0)UV(cid:62)(cid:1). Note that \u03b5t is actually\n\nyt = (cid:104)B(cid:63), X\u03c0\n\nt (cid:105) + \u03b5t,\n\n\u03b5\u03c0t,t, but since all noise values are i.i.d., we will drop the dependence on \u03c0.\nFor a given subset I = {t1, . . . , tn} of [T ], consider the set of corresponding context matrices\n{X\u03c0\n| i \u2208 I}. We de\ufb01ne an associated sampling operator X\u03c0I : Rk\u00d7d \u2192 Rn to be de\ufb01ned as follows.\nFor any matrix B \u2208 Rk\u00d7d, X\u03c0I(B) is a vector of length n where its i-th entry (i \u2208 [n]) is given by\n[X\u03c0I(B)]i := (cid:104)B, X\u03c0\n\n(cid:11)(cid:3)\n\n(1)\n\n\u03c0t\n\nti\n\nti\n\n(cid:105). Therefore, the vector form of (1) is\nY = X\u03c0I(B(cid:63)) + E ,\n\nRk\u00d7d, we can de\ufb01ne the following norm (cid:107)B(cid:107)P := E(cid:2)(cid:104)B, Z(cid:105)2(cid:3) for all B \u2208 Rk\u00d7d where Z is drawn\n\nwhere E is the n-vector of all noise values \u03b5t1, . . . , \u03b5tn. In the remaining, we use the simpler notation\nX(\u00b7) instead of X\u03c0I(\u00b7) when I and \u03c0 are implicitly clear.\nWe also use different norms in our algorithm and analysis in this paper. (cid:107)\u00b7(cid:107)2 refers to the regular (cid:96)2\nnorm of a vector. The nuclear norm (or trace-norm) of a matrix is denoted by (cid:107)\u00b7(cid:107)\u2217, and (cid:107)\u00b7(cid:107)F and\n(cid:107)\u00b7(cid:107)\u221e refer to the Frobenius and the in\ufb01nity norm of a matrix. Also, for a given distribution P over\nfrom P. Finally, (cid:107)\u0393(cid:107)\u221e,2 is the maximum of (cid:107)\u0393\u03ba(cid:107)2 for \u03ba \u2208 [k] (recall that \u0393(cid:62)\n\u03ba is \u03ba-th row of \u0393). In\nfact, one of our assumptions that will be stated explicitly later is that the matrix B(cid:63) belongs to the\nfollowing set:\n\nS = {B \u2208 Rk\u00d7d | (cid:107)B(cid:107)\u221e,2 \u2264 b(cid:63)} ,\n\nfor a positive constat b(cid:63). Also, for a k by d matrix B of rank r with singular value decomposition\nB = UDV(cid:62), we de\ufb01ne the row-incoherence parameter as\n\u00b7 (cid:107)B(cid:107)\u221e,2\n\n(cid:114)\n\n\u00b5(B) =\n\n(2)\n\n.\n\nk\nr\n\nDr,r\n\n3 Algorithm\n\nIn this section, we describe the Row-Enhanced and Low-Rank Bandit (REAL-Bandit) algorithm.\nThe algorithm combines ideas from existing literature [14, 6] and a new row-enhancement procedure\nto obtain sharper convergence rate when k is very large. REAL-Bandit algorithm has two disjoint\nphases for exploration and exploitation, similar to [14, 6]. In the exploration phase, all arms are given\nan equal chance to be explored to enable the algorithm to obtain an estimate of their corresponding\nparameters. These forced-sampling estimates are not suf\ufb01ciently accurate to pick the best arm with\nhigh probability, however, they are accurate enough to rule out all of the arms that are substantially\n\n3\n\n\finferior to the optimal arm. At each time t, these estimates are used as a proxy of the actual arm\nparameters to form a set of candidate arms. In order to choose one of these candidates, we need more\naccurate estimates and so, the algorithm uses the all-sampling estimates that are obtained from all the\nobservations made thus far to pick the best arm.\nHowever, unlike [14, 6], that estimate each of the arm parameters {B(cid:63)\n\u03ba}\u03ba\u2208[k] separately, our forced-\nsampling and all-sampling estimates utilize the low-rank assumption on matrix B(cid:63) and estimate all\nparameters simultaneously (like in the multi-task learning literature).\nThe Estimators. REAL-Bandit is designed to work with any matrix estimation method that has\ntheoretical guarantees. Two such estimators (developed in the matrix completion literature) are: (1)\nestimators based on convex optimization and (2) estimators based on non-convex optimization. Before\nwe present these two classes of estimators, we assume that a set of time periods J = {t1, . . . , tn}\nand their associate observations (X\u03c0\n, ytn ), and a positive constant \u03bb are available.\nt1\n\nWe use notations \u00afB(J ) or (cid:98)B(J ) for estimators of B(cid:63), that use observations from time periods in J .\nWhen J is clear, we use simpler notations \u00afB and (cid:98)B.\n\n, yt1), . . . , (X\u03c0\ntn\n\n(1) Convex optimization. In this approach, introduced by [8], the approximation to B(cid:63) is the minimizer\nof the following convex program:\n\nIn fact, as [16] shows, one just needs a feasible solution \u00afB = \u00afB(J , \u03bb) that satis\ufb01es:\n\nminimize n\u22121(cid:107)Y \u2212 X(B)(cid:107)2 + \u03bb(cid:107)B(cid:107)\u2217 .\n\n+ \u03bb(cid:13)(cid:13) \u00afB(cid:13)(cid:13)\u2217 \u2264 n\u22121(cid:107)Y \u2212 X(B(cid:63))(cid:107)2 + \u03bb(cid:107)B(cid:63)(cid:107)\u2217.\n\n(3)\n\n(4)\n\nn\u22121(cid:13)(cid:13)Y \u2212 X( \u00afB)(cid:13)(cid:13)2\nminimizeU,V n\u22121(cid:13)(cid:13)Y \u2212 X(UV(cid:62))(cid:13)(cid:13)2\n\nThis brings additional \ufb02exibility to choose the optimizer and has computational advantages.\n(2) Non-convex optimization. Another approach is to explicitly impose the low-rank constraint by\nwriting B as UV(cid:62) where U \u2208 Rk\u00d7r and V \u2208 Rd\u00d7r. The optimization problem would be:\n\n(5)\nOne challenge is that this is not a convex program, but it has been shown that under certain conditions,\nalternating minimization can be an effective algorithm [18]. It can also be shown (e.g., see [22, 31])\nthat minimizing the above loss function is equivalent to solving the following optimization problem:\n(6)\n\nF )/2 .\n\nminimize n\u22121(cid:107)Y \u2212 X(B)(cid:107)2 + \u03bb(cid:107)B(cid:107)\u2217\nsubject to\n\nrank(B) \u2264 r .\n\n+ \u03bb((cid:107)U(cid:107)2\n\nF + (cid:107)V(cid:107)2\n\n\u03ba\n\n\u03ba\n\nIf \u00afB is a solution to (6), since B(cid:63) is also a feasible solution, (4) must hold.\n\nk factor. To remedy this, we introduce REAL-estimator that uses a set of almost independent\n\n[9, 26, 33, 24, 16]. However, these results do not characterize how this error is distributed across\n\nREAL-estimator. The existing theory of matrix estimation provides error bounds for(cid:13)(cid:13) \u00afB \u2212 B(cid:63)(cid:13)(cid:13)F\n(cid:13)(cid:13)2 for\ndifferent rows. On the other hand, in order to get a regret bound, we need to control(cid:13)(cid:13) \u00afB\u03ba \u2212 B(cid:63)\n(cid:13)(cid:13)2 \u2264(cid:13)(cid:13) \u00afB \u2212 B(cid:63)(cid:13)(cid:13)F would introduce an unnecessary\nall \u03ba \u2208 [k], and the trivial inequality(cid:13)(cid:13) \u00afB\u03ba \u2212 B(cid:63)\n\u221a\nobservations to improve the row-wise error bound.\nAs before, let J = {t1, . . . , tn} be a set of n time periods with t1 < t2 < \u00b7\u00b7\u00b7 < tn. We split J to\n:= J K(cid:84)J(cid:96).\n2 +1, . . . , tn}. For any K \u2286 [k], let J K be the subset of J such\nJ1 := {t1, . . . , t n\nthat an arm in K is pulled, i.e. \u03c0ti \u2208 K. Moreover, for (cid:96) \u2208 {1, 2} and K \u2286 [k], let J K\nFor \u03ba \u2208 [k], when K = {\u03ba}, we use superscript \u03ba rather than {\u03ba} for simplicity. Next, for any\noutput of this algorithm REAL-estimator and denote it by (cid:98)B(J ). The dif\ufb01culty of analyzing this\nlow-rank matrix estimator \u00afB, Algorithm 1 performs the row-enhancement procedure. We call the\n\n} and J2 := {t n\n\nestimator arises from the fact that the observations are generated in an adaptive fashion, and thus, the\nresults that require independence assumption are not applicable in our case. However, in the analysis\nwe will show that these observations can be approximated by i.i.d. samples, and as a result theoretical\nguarantees can be obtained. In the following, we will state the assumptions formally, and then, we\nwill verify that they continue to hold throughout the analysis.\nBefore we de\ufb01ne the notion of approximately independent, we need a few more notations. For\n\u03ba \u2208 [k], X\u03ba is a matrix constructed by the set of context vectors of observations for arm \u03ba, stacked\n(cid:96) for (cid:96) \u2208 {1, 2} and J(cid:96) similarly. Recall that, for matrices Y1\nas rows of this matrix. We de\ufb01ne X\u03ba\nand Y2 with d columns, Y1 (cid:118) Y2 means that all rows of Y1 are also rows of Y2.\n\n(cid:96)\n\n2\n\n4\n\n\fAlgorithm 1 Row-enhancement procedure\nInput: Low-rank matrix estimator \u00afB \u2208 Rk\u00d7d, J = {t1, . . . , tn}, and observations\n\n1: Initialize, (cid:98)B \u2208 Rk\u00d7d,\n\n, yt1), . . . , (X\u03c0\ntn\n\n(X\u03c0\nt1\n\n, ytn ).\n\n2 +1, . . . , tn},\n\n2\n\n} and J2 := {t n\n\n2: Split J into J1 := {t1, . . . , t n\n3: Compute SVD \u00afB(J1) = UDV(cid:62),\n4: Let Vr be the matrix containing \ufb01rst r columns of V,\n5: for \u03ba = 1, 2,\u00b7\u00b7\u00b7 , k do\n(yti \u2212 (cid:104)Vr\u03d1, Xti(cid:105))2,\n6:\n7:\n8: end for\n\nSet row \u03ba of (cid:98)B to (Vr\n\nLet \u02c6\u03d1\u03ba = arg min\u03d1\u2208Rr\n\n\u02c6\u03d1\u03ba)(cid:62).\n\n(cid:80)\n\nti\u2208J \u03ba\n\n2\n\n9: Return (cid:98)B.\n\nDe\ufb01nition 3.1 (Approximately independence). Let J be a given set of n time periods and P, PU ,\nand PV be three distributions. Then, for \u03ba \u2208 [k], we say that J \u03ba is a (nU , nV )-approximately\nindependent set of observations if there exists random matrices X\u03baU and X\u03baV such that\n\n1. X\u03baU (cid:118) X\u03ba and X\u03ba (cid:118) X\u03baV,\n2. All rows of X\u03baU are independent samples of PU ,\n3. All rows of X\u03baV are independent samples from either P or PV,\n4. X\u03baU and X\u03baV have nU and nV rows respectively.\n\nThis de\ufb01nition requires the observations for a row \u03ba to lie between two sets of i.i.d. samples. This\nnotion becomes extremely useful whenever one can prove that nU and nV are of the same order.\nNext, we specify the conditions that P, PU , and PV need to meet so that we can prove error bounds.\nDe\ufb01nition 3.2. We say that a distribution P(\u00b7) on Rd is (\u03b3min , \u03b3max , \u03c3X )-diverse if\n\nwhere \u03a3 = E(cid:2)XX(cid:62)(cid:3), and \u03a3\u2212 1\n\nX-sub-Gaussian (i.e., for any deterministic unit vector u \u2208 Rd,\n2 X is \u03c32\n\n\u03b3min \u2264 \u03bbmin (\u03a3) \u2264 \u03bbmax (\u03a3) \u2264 \u03b3max ,\n2 X is \u03c32\nthe real-valued random variable u(cid:62)\u03a3\u2212 1\nWe will treat \u03c3X as a constant. Note that, for instance, when X follows a multivariate Gaussian\ndistribution, then \u03c3X = 1.\nIn our proofs, we will show that whenever J can be split into two almost independent halves, then\nrow-enhancement procedure gives us sharper per-row guarantees than the raw matrix estimator \u00afB.\nThe REAL-Bandit algorithm. Here, we describe REAL-Bandit algorithm presented in Algorithm\n2. As mentioned earlier, this algorithm has disjoint exploration and exploitation phases which are\nspeci\ufb01ed by a force-sampling rule f : N \u2192 [k] \u222a {\u2205}. At time t, the force-sampling rule decides\nbetween forcing the arm ft \u2208 [k] to be pulled or exploiting the past data, indicated by ft = \u2205. By Ft,\nwe denote the time periods that an arm was forced to be pulled, i.e. Ft := {\u03c4 \u2264 t : f\u03c4 \u2208 [k]}. For\nsimplicity, we also use At := [t] to refer to the all time periods up to time t. The force-sampling rule\nthat we use is a randomized function that picks an arm \u03ba \u2208 [k] with probability\n\nX-sub-Gaussian).\n\n(cid:40) 1\n\nP(ft = \u03ba) =\n\nk\nk[t\u2212\u03c1 log(\u03c1)+1]\n\n\u03c1\n\nif t \u2264 2\u03c1 log(\u03c1) ,\nif t > 2\u03c1 log(\u03c1) ,\n\n(7)\n\nand ft = \u2205 otherwise. We will specify the hyper-parameter \u03c1 in \u00a75. As we will see in our analysis,\nthis force-sampling rule ensures that F \u03ba\nt grows as O(log t) for all \u03ba \u2208 [k]. One can alternatively use\nany force-sampling rule that has this rate of exploration.\nRemark 1. The algorithm proposed in [14, 6] are similar to the REAL-Bandit. They, however, use\na deterministic force-sampling rule (that can be used here as well). However, our randomized rule\nbrings practical advantages in exchange for a slightly more complex theoretical analysis.\n\n5\n\n\fNow, let \u00afBF and \u00afBA be two low-rank matrix estimators (obtained from observations of the force-\n\nsampling rounds and the all-sampling rounds respectively) and denote by (cid:98)BF and (cid:98)BA their cor-\n(cid:13)(cid:13)2 \u2264 O(1) with\nalgorithm. We will show that the forced-samples estimator (cid:98)BF satis\ufb01es(cid:13)(cid:13)(cid:98)BF\n\nresponding REAL-estimators, introduced above. These estimators serve different purposes in our\nprobability at least 1 \u2212 O(1/t) for all arms \u03ba \u2208 [k]. The key idea is that O(log t) i.i.d. samples are\nenough to get such a guarantee. These estimates are then only used to rule out some arms that are\nvery far from the optimal arm. The threshold for eliminating sub-optimal arms is determined by a\nhyper-parameter h that is given to the algorithm. This parameter can be thought of as the average\ngap of the problem.\n\nThe remaining arms are candidates of being the optimal arm. Then, the all-samples estimator (cid:98)BA\nthat (cid:98)BA enjoys the sharper bound(cid:13)(cid:13)(cid:98)BA\nt(cid:1) for all optimal arms \u03ba \u2208 Kopt \u2286 [k]\n\ncomes into play. This estimator is used to pick the best arm among these candidate arms. We will show\nwith probability at least 1 \u2212 O(1/t) where Kopt is de\ufb01ned formally in Assumption 3 of \u00a75. This\nsharper rate improves the accuracy of the decisions made by the algorithm signi\ufb01cantly.\n\n(cid:13)(cid:13)2 \u2264 O(cid:0)1/\n\n\u03ba \u2212 B(cid:63)\n\n\u03ba \u2212 B(cid:63)\n\n\u221a\n\n\u03ba\n\n\u03ba\n\nAlgorithm 2 REAL-Bandit algorithm\nInput: Force-sampling rule f, gap h.\n1: for t = 1, 2,\u00b7\u00b7\u00b7 do\nObserve Xt \u223c P,\n2:\nif ft (cid:54)= 0 then\n3:\n4:\nelse\n5:\n6:\n\n(cid:110)\n\u03ba \u2208 [k] |(cid:10)Xt, (cid:98)BF\n\u03c0t \u2190 arg max\u03ba\u2208C(cid:10)Xt, (cid:98)BA\n\n\u03c0t \u2190 ft\nC =\n\n\u03ba (Ft\u22121)(cid:11) \u2265 max\n\u03ba (At\u22121)(cid:11)\n\n(cid:96)\u2208[k]\n\n7:\nend if\n8:\n9: end for\n\n(cid:10)Xt, (cid:98)BF\n\n(cid:96) (Ft\u22121)(cid:11) \u2212 h\n\n2 \u00b7 (cid:107)Xt(cid:107)2\n\n(cid:111)\n\n4 Simulations\n\nWe compared the REAL-Bandit algorithm with four other algorithms: OLS-Bandit of [14], but we\nuse the improved version from [6] that \ufb01lters sub-optimal arms, LASSO-Bandit of [6], OFUL of [2]\nwhich is based on the Upper Con\ufb01dence Bound (UCB) idea, and Thompson sampling (the version\nfrom [37]). Taking k = 201, d = 200, and r = 3, we generated matrix B(cid:63) as UV(cid:62) where rows of\nU \u2208 R201\u00d73 and V \u2208 R200\u00d73 are drawn independently and uniformly from the unit sphere in R3.\nNoise variance is 1 and features are i.i.d. N (0, Id). We gave Thompson sampling the true prior mean\nand variance of the arm parameters, and the true noise variance. Similarly, OFUL had access to the\ntrue noise variance. Other parameters of OLS-Bandit, LASSO-Bandit, and OFUL are selected as in\n[6]. We generated 10 data sets and executed all algorithms for a time horizon of length T = 40, 000.\nFigure 1 shows average cumulative regret (with 1 SE error bars) for all algorithms across these\n10 runs.. The results of this simulation support our theoretical analysis, that REAL-Bandit takes\nadvantage of the low-rank structure of the problem parameters and signi\ufb01cantly outperforms other\nbenchmarks that do not leverage the structure.\n\n5 Analysis\n\nThis section is dedicated to the analysis of REAL-Bandit. We will \ufb01rst state the assumptions\nunderlying the analysis and then state the main theorem of this section. A discussion of some of these\nassumptions can be found in [6].\nAssumption 1 (Parameter set). Assume the rank of B(cid:63) is r, (cid:107)B(cid:63)(cid:107)\u221e,2 \u2264 b(cid:63), and \u00b5(cid:63) := \u00b5(B(cid:63)) where\n\u00b5(\u00b7) is de\ufb01ned in (2).\nAssumption 2 (Margin condition). For any a > 0, there is a constant c0 > 0 such that E[Na] \u2264 kc0a,\n\u03ba\u2217 = \u03ba\u2217(X) is the optimal arm, given context vector X.\n\nwhere the random variable Na is de\ufb01ned by Na :=(cid:80)k\n\n(cid:11) \u2264 a \u00b7 b(cid:63) \u00b7 (cid:107)X(cid:107)2\n\nI(cid:0)(cid:10)X, B(cid:63)\n\n(cid:1) where\n\n\u03ba\u2217 \u2212 B(cid:63)\n\n\u03ba=1\n\n\u03ba\n\n6\n\n\fFigure 1: Cumulative regret of REAL-Bandit versus LASSO-Bandit, OFUL, OLS-Bandit, and\nThompson sampling for (k, d, r) = (201, 200, 3).\n\nAssumption 3 (Arm optimality). Let Kopt and Ksub be a partitioning of [k]. Then, for some h > 0,\nthe following conditions hold:\n\u03ba(cid:48)(cid:105) \u2212 h \u00b7 (cid:107)X(cid:107)2 for any context X,\n1) For any sub-optimal arm \u03ba \u2208 Ksub, (cid:104)X, B(cid:63)\n2) For each arm \u03ba \u2208 Kopt, where P(X \u2208 Uk) \u2265 p\u2217\n\n\u03ba(cid:105) \u2264 max\u03ba(cid:48)(cid:104)X, B(cid:63)\n\n(cid:27)\n|Kopt|, where U\u03ba is de\ufb01ned by\n\n(cid:26)\n\nU\u03ba :=\n\nX \u2208 Rd | (cid:104)X, B(cid:63)\n\n\u03ba(cid:48)(cid:105) + h \u00b7 (cid:107)X(cid:107)2\n3) For each arm \u03ba \u2208 Kopt, there exists constant q\u2217 > 0 such that max\u03ba\u2208Kopt\n(cid:27)\nwhere the set V\u03ba is de\ufb01ned as\n\n\u03ba(cid:105) > max\n\u03ba(cid:48)(cid:54)=\u03ba\n\n(cid:104)X, B(cid:63)\n\n(cid:26)\n\nV\u03ba :=\n\nX \u2208 Rd | (cid:104)X, B(cid:63)\n\n\u03ba(cid:105) > max\n\u03ba(cid:48)(cid:54)=\u03ba\n\n(cid:104)X, B(cid:63)\n\n\u03ba(cid:48)(cid:105) \u2212 h \u00b7 (cid:107)X(cid:107)2\n\n,\n\nP(X \u2208 V\u03ba) \u2264 q\u2217\n|Kopt|\n\n.\n\n(cid:18)\nP(cid:0)(cid:13)(cid:13) \u00afBF (J ) \u2212 B(cid:63)(cid:13)(cid:13)F \u2265 \u03b4(cid:1) \u2264 exp\n\u00b7(cid:113) k\n(cid:113) \u03b3min\nr \u00b7 h\n(cid:115)\n(cid:35)\n\n\u03b42 )r(k + d), \u03b4 :=\n\n(cid:34)\u221a\n\n\u03b3max\n\nAssumption 4 (Diversity). For all \u03ba \u2208 Kopt, P, PU\u03ba, and PV\u03ba are (\u03b3min , \u03b3max , \u03c3X )-diverse.\nAssumption 5 (Low-rank estimators). Assume the following tail bounds hold for \u00afBF and \u00afBA:\n1) Let J be a set of n time periods such that for each arm \u03ba \u2208 [k], the matrix X\u03ba of the context\nvectors associated to the arm \u03ba has i.i.d. rows sampled from P. Then,\n\u2212 c1\u03b42n\n(k + d)r\n\n(cid:19)\n\n,\n\nlog(n) \u2265 c2(1 + 1\n\nn\n\nholds when\n2) Let J be a set of n observations, such that for all \u03ba \u2208 Kopt, J \u03ba is a set of ( np\u2217\napproximately independent observations. Then, we get\nd\u03c3\u03b5 \u2228 b(cid:63)\n\u221a\nd\u03b3min\n\n(cid:32)(cid:13)(cid:13)\u03a0opt[ \u00afBA(J ) \u2212 B(cid:63)](cid:13)(cid:13)F \u2265\n\n64\u00b5(cid:63) , and c1, c2 are positive constants.\n2 , 2nq\u2217)-\n\nc3r(k + d) log(n)\n\n\u2264 1\nn\n\n(cid:33)\n\nnp\u2217\n\nP\n\n\u00b7\n\n,\n\nprovided that (n/ log n) \u2265 c2r(k + d) for some constant c3 > 0, and \u03a0opt : Rk\u00d7d \u2192 Rk\u00d7d denotes\nthe linear function that sets the rows corresponding to the sub-optimal arms to zero and keeps the rest\nunchanged.\n\nNow, we are prepared to state our main theoretical result.\n\n7\n\n\f(cid:20)\n\nTheorem 1. If Assumptions 1-5 hold, then the cumulative regret of Algorithm 2 is bounded above by\n\n\u2264 C\n\nRT\nxn\n\nc2b(cid:63)(1 +\n\n1\n\u03b42 )r(k + d) log(T )\n\nmax q\u2217\nC(cid:48) := c3\u00b5(cid:63)2 \u00b7 \u03b32\nmin p\u2217 \u00b7\n\u03b32\n\n(cid:19)\n\n(cid:18) d\u03c32\n\n\u03b5 \u2228 b(cid:63)2\nd\u03b3min\n\n(cid:21)\n\n+ C(cid:48)(cid:20) c0r2(k + d) log(T )2\n\n(cid:21)\n\n,\n\nb(cid:63)p\u2217\n\nE(cid:2)(cid:107)X(cid:107)2\n\n(cid:12)(cid:12) X = (cid:107)X(cid:107)2V(cid:3) .\n\n, xn := sup\n(cid:107)V (cid:107)2=1\n\nwhere C > 0 is a constant, the forced-sampling parameter \u03c1 is set to 2c2(1 + \u03b4\u22122)r(k + d), and\n\nBefore describing the proof of Theorem 1, we state four key lemmas that will be used in the proof.\nDue to space limitations, we defer proofs of the lemmas to the extended version of the paper [17].\nLemma 1. The force-sampling sets created by the force-sampling rule (7) satisfy the following\ninequalities, for all t \u2265 2\u03c1 log(\u03c1), provided that \u03c1 \u2265 24,\n\n(cid:18)\n\n(cid:19)\n\nP\n\n|Ft| \u2265 6\u03c1 log t\n\n\u2264 t\u22121\n\nand\n\nP(|Ft| \u2264 [\u03c1/2] log t) \u2264 t\u22123 .\n\nLemma 2. Let I be a (deterministic) subset of the forced-sampling observations and by I\u03ba \u2286 I, we\ndenote the observations corresponding to arm \u03ba \u2208 [k]. Then, the following inequality holds,\n\nLemma 3. For all t \u2265 10c2(1 + \u03b4\u22122)r(k + d) log(kd) and \u03ba \u2208 [k], with probability at least\n\n(cid:21)\n\n.\n\n8k\n\n\u2264 exp\n\n(cid:12)(cid:12)(cid:12)(cid:12) |I|\n(cid:19)\n\n(cid:20)\n\u2212|I|\n(cid:13)(cid:13)2 \u2264 h/4 .\n\u03ba \u2212 B(cid:63)\n(cid:115)\n\n\u03ba\n\nC(cid:48)r2(k + d) log(t)\n\nkp\u2217t\n\n.\n\nP\n\n2k\n\n(cid:18)|I\u03ba|\n|I| \u2264 1\n1 \u2212 10t\u22123, the following inequality holds(cid:13)(cid:13)(cid:98)BF\n(cid:13)(cid:13)2 \u2264 10\nif (cid:107)(cid:98)BF\n\n(cid:13)(cid:13)(cid:98)BA\n\u03ba \u2212 B(cid:63)\n(cid:40)\n\n\u03ba\n\nG(Ft) :=\n\n1\n0\n\notherwise.\n\n\u03ba (Ft\u22121) \u2212 B(cid:63)\n\n\u03ba(cid:107)2 \u2264 h\n\n4 for all \u03ba \u2208 [k],\n\nProof of Theorem 1. Following the lines of the proof of Theorem 1 in [6], we de\ufb01ne G(\u00b7) as\n\nLemma 4. For all t > 10c2r(k + d) log(kd) and \u03ba \u2208 Kopt, with probability at least 1 \u2212 100r(k +\nd)t\u22121, we have\n\nDe\ufb01ne c4 := 10c2(1 + \u03b4\u22122)r(k + d) log(kd). Then, we split the regret of the algorithm into the\nfollowing three cases and bound each case separately:\n(a) Initialization (i.e., when t \u2264 c4) and forced-sampling rounds.\n(b) When t > c4 and G(Ft\u22121) = 0.\n(c) When t > c4 and G(Ft\u22121) = 1, but a suboptimal arm is chosen due to inaccurate all-sampling\nestimates.\nLet R(a)\nT , R(b)\nthat RT = R(a)\nBefore proving upper bounds, note that, for each suboptimal choice, the regret incurred at each step\nis at most (cid:104)X, B(cid:63)\n\nT denote the regret incurred in the above cases, respectively. Clearly, we have\nT + R(c)\nT .\n\u03ba(cid:48)(cid:105) for some \u03ba, \u03ba(cid:48) which in turn is bounded above by\n|(cid:104)X, B(cid:63)\n\nT , and R(c)\nT + R(b)\n\u03ba \u2212 B(cid:63)\n\n\u03ba(cid:48)(cid:107)2 \u2264 2b(cid:63) \u00b7 (cid:107)X(cid:107)2 .\n\n\u03ba(cid:48)(cid:105)| \u2264 (cid:107)X(cid:107)2 \u00b7 (cid:107)B(cid:63)\n\n\u03ba \u2212 B(cid:63)\n\n\u03ba \u2212 B(cid:63)\n\nThis fact can be used to obtain regret bounds by bounding the number of times that each suboptimal\narm is pulled. Clearly, for part (a), this number is less than or equal to c4 + |FT|. Using Lemma 1,\n\n(cid:105) \u2264 2b(cid:63)xnE[c4 + |FT|] \u2264 2b(cid:63)xn (c4 + 6\u03c1 log T ) .\n\nE(cid:104)\n\nR(a)\nT\n\n8\n\n\f(cid:34) T(cid:88)\n\nNext, it follows from the de\ufb01nition of G(Ft\u22121) and Lemma 3 that the number of times that\nG(Ft\u22121) = 0 is controlled by\nE\n\n10t\u22123 \u2264 T(cid:88)\n\n10t\u22121 \u2264 10 log(T ) .\n\n[1 \u2212 G(Ft\u22121)]\n\n(cid:35)\nT \u2264 2b(cid:63)E(cid:104)(cid:80)T\n\n=\n\nT(cid:88)\nP(G(Ft\u22121) = 0) \u2264 T(cid:88)\n(cid:105)\n(cid:35)\n(cid:34) T(cid:88)\nt=c4+1 (cid:107)Xt(cid:107)2 \u00b7 [1 \u2212 G(Ft\u22121)]\n(1 \u2212 G(Ft\u22121))\n\nt=c4+1\n\nt=c4+1\n\nT \u2264 2b(cid:63)xnE\nR(b)\n\nt=c4+1\n\nt=c4+1\n\n, we have\n\n\u2264 20b(cid:63)xn log(T ) .\n\nt=c4+1\n\nNow, since R(b)\n\n(cid:1)(cid:3)\n\n\u2229 E Ac\n\nt\n\n\u03c0t\n\nt\n\n\u03c0t\n\n\u03c0t\n\nkp\u2217\n\n.\n\nt\n\nt\n\nt\n\n\u03c0t\n\n\u03c0t\n\nE A\n\nt\n\n(cid:26)\n\n(cid:27)\n\n,\n\n\u2264 2xn\n\n\u03ba\u2217 \u2212 B(cid:63)\n\n\u03c0t\n\n\u2229 E Ac\n\nt\n\n\u03ba\u2217 \u2212 B(cid:63)\n\n\u03c0t\n\n\u03ba\u2217 \u2212 B(cid:63)\n\n\u03c0t\n\nt \u00b7 (cid:107)Xt(cid:107)2\n\n(cid:114) c5\n\n0 < (cid:104)Xt, B(cid:63)\n\n\u03ba(cid:105) < h \u00b7 (cid:107)Xt(cid:107)2.\n\n\u03ba\u2217 \u2212 B(cid:63)\n\u03ba\u2217 \u2212 B(cid:63)\n\n\u03ba (At) \u2212 B(cid:63)\n\n\u03ba(cid:107)2 >\n\nFinally, we need to \ufb01nd an upper bound for R(c)\nT . It follows from a slightly modi\ufb01ed version of\nLemma EC.18 in [6] that, whenever G(Ft\u22121) = 1, the set C contains the optimal arm and no\nsuboptimal arm. In particular, if the best arm is \u03ba\u2217, we get the following inequality for all \u03ba \u2208 C\n\n. Now, using Lemma 4, we have for all t > c4,\n\n:=\nwhere c5 := 100C(cid:48)r2(k+d) log(T )\n\n(9)\nNow, recall \u03ba\u2217 denotes the optimal arm and the arm \u03c0t is the pulled arm. For (random variable)\n\n(8)\nTherefore, whenever G(Ft\u22121) = 1, for any \u03ba \u2208 C, we have X \u2208 V\u03ba and if X \u2208 U\u03ba, then C = {\u03ba}.\nNow, we are ready to use Lemma 4 to bound the probability of pulling an incorrect arm. Letting\n\n\u03ba\u2217 \u2212 B(cid:63)\n\u2203\u03ba \u2208 Kopt : (cid:107)(cid:98)BA\nP(cid:0)E A\n(cid:1) \u2264 100(k + d)r\n\u03ba \u2208 [k], de\ufb01ne D\u03ba :=(cid:8)(cid:104)Xt, B(cid:63)\n(cid:9). It follows from the de\ufb01nition of rt that\n\u03ba(cid:105) \u2265 2(cid:112) c5\n(cid:11)I(cid:0)D\u03c0t \u222a E A\n(cid:11)(cid:3) \u2264 E(cid:2)(cid:10)Xt, B(cid:63)\nrt = E(cid:2)(cid:10)Xt, B(cid:63)\n(cid:1)(cid:3) + E(cid:2)(cid:10)Xt, B(cid:63)\n(cid:11)I(cid:0)Dc\n\u03ba\u2217(cid:11)(cid:111) \u2229(cid:0)D\u03c0t \u222a E A\n(cid:11)I(cid:16)(cid:110)(cid:10)Xt, (cid:98)BA\n\u2264 E(cid:104)(cid:10)Xt, B(cid:63)\n(cid:1)(cid:17)(cid:105)\n(cid:11) \u2265(cid:10)Xt, (cid:98)BA\n\u03ba\u2217 \u2212 B(cid:63)\n+ E(cid:2)(cid:10)Xt, B(cid:63)\n(cid:1)(cid:3)\n(cid:11)I(cid:0)Dc\n(cid:114) c5\n(cid:1)(cid:21)\n(cid:20)\n\u03ba\u2217(cid:11)(cid:111) \u2229(cid:0)D\u03c0t \u222a E A\n(cid:1)(cid:17)\nb(cid:63)P(cid:16)(cid:110)(cid:10)Xt, (cid:98)BA\nP(cid:0)Dc\n(cid:11) \u2265(cid:10)Xt, (cid:98)BA\n\u03ba\u2217(cid:11) in combination with the de\ufb01nition of D\u03c0t implies that\n(cid:11) \u2265(cid:10)Xt, (cid:98)BA\nNote that(cid:10)Xt, (cid:98)BA\n(cid:114) c5\n(cid:68)\n0 \u2265(cid:68)\n(cid:69)\n(cid:69)\nXt, (cid:98)BA\n\u2212 (cid:98)BA\n(cid:114) c5\n\u03ba\u2217(cid:107)2 \u2265(cid:12)(cid:12)(cid:12)(cid:10)Xt, B(cid:63)\n\u03ba\u2217(cid:11)(cid:12)(cid:12)(cid:12) \u2265\n\u03ba\u2217 \u2212 (cid:98)BA\n(cid:114) c5\n(cid:11)(cid:12)(cid:12)(cid:12) \u2265\n(cid:107)2 \u2265(cid:12)(cid:12)(cid:12)(cid:10)Xt, (cid:98)BA\n(cid:107)Xt(cid:107)2 \u00b7 (cid:107)(cid:98)BA\n(cid:110)(cid:10)Xt, (cid:98)BA\n(cid:11) \u2265(cid:10)Xt, (cid:98)BA\n(cid:1)(cid:17) \u2264 100(k + d)r\n\u03ba\u2217(cid:11)(cid:111) \u2229(cid:0)D\u03c0t \u222a E A\nP(cid:16)(cid:110)(cid:10)Xt, (cid:98)BA\n(cid:11) \u2265(cid:10)Xt, (cid:98)BA\n(cid:114) c5\n(cid:1) \u2264 E(cid:104)\n(cid:105) \u2264 kc0\n(cid:1) \u2264 P(cid:0)Dc\nP(cid:0)Dc\n(cid:20) 100 \u00b7 r(k + d) \u00b7 b(cid:63) + kc0c5\n(cid:21)\n\nSince C does not contain any suboptimal arm, we have that\n\u03c0t\nThis fact, combined with (9) means for all t > c4, the following holds\n\n\u03ba\u2217(cid:11)(cid:111)(cid:84) D\u03c0t \u2286 E A\n\nAnd this entails that at least one of the following inequalities hold:\n\n\u03ba\u2217 \u2212 B(cid:63)\n\u03ba\u2217\n\u03ba\u2217 \u2212 (cid:98)BA\n\nFinally, by using the margin condition, we get that\n\nTherefore, using (10), we have\n\n(cid:107)Xt(cid:107)2 \u00b7 (cid:107)B(cid:63)\n\nt\n\n\u221a c5\n(cid:20)\n\n\u00b7 (cid:107)Xt(cid:107)2 .\n\n\u00b7 (cid:107)Xt(cid:107)2.\n\n+\n\nXt, B(cid:63)\n\u03c0t\n\n\u2212 B(cid:63)\n\n\u03c0t\n\n\u03c0t\n\n\u2212 B(cid:63)\n\n\u03c0t\n\n\u03c0t\n\n\u2229 E Ac\n\nt\n\n\u03c0t\n\n\u00b7 (cid:107)Xt(cid:107)2\n\n\u2229 E Ac\n\nt\n\n\u03c0t\n\nt\n\nt\n\nt\n\n+ 2\n\n\u03c0t\n\n+\n\nt\n\nN 2\nb(cid:63)\n\n.\n\n(10)\n\n2\nb(cid:63)\n\n.\n\nt\n\nt\n\nt\n\n.\n\nt\n\n(cid:21)\n\n\u03c0t\n\nt\n\nt .\n\n\u03c0t\n\n\u03c0t\n\n\u03c0t\n\n\u2264 2xn \u00b7\n\n100 \u00b7 r(k + d) \u00b7 b(cid:63) +\n\n4kc0c5\n\nb(cid:63)\n\nlog(T ) .\n\nT \u2264 T(cid:88)\n\nR(c)\n\n2xn\n\nt=c4+1\n\nt\n\n9\n\n\fAcknowledgments\n\nThe authors gratefully acknowledge support of the National Science Foundation (CAREER award\nCMMI: 1554140), Stanford Data Science Initiative, and Human-Centered AI Initiative.\n\nReferences\n[1] Yasin Abbasi-Yadkori. Online Learning for Linearly Parametrized Control Problems. PhD.\n\nThesis, 2012.\n\n[2] Yasin Abbasi-Yadkori, D\u00b4avid P\u00b4al, and Csaba Szepesv\u00b4ari.\n\nImproved algorithms for linear\nstochastic bandits. In Advances in Neural Information Processing Systems, pages 2312\u20132320,\n2011.\n\n[3] S. Agrawal, V. Avadhanula, V. Goyal, and A. Zeevi. MNL-Bandit: A Dynamic Learning\n\nApproach to Assortment Selection. ArXiv e-prints, June 2017.\n\n[4] Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear\n\npayoffs. In ICML (3), pages 127\u2013135, 2013.\n\n[5] Peter Auer. Using con\ufb01dence bounds for exploitation-exploration trade-offs. Journal of Machine\n\nLearning Research, 3:397\u2013422, 2003.\n\n[6] Hamsa Bastani and Mohsen Bayati. Online decision-making with high-dimensional covari-\nates, 2015. https://papers.ssrn.com/sol3/papers.cfm?abstract\\_id=\n2661896.\n\n[7] S\u00b4ebastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic\nmulti-armed bandit problems. Foundations and Trends R(cid:13) in Machine Learning, 5(1):1\u2013122,\n2012.\n\n[8] Emmanuel Candes and Benjamin Recht. Exact matrix completion via convex optimization.\n\nCommunications of the ACM, 55(6):111\u2013119, 2009.\n\n[9] Emmanuel J Candes and Yaniv Plan. Matrix completion with noise. Proceedings of the IEEE,\n\n98(6):925\u2013936, 2010.\n\n[10] Rich Caruana. Multitask learning. Machine Learning, 28(1):41\u201375, 1997.\n\n[11] Yuxin Chen, Yuejie Chi, Jianqing Fan, Cong Ma, and Yuling Yan. Noisy Matrix Completion:\nUnderstanding Statistical Guarantees for Convex Relaxation via Nonconvex Optimization. arXiv\ne-prints, page arXiv:1902.07698, Feb 2019.\n\n[12] Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linear payoff\nfunctions. In Proceedings of the Fourteenth International Conference on Arti\ufb01cial Intelligence\nand Statistics, pages 208\u2013214, 2011.\n\n[13] Varsha Dani, Thomas P Hayes, and Sham M Kakade. Stochastic linear optimization under\n\nbandit feedback. 2008.\n\n[14] Alexander Goldenshluger and Assaf Zeevi. A linear response bandit problem. Stochastic\n\nSystems, 3(1):230\u2013261, 2013.\n\n[15] David Gross. Recovering low-rank matrices from few coef\ufb01cients in any basis. IEEE Trans.\n\nInformation Theory, 57(3):1548\u20131566, 2011.\n\n[16] Nima Hamidi and Mohsen Bayati. On low-rank trace regression under general sampling\n\ndistribution. arXiv preprint arXiv:1904.08576, 2019.\n\n[17] Nima Hamidi, Mohsen Bayati, and Kapil Gupta. Personalizing many decisions with high-\n\ndimensional covariates. arXiv preprint, 2019. Extended Version.\n\n[18] Prateek Jain, Praneeth Netrapalli, and Sujay Sanghavi. Low-rank matrix completion using\nalternating minimization. In Proceedings of the forty-\ufb01fth annual ACM symposium on Theory\nof computing, pages 665\u2013674. ACM, 2013.\n\n10\n\n\f[19] Ramesh Johari, Pete Koomen, Leonid Pekelis, and David Walsh. Peeking at a/b tests: Why\nit matters, and what to do about it. In Proceedings of the 23rd ACM SIGKDD International\nConference on Knowledge Discovery and Data Mining, pages 1517\u20131525, New York, NY, USA,\n2017. ACM.\n\n[20] Kwang-Sung Jun, Rebecca Willett, Stephen Wright, and Robert Nowak. Bilinear Bandits with\n\nLow-rank Structure. arXiv e-prints, page arXiv:1901.02470, January 2019.\n\n[21] N. Kallus and M. Udell. Dynamic Assortment Personalization in High Dimensions. ArXiv\n\ne-prints, October 2016.\n\n[22] Raghunandan H Keshavan, Andrea Montanari, and Sewoong Oh. Matrix completion from a\n\nfew entries. IEEE Transactions on Information Theory, 56(6):2980\u20132998, 2009.\n\n[23] Raghunandan H Keshavan, Andrea Montanari, and Sewoong Oh. Matrix completion from noisy\n\nentries. Journal of Machine Learning Research, 11(Jul):2057\u20132078, 2010.\n\n[24] Olga Klopp et al. Noisy low-rank matrix completion with general sampling distribution.\n\nBernoulli, 20(1):282\u2013303, 2014.\n\n[25] Tom\u00b4a\u02c7s Koc\u00b4ak, Michal Valko, R\u00b4emi Munos, and Shipra Agrawal. Spectral thompson sampling.\nIn Proceedings of the Twenty-Eighth AAAI Conference on Arti\ufb01cial Intelligence, AAAI\u201914,\npages 1911\u20131917. AAAI Press, 2014.\n\n[26] Vladimir Koltchinskii, Karim Lounici, Alexandre B Tsybakov, et al. Nuclear-norm penalization\nand optimal rates for noisy low-rank matrix completion. The Annals of Statistics, 39(5):2302\u2013\n2329, 2011.\n\n[27] Tze Leung Lai and Herbert Robbins. Asymptotically ef\ufb01cient adaptive allocation rules. Ad-\n\nvances in applied mathematics, 6(1):4\u201322, 1985.\n\n[28] Sahin Lale, Kamyar Azizzadenesheli, Anima Anandkumar, and Babak Hassibi. Stochastic\nLinear Bandits with Hidden Low Rank Structure. arXiv e-prints, page arXiv:1901.09490, Jan\n2019.\n\n[29] John Langford and Tong Zhang. The epoch-greedy algorithm for multi-armed bandits with side\ninformation. In J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, editors, Advances in Neural\nInformation Processing Systems 20, pages 817\u2013824. Curran Associates, Inc., 2008.\n\n[30] Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to\npersonalized news article recommendation. In Proceedings of the 19th international conference\non World wide web, pages 661\u2013670. ACM, 2010.\n\n[31] Rahul Mazumder, Trevor Hastie, and Robert Tibshirani. Spectral regularization algorithms for\nlearning large incomplete matrices. Journal of machine learning research, 11(Aug):2287\u20132322,\n2010.\n\n[32] Daniel McFadden. Econometric models for probabilistic choice among products. The Journal\n\nof Business, 53(3):S13\u2013S29, 1980.\n\n[33] Sahand Negahban and Martin J Wainwright. Restricted strong convexity and weighted matrix\ncompletion: Optimal bounds with noise. Journal of Machine Learning Research, 13(May):1665\u2013\n1697, 2012.\n\n[34] Ian Osband, Dan Russo, and Benjamin Van Roy. (more) ef\ufb01cient reinforcement learning via\nposterior sampling. In Advances in Neural Information Processing Systems, pages 3003\u20133011,\n2013.\n\n[35] Benjamin Recht. A simpler approach to matrix completion. Journal of Machine Learning\n\nResearch, 12(Dec):3413\u20133430, 2011.\n\n[36] Angelika Rohde, Alexandre B Tsybakov, et al. Estimation of high-dimensional low-rank\n\nmatrices. The Annals of Statistics, 39(2):887\u2013930, 2011.\n\n11\n\n\f[37] Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling. Mathematics\n\nof Operations Research, 39(4):1221\u20131243, 2014.\n\n[38] Steven L Scott. A modern bayesian look at the multi-armed bandit. Applied Stochastic Models\n\nin Business and Industry, 26(6):639\u2013658, 2010.\n\n[39] Steven L. Scott. Multi-armed bandit experiments in the online service economy. Appl. Stoch.\n\nModel. Bus. Ind., 31(1):37\u201345, 2015.\n\n[40] Nathan Srebro, Noga Alon, and Tommi S. Jaakkola. Generalization error bounds for collabora-\ntive prediction with low-rank matrices. In L. K. Saul, Y. Weiss, and L. Bottou, editors, Advances\nin Neural Information Processing Systems 17, pages 1321\u20131328. 2005.\n\n[41] William R Thompson. On the likelihood that one unknown probability exceeds another in view\n\nof the evidence of two samples. Biometrika, 25(3-4):285\u2013294, 1933.\n\n[42] Michal Valko, R\u00b4emi Munos, Branislav Kveton, and Tom\u00b4a\u02c7s Koc\u00b4ak. Spectral bandits for smooth\ngraph functions. In Proceedings of the 31st International Conference on International Confer-\nence on Machine Learning - Volume 32, ICML\u201914, pages II\u201346\u2013II\u201354. JMLR.org, 2014.\n\n12\n\n\f", "award": [], "sourceid": 6131, "authors": [{"given_name": "Nima", "family_name": "Hamidi", "institution": "Stanford University"}, {"given_name": "Mohsen", "family_name": "Bayati", "institution": "Stanford University"}, {"given_name": "Kapil", "family_name": "Gupta", "institution": "Airbnb"}]}