{"title": "Bandit Learning with Implicit Feedback", "book": "Advances in Neural Information Processing Systems", "page_first": 7276, "page_last": 7286, "abstract": "Implicit feedback, such as user clicks, although abundant in online information service systems, does not provide substantial evidence on users' evaluation of system's output. Without proper modeling, such incomplete supervision inevitably misleads model estimation, especially in a bandit learning setting where the feedback is acquired on the fly. In this work, we perform contextual bandit learning with implicit feedback by modeling the feedback as a composition of user result examination and relevance judgment. Since users' examination behavior is unobserved, we introduce latent variables to model it. We perform Thompson sampling on top of variational Bayesian inference for arm selection and model update. Our upper regret bound analysis of the proposed algorithm proves its feasibility of learning from implicit feedback in a bandit setting; and extensive empirical evaluations on click logs collected from a major MOOC platform further demonstrate its learning effectiveness in practice.", "full_text": "Bandit Learning with Implicit Feedback\n\nYi Qi1, Qingyun Wu2, Hongning Wang2, Jie Tang1, Maosong Sun1\n\n1 State Key Lab of Intell. Tech. & Sys.,\nInstitution for Arti\ufb01cial Intelligence,\n\nDept. of Comp. Sci. & Tech., Tsinghua University, Beijing, China\n\n2 Department of Computer Science, University of Virginia\n\nqi-y16@mails.tsinghua.edu.cn, {jietang, sms}@tsinghua.edu.cn\n\n{qw2ky,hw5x}@virginia.edu\n\nAbstract\n\nImplicit feedback, such as user clicks, although abundant in online information ser-\nvice systems, does not provide substantial evidence on users\u2019 evaluation of system\u2019s\noutput. Without proper modeling, such incomplete supervision inevitably misleads\nmodel estimation, especially in a bandit learning setting where the feedback is ac-\nquired on the \ufb02y. In this work, we perform contextual bandit learning with implicit\nfeedback by modeling the feedback as a composition of user result examination\nand relevance judgment. Since users\u2019 examination behavior is unobserved, we\nintroduce latent variables to model it. We perform Thompson sampling on top\nof variational Bayesian inference for arm selection and model update. Our upper\nregret bound analysis of the proposed algorithm proves its feasibility of learning\nfrom implicit feedback in a bandit setting; and extensive empirical evaluations on\nclick logs collected from a major MOOC platform further demonstrate its learning\neffectiveness in practice.\n\n1\n\nIntroduction\n\nContextual bandit algorithms [4, 20, 19] provide modern information service systems an effective\nsolution to adaptively \ufb01nd good mappings between available items and users. This family of\nalgorithms sequentially select items to serve users using side information about user and item, while\nadapting their selection strategies based on the immediate user feedback to maximize users\u2019 long-term\nsatisfaction. They have been popularly deployed in practical systems for content recommendation\n[20, 5, 26] and display advertising [6, 22].\nHowever, the most dominant form of user feedback in such systems is implicit feedback, such as\nclicks, which is known to be biased and incomplete about users\u2019 evaluation of system\u2019s output\n[16, 11]. For example, a user skips a recommended item might not be because he/she does not like\nthe item, but he/she just does not examine that display position, i.e., position bias [13]. Unfortunately,\na common practice in contextual bandit applications simply treats no click as a form of negative\nfeedback [20, 25, 6]. This introduces inconsistency to model update, since the skipped items might\nnot be truly irrelevant, and it inevitably leads to suboptimal outputs of bandit algorithms over time.\nIn this work, we focus on learning contextual bandits with user click feedback, and model such\nimplicit feedback as a composition of user result examination and relevance judgment. Examination\nhypothesis [8], which is a fundamental assumption in click modeling, postulates that a user clicks on\na system\u2019s returned result if and only if that result has been examined by the user and it is relevant to\nthe user\u2019s information need at the moment. Because a user\u2019s examination behavior is unobserved,\nwe model it as a latent variable, and realize the examination hypothesis in a probabilistic model.\nWe de\ufb01ne the conditional probabilities of result examination and relevance judgment via logistic\nfunctions over the corresponding contextual features. To perform model update, we take a variational\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fBayesian approach to develop a closed form approximation to the posterior distribution of model\nparameters on the \ufb02y. This approximation also paves the way for an ef\ufb01cient Thompson sampling\nstrategy for arm selection in bandit learning. Our \ufb01nite time analysis proves that, despite the increased\ncomplexity in parameter estimation introduced by the latent variables, our Thompson sampling\npolicy based on the true posterior is guaranteed to achieve a sub-linear Bayesian regret with a high\nprobability. We also demonstrate that the regret of Thompson sampling based on the approximated\nposterior is well-bounded. In addition, we prove that when one fails to model result examination in\nclick feedback, a linearly increasing regret is possible, as the model cannot differentiate examination\ndriven skips from relevance driven skips in the negative feedback.\nWe tested the algorithm in XuetangX1, a major Massive Open Online Course (MOOC) platform in\nChina, for personalized education. To personalize students\u2019 learning experience on this platform, we\nrecommend quiz-like questions in a form of banners on top of the lecture videos when students are\nwatching the videos. The algorithm needs to decide where in a video to display which question to\na target student. If the student feels the displayed question is helpful for him/her to understand the\nlecture content, he/she could click on the banner to read the answer and more related online content\nabout the question. Therefore, our goal is to maximize the click through rate (CTR) on the selected\nquestions. There are several properties of this application that ampli\ufb01es the bias and incompleteness\nof click feedback. First, based on the consideration of user experience, to minimize the risk of\nannoying any student, the displayed time of a banner is limited to a few seconds. Second, as this\nfeature is newly introduced to the platform, many users might not realize that they can click on the\nquestion to read more related content about it. As a result, no click on a question does not necessarily\nindicate its irrelevance. We tested the algorithm in this application in a four-month period, where a\ntotal of 69 questions are manually compiled for the algorithm to select over 20 major videos with\nmore than 100 thousands student video watching sessions. Based on the unbiased of\ufb02ine evaluation\npolicy [21], our algorithm achieved a 8.9% CTR lift compared to standard contextual bandits [20, 9]\nwhich do not model users\u2019 examination behavior.\n\n2 Related Works\n\nAs having been extensively studied in click modeling of user search results [7], various factors affect\nusers\u2019 click decisions; and among them result examination plays a central role [13, 8]. Unfortunately,\nmost applications of bandit algorithms simply treat user clicks as explicit feedback for model update\n[20, 25, 6, 26], where no click on a selected result is considered as negative feedback. This inevitably\nleads to inaccurate model update and sub-optimal arm selection. There is a line of related research\nthat develops click model based bandit algorithms for learning to rank problems. For example, by\nassuming that skipped documents are less attractive than later clicked ones in a ranked list, Kveton\net al. [17] developed a cascading bandit model to learn from both clicks and skips in search results.\nTo enable learning from multiple clicks in the same result ranking list, they adopted the dependent\nclick model [10] to infer user satisfaction after a sequence of clicks [14], and later further extended it\nto broader types of click models [27]. However, such algorithms aim at estimating the best ranking\nof results in a per-query basis, without specifying any speci\ufb01c ranking function. Hence, it is hard\nfor them to generalize to unseen queries. This directly limits their application scenario in practice.\nThe solution developed in Lagr\u00e9e et al. [18] is the closest to ours, which exploits bias in reward\ndistribution induced by different examination probabilities at different display positions. Yet they\nassumed the examination probability only depends on position, while we allow any reasonable feature\nto be a determinant. Besides, they postulated that the probability of examination at each position is\neither heuristically set or empirically estimated, and henceforth \ufb01xed; while we estimate it on the \ufb02y\nfrom the observations obtained by interacting with users.\nAnother line of related research is bandit learning with latent variables. Maillard and Mannor studied\nthe problem of latent bandit [23], which assumes reward distributions are clustered and the clusters\nare determined by some latent variables. They only studied the problem in a context-free setting, and\na very weak performance guarantee is provided when the reward distribution is unknown in those\nclusters. Kawale et al. developed a Thompson sampling scheme for online matrix-factorization [15].\nLatent features are extracted via an online low-rank matrix completion based on samples selected\nfrom Thompson sampling on the \ufb02y. Due to the ad-hoc combination of factorization method and\nbandit method, little theoretical analysis was provided. Wang et al. studied the problem of latent\n\n1http://www.xuetangx.com/\n\n2\n\n\ffeature learning for contextual bandits [25]. They extended arms\u2019 context vectors with latent features\nunder a linear reward structure, and applied the upper con\ufb01dence bound principle over coordinate\ndescent to iteratively estimate the hidden features and model parameters. The linear reward structure\nprohibits it from recognizing the nonlinear dependency between result examination and relevance\njudgment in click feedback. And their regret analysis depends heavily on the initialization of the\nalgorithm, which could be hard to achieve in practice.\n\n3 Problem Setup\n\nWe consider a contextual bandit problem with \ufb01nite, but possibly large, number of arms. Denote\nthe arm set as A. At each trial t = 1, ..., T , the learner observes a subset of candidate arms At with\nAt \u21e2A , where each arm a is associated with a context vector xa summarizing the side information\nabout the arm. Once an arm at 2A t is chosen according to some policy \u21e1, corresponding implicit\nbinary feedback Cat, e.g., user click, will be given to the learner as the reward. The learner\u2019s goal is\nto adjust its arm selection strategy to maximize its cumulative reward over time. What makes this\nproblem unique and challenging is that Cat does not truly re\ufb02ect users\u2019 evaluation of the selected\narm at. Based on the examination hypothesis [13, 8], when Cat = 1, the chosen at must be relevant\nto the user\u2019s information need at time t; but when Cat = 0, at might be relevant but the user just does\nnot examine it. Unfortunately, the result examination condition is unobserved to the learner.\nWe model a user\u2019s result examination via a binary latent variable Eat and assume that the context\nvector xa\nE,t are\ndC and dE respectively. Accordingly, users\u2019 result examination and relevance judgment decisions\nare assumed to be governed by a conjecture of (xa\nE,t) and the corresponding bandit parameter\n\u2713\u21e4 = (\u2713\u21e4C, \u2713\u21e4E). In the rest of this paper, when no ambiguity is introduced, we drop the index a to\nsimplify the notations. As a result, we make the following generative assumption about an observed\nclick Ct on arm at,\n\nt of arm a can be decomposed into (xa\n\nE,t), where the dimension of xa\n\nC,t and xa\n\nC,t, xa\n\nC,t, xa\n\nP(Ct = 1|Et = 0, xC,t) = 0\nP(Ct = 1|Et = 1, xC,t) = \u21e2(xT\nP(Et = 1|xE,t) = \u21e2(xT\n\nC,t\u2713\u21e4C)\nE,t\u2713\u21e4E)\n\n1\n\n1+ex . Based on this assumption, we have E[Ct|xt] = \u21e2(xT\n\nwhere \u21e2(x) =\nE,t\u2713\u21e4E).\nAs a result, the observed click feedback Ct is a sample from this generative process. De\ufb01ne\nE\u2713E). The accumulated regret of a policy \u21e1 up to time T is\nf\u2713(x) := E[C|x, \u2713] = \u21e2(xT\nformally de\ufb01ned as,\n\nC,t\u2713\u21e4C)\u21e2(xT\n\nC\u2713C)\u21e2(xT\n\nmax\na2At\n\nRegret(T,\u21e1, \u2713\u21e4) =\n\nf\u2713\u21e4(xa) f\u2713\u21e4(xat)\n\nTXt=1\nwhere xat := (xat\nE ) is the context vector of the arm at 2A t selected by the policy \u21e1 at time t\ni=1. The Bayesian regret is de\ufb01ned by E\u21e5Regret(T,\u21e1,\u2713 \u21e4)\u21e4,\nbased on the history Ht := {(Ai, xi, Ci)}t1\nwhere the expectation is taken with respect to the prior distribution over \u2713\u21e4, and it can be written as,\nTXt=1\n\nf\u2713\u21e4(xa) f\u2713\u21e4(xat)\u21e4\n\nIn our online learning setting, the objective is to \ufb01nd the policy \u21e1 that minimizes the accumulated\nregret over T.\n\nE\u21e5 max\n\nBayesRegret(T,\u21e1 ) =\n\nC , xat\n\na2At\n\n4 Algorithm\n\nThe learner needs to estimate the bandit parameters \u2713\u21e4C and \u2713\u21e4E based on its interactively obtained\nclick feedback {xi, Ci}t\ni=1 over time. Ideally, this estimation can be obtained by maximizing the data\nlikelihood with respect to the bandit model parameters. However, the inclusion of examination as a\nlatent variable in our bandit learning setting poses serious challenges to both parameter estimation\nand arm selection. Neither conventional least square estimator nor maximum likelihood estimator can\n\n3\n\n\fbe easily obtained, let alone computational ef\ufb01ciency, due to the non-convexity of the corresponding\noptimization problem. To make things even worse, the two popular bandit learning paradigms, upper\ncon\ufb01dence bound principle [1] and Thompson sampling [3], both demand an accurate estimation\nof bandit parameters and their uncertainty. In this section, we present an ef\ufb01cient new solution to\ntackle these two challenges, which makes use of variational Bayesian inference technique to learn\nparameters approximately on the \ufb02y, as well as to bridge parameter estimation and arm selection\npolicy design.\n\n4.1 Variational Bayesian for parameter estimation\nTo complete the generative process de\ufb01ned in Section 3, we further assume \u2713C and \u2713E follow\nGaussian distribution N ( \u02c6\u2713C, \u2303C) and N ( \u02c6\u2713E, \u2303E) respectively. We are interested in developing\na closed form approximation to their posteriors, when a newly obtained observation (xC, xE, C)\nbecomes available. By applying Bayes\u2019 rule in the log space, we have,\n\nlog P(\u2713C, \u2713E|xC, xE, C)\n= log P(C|\u2713C, \u2713E, xC, xE) + log P(\u2713C, \u2713E) + log const\nE\u2713E) + (1 C) log1 \u21e2(xT\n=C log \u21e2(xT\n(\u2713E \u02c6\u2713E)T\u23031\nC (\u2713C \u02c6\u2713C) \n\nC\u2713C)\u21e2(xT\n(\u2713C \u02c6\u2713C)T\u23031\n\n\n\n1\n2\n\n1\n2\n\nC\u2713C)\u21e2(xT\n\nE\u2713E)\n\nE (\u2713E \u02c6\u2713E) + log const\n\nThe key idea is to develop a variational lower bound in the quadratic form of \u2713C and \u2713E for the log-\nlikelihood function. Because of the convexity of log \u21e2(x) x\n2 with respect to x2 (See Appendix B.1)\nand the Jensen\u2019s inequality for log x (See Appendix B.2), a lower bound of the required form is\nachievable. When C = 1, by Eq (16) in Appendix B.3, we have,\n\nlC=1(xC, xE, \u2713) := log\u21e2(xT\n\n(1)\nwhere g(x, \u21e0) := x\n, x, \u21e0 2R . More speci\ufb01cally,\ng(x, \u21e0) is a polynomial of degree 2 with respect to x. When C = 0, by Eq (17) in Appendix B.3, we\nhave,\n\nE\u2713E) g(xT\n2 + log \u21e2(\u21e0) (\u21e0)(x2 \u21e02), (\u21e0) = tanh \u21e0\n\nC\u2713,\u21e0 C) + g(xT\n\nC\u2713C)\u21e2(xT\n\n2 \u21e0\n\nE\u2713,\u21e0 E)\n\n4\u21e0\n\n2\n\nlC=0(xC, xE, \u2713) := log1 \u21e2(xT\n H(q) + qg(xT\n\nC\u2713C)\u21e2(xT\n\nE\u2713E)\n\nC\u2713,\u21e0 C) + qg(xT\n\nE\u2713,\u21e0 E,1) + (1 q)g(xT\n\nE\u2713,\u21e0 E,2)\n\nwhere H(q) := q log q (1 q) log(1 q). Once the lower bound in the quadratic form is\nestablished, we can use a Gaussian distribution to approximate our target posterior, whose mean and\ncovariance matrix are determined by the following equations,\nC + 2q1C(\u21e0C)xCxT\nC\n\n\u23031\n\n(3)\n\nC,post = \u23031\n\u02c6\u2713C,post = \u2303C,post(\u23031\nC\nE,post = \u23031\n\u23031\n\u02c6\u2713E,post = \u2303E,post(\u23031\nE\n\n1\n2\nE + 2(\u21e0E)xExT\nE\n1\n2\n\n\u02c6\u2713E +\n\n\u02c6\u2713C +\n\n(q)1CxC)\n\n(2q 1)1CxE)\n\n(2)\n\n(4)\n\n(5)\n\n(6)\n\nwhere the subscript \u201cpost\u201d denotes the parameters in the Gaussian distributions that approximate the\ndesired posteriors. Consecutive observations can be incorporated into the approximated posteriors\nsequentially. There is one thing left to decide, i.e., the choice of variational parameters (\u21e0C,\u21e0 E, q). A\ntypical criterion is to choose the values such that the likelihood on the observations is maximized.\nSimilar to the choice made by [12], we choose the closed form update formulas of those variational\nparameters as,\n\n\u21e0C =qE\u2713C [xT\n\u21e0E =qE\u2713E [xT\n\nC\u2713C]2\n\nE\u2713E]2\n\nq =\n\nexp (g(xT\n\nC\u2713C,\u21e0 C) + g(xT\n\n1 + exp (g(xT\n\nC\u2713C,\u21e0 C) + g(xT\n\nE\u2713E,\u21e0 E) g(xT\nE\u2713E,\u21e0 E) g(xT\n\nE\u2713E,\u21e0 E))\n\nE\u2713E,\u21e0 E))\n\n4\n\n\fAlgorithm 1 Thompson sampling for E-C Bandit\n1: Initiate \u2303C = I, \u2303E = I, \u02c6\u2713C = \u2713C,0, \u02c6\u2713E = \u2713E,0.\n2: for t = 0, 1, 2... do\n3:\n\nObserve the available arm set At \u21e2A and its corresponding context set Xt := {(xa\na 2A t}.\nRandomly sample \u02dc\u2713C \u21e0 N ( \u02c6\u2713C, \u2303C), \u02dc\u2713E \u21e0 N ( \u02c6\u2713E, \u2303E).\nSelect:\nC)T \u02dc\u2713C)\u21e2((xa\n\nat = arg max\n\nE)T \u02dc\u2713E)\n\n\u21e2((xa\n\n4:\n5:\n\nC, xa\n\nE) :\n\nPlay the selected arm at and Observe the reward Ct.\nUpdate \u2303C, \u02c6\u2713C, \u2303E, \u02c6\u2713E according to Eq (3) to Eq (6) respectively.\n\na2At\n\n6:\n7:\n8: end for\n\nwhere all the expectations are taken under the approximated posteriors. Empirically, we found the\niterative update of the approximated posterior and the variational parameters converge quite rapidly,\nsuch that it usually only needs a few rounds of iterations to get a satisfactory local maximum in our\nexperiments.\n\n4.2 Thompson sampling with approximated lower bound\n\nThompson sampling, also known as probability matching, is widely used in bandit learning to balance\nexploration and exploitation, and it shows great empirical performance [6]. Thompson sampling\nrequires a distribution of the model parameters to sample from. In a standard Thompson sampling\n[3], one is required to sample from the true posterior of model parameters. But as logistic regression\ndoes not have a conjugate prior, the model de\ufb01ned in our problem does not have an exact posterior.\nWe decide to sample from the approximated posterior as derived in Eq (3) to Eq (6). Later we will\ndemonstrate this is a very tight posterior approximation. Once the sampling of ( \u02dc\u2713C, \u02dc\u2713E) is complete,\n\u02dc\u2713E). We name the\nwe can select the corresponding arm at 2A t which maximizes \u21e2(xT\nresulting bandit algorithm as examination-click bandit, or E-C Bandit in short, and summarize it in\nAlgorithm 1.\n\n\u02dc\u2713C)\u21e2(xT\nE\n\nC\n\n5 Regret Analysis\n\nRecall our object is to \ufb01nd the policy that minimizes the Beyesian regret,\n\nBayesRegret(T,\u21e1 ) =\n\nTXt=1\n\nE\u21e5 max\n\na2At\n\nf\u2713\u21e4(xa) f\u2713\u21e4(xat)\u21e4\n\nC\u2713C)\u21e2(xT\n\nE\u2713E). Our algorithm, which is based on a maximum\nwhere f\u2713(x) := E[C|x, \u2713] = \u21e2(xT\nlikelihood estimator, is equivalent to an estimator that minimizes a log-loss with binary random\nvariables. In this section, we will \ufb01rst bound the aggregate empirical discrepancy of the log-loss\nestimator used in our model in Proposition 1. This prepares for the upper bound of the generic\nBayeisan regret under a log-loss estimator with Thompson sampling policy in Theorem 1. Based on\nthis generic Bayesian regret bound, we study the upper bound of Bayesian regret for our proposed\nE-C Bandit. Due to space limit, we provide all the detailed proofs in the Appendix.\nTo further simplify our notations, we use f for f\u2713, which is the reward function based the estimated\nbandit parameter \u2713, and fk for f\u2713(xak ), i.e., the reward for arm ak. We use f\u21e4 for f\u2713\u21e4, which is the\nreward function based on the ground-truth bandit parameter, and correspondingly f\u21e4k for f\u2713\u21e4(xak ).\nWe assume that f\u21e4 lies in a known function space F, where any f 2F is a function mapping from\nthe arm set A to the range (0, 1). De\ufb01ne the log-loss estimator by \u02c6f LOGLOSS\n2 arg minf2F L2,t(f )\nk=1 lk(f ) where lk(f ) = Ck log fk + (1 \nwhere L2,t(f ) is the aggregate log-loss written asPt1\nCk) log(1 fk). We have the following proposition,\n\nt\n\n5\n\n\ft\n\n( 1\nmf\n\n+ 1\n\n0\n\n3\n1Mf\n\nE,t. For\n\nk=1(fk f\u21e4k )2 by kf f\u21e4k2\n\nProposition 1. Denote the aggregate empirical discrepancyPt\nall > 0 and \u21b5> 0, if Ft =nf 2F :f \u02c6f LOGLOSS\nPf\u21e4 2 \\1t=1Ft > 1 ,\n)2 and \u2318t =4Mf +\n\nE,t \uf8ffp\u21e4t (F,,\u21b5 )o for all t 2 N,\n(7)\nwhere \u21e4t (F,,\u21b5 ) is an appropriately constructed con\ufb01dence parameter. In particular, it is de\ufb01ned as\nlog(N (F,\u21b5, k\u00b7k1)/) + 2\u21b5\u2318t, where N (F,\u21b5, k\u00b7k1) denotes the alpha-covering\n\u21e4t (F,,\u21b5 ) := 2\nmin{mf ,1Mf}t, in which mf , Mf 2R are\nnumber of F, 0 =\nupper and lower bounds of f such at 0 < mf \uf8ff f \uf8ff Mf < 1 for any f 2F .\nRemark 1. The proof is provided in Appendix C. Here we discuss two important details re-\nlated to our later proof about E-C Bandit\u2019s regret. First, the precise optimization of \u02c6f LOGLOSS\n2\narg minf2F L2,t(f ) could be hard in some instances of F. For example, when F is a set of\nnon-convex functions. Nevertheless, we can always resort to approximation methods to solve the\noptimization problem as long as the approximation error can be bounded. Indeed, in our E-C Bandit,\nwe resort to variational inference to estimate \u02c6f LOGLOSS\non the \ufb02y and \ufb01nd it works quite well in\npractice. Second, when f\u21e4 62 F, this corresponds to the problem of model mis-speci\ufb01cation. In\nthis situation, the regret bound could be very poor, as the real regret could be linear with respect to\ntime. To show this clearly in our case, in Appendix F we construct a situation in which the regret is\ninevitably linear if one fails to model the examination condition in click feedback and simply treats\nno click as negative feedback.\n\n1\n\nt\n\nt\n\nWith Proposition 1, we have the following theorem which bounds the Bayesian regret of the Thompson\nSampling strategy under a log-loss estimator.\nTheorem 1. For all T 2 N, \u21b5> 0 and < 1\nestimator and a Thompson sampling strategy along the time steps, then\n\n2T , if \u21e1T S denotes the policy derived from the log-loss\n\nBayesRegret(T,\u21e1 T S) \uf8ff 1 + dimAE(F, T 1) + 1C + 4qdimAE(F, T 1)\u21e4T (F,\u21b5, )T\n\n(8)\nwhere C = supf2F{sup f}, dimAE(F, T 1) is the eluder dimension (see De\ufb01nition 3 in Russo and\nVan Roy [24]) of F with respect to A.\nRemark 2. We can choose C = 1 in our click feedback case since f 2 (0, 1). C is kept in the\ntheorem to show the same form compared to the Proposition 8 in Russo and Van Roy [24]. In fact,\nthe proof is almost the same once we have Proposition 1. Hence, we omit the proof in our paper.\n\nNow we turn to provide an upper regret bound of our E-C Bandit, based on the above generic Bayeisan\nregret analysis under a log-loss estimator. We add the following two assumptions which are standard\nin the literature of contextual bandits.\nAssumption 1. The optimal \u2713\u21e4 lies in Bs := {\u2713 2R d : k\u2713k2 \uf8ff s}, and s is known as a prior.\nAssumption 2. The norm of context vectors are bounded by x, i.e., (xC, xE) 2B x, where Bx :=\n{x 2R d : kxk2 \uf8ff x} and x is known as a prior.\nBased on these two assumptions, it is straightforward to verify that \u21e2(xT\nE\u2713E) and f\u2713(x)\nare bounded. Let M\u21e2 = max\u27132Bs,x2Bx \u21e2(xT\u2713) and m\u21e2 = min\u27132Bs,x2Bx \u21e2(xT\u2713). Hence, 0 <\nm\u21e2 \uf8ff M\u21e2 < 1. Similarly, denote the maximum of f\u2713(x) by Mf and the minimum by mf ,\nwe have 0 < mf \uf8ff Mf < 1. Once the arm set is restricted to a \ufb01nite cardinality, we have\ndimAE(F, T 1) \uf8ff|A| by Appendix C.1. in Russo and Van Roy [24]. Choosing the function class\nas that in our E-C Bandit, i.e., F = {f : Bx !R| f = \u21e2(xT\nE\u2713E), \u2713 2B s}, by Lemma 8\n(See Appendix 8 for its proof), we have N (F,\u21b5, k\u00b7k1) = (/\u21b5)d where = 2M\u21e2k\u21e2x (k\u21e2 is the\nLipschitz constant of \u21e2, see Lemma 4). Hence, choosing \u21b5 = 1/t2 and = 1/t leads to\n\nC\u2713C),\u21e2 (xT\n\nC\u2713C)\u21e2(xT\n\n2\n0\n\nTherefore, the upper bound of Bayesian regret of our proposed E-C Bandit takes the following form,\n(10)\n\nWhen T |A| and T d, which is a typical case in practice, it becomes O(pT log T ).\n\n(4Mf +\n\nd log(t3) +\n\n\u21e4t (F, 1/t, 1/t2) =\nBayesRegret(T,\u21e1 T S) = O(|A| +pd|A|T log T ).\n\n(9)\n\n).\n\n1\nt\n\n1\nmf\n\n6\n\n\f6 Experiment\n\nWe perform empirical evaluations in simulation and on the online student click logs collected from\nour MOOC platform to verify the effectiveness of our proposed algorithm. In particular, we compare\nwith those that fail to model the examination condition and directly use click as feedback.\n\n6.1 Algorithms for comparison\n\nWe list the models used for empirical comparisons below, and explain how we adjust them in our\nevaluations.\nLogistic Bandit. This model has been extensively used for online advertisement CTR optimization.\nIn [6, 21], the authors model user clicks by a regularized logistic regression model over observed\ncontext features and make decisions by applying Thompson sampling over the learnt model. In\nparticular, no click is treated as negative feedback. Following their setting in [6], we used the Laplace\napproximation and Gaussian prior presented in to update the model parameters on the \ufb02y. We also\nwant to highlight that despite mountains of works focusing on generalized linear bandits, most of\nthem are not truly online algorithms, because the estimation of their parameters at each iteration has\nto involve all historical observations iteratively. This incurs a space complexity at least O(T ) and\ntime complexity at least O(T 2) (e.g., Filippi et al. [9] requires exact optimum of logistic regression\non all historical observations at each round).\nhLinUCB. This is an algorithm proposed by Wang et al. [25] for bandit learning with latent variables.\nIt is related to our model in a sense that both models estimate hidden features. In particular, hLinUCB\nextends linear contextual bandit by inclusion of hidden features and operates under a UCB-like\nstrategy. However, it still treats click as direct feedback, but aims at learning more expressive features\nto describe the observed clicks.\nE-C Bandit. This is the algorithm we present in Algorithm 1. We should note that in the experiments\non real-world data, the manual separation of examination feature xE and click feature xC in the\ncontext vector x offers a principled approach to incorporate one\u2019s domain knowledge about what\naffects user examination and what affects user result relevance. We explain in detail what features\nare chosen for which component in Appendix G. Thanks to the tight approximation achieved by\nBayesian variational inference presented in Section 4, truly online model update is feasible in this\nalgorithm. This provides both computational and storage ef\ufb01ciency.\n\n6.2 Experiments on simulations\n\nC\u2713C)\u21e2(xT\n\nC,t\u2713\u21e4C)\u21e2(xT\n\n(1+e)2 and Mf =\n\nE\u2713E) on B, respectively, i.e., mf = 1\n\nFirst we demonstrate the effectiveness of our algorithm by experiment with simulated data. The\nexperiment is performed as follows. The context vector\u2019s dimension dC and dE are set to 5, and\nthus d = dC + dE = 10. We set |A| = 100, each of which is associated with a unique context\nvector (xC, xE). The ground-truth parameter (\u2713\u21e4C, \u2713\u21e4E) and the speci\ufb01c value of (xC, xE) are all\nrandomly sampled from the unit ball B = {x 2R d : kxk2 \uf8ff 1}. Since (\u2713\u21e4C, \u2713\u21e4E) and (xC, xE)\nare both sampled from B, mf and Mf can be obtained by taking the minimum and maximum of\n\u21e2(xT\nAt each time t, an arm set \u02dcAt is randomly sampled from A such that | \u02dcAt| = 10, i.e., each time\nwe offer 10 randomly selected arms for the algorithm to choose from. An algorithm selects an\narm from At and observes the corresponding reward Calg\nt generated by the Bernoulli distribution\nE,t\u2713\u21e4E)). The regret of this algorithm at time t is de\ufb01ned by its received click reward,\nB(\u21e2(xT\ni.e., regret(t) = Ca\u21e4t Cat, where a\u21e4t is the optimum arm to be chosen based on the ground-truth\nbandit parameters (\u2713\u21e4C, \u2713\u21e4E).\nWe repeat the experiment 100 times using the same simulation setting, each containing 10,000\niterations. The average cumulative regret over 100 runs and the corresponding standard deviation\n(plotted per thousand iterations) are illustrated in Figure 1. One can clearly notice that the Logistic\nBandit suffers from a linear regret with respect to time t, as it mistakenly treats no click as negative\nfeedback. Our E-C Bandit achieves a fast converging sub-linear regret. The result that hLinUCB\nperforms the worst is expected, since it assumes a linear relation between click and context feature\nvectors. We further investigate how the aggregate empirical discrepancy between E-C bandit and\n\n1\n\n(1+e1)2 .\n\n7\n\n\fFigure 1: Comparison of cumulative regret over\n100 runs of simulation.\n\nFigure 2: Comparison of discrepancy bound pro-\nvided by Proposition 1.\n\nLogistic Bandit increases with respect to time. Figure 2 illustrates that the aggregate empirical\ndiscrepancy of E-C Bandit is well bounded by the upper bound provided by Proposition 1, while\nthe Logistic Bandit\u2019s aggregate empirical discrepancy increases linearly. This directly explains their\naccumulative regret in this experiment comparison.\n\n6.3 Experiments on MOOC video watching data\n\nThe MOOC data we used for evaluation is collected from a single course in a 4-month period. The\ncourse has 503 lecture videos in total. About 500 high-quality quiz-like questions have been manually\ncrafted and each video is assigned with a subset of them based on human-judged relatedness. We\nselected 21 videos, whose accumulated watching time are ranked in the \ufb01rst 21 places, for this\nevaluation. Over the selected videos, on average a video is assigned with 5.5 questions, each\nof which is associated with 6 possible displaying positions within the video, leading to a total\nof 33 arms in average (as each question can be placed in all positions). The data set with our\nmanually crafted features and our model implementation have been made publicially available here:\nhttps://github.com/qy7171/ec_bandit.\nWe picked one video as an example to analyze students\u2019 click behavior. 9 arms are picked and\nprojected by a random Gaussian matrix to a two-dimension plane in Figure 3. Thus, their relative\ndistance are kept. The number in the parenthesis indicates the empirical CTR of the corresponding\narm. It can be clearly seen that while arm c and arm f have the same empirical CTR, the arms\nbetween them, such as arm a and d, have lower CTRs. Logistic Bandit is never able to capture this\nnon-monotonicity relation, since its reward prediction increases monotonically with respect to a linear\npredictor. We construct a more general case in Appendix F to illustrate the scenario that failing to\nmodel examination condition would lead to a linear regret. Mapping the illustration back to the MOOC\ndata set, arm a and arm f are two different questions displayed at the same position in the video, while\narm a and arm c are the same question displayed at different positions. This phenomenon strongly\nsuggests bias in users\u2019 implicit feedback, which again justi\ufb01es our decomposition of examination and\nrelevance in click feedback.\nWe followed [21] to develop our online data collection policy in our MOOC platform so as to prepare\nour of\ufb02ine evaluation data set. In particular, any related questions with respect to a video will have an\nequal probability to be selected and displayed at all positions in this video. We name this policy as\nSimilarity. We create an instance of a bandit model for each video to learn its own optimal question\nplacing policy. See Appendix G for detailed explanations of our examination feature choice. We\nalso added a new baseline here, i.e., PBMUCB[18], which assumes a position-based examination\nprobability in any ranking result. To adjust it to our setting, we assumed that the examination\nprobability of any question chosen in a video is determined by and only by its position. Therefore,\nthe key difference between our model and PBMUCB is that ours utilizes the available contextual\ninformation to estimate the examination probability, while PBMUCB is context-free. Yet, another\nimportant difference is that PBMUCB assumes the probability of examination at different position is\nknown, and is estimated from of\ufb02ine data.\n\n8\n\n\fFigure 3: 9 arms\u2019 feature vectors projected\nonto a two-dimension plane, such that the rel-\native distances between points are kept. The\nnumber in the parenthesis is the arm\u2019s empiri-\ncal CTR.\n\nFigure 4: Performance comparison on MOOC\nvideos\u2019 data of different bandit algorithms.\n\nLi et al. [21] proposed a method to calculate a near-unbiased estimate of CTR of any bandit algorithm\nbased on the collected history data, so that of\ufb02ine evaluation and performance comparison are\npossible. We take their of\ufb02ine evaluation protocol here and report the estimated CTR in Figure 4,\nwhich is averaged over 100 runs. To avoid disclosing any proprietary information about the platform,\nall algorithms\u2019 CTRs are normalized by that from the Similarity policy. As shown in the \ufb01gure,\nindependently estimating E-C Bandits across videos achieves an average 40.6% increase in CTR over\nthe Similarity baseline. Meanwhile, E-C Bandit consistently outperforms the other three baseline\nbandits, i.e., hLinUCB, Logistic Bandit and PBMUCB. The improvement of our model compared to\nLogistic Bandit clearly suggests the necessity of modeling examination condition in user clicks for\nimproving the online recommendation performance, and the improvement against PBMUCB provides\nstrong evidence of the importance of modelling examination with available contextual information.\nIn addition, the standard of error of E-C Bandit, hLinUCB, Logistic Bandit and PBMUCB\u2019s relative\nCTR performance among 100 trials are 0.032, 0.031, 0.030, 0.041, respectively. Therefore, the\nvariance of our of\ufb02ine evaluation is small and the improvement from our solution to the baselines are\nstatistically signi\ufb01cant.\n\n7 Conclusion\n\nMotivated by the examination hypothesis in user click modeling, in this paper we developed E-C\nBandit, which differentiates result examination and content relevance in user clicks and actively\nlearns from such implicit feedback. We developed an ef\ufb01cient and effective learning algorithm\nbased on variational inference and demonstrated its effectiveness on both simulated and real-world\ndatasets. We proved that despite the complexity of underlying reward generation assumption and the\nresulting parameter estimation procedure, the proposed learning algorithm enjoys a sub-linear regret\nbound. Currently we only studied click feedback on single items; it is important for us to study it in a\nmore general setting, e.g., a list of ranked items, where sequential result examination and relevance\njudgment introduce richer inter-dependency. In addition, our current regret analysis does not account\nfor the additional discrepancy introduced by the variational inference. Abeille et al. [2] suggests that\nan exact posterior is not a necessary condition for a Thompson sampling policy to be optimal. It is\nimportant to study a tighter upper regret bound under our approximated posterior in general.\nAcknowledgements. We thank the anonymous reviewers for their insightful comments. This paper\nis based upon work supported by a research fund from XuetangX.com and the National Science\nFoundation under grant IIS-1553568 and IIS-1618948.\n\nReferences\n[1] Yasin Abbasi-yadkori, D\u00e1vid P\u00e1l, and Csaba Szepesv\u00e1ri.\n\nstochastic bandits. In NIPS, pages 2312\u20132320. 2011.\n\nImproved algorithms for linear\n\n9\n\n\f[2] Marc Abeille, Alessandro Lazaric, et al. Linear thompson sampling revisited. Electronic\n\nJournal of Statistics, 11(2):5165\u20135197, 2017.\n\n[3] Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear\n\npayoffs. In International Conference on Machine Learning, pages 127\u2013135, 2013.\n\n[4] Peter Auer. Using con\ufb01dence bounds for exploitation-exploration trade-offs. Journal of Machine\nLearning Research, 3:397\u2013422, 2002. URL http://www.jmlr.org/papers/v3/auer02a.\nhtml.\n\n[5] Djallel Bouneffouf, Amel Bouzeghoub, and Alda Lopes Gan\u00e7arski. A contextual-bandit\nalgorithm for mobile context-aware recommender system. In Neural Information Processing,\npages 324\u2013331. 2012.\n\n[6] Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. In Advances\n\nin neural information processing systems, pages 2249\u20132257, 2011.\n\n[7] Aleksandr Chuklin, Ilya Markov, and Maarten de Rijke. Click models for web search. Synthesis\n\nLectures on Information Concepts, Retrieval, and Services, 7(3):1\u2013115, 2015.\n\n[8] Nick Craswell, Onno Zoeter, Michael Taylor, and Bill Ramsey. An experimental comparison of\nclick position-bias models. In Proceedings of the 2008 international conference on web search\nand data mining, pages 87\u201394. ACM, 2008.\n\n[9] Sarah Filippi, Olivier Cappe, Aur\u00e9lien Garivier, and Csaba Szepesv\u00e1ri. Parametric bandits: The\ngeneralized linear case. In Advances in Neural Information Processing Systems, pages 586\u2013594,\n2010.\n\n[10] Fan Guo, Chao Liu, and Yi Min Wang. Ef\ufb01cient multiple-click models in web search. In\nProceedings of the Second ACM International Conference on WSDM, pages 124\u2013131. ACM,\n2009.\n\n[11] Yifan Hu, Yehuda Koren, and Chris Volinsky. Collaborative \ufb01ltering for implicit feedback\ndatasets. In Data Mining, 2008. ICDM\u201908. Eighth IEEE International Conference on, pages\n263\u2013272. Ieee, 2008.\n\n[12] Tommi S Jaakkola and Michael I Jordan. Bayesian parameter estimation via variational methods.\n\nStatistics and Computing, 10(1):25\u201337, 2000.\n\n[13] Thorsten Joachims, Laura Granka, Bing Pan, Helene Hembrooke, and Geri Gay. Accurately\ninterpreting clickthrough data as implicit feedback. In ACM SIGIR Forum, volume 51, pages\n4\u201311. Acm, 2017.\n\n[14] Sumeet Katariya, Branislav Kveton, Csaba Szepesvari, and Zheng Wen. Dcm bandits: Learning\nto rank with multiple clicks. In International Conference on Machine Learning, pages 1215\u2013\n1224, 2016.\n\n[15] Jaya Kawale, Hung H Bui, Branislav Kveton, Long Tran-Thanh, and Sanjay Chawla. Ef\ufb01cient\nthompson sampling for online matrix-factorization recommendation. In NIPS, pages 1297\u20131305,\n2015.\n\n[16] Diane Kelly and Jaime Teevan. Implicit feedback for inferring user preference: a bibliography.\n\nIn Acm Sigir Forum, volume 37, pages 18\u201328. ACM, 2003.\n\n[17] Branislav Kveton, Csaba Szepesvari, Zheng Wen, and Azin Ashkan. Cascading bandits:\nLearning to rank in the cascade model. In International Conference on Machine Learning,\npages 767\u2013776, 2015.\n\n[18] Paul Lagr\u00e9e, Claire Vernade, and Olivier Cappe. Multiple-play bandits in the position-based\n\nmodel. In Advances in Neural Information Processing Systems, pages 1597\u20131605, 2016.\n\n[19] John Langford and Tong Zhang. The epoch-greedy algorithm for multi-armed bandits with side\n\ninformation. In NIPS, pages 817\u2013824, 2008.\n\n10\n\n\f[20] Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to\npersonalized news article recommendation. In Proceedings of the 19th international conference\non World wide web, pages 661\u2013670. ACM, 2010.\n\n[21] Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. Unbiased of\ufb02ine evaluation of\ncontextual-bandit-based news article recommendation algorithms. In Proceedings of the fourth\nACM international conference on Web search and data mining, pages 297\u2013306. ACM, 2011.\n[22] Wei Li, Xuerui Wang, Ruofei Zhang, Ying Cui, Jianchang Mao, and Rong Jin. Exploitation\nand exploration in a performance based contextual advertising system. In Proceedings of 16th\nSIGKDD, pages 27\u201336. ACM, 2010.\n\n[23] Odalric-Ambrym Maillard and Shie Mannor. Latent bandits. In International Conference on\n\nMachine Learning, pages 136\u2013144, 2014.\n\n[24] Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling. Mathematics\n\nof Operations Research, 39(4):1221\u20131243, 2014.\n\n[25] Huazheng Wang, Qingyun Wu, and Hongning Wang. Learning hidden features for contextual\nbandits. In Proceedings of the 25th ACM International on Conference on Information and\nKnowledge Management, pages 1633\u20131642. ACM, 2016.\n\n[26] Yisong Yue and Carlos Guestrin. Linear submodular bandits and their application to diversi\ufb01ed\n\nretrieval. In NIPS, pages 2483\u20132491, 2011.\n\n[27] Masrour Zoghi, Tomas Tunys, Mohammad Ghavamzadeh, Branislav Kveton, Csaba Szepesvari,\nand Zheng Wen. Online learning to rank in stochastic click models. In International Conference\non Machine Learning, pages 4199\u20134208, 2017.\n\n11\n\n\f", "award": [], "sourceid": 3615, "authors": [{"given_name": "Yi", "family_name": "Qi", "institution": "Tsinghua University"}, {"given_name": "Qingyun", "family_name": "Wu", "institution": "University of Virginia"}, {"given_name": "Hongning", "family_name": "Wang", "institution": "University of Virginia"}, {"given_name": "Jie", "family_name": "Tang", "institution": "Tsinghua University"}, {"given_name": "Maosong", "family_name": "Sun", "institution": null}]}