{"title": "A Latent Source Model for Online Collaborative Filtering", "book": "Advances in Neural Information Processing Systems", "page_first": 3347, "page_last": 3355, "abstract": "Despite the prevalence of collaborative filtering in recommendation systems, there has been little theoretical development on why and how well it works, especially in the ``online'' setting, where items are recommended to users over time. We address this theoretical gap by introducing a model for online recommendation systems, cast item recommendation under the model as a learning problem, and analyze the performance of a cosine-similarity collaborative filtering method. In our model, each of $n$ users either likes or dislikes each of $m$ items. We assume there to be $k$ types of users, and all the users of a given type share a common string of probabilities determining the chance of liking each item. At each time step, we recommend an item to each user, where a key distinction from related bandit literature is that once a user consumes an item (e.g., watches a movie), then that item cannot be recommended to the same user again. The goal is to maximize the number of likable items recommended to users over time. Our main result establishes that after nearly $\\log(km)$ initial learning time steps, a simple collaborative filtering algorithm achieves essentially optimal performance without knowing $k$. The algorithm has an exploitation step that uses cosine similarity and two types of exploration steps, one to explore the space of items (standard in the literature) and the other to explore similarity between users (novel to this work).", "full_text": "A Latent Source Model for\n\nOnline Collaborative Filtering\n\nGuy Bresler\n\nDevavrat Shah\nDepartment of Electrical Engineering and Computer Science\n\nGeorge H. Chen\n\nMassachusetts Institute of Technology\n\nCambridge, MA 02139\n\n{gbresler,georgehc,devavrat}@mit.edu\n\nAbstract\n\nDespite the prevalence of collaborative \ufb01ltering in recommendation systems, there\nhas been little theoretical development on why and how well it works, especially\nin the \u201conline\u201d setting, where items are recommended to users over time. We\naddress this theoretical gap by introducing a model for online recommendation\nsystems, cast item recommendation under the model as a learning problem, and\nanalyze the performance of a cosine-similarity collaborative \ufb01ltering method. In\nour model, each of n users either likes or dislikes each of m items. We assume\nthere to be k types of users, and all the users of a given type share a common\nstring of probabilities determining the chance of liking each item. At each time\nstep, we recommend an item to each user, where a key distinction from related\nbandit literature is that once a user consumes an item (e.g., watches a movie),\nthen that item cannot be recommended to the same user again. The goal is to\nmaximize the number of likable items recommended to users over time. Our main\nresult establishes that after nearly log(km) initial learning time steps, a simple\ncollaborative \ufb01ltering algorithm achieves essentially optimal performance without\nknowing k. The algorithm has an exploitation step that uses cosine similarity and\ntwo types of exploration steps, one to explore the space of items (standard in the\nliterature) and the other to explore similarity between users (novel to this work).\n\n1\n\nIntroduction\n\nRecommendation systems have become ubiquitous in our lives, helping us \ufb01lter the vast expanse of\ninformation we encounter into small selections tailored to our personal tastes. Prominent examples\ninclude Amazon recommending items to buy, Net\ufb02ix recommending movies, and LinkedIn recom-\nmending jobs. In practice, recommendations are often made via collaborative \ufb01ltering, which boils\ndown to recommending an item to a user by considering items that other similar or \u201cnearby\u201d users\nliked. Collaborative \ufb01ltering has been used extensively for decades now including in the GroupLens\nnews recommendation system [20], Amazon\u2019s item recommendation system [17], the Net\ufb02ix Prize\nwinning algorithm by BellKor\u2019s Pragmatic Chaos [16, 24, 19], and a recent song recommendation\nsystem [1] that won the Million Song Dataset Challenge [6].\nMost such systems operate in the \u201conline\u201d setting, where items are constantly recommended to\nusers over time. In many scenarios, it does not make sense to recommend an item that is already\nconsumed. For example, once Alice watches a movie, there\u2019s little point to recommending the same\nmovie to her again, at least not immediately, and one could argue that recommending unwatched\nmovies and already watched movies could be handled as separate cases. Finally, what matters is\nwhether a likable item is recommended to a user rather than an unlikable one. In short, a good\nonline recommendation system should recommend different likable items continually over time.\n\n1\n\n\fDespite the success of collaborative \ufb01ltering, there has been little theoretical development to justify\nits effectiveness in the online setting. We address this theoretical gap with our two main contributions\nin this paper. First, we frame online recommendation as a learning problem that fuses the lines of\nwork on sleeping bandits and clustered bandits. We impose the constraint that once an item is\nconsumed by a user, the system can\u2019t recommend the item to the same user again. Our second\nmain contribution is to analyze a cosine-similarity collaborative \ufb01ltering algorithm. The key insight\nis our inclusion of two types of exploration in the algorithm: (1) the standard random exploration\nfor probing the space of items, and (2) a novel \u201cjoint\u201d exploration for \ufb01nding different user types.\nUnder our learning problem setup, after nearly log(km) initial time steps, the proposed algorithm\nachieves near-optimal performance relative to an oracle algorithm that recommends all likable items\n\ufb01rst. The nearly logarithmic dependence is a result of using the two different exploration types. We\nnote that the algorithm does not know k.\nOutline. We present our model and learning problem for online recommendation systems in Section\n2, provide a collaborative \ufb01ltering algorithm and its performance guarantee in Section 3, and give\nthe proof idea for the performance guarantee in Section 4. An overview of experimental results is\ngiven in Section 5. We discuss our work in the context of prior work in Section 6.\n\n2 A Model and Learning Problem for Online Recommendations\n\nWe consider a system with n users and m items. At each time step, each user is recommended an\nitem that she or he hasn\u2019t consumed yet, upon which, for simplicity, we assume that the user imme-\ndiately consumes the item and rates it +1 (like) or \u22121 (dislike).1 The reward earned by the recom-\nmendation system up to any time step is the total number of liked items that have been recommended\nso far across all users. Formally, index time by t \u2208 {1, 2, . . .}, and users by u \u2208 [n] (cid:44) {1, . . . , n}.\nLet \u03c0ut \u2208 [m] (cid:44) {1, . . . , m} be the item recommended to user u at time t. Let Y (t)\nui \u2208 {\u22121, 0, +1}\nbe the rating provided by user u for item i up to and including time t, where 0 indicates that no rating\nhas been given yet. A reasonable objective is to maximize the expected reward r(T ) up to time T :\n\nr(T ) (cid:44) T(cid:88)\n\nn(cid:88)\n\nm(cid:88)\n\nn(cid:88)\n\nE[Y (T )\n\nu\u03c0ut\n\n] =\n\nE[Y (T )\n\nui\n\n].\n\nt=1\n\nu=1\n\ni=1\n\nu=1\n\nThe ratings are noisy: the latent item preferences for user u are represented by a length-m vector\npu \u2208 [0, 1]m, where user u likes item i with probability pui, independently across items. For a user\nu, we say that item i is likable if pui > 1/2 and unlikable if pui < 1/2. To maximize the expected\nreward r(T ), clearly likable items for the user should be recommended before unlikable ones.\nIn this paper, we focus on recommending likable items. Thus, instead of maximizing the expected\nreward r(T ), we aim to maximize the expected number of likable items recommended up to time T :\n\n(cid:44) T(cid:88)\n\nn(cid:88)\n\nr(T )\n+\n\nE[Xut] ,\n\n(1)\n\nt=1\n\nu=1\n\nwhere Xut is the indicator random variable for whether the item recommended to user u at time t is\nlikable, i.e., Xut = +1 if pu\u03c0ut > 1/2 and Xut = 0 otherwise. Maximizing r(T ) and r(T )\n+ differ\nsince the former asks that we prioritize items according to their probability of being liked.\nRecommending likable items for a user in an arbitrary order is suf\ufb01cient for many real recommen-\ndation systems such as for movies and music. For example, we suspect that users wouldn\u2019t actually\nprefer to listen to music starting from the songs that their user type would like with highest probabil-\nity to the ones their user type would like with lowest probability; instead, each user would listen to\nsongs that she or he \ufb01nds likable, ordered such that there is suf\ufb01cient diversity in the playlist to keep\nthe user experience interesting. We target the modest goal of merely recommending likable items,\nin any order. Of course, if all likable items have the same probability of being liked and similarly\nfor all unlikable items, then maximizing r(T ) and r(T )\n\n+ are equivalent.\n\n1In practice, a user could ignore the recommendation. To keep our exposition simple, however, we stick to\nthis setting that resembles song recommendation systems like Pandora that per user continually recommends\na single item at a time. For example, if a user rates a song as \u201cthumbs down\u201d then we assign a rating of \u22121\n(dislike), and any other action corresponds to +1 (like).\n\n2\n\n\fThe fundamental challenge is that to learn about a user\u2019s preference for an item, we need the user to\nrate (and thus consume) the item. But then we cannot recommend that item to the user again! Thus,\nthe only way to learn about a user\u2019s preferences is through collaboration, or inferring from other\nusers\u2019 ratings. Broadly, such inference is possible if the users preferences are somehow related.\nIn this paper, we assume a simple structure for shared user preferences. We posit that there are\nk < n different types of users, where users of the same type have identical item preference vectors.\nThe number of types k represents the heterogeneity in the population. For ease of exposition, in this\npaper we assume that a user belongs to each user type with probability 1/k. We refer to this model\nas a latent source model, where each user type corresponds to a latent source of users. We remark\nthat there is evidence suggesting real movie recommendation data to be well modeled by clustering\nof both users and items [21]. Our model only assumes clustering over users.\nOur problem setup relates to some versions of the multi-armed bandit problem. A fundamental\ndifference between our setup and that of the standard stochastic multi-armed bandit problem [23, 8]\nis that the latter allows each item to be recommended an in\ufb01nite number of times. Thus, the solution\nconcept for the stochastic multi-armed bandit problem is to determine the best item (arm) and keep\nchoosing it [3]. This observation applies also to \u201cclustered bandits\u201d [9], which like our work seeks to\ncapture collaboration between users. On the other hand, sleeping bandits [15] allow for the available\nitems at each time step to vary, but the analysis is worst-case in terms of which items are available\nover time.\nIn our setup, the sequence of items that are available is not adversarial. Our model\ncombines the collaborative aspect of clustered bandits with dynamic item availability from sleeping\nbandits, where we impose a strict structure on how items become unavailable.\n\n3 A Collaborative Filtering Algorithm and Its Performance Guarantee\n\nThis section presents our algorithm COLLABORATIVE-GREEDY and its accompanying theoretical\nperformance guarantee. The algorithm is syntactically similar to the \u03b5-greedy algorithm for multi-\narmed bandits [22], which explores items with probability \u03b5 and otherwise greedily chooses the best\nitem seen so far based on a plurality vote. In our algorithm, the greedy choice, or exploitation, uses\nthe standard cosine-similarity measure. The exploration, on the other hand, is split into two types, a\nstandard item exploration in which a user is recommended an item that she or he hasn\u2019t consumed\nyet uniformly at random, and a joint exploration in which all users are asked to provide a rating for\nthe next item in a shared, randomly chosen sequence of items. Let\u2019s \ufb01ll in the details.\nAlgorithm. At each time step t, either all the users are asked to explore, or an item is recommended\nto each user by choosing the item with the highest score for that user. The pseudocode is described\nin Algorithm 1. There are two types of exploration: random exploration, which is for exploring the\nspace of items, and joint exploration, which helps to learn about similarity between users. For a\npre-speci\ufb01ed rate \u03b1 \u2208 (0, 4/7], we set the probability of random exploration to be \u03b5R(n) = 1/n\u03b1\n\nAlgorithm 1: COLLABORATIVE-GREEDY\nInput: Parameters \u03b8 \u2208 [0, 1], \u03b1 \u2208 (0, 4/7].\nSelect a random ordering \u03c3 of the items [m]. De\ufb01ne\n\n\u03b5R(n) =\n\n1\nn\u03b1 ,\n\nand\n\n\u03b5J (t) =\n\n1\nt\u03b1 .\n\nfor time step t = 1, 2, . . . , T do\n\nWith prob. \u03b5R(n): (random exploration) for each user, recommend a random item that the\nuser has not rated.\nWith prob. \u03b5J (t): (joint exploration) for each user, recommend the \ufb01rst item in \u03c3 that the user\nhas not rated.\nuser has not rated and that maximizes score(cid:101)p(t)\nWith prob. 1 \u2212 \u03b5J (t) \u2212 \u03b5R(n): (exploitation) for each user u, recommend an item j that the\n\nuj , which depends on threshold \u03b8.\n\nend\n\n3\n\n\fu\n\n(cid:40)\n\nv )|},\n\nIn other words,\n\n(cid:44)\n\nui\n\n1/2\n\nu\n\nu\n\nvi\n\nu\n\notherwise,\n\nand(cid:101)Y (t)\n\nu\n\n1{Y (t)\n\nvi\n\n(cid:54)= 0} > 0,\n\nwhere the neighborhood of user u is given by\n\nvi = +1}\n(cid:54)= 0}\n\n1{Y (t)\n1{Y (t)\nu ,(cid:101)Y (t)\n\n(decaying with the number of users), and the probability of joint exploration to be \u03b5J (t) = 1/t\u03b1\n(decaying with time).2\n\nui = {\u22121, 0, +1}\nas user u\u2019s rating for item i up to time t, where 0 indicates that no rating has been given yet. We\nde\ufb01ne\n\nui for item i at time t. Recall that we observe Y (t)\n\nif(cid:80)\nv\u2208(cid:101)N (t)\nu ) \u2229 supp((cid:101)Y (t)\nv (cid:105) \u2265 \u03b8|supp((cid:101)Y (t)\n\nNext, we de\ufb01ne user u\u2019s score(cid:101)p(t)\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3\n(cid:80)\n(cid:80)\nv\u2208(cid:101)N (t)\n(cid:101)p(t)\nv\u2208(cid:101)N (t)\n(cid:101)N (t)\n(cid:44) {v \u2208 [n] : (cid:104)(cid:101)Y (t)\n(cid:101)Y (t)\nv , let \u2126uv (cid:44) supp((cid:101)Y (t)\nu and(cid:101)Y (t)\nTo see this, for users u and v with revealed ratings(cid:101)Y (t)\nbe the support overlap of (cid:101)Y (t)\nu ,(cid:101)Y (t)\n(cid:104)(cid:101)Y (t)\nu ,(cid:101)Y (t)\n(cid:104)(cid:101)Y (t)\n(cid:113)\nu ,(cid:101)Y (t)\nv ,(cid:101)Y (t)\n(cid:104)(cid:101)Y (t)\n(cid:104)(cid:101)Y (t)\nand (cid:101)Y (t)\nwhich is the cosine similarity of revealed rating vectors (cid:101)Y (t)\n\nThe neighborhoods are de\ufb01ned precisely by cosine similarity with respect to jointed explored items.\nand (cid:101)Y (t)\nv )\nv , and let (cid:104)\u00b7,\u00b7(cid:105)\u2126uv be the dot product restricted to entries in\n(cid:113)\n\nconsists of the revealed ratings of user u restricted to items that have been jointly explored.\n\nif item i is jointly explored by time t,\notherwise.\n\nu )\u2229supp((cid:101)Y (t)\n\nrestricted to the overlap of\n\ntheir supports. Thus, users u and v are neighbors if and only if their cosine similarity is at least \u03b8.\nTheoretical performance guarantee. We now state our main result on the proposed collaborative\n\ufb01ltering algorithm\u2019s performance with respect to the objective stated in equation (1). We begin with\ntwo reasonable, and seemingly necessary, conditions under which our the results will be established.\n\n,\n\nv (cid:105)\u2126uv\n\nv (cid:105)\n\n=\n\n|\u2126uv|\n\nui =\n\nY (t)\nui\n0\n\nv (cid:105)\u2126uv\n\nu (cid:105)\u2126uv\n\nu\n\nv\n\n\u2126uv. Then\n\nu\n\nA1 No \u2206-ambiguous items. There exists some constant \u2206 > 0 such that\n\nfor all users u and items i. (Smaller \u2206 corresponds to more noise.)\n\n|pui \u2212 1/2| \u2265 \u2206\n\nA2 \u03b3-incoherence. There exist a constant \u03b3 \u2208 [0, 1) such that if users u and v are of different\n\ntypes, then their item preference vectors pu and pv satisfy\n\n1\nm(cid:104)2pu \u2212 1, 2pv \u2212 1(cid:105) \u2264 4\u03b3\u22062,\n\nwhere 1 is the all ones vector. Note that a different way to write the left-hand side is\nm(cid:104)Y \u2217\nE[ 1\nv are fully-revealed rating vectors of users u and v, and\nthe expectation is over the random ratings of items.\n\nv (cid:105)], where Y \u2217\n\nu and Y \u2217\n\nu , Y \u2217\n\nThe \ufb01rst condition is a low noise condition to ensure that with a \ufb01nite number of samples, we can\ncorrectly classify each item as either likable or unlikable. The incoherence condition asks that the\ndifferent user types are well-separated so that cosine similarity can tease apart the users of different\ntypes over time. We provide some examples after the statement of the main theorem that suggest the\nincoherence condition to be reasonable, allowing E[(cid:104)Y \u2217\nv (cid:105)] to scale as \u0398(m) rather than o(m).\nWe assume that the number of users satis\ufb01es n = O(mC) for some constant C > 1. This is without\nloss of generality since otherwise, we can randomly divide the n users into separate population\n\nu , Y \u2217\n\n2For ease of presentation, we set the two explorations to have the same decay rate \u03b1, but our proof easily\nextends to encompass different decay rates for the two exploration types. Furthermore, the constant 4/7 \u2265 \u03b1\nis not special. It could be different and only affects another constant in our proof.\n\n4\n\n\fpools, each of size O(mC) and run the recommendation algorithm independently for each pool to\nachieve the same overall performance guarantee.\nFinally, we de\ufb01ne \u00b5, the minimum proportion of likable items for any user (and thus any user type):\n\n(cid:80)m\n\ni=1\n\n\u00b5 (cid:44) min\nu\u2208[n]\n\n1{pui > 1/2}\n\nm\n\n.\n\nTheorem 1. Let \u03b4 \u2208 (0, 1) be some pre-speci\ufb01ed tolerance. Take as input to COLLABORATIVE-\nGREEDY \u03b8 = 2\u22062(1 + \u03b3) where \u03b3 \u2208 [0, 1) is as de\ufb01ned in A2, and \u03b1 \u2208 (0, 4/7]. Under the latent\nsource model and assumptions A1 and A2, if the number of users n = O(mC) satis\ufb01es\n\nthen for any Tlearn \u2264 T \u2264 \u00b5m, the expected proportion of likable items recommended by\nCOLLABORATIVE-GREEDY up until time T satis\ufb01es\n\n(cid:16)\n\n(cid:16) 4\n\n(cid:17)1/\u03b1(cid:17)\n\n,\n\nn = \u2126\n\nkm log\n\n1\n\u03b4\n\n+\n\n\u03b4\n\nr(T )\n+\nT n \u2265\n\n(cid:16)\n(cid:18)(cid:18) log km\n\n1 \u2212\n\n\u2206\u03b4\n\nTlearn\n\n(cid:17)\n(cid:19)1/(1\u2212\u03b1)\n\nT\n\n(1 \u2212 \u03b4),\n\n(cid:17)1/\u03b1(cid:19)\n\n.\n\n(cid:16) 4\n\n\u03b4\n\n+\n\nwhere\n\nTlearn = \u0398\n\n\u22064(1 \u2212 \u03b3)2\n\nTheorem 1 says that there are Tlearn initial time steps for which the algorithm may be giving poor\nrecommendations. Afterward, for Tlearn < T < \u00b5m, the algorithm becomes near-optimal, recom-\nmending a fraction of likable items 1\u2212\u03b4 close to what an optimal oracle algorithm (that recommends\nall likable items \ufb01rst) would achieve. Then for time horizon T > \u00b5m, we can no longer guarantee\nthat there are likable items left to recommend. Indeed, if the user types each have the same fraction\nof likable items, then even an oracle recommender would use up the \u00b5m likable items by this time.\nMeanwhile, to give a sense of how long the learning period Tlearn is, note that when \u03b1 = 1/2, we\nhave Tlearn scaling as log2(km), and if we choose \u03b1 close to 0, then Tlearn becomes nearly log(km).\nIn summary, after Tlearn initial time steps, the simple algorithm proposed is essentially optimal.\nTo gain intuition for incoherence condition A2, we calculate the parameter \u03b3 for three examples.\nExample 1. Consider when there is no noise, i.e., \u2206 = 1\n2 . Then users\u2019 ratings are deterministic\ngiven their user type. Produce k vectors of probabilities by drawing m independent Bernoulli( 1\n2 )\n2 each) for each user type. For any item i and pair of\nrandom variables (0 or 1 with probability 1\nusers u and v of different types, Y \u2217\nvi is a Rademacher random variable (\u00b11 with probability 1\n2\neach), and thus the inner product of two user rating vectors is equal to the sum of m Rademacher\n\nrandom variables. Standard concentration inequalities show that one may take \u03b3 = \u0398(cid:0)(cid:113) log m\n\nui \u00b7 Y \u2217\n\n(cid:1) to\n\nui \u00b7 Y \u2217\n\n2 + \u2206)2 + ( 1\n\nvi] = ( 1\n2 + \u2206)2 \u2212 ( 1\n\n2 \u00b1 \u2206 with probability 1\n\nui \u00b7 Y \u2217\n4 \u2212 \u22062) \u2212 ( 1\nu , Y \u2217\n\nsatisfy \u03b3-incoherence with probability 1 \u2212 1/poly(m).\nExample 2. We expand on the previous example by choosing an arbitrary \u2206 > 0 and making all\n2 each. As before let\nlatent source probability vectors have entries equal to 1\nuser u and v are from different type. Now E[Y \u2217\n4 \u2212 \u22062) = 4\u22062\nvariables, but this time scaled by 4\u22062. For similar reasons as before, \u03b3 = \u0398(cid:0)(cid:113) log m\nif pui = pvi and E[Y \u2217\n2 \u2212 \u2206)2 = \u22124\u22062 if pui = 1 \u2212 pvi.\nvi] = 2( 1\nThe value of the inner product E[(cid:104)Y \u2217\nv (cid:105)] is again equal to the sum of m Rademacher random\nsatisfy \u03b3-incoherence with probability 1 \u2212 1/poly(m).\nExample 3. Continuing with the previous example, now suppose each entry is 1\nAgain, using standard concentration, this shows that \u03b3 = (1\u2212 2\u00b5)2 + \u0398(cid:0)(cid:113) log m\n\u00b5 \u2208 (0, 1/2) and 1\npui = pvi with probability \u00b52 + (1 \u2212 \u00b5)2. This implies that E[(cid:104)Y \u2217\n\u03b3-incoherence with probability 1 \u2212 1/poly(m).\n\n2 +\u2206 with probability\n2 \u2212 \u2206 with probability 1 \u2212 \u00b5. Then for two users u and v of different types,\nv (cid:105)] = 4m\u22062(1 \u2212 2\u00b5)2.\n\n(cid:1) suf\ufb01ces to satisfy\n\n(cid:1) suf\ufb01ces to\n\n2 \u2212 \u2206)2 \u2212 2( 1\n\nu , Y \u2217\n\nm\n\nm\n\nm\n\n5\n\n\f4 Proof of Theorem 1\n\nRecall that Xut is the indicator random variable for whether the item \u03c0ut recommended to user u\nat time t is likable, i.e., pu\u03c0ut > 1/2. Given assumption A1, this is equivalent to the event that\npu\u03c0ut \u2265 1\n\n2 + \u2206. The expected proportion of likable items is\n\nT(cid:88)\n\nn(cid:88)\n\nt=1\n\nu=1\n\nT(cid:88)\n\nn(cid:88)\n\nt=1\n\nu=1\n\nr(T )\n+\nT n\n\n=\n\n1\nT n\n\nE[Xut] =\n\n1\nT n\n\nP(Xut = 1).\n\n(cid:110)\n\nOur proof focuses on lower-bounding P(Xut = 1). The key idea is to condition on what we call the\n\u201cgood neighborhood\u201d event Egood(u, t):\nat time t, user u has \u2265\nEgood(u, t) =\nand \u2264\n\nneighbors from the same user type (\u201cgood neighbors\u201d),\n\nneighbors from other user types (\u201cbad neighbors\u201d)\n\n\u2206tn1\u2212\u03b1\n10km\n\n(cid:111)\n\nn\n5k\n\nThis good neighborhood event will enable us to argue that after an initial learning time, with high\nprobability there are at most \u2206 as many ratings from bad neighbors as there are from good neighbors.\nThe proof of Theorem 1 consists of two parts. The \ufb01rst part uses joint exploration to show that after\na suf\ufb01cient amount of time, the good neighborhood event Egood(u, t) holds with high probability.\nLemma 1. For user u, after\n\n.\n\n(cid:18) 2 log(10kmn\u03b1/\u2206)\n(cid:16)\n\n\u22064(1 \u2212 \u03b3)2\n\n(cid:17)\n\n(cid:19)1/(1\u2212\u03b1)\n(cid:16)\n\ntime steps,\n\nt \u2265\n\nP(Egood(u, t)) \u2265 1 \u2212 exp\n\nn\n8k\n\n\u2212\n\n\u2212 12 exp\n\n\u2212\n\n\u22064(1 \u2212 \u03b3)2t1\u2212\u03b1\n\n20\n\n(cid:17)\n\n.\n\nIn the above lower bound, the \ufb01rst exponentially decaying term could be thought of as the penalty\nfor not having enough users in the system from the k user types, and the second decaying term could\nbe thought of as the penalty for not yet clustering the users correctly.\nThe second part of our proof to Theorem 1 shows that, with high probability, the good neighborhoods\nhave, through random exploration, accurately estimated the probability of liking each item. Thus,\nwe correctly classify each item as likable or not with high probability, which leads to a lower bound\non P(Xut = 1).\nLemma 2. For user u at time t, if the good neighborhood event Egood(u, t) holds and t \u2264 \u00b5m, then\n\nP(Xut = 1) \u2265 1 \u2212 2m exp\n\n\u22062tn1\u2212\u03b1\n40km\n\n\u2212\n\n1\nt\u03b1 \u2212\n\n1\nn\u03b1 .\n\n\u2212\n\n(cid:16)\n\n(cid:17)\n\nHere, the \ufb01rst exponentially decaying term could be thought of as the cost of not classifying items\ncorrectly as likable or unlikable, and the last two decaying terms together could be thought of as the\ncost of exploration (we explore with probability \u03b5J (t) + \u03b5R(n) = 1/t\u03b1 + 1/n\u03b1).\nWe defer the proofs of Lemmas 1 and 2 to the supplementary material. Combining these lemmas\nand choosing appropriate constraints on the numbers of users and items, we produce the following\nlemma.\nLemma 3. Let \u03b4 \u2208 (0, 1) be some pre-speci\ufb01ed tolerance. If the number of users n and items m\nsatisfy\n\n,\n\n4\n\u03b4\n\n8k log\n\n(cid:17)1/\u03b1(cid:111)\n(cid:110)\n(cid:26)(cid:18) 2 log(10kmn\u03b1/\u2206)\n\n(cid:16) 4\n(cid:16) 16m\n(cid:17)\n\n\u22064(1 \u2212 \u03b3)2\n\n\u03b4\n\n,\n\nlog\n\n,\n\n\u03b4\n\nn \u2265 max\n\n\u00b5m \u2265 t \u2265 max\nnt1\u2212\u03b1 \u2265\n40km\n\n\u22062\n\nthen P(Xut = 1) \u2265 1 \u2212 \u03b4.\n\n(cid:19)1/(1\u2212\u03b1)\n\n,\n\n(cid:18) 20 log(96/\u03b4)\n\n(cid:19)1/(1\u2212\u03b1)\n\n\u22064(1 \u2212 \u03b3)2\n\n(cid:16) 4\n\n(cid:17)1/\u03b1(cid:27)\n\n\u03b4\n\n,\n\n,\n\n6\n\n\f(cid:16)\n\n(cid:17)\n\nProof. With the above conditions on n and t satis\ufb01ed, we combine Lemmas 1 and 2 to obtain\n\u22062tn1\u2212\u03b1\nP(Xut = 1) \u2265 1 \u2212 exp\n40km\n1\nt\u03b1 \u2212\n\n\u2212 12 exp\n\u03b4\n\u03b4\n8 \u2212\n4 \u2212\n\n\u2212\n1\nn\u03b1 \u2265 1 \u2212\n\n\u22064(1 \u2212 \u03b3)2t1\u2212\u03b1\n\n20\n\u03b4\n= 1 \u2212 \u03b4.\n4\n\n\u2212 2m exp\n\n\u2212\n\u03b4\n8 \u2212\n\n\u03b4\n4 \u2212\n\nn\n8k\n\n\u2212\n\n\u2212\n\n(cid:17)\n\n(cid:16)\n\n(cid:17)\n\nTheorem 1 follows as a corollary to Lemma 3. As previously mentioned, without loss of generality,\nwe take n = O(mC). Then with number of users n satisfying\n\n(cid:16)\n\n(cid:16)\n\n(cid:16) 4\n(cid:17)1/\u03b1(cid:17)\n(cid:17)1/\u03b1(cid:19)\n(cid:16) 4\n\n\u03b4\n\n,\n\n\u03b4\n\n(cid:19)1/(1\u2212\u03b1)\n\n1\n\u03b4\n\n+\n\n+\n\n(cid:44) Tlearn ,\n\nO(mC) = n = \u2126\n\nkm log\n\n(cid:18)(cid:18) log km\n\n\u2206\u03b4\n\n\u22064(1 \u2212 \u03b3)2\n\n\u00b5m \u2265 t \u2265 \u0398\n\nand for any time step t satisfying\n\nwe simultaneously meet all of the conditions of Lemma 3. Note that the upper bound on number of\nusers n appears since without it, Tlearn would depend on n (observe that in Lemma 3, we ask that t\nbe greater than a quantity that depends on n). Provided that the time horizon satis\ufb01es T \u2264 \u00b5m, then\n\nP(Xut = 1) \u2265\n\n1\nT n\n\n(1 \u2212 \u03b4) =\n\n(T \u2212 Tlearn)(1 \u2212 \u03b4)\n\nT\n\n,\n\nT(cid:88)\n\nn(cid:88)\n\nt=Tlearn\n\nu=1\n\nr(T )\n+\nT n \u2265\n\n1\nT n\n\nT(cid:88)\n\nn(cid:88)\n\nt=Tlearn\n\nu=1\n\nyielding the theorem statement.\n\n5 Experimental Results\n\nWe provide only a summary of our experimental results here, deferring full details to the supple-\nmentary material. We simulate an online recommendation system based on movie ratings from the\nMovielens10m and Net\ufb02ix datasets, each of which provides a sparsely \ufb01lled user-by-movie rating\nmatrix with ratings out of 5 stars. Unfortunately, existing collaborative \ufb01ltering datasets such as the\ntwo we consider don\u2019t offer the interactivity of a real online recommendation system, nor do they\nallow us to reveal the rating for an item that a user didn\u2019t actually rate. For simulating an online sys-\ntem, the former issue can be dealt with by simply revealing entries in the user-by-item rating matrix\nover time. We address the latter issue by only considering a dense \u201ctop users vs. top items\u201d subset\nof each dataset. In particular, we consider only the \u201ctop\u201d users who have rated the most number of\nitems, and the \u201ctop\u201d items that have received the most number of ratings. While this dense part of the\ndataset is unrepresentative of the rest of the dataset, it does allow us to use actual ratings provided\nby users without synthesizing any ratings. A rigorous validation would require an implementation\nof an actual interactive online recommendation system, which is beyond the scope of our paper.\nFirst, we validate that our latent source model is reasonable for the dense parts of the two datasets\nwe consider by looking for clustering behavior across users. We \ufb01nd that the dense top users vs. top\nmovies matrices do in fact exhibit clustering behavior of users and also movies, as shown in Figure\n1(a). The clustering was found via Bayesian clustered tensor factorization, which was previously\nshown to model real movie ratings data well [21].\nNext, we demonstrate our algorithm COLLABORATIVE-GREEDY on the two simulated online movie\nrecommendation systems, showing that it outperforms two existing recommendation algorithms\nPopularity Amongst Friends (PAF) [4] and a method by Deshpande and Montanari (DM) [12]. Fol-\nlowing the experimental setup of [4], we quantize a rating of 4 stars or more to be +1 (likable), and\na rating less than 4 stars to be \u22121 (unlikable). While we look at a dense subset of each dataset, there\nare still missing entries. If a user u hasn\u2019t rated item j in the dataset, then we set the corresponding\ntrue rating to 0, meaning that in our simulation, upon recommending item j to user u, we receive\n0 reward, but we still mark that user u has consumed item j; thus, item j can no longer be recom-\nmended to user u. For both Movielens10m and Net\ufb02ix datasets, we consider the top n = 200 users\nand the top m = 500 movies. For Movielens10m, the resulting user-by-rating matrix has 80.7%\nnonzero entries. For Net\ufb02ix, the resulting matrix has 86.0% nonzero entries. For an algorithm that\n\n7\n\n\f(cid:80)T\n\nt=1\n\n(cid:80)n\n\n(a)\n\n(b)\n\nFigure 1: Movielens10m dataset: (a) Top users by top movies matrix with rows and columns re-\nordered to show clustering of users and items. (b) Average cumulative rewards over time.\n\nu=1 Y (T )\n\nrecommends item \u03c0ut to user u at time t, we measure the algorithm\u2019s average cumulative reward up\nto time T as 1\nu\u03c0ut, where we average over users. For all four methods, we recom-\nn\nmend items until we reach time T = 500, i.e., we make movie recommendations until each user has\nseen all m = 500 movies. We disallow the matrix completion step for DM to see the users that we\nactually test on, but we allow it to see the the same items as what is in the simulated online recom-\nmendation system in order to compute these items\u2019 feature vectors (using the rest of the users in the\ndataset). Furthermore, when a rating is revealed, we provide DM both the thresholded rating and the\nnon-thresholded rating, the latter of which DM uses to estimate user feature vectors over time. We\ndiscuss choice of algorithm parameters in the supplementary material. In short, parameters \u03b8 and\n\u03b1 of our algorithm are chosen based on training data, whereas we allow the other algorithms to use\nwhichever parameters give the best results on the test data. Despite giving the two competing algo-\nrithms this advantage, COLLABORATIVE-GREEDY outperforms the two, as shown in Figure 1(b).\nResults on the Net\ufb02ix dataset are similar.\n\n6 Discussion and Related Work\n\nThis paper proposes a model for online recommendation systems under which we can analyze the\nperformance of recommendation algorithms. We theoretical justify when a cosine-similarity collab-\norative \ufb01ltering method works well, with a key insight of using two exploration types.\nThe closest related work is by Biau et al. [7], who study the asymptotic consistency of a cosine-\nsimilarity nearest-neighbor collaborative \ufb01ltering method. Their goal is to predict the rating of the\nnext unseen item. Barman and Dabeer [4] study the performance of an algorithm called Popular-\nity Amongst Friends, examining its ability to predict binary ratings in an asymptotic information-\ntheoretic setting. In contrast, we seek to understand the \ufb01nite-time performance of such systems.\nDabeer [11] uses a model similar to ours and studies online collaborative \ufb01ltering with a moving\nhorizon cost in the limit of small noise using an algorithm that knows the numbers of user types and\nitem types. We do not model different item types, our algorithm is oblivious to the number of user\ntypes, and our performance metric is different. Another related work is by Deshpande and Mon-\ntanari [12], who study online recommendations as a linear bandit problem; their method, however,\ndoes not actually use any collaboration beyond a pre-processing step in which of\ufb02ine collaborative\n\ufb01ltering (speci\ufb01cally matrix completion) is solved to compute feature vectors for items.\nOur work also relates to the problem of learning mixture distributions (c.f., [10, 18, 5, 2]), where\none observes samples from a mixture distribution and the goal is to learn the mixture components\nand weights. Existing results assume that one has access to the entire high-dimensional sample\nor that the samples are produced in an exogenous manner (not chosen by the algorithm). Neither\nassumption holds in our setting, as we only see each user\u2019s revealed ratings thus far and not the user\u2019s\nentire preference vector, and the recommendation algorithm affects which samples are observed (by\nchoosing which item ratings are revealed for each user). These two aspects make our setting more\nchallenging than the standard setting for learning mixture distributions. However, our goal is more\nmodest. Rather than learning the k item preference vectors, we settle for classifying them as likable\nor unlikable. Despite this, we suspect having two types of exploration to be useful in general for\nef\ufb01ciently learning mixture distributions in the active learning setting.\nAcknowledgements. This work was supported in part by NSF grant CNS-1161964 and by Army\nResearch Of\ufb01ce MURI Award W911NF-11-1-0036. GHC was supported by an NDSEG fellowship.\n\n8\n\n01002003004005000501001502000100200300400500Time\u2212100102030405060AveragecumulativerewardCollaborative-GreedyPopularityAmongstFriendsDeshpande-Montanari\fReferences\n[1] Fabio Aiolli. A preliminary study on a recommender system for the million songs dataset challenge. In\n\nProceedings of the Italian Information Retrieval Workshop, pages 73\u201383, 2013.\n\n[2] Anima Anandkumar, Rong Ge, Daniel Hsu, Sham M. Kakade, and Matus Telgarsky. Tensor decomposi-\n\ntions for learning latent variable models, 2012. arXiv:1210.7559.\n\n[3] Peter Auer, Nicol`o Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem.\n\nMachine Learning, 47(2-3):235\u2013256, May 2002.\n\n[4] Kishor Barman and Onkar Dabeer. Analysis of a collaborative \ufb01lter based on popularity amongst neigh-\n\nbors. IEEE Transactions on Information Theory, 58(12):7110\u20137134, 2012.\n\n[5] Mikhail Belkin and Kaushik Sinha. Polynomial learning of distribution families.\n\nIn Foundations of\n\nComputer Science (FOCS), 2010 51st Annual IEEE Symposium on, pages 103\u2013112. IEEE, 2010.\n\n[6] Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere. The million song dataset.\nIn Proceedings of the 12th International Conference on Music Information Retrieval (ISMIR 2011), 2011.\n[7] G\u00b4erard Biau, Beno\u02c6\u0131t Cadre, and Laurent Rouvi`ere. Statistical analysis of k-nearest neighbor collaborative\n\nrecommendation. The Annals of Statistics, 38(3):1568\u20131592, 2010.\n\n[8] S\u00b4ebastien Bubeck and Nicol`o Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed\n\nbandit problems. Foundations and Trends in Machine Learning, 5(1):1\u2013122, 2012.\n\n[9] Loc Bui, Ramesh Johari, and Shie Mannor. Clustered bandits, 2012. arXiv:1206.4169.\n[10] Kamalika Chaudhuri and Satish Rao. Learning mixtures of product distributions using correlations and\n\nindependence. In Conference on Learning Theory, pages 9\u201320, 2008.\n\n[11] Onkar Dabeer. Adaptive collaborating \ufb01ltering: The low noise regime. In IEEE International Symposium\n\non Information Theory, pages 1197\u20131201, 2013.\n\n[12] Yash Deshpande and Andrea Montanari. Linear bandits in high dimension and recommendation systems,\n\n2013. arXiv:1301.1722.\n\n[13] Roger B. Grosse, Ruslan Salakhutdinov, William T. Freeman, and Joshua B. Tenenbaum. Exploiting\ncompositionality to explore a large space of model structures. In Uncertainty in Arti\ufb01cial Intelligence,\npages 306\u2013315, 2012.\n\n[14] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the Amer-\n\nican statistical association, 58(301):13\u201330, 1963.\n\n[15] Robert Kleinberg, Alexandru Niculescu-Mizil, and Yogeshwer Sharma. Regret bounds for sleeping ex-\n\nperts and bandits. Machine Learning, 80(2-3):245\u2013272, 2010.\n\n[16] Yehuda Koren. The BellKor solution to the Net\ufb02ix grand prize. http://www.netflixprize.com/\n\nassets/GrandPrize2009_BPC_BellKor.pdf, August 2009.\n\n[17] Greg Linden, Brent Smith, and Jeremy York. Amazon.com recommendations: item-to-item collaborative\n\n\ufb01ltering. IEEE Internet Computing, 7(1):76\u201380, 2003.\n\n[18] Ankur Moitra and Gregory Valiant. Settling the polynomial learnability of mixtures of gaussians. Pro-\n\nceedings of the 51st Annual IEEE Symposium on Foundations of Computer Science, 2010.\n\n[19] Martin Piotte and Martin Chabbert. The pragmatic theory solution to the net\ufb02ix grand prize. http://\nwww.netflixprize.com/assets/GrandPrize2009_BPC_PragmaticTheory.pdf, Au-\ngust 2009.\n\n[20] Paul Resnick, Neophytos Iacovou, Mitesh Suchak, Peter Bergstrom, and John Riedl. Grouplens: An\nopen architecture for collaborative \ufb01ltering of netnews. In Proceedings of the 1994 ACM Conference on\nComputer Supported Cooperative Work, CSCW \u201994, pages 175\u2013186, New York, NY, USA, 1994. ACM.\n[21] Ilya Sutskever, Ruslan Salakhutdinov, and Joshua B. Tenenbaum. Modelling relational data using\n\nbayesian clustered tensor factorization. In NIPS, pages 1821\u20131828, 2009.\n\n[22] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cam-\n\nbridge, MA, 1998.\n\n[23] William R. Thompson. On the Likelihood that one Unknown Probability Exceeds Another in View of the\n\nEvidence of Two Samples. Biometrika, 25:285\u2013294, 1933.\n\n[24] Andreas T\u00a8oscher and Michael Jahrer. The bigchaos solution to the net\ufb02ix grand prize. http://www.\n\nnetflixprize.com/assets/GrandPrize2009_BPC_BigChaos.pdf, September 2009.\n\n9\n\n\f", "award": [], "sourceid": 1708, "authors": [{"given_name": "Guy", "family_name": "Bresler", "institution": "Massachusetts Institute of Technology"}, {"given_name": "George", "family_name": "Chen", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Devavrat", "family_name": "Shah", "institution": "Massachusetts Institute of Technology"}]}