{"title": "Optimal Tagging with Markov Chain Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 1307, "page_last": 1315, "abstract": "Many information systems use tags and keywords to describe and annotate content. These allow for efficient organization and categorization of items, as well as facilitate relevant search queries. As such, the selected set of tags for an item can have a considerable effect on the volume of traffic that eventually reaches an item. In tagging systems where tags are exclusively chosen by an item's owner, who in turn is interested in maximizing traffic, a principled approach for assigning tags can prove valuable. In this paper we introduce the problem of optimal tagging, where the task is to choose a subset of tags for a new item such that the probability of browsing users reaching that item is maximized. We formulate the problem by modeling traffic using a Markov chain, and asking how transitions in this chain should be modified to maximize traffic into a certain state of interest. The resulting optimization problem involves maximizing a certain function over subsets, under a cardinality constraint. We show that the optimization problem is NP-hard, but has a (1-1/e)-approximation via a simple greedy algorithm due to monotonicity and submodularity. Furthermore, the structure of the problem allows for an efficient computation of the greedy step. To demonstrate the effectiveness of our method, we perform experiments on three tagging datasets, and show that the greedy algorithm outperforms other baselines.", "full_text": "Optimal Tagging with Markov Chain Optimization\n\nNir Rosenfeld\n\nAmir Globerson\n\nSchool of Computer Science and Engineering\n\nThe Blavatnik School of Computer Science\n\nHebrew University of Jerusalem\n\nnir.rosenfeld@mail.huji.ac.il\n\nTel Aviv University\n\ngamir@post.tau.ac.il\n\nAbstract\n\nMany information systems use tags and keywords to describe and annotate content.\nThese allow for ef\ufb01cient organization and categorization of items, as well as\nfacilitate relevant search queries. As such, the selected set of tags for an item can\nhave a considerable effect on the volume of traf\ufb01c that eventually reaches an item.\nIn tagging systems where tags are exclusively chosen by an item\u2019s owner, who in\nturn is interested in maximizing traf\ufb01c, a principled approach for assigning tags\ncan prove valuable. In this paper we introduce the problem of optimal tagging,\nwhere the task is to choose a subset of tags for a new item such that the probability\nof browsing users reaching that item is maximized.\nWe formulate the problem by modeling traf\ufb01c using a Markov chain, and asking\nhow transitions in this chain should be modi\ufb01ed to maximize traf\ufb01c into a certain\nstate of interest. The resulting optimization problem involves maximizing a certain\nfunction over subsets, under a cardinality constraint.\nWe show that the optimization problem is NP-hard, but has a (1\u2212 1\ne )-approximation\nvia a simple greedy algorithm due to monotonicity and submodularity. Furthermore,\nthe structure of the problem allows for an ef\ufb01cient computation of the greedy step.\nTo demonstrate the effectiveness of our method, we perform experiments on three\ntagging datasets, and show that the greedy algorithm outperforms other baselines.\n\n1\n\nIntroduction\n\nTo allow for ef\ufb01cient navigation and search, modern information systems rely on the usage of non-\nhierarchical tags, keywords, or labels to describe items and content. These tags are then used either\nexplicitly by users when searching for content, or implicitly by the system to recommend related\nitems or to augment search results.\nMany online systems where users can create or upload content support tagging. Examples of such\nsystems are media-sharing platforms, social bookmarking websites, and consumer to consumer\nauctioning services. While in some systems any user can tag any item, in many ad-hoc systems tags\nare exclusively set by the item\u2019s owner alone. She, in turn, is free to select any set of tags or keywords\nwhich she believes best describe the item. Typically, the only concrete limitation is on the number\nof tags, words, or characters used. Tags are often chosen on a basis of their ability to best describe,\nclassify, or categorize items and content. By choosing relevant tags, users aid in creating a more\norganized information system. However, content owners may have their own individual objective,\nsuch as maximizing the exposure of their items to other browsing users. This is true for many artists,\nartisans, content creators, and merchants whose services and items are provided online.\nThis suggests that choosing tags should in fact be done strategically. For instance, for a user uploading\na new song, tagging it as \u2018Rock\u2019 may be informative, but will probably only contribute marginally to\nthe song\u2019s traf\ufb01c, as the competition for popularity under this tag can be \ufb01erce. On the other hand,\nchoosing a unique or obscure tag may be appealing, but will not help much either. Strategic tagging\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\for keyword selection is clearly exhibited in search systems, where keywords are explicitly used for\n\ufb01ltering and ordering search results or ad placements, and users have a clear incentive of maximizing\nan item\u2019s exposure. Nonetheless, the selection process is typically heuristic.\nRecent years have seen an abundance of work on methods for user-speci\ufb01c tag recommendations\n[8, 10, 5]. Such methods aim to support collaborative tagging systems, where any user can tag any\nitem in the repository. In contrast, in this paper we take a complementary perspective and focus on\ntaxonomic tagging systems, where only the owner of an item can determine its tags. We formalize\nthe task of optimal tagging and suggest an ef\ufb01cient, provably-approximate algorithm. While the\nproblem is shown to be NP-hard, we prove that the objective is in fact monotone and submodular,\nwhich suggests a straightforward greedy (1 \u2212 1\ne )-approximation algorithm [13]. We also show how\nthe greedy step, which consists of solving a set of linear equations, can be greatly simpli\ufb01ed. This\nresults in a signi\ufb01cant improvement of runtime.\nWe begin by modeling a user browsing a tagged information system as a random walk. Items and\ntags act as states in a Markov chain, whose transition probabilities describe the probability of users\ntransitioning between items and tags. Given a new item, our task is to choose a subset of k tags for\nthis item. When an item is tagged, positive probabilities are assigned to transitioning both from the\ntag to the item and from the item to the tag. Our objective is to choose the subset of k tags which will\nmaximize traf\ufb01c to that item, namely the probability of a random walk reaching the item. Intuitively,\ntagging an item causes probability to \ufb02ow from the tag to the item, on account of other items with\nthis tag. Our goal is to take as much probability mass as possible from the system as a whole.\nOur method shares some similarities with other PageRank (PR, [2]) based methods, which optimize\nmeasures based on the stationary distribution [14, 4, 6, 15]. Here we argue that our approach, which\nfocuses on maximizing the probability of a random walk reaching a new item\u2019s state, is better suited\nto the task of optimal tagging. First, an item\u2019s popularity should only increase when assigned a new\ntag. Since tagging an item creates bidirectional links, its stationary probability may undesirably\ndecrease. Hence, optimizing the PR of an item will lead to an undesired non-monotone objective [1].\nSecond, PR considers a single Markov chain for all items, and is thus not item-centric. In contrast,\nour method considers a unique instance of the transition system for every item we consider. While an\nitem-speci\ufb01c Personalized PR based objective can be constructed, it would consider random walks\nfrom a given state, not to it. Third, a stationary distribution does not always exist, and hence may\nrequire modi\ufb01cations of the Markov chain. Finally, optimizing PR is known to be hard. While some\napproximations exist, our method provides superior guarantees and potentially better runtime [16].\nAlthough the Markov chain model we propose for optimal tagging is bipartite, our results apply to\ngeneral Markov chains. We therefore \ufb01rst formulate a general problem in Sec. 3, where the task is\nto choose k states to link a new state to such that the probability of reaching that state is maximal.\nThen, in Sec. 4 we prove that this problem is NP-hard by a reduction from vertex cover. In Sec. 5 we\nprove that for a general Markov chain the optimal objective is both monotonically non-decreasing\nand submodular. Based on this, in Sec. 6 we suggest a basic greedy (1\u2212 1\ne )-approximation algorithm,\nand describe a method for signi\ufb01cantly improving its runtime. In Sec. 7 we revisit the optimal tagging\nproblem and show how to construct a bipartite Markov chain for a given tag-supporting information\nsystem. In Sec. 8 we present experimental results on three tagging datasets (musical artists in Last.fm,\nbookmarks in Delicious, and movies in Movielens) and show that our algorithm outperforms other\nbaselines. Concluding remarks are given in Sec. 9.\n\n2 Related Work\n\nOne of the main roles of tags is to aid in the categorization and classi\ufb01cation of content. An active\nline of research in tagging systems focuses on the task of tag recommendations, where the goal is\nto predict the tags a given user may assign an item. This task is applicable in collaborative tagging\nsystems and folksonomies, where any user can tag any item. Methods for this task are based on\nrandom walks [8, 10] and tensor factorization [5]. While the goal in tag recommendation is also to\noutput a set of tags, our task is very different in nature. Tag recommendation is a prediction task for\nitem-user pairs, is based on ground-truth evaluation, and is applied in collaborative tagging systems.\nIn contrast, ours is an item-centric optimization task for tag-based taxonomies, and is counterfactual\nin nature, as the selection of tags is assumed to affect future outcomes.\n\n2\n\n\fA line of work similar to ours is optimizing the PageRank of web pages in different settings. In [4]\nthe authors consider the problem of computing the maximal and minimal PageRank value for a set of\n\u201cfragile\u201d links. The authors of [1] analyze the effects of additional outgoing links on the PageRank\nvalue. Perhaps the works most closely related to ours are [16, 14], where an approximation algorithm\nis given for the problem of maximizing the PageRank value by adding at most k incoming links. The\nauthors prove that the probability of reaching a web page is submodular and monotone in a fashion\nsimilar to ours (but with a different parameterization), and use it as a proxy for PageRank.\nOur framework uses absorbing Markov chains, whose relation to submodular optimization has been\nexplored in [6] for opinion maximization and in [12] for computing centrality measures in networks.\nFollowing the classic work of Nemhauser [13], submodular optimization is now a very active line of\nresearch. Many interesting optimization problems across diverse domains have been shown to be\nsubmodular. Examples include sensor placement [11] and social in\ufb02uence maximization [9].\n\n3 Problem Formulation\n\nBefore presenting our approach to optimal tagging, we \ufb01rst describe a general combinatorial opti-\nmization task over Markov chains, of which optimal tagging is a special case. Consider a Markov\nchain over n + 1 states. Assume there is a state \u03c3 to which we would like to add k new incoming\ntransitions, where w.l.o.g. \u03c3 = n + 1. In the tagging problem, \u03c3 will be an item (e.g., song or product)\nand the incoming transitions will be from possible tags for the item, or from related items.\nThe optimization problem is then to choose a subset S \u2286 [n] of k states so as to maximize the\nprobability of visiting \u03c3 at some point in time. Formally, let Xt \u2208 [n + 1] be the random variable\ncorresponding to the state of the Markov chain at time t. Then the optimal tagging problem is:\n\nmax\n\nS\u2286[n], |S|\u2264k\n\nPS [Xt = \u03c3 for some t \u2265 0]\n\n(1)\n\nAt \ufb01rst glance, it is not clear how to compute the objective function in Eq. (1). However, with a slight\nmodi\ufb01cation of the Markov chain, the objective function can be expressed as a simple function of the\nMarkov chain parameters, as explained next.\nIn general, \u03c3 may have outgoing edges, and random walks reaching \u03c3 may continue to other states\nafterward. Nonetheless, as we are only interested in the probability of reaching \u03c3, the states visited\nafter \u03c3 have no effect on our objective. Hence, the edges out of \u03c3 can be safely replaced with a single\nself-edge without affecting the probability of reaching \u03c3. This essentially makes \u03c3 an absorbing\nstate, and our task becomes to maximize the probability of the Markov chain being absorbed in \u03c3. In\nthe remainder of the paper we consider this equivalent formulation.\nWhen the Markov chain includes other absorbing states, optimizing over S can be intuitively thought\nof as trying to transfer as much probability mass from the competing absorbing states to \u03c3, under\na budget on the number of states that can be connected to \u03c3.1 As we discuss in Section 7, having\ncompeting absorbing states arises naturally in optimal tagging.\nTo fully specify the problem, we need the Markov chain parameters. Denote the initial distribution\nby \u03c0. For the transition probabilities, each node i will have two sets of transitions: one when it is\nallowed to transition to \u03c3 (i.e., i \u2208 S) and one when no transition is allowed. Using two distinct sets\nis necessary since in both cases outgoing probabilities must sum to one. We use qij to denote the\ntransition probability from state i to j when transition from i to \u03c3 is not allowed, and q+\nij when it is.\nWe also denote the corresponding transition matrices by Q and Q+.\nIt is natural to assume that when adding a link from i to \u03c3, transition into \u03c3 will become more likely,\nand transition to other states can only be less likely. Thus, we add the assumptions that:\n\n\u2200i : 0 = qi\u03c3 \u2264 q+\ni\u03c3,\n\n\u2200i , \u2200j (cid:54)= \u03c3 : q+\n\nij \u2264 qij\n\nGiven a subset S of states from which transitions to \u03c3 are allowed, we construct a new transition\nmatrix, taking corresponding rows from Q and Q+. We denote this matrix by \u03c1(S), with\n\n(cid:26) q+\n\nij\nqij\n\ni \u2208 S\ni /\u2208 S\n\n\u03c1ij(S) =\n\n(2)\n\n(3)\n\n1 In an ergodic chain with one absorbing state, all walks reach \u03c3 w.p. 1, and the problem becomes trivial.\n\n3\n\n\f4 NP-Hardness\n\nWe now show that for a general Markov chain, the optimal tagging problem in Eq. (1) is NP-hard\nby a reduction from vertex cover. Given an undirected graph G = (V, E) with n nodes as input to\nthe vertex cover problem, we construct an instance of optimal tagging such that there exists a vertex\ncover S \u2286 V of size at most k iff the probability of reaching \u03c3 reaches some threshold.\nTo construct the absorbing Markov chain, we create a transient state i for every node i \u2208 V , and add\ntwo absorbing states \u2205 and \u03c3. We set the initial distribution to be uniform, and for some 0 < \u0001 < 1\nset the transitions for transient states i as follows:\n\n(cid:26) 1 j = \u03c3\n\nj (cid:54)= \u03c3\n\n0\n\nq+\nij =\n\n,\n\nqij =\n\n\uf8f1\uf8f2\uf8f3 0\n\n\u0001\n1\u2212\u0001\ndeg(i)\n\nj = \u03c3\nj = \u2205\notherwise\n\n(4)\n\nLet U \u2286 V of size k, and S(U ) the set of states corresponding to the nodes in U. We claim that U is\na vertex cover in G iff the probability of reaching \u03c3 when S(U ) is chosen is 1 \u2212 (n\u2212k)\nn \u0001.\nAssume U is a vertex cover. For every i \u2208 S(U ), a walk starting in i will reach \u03c3 with probability\n1 in one step. For every i (cid:54)\u2208 S(U ), with probability \u0001 a walk will reach \u2205 in one step, and with\nprobability 1 \u2212 \u0001 it will visit one of its neighbors j. Since U is a vertex cover, it will then reach \u03c3\nin one step with probability 1. Hence, in total it will reach \u03c3 with probability 1 \u2212 \u0001. Overall, the\nprobability of reaching \u03c3 is k+(n\u2212k)(1\u2212\u0001)\nn \u0001 as needed. Note that this is the maximal\npossible probability of reaching \u03c3 for any subset of V of size k.\nAssume now that U is not a vertex cover, then there exists an edge (i, j) \u2208 E such that both i (cid:54)\u2208 S(U )\nand j (cid:54)\u2208 S(U ). A walk starting in i will reach \u2205 in one step with probability \u0001, and in two steps (via\nj) with probability \u0001\u00b7 qij > 0. Hence, it will reach \u03c3 with probability strictly smaller than 1 \u2212 \u0001, and\nthe overall probability of reaching \u0001 will be strictly smaller than 1 \u2212 (n\u2212k)\nn \u0001.\n\n= 1 \u2212 (n\u2212k)\n\nn\n\n5 Proof of Monotonicity and Submodularity\nDenote by PS [A] the probability of event A when transitions from S to \u03c3 are allowed. We de\ufb01ne:\n(5)\n(6)\n\n(S) = PS [Xt = \u03c3 for some t \u2264 k|X0 = i]\nc(k)\ni\nci(S) = PS [Xt = \u03c3 for some t|X0 = i] = limk\u2192\u221e c(k)\n\ni\n\nUsing c(S) = (c1(S), . . . , cn(S)), the objective in Eq. (1) now becomes:\n\nmax\n\nS\u2286[n],|S|\u2264k\n\nf (S),\n\nf (S) = (cid:104)\u03c0, c(S)(cid:105) = PS [Xt = \u03c3 for some t]\n\n(7)\n\nWe now prove that f (S) is both monotonically non-decreasing and submodular.\n\n5.1 Monotonicity\n\nWhen a link is created from i to \u03c3, the probability of reaching \u03c3 directly from i increases. However,\ndue to the renormalization constraints, the probability of reaching \u03c3 via longer paths may decrease.\nTrying to prove that for every random walk f is monotone and using additive closure is bound to fail.\nNonetheless, our proof of monotonicity shows that the overall probability cannot decrease.\nTheorem 5.1. For every k \u2265 0 and i \u2208 [n], c(k)\nz \u2208 [n] \\ S, it holds that c(k)\n(S \u222a {z}).\n\nis non-decreasing. Namely, for all S \u2286 [n] and\n\n(S) \u2264 c(k)\n\ni\n\ni\n\ni\n\nProof. We prove by induction on k. For k = 0, as \u03c0 is independent of S and z, we have:\n\nAssume now that the claim holds for some k \u2265 0. For any T \u2286 [n], it holds that:\n\nc0\ni (S) = \u03c0\u03c31{i=\u03c3} = c0\n\ni (S \u222a {z})\n\nc(k+1)\ni\n\n(T ) =\n\n\u03c1ij(T )c(k)\n\nj\n\n(T ) + \u03c1i\u03c31{i\u2208T}\n\n(8)\n\nn(cid:88)\n\nj=1\n\n4\n\n\fWe separate into cases. When i (cid:54)= z, we have:\n\ni \u2208 S :\n\ni (cid:54)\u2208 S :\n\nc(k+1)\ni\n\n(S) =\n\nc(k+1)\ni\n\n(S) =\n\nijc(k)\nq+\n\nj\n\nqijc(k)\n\nj\n\nijc(k)\nq+\n\nj\n\n(S \u222a z) + q+\n\ni\u03c3 = c(k+1)\n\ni\n\n(S \u222a z)\n\n(9)\n\nj=1\n\nqijc(k)\n\nj\n\n(S \u222a z) = c(k+1)\n\ni\n\n(S \u222a z)\n\n(10)\n\nn(cid:88)\nn(cid:88)\n\nj=1\n\nj=1\n\ni\u03c3 \u2264 n(cid:88)\n\nj=1\n\n(S) + q+\n\n(S) \u2264 n(cid:88)\nn(cid:88)\nn(cid:88)\n\nj=1\n\nusing the inductive assumption and Eq. (8). When i = z, we have:\n\nc(k+1)\ni\n\n(S \u222a z) =\n\nijc(k)\nq+\n\nj\n\n(S \u222a z) +\n\n(qij \u2212 q+\n\nij)c(k)\n\nj\n\n(S \u222a z)\n\nj\n\nqijc(k)\n\n(S) \u2264 n(cid:88)\n\u2264 n(cid:88)\nij, c \u2264 1,(cid:80)n\n\nijc(k)\nq+\n\nj=1\n\nj=1\n\nj\n\n(qij \u2212 q+\n\n(S \u222a z) +\n\nj=1 qij = 1, and(cid:80)n\n\nj=1\n\nij) =\n\nn(cid:88)\n\nj=1\n\nn(cid:88)\nij = 1 \u2212 q+\ni\u03c3.\n\nijc(k)\nq+\n\nj=1\n\nj\n\n(S \u222a z) + q+\n\nz\u03c3 = c(k+1)\n\ni\n\n(S \u222a z)\n\ndue to to qij \u2265 q+\nCorollary 5.2. \u2200i \u2208 [n], ci(S) is non-decreasing, hence f (S) = (cid:104)\u03c0, c(S)(cid:105) is non-decreasing.\n\nj=1 q+\n\n5.2 Submodularity\n\nSubmodularity captures the principle of diminishing returns. A function f (S) is submodular if:\n\n\u2200X \u2286 Y \u2286 [n], z /\u2208 X,\n\nf (X \u222a {z}) \u2212 f (X) \u2265 f (Y \u222a {z}) \u2212 f (Y )\n\nIn what follows we will use the following equivalent de\ufb01nition:\n\n\u2200S \u2286 [n], z1, z2 \u2208 [n] \\ S,\n\nf (S \u222a {z1}) + f (S \u222a {z2}) \u2265 f (S \u222a {z1, z2}) + f (S)\n\n(11)\n\nUsing this formulation, we now show that f (S) as de\ufb01ned in Eq. (7) is submodular.\nTheorem 5.3. For every k \u2265 0 and i \u2208 [n], c(k)\n\n(S) is a submodular function.\n\ni\n\nProof. We prove by induction on k. For k = 0, once again \u03c0 is independent of S and hence c0\nmodular. Assume now that the claim holds for some k \u2265 0. For brevity we de\ufb01ne:\nc(k)\ni,12 = c(k)\nc(k)\ni = c(k)\n\n(S \u222a {z1, z2})\n\n(S \u222a {z1}),\n\ni\n\ni\n\ni is\n\n(S \u222a {z2}),\n. For every j \u2208 [n], we\u2019ll prove that:\n\ni\n\n(S),\n\nc(k)\ni,1 = c(k)\nWe\u2019d like to show that c(k+1)\n\u03c1ij(S \u222a {z1})c(k)\n\nc(k)\ni,2 = c(k)\ni,2 \u2265 c(k+1)\ni,1 + c(k+1)\ni,12 + c(k+1)\nj,1 + \u03c1ij(S \u222a {z2})c(k)\n\ni\n\ni\n\nj,2 \u2265 \u03c1ij(S \u222a {z1, z2})c(k)\n\nj,12 + \u03c1ij(S)c(k)\nBy summing over all j \u2208 [n] and adding \u03c1i\u03c31{i\u2208T} we get Eq. (8) and conclude our proof.\nWe separate into different cases for i. If i \u2208 S, then we have \u03c1ij(S \u222a {z1, z2}) = \u03c1ij(S \u222a {z1}) =\n\u03c1ij(S \u222a {z2}) = \u03c1ij(S) = q+\nij. Similarly, if i /\u2208 S \u222a {z1, z2}, then all terms now equal qij. Eq. (12)\nthen follows from the inductive assumption.\nAssume i = z1 (and analogously for i = z2). From the assumption in Eq. (2) we can write\nqij = (1 + \u03b1)q+\n\n(12)\n\nj\n\nij for some \u03b1 \u2265 0. Then Eq. (12) becomes:\nijc(k)\n\nijc(k)\nij > 0 if needed and reorder to get:\n\nj,1 + (1 + \u03b1)q+\n\nijc(k)\nq+\n\nj,2 \u2265 q+\n(cid:17)\n\n(cid:16)\n\nDivide by q+\n\nc(k)\nj,1 + ck\n\nj,2 \u2212 c(k)\n\nj,12 \u2212 c(k)\n\nj\n\n+ \u03b1(ck\n\nj,2 \u2212 c(k)\n\nj\n\n) \u2265 0\n\nj,12 + (1 + \u03b1)q+\n\nijc(k)\n\nj\n\n(13)\n\n(14)\n\nThis indeed holds since the \ufb01rst term is non-negative from the inductive assumption, and the second\nterm is non-negative because of monotonicity and \u03b1 \u2265 0.\nCorollary 5.4. \u2200 i \u2208 [n], ci(S) is submodular, hence f (S) = (cid:104)\u03c0, c(S)(cid:105) is submodular.\n\n5\n\n\fInitialize S = \u2205\nfor i \u2190 1 to k do\n\nAlgorithm 1\n1: function SIMPLEGREEDYTAGOPT(Q, Q+, \u03c0, k)\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n\nc =(cid:0)I \u2212 A(S \u222a {z})(cid:1) \\ b(S \u222a {z})\n\nfor z \u2208 [n] \\ S do\nv(z) = (cid:104)\u03c0, c(cid:105)\n\nS \u2190 S \u222a argmaxz v(z)\n\nReturn S\n\n6 Optimization\n\n(cid:46) See supp. for ef\ufb01cient implementation\n\n(cid:46) A, b are set by Q, Q+ using Eqs. (3), (15)\n\nMaximizing submodular functions is hard in general. However, a classic result by Nemhauser [13]\nshows that a non-decreasing submodular set function, such as our f (S), can be ef\ufb01ciently optimized\nvia a simple greedy algorithm, with a guaranteed (1 \u2212 1\ne )-approximation of the optimum. The greedy\nalgorithm initializes S = \u2205, and then sequentially adds elements to S. For a given S, the algorithm\niterates over all z \u2208 [n] \\ S and computes f (S \u222a {z}). Then, it adds the highest scoring z to S, and\ncontinues to the next step. We now discuss its implementation for our problem.\nComputing f (S) for a given S reduces to solving a set of linear equations. For transient states\n{1, . . . , n\u2212 r} and absorbing states {n\u2212 r + 1, . . . , n + 1 = \u03c3}, the transition matrix \u03c1(S) becomes:\n\n(cid:18) A(S) B(S)\n\n(cid:19)\n\n0\n\nI\n\n\u03c1(S) =\n\n(15)\n\n\u221e(cid:88)\n\n(cid:88)\n\nt=0\n\nj\u2208[n\u2212r]\n\n\u221e(cid:88)\n\nwhere A(S) are the transition probabilities between transient states, B(S) are the transition probabil-\nities from transient states to absorbing states, and I is the identity matrix. When clear from context\nwe will drop the dependence of A, B on S. Note that \u03c1(S) has at least one absorbing state (namely\n\u03c3). We denote by b the column of B corresponding to state \u03c3 (i.e., B\u2019s rightmost column).\nWe would like to calculate f (S). By Eq. (6), the probability of reaching \u03c3 given an initial state i is:\n\nci(S) =\n\nPS [Xt = \u03c3|Xt\u22121 = j] PS [Xt\u22121 = j|X0 = i] =\n\n(cid:32) \u221e(cid:88)\n\n(cid:33)\n\nAtb\n\nt=0\n\ni\n\nThe above series has a closed form solution:\n\nAt = (I \u2212 A)\u22121 \u21d2 c = (I \u2212 A)\u22121b\n\nt=0\n\nThus, c(S) is the solution of the set of linear equations, which readily gives us f (S):\n\nf (S) = (cid:104)\u03c0, c(cid:105)\n\ns.t.\n\n(I \u2212 A(S))c = b(S)\n\n(16)\n\nThe greedy algorithm can thus be implemented by sequentially considering candidate sets S of\nincreasing size, and for each z calculating f (S \u222a {z}) by solving a set of linear equations (see\nAlgorithm 1). Though parallelizable, this na\u00efve implementation may be costly as it requires solving\nO(n2) sets of n \u2212 r linear equations, one for every addition of z to S. Fast submodular solvers [7]\ncan reduce the number of f (S) evaluations by an order of magnitude. In addition, we now show how\na signi\ufb01cant speedup in computing f (S) itself can be achieved using certain properties of f (S).\nA standard method for solving the set of linear equations (I \u2212 A)c = b if to \ufb01rst compute an LU P\ndecomposition for (I \u2212 A), namely \ufb01nd lower and upper diagonal matrices L, U and a permutation\nmatrix P such that LU = P (I \u2212 A). Then, it suf\ufb01ces to solve Ly = P b and U c = y. Since L and U\nare diagonal, solving these equations can be performed ef\ufb01ciently. The costly operation is computing\nthe decomposition in the \ufb01rst place.\nRecall that \u03c1(S) is composed of rows from Q+ corresponding to S and rows from Q corresponding to\n[n] \\ S. This means that \u03c1(S) and \u03c1(S \u222a {z}) differ only in one row, or equivalently, that \u03c1(S \u222a {z})\ncan be obtained from \u03c1(S) by adding a rank-1 matrix. Given an LU P decomposition of \u03c1(S), we can\n\n6\n\n\fef\ufb01ciently compute f (S \u222a {z}) (and the corresponding decomposition) using ef\ufb01cient rank-1-update\ntechniques such as Bartels-Golub-Reid [17], which are especially ef\ufb01cient for sparse matrices. As a\nresult, it suf\ufb01ces to compute only a single LU P decomposition once at the beginning, and perform\ncheap updates at every step. We give an ef\ufb01cient implementation in the supp. material.\n\n7 Optimal Tagging\n\nIn this section we return to the task of optimal tagging and show how the Markov chain optimization\nframework described above can be applied. We use a random surfer model, where a browsing user\nhops between items and tags in a bipartite Markov chain. In its explicit form, our model captures the\nactivity of browsing users whom, when viewing an item, are presented with the item\u2019s tags and may\nchoose to click on them (and similarly when viewing tags).\nIn reality, many systems also include direct links between related items, often in the form of a ranked\nlist of item recommendations. The relatedness of two items is in many cases, at least to some extent,\nbased on their mutual tags. Our model captures this notion of similarity by indirect transitions via\ntag states. This allows us to encode tags as variables in the objective. Furthermore, adding direct\ntransitions between items is straightforward as our results apply to general Markov chains. Note that\nin contrast to models for tag recommendation, we do not need to explicitly model users, as our setup\nde\ufb01nes only one distinct optimization task per item.\nIn what follows we formalize the above notions. Consider a system of m items \u2126 = {\u03c91, . . . , \u03c9m}\nand n tags T = {\u03c41, . . . , \u03c4n}. Each item \u03c9i has a set of tags Ti \u2286 T , and each tag \u03c4j has a set of\nitems \u2126j \u2286 \u2126. The items and tags constitute the states of a bipartite Markov chain, where users hop\nbetween items and tags. Speci\ufb01cally, the transition matrix \u03c1 can have non-zero entries \u03c1ij and \u03c1ji for\nitems \u03c9i tagged by \u03c4j. To model the fact that browsing users eventually leave the system, we add a\nglobal absorbing state \u2205 and add transition probabilities \u03c1i\u2205 = \u0001i > 0 for all items \u03c9i. For simplicity\nwe assume that \u0001i = \u0001 for all i, and that \u03c0 can be non-zero only for tag states.\nIn our setting, when a new item \u03c3 is uploaded, its owner may choose a set S \u2286 T of at most k tags\nfor \u03c3. Her goal is to choose S such that the probability of an arbitrary browsing user reaching (or\nequivalently, being absorbed in) \u03c3 while browsing the system is maximal. As in the general case, the\nchoice of S affects the transition matrix \u03c1(S). Denote by Pij the transition probability from item \u03c9i\nto tag \u03c4j, by Rji(S) the transition probability from \u03c4j to \u03c9i under S, and let rj(S) = Rj\u03c3(S). Using\nEq. (15), \u03c1 can be written as:\n\n(cid:19)\n\n(cid:18) A B\n\n0 I2\n\n(cid:18) 0 R(S)\n\n(cid:19)\n\nP\n\n0\n\n\u03c1(S) =\n\n,\n\nA =\n\n, B =\n\n(cid:18) 0\n\n1\u00b7 \u0001\n\n(cid:19)\n\nr(S)\n\n0\n\n,\n\nI2 =\n\n(cid:18) 1\n\n0\n\n(cid:19)\n\n0\n1\n\nwhere 0 and 1 are appropriately sized vectors or matrices. Since we are only interested in selecting\ntags, we may consider a chain that includes only the tag states, with the item states marginalized out.\nThe transition matrix between tags is given by \u03c12(S) = R(S)P . The transition probabilities from\ntags to \u03c3 remain r(S). Our objective of maximizing the probability of reaching \u03c3 under S is then:\n(17)\nwhich is a special case of the general objective presented in Eq. (16), and hence can be optimized\nef\ufb01ciently. In the supplementary material we prove that this special case is still NP-hard.\n\n(I \u2212 R(S)P ) c = r(S)\n\nf (S) = (cid:104)\u03c0, c(cid:105)\n\ns.t.\n\n8 Experiments\n\nTo demonstrate the effectiveness of our approach, we perform experiments on optimal tagging in data\ncollected from Last.fm, Delicious, and Movielens by the HetRec 2011 workshop [3]. The datasets\ninclude all items (between 10,197 and 59,226) and tags (between 11,946 and 53,388) reached by\ncrawling a set of about 2,000 users in each system, as well as some metadata.\nFor each dataset, we \ufb01rst created a bipartite graph of items and tags. Next, we generated 100 different\ninstances of our problem per dataset by expanding each of the 100 highest-degree tags and creating a\nMarkov chain for their items and their tags. We discarded nodes with less than 10 edges.\nTo create an interesting tag selection setup, for each item in each instance we augmented its true\ntags with up to 100 similar tags (based on [18]). These served as the set of candidate tags for which\n\n7\n\n\fFigure 1: The probability of reaching a focal item \u03c3 under a budget of k tags for various methods.\n\ntransitions to the item were allowed. We focused on items which were ranked \ufb01rst in at least 10 of\ntheir 100 candidate tags, giving a total of 18,167 focal items for comparison. For each such item, our\ntask was to choose the k tags which maximize the probability of reaching the focal item.\nTransition probabilities from tags to items were set to be proportional to the item weights - number of\nlistens for artists in Last.fm, tag counts for bookmarks in Delicious, and averaged ratings for movies\nin Movielens. As the datasets do not include explicit weights for tags, we used uniform transition\nprobabilities from items to tags. The initial distribution was set to be uniform over the set of candidate\ntags, and the transition probability from items to \u2205 was set to \u0001 = 0.1.\nWe compared the performance of our greedy algorithm with several baselines. Random-walk\nbased methods included PageRank and an adaptation2 of BiFolkRank [10], a state-of-the-art tag\nrecommendation method that operates on item-tag relations. Heuristics included choosing tags with\nhighest and lowest degree, true labels (for relevant k-s) sorted by weight, and random. To measure\nthe added value of long random walks, we also display the probability of reaching \u03c3 in one step.\nResults for all three datasets are provided in Fig. 1, which shows the average probability of reaching\nthe focal item for values of k \u2208 {1, . . . , 25}. As can be seen, the greedy method clearly outperforms\nother baselines. Considering paths of all lengths improves results by a considerable 20-30% for\nk = 1, and roughly 5% for k = 25. An interesting observation is that the performance of the true\ntags is rather poor. A plausible explanation for this is that the data we use are taken from collaborative\ntagging systems, where items can be tagged by any user. In such systems, tags typically play a\ncategorical or hierarchical role, and as such are probably not optimal for promoting item popularity.\nThe supplementary material includes an interesting case analysis.\n\n9 Conclusions\n\nIn this paper we introduced the problem of optimal tagging, along with the general problem of\noptimizing probability mass in Markov chains by adding links. We proved that the problem is NP-\nhard, but can be (1 \u2212 1\ne )-approximated due to the submodularity and monotonicity of the objective.\nOur ef\ufb01cient greedy algorithm can be used in practice for choosing optimal tags or keywords in\nvarious domains. Our experimental results show that simple heuristics and PageRank variants\nunderperform our disciplined approach, and na\u00efvely selecting the true tags can be suboptimal.\nIn our work we assumed access to the transition probabilities between tags and items and vice versa.\nWhile the transition probabilities for existing items can be easily estimated by a system\u2019s operator,\nestimating the probabilities from tags to new items is non-trivial. This is an interesting problem to\npursue. Even so, users do not typically have access to the information required for estimation. Our\nresults suggest that users can simply apply the greedy steps sequentially via trial-and-error [9].\nFinally, since our task is of a counterfactual nature, it is hard to draw conclusions from the experiments\nas to the effectiveness of our method in real settings. It would be interesting to test it in realty, and\ncompare it to strategies used by both lay users and experts. Especially interesting in this context are\ncompetitive domains such as ad placements and viral marketing. We leave this for future research.\nAcknowledgments: This work was supported by the ISF Centers of Excellence grant 2180/15, and by the Intel\nCollaborative Research Institute for Computational Intelligence (ICRI-CI).\n\n2To apply the method to our setting, we used a uniform prior over user-tag relations.\n\n8\n\nk1 5 9 13172125Pr(\u03c3)00.050.10.150.2Last.fmGreedyPageRankBiFolkRank*High degreeLow degreeTrue tagsOne stepRandomk1 5 9 13172125Pr(\u03c3)00.050.10.150.2Deliciousk1 5 9 13172125Pr(\u03c3)00.050.10.150.20.25Movielens\fReferences\n[1] Konstantin Avrachenkov and Nelly Litvak. The effect of new links on google pagerank.\n\nStochastic Models, 22(2):319\u2013331, 2006.\n\n[2] Sergey Brin and Lawrence Page. Reprint of: The anatomy of a large-scale hypertextual web\n\nsearch engine. Computer networks, 56(18):3825\u20133833, 2012.\n\n[3] Iv\u00e1n Cantador, Peter Brusilovsky, and Tsvi Ku\ufb02ik. 2nd workshop on information heterogeneity\nand fusion in recommender systems (hetrec 2011). In Proceedings of the 5th ACM conference\non Recommender systems, RecSys 2011, New York, NY, USA, 2011. ACM.\n\n[4] Bal\u00e1zs Csan\u00e1d Cs\u00e1ji, Rapha\u00ebl M Jungers, and Vincent D Blondel. Pagerank optimization in\npolynomial time by stochastic shortest path reformulation. In Algorithmic Learning Theory,\npages 89\u2013103. Springer, 2010.\n\n[5] Xiaomin Fang, Rong Pan, Guoxiang Cao, Xiuqiang He, and Wenyuan Dai. Personalized tag\nrecommendation through nonlinear tensor factorization using gaussian kernel. In Twenty-Ninth\nAAAI Conference on Arti\ufb01cial Intelligence, 2015.\n\n[6] Aristides Gionis, Evimaria Terzi, and Panayiotis Tsaparas. Opinion maximization in social\n\nnetworks. In SDM, pages 387\u2013395. SIAM, 2013.\n\n[7] Amit Goyal, Wei Lu, and Laks VS Lakshmanan. CELF++: optimizing the greedy algorithm for\nin\ufb02uence maximization in social networks. In Proceedings of the 20th international conference\ncompanion on World wide web, pages 47\u201348. ACM, 2011.\n\n[8] Andreas Hotho, Robert J\u00e4schke, Christoph Schmitz, Gerd Stumme, and Klaus-Dieter Althoff.\n\nFolkrank: A ranking algorithm for folksonomies. In LWA, volume 1, pages 111\u2013114, 2006.\n\n[9] David Kempe, Jon Kleinberg, and \u00c9va Tardos. Maximizing the spread of in\ufb02uence through\na social network. In Proceedings of the ninth ACM SIGKDD international conference on\nKnowledge discovery and data mining, pages 137\u2013146. ACM, 2003.\n\n[10] Heung-Nam Kim and Abdulmotaleb El Saddik. Personalized pagerank vectors for tag recom-\nmendations: inside folkrank. In Proceedings of the \ufb01fth ACM conference on Recommender\nsystems, pages 45\u201352. ACM, 2011.\n\n[11] Andreas Krause, Ajit Singh, and Carlos Guestrin. Near-optimal sensor placements in gaussian\nprocesses: Theory, ef\ufb01cient algorithms and empirical studies. The Journal of Machine Learning\nResearch, 9:235\u2013284, 2008.\n\n[12] Charalampos Mavroforakis, Michael Mathioudakis, and Aristides Gionis. Absorbing random-\n\nwalk centrality: Theory and algorithms. arXiv preprint arXiv:1509.02533, 2015.\n\n[13] George L Nemhauser, Laurence A Wolsey, and Marshall L Fisher. An analysis of approximations\nfor maximizing submodular set functions. Mathematical Programming, 14(1):265\u2013294, 1978.\n\n[14] Martin Olsen. Maximizing pagerank with new backlinks. In International Conference on\n\nAlgorithms and Complexity, pages 37\u201348. Springer, 2010.\n\n[15] Martin Olsen and Anastasios Viglas. On the approximability of the link building problem.\n\nTheoretical Computer Science, 518:96\u2013116, 2014.\n\n[16] Martin Olsen, Anastasios Viglas, and Ilia Zvedeniouk. A constant-factor approximation algo-\nrithm for the link building problem. In Combinatorial Optimization and Applications, pages\n87\u201396. Springer, 2010.\n\n[17] John Ker Reid. A sparsity-exploiting variant of the Bartels-Golub decomposition for linear\n\nprogramming bases. Mathematical Programming, 24(1):55\u201369, 1982.\n\n[18] B\u00f6rkur Sigurbj\u00f6rnsson and Roelof Van Zwol. Flickr tag recommendation based on collective\nknowledge. In Proceedings of the 17th international conference on World Wide Web, pages\n327\u2013336. ACM, 2008.\n\n9\n\n\f", "award": [], "sourceid": 722, "authors": [{"given_name": "Nir", "family_name": "Rosenfeld", "institution": "Hebrew University of Jerusalem"}, {"given_name": "Amir", "family_name": "Globerson", "institution": "Tel Aviv University"}]}