{"title": "Factor Modeling for Advertisement Targeting", "book": "Advances in Neural Information Processing Systems", "page_first": 324, "page_last": 332, "abstract": "We adapt a probabilistic latent variable model, namely GaP (Gamma-Poisson), to ad targeting in the contexts of sponsored search (SS) and behaviorally targeted (BT) display advertising. We also approach the important problem of ad positional bias by formulating a one-latent-dimension GaP factorization. Learning from click-through data is intrinsically large scale, even more so for ads. We scale up the algorithm to terabytes of real-world SS and BT data that contains hundreds of millions of users and hundreds of thousands of features, by leveraging the scalability characteristics of the algorithm and the inherent structure of the problem including data sparsity and locality. Specifically, we demonstrate two somewhat orthogonal philosophies of scaling algorithms to large-scale problems, through the SS and BT implementations, respectively. Finally, we report the experimental results using Yahoos vast datasets, and show that our approach substantially outperform the state-of-the-art methods in prediction accuracy. For BT in particular, the ROC area achieved by GaP is exceeding 0.95, while one prior approach using Poisson regression yielded 0.83. For computational performance, we compare a single-node sparse implementation with a parallel implementation using Hadoop MapReduce, the results are counterintuitive yet quite interesting. We therefore provide insights into the underlying principles of large-scale learning.", "full_text": "Factor Modeling for Advertisement Targeting\n\nYe Chen\u2217\neBay Inc.\n\nyechen1@ebay.com\n\nMichael Kapralov\nStanford University\n\nkapralov@stanford.edu\n\nDmitry Pavlov\u2020\nYandex Labs\n\ndmitry-pavlov@yandex-team.ru\n\nJohn F. Canny\n\nUniversity of California, Berkeley\n\njfc@cs.berkeley.edu\n\nAbstract\n\nWe adapt a probabilistic latent variable model, namely GaP (Gamma-Poisson) [6],\nto ad targeting in the contexts of sponsored search (SS) and behaviorally targeted\n(BT) display advertising. We also approach the important problem of ad posi-\ntional bias by formulating a one-latent-dimension GaP factorization. Learning\nfrom click-through data is intrinsically large scale, even more so for ads. We scale\nup the algorithm to terabytes of real-world SS and BT data that contains hundreds\nof millions of users and hundreds of thousands of features, by leveraging the scal-\nability characteristics of the algorithm and the inherent structure of the problem\nincluding data sparsity and locality. Speci\ufb01cally, we demonstrate two somewhat\northogonal philosophies of scaling algorithms to large-scale problems, through\nthe SS and BT implementations, respectively. Finally, we report the experimen-\ntal results using Yahoo\u2019s vast datasets, and show that our approach substantially\noutperform the state-of-the-art methods in prediction accuracy. For BT in partic-\nular, the ROC area achieved by GaP is exceeding 0.95, while one prior approach\nusing Poisson regression [11] yielded 0.83. For computational performance, we\ncompare a single-node sparse implementation with a parallel implementation us-\ning Hadoop MapReduce, the results are counterintuitive yet quite interesting. We\ntherefore provide insights into the underlying principles of large-scale learning.\n\n1 Introduction\n\nOnline advertising has become the cornerstone of many sustainable business models in today\u2019s In-\nternet, including search engines (e.g., Google), content providers (e.g., Yahoo!), and social networks\n(e.g., Facebook). One essential competitive advantage, over traditional channels, of online adver-\ntising is that it allows for targeting. The objective of ad targeting is to select most relevant ads to\npresent to a user based on contextual and prior knowledge about this user. The relevance measure or\nresponse variable is typically click-through rate (CTR), while explanatory variables vary in different\napplication domains. For instance, sponsored search (SS) [17] uses query, content match [5] relies\non page content, and behavioral targeting (BT) [11] leverages historical user behavior. Nevertheless,\nthe training data can be generally formed as a user-feature matrix of event counts, where the feature\ndimension contains various events such as queries, ad clicks and views. This characterization of data\nnaturally leads to our adoption of the family of latent variable models [20, 19, 16, 18, 4, 6], which\nhave been quite successfully applied to text and image corpora. In general, the goal of latent variable\nmodels is to discover statistical structures (factors) latent in the data, often with dimensionality re-\nduction, and thus to generalize well to unseen examples. In particular, our choice of Gamma-Poisson\n(GaP) is theoretically as well as empirically motivated, as we elaborate in Section 2.2.\n\n\u2217\u2020This work was conducted when the authors were at Yahoo! Labs, 701 First Ave, Sunnyvale, CA 94089.\n\n1\n\n\fSponsored search involves placing textual ads related to the user query alongside the algorithmic\nsearch results. To estimate ad relevance, previous approaches include similarity search [5], logistic\nregression [25, 8], classi\ufb01cation and online learning with perceptron [13], while primarily in the\noriginal term space. We consider the problem of estimating CTR of the form p(click|ad, user, query),\nthrough a factorization of the user-feature matrix into a latent factor space, as derived in Section 2.1.\nSS adopts the keyword-based pay-per-click (PPC) advertising model [23]; hence the accuracy of\nCTR prediction is essential in determining the ad\u2019s ranking, placement, pricing, and \ufb01ltering [21].\nBehavioral targeting leverages historical user behavior to select relevant ads to display. Since BT\ndoes not primarily rely on contextual information such as query and page content; it makes an en-\nabling technology for display (banner) advertising where such contextual data is typically unavail-\nable, such as reading an email, watching a movie, instant messaging, and at least from the ad\u2019s side.\nWe consider the problem of predicting CTR of the form p(click|ad, user). The question addressed\nby the state-of-the-art BT is instead that of predicting the CTR of an ad in a given category (e.g., Fi-\nnance and Technology) or p(click|ad-category, user), by \ufb01tting a sign-constrained linear regression\nwith categorized features [12] or a non-negative Poisson regression with granular features [11,10,7].\nAd categorization is done by human labeling and thus expensive and error-prone. One of the major\nadvantages of GaP is the ability to perform granular or per-ad prediction, which is infeasible by the\nprevious BT technologies due to scalability issues (e.g., a regression model for each category).\n\n2 GaP model\nGaP is a generative probabilistic model, as graphically represented in Figure 1. Let F be an n \u00d7 m\ndata matrix whose element fij is the observed count of event (or feature) i by user j. Y is a matrix\nof expected counts with the same dimensions as F . F , element-wise, is naturally assumed to follow\nPoisson distributions with mean parameters in Y respectively, i.e., F \u223c Poisson(Y ). Let X be a\nd \u00d7 m matrix where the column vector xj is a low-dimensional representation of user j in a latent\nspace of \u201ctopics\u201d. The element xkj encodes the \u201caf\ufb01nity\u201d of user j to topic k as the total number of\noccurrences of all events contributing to topic k. \u039b is an n\u00d7d matrix where the column \u039bk represents\nthe kth topic as a vector of event probabilities p(i|k), that is, a multinomial distribution of event\ncounts conditioned on topic k. Therefore, the Poisson mean matrix Y has a linear parameterization\nwith \u039b and X, i.e., Y = \u039bX. GaP essentially yields an approximate factorization of the data\nmatrix into two matrices with a low inner dimension F \u2248 \u039bX. The approximation has an appealing\ninterpretation column-wise f \u2248 \u039bx, that is, each user vector f in event space is approximated by\na linear combination of the column vectors of \u039b, weighted by the topical mixture x for that user.\nSince by design d (cid:28) n, m, the model matrix \u039b shall capture signi\ufb01cant statistical (topical) structure\nhidden in the data. Finally, xkj is given a gamma distribution as an empirical prior. The generative\nprocess of an observed event-user count fij follows:\n\n1. Generate xkj \u223c Gamma(\u03b1k, \u03b2k),\u2200k.\n2. Generate yij occurrences of event i from a mixture of k Multinomial(p(i|k)) with outcome\n3. Generate fij \u223c Poisson(yij).\n\ni, i.e., yij = \u039bixj where \u039bi is the ith row vector of \u039b.\n\nThe starting point of the generative process is a gamma distribution of x, with pdf\n\np(x) = x\u03b1\u22121 exp(\u2212x/\u03b2)\n\n\u03b2\u03b1\u0393(\u03b1)\n\nfor x > 0 and \u03b1, \u03b2 > 0.\n\n(1)\n\nIt has a shape parameter \u03b1 and a scale parameter \u03b2. Next, from the latent random vector character-\nizing a user x, we derive the expected count vector y for the user as follows:\n\n(2)\nThe last stochastic process is a Poisson distribution of the observed count f with the mean value y,\n\nThe data likelihood for a user generated as described above is\n\ny = \u039bx.\np(f) = yf exp(\u2212y)\n\nf!\n\nfor f \u2265 0.\n\nn(cid:89)\n\ni=1\n\ni exp(\u2212yi)\nyfi\n\nfi!\n\nd(cid:89)\n\n(xk/\u03b2k)\u03b1k\u22121 exp(\u2212xk/\u03b2k)\n\nk=1\n\n\u03b2k\u0393(\u03b1k)\n\n2\n\n(3)\n\n(4)\n\n,\n\n\fFn\u00d7m\n\nusers\n\n\u2022 fij\n\nF\n\ns\ne\nr\nu\na\ne\n\nt\n\nf\n\n\u2248\n\n\u2022 yij\n\nY\n\nYn\u00d7m\n\n\u039bn\u00d7d\n\nXd\u00d7m\n\ncookie: \n\n\u20184qb2cg939usaj\u2019\n\ncookie hashmap\n(inverted index)\n\n<\u20184qb2cg939usaj\u2019, 878623>\n\nxj-cookie lookup\n\n<9869, 878623>\n\n\u039bi\n\n=\n\n\u00d7\n\u00d7\n\ntopics\n\n\u039b\n\n\u039bk\n\nX\n\nxj\n\n\u039bi\n\n(42497th row)\n\n\u039b\n\n\u00d7\n\nX\n\nxj (9869th column)\n= zij\n\n<\u2018machine+learning+8532948011\u2019, 42497>\n\u2018machine+learning+8532948011\u2019, 42497>\n\nfij ~ Poisson(yij) \u2190 yij ~ mixture of Multinomial(p(i|k)) \u2190 xkj ~ Gamma(\u03b1k,\u03b2k)\n\nquery-ad: \n\n\u2018machine+learning+8532948011\u2019\n\nquery-ad hashmap\n(inverted index)\n\nFigure 1: GaP graphical model\n\nFigure 2: GaP online prediction\n\nwhere yi = (cid:31)ix. And the log likelihood reads\n\n(cid:31)=(cid:31)\n\n(fi log yi (cid:31) yi (cid:31) log fi!)+(cid:31)\n\n[((cid:30)k (cid:31) 1) log xk (cid:31) xk(cid:47)(cid:29)k (cid:31) (cid:30)k log((cid:29)k) (cid:31) log (cid:30)((cid:30)k)](cid:46) (5)\n\ni\n\nk\n\nGiven a corpus of user data F = (f1(cid:44) (cid:46)(cid:46)(cid:46)(cid:44) fj(cid:44) (cid:46)(cid:46)(cid:46)(cid:44) fm), we wish to \ufb01nd the maximum likelihood\nestimates (MLE) of the model parameters ((cid:31)(cid:44) X). Based on an elegant multiplicative recurrence\ndeveloped by Lee and Seung [22] for NMF, the following EM algorithm was derived in [6]:\n\n(cid:30)\n(cid:30)\n\n(cid:30)\ni (fij(cid:31)ik(cid:47)yij) + ((cid:30)k (cid:31) 1)(cid:47)xkj\n(cid:29)fijxkj(cid:47)yij\n(cid:30)\n\ni (cid:31)ik + 1(cid:47)(cid:29)k\n\n(cid:28)\n\n(cid:46)\n\nj\n\nj xkj\n\n(cid:46)\n\n(6)\n\n(7)\n\nE-step:\n\nkj (cid:30) xkj\nx(cid:31)\n\nM-step: (cid:31)(cid:31)\n\nik (cid:30) (cid:31)ik\n\n2.1 Two variants for CTR prediction\n\nThe standard GaP model \ufb01ts discrete count data. We now describe two variant derivations for pre-\ndicting CTR. The \ufb01rst approach is to predict clicks and views independently, and then to construct\nthe unbiased estimator of CTR, typically with Laplacian smoothing:\n\n(cid:27)CTRad(i)j =(cid:29)(cid:31)click(i)xj + (cid:28)(cid:28)(cid:47)(cid:29)(cid:31)view(i)xj + (cid:27)(cid:28)(cid:44)\n\n(8)\nwhere click(i) and view(i) are the indices corresponding to the click/view pair of ad feature i,\nrespectively, by user j; (cid:28) and (cid:27) are smoothing constants.\nThe second idea is to consider the relative frequency of counts, particularly the number of clicks\nrelative to the number of views for the events of interest. Formally, let F be a matrix of observed\nclick counts and Y be a matrix of the corresponding expected click counts. We further introduce a\nmatrix of observed views V and a matrix of click probabilities Z, and de\ufb01ne the link function:\n\nF (cid:29)Y = V(cid:46)Z = V(cid:46)((cid:31)X)(cid:44)\n\n(9)\nwhere \u2018.\u2019 denotes element-wise matrix multiplication. The linear predictor Z = (cid:31)X now esti-\nmates CTR directly, and is scaled by the observed view counts V to obtain the expected number of\nclicks Y . The Poisson assumption is only given to the click events F with the mean parameters Y .\nGiven a number of views v and the probability of click for a single view or CTR, a more natural\nstochastic model for click counts is Binomial(v(cid:44) CTR). But since in ad\u2019s data the number of views\nis suf\ufb01ciently large and CTR is typically very small, the binomial converges to Poisson(v (cid:183) CTR).\nGiven the same form of log likelihood in Eq. (5) but with the extended link function in Eq. (9), we\nderive the following EM recurrence:\nkj (cid:30) xkj\nx(cid:31)\n\nE-step:\n\n(10)\n\n(cid:46)\n\n(cid:30)\n(cid:30)\ni (fij(cid:31)ik(cid:47)zij) + ((cid:30)k (cid:31) 1)(cid:47)xkj\n(cid:30)\n(cid:30)\nj (fijxkj(cid:47)zij)\nj (vijxkj) (cid:46)\n\ni (vij(cid:31)ik) + 1(cid:47)(cid:29)k\n\nM-step: (cid:31)(cid:31)\n\nik (cid:30) (cid:31)ik\n\n(11)\n\n3\n\n\f2.2 Rationale for GaP model\n\nGaP is a generative probabilistic model for discrete data (such as texts). Similar to LDA (latent\nDirichlet allocation) [4], GaP represents each sample (document or in this case a user) as a mix-\nture of topics or interests. The latent factors in these models are non-negative, which has proved\nto have several practical advantages. First of all, texts arguably do comprise passages of prose on\nspeci\ufb01c topics, whereas negative factors have no clear interpretation. Similarly, users have occa-\nsional interests in particular products or groups of products and their click-through propensity will\ndramatically increase for those products. On the other hand \u201ctemporary avoidance\u201d of a product\nline is less plausible, and one clearly cannot have negative click-through counts which would be a\nconsequence of allowing negative factors. A more practical aspect of non-negative factor models is\nthat weak factor coef\ufb01cients are driven to zero, especially when the input data is itself sparse; and\nhence the non-zeros will be much more stable, and cross-validation error much lower. This helps to\navoid over\ufb01tting, and a typical LDA or GaP model can be run with high latent dimensions without\nover\ufb01tting, e.g., with 100 data measurements per user; one factor of a 100-dimensional PCA model\nwill essentially be a (reversible) linear transformation of the input data. On the choice of GaP vs.\nLDA, the models are very similar, however there is a key difference. In LDA, the choice of latent\nfactor is made independently word-by-word, or in the BT case, ad view by ad view. In GaP however,\nit is assumed that several items are chosen from each latent factor, i.e., that interests are locally re-\nlated. Hence GaP uses gamma priors which include both shape and scale factors. The scale factors\nprovide an estimated count of the number of items drawn from each latent factor. Another reason for\nour preference for GaP in this application is its simplicity. While LDA requires application of tran-\nscendental functions across the models with each iteration (e.g., \u03a8 function in Equation (8) of [4]),\nGaP requires only basic arithmetic. Apart from transcendentals, the numbers of arithmetic opera-\ntions of the two methods on same-sized data are identical. While we did not have the resources to\nimplement LDA at this scale in addition to GaP, small-scale experiments showed identical accuracy.\nSo we chose GaP for its speed and simplicity.\n\n3 Sponsored search\n\nWe apply the second variant of GaP or the CTR-based formulation to SS CTR prediction, where the\nfactorization will directly yield a linear predictor of CTR or p(click|ad, user, query), as in Eq. (9).\nBased on the structure of the SS click-through data, speci\ufb01cally the dimensionality and the user data\nlocality, the deployment of GaP for SS involves three processes: (1) of\ufb02ine training, (2) of\ufb02ine user\npro\ufb01le updating, and (3) online CTR prediction, as elaborated below.\n\n3.1 The GaP deployment for SS\n\nOf\ufb02ine training. First, given the observed click counts F and view counts V obtained from a\ncorpus of historical user data, we derive \u039b and X using the CTR-based GaP algorithm in Eqs. (10)\nand (11). Counts are aggregated over a certain period of time (e.g., one month) and for a feature\nspace to be considered in the model. In SS, the primary feature type is the query-ad pair (noted as\nQL for query-linead, where linead refers to a textual ad) since it is the response variable of which\nthe CTR is predicted. Other features can also be added based on their predicting capabilities, such\nas query term, linead term, ad group, and match type. This will effectively change the per-topic\nfeature mixture in \u039b and possibly the per-user topic mixture in X, with the objective of improving\nCTR prediction by adding more contextual information. In prediction though, one only focuses on\nthe blocks of QL features in \u039b and Z. In order for the model matrix \u039b to capture the corpus-wide\ntopical structure, the entire user corpus should be used as training set.\nOf\ufb02ine user pro\ufb01le updateing. Second, given the derived model matrix \u039b, we update the user\npro\ufb01les X in a distributed and data-local fashion. This updating step is necessary for two reasons.\n(1) User space is more volatile relative to feature space, due to cookie churn (fast turnover) and\nuser\u2019s interests change over time. To ensure the model to capture the latest user behavioral pattern\nand to have high coverage of users, one needs to refresh the model often, e.g., on a daily basis. (2)\nRetraining the model from scratch is relatively expensive, and thus impractical for frequent model\nrefresh. However, partial model refresh, i.e., updating X, has a very ef\ufb01cient and scalable solution\nwhich works as follows. Once a model is trained on a full corpus of user data, it suf\ufb01ces to keep\nonly \u039b, the model matrix so named. \u039b contains the global information of latent topics in the form\n\n4\n\n\fof feature mixtures. We then distribute \u039b across servers with each randomly bucketized for a subset\nof users. Note that this bucketization is exactly how production ad serving works. With the global\n\u039b and the user-local data F and V , X can be computed using E-step recurrence only. According\nto Eq. (10), the update rule for a given user xj only involves the data for that user and the global\n\u039b. Moreover, since \u039b and a local X usually \ufb01t in memory, we can perform successive E-steps\nto converge X within an order of magnitude less amount of time comparing with a global E-step.\nNotice that the multiplicative factor in E-step depends on xkj, the parameter being updated, thus\nconsecutive E-steps will indeed advance convergence.\nOnline CTR prediction. Finally, given the global \u039b and a local X learned and stored in each\nserver, the expected CTR for a user given a QL pair or p(click|QL, user) is computed online as\nfollows. Suppose a user issues a query, a candidate set of lineads is retrieved by applying various\nmatching algorithms. Taking the product of these lineads with the query gives a set of QLs to be\nscored. One then extracts the row vectors from \u039b corresponding to the candidate QL set to form a\nsmaller block \u039bmat, and looks up the column vector xj for that user from X. The predicted CTRs\nare obtained by a matrix-vector multiplication zmat\nj = \u039bmatxj. The online prediction deployment is\nschematically shown in Figure 2.\n\n3.2 Positional normalization\n\nOur analysis so far has been abstracted from another essential factor, that is, the position of an ad\nimpression on a search result page. It is known intuitively and empirically that ad position has a\nsigni\ufb01cant effect on CTR [24, 14]. In this section we treat the positional effect in a statistically\nsound manner.\nThe observed CTR actually represents a conditional probability p(click|position). We wish to learn\na CTR normalized by position, i.e., \u201cscaled\u201d to a same presentation position, in order to capture\nthe probability of click regardless of where the impression is shown. To achieve positional normal-\nization, we assume the following Markov chain: (1) viewing an ad given its position, and then (2)\nclicking the ad given a user actually views the ad; thus\n\np(click|position) = p(click|view)p(view|position),\n\n(12)\nwhere \u201cview\u201d is the event of a user voluntarily examining an ad, instead of an ad impression itself.\nEq. (12) suggests a factorization of a matrix of observed CTRs into two vectors. As it turns out,\nto estimate the positional prior p(view|position) we can apply a special GaP factorization with one\ninner dimension. The data matrices F and V are now feature-by-position matrices, and the inner\ndimension can be interpreted as the topic of physically viewing.\nIn both training and evaluation, one shall use the position-normalized CTR, i.e., p(click|view). First,\nthe GaP algorithm for estimating positional priors is run on the observed click and view counts of\n(feature, position) pairs. This yields a row vector of positional priors xpos. In model training, each\nad view occurrence is then normalized (multiplied) by the prior p(view|position) for the position\nwhere the ad is presented. For example, the a priori CTR of a noticeable position (e.g., ov-top+1\nin Yahoo\u2019s terminology meaning the North 1 position in sponsored results) is typically higher than\nthat of an obscure position (e.g., ov-bottom+2) by a factor of up to 10. An observed count of views\nplaced in ov-top+1 thus has a greater normalized count than that in ov-bottom+2. This normalization\neffectively asserts that, given a same observed (unnormalized) CTR, an ad shown in an inferior\nposition has a higher click probability per se than the one placed in a more obvious position. The\nsame view count normalization should also be applied during of\ufb02ine evaluation. In online prediction,\nhowever, we need CTR estimates unbiased by positional effect in order for the matching ads to\nbe ranked based on their qualities (clickabilities). The linear predictor Z = \u039bX learned from a\nposition-normalized training dataset gives exactly the position-unbiased CTR estimation. In other\nwords, we are hypothesizing that all candidate ads are to be presented in a same imaginary position.\nFor an intuitive interpretation, if we scale positional priors so that the top position has a prior of 1,\ni.e., xpos\nAnother view of the positional prior model we use is an examination model [25], that is, the proba-\nbility of clicking on an ad is the product of a positional probability and a relevance-based probability\nwhich is independent of position. This model is simple and easy to solve for using maximum like-\nlihood as explained above. This model is not dependent on the content of ads higher up on the\nsearch page, as for example the cascade [14] or DBN models [9]. These models are appropriate\n\nov-top+1 = 1, all ads are normalized to that top position.\n\n5\n\n\ffor search results where users have a high probability of clicking on one of the links. However, for\nads, the probability of clicking on ad links is extremely low, usually a fraction of a percent. Thus\nthe effects of higher ads is a product of factors which are extremely close to one. In this case, the\nDBN positional prior reduces to a negative exponential function which is a good \ufb01t to the empirical\ndistribution found from the examination model.\n\n3.3 Large-scale implementation\n\nData locality. Recall that updating X after a global training is distributed and only involves E-steps\nusing user-local data. In fact, this data locality can also be leveraged in training. More precisely,\nEq. (10) suggests that updating a user pro\ufb01le vector xj via E-step only requires that user\u2019s data fj\nand vj as well as the model matrix \u039b. This computation has a very small memory footprint and\ntypically \ufb01ts in L1 cache. On the other hand, updating each single value in \u039b as in Eq. (11) for\nM-step requires a full pass over the corpus (all users\u2019 data) and hence more expensive. To better\nexploit the data locality present in E-step, we alternate 3 to 10 successive E-steps with one M-step.\nWe also observe that M-step involves summations over j \u2264 m users, for both the numerator and\nthe denominator in Eq. (11). Both summing terms (fijxkj/zij and vijxkj) only requires data that\nis available locally (in memory) right after the E-step for user j. Thus the summations for M-step\ncan be computed incrementally along with the E-step recurrence for each user. As thus arranged, an\niteration of 3-10 E-steps combined with one M-step only requires a single pass over the user corpus.\nData sparsity. The multiplicative recurrence exploits data sparsity very well. Note that the inner\nloops of both E-step and M-step involve calculating the ratio fij/zij. Since f is a count of very rare\nclick events, one only needs to compute z when the corresponding f is non-zero. Let Nc be the total\nnumber of non-zero f terms or distinct click events over all users. For each non-zero fij, computing\nzij = \u039bixj dot-product takes d multiplications. Thus the numerators of both E-step and M-step\nhave a complexity of O(Ncd). Both denominators have a complexity of O(Nv), where Nv is the\ntotal number of non-zero v terms. The \ufb01nal divisions to compute the multiplicative factors in one\nouter loop over topics take O(d) time (the other outer loop over m or n has already been accounted\nfor by both Nc and Nv). Typically, we have Nv (cid:29) Nc (cid:29) m > n (cid:29) d. Thus the smoothed\ncomplexity [26] of of\ufb02ine training is O(Nvdr), where r is the number of EM iterations and r = 20\nsuf\ufb01ces for convergence.\nScalability. Now that we have reached an algorithm of linear complexity O(Nvdr) with various\nimplementation tricks as just described. We now illustrate the scalability of our algorithm by the\nfollowing run-time analysis. The constant factor of the complexity is 4, the number of division\nterms in the recurrence formulae. Suppose the entire Yahoo\u2019s user base of SS contains about 200\nmillion users. A 1/16 sample (32 out of 512 buckets) gives around 10 million user. Further assume\n100 distinct ad views on average per user and an inner dimension of 10, thus the total number of\noperations is 4 \u00d7 1010 for one iteration. The model converges after 15-20 iterations. Our single-\nmachine implementation with sparse matrix operations (which are readily available in MATLAB [2]\nand LAPACK [3]) gives above 100 M\ufb02ops, hence it takes 1.6-2.2 hours to train a model.\nSo far, we have demonstrated one paradigm of scaling up, which focuses on optimizing arithmetic\noperations, such as using sparse matrix multiplication in the innermost loop. Another paradigm is\nthrough large-scale parallelization, such as using a Hadoop [1] cluster, as we illustrate in the BT\nimplementation in Section 4.1.\n\n3.4 Experiments\n\nWe have experimented with different feature types, and found empirically the best combination\nis query-linead (QL), query term (QT), and linead term (LT). A QL feature is a product of query\nand linead. For QTs, queries are tokenized with stemming and stopwords removed. For LTs, we\n\ufb01rst concatenate the title, short description, and description of a linead text, and then extract up to\n8 foremost terms. The dataset was obtained from 32 buckets of users and covering a one-month\nperiod, where the \ufb01rst three weeks forms the training set and the last week was held out for testing.\nFor feature selection, we set the minimum frequency to 30 to be included for all three feature types,\nwhich yielded slightly above 1M features comprised of 700K QLs, 175K QTs, and 135K LTs. We\nalso \ufb01ltered out users with a total event count below 10, which gave 1.6M users. We used a latent\n\n6\n\n\fdimension of 10, which was empirically among the best while computationally favorable. For the\ngamma prior on X, we \ufb01xed the shape parameter \u03b1 to 1.45 and the scale parameter \u03b2 to 0.2 across\nall latent topics for model training; and used a near-zero prior for positional prior estimation.\nWe benchmarked our GaP model with two simple baseline predictors: (1) Panama score (historical\nCOEC de\ufb01ned as the ratio of the observed clicks to the expected clicks [9]), and (2) historical QL\nCTR normalized by position. The experimental results are plotted in Figure 3, and numerically\nsummarized in Tables 1 and 2. A click-view ROC curve plots the click recall vs. the view recall,\nfrom the testing examples ranked in descending order of predicted CTR. A CTR lift curve plots\nthe relative CTR lift vs. the view recall. As the results show, historical QL CTR is a fair predictor\nrelative to Panama score. The GaP model yielded a ROC area of 0.82 or 2% improvement over\nhistorical QL CTR, and a 68% average CTR lift over Panama score at the 5-20% view recall range.\n\n(a) ROC plots\n\n(b) Pairwise CTR lift\n\nFigure 3: Model performance comparison among (1) GaP using QL-QT-LT, (2) Panama score pre-\ndictor, and (3) historical QL-CTR predictor.\n\nTable 1: Areas under ROC curves\n\nGaP Panama QL-CTR\n0.82\n\n0.80\n\n0.72\n\nTable 2: CTR lift of GaP over Panama\n\nView recall\nCTR lift\n\n1% 1-5% avg.\n0.96\n\n0.86\n\n5% 5-20% avg.\n0.93\n\n0.68\n\n4 Behavioral targeting\n\nFor the BT application, we adopt the \ufb01rst approach to CTR prediction as described in Section 2.1.\nThe number of clicks and views for a given ad are predicted separately and a CTR estimator is\nconstructed as in Eq. (8). Moreover, the granular nature of GaP allows for signi\ufb01cant \ufb02exibility in\nthe way prediction can be done, as we describe next.\n\n4.1 Prediction with different granularity\n\nWe form the data matrix F from historical user behavioral data at the granular level, including\nclick and view counts for individual ads, as well as other explanatory variable features such as page\nviews. This setup allows for per-ad CTR prediction, i.e., p(click|ad, user), given by Eq. (8). Per-\ncategory CTR prediction as does in previous BT systems, i.e., p(click|ad-category, user), can also\nbe performed in this setup by marginalizing \u039b over categories:\n\n(cid:19)\n\n(cid:33)\n\n(cid:100)CTRcj =\n\n(cid:32)(cid:18)(cid:88)\n\ni\u2208c\n\n(cid:19)\n\n(cid:33)\n\n(cid:32)(cid:18)(cid:88)\n\ni\u2208c\n\n\u039bclick(i)\n\nxj + \u03b4\n\n/\n\n\u039bview(i)\n\nxj + \u03b7\n\n,\n\n(13)\n\nwhere c denotes a category and i \u2208 c is de\ufb01ned by ad categorization.\n\n7\n\n00.10.20.30.40.50.60.70.80.9100.10.20.30.40.50.60.70.80.91 GAPPanamaQLctrViewrecallClickrecall\u0003\u0003\u0002\u0004\u0003\u0002\u0005\u0003\u0002\u0006\u0003\u0002\u0007\u0003\u0002\b\u0003\u0002\t\u0003\u0002\n\u0003\u0002\u000b\u0003\u0002\f\u0004\u0001\u0003\u0002\u0005\u0003\u0003\u0002\u0005\u0003\u0002\u0007\u0003\u0002\t\u0003\u0002\u000b\u0004\u0004\u0002\u0005\u0004\u0002\u0007\u0001\u0001\u000e\r\u0010\u0001\u0010\u0012\u0015\u0012\u0014\u0012\u000e\r\u0010\u0001\u0011\u000f\u0013\u0017\u0016\u0011\u000f\u0013\u0017\u0016\u0001\u0010\u0012\u0015\u0012\u0014\u0012ViewrecallCTRlift\fThe modeling was implemented in a distributed fashion using Hadoop. As discussed in Section 3.3,\nthe EM algorithm can be parallelized ef\ufb01ciently by exploiting user data locality, particularly in the\nMapReduce [15] framework. However, compared with the scaling approach adopted by the SS im-\nplementation, the large-scale parallelization paradigm typically cannot support complex operations\nas ef\ufb01cient, such as performing sparse matrix multiplication by three-level nested loops in Java.\n\n4.2 Experiments\n\nThe data matrix F was formed to contain rows for all ad clicks and views, as well as page views with\nfrequency above a threshold of 100. The counts were aggregated over a two-week period of time\nand from 32 buckets of users. This setup resulted in 170K features comprised of 120K ad clicks or\nviews, and 50K page views, which allows the model matrix \u039b to \ufb01t well in memory. The number\nof users was about 40M. We set the latent inner dimension d = 20. We ran 13 EM iterations where\neach iteration alternated 3 E-steps with one M-step. Prediction accuracy was evaluated using data\nfrom the next day following the training period, and measured by the area under the ROC curve.\nWe \ufb01rst compared per-ad prediction (Eq. (8)) with per-category prediction (Eq. (13)), and obtained\nthe ROC areas of 95% and 70%, respectively. One latest technology used Poisson regression for\nper-category modeling and yielded an average ROC area of 83% [11]. This shows that capturing\nintra-category structure by factor modeling can result in substantial improvement over the state-of-\nthe-art of BT. We also measured the effect of the latent dimension on the model performance by\nvarying d = 10 to 100, and observed that per-ad prediction is insensitive to the latent dimension\nwith all ROC areas in the range of [95%, 96%], whereas per-category prediction bene\ufb01ts from larger\ninner dimensions. Finally, to verify the scalability of our parallel implementation, we increased the\nsize of training data from 32 to 512 user buckets. The experiments were run on a 250-node Hadoop\ncluster. As shown in Table 3, the running time scales sub-linearly with the number of users.\n\nTable 3: Run-time vs. number of user buckets\n\n128\n31.7\n\n512\n79.8\n\nNumber of buckets\nRun-time (hours)\n\n32\n11.2\n\n64\n18.6\n\nSurprisingly though, the running time for 32 buckets with a 250-node cluster is no less than a single-\nnode yet highly ef\ufb01cient implementation as analyzed in Section 3.3 (after accounting for the different\nfactors of users 4\u00d7, latent dimension 2\u00d7, and EM iterations 13/15), with a similar 100 M\ufb02ops. Ac-\ntually, the same pattern has been found in one previous large-scale learning task [11]. We argue that\nlarge-scale parallelization is not necessarily the best way, nor the only way, to deal with scaling; but\nin fact implementation issues (such as cache ef\ufb01ciency, number of references, data encapsulation)\nstill cause orders-of-magnitude differences in performance and can more than overwhelm the addi-\ntional nodes. The right principle of scaling up should start with single node and achieve above 100\nM\ufb02ops with sparse arithmetic operations.\n\n5 Discussion\n\nGaP is a dimensionality reduction algorithm. The low-dimensional latent space allows scalable\nand ef\ufb01cient learning and prediction, and hence making the algorithm practically appealing for\nweb-scale data like in SS and BT. GaP is also a smoothing algorithm, which yields smoothed click\nprediction. This addresses the data sparseness issue that is typically present in click-through data.\nMoreover, GaP builds personalization into ad targeting, by pro\ufb01ling a user as a vector of latent\nvariables. The latent dimensions are inferred purely from data, with the objective to maximize the\ndata likelihood or the capability to predict target events. Furthermore, position of ad impression\nhas a signi\ufb01cant impact on CTR. GaP factorization with one inner dimension gives a statistically\nsound approach to estimating the positional prior. Finally, the GaP-derived latent low-dimensional\nrepresentation of user can be used as a valuable input to other applications and products, such as\nuser clustering, collaborative \ufb01ltering, content match, and algorithmic search.\n\n8\n\n\fReferences\n[1] http://hadoop.apache.org/.\n[2] http://www.mathworks.com/products/matlab/.\n[3] http://www.netlib.org/lapack/.\n[4] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. The Journal of Machine Learning\n\nResearch, 3:993\u20131022, 2003.\n\n[5] A. Broder, M. Fontoura, V. Josifovski, and L. Riedel. A semantic approach to contextual advertising.\n\nACM Conference on Information Retrieval (SIGIR 2007), pages 559\u2013566, 2007.\n\n[6] J. F. Canny. GaP: a factor model for discrete data. ACM Conference on Information Retrieval (SIGIR\n\n2004), pages 122\u2013129, 2004.\n\n[7] J. F. Canny, S. Zhong, S. Gaffney, C. Brower, P. Berkhin, and G. H. John. Granular data for behavioral\n\ntargeting. U.S. Patent Application 20090006363.\n\n[8] D. Chakrabarti, D. Agarwal, and V. Josifovski. Contextual advertising by combining relevance with click\n\nfeedback. International World Wide Web Conference (WWW 2008), pages 417\u2013426, 2008.\n\n[9] O. Chapelle and Y. Zhang. A dynamic Bayesian network click model for web search ranking. Interna-\n\ntional World Wide Web Conference (WWW 2009), pages 1\u201310, 2009.\n\n[10] Y. Chen, D. Pavlov, P. Berkhin, and J. F. Canny. Large-scale behavioral targeting for advertising over a\n\nnetwork. U.S. Patent Application 12/351,749, \ufb01led: Jan 09, 2009.\n\n[11] Y. Chen, D. Pavlov, and J. F. Canny. Large-scale behavioral targeting. ACM Conference on Knowledge\n\nDiscovery and Data Mining (KDD 2009), 2009.\n\n[12] C. Y. Chung, J. M. Koran, L.-J. Lin, and H. Yin. Model for generating user pro\ufb01les in a behavioral\n\ntargeting system. U.S. Patent 11/394,374, \ufb01led: Mar 29, 2006.\n\n[13] M. Ciaramita, V. Murdock, and V. Plachouras. Online learning from click data for sponsored search.\n\nInternational World Wide Web Conference (WWW 2008), pages 227\u2013236, 2008.\n\n[14] N. Craswell, O. Zoeter, M. Taylor, and B. Ramsey. An experimental comparison of click position-bias\n\nmodels. Web Search and Web Data Mining (WSDM 2008), pages 87\u201394, 2008.\n\n[15] J. Dean and S. Ghemawat. Mapreduce: Simpli\ufb01ed data processing on large clusters. Communications of\n\nthe ACM, 51(1):107\u2013113, 2008.\n\n[16] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic\n\nanalysis. Journal of the American Society for Information Science, 41(6):391\u2013407, 1990.\n\n[17] D. C. Fain and J. O. Pedersen. Sponsored search: a brief history. Bulletin of the American Society for\n\nInformation Science and Technology, 32(2):12\u201313, 2006.\n\n[18] T. Hofmann. Unsupervised learning by probabilistic latent semantic analysis. Machine Learning, 42(1-\n\n2):177\u2013196, 2001.\n\n[19] A. Hyv\u00a8arinen. Fast and robust \ufb01xed-point algorithms for independent component analysis. IEEE Trans-\n\nactions on Neural Networks, 10(3):626\u2013634, 1999.\n\n[20] I. T. Jolliffe. Principal Component Analysis. Springer, 2002.\n[21] A. Lacerda, M. Cristo, M. A. Gonc\u00b8alves, W. Fan, N. Ziviani, and B. Ribeiro-Neto. Learning to advertise.\n\nACM Conference on Information Retrieval (SIGIR 2006), pages 549\u2013556, 2006.\n\n[22] D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. Advances in Neural\n\nInformation Processing Systems (NIPS 2000), 13:556\u2013562, 2000.\n\n[23] S. Pandey and C. Olston. Handling advertisements of unknown quality in search advertising. Advances\n\nin Neural Information Processing Systems (NIPS 2006), 19:1065\u20131072, 2006.\n\n[24] F. Radlinski and T. Joachims. Minimally invasive randomization for collecting unbiased preferences from\nclickthrough logs. National Conference on Arti\ufb01cial Intelligence (AAAI 2006), pages 1406\u20131412, 2006.\n[25] M. Richardson, E. Dominowska, and R. Ragno. Predicting clicks: estimating the click-through rate for\n\nnew ads. International World Wide Web Conference (WWW 2007), pages 521\u2013530, 2007.\n\n[26] D. A. Spielman and S.-H. Teng. Smoothed analysis of algorithms: Why the simplex algorithm usually\n\ntakes polynomial time. Journal of the ACM, 51(3):385\u2013463, 2004.\n\n9\n\n\f", "award": [], "sourceid": 558, "authors": [{"given_name": "Ye", "family_name": "Chen", "institution": null}, {"given_name": "Michael", "family_name": "Kapralov", "institution": null}, {"given_name": "John", "family_name": "Canny", "institution": null}, {"given_name": "Dmitry", "family_name": "Pavlov", "institution": null}]}