{"title": "A Maximum Entropy Approach to Collaborative Filtering in Dynamic, Sparse, High-Dimensional Domains", "book": "Advances in Neural Information Processing Systems", "page_first": 1465, "page_last": 1472, "abstract": null, "full_text": "A Maximum Entropy Approach To\n\nCollaborative Filtering in Dynamic, Sparse,\n\nHigh-Dimensional Domains\n\nDmitry Y. Pavlov\n\nNEC Laboratories America\n\n4 Independence Way\nPrinceton, NJ 08540,\n\nDavid M. Pennock\n\nOverture Services, Inc.\n\n74 N. Pasadena Ave., 3rd \ufb02oor\n\nPasadena, CA 91103,\n\ndpavlov@nec-labs.com\n\ndavid.pennock@overture.com\n\nAbstract\n\nWe develop a maximum entropy (maxent) approach to generating recom-\nmendations in the context of a user\u2019s current navigation stream, suitable\nfor environments where data is sparse, high-dimensional, and dynamic\u2014\nconditions typical of many recommendation applications. We address\nsparsity and dimensionality reduction by \ufb01rst clustering items based on\nuser access patterns so as to attempt to minimize the apriori probabil-\nity that recommendations will cross cluster boundaries and then recom-\nmending only within clusters. We address the inherent dynamic nature\nof the problem by explicitly modeling the data as a time series; we show\nhow this representational expressivity \ufb01ts naturally into a maxent frame-\nwork. We conduct experiments on data from ResearchIndex, a popu-\nlar online repository of over 470,000 computer science documents. We\nshow that our maxent formulation outperforms several competing algo-\nrithms in of\ufb02ine tests simulating the recommendation of documents to\nResearchIndex users.\n\n1 Introduction\n\nRecommender systems attempt to automate the process of \u201cword of mouth\u201d recommenda-\ntions within a community. Typical application environments are dynamic in many respects:\nusers come and go, users preferences and goals change, items are added and removed, and\nuser navigation itself is a dynamic process. Recommendation domains are also often high\ndimensional and sparse, with tens or hundreds of thousands of items, among which very\nfew are known to any particular user.\n\nConsider, for instance, the problem of generating recommendations within ResearchIndex\n(a.k.a., CiteSeer),1 an online digital library of computer science papers, receiving thousands\nof user accesses per hour. The site automatically locates computer science papers found on\nthe Web, indexes their full text, allows browsing via the literature citation graph, and iso-\nlates the text around citations, among other services [8]. The archive contains over 470,000\n\n1http://www.researchindex.com\n\n\fdocuments including the full text of each document, citation links between documents,\nand a wealth of user access data. With so many documents, and only seven accesses per\nuser on average, the user-document data matrix is exceedingly sparse and thus challenging\nto model. In this paper, we work with the ResearchIndex data, since it is an interesting\napplication domain, and is typical of many recommendation application areas [14].\n\nThere are two conceptually different ways of making recommendations. A content \ufb01ltering\napproach is to recommend solely based on the features of a document \n(e.g., showing doc-\numents written by the same author(s), or textually similar documents to \n). These methods\nhave been shown to be good predictors [3]. Another possibility is to perform collaborative\n\ufb01ltering [13] by assessing the similarities between the documents requested by the current\nuser and the users who interacted with ResearchIndex in the past. Once the users with\nbrowsing histories similar to that of a given user are identi\ufb01ed, an assumption is made that\nthe future browsing patterns will be similar as well, and the prediction is made accordingly.\nCommon measures of similarity between users include Pearson correlation coef\ufb01cient [13],\nmean squared error [16], and vector similarity [1]. More recent work includes application\nof statistical machine learning techniques, such as Bayesian networks [1], dependency net-\nworks [6], singular value decomposition [14] and latent class models [7, 12]. Most of these\nrecommendation algorithms are context and order independent: that is, the rank of recom-\nmendations does not depend on the context of the user\u2019s current navigation or on recency\neffects (past viewed items receive as much weight as recently viewed items).\n\nCurrently, ResearchIndex mostly employs fairly simple content-based recommenders. Our\nobjective was to design a superior (or at least complementary) model-based recommenda-\ntion algorithm that (1) is tuned for a particular user at hand, and (2) takes into account the\nidentity of the currently-viewed document \n, so as not the lead the user too far astray from\nhis or her current search goal.\n\nTo overcome the sparsity and high dimensionality of the data, we cluster the documents\nwith an objective of maximizing the likelihood that recommendable items co-occur in the\nsame cluster. By marrying the clustering technique with the end goal of recommendation,\nour approach appears to do a good job at maintaining high recall (sensitivity). Similar ideas\nin the context of maxent were proposed recently by Goodman in [5].\n\nWe explicitly model time: each user is associated with a set of sessions, and each session\nis modeled as a time sequence of document accesses. We present a maxent model that\neffectively estimates the probability of the next visited document ID (DID) given the most\nrecently visited DID (\u201cbigrams\u201d) and past indicative DIDs (\u201ctriggers\u201d). To our knowl-\nedge, this is the \ufb01rst application of maxent for collaborative \ufb01ltering, and one of the few\npublished formulations that makes accurate recommendations in the context of a dynamic\nuser session [3, 15]. We perform of\ufb02ine empirical tests of our recommender and compare it\nto competing models. The comparison shows our method is quite accurate, outperforming\nseveral other less-expressive models.\n\nThe rest of the paper is organized as follows. In Section 2, we describe the log data from\nResearchIndex and how we preprocessed it. Section 3 presents the greedy algorithm for\nclustering the documents and discusses how the clustering helps to decompose the original\nprediction task. In Section 4, we give a high-level description of our maxent model and the\nfeatures we used for its learning. Experimental results and comparisons with other models\nare discussed in Section 5. In Section 6, we draw conclusions and describe directions for\nfuture work.\n\n\f2 Preprocessing the ResearchIndex data\n\nEach document indexed in ResearchIndex is assigned a unique document ID (DID). When-\never a user accesses the site with a cookie-enabled browser, (s)he is identi\ufb01ed as a new or re-\nturning user and all activity is recorded on the server side with a unique user ID (UID) and a\ntime stamp (TID). We obtained a log \ufb01le that recorded approximately 3 month worth of Re-\n.\n\nand broke them into\nsessions. For a \ufb01xed UID, a session is de\ufb01ned as a sequence of document requests, with\nIn our experiments we chose\n\nsearchIndex data that can roughly be viewed as a series of requests \u0002\u0001\u0004\u0003\nIn the \ufb01rst processing step, we aggregated the requests by the \u0007\nno two consecutive requests more than \u0001\n\u0001\f\u000b\u000e\r\u0010\u000f\u0011\u000f , so that if a user was inactive for more than 300 seconds, his next request was\n\nconsidered to mark a start of a new session.\n\nseconds apart.\n\n\u0006\u0005\b\u0007\n\n\n\t\n\n\u0006\u0005\n\nThe next processing step included heuristics, such as identifying and discarding the ses-\nsions belonging to robots (they obviously contaminate the browsing patterns of human\nusers), collapsing all same consecutive DID accesses into a single instance of this DID (our\nobjective was to predict what interests the user beyond the currently requested document),\ngetting rid of all DIDs that occurred less than two times in the log (for two or fewer oc-\ncurrences, it is hard to reliably train the system to predict them and evaluate performance),\nand \ufb01nally discarding sessions containing only one document.\n\n3 Dimensionality Reduction Via Clustering\n\nEven after the log is processed, the data still remains high-dimensional (62,240 documents),\nand sparse, and hence still hard to model. To solve these problems we clustered the doc-\numents. Since our objective was to predict the instantaneous user interests, among many\npossibilities of performing the clustering we chose to cluster based on user navigation pat-\nterns.\n\nWe scanned the processed log once and for each document \u0013\u0012\u0015\u0014\u0017\u0016\u0019\u0018 accumulated the number\n\u0012\u0015\u0014\u001e\u0016\u0019\u0018 ; in other words, we\nof times the document \u0013\u001a\n\n\u0016\u0019\u001b\u001d\u001c was requested immediately after \n\ncomputed the \ufb01rst-order Markov statistics or bigrams. Based on the user navigation patterns\nencoded in bigrams, the greedy clustering is done as shown in the following pseudocode:\n\n;\n\n\u0005$#&% ; Number of Clusters '\n\nInput: Bigrams \u001f! \nOutput: Set ( of ' Clusters.\n)+*,\u000b-\u000f ;\n\u0005$#&%\nset ).\u000b-/1032546/1798;:\n<3\u001f! \n\u0005=#\u0011%\n\u0005=# such that \u001f! \nfor all docs \"\n/CBDB\u001d\"=21)+EDFG\u000b>\u000bIH>J and #\nif @;\"\u001eA\nQSRTB3UV@;\"WO ;\n(P \n)+*\nO ;\n(P \n)+*\nQSRTB3UV@\n/CBDB\u001d\"X25)+EDFK\u000b-)\n\"\u001eA\n/CB3B\u001d\"=25)+E3FK\u000b\n)+*.YNY\n/CBDB\u001d\"X25)+EDF\\[1\u000bLH>J and #\nelse if @Z\"\u001eA\nO ;\nQSRTBDU]@\n(P \n\"\u001eA\n/5BDB\u001d\"=21)+EDF\n/CBDB\u001d\"=21)+EDF ;\n/CB3B\u001d\"=25)+E3FK\u000b?\"\u001eA\n/CBDB\u001d\"X25)+EDFK\u000b>\u000b^H>J and #\nelse if @Z\"\u001eA\nQSRTBDU]@Z\"WO ;\n(P \n/5BDB\u001d\"=21)+EDF\n/CBDB\u001d\"X25)+EDF ;\n/CB3B\u001d\"=25)+E3FK\u000b\n\"\u001eA\n\u0005$#\u0011%\n\u000bIH>J ;\n\nAlgorithm:\n0.\n1.\n2.\n3.\n4.\n5.\n6.\n7.\n8.\n9.\n10.\n11.\n12.\n13.\n14.\n15.\n16. end for\n\nend if\n\n\u001f\u0013 \n\n;\n\n// max number of transitions\n\n// all docs with n transitions\n\n\u000b>\u000b?) do\n/5BDB\u001d\"=21)+EDFK\u000b>\u000bLH>J and )+*MN'>O\n\n;\n\n// new cluster for i and j\n\n/CB3B\u001d\"=25)+E3FK\u000b>\u000bLH>J3O\n/5BDB\u001d\"=21)+EDF\\[1\u000bLH>J3O\n\n// j goes to cluster of i\n\n// i goes to cluster of j\n\n\u0003\n\n\u0003\n\u0003\n\n\"\n\"\n\"\nA\n%\nA\n%\nA\n#\n#\nA\n*\nA\n%\nA\n#\n#\nA\nA\n#\nA\n%\nA\n#\nA\n\"\n\fCluster 1\nCluster 2\nCluster 3\nCluster 4\nCluster 5\nCluster 6\nCluster 7\nCluster 8\n\nTable 1: Top features for some of the clusters.\n\n\u0005\u0003\u0002\n/\u001025E3)\u0001\n/\u00102CE3)\u0001\n2\b\u0006\t\u0006\nUS/\u0005\u00041\"\u0007\u0006D0\n/\u0010R\n\u000b\u0006D)\f\u0006D4\r\u0006DR\n\u0005\u000f\u000e\u0011\u0010\n\u0005\u0014\u000e\u0011\u0010\n\u0005\u0017\u0016\n\u00190\n/\u0010\"$)T\"$)\nF\u0010\"WB\u0013W/1)\nB\u0012WE30D\"=)\n/\u0005\u0019\"\u0007\u0006D)\n/CBDB\u001d\"\u0007\u00159\"\n\u0002D\u0005\n\u0005\u000f\u0019\n\u0002D\u0005\u000f\u0019\n\u0018^E\nUSE\nE30\u001b\u001a\nR946E3)\u0001\nF\u0005\u0006\n\u0018^E\nE30D\"$E\nQ9/\u001025E\n\u000e\u0013\u0016\n)+E\u001b\u0007\u001d\u001e\u0006D0\n/5F\u0010F\u00100\n0\u001c\u0006DR\n\u0019\"$)\nEDBDB\n/5B\u0012\nE\u001b\nUSE\n\u0010W\u0005\u0017\u000e\n\u0005\u0003\u000e\nBDB\u001d\"\u0007\u0006D)\n\u00190\n/\u0005WE\n/\u0010)]B\u001c\u0015\u001f\u0006D0D4\nUS/1)T)+E\nF\u0011\"$)\n\u0006D4\n0DE\nF\u0010E\u001bWE\n\u0019\"\u0007\u0006D)\n/\u00102CE3)\u0001\n\"$)\u0001\u001903RTB\u001d\"\u0007\u0006D)\nF\u0010E\u001bWE\n\u0019\"\u0007\u0006D)\nB3E\n0D\" \u0007\u001a\n\u0005\u0014\u0010\n\u000e\u0013\u0016\n/\b\u0015\u0001\u00159\"\n/!WE\n\u0006D)\n\u001dP\"\u0019F1E\nEDF\u0011R\n\"$)\n\u00190\nE\u001b\nB3E30\u001c\u0004\u0010\"\nB3E30\u001c\u0004\u0010\"\nQS0\t\u0006\u001c\u000b\u0006\nBDB\n\u001dP\"$0\n4\"\u0006\n\nB\u0012\u000414\n\n\u0010W\u0005\n\nE30D)+E\n\u000e\u0013\u0016\nQ9/\nE\u001b\n/\u00112CE\n\"$4\n\n17.\n18. Return S\n\nif @;)$#&%\u0010O goto 1\n\n.\n\nThe algorithm starts with empty clusters and then cycles through all documents picking the\npairs of documents that have the current highest joint visitation frequency as prompted by\na bigram frequency (lines 1 and 2). If both documents in the selected pair are unassigned, a\nnew cluster is allocated for them (lines 3 through 7). If one of the documents in the selected\npair has been assigned to one of the previous clusters, the second document is assigned to\n, as\n\nthe same cluster (lines 8 through 14). The algorithm repeats for a lower frequency )\nlong as )$#&%\nAfter the clustering, we can assume that if the user requests a document from the \" -th\n% rather than\n% , he is considerably more likely to prefer a next document from (\ncluster (\n#\u0011% , #('\nfrom (\n/\u0005W/CO0/\nJKH1)\n\n. This\nassumption is reasonable because by construction clusters represent densely connected (in\nterms of traf\ufb01c) components, and the traf\ufb01c across the clusters is small compared to the\ntraf\ufb01c within each cluster. In view of this observation, we broke individual user sessions\ndown into subsessions, where each subsession consisted of documents belonging to the\nsame cluster. The problem was thus reduced to a series of prediction problems for each\ncluster.\n\n\" , i.e.\n\n\u0012\u0015\u0014\u001e\u0016\u0019\u0018.+\n\n\u0016\u0019\u001b\u001d\u001c\"+\n\n\u000b*)\n\n(P \n\n%-,\n\n%$\u0005\n\nWe studied the clusters by trying to \ufb01nd out if the documents within a cluster are topically\nrelated. We ran code previously developed at NEC Labs [4] that uses information gain\nto \ufb01nd the top features that distinguish each cluster from the rest. Table 1 shows the top\nfeatures for some of the created clusters. The top features are quite consistent descriptors,\nsuggesting that in one session a ResearchIndex user is typically interested in searching\namong topically-related documents.\n\n4 Trigger MaxEnt\n\n\u0016\u0019\u001b\u001d\u001c\u0013,\n\nIn this paper, we model\n\nthe identity of the document that will be next requested by the user \u0007\n\nis\n, given the history\nfor all other users. This choice of the maxent model is\nnatural since our intuition is that all of the previously requested documents in the user\nsession in\ufb02uence the identity of \nhigh-order model, because of the sparsity and high-dimensional data, so we need to restrict\nourselves to models that can be reliably estimated from the low-order statistics.\n\n\u0016\u0019\u001b\u001d\u001c . It is also clear that we cannot afford to build a\n\n/\u0005W/CO as a maxent distribution, where \n\nO and the available \n\n/!W/\n\n\u0016W\u001b\u0015\u001c\n\nBigrams provide one type of such statistics. In order to introduce long term dependence of\n\npair of documents @Z/\n\n\u0016W\u001b\u0015\u001c on the documents that occurred in the history of the session, we de\ufb01ne a trigger as a\nO . To measure the quality of triggers and in order to rank them\n\nin a given cluster such that\n\ndifferent from\n\nis substantially\n\n\u00053\u0002\n\u0016\u0019\u001b\u001d\u001c\n\n\u0016\u0019\u001b\u001d\u001c\n\n+52\n\n\u00024,\n\n\u0005\nB\nE\n\u0005\nF\n\u0005\nB\n\u0005\nA\nA\nA\n2\nR\n2\n\u0005\n\u000e\nE\n\u000e\n\u0005\nA\nA\nA\n\u000e\nB\nR\n\u0005\n\nR\nB\n\u0005\nB\n\u0005\nA\nA\nA\nQ\n/\n\u0005\n\u0015\n\u0005\n2\n\u0005\n\u0005\n\n\u0016\n\u0005\n\"\nQ\n\u0005\nB\n\u0005\nA\nA\nA\n\u0006\n2\n\u0005\n0\n\u000e\nQ\n\u0005\nB\n\u0005\nA\nA\nA\n\u000e\n\u0005\nB\n\u0005\n\u000e\nR\n\u0005\n\u000e\n\u0005\nA\nA\nA\n\u000e\n\u0005\n0\n\u0005\nQ\n/\n2\n\u0005\nB\n\u000e\nU\n\u0010\n2\n\u0005\nA\nA\nA\n\u0002\n\"\n\u0010\nE\n\u0005\nE\n\u0010\nE\n\u0005\n\u000e\n\u0006\n\u0010\n\u0005\n\u000e\nE\n\u0005\n\u000e\nE\nB\n\u0005\nA\nA\nA\n \n\"\n \n\"\n \n\u000b\n)\n@\n\n\u001a\n\"\n\n(\n \n\"\n\n)\n@\n\n\u001a\n2\n@\n\u0007\nO\n\u0005\n\n\u001a\n2\n@\n\u0007\n\u001a\n\n\u001a\nO\n)\n@\n\n\u001a\n\u000b\n/\nO\n)\n@\n\n\u001a\n\u000b\n\u0002\n\fTable 2: Average number of hits \n\nof predictions across the clusters for\ndifferent ranges of heights and using various models. The boxed numbers are the best\nvalues across all models.\n\nand height\n\nModel\nMult.\n1 c.\nMult.\n25 c.\nMark.\n1 c.\nMark.\n25 c.\nMaxent\nno sm.\nMaxent\nw. sm.\nCorr.\n\n48.78\n1.437\n95.49\n1.421\n91.39\n1.959\n89.75\n1.959\n111.95\n1.510\n112.68\n1.476\n111.02\n1.973\n\nJ\u001d\u000f\n\n67.94\n2.947\n120.52\n2.503\n115.68\n3.007\n114.49\n3.047\n130.35\n2.296\n130.86\n2.258\n132.87\n2.801\n\n80.94\n4.390\n132.07\n3.312\n123.44\n3.571\n122.57\n3.646\n138.18\n2.858\n138.53\n2.810\n140.96\n3.340\n\n1%&\u000f\n\n90.93\n5.773\n138.89\n3.975\n126.26\n3.875\n125.61\n3.972\n142.56\n3.303\n142.85\n3.248\n144.99\n3.726\n\n&%\n\n98.54\n7.026\n143.33\n4.528\n127.57\n4.063\n127.14\n4.191\n145.55\n3.694\n145.78\n3.633\n147.34\n4.021\n\n\u0016\u0019\u001b\u001d\u001c\n\nThe set of features, together with maxent as an objective function, can be shown to lead to\nthe following form of the conditional maxent model\n\nwe computed mutual information between events\u0002\u0004\u0003\n\u000f\u0011\u0010\n\nwhere\f\nThe set of parameters\u0005\n\nthe distribution\nthe training data:\n\nis a normalization constant ensuring that the distribution sums to 1.\n\n\u0016W\u001b\u0015\u001c\n\u0007 needs to be found from the following set of equations that restrict\n\u0016\u0019\u001b\u0015\u001c\u0011,\n\nto have the same expected value for each feature as seen in\n\n\u0003\u0013\u0012\n\n\u000b\u0006\u0005\n\u000f\u0015\u0014\u0016\u000f\n\n\u000b\u0006\u00053/\n\nE37\u0011Q\n\n\u0016W\u001b\u0015\u001c\n\n\u000532\n\n%$\u0005\n\n\u0002\b\u0007 and\u0002\n\t\n\n+\"2\u000b\u0007 .\n\n(1)\n\n\u000e\u001c\u0017\n\n\u000bIJ\n\n\u0006\u000532\n\nwhere the LHS represents the expectation (up to a normalization factor) of the feature\n\nto the same normalization factor) of this feature in the training data. There exist ef\ufb01cient\n(e.g. improved iterative scaling [11]) that are\n\n\u000e\u0018\u0017\u0019\u000e\u001b\u001a\n\u0014\u001d\u000f\n\u0006\u000532\nO with respect to the distributionQV@\nalgorithms for \ufb01nding the parameters \u0005\nlikelihood model [11]. Employing a Gaussian prior with a zero mean on parameters \u0012\n\nyields a maximum aposteriori solution that has been shown to be more accurate than the re-\nlated maximum likelihood solution and other smoothing techniques for maxent models [2].\nWe use Gaussian smoothing in our experiments with a maxent model.\n\nO and the RHS is the actual frequency (up\n\nUnder fairly general assumptions, the maxent model can also be shown to be a maximum\n\nknown to converge if the constraints imposed on\n\nare consistent.\n\nA\u001dA\u0015A\n\n(2)\n\n5 Experimental Results and Comparisons\n\nWe compared the trigger maxent model with the following models: mixture of Markov\nmodels (1 and 25 components), mixture of multinomials (1 and 25 components) and the\n\nU\n\n2\n2\n\n\u0001\n2\n\n2\n\nJ\n\u0001\n2\n2\n\u0001\n\nU\n\n2\n\nU\n\n2\n\nU\n\n2\n\nU\n\n2\n\nU\n\n2\n\nU\n\n2\n\nU\n\n2\n\n\u001a\n\u000b\n)\n@\n\n\u001a\n,\n2\nO\n\u000b\nJ\n\f\n@\n2\nO\n \n\n\u000e\n@\n\n\u001a\nO\n@\n2\nO\n\u0012\n)\n@\n\n\u001a\n2\nO\n)\n@\n\n,\n2\nO\n\u0014\n\u000f\n@\nO\n\u000b\n\u0014\n\u000f\n@\n\n@\n2\nO\n\u0005\n2\nO\n\u0005\nB\n\u0005\n\u0005\n(\n\u0005\n@\n\n,\n2\n\u0012\n\u0007\n)\n\fTable 3: Average time per 1000 predictions and average memory used by various models\nacross 1000 clusters.\n\nMult.,\n\nMult., 25\nMarkov, 1\nMarkov, 25\n\nMaxent, no sm.\nMaxent, w. sm.\n\nCorrelation\n\nTime, s Memory, KBytes\n0.0049\n0.0559\n0.0024\n0.0311\n0.0746\n0.0696\n7.2013\n\n0.5038\n12.58\n1.53\n68.23\n90.12\n90.12\n17.26\n\ncorrelation method [1]. The de\ufb01nitions of the models can be found in [9]. The maxent\nmodel came in two \ufb02avors: unsmoothed and smoothed with a Gaussian prior, with 0 mean\nand \ufb01xed variance 2. We did not optimize the adjustable parameters of the models (such as\nthe number of components for the mixture or the variance of the prior for maxent models)\nor the number of clusters (1000).\n\nWe chronologically partitioned the log into roughly 8 million training requests (covering\n82 days) and 2 million test requests (covering 17 days). We used the average height of\npredictions on the test data as a main evaluation criteria. The height of a prediction is\n5,\nde\ufb01ned as follows. Assuming that the probability estimates\n, we \ufb01rst sort them in the\nand all possible values of \na model\nand then \ufb01nd the distance in terms of the number of documents to\ndescending order of\nthe actually requested \n(which we know from the test data) from the top of this sorted list.\nThe height tells us how deep into the list the user must go in order to see the document that\nactually interests him. The height of a perfect prediction is 0, the maximum (worst) height\nfor a given cluster equals the number of documents in this cluster. Since heights greater\nthan 20 are of little practical interest, we binned the heights of predictions for each cluster.\n\nO are available from\n\nfor a \ufb01xed history 2\n\n\u0001\u0001\n\nFor binning purposes we used height ranges \n\nbin we also computed the average height of predictions. Thus, the best performing model\nwould place most of the predictions inside the bin(s) with low value(s) of \nand within\nthose bins the averages would be as low as possible.\n\nO for \n\nY.J3O\n\n\u0005\u0003\u0002 . Within each\n\nA\u001dA\u0015A\n\nTable 2 reports the average number of hits each model makes on average in each of the\nbins, as well as the average height of predictions within the bin. The smoothed maxent\nmodel has the best average height of predictions across the bins and scores roughly the\nsame number of hits in each of the bins as the correlation method. The mixture of Markov\nmodels with 25 components evidently over\ufb01ts on the training data and fails to outperform\na 1 component mixture. The mixture of multinomials is quite close in quality to, but still\nnot as good as, the maxent model with respect to both the number of hits and the height\npredictions in each of the bins.\n\nIn Table 3, we present comparison of various models with respect to the average time\ntaken and memory required to make a prediction. The table clearly illustrates that the\nmaxent model (i.e., the model-based approach) is substantially more time ef\ufb01cient than the\ncorrelation (i.e., the memory-based approach), even despite the fact that the model takes\non average more memory. In particular, our maxent approach is roughly two orders of\nmagnitude faster than the correlation.\n\n)\n@\n2\n)\n)\n\u0005\n\u0001\n@\n\n\u000b\n\u000f\n\u0005\n\f6 Conclusions and Future Work\n\nWe have described a maxent approach to generating document recommendations in Re-\nsearchIndex. We addressed the problem of sparse, high-dimensional data by introducing a\nclustering of the documents based on the user navigation patterns. A particular advantage\nof our clustering is that by its de\ufb01nition the traf\ufb01c across the clusters is small compared to\nthe traf\ufb01c within the cluster. This advantage allowed us to decompose the original predic-\ntion problem into a set of problems corresponding to the clusters. We also demonstrated\nthat our clustering produces highly interpretable clusters: each cluster can be assigned a\ntopical name based on the top-extracted features.\n\nWe presented a number of models that can be used to solve a document prediction problem\nwithin cluster. We showed that the maxent model that combines zero and \ufb01rst order Markov\nterms as well as the triggers with high information content provides the best average out-\nof-sample performance. Gaussian smoothing improved results even further.\n\nThere are several important directions to extend the work described in this paper. First,\nwe plan to perform \u201clive\u201d testing of the clustering approach and various models in Re-\nsearchIndex. Secondly, our recent work [10] suggests that for dif\ufb01cult prediction problems\nimprovement beyond the plain maxent models can be sought by employing the mixtures of\nmaxent models. We also plan to look at different clustering methods for documents (e.g.,\nbased on the content or the link structure) and try to combine prediction results for differ-\nent clusterings. Our expectation is that such combining could yield better accuracy at the\nexpense of longer running times. Finally, one could think of a (quite involved) EM algo-\nrithm that performs the clustering of the documents in a manner that would make prediction\nwithin resulting clusters easier.\nAcknowledgements We would like to thank Steve Lawrence for making available the Re-\nsearchIndex log data, Eric Glover for running his naming code on our clusters, Kostas\nTsioutsiouliklis and Darya Chudova for many useful discussions, and the anonymous re-\nviewers for helpful suggestions.\n\nReferences\n\n[1] J. Breese, D. Heckerman, and C. Kadie. Empirical analysis of predictive algorithms for col-\nIn Proceedings of UAI-1998, pages 43\u201352. San Francisco, CA: Morgan\n\nlaborative \ufb01ltering.\nKaufmann Publishers, 1998.\n\n[2] S. Chen and R. Rosenfeld. A Gaussian prior for smoothing maximum entropy models. Techni-\n\ncal Report CMUCS -99-108, Carnegie Mellon University, 1999.\n\n[3] D. Cosley, S. Lawrence, and D. Pennock. An open framework for practical testing of recom-\nmender systems using ResearchIindex. In International Conference on Very Large Databases\n(VLDB\u201902), 2002.\n\n[4] E. Glover, D. Pennock, S. Lawrence, and R. Krovetz. Inferring hierarchical descriptions. Tech-\n\nnical Report NECI TR 2002-035, NEC Research Institute, 2002.\n\n[5] J. Goodman. Classes for fast maximum entropy training. In Proceedings of IEEE International\n\nConference on Acoustics, Speech, and Signal Processing, 2001.\n\n[6] D. Heckerman, D. Chickering, C. Meek, R. Rounthwaite, and C. Kadie. Dependency net-\nworks for density estimation, collaborative \ufb01ltering, and data visualization. Journal of Machine\nLearning Research, 1:49\u201475, 2000.\n\n[7] T. Hofmann and J. Puzicha. Latent class models for collaborative \ufb01ltering. In Proceedings of\n\nthe Sixteenth International Joint Conference on Arti\ufb01cial Intelligence, pages 688\u2013693, 1999.\n\n[8] S. Lawrence, C. L. Giles, and K. Bollacker. Digital libraries and Autonomous Citation Indexing.\n\nIEEE Computer, 32(6):67\u201371, 1999.\n\n\f[9] D. Pavlov and D. Pennock. A maximum entropy approach to collaborative \ufb01ltering in dynamic,\nsparse, high-dimensional domains. Technical Report NECI TR, NEC Research Institute, 2002.\n[10] D. Pavlov, A. Popescul, D. Pennock, and L. Ungar. Mixtures of conditional maximum entropy\n\nmodels. Technical Report NECI TR, NEC Research Institute, 2002.\n\n[11] S. D. Pietra, V. D. Pietra, and J. Lafferty. Inducing features of random \ufb01elds. IEEE Transactions\n\non Pattern Analysis and Machine Intelligence, 19(4):380\u2013393, April 1997.\n\n[12] A. Popescul, L. Ungar, D. Pennock, and S. Lawrence. Probabilistic models for uni\ufb01ed collab-\norative and content-based recommendation in sparse-data environments. In Proceedings of the\nSeventeenth Conference on Uncertainty in Arti\ufb01cial Intelligence, pages 437\u2013444, 2001.\n\n[13] P. Resnick, N. Iacovou, M. Suchak, P. Bergstorm, and J. Riedl. GroupLens: An Open Archi-\ntecture for Collaborative Filtering of Netnews. In Proceedings of ACM 1994 Conference on\nComputer Supported Cooperative Work, pages 175\u2013186, Chapel Hill, North Carolina, 1994.\nACM.\n\n[14] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl. Analysis of recommender algorithms for\ne-commerce. In Proceedings of the 2nd ACM Conference on Electronic Commerce, pages 158\u2013\n167, 2000.\n\n[15] G. Shani, R. Brafman, and D. Heckerman. An MDP-based recommender system. In Proceed-\nings of the Seventeenth Conference on Uncertainty in Arti\ufb01cial Intelligence, pages 453\u2014460,\n2002.\n\n[16] U. Shardanand and P. Maes. Social information \ufb01ltering: Algorithms for automating \u201cword of\nmouth\u201d. InProceedings of ACM CHI\u201995 Conference on Human Factors in Computing Systems,\nvolume 1, pages 210\u2013217, 1995.\n\n\f", "award": [], "sourceid": 2278, "authors": [{"given_name": "Dmitry", "family_name": "Pavlov", "institution": null}, {"given_name": "David", "family_name": "Pennock", "institution": null}]}