{"title": "Optimal Web-Scale Tiering as a Flow Problem", "book": "Advances in Neural Information Processing Systems", "page_first": 1333, "page_last": 1341, "abstract": "We present a fast online solver for large scale maximum-flow problems as they occur in portfolio optimization, inventory management, computer vision, and logistics. Our algorithm solves an integer linear program in an online fashion. It exploits total unimodularity of the constraint matrix and a Lagrangian relaxation to solve the problem as a convex online game. The algorithm generates approximate solutions of max-flow problems by performing stochastic gradient descent on a set of flows. We apply the algorithm to optimize tier arrangement of over 80 Million web pages on a layered set of caches to serve an incoming query stream optimally. We provide an empirical demonstration of the effectiveness of our method on real query-pages data.", "full_text": "Optimal Web-scale Tiering as a Flow Problem\n\nGilbert Leung\n\neBay, Inc.\n\nSan Jose, CA, USA\ngleung@alum.mit.edu\n\nAlexander J. Smola\n\nYahoo! Research\n\nSanta Clara, CA, USA\n\nalex@smola.org\n\nNovi Quadrianto\n\nSML-NICTA & RSISE-ANU\n\nCanberra, ACT, Australia\nnovi.quad@gmail.com\n\nKostas Tsioutsiouliklis\n\nYahoo! Labs\n\nSunnyvale, CA, USA\nkostas@yahoo-inc.com\n\nAbstract\n\nWe present a fast online solver for large scale parametric max-\ufb02ow problems as\nthey occur in portfolio optimization, inventory management, computer vision, and\nlogistics. Our algorithm solves an integer linear program in an online fashion. It\nexploits total unimodularity of the constraint matrix and a Lagrangian relaxation to\nsolve the problem as a convex online game. The algorithm generates approximate\nsolutions of max-\ufb02ow problems by performing stochastic gradient descent on a set\nof \ufb02ows. We apply the algorithm to optimize tier arrangement of over 84 million\nweb pages on a layered set of caches to serve an incoming query stream optimally.\n\n1\n\nIntroduction\n\nParametric \ufb02ow problems have been well-studied in operations research [7]. It has received a sig-\nni\ufb01cant amount of contributions and has been applied in many problem areas such as database\nrecord segmentation [2], energy minimization for computer vision [10], critical load factor determi-\nnation in two-processor systems [16], end-of-session baseball elimination [6], and most recently by\n[19, 18, 20] in product portfolio selection. In other words, it is a key technique for many estima-\ntion and assignment problems. Unfortunately many algorithms proposed in the literature are geared\ntowards thousands to millions of objects rather than billions, as is common in web-scale problems.\nOur motivation for solving parametric \ufb02ow is the problem of webpage tiering for search engine\nindices. While our methods are entirely general and could be applied to a range of other machine\nlearning and optimization problems, we focus on webpage tiering as the illustrative example in this\npaper. The rationale for choosing this application is threefold: \ufb01rstly, it is a real problem in search\nengines. Secondly, it provides very large datasets. Thirdly, in doing so we introduce a new problem\nto the machine learning community. That said, our approach would also be readily applicable to\nvery large scale versions of the problems described in [2, 16, 6, 19].\nThe speci\ufb01c problem that will provide our running example is that of assigning webpages to several\ntiers of a search engine cache such that the time to serve a query is minimized. For a given query,\na search engine returns a number of documents (typically 10). The time it takes to serve a query\ndepends on where the documents are located. The \ufb01rst tier (or cache) is the fastest (using premium\nhardware, etc. thus also often the smallest) and retrieves its documents with little latency. If even\njust a single document is located in a back tier, the delay is considerably increased since now we\nneed to search the larger (and slower) tiers until the desired document is found. Hence it is our\ngoal to assign the most popular documents to the fastest tiers while taking the interactions between\ndocuments into account.\n\n1\n\n\f2 The Tiering Problem\nWe would like to allocate documents d \u2208 D into k tiers of storage at our disposal. Moreover, let\nq \u2208 Q be the queries arriving at a search engine, with \ufb01nite values vq > 0 (e.g. the probability of\nthe query, possibly weighted by the relevance of the retrieved results), and a set of documents Dq\nretrieved for the query. This input structure is stored in a bipartite graph G with vertices V = D\u222a Q\nand edges (d, q) \u2208 E whenever document d should be retrieved for query q.\nThe k tiers, with tier 1 as the most desirable and k the least (most costly for retrieval), form an\nincreasing sequence of cummulative capacities Ct, with Ct indicating how many pages can be stored\nby tiers t(cid:48) \u2264 t together. Without loss of generality, assume Ck\u22121 < |D| (that is, the last tier is\nrequired to hold all documents, or the problem can be reduced). Finally, for each t \u2265 2 we assume\nthat there is a penalty pt\u22121 > 0 incurred by a tier-miss at level t (known as \u201cfallthrough\u201d from tier\nt \u2212 1 to tier t). And since we have to access tier 1 regardless, we set p0 = 0 for convenience. For\ninstance, retrieving a page in tier 3 incurs a total penalty of p1 + p2.\n\n2.1 Background\n\nOptimization of index structures and data storage is a key problem in building an ef\ufb01cient search\nengine. Much work has been invested into building ef\ufb01cient inverted indices which are optimized for\nquery processing [17, 3]. These papers all deal with the issue of optimizing the data representation\nfor a given query and how an inverted index should be stored and managed for general queries. In\nparticular, [3, 14] address the problem of computing the top-k results without scanning over the\nentire inverted lists. Recently, machine learning algorithms have been proposed [5] to improve the\nordering within a given collection beyond the basic inverted indexing setup [3].\nA somewhat orthogonal strategy to this is to decompose the collection of webpages into a number\nof disjoint tiers [15] ordered in decreasing level of relevance. That is, documents are partitioned\naccording to their relevance for answering queries into different tiers of (typically) increasing size.\nThis leads to putting the most frequently retrieved or the most relevant (according to the value of\nquery, the market or other operational parameters) pages into the top tier with the smallest latency\nand relegating the less frequently retrieved or the less relevant pages into bottom tiers. Since queries\nare often carried out by sequentially searching this hierarchy of tiers, an improved ordering mini-\nmizes latency, improves user satisfaction, and it reduces computation.\nA naive implementation of this approach would simply assign a value to each page in the index and\narrange them such that the most frequently accessed pages reside in the highest levels of the cache.\nUnfortunately this approach is suboptimal: in order to answer a given query well a search engine\ntypically does not only return a single page as a result but rather returns a list of r (typically r = 10)\npages. This means that if even just one of these pages is found at a much lower tier, we either need\nto search the backtiers to retrieve this page or alternatively we need to sacri\ufb01ce result relevance.\nAt \ufb01rst glance, the problem is daunting: we need to take all correlations among pages induced by\nuser queries into account. Moreover, for reasons of practicality we need to design an algorithm\nwhich is linear in the amount of data presented (i.e. the number of queries) and whose storage\nrequirements are only linear in the number of pages. Finally, we would like to obtain guarantees in\nterms of performance for the assignment that we obtain from the algorithm. Our problem, even for\nr = 2, is closely related to the weighted k-densest subgraph problem, which is NP hard [13].\n\n2.2 Optimization Problem\n\nSince the problem we study is somewhat more general than the parametric \ufb02ow problem we give a\nself-contained derivation of the problem and derive the more general version beyond [7]. For brevity,\nwe relegate all proofs to the Appendix.\nWe denote the result set for query q by Dq := {d : (d, q) \u2208 G}, and similarly, the set of queries seek-\ning for a document d by Qd := {q : (d, q) \u2208 G}. For a document d we denote by zd \u2208 {1, . . . , k}\nthe tier storing d. De\ufb01ne\n\nuq := max\nd\u2208Dq\n\nzd\n\n2\n\n(1)\n\n\fas the number of cache levels we need to traverse to answer query q.\nIn other words, it is the\ndocument found in the worst tier which determines the cost of access. Integrating the optimization\nover uq we may formulate the tiering problem as an integer program:\n\nuq\u22121(cid:88)\n\npt subject to zd \u2264 uq \u2264 k for all (q, d) \u2208 G and (cid:88)\n\n{zd \u2264 t} \u2264 Ct \u2200 t.\n\n(cid:88)\n\nq\u2208Q\n\nminimize\n\nz,u\n\nvq\n\nt=1\n\nd\u2208D\n\n(2)\nNote that we replaced the maximization condition (1) by a linear inequality in preparation for a\nreformulation as an integer linear program. Obviously, the optimal uq for a given z will satisfy (1).\nLemma 1 Assume that Ck \u2265 |D| > Ck\u22121. Then there exists an optimal solution of (2) such that\n\n(cid:80)\nd {zd \u2264 t} = Ct for all 1 \u2264 t < k.\n\nIn the following we address several issues associated with the optimization problem: A) Eq. (2) is an\ninteger program and consequently it is discrete and nonconvex. We show that there exists a convex\nreformulation of the problem. B) It is at a formidable scale (often |D| > 109). Section 3.4 presents\na stochastic gradient descent procedure to solve the problem in few passes through the database. C)\nWe have insuf\ufb01cient data for an accurate tier assignment for pages associated with tail queries. This\ncan be addressed by a smoothing estimator for the tier index of a page.\n\n2.3\n\nInteger Linear Program\n\nx \u2208 {0; 1}D\u00d7(k\u22121) subject to xdt \u2265 xd,t+1 for all d, t\ny \u2208 {0; 1}Q\u00d7(k\u22121) subject to yqt \u2265 yq,t+1 for all q, t\n\nWe now replace the selector variables zd and uq by binary variables via a \u201cthermometer\u201d code. Let\n(3a)\n(3b)\nt xdt and xdt = {zd > t}\nbetween z and x. For instance, for k = 5, a middle tier z = 3 maps into x = (1, 1, 0, 0) (requiring\ntwo fallthroughs), and the best tier z = 1 corresponds to x = (0, 0, 0, 0). The mapping between u\nand y is analogous. The constraint uq \u2265 zd can simply be rewritten coordinate-wise yqt \u2265 xdt.\n\nbe index variables. Thus we have the one-to-one mapping zd = 1 +(cid:80)\nFinally, the capacity constraints assume the form(cid:80)\n\nd xdt \u2265 |D| \u2212 Ct. That is, the number of pages\nallocated to higher tiers are at least |D| \u2212 Ct. De\ufb01ne remaining capacities \u00afCt := |D| \u2212 Ct and use\nthe variable transformation (1) we have the following integer linear program:\n\nx,y\n\n(cid:80)\nd xdt \u2265 \u00afCt for all 1 \u2264 t \u2264 k \u2212 1\n\nminimize\nsubject to xdt \u2265 xd,t+1 and yqt \u2265 yq,t+1 and yqt \u2265 xdt for all (q, d) \u2208 G\n\n(4b)\n(4c)\n(4d)\nwhere p = (p1, . . . , pk\u22121)(cid:62) and v = (v1, . . . , v|Q|)(cid:62) are column vectors, and y a matrix (yqt). The\nadvantage of (4) is that while still discrete, we now have linear constraints and a linear objective\nfunction. The only problem is that the variables x and y need to be binary.\n\nx \u2208 {0; 1}D\u00d7(k\u22121) ; y \u2208 {0; 1}Q\u00d7(k\u22121)\n\n(4a)\n\nv(cid:62)yp\n\nLemma 2 The solutions of (2) and (4) are equivalent.\n\n2.4 Hardness\n\nBefore discussing convex relaxations and approximation algorithms it is worthwhile to review the\nhardness of the problem: consider only two tiers, and a case where we retrieve only two pages\nper query. The corresponding graph has vertices D and edges (d, d(cid:48)) \u2208 E, whenever d and d(cid:48) are\ndisplayed together to answer a query. In this case the tiering problem reduces to one of \ufb01nding a\nsubset of vertices D(cid:48) \u2282 D such that the induced subgraph has the largest number (possibly weighted)\nof edges subject to the capacity constraint |D(cid:48)| \u2264 C.\nFor the case of k pages per query, simply assume that k\u2212 2 of the pages are always the same. Hence\nthe problem of \ufb01nding the best subset reduces to the case of 2 pages per query. This problem is\nidentical to the k-densest subgraph problem which is known to be NP hard [13].\n\n3\n\n\fFigure 1: k-densest subgraph reduction. Vertices\ncorrespond to URLs and queries correspond to\nedges. Queries can be served whenever the corre-\nsponding URLs are in the cache. This is the case\nwhenever the induced subgraph contains the edge.\n\n3 Convex Programming\n\nThe key idea in solving (4) is to relax the capacity constraints for the tiers. This renders the problem\ntotally unimodular and therefore amenable to a solution by a linear program. We replace the capacity\nconstraint by a partial Lagrangian. This does not ensure that we will be able to meet the capacity\nconstraints exactly anymore. Instead, we will only be able to state ex-post that the relaxed solution\nis optimal for the observed capacity distribution. Moreover, we are still able to control capacity by\na suitable choice of the associated Lagrange multipliers.\n\n3.1 Linear Program\n\nInstead of solving (4) we study the linear program:\n\nminimize\n\nx,y\n\nv(cid:62)yp \u2212 1(cid:62)x\u03bb subject to xdt \u2265 xd,t+1 and yqt \u2265 yq,t+1\n\n(5)\n\nyqt \u2265 xdt for (q, d) \u2208 G and xdt, yqt \u2208 [0, 1]\n\nHere \u03bb = (\u03bb1, . . . , \u03bbk\u22121)(cid:62) act as Lagrange multipliers \u03bbt \u2265 0 for enforcing capacity constraints\nand 1 denotes a column of |D| ones. We now relate the solution of (5) to that of (4).\nLemma 3 For any choice of \u03bb with \u03bbt \u2265 0 the linear program (5) has an integral solution, i.e. there\nexists some x\u2217, y\u2217 satisfying x\u2217\ndt the\nsolution (x\u2217, y\u2217) also solves (4).\n\nqt \u2208 {0; 1} which minimize (5). Moreover, for \u00afCt =(cid:80)\n\ndt, y\u2217\n\nd x\u2217\n\nWe have succeeded in reducing the complexity of the problem to that of a linear program, yet it is still\nformidable and it needs to be solved to optimality for an accurate caching prescription. Moreover,\nwe need to adjust \u03bb such that we satisfy the desired capacity constraints (approximately).\n\nLemma 4 Denote by L\u2217(\u03bb) the value of (5) at the solution of (5) and let L(\u03bb) := L\u2217(\u03bb)+(cid:80)\n\n\u00afCt\u03bbt.\nHence L(\u03bb) is concave in \u03bb and moreover, L(\u03bb) is maximized for a choice of \u03bb where the solution\nof (5) satis\ufb01es the constraints of (4).\n\nt\n\nNote that while the above two lemmas provide us with a guarantee that for every \u03bb and for every\nassociated integral solution of (5) there exists a set of capacity constraints for which this is optimal\nand that such a capacity satisfying constraint can be found ef\ufb01ciently by concave maximization,\nthey do not guarantee the converse: not every capacity constraint can be satis\ufb01ed by the convex\nrelaxation, as the following example demonstrates.\n\nExample 1 Consider the case of 2 tiers (hence we drop the index t), a single query q and 3 docu-\nments d. Set the capacity constraint of the \ufb01rst tier to 1. In this case it is impossible to avoid a cache\nmiss in the ILP. In the LP relaxation of (4), however, the optimal (non-integral) solution is to set all\n3 . The partial Lagrangian L(\u03bb) is maximized for \u03bb = \u2212p/3. Moreover, for\nxd = 1\n\u03bb < \u2212p/3 the optimization problem (5) has as its solution x = y = 1; whereas for \u03bb > \u2212p/3 the\nsolution is x = y = 0. For the critical value any convex combination of those two values is valid.\n\n3 and yq = 1\n\nThis example shows why the optimal tiering problem is NP hard \u2014 it is possible to design cases\nwhere the tier assignment for a page is highly ambiguous. Note that for the integer programming\nproblem with capacity constraint C = 2 we could allocate an arbitrary pair of pages to the cache.\nThis does not change the objective function (total cache miss) or feasibility.\n\n4\n\nURLqueryURLquery\fFigure 2: Left: maximum \ufb02ow problem for a problem of 4 pages and 3 queries. The minimum cut\nof the directed graph needs to sever all pages leading to a query or alternatively it needs to sever the\ncorresponding query incurring a penalty of (1 \u2212 vq). This is precisely the tiering objective function\nfor the case of two tiers. Right: the same query graph for three tiers. Here the black nodes and\ndashed edges represent a copy of the original graph \u2014 additionally each page in the original graph\nalso has an in\ufb01nite-capacity link to the corresponding query in the additional graph.\n\n3.2 Graph Cut Equivalence\n\nIt is well known that the case of two tiers (k = 2) can be relaxed to a min-cut, max-\ufb02ow problem\n[7, 4]. The transformation works by designing a bipartite graph between queries q and documents\nd. All documents are connected to the source s by edges with capacity \u03bb and queries are connected\nto the sink t with capacity (1 \u2212 vq). Documents d retrieved for a query q are connected to q with\ncapacity \u221e.\nFigure 2 provides an example of such a maximum-\ufb02ow, minimum-cut graph from source s to sink\nt. The conversion to several tiers is slightly more involved. Denote by vdi vertices associated with\ndocument d and tier i and moreover, denote by wqi vertices associated with a query q and tier i. Then\nthe graph is given by edges (s, vdi) with capacities \u03bbi; edges (vdi, wqi(cid:48)) for all (document, query)\npairs and for all i \u2264 i(cid:48), endowed with in\ufb01nite capacity; and edges (wqi, t) with capacity (1 \u2212 vq).\nAs with the simple caching problem, we need to impose a cut on any query edge for which not all\nincoming page edges have been cut. The key difference is that in order to bene\ufb01t from storing pages\nin a better tier we need to guarantee that the page is contained in the lower tier, too.\n\n3.3 Variable Reduction\n\nWe now simplify the relaxed problem (5) further by reducing the number of variables, without\nsacri\ufb01cing integrality of the solution. A \ufb01rst step is to substitute yqt = maxd\u2208Dq xdt, to obtain an\noptimization problem over the documents alone:\n\nv(cid:62)(cid:16)\n\n(cid:17)\n\nx\n\nmax\nd\u2208Dq\n\nxdt\n\nminimize\n\np \u2212 1(cid:62)x\u03bb subject to xdt \u2265 xdt(cid:48) for t(cid:48) > t and xdt \u2208 [0, 1]\n\n(6)\nNote that the monotonicity condition yqt \u2265 yqt(cid:48) for t(cid:48) > t is automatically inherited from that of x.\nThe solution of (6) is still integral since the problem is equivalent to one with integral solution.\nLemma 5 We may scale pt and \u03bbt together by constants \u03b2t > 0, such that p(cid:48)\nt/pt = \u03b2t = \u03bb(cid:48)\nThe resulting solution of this new problem (6) with (p(cid:48), \u03bb(cid:48)) is unchanged.\n\nt/\u03bbt.\n\nEssentially, problem (5) as parameterized by (p, \u03bb) yields solutions which form equivalence classes.\nt = 1 for t \u2265 1. We only need to\nConsequently for the convenience of solving (5), we may assume p(cid:48)\nconsider the original p for evaluating the objective using solution z (thus, same observed capacities\nCt).\nSince (5) is a relaxation of (4) this reformulation can be extended to the integer linear program, too.\nMoreover, under reasonable conditions on the capacity constraints, there is more structure in \u03bb.\nLemma 6 Assume that \u00afCt is monotonically decreasing and that pt = 1 for t \u2265 1. Then any choice\nof \u03bb satisfying the capacity constraints is monotonically non-increasing.\n\n5\n\nstpagesqueries\u03bb\u221e(1-vq)stpagesqueries\u03bb\u221e(1-vq)\fAlgorithm 1 Tiering Optimization\n\nAlgorithm 2 Deferred updates\n\nInitialize all zd = 0\nInitialize n = 100\nfor i = 1 to MAXITER do\nfor all q \u2208 Q do\n\nn (learning rate)\n\n\u03b7 = 1\u221a\nn \u2190 n + 1 (increment counter)\nUpdate z \u2190 z \u2212 \u03b7\u2202x(cid:96)q(z)\nProject z to [1, k]D via\nzd \u2190 max(1, min(k, zd))\n\nend for\n\nend for\n\nObserve current time n(cid:48)\nRead timestamp n for document d\nCompute update steps \u03b4 = \u03b4(n(cid:48), n)\nrepeat\nj = (cid:98)zd + 1(cid:99) (next largest tier)\nt = (j \u2212 zd)/\u03bbj (change needed to reach next tier)\nif t > \u03b4 then\n\n\u03b4 = 0 and zd \u2190 zd + \u03bbj\u03b4 (partial step; we are done)\n\u03b4 \u2190 \u03b4 \u2212 t and zd \u2190 zd + 1 (full step; next tier)\n\nelse\n\nuntil \u03b4 = 0 (no more updates) or zd = k\u2212 1 (bottom tier)\n\nend if\n\nOne interpretation of this is that, unless the tiers are increasingly inexpensive, the optimal solu-\ntion would assign pages in a fashion yielding empty middle tiers (the remaining capacities \u00afCt not\nstrictly decreasing). This monotonicity simpli\ufb01es the problem. Consequently, we exploit this fact to\ncomplete the variable reduction.\nDe\ufb01ne \u03b4\u03bbi := \u03bbi \u2212 \u03bbi+1 for i \u2265 1 (all non-negative by virtue of Lemma 6) and\n\nk\u22122(cid:88)\n\ni=1\n\nv(cid:62)(cid:16)\n\n(cid:17)\n\n(cid:88)\n\nd\n\nf\u03bb(\u03c7) := \u2212\u03bb1\u03c7 +\n\n\u03b4\u03bbi max(0, i \u2212 \u03c7) for \u03c7 \u2208 [0, k-1].\n\n(7)\n\nNote that by construction \u2202\u03c7f\u03bb(\u03c7) = \u2212\u03bbi whenever \u03c7 \u2208 (i \u2212 1, i). The function f\u03bb is clearly\nconvex, which helps describe our tiering problem via the following convex program\n\nminimize\n\nz\n\nmax\nd\u2208Dq\n\nzd\n\n+\n\nf\u03bb(zd \u2212 1) for zd \u2208 [1, k]\n\n(8)\n\nWe now use only one variable per document. Moreover, the convex constraints are simple box\nconstraints. This simpli\ufb01es convex projections, as needed for online programming.\nLemma 7 The solution of (8) is equivalent to that of (5).\n\n3.4 Online Algorithm\n\nWe now turn our attention to a fast algorithm for minimizing (8). While greatly simpli\ufb01ed relative\nto (2) it still remains a problem of billions of variables. The key observation is that the objective\nfunction of (8) can be written as sum over the following loss functions\nf\u03bb(zd \u2212 1)\n\n(cid:88)\n\n(9)\n\nlq(z) := vq max\nd\u2208Dq\n\nzd +\n\n1\n|Q|\n\nd\n\nwhere |Q| denotes the cardinality of the query set. The transformation suggests a simple stochastic\ngradient descent optimization algorithm: traverse the input stream by queries, and update the values\nof xd of all those documents d that would need to move into the next tier in order to reduce service\ntime for a query. Subsequently, perform a projection of the page vectors to the set [1, k] to ensure\nthat we do not assign pages to non-existent tiers.\nAlgorithm 1 proceeds by processing the input query-result records (q, vq, Dq) as a stream compris-\ning the set of pages that need to be displayed to answer a given query. More speci\ufb01cally, it updates\nthe tier preferences of the pages that have the lowest tier scores for each level and it decrements the\npreferences for all other pages. We may apply results for online optimization algorithms [1] to show\nthat a small number of passes through the dataset suf\ufb01ce.\n\nLemma 8 The solution obtained by Algorithm 1 converges at rate O((cid:112)(log T )/T ) to its minimum\n\nvalue. Here T is the number of queries processed.\n\n6\n\n\f3.5 Deferred and Approximate Updates\nThe naive implementation of algorithm 1 is infeasible as it would require us to update all |D| coordi-\nnates of xd for each query q. However, it is possible to defer the updates until we need to inspect zd\ndirectly. The key idea is to exploit that for all zd with d (cid:54)\u2208 Dq the updates only depend on the value\nof zd at update time (Section A.1) and that f\u03bb is piecewise linear and monotonically decreasing.\n\n3.6 Path Following\n\nThe tiering problem has the appealing property [19] that the solutions for increasing \u03bb form a nested\nsubset. In other words, relaxing capacity constraints never demotes but only promotes pages. This\nfact can be used to design specialized solvers which work well at determining the entire solution path\nat once for moderate-sized problems [19]. Alternatively, we can simply take advantage of solutions\nfor successive values of \u03bb in determining an approximate solution path by using the solution for \u03bb\nas initialization for \u03bb(cid:48). This strategy is well known as path-following in numerical optimization.\nIn this context it is undesirable to solve the optimization for a particular value of \u03bb to optimality.\nInstead, we simply solve it approximately (using a small number of passes) and readjust \u03bb. Due to\nthe nesting property [19] and the fact that the optimal solutions are binary (via total unimodularity)\nthe average over solutions on the entire path provides an ordering of pages into tiers. Thus,\n\n\u03bb. Moreover, denote by \u03b6d := [\u03bb(cid:48) \u2212 \u03bb]\u22121(cid:82) \u03bb(cid:48)\n\nLemma 9 Denote by xd(\u03bb) the solution of the two-tier optimization problem for a given value of\n\u03bb xd(\u03bb) the average value over a range of Lagrange\nmultipliers. Then \u03b6d provides an order for sorting documents into tiers for the entire range [\u03bb, \u03bb(cid:48)].\n\nIn practice1, we only choose a \ufb01nite number of steps for near-optimal solutions. This yields\nAlgorithm 3 Path Following\n\nInitialize all (xdt) = zd \u2208 [1, k]\nfor each \u03bb \u2208 \u039b do\n\nRe\ufb01ne variables xdt(\u03bb) by Algorithm 1 using a\nsmall number of iterations.\n\nAverage the variables xdt =(cid:80)\n\nend for\n\n\u03bb\u2208\u039b xdt(\u03bb)/|\u039b|\n\nSort the documents with the resulting total scores zd\nFill the ordered documents to tier 1, then tier 2, etc.\n\nExperiments show that using synthetic data\n(where it was feasible to compute and com-\npare with the optimal LP solution pointwise)\neven |\u039b| = 5 values of \u03bb produce near-\noptimal results in the two-tier case. More-\nover, we may carry out\nthe optimization\nprocedure for several parameters simultane-\nously. This is advantageous since the main\ncost is sequential RAM read-write access\nrather than CPU speed.\n\n4 Experiments\n\nTo examine the ef\ufb01cacy of our algorithm at web-scale we tested it with real data from a major\nsearch engine. The results of our proposed methods are compared to those of the max and sum\nheuristics in Section A.2. We also performed experiments on small synthetic data (2-tier and 3-tier),\nwhere we were able to show that our algorithm converges to exact solution given by an LP solver\n(Appendix C). However, since LP solvers are very slow, it is not feasible for web-scale problems.\nWe processed the logs for one week of September 2009 containing results from the top geographic\nregions which include a majority of the search engine\u2019s user base. To simplify the heavy processing\ninvolved for collecting such a massive data set, we only record whether a particular result, de\ufb01ned\nas a (query, document) pair, appears in top 10 (\ufb01rst result page) for a given session and we aggregate\nthe view counts of such results, which will be used for the session value vq once. In its entirety\nthis subset contains about 108 viewed documents and 1.6 \u00b7 107 distinct queries. We excluded results\nviewed only once, yielding a \ufb01nal data set of 8.4 \u00b7 107 documents.2 For simplicity, our experiments\nare carried out for a two-tier (single cache) system such that the only design parameter is the relative\n\n\u03bb \u2208 \u039b \u2286 Rk\u22121\n\n1This result can be readily extended to k > 2, and any probability measure over a set of Lagrangian values\n\n+ so long as there are positive weights around the values yielding all the nested solutions.\n\n2The search results for any \ufb01xed query vary for a variety of reasons, e.g. database updates. We approximate\nthe session graph by treating queries with different result sets as if they were different. This does not change\n\n7\n\n\fFigure 3: Left: Experimental results for real web-search data with 8.4 \u00b7 107 pages and 1.6 \u00b7 107\nqueries. Session miss rate for the online procedure, the max and sum heuristics (A.2).\n(The y-\naxis is normalized such that SUM-tier\u2019s \ufb01rst point is at 1). As seen, the max heuristic cannot be\noptimal for any but small cache sizes, but it performs comparably well to Online. Right: \u201cOnline\u201d\nis outperforming MAX for cache size larger than 60%, sometimes more than twofold.\n\nsize of the prime tier (the cache). The ranking variant of our online Algorithm 3 (30 passes over the\ndata) consistently outperforms the max and sum heuristics over a large span of cache sizes (Figure 3).\nDirect comparison can now be made between our online procedure and the max and sum heuristics\nsince each one induces a ranking on the set of documents. We then calculate the session miss rate\nof each procedure at any cache size, and report the relative improvement of our online algorithm as\nratios of miss rates in Figure 3\u2013Right.\nThe optimizer \ufb01ts well in a desktop\u2019s RAM since 5 values of \u03bb only amount to about 2GB of single-\nprecision x(\u03bb). We measure a throughput of approximately 0.5 million query-sessions per second\n(qps) for this version, and about 2 million qps for smaller problems (as they incur fewer memory\npage faults). Billion-scale problems can readily \ufb01t in 24GB of RAM by serializing computation one\n\u03bb value at a time. We also implemented a multi-thread version utilizing 4 CPU cores, although its\nperformance did not improve since memory and disk bandwidth limits have already been reached.\n\n5 Discussion\n\nWe showed that very large tiering and densest subset optimization problems can be solved ef\ufb01ciently\nby a relatively simple online optimization procedure (Some extensions are in Appendix B). It came\nsomewhat as a surprise that the max heuristic often works nearly as well as the optimal tiering\nsolution. Since we experienced this correlation on both synthetic and real data we believe that it\nmight be possible to prove approximation guarantees for this strategy whenever the bipartite graphs\nsatisfy certain power-law properties.\nSome readers may question the need for a static tiering solution, given that data could, in theory,\nbe reassigned between different caching tiers on the \ufb02y. The problem is that in production systems\nof a search engine, such reassignment of large amounts of data may not always be ef\ufb01cient for\noperational reasons (e.g. different versions of the ranking algorithm, different versions of the index,\ndifferent service levels, constraints on transfer bandwidth). In addition to that, tiering is a problem\nnot restricted to the provision of webpages. It occurs in product portfolio optimization and other\nresource constrained settings. We showed that it is possible to solve such problems at several orders\nof magnitude larger scale than what was previously considered feasible.\n\nAcknowledgments We thank Kelvin Fong for providing computer facilities. NICTA is funded by\nthe Australian Government as represented by the Department of Broadband, Communications and\nthe Digital Economy and the Australian Research Council through the ICT Centre of Excellence\nprogram. This work was carried out while GL and NQ were with Yahoo! Labs.\n\nthe optimization problem and keeps the model accurate. Moreover, we remove rare results by maintaining that\nthe lowest count of a document is at least as large as the square root of the highest within the same session.\n\n8\n\n\fReferences\n[1] P. Bartlett, E. Hazan, and A. Rakhlin. Adaptive online gradient descent. In J. C. Platt, D. Koller,\n\nY. Singer, and S. Roweis, editors, NIPS 20, Cambridge, MA, 2008.\n\n[2] M. J. Eisner and D. G. Severance. Mathematical techniques for ef\ufb01cient record segmentation\n\nin large shared databases. J. ACM, 23(4):619\u2013635, 1976.\n\n[3] R. Fagin. Combining fuzzy information from multiple systems. In Fifteenth ACM SIGACT-\nSIGMOD-SIGART Symposium on Principles of Database Systems, pages 216\u2013226, Montreal,\nCanada, 1996.\n\n[4] L. R. Ford and D. R. Fulkerson. Maximal \ufb02ow through a network. Canadian Journal of\n\nMathematics, 8:399\u2013404, 1956.\n\n[5] S. Goel, J. Langford, and A. Strehl. Predictive indexing for fast search. In D. Koller, D. Schu-\n\nurmans, Y. Bengio, and L. Bottou, editors, NIPS, pages 505\u2013512. MIT Press, 2008.\n\n[6] D. Gus\ufb01eld and C. U. Martel. A fast algorithm for the generalized parametric minimum cut\n\nproblem and applications. Algorithmica, 7(5&6):499\u2013519, 1992.\n\n[7] D. Gus\ufb01eld and \u00b4E. Tardos. A faster parametric minimum-cut algorithm. Algorithmica,\n\n11(3):278\u2013290, 1994.\n\n[8] I. Heller and C. Tompkins. An extension of a theorem of dantzig\u2019s. In H. Kuhn and A. Tucker,\neditors, Linear Inequalities and Related Systems, volume 38 of Annals of Mathematics Studies.\nAMS, 1956.\n\n[9] J. Kleinberg. Authoritative sources in a hyperlinked environment.\n\nJournal of the ACM,\n\n[10] V. Kolmogorov, Y. Boykov and C. Rother. Applications of parametric max\ufb02ow in computer\n\n46(5):604\u2013632, 1999.\n\nvision. ICCV, 1\u20138, 2007.\n\n[11] Y. Nesterov and J.-P. Vial. Con\ufb01dence level solutions for stochastic programming. Techni-\ncal Report 2000/13, Universit\u00b4e Catholique de Louvain - Center for Operations Research and\nEconomics, 2000.\n\n[12] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order\nto the web. Technical report, Stanford Digital Library Technologies Project, Stanford, CA,\nUSA, Nov. 1998.\n\n[13] C. H. Papadimitriou and K. Steiglitz. Combinatorial Optimization: Algorithms and Complex-\n\n[14] M. Persin, J. Zobel, and R. Sacks-Davis. Filtered document retrieval with frequency-sorted\n\nity. Prentice-Hall, New Jersey, 1982.\n\nindexes. JASIS, 47(10):749\u2013764, 1996.\n\n[15] K. M. Risvik, Y. Aasheim, and M. Lidal. Multi-tier architecture for web search engines. In\n\nLA-WEB, pages 132\u2013143. IEEE Computer Society, 2003.\n\n[16] H. S. Stone. Critical load factors in two-processor distributed systems. IEEE Trans. Softw.\n\nEng., 4(3):254\u2013258, 1978.\n\n[17] H. Yan, S. Ding, and T. Suel.\n\nInverted index compression and query processing with opti-\nmized document ordering. In J. Quemada, G. Le\u00b4on, Y. Maarek, and W. Nejdl, editors, 18th\nInternational Conference on World Wide Web, Madrid, Spain, pages 401\u2013410. ACM, 2009.\n\n[18] B. Zhang, J. Ward, and A. Feng. A simultaneous maximum \ufb02ow algorithm for the selection\n\nmodel. Technical Report HPL-2005-91, Hewlett Packard Laboratories, 2005.\n\n[19] B. Zhang, J. Ward, and Q. Feng. A simultaneous parametric maximum-\ufb02ow algorithm for\n\ufb01nding the complete chain of solutions. Technical Report HPL-2004-189, Hewlett Packard\nLaboratories, 2004.\n\n[20] B. Zhang, J. Ward, and Q. Feng. Simultaneous parametric maximum \ufb02ow algorithm with\n\nvertex balancing. Technical Report HPL-2005-121, Hewlett Packard Laboratories, 2005.\n\n9\n\n\f", "award": [], "sourceid": 638, "authors": [{"given_name": "Gilbert", "family_name": "Leung", "institution": null}, {"given_name": "Novi", "family_name": "Quadrianto", "institution": null}, {"given_name": "Kostas", "family_name": "Tsioutsiouliklis", "institution": null}, {"given_name": "Alex", "family_name": "Smola", "institution": null}]}