{"title": "Exponential Family Graph Matching and Ranking", "book": "Advances in Neural Information Processing Systems", "page_first": 1455, "page_last": 1463, "abstract": "We present a method for learning max-weight matching predictors in bipartite graphs. The method consists of performing maximum a posteriori estimation in exponential families with sufficient statistics that encode permutations and data features. Although inference is in general hard, we show that for one very relevant application - document ranking - exact inference is efficient. For general model instances, an appropriate sampler is readily available. Contrary to existing max-margin matching models, our approach is statistically consistent and, in addition, experiments with increasing sample sizes indicate superior improvement over such models. We apply the method to graph matching in computer vision as well as to a standard benchmark dataset for learning document ranking, in which we obtain state-of-the-art results, in particular improving on max-margin variants. The drawback of this method with respect to max-margin alternatives is its runtime for large graphs, which is high comparatively.", "full_text": "Exponential Family Graph Matching and Ranking\n\nJames Petterson, Tib\u00b4erio S. Caetano, Julian J. McAuley and Jin Yu\n\nNICTA, Australian National University\n\nCanberra, Australia\n\nAbstract\n\nWe present a method for learning max-weight matching predictors in bipartite\ngraphs. The method consists of performing maximum a posteriori estimation\nin exponential families with suf\ufb01cient statistics that encode permutations and\ndata features. Although inference is in general hard, we show that for one very\nrelevant application\u2013document ranking\u2013exact inference is ef\ufb01cient. For general\nmodel instances, an appropriate sampler is readily available. Contrary to existing\nmax-margin matching models, our approach is statistically consistent and, in ad-\ndition, experiments with increasing sample sizes indicate superior improvement\nover such models. We apply the method to graph matching in computer vision as\nwell as to a standard benchmark dataset for learning document ranking, in which\nwe obtain state-of-the-art results, in particular improving on max-margin variants.\nThe drawback of this method with respect to max-margin alternatives is its run-\ntime for large graphs, which is comparatively high.\n\nIntroduction\n\n1\nThe Maximum-Weight Bipartite Matching Problem (henceforth \u2018matching problem\u2019) is a funda-\nmental problem in combinatorial optimization [22]. This is the problem of \ufb01nding the \u2018heaviest\u2019\nperfect match in a weighted bipartite graph. An exact optimal solution can be found in cubic time\nby standard methods such as the Hungarian algorithm.\nThis problem is of practical interest because it can nicely model real-world applications. For ex-\nample, in computer vision the crucial problem of \ufb01nding a correspondence between sets of image\nfeatures is often modeled as a matching problem [2, 4]. Ranking algorithms can be based on a\nmatching framework [13], as can clustering algorithms [8].\nWhen modeling a problem as one of matching, one central question is the choice of the weight\nmatrix. The problem is that in real applications we typically observe edge feature vectors, not edge\nweights. Consider a concrete example in computer vision: it is dif\ufb01cult to tell what the \u2018similar-\nity score\u2019 is between two image feature points, but it is straightforward to extract feature vectors\n(e.g. SIFT) associated with those points.\nIn this setting, it is natural to ask whether we could parameterize the features, and use labeled\nmatches in order to estimate the parameters such that, given graphs with \u2018similar\u2019 features, their\nresulting max-weight matches are also \u2018similar\u2019. This idea of \u2018parameterizing algorithms\u2019 and then\noptimizing for agreement with data is called structured estimation [27, 29].\n[27] and [4] describe max-margin structured estimation formalisms for this problem. Max-margin\nstructured estimators are appealing in that they try to minimize the loss that one really cares about\n(\u2018structured losses\u2019, of which the Hamming loss is an example). However structured losses are typi-\ncally piecewise constant in the parameters, which eliminates any hope of using smooth optimization\ndirectly. Max-margin estimators instead minimize a surrogate loss which is easier to optimize,\nnamely a convex upper bound on the structured loss [29]. In practice the results are often good,\nbut known convex relaxations produce estimators which are statistically inconsistent [18], i.e., the\nalgorithm in general fails to obtain the best attainable model in the limit of in\ufb01nite training data. The\ninconsistency of multiclass support vector machines is a well-known issue in the literature that has\nreceived careful examination recently [16, 15].\n\n1\n\n\fMotivated by the inconsistency issues of max-margin structured estimators as well as by the well-\nknown bene\ufb01ts of having a full probabilistic model, in this paper we present a maximum a posteriori\n(MAP) estimator for the matching problem. The observed data are the edge feature vectors and\nthe labeled matches provided for training. We then maximize the conditional posterior probability\nof matches given the observed data. We build an exponential family model where the suf\ufb01cient\nstatistics are such that the mode of the distribution (the prediction) is the solution of a max-weight\nmatching problem. The resulting partition function is ]P-complete to compute exactly. However,\nwe show that for learning to rank applications the model instance is tractable. We then compare\nthe performance of our model instance against a large number of state-of-the-art ranking methods,\nincluding DORM [13], an approach that only differs from our model instance by using max-margin\ninstead of a MAP formulation. We show very competitive results on standard document ranking\ndatasets, and in particular we show that our model performs better than or on par with DORM. For\nintractable model instances, we show that the problem can be approximately solved using sampling\nand we provide experiments from the computer vision domain. However the fastest suitable sampler\nis still quite slow for large models, in which case max-margin matching estimators like those of [4]\nand [27] are likely to be preferable even in spite of their potential inferior accuracy.\n2 Background\n2.1 Structured Prediction\nIn recent years, great attention has been devoted in Machine Learning to so-called structured pre-\ndictors, which are predictors of the kind\n\ng\u2713 : X 7! Y,\n\n(1)\n\nwhere X is an arbitrary input space and Y is an arbitrary discrete space, typically exponentially\nlarge. Y may be, for example, a space of matrices, trees, graphs, sequences, strings, matches, etc.\nThis structured nature of Y is what structured prediction refers to. In the setting of this paper, X is the\nset of vector-weighted bipartite graphs (i.e., each edge has a feature vector associated with it), and\nY is the set of perfect matches induced by X. If N graphs are available, along with corresponding\nannotated matches (i.e., a set {(xn, yn)}N\nn=1), our task will be to estimate \u2713 such that when we apply\nthe predictor g\u2713 to a new graph it produces a match that is similar to matches of similar graphs from\nthe annotated set. Structured learning or structured estimation refers to the process of estimating a\nvector \u2713 for predictor g\u2713 when data {(x1, y1), . . . , (xN , yN)}2 (X \u21e5 Y)N are available. Structured\nprediction for input x means computing y = g(x; \u2713) using the estimated \u2713.\nTwo generic estimation strategies have been popular in producing structured predictors. One is based\non max-margin estimators [29, 27], and the other on maximum-likelihood (ML) or MAP estimators\nin exponential family models [12].\nThe \ufb01rst approach is a generalization of support vector machines to the case where the set Y is\nstructured. However the resulting estimators are known to be inconsistent in general: in the limit\nof in\ufb01nite training data the algorithm fails to recover the best model in the model class [18, 16, 15].\nMcAllester recently provided an interesting analysis on this issue, where he proposed new upper\nbounds whose minimization results in consistent estimators, but no such bounds are convex [18].\nThe other approach uses ML or MAP estimation in conditional exponential families with \u2018struc-\ntured\u2019 suf\ufb01cient statistics, such as in probabilistic graphical models, where they are decomposed\nover the cliques of the graph (in which case they are called Conditional Random Fields, or CRFs\n[12]). In the case of tractable graphical models, dynamic programming can be used to ef\ufb01ciently\nperform inference. ML and MAP estimators in exponential families not only amount to solving an\nunconstrained and convex optimization problem; in addition they are statistically consistent. The\nmain problem with these types of models is that often the partition function is intractable. This has\nmotivated the use of max-margin methods in many scenarios where such intractability arises.\n\n2.2 The Matching Problem\nConsider a weighted bipartite graph with m nodes in each part, G = (V, E, w), where V is the set\nof vertices, E is the set of edges and w : E 7! R is a set of real-valued weights associated with\nthe edges. G can be simply represented by a matrix (wij) where the entry wij is the weight of the\nedge ij. Consider also a bijection y : {1, 2, . . . , m} 7! {1, 2, . . . , m}, i.e., a permutation. Then the\nmatching problem consists of computing\n\n2\n\n\fto attain the graph G = (V, E, w). See Figure 1 for an\nillustration.\n\nGx\n\nG\n\ni\n\nxij\n\nj\n\ni\n\nwij = hxij,\u2713 i\n\nj\n\nFigure 1: Left: Illustration of an input vector-weighted bipartite graph Gx with 3 \u21e5 3 edges. There\nis a vector xe associated with each edge e (for clarity only xij is shown, corresponding to the solid\nedge). Right: weighted bipartite graph G obtained by evaluating Gx on the learned vector \u2713 (again\nonly edge ij is shown).\n\nFigure 1. Left: Illustration of an input vector-weighted bi-\npartite graph Gx with 3 \u21e5 3 edges. There is a vector xe\nassociated to each edge e (for clarity only xij is shown,\ncorresponding to the solid edge). Right: weighted bipar-\ntite graph G obtained by evaluating Gx on the learned\nvector \u2713 (again only edge ij is shown).\n\nQN\nn=1 p(yn|xn; \u2713). Therefore,\nNYn=1\np(\u2713|Y, X) / p(\u2713)\nexp (h(xn, yn),\u2713 i  g(xn; \u2713))\n= exp log p(\u2713) +\n(h(xn, yn),\u2713 i  g(xn; \u2713))!.\nNXn=1\n\nWe impose a Gaussian prior on \u2713. Instead of maximiz-\ning the posterior we can instead minimize the negative\nlog-posterior `(Y |X; \u2713), which becomes our loss func-\ntion (we suppress the constant term):\n`(Y |X; \u2713) = \n\n(g(xn; \u2713)  h(xn, yn),\u2713 i)\n\n2 k\u2713k2 +\n(2)\n\n1\nN\n\nNXn=1\n\ny\u21e4 = argmax\n\nwiy(i).\n\ny\n\nmXi=1\n\na\n\nthat\n\nassume\n\ntraining\n\nformally,\n\nMore\nset\n{X, Y } = {(x1, y1), . . . , (xN , yN)} is available, for n =\n1, 2, . . . , N (where xn := (xn\nM (n)M (n))).\nHere M(n) is the number of nodes in each part of\nthe vector-weighted bipartite graph xn. We then\nparameterize xij as wiy(i) = f(xiy(i); \u2713), and the\ngoal is to \ufb01nd the \u2713 which maximizes the posterior\nlikelihood of the observed data. We will assume f to\n\n12 . . . , xn\n\nThis is a well-studied problem; it is tractable and can be solved in O(m3) time [22]. This model can\nbe used to match features in images [4], improve classi\ufb01cation algorithms [8] and rank documents\n[13], to cite a few applications. The typical setting consists of engineering the score matrix wij\naccording to domain knowledge and subsequently solving the combinatorial problem.\n\nwhere  is a regularization constant. `(Y |X; \u2713) is a\nconvex function of \u2713 since the log-partition function\ng(\u2713) is a convex function of \u2713 (Wainwright & Jordan,\n2003) and the other terms are clearly convex in \u2713.\n\n11, xn\n\n3.2. Exponential Family Model\n\nbe bilinear, i.e. f(xiy(i); \u2713) =\u2326xiy(i),\u2713\u21b5.\n\n3 The Model\n3.1 Basic Goal\nIn this paper we assume that the weights wij are instead to be estimated from training data. More\nprecisely, the weight wij associated with the edge ij in a graph will be the result of an appropriate\ncomposition of a feature vector xij (observed) and a parameter vector \u2713 (estimated from training\ndata). Therefore, in practice, our input is a vector-weighted bipartite graph Gx = (V, E, x) (x :\nE 7! Rn), which is \u2018evaluated\u2019 at a particular \u2713 (obtained from previous training) so as to attain the\ngraph G = (V, E, w). See Figure 1 for an illustration. More formally, assume that a training set\n{X, Y } = {(xn, yn)}N\nM (n)M (n)). Here M(n) is the\nnumber of nodes in each part of the vector-weighted bipartite graph xn. We then parameterize xij\nas wiy(i) = f(xiy(i); \u2713), and the goal is to \ufb01nd the \u2713 which maximizes the posterior probability of\nthe observed data. We will assume f to be bilinear, i.e., f(xiy(i); \u2713) =\u2326xiy(i),\u2713\u21b5.\n\nWe assume an exponential family model, where the\nprobability model is\nn=1 is available, where xn := (xn\n\n12 . . . , xn\np(y|x; \u2713) = exp (h(x, y),\u2713 i  g(x; \u2713)),\n\ng(x; \u2713) = logXy\n\n3.2 Exponential Family Model\nWe assume an exponential family model, where the probability model is\np(y|x; \u2713) = exp (h(x, y),\u2713 i  g(x; \u2713)), where\nis the log-partition function, which is a convex and dif-\nferentiable function of \u2713 (Wainwright & Jordan, 2003).\nThe prediction in this model is the most likely y, i.e.\n\nexph(x, y),\u2713 i\n\n11, xn\n\nwhere\n\n(3)\n\n(3)\n\n(4)\n\nexph(x, y),\u2713 i\n\ng(x; \u2713) = logXy\n\nis the log-partition function, which is a convex and differentiable function of \u2713 [31].\nThe prediction in this model is the most likely y, i.e.,\n\np(y|x; \u2713) = argmax\n\nh(x, y),\u2713 i\n\ny\u21e4 = argmax\n\n(5)\n\ny\n\ny\n\n3.3. Feature Parameterization\n\nThe critical observation now is that we equate the so-\nlution of the matching problem (2) to the prediction\n\nof the exponential family model (5), i.e. Pi wiy(i) =\nh(x, y),\u2713 i. Since our goal is to parameterize fea-\ntures of individual pairs of nodes (so as to produce\nthe weight of an edge), the most natural model is\n\n(x, y) =\n\nxiy(i), which gives\n\nMXi=1\n\nwiy(i) =\u2326xiy(i),\u2713\u21b5 ,\n\ni.e. linear in both x and \u2713 (see Figure 1, right). The\nspeci\ufb01c form for xij will be discussed in the experi-\nmental section. In light of (10), (2) now clearly means\na prediction of the best match for Gx under the model\n\u2713.\n\n4. Learning the Model\n\n(4)\n\n4.1. Basics\nWe need to solve \u2713\u21e4 = argmin\u2713 `(Y |X; \u2713). `(Y |X; \u2713) is\na convex and di\u21b5erentiable function of \u2713 (Wainwright\n& Jordan, 2003), therefore gradient descent will \ufb01nd\nthe global optimum. In order to compute r\u2713`(Y |X; \u2713),\nIt is a standard result\nwe need to compute r\u2713g(\u2713).\nof exponential families that the gradient of the log-\npartition function is the expectation of the sucient\nstatistics:\n\n(5)\n\nr\u2713g(x; \u2713) = Ey\u21e0p(y|x;\u2713)[(x, y)].\n\ny\u21e4 = argmax\n\nand ML estimation amounts to maximizing the con-\nditional likelihood of a sample {X, Y }, i.e. computing\nargmax\u2713 p(Y |X; \u2713). In practice we will in general in-\ntroduce a prior on \u2713 and perform MAP estimation:\n\np(y|x; \u2713) = argmax\n\nh(x, y),\u2713 i\n\ny\n\ny\n\n\u2713\n\nand ML estimation amounts to maximizing the conditional likelihood of the training set {X, Y },\ni.e., computing argmax\u2713 p(Y |X; \u2713). In practice we will in general introduce a prior on \u2713 and per-\nform MAP estimation:\n(6)\n\np(\u2713|Y, X). (6)\n\u2713\u21e4 = argmax\np(\u2713|Y, X).\n\u2713\u21e4 = argmax\nAssuming iid sampling, we have p(Y |X; \u2713) =\nAssuming iid sampling, we have p(Y |X; \u2713) =QN\nn=1 p(yn|xn; \u2713). Therefore,\np(\u2713|Y, X) / exp log p(\u2713) +\n(h(xn, yn),\u2713 i  g(xn; \u2713))!.\nNXn=1\n\np(Y |X; \u2713)p(\u2713) = argmax\np(Y |X; \u2713)p(\u2713) = argmax\n\n(7)\n\n\u2713\n\n\u2713\n\n\u2713\n\n3\n\n\fWe impose a Gaussian prior on \u2713. Instead of maximizing the posterior we can instead minimize the\nnegative log-posterior `(Y |X; \u2713), which becomes our loss function (we suppress the constant term):\n(8)\n\n`(Y |X; \u2713) = \n\n2 k\u2713k2 +\n\n(g(xn; \u2713)  h(xn, yn),\u2713 i)\n\nwhere  is a regularization constant. `(Y |X; \u2713) is a convex function of \u2713 since the log-partition\nfunction g(\u2713) is a convex function of \u2713 [31] and the other terms are clearly convex in \u2713.\n3.3 Feature Parameterization\nThe critical observation now is that we equate the solution of the matching problem (2) to the pre-\n\ndiction of the exponential family model (5), i.e.,Pi wiy(i) = h(x, y),\u2713 i. Since our goal is to\n\nparameterize features of individual pairs of nodes (so as to produce the weight of an edge), the most\nnatural model is\n\n(x, y) =\n\nxiy(i), which gives\n\n(9)\n\n1\nN\n\nNXn=1\n\nMXi=1\n\nwiy(i) =\u2326xiy(i),\u2713\u21b5 ,\n\n(10)\ni.e., linear in both x and \u2713 (see Figure 1, right). The speci\ufb01c form for xij will be discussed in the\nexperimental section. In light of (10), (2) now clearly means a prediction of the best match for Gx\nunder the model \u2713.\n4 Learning the Model\n4.1 Basics\nWe need to solve \u2713\u21e4 = argmin\u2713 `(Y |X; \u2713). `(Y |X; \u2713) is a convex and differentiable function of\n\u2713 [31], therefore gradient descent will \ufb01nd the global optimum. In order to compute r\u2713`(Y |X; \u2713),\nwe need to compute r\u2713g(\u2713). It is a standard result of exponential families that the gradient of the\nlog-partition function is the expectation of the suf\ufb01cient statistics:\nr\u2713g(x; \u2713) = Ey\u21e0p(y|x;\u2713)[(x, y)].\n(11)\nTherefore in order to perform gradient descent we need to compute the above expectation. Opening\nthe above expression gives\n\nwhich reveals that the partition function Z(x; \u2713) needs to be computed. The partition function is:\n\n1\n\n=\n\n(x, y)\n\nEy\u21e0p(y|x;\u2713)[(x, y)] =Xy\n(x, y)p(y|x; \u2713)\nMYi=1\nZ(x; \u2713)Xy\nexp(\u2326xiy(i),\u2713\u21b5),\nZ(x; \u2713) =Xy\nexp(\u2326xiy(i),\u2713\u21b5)\n|\n}\n\nMYi=1\n\n{z\n\n=:Biy(i)\n\n.\n\n(12)\n\n(13)\n\n(14)\n\nNote that the above is the expression for the permanent of matrix B [19]. The permanent is similar\nin de\ufb01nition to the determinant, the difference being that for the latter sgn(y) comes before the\nproduct. However, unlike the determinant, which is computable ef\ufb01ciently and exactly by standard\nlinear algebra manipulations, computing the permanent is a ]P-complete problem [30]. Therefore\nwe have no realistic hope of computing (11) exactly for general problems.\n\n4.2 Exact Expectation\n\nThe exact partition function itself can be ef\ufb01ciently computed for up to about M = 30 using the\nO(M2M) algorithm by Ryser [25]. However for arbitrary expectations we are not aware of any\nexact algorithm which is more ef\ufb01cient than full enumeration (which would constrain tractability\nto very small graphs). However we will see that even in the case of very small graphs we \ufb01nd a\nvery important application: learning to rank. In our experiments, we successfully apply a tractable\ninstance of our model to benchmark document ranking datasets, obtaining very competitive results.\nFor larger graphs, we have alternative options as indicated below.\n\n4\n\n\f4.3 Approximate Expectation\nIf we have a situation in which the set of feasible permutations is too large to be fully enumerated\nef\ufb01ciently, we need to resort to some approximation for the expectation of the suf\ufb01cient statistics.\nThe best solution we are aware of is one by Huber and Law, who recently presented an algorithm to\napproximate the permanent of dense non-negative matrices [10]. The algorithm works by producing\nexact samples from the distribution of perfect matches on weighted bipartite graphs. This is in\nprecisely the same form as the distribution we have here, p(y|x; \u2713) [10]. We can use this algorithm\nfor applications that involve larger graphs.We generate K samples from the distribution p(y|x; \u2713),\nand directly approximate (12) with a Monte Carlo estimate\n\nEy\u21e0p(y|x;\u2713)[(x, y)] \u21e1\n\n1\nK\n\nKXi=1\n\n(x, yi).\n\n(15)\n\nIn our experiments, we apply this algorithm to an image matching application.\n\n1, . . . , dk\n\ni , a joint feature vector k\n\n5 Experiments\n5.1 Ranking\nHere we apply the general matching model introduced in previous sections to the task of learning\nto rank. Ranking is a fundamental problem with applications in diverse areas such as document\nretrieval, recommender systems, product rating and others. Early learning to rank methods applied a\npairwise approach, where pairs of documents were used as instances in learning [7, 6, 3]. Recently\nthere has been interest in listwise approaches, where document lists are used as instances, as in our\nmethod. In this paper we focus, without loss of generality, on document ranking.\nWe are given a set of queries {qk} and, for each query qk, a list of D(k) documents {dk\nD(k)}\nD(k)} (assigned by a human editor), measuring the relevance\nwith corresponding ratings {rk\ndegree of each document with respect to query qk. A rating or relevance degree is usually a nominal\nvalue in the list {1, . . . , R}, where R is typically between 2 and 5. We are also given, for every\nretrieved document dk\nTraining At training time, we model each query qk as a vector-weighted bipartite graph (Figure\n1) where the nodes on one side correspond to a subset of cardinality M of all D(k) documents\nretrieved by the query, and the nodes on the other side correspond to all possible ranking positions for\nthese documents (1, . . . , M). The subset itself is chosen randomly, provided at least one exemplar\ndocument of every rating is present. Therefore M must be such that M  R.\nThe process is then repeated in a bootstrap manner: we resample (with replacement) from the set\nof documents {dk\nD(k)}, M documents at a time (conditioned on the fact that at least one\nexemplar of every rating is present, but otherwise randomly). This effectively boosts the number of\ntraining examples since each query qk ends up being selected many times, each time with a different\nsubset of M documents from the original set of D(k) documents.\nIn the following we drop the query index k to examine a single query. Here we follow the con-\nstruction used in [13] to map matching problems to ranking problems (indeed the only difference\nbetween our ranking model and that of [13] is that they use a max-margin estimator and we use MAP\nin an exponential family.) Our edge feature vector xij will be the product of the feature vector i\nassociated with document i, and a scalar cj (the choice of which will be explained below) associated\nwith ranking position j\n\ni for that document and the query qk.\n\n1 , . . . , rk\n\n1, . . . , dk\n\n(16)\n i is dataset speci\ufb01c (see details below). From (10) and (16), we have wij = cj h i,\u2713 i, and training\nproceeds as explained in Section 4.\nTesting At test time, we are given a query q and its corresponding list of D associated documents.\nWe then have to solve the prediction problem, i.e.,\n\nxij = icj.\n\ny\u21e4 = argmax\n\ny\n\nDXi=1\u2326xiy(i),\u2713\u21b5 = argmax\n\ny\n\nDXi=1\n\ncy(i) h i,\u2713 i .\n\n(17)\n\n5\n\n\f0.65\n\n0.6\n\n0.55\n\nG\nC\nD\nN\n\n0.5\n\n0.45\n\n0.4\n\nTD2004\n\nRankMatch (Our Method), M=2\nDORM\nRankBoost\nRankSVM\nFRank\nListNet\nAdaRank\u2212MAP\nAdaRank\u2212NDCG\nQBRank\nIsoRank\nSortNet 20 hiddens MAP\nSortNet 20 hiddens P@10\nStructRank\nC\u2212CRF\n\n0.55\n\n \n\n0.5\n\n0.45\n\n0.4\n\nG\nC\nD\nN\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\nTD2003\n\n \nRankMatch (Our Method), M=2 (all)\nDORM\nRankBoost\nRankSVM\nFRank\nListNet\nAdaRank\u2212MAP\nAdaRank\u2212NDCG\nQBRank\nIsoRank\nSortNet 10 hiddens MAP\nSortNet 10 hiddens P@10\n\n0.58\n\n0.56\n\n0.54\n\n0.52\n\nOHSUMED\n\n \n\nRankMatch (Our Method), M=3\nDORM\nRankBoost\nRankSVM\nFRank\nListNet\nAdaRank\u2212MAP\nAdaRank\u2212NDCG\nQBRank\nIsoRank\nStructRank\nC\u2212CRF\n\nG\nC\nD\nN\n\n0.5\n\n0.48\n\n0.46\n\n0.44\n\n0.35\n \n1\n\n2\n\n3\n\n4\n\n5\n\nk\n\n6\n\n7\n\n8\n\n9\n\n10\n\n \n1\n\n2\n\n3\n\n4\n\n5\n\nk\n\n6\n\n7\n\n8\n\n9\n\n10\n\n0.42\n \n1\n\n2\n\n3\n\n4\n\n5\n\nk\n\n6\n\n7\n\n8\n\n9\n\n10\n\nFigure 2: Results of NDCG@k for state-of-the-art methods on TD2004 (left), TD2003 (middle) and\nOHSUMED (right). This is best viewed in color.\n\nWe now notice that if the scalar cj = c(j), where c is a non-increasing function of rank position\nj, then (17) can be solved simply by sorting the values of h i,\u2713 i in decreasing order.1 In other\nwords, the matching problem becomes one of ranking the values h i,\u2713 i. Inference in our model is\ntherefore very fast (linear time).2 In this setting it makes sense to interpret the quantity h i,\u2713 i as a\nscore of document di for query q. This leaves open the question of which non-increasing function c\nshould be used. We do not solve this problem in this paper, and instead choose a \ufb01xed c. In theory\nit is possible to optimize over c during learning, but in that case the optimization problem would no\nlonger be convex. We describe the results of our method on LETOR 2.0 [14], a publicly available\nbenchmark data collection for comparing learning to rank algorithms. It is comprised of three data\nsets: OHSUMED, TD2003 and TD2004.\nData sets OHSUMED contains features extracted from query-document pairs in the OHSUMED\ncollection, a subset of MEDLINE, a database of medical publications. It contains 106 queries. For\neach query there are a number of associated documents, with relevance degrees judged by humans\non three levels: de\ufb01nitely, possibly or not relevant. Each query-document pair is associated with a\n25 dimensional feature vector, i. The total number of query-document pairs is 16,140. TD2003\nand TD2004 contain features extracted from the topic distillation tasks of TREC 2003 and TREC\n2004, with 50 and 75 queries, respectively. Again, for each query there are a number of associated\ndocuments, with relevance degrees judged by humans, but in this case only two levels are provided:\nrelevant or not relevant. Each query-document pair is associated with a 44 dimensional feature\nvector, i. The total number of query-document pairs is 49,171 for TD2003 and 74,170 for TD2004.\nAll datasets are already partitioned for 5-fold cross-validation. See [14] for more details.\nEvaluation Metrics In order to measure the effectiveness of our method we use the normalized\ndiscount cumulative gain (NDCG) measure [11] at rank position k, which is de\ufb01ned as\n\nNDCG@k =\n\n2r(j)  1\nlog(1 + j) ,\n\n1\nZ\n\nkXj=1\n\n(18)\n\nwhere r(j) is the relevance of the jth document in the list, and Z is a normalization constant so that\na perfect ranking yields an NDCG score of 1.\n\n\u21e1\u21e4 such that r(a) = r(\u21e1\u21e4(b)), a theorem due to Polya, Littlewood, Hardy and Blackwell [26].\n\n1If r(v) denotes the vector of ranks of entries of vector v, then ha, \u21e1(b)i is maximized by the permutation\n2Sorting the top k items of a list of D items takes O(k log k + D) time [17].\n\n6\n\n\fTable 1: Training times (per observation, in seconds, Intel Core2 2.4GHz) for the exponential model\nand max-margin. Runtimes for M = 3, 4, 5 are from the ranking experiments, computed by full\nenumeration; M = 20 corresponds to the image matching experiments, which use the sampler from\n[10]. A problem of size 20 cannot be practically solved by full enumeration.\n\nM exponential model max margin\n0.0008965\n3\n0.0016086\n4\n5\n0.0015328\n0.9334556\n20\n\n0.0006661\n0.0011277\n0.0030187\n36.0300000\n\nExternal Parameters The regularization constant  is chosen by 5-fold cross-validation, with the\npartition provided by the LETOR package. All experiments are repeated 5 times to account for the\nrandomness of the sampling of the training data. We use c(j) = M  j on all experiments.\nOptimization To optimize (8) we use a standard BFGS Quasi-Newton method with a backtracking\nline search, as described in [21].\nResults For the \ufb01rst experiment training was done on subsets sampled as described above, where\nfor each query qk we sampled 0.4 \u00b7 D(k) \u00b7 M subsets, therefore increasing the number of samples\nlinearly with M. For TD2003 we also trained with all possible subsets (M = 2(all) in the plots).\nIn Figure 2 we plot the results of our method (named RankMatch), for M = R, compared to those\nachieved by a number of state-of-the-art methods which have published NDCG scores in at least two\nof the datasets: RankBoost [6], RankSVM [7], FRank [28], ListNet [5], AdaRank [32], QBRank\n[34], IsoRank [33], SortNet [24], StructRank [9] and C-CRF [23]. We also included a plot of our\nimplementation of DORM [13], using precisely the same resampling methodology and data for a\nfair comparison. RankMatch performs among the best methods on both TD2004 and OHSUMED,\nwhile on TD2003 it performs poorly (for low k) or fairly well (for high k).\nWe notice that there are four methods which only report results in two of the three datasets: the\ntwo SortNet versions are only reported on TD2003 and TD2004, while StructRank and C-CRF\nare only reported on TD2004 and OHSUMED. RankMatch compares similarly with SortNet and\nStructRank on TD2004, similarly to C-CRF and StructRank on OHSUMED and similarly to the\ntwo versions of SortNet on TD2003. This exhausts all the comparisons against the methods which\nhave results reported in only two datasets. A fairer comparison could be made if these methods had\ntheir performance published for the respective missing dataset.\nWhen compared to the methods which report results in all datasets, RankMatch entirely dominates\ntheir performance on TD2004 and is second only to IsoRank on OHSUMED.\nThese results should be interpreted cautiously; [20] presents an interesting discussion about issues\nwith these datasets. Also, benchmarking of ranking algorithms is still in its infancy and we don\u2019t yet\nhave publicly available code for all of the competitive methods. We expect this situation to change\nin the near future so that we are able to compare them on a fair and transparent basis.\nConsistency In a second experiment we trained RankMatch with different training subset sizes,\nstarting with 0.03\u00b7D(k)\u00b7M and going up to 1.0\u00b7D(k)\u00b7M. Once again, we repeated the experiments\nwith DORM using precisely the same training subsets. The purpose here is to see whether we\nobserve a practical advantage of our method with increasing sample size, since statistical consistency\nonly provides an asymptotic indication. The results are plotted in Figure 3-right, where we can see\nthat, as more training data is available, RankMatch improves more saliently than DORM.\nRuntime The runtime of our algorithm is competitive with that of max-margin for small graphs, such\nas those that arise from the ranking application. For larger graphs, the use of the sampling algorithm\nwill result in much slower runtimes than those typically obtained in the max-margin framework.\nThis is certainly the bene\ufb01t of the max-margin matching formulations of [4, 13]: it is much faster\nfor large graphs. Table 1 shows the runtimes for graphs of different sizes, for both estimators.\n\nImage Matching\n\n5.2\nFor our computer vision application we used a silhouette image from the Mythological Creatures\n2D database3. We randomly selected 20 points on the silhouette as our interest points and applied\n\n3http://tosca.cs.technion.ac.il\n\n7\n\n\fshear to the image creating 200 different images. We then randomly selected N pairs of images for\ntraining, N for validation and 500 for testing, and trained our model to match the interest points\nin the pairs \u2013 that is, given two images with corresponding points, we computed descriptors for\neach pair i, j of points (one from each image) and learned \u2713 such that the solution to the matching\nproblem (2) with the weights set to wij = hxij,\u2713 i best matches the expected solution that a human\nwould manually provide. In this setup,\n\nxij = | i  j|2, where | \u00b7 | denotes the elementwise difference\n\n(19)\n\nand i is the Shape Context feature vector [1] for point i.\nFor a graph of this size computing the exact expectation is not feasible, so we used the sampling\nmethod described in Section 4.3. Once again, the regularization constant  was chosen by cross-\nvalidation. Given the fact that the MAP estimator is consistent while the max-margin estimator is\nnot, one is tempted to investigate the practical performance of both estimators as the sample size\ngrows. However, since consistency is only an asymptotic property, and also since the Hamming\nloss is not the criterion optimized by either estimator, this does not imply a better large-sample\nperformance of MAP in real experiments. In any case, we present results with varying training set\nsizes in Figure 3-left. The max-margin method is that of [4]. After a suf\ufb01ciently large training set\nsize, our model seems to enjoy a slight advantage.\n\n0.2\n\n0.18\n\n0.16\n\nr\no\nr\nr\ne\n\n0.14\n\n0.12\n\n0.1\n\n0.08\n\n \n0\n\n50\n\n100\n\n150\n\n \n\nexponential model\nmax margin\n\n350\n\n400\n\n450\n\n500\n\n1\n\u2212\nG\nC\nD\nN\n\n0.57\n\n0.565\n\n0.56\n\n0.555\n\n0.55\n\n0.545\n\n0.54\n\n0.535\n\n0.53\n \n\nOHSUMED\n\n \n\nRankMatch\nDORM\n\n10\u22121\n\nsample size (x M D)\n\n100\n\n200\n\n250\n\n300\n\nnumber of training pairs\n\nFigure 3: Performance with increasing sample size. Left: Hamming loss for different numbers of\ntraining pairs in the image matching problem (test set size \ufb01xed to 500 pairs). Right: results of\nNDCG@1 on the ranking dataset OHSUMED. This evidence is in agreement with the fact that our\nestimator is consistent, while max-margin is not.\n6 Conclusion and Discussion\nWe presented a method for learning max-weight bipartite matching predictors, and applied it ex-\ntensively to well-known document ranking datasets, obtaining state-of-the-art results. We also\nillustrated\u2013with an image matching application\u2013that larger problems can also be solved, albeit\nslowly, with a recently developed sampler. The method has a number of convenient features. First,\nit consists of performing maximum-a-posteriori estimation in an exponential family model, which\nresults in a simple unconstrained convex optimization problem solvable by standard algorithms such\nas BFGS. Second, the estimator is not only statistically consistent but also in practice it seems to\nbene\ufb01t more from increasing sample sizes than its max-margin alternative. Finally, being fully prob-\nabilistic, the model can be easily integrated as a module in a Bayesian framework, for example. The\nmain direction for future research consists of \ufb01nding more ef\ufb01cient ways to solve large problems.\nThis will most likely arise from appropriate exploitation of data sparsity in the permutation group.\n\nFigure 2: Learning image matching. Left: hamming loss for different numbers of training pairs (test\nset size \ufb01xed to 500 pairs). Right: an example match from the test set (blue are correct and red\nincorrect matches).\n\nretrieval, recommender systems, product rating and others. We are going to focus on web page\n\nD(k)} with corresponding ratings {rk\n\nFor this problem we are given a set of queries {qk} and, for each query qk, a list of D(k) documents\nD(k)} (assigned by a human editor), measur-\ning the relevance degree of each document with respect to query qk. A rating or relevance degree is\nusually a nominal value in the list {1, . . . , R}, where R is typically between 2 and 5. We are also\ngiven, for every retrieved document dk\ni for that document and the query\n\nReferences\n[1] Belongie, S., & Malik, J (2000). Matching with shape contexts. CBAIVL00.\n[2] Belongie, S., Malik, J., & Puzicha, J. (2002). Shape matching and object recognition using shape contexts.\n\n[3] Burges, C. J. C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N. & Hulldender, G. (2005).\n\nIEEE Trans. on PAMI, 24, 509\u2013521.\n\nLearning to rank using gradient descent. ICML.\n\ni , a joint feature vector k\n\n1 , . . . , rk\n\n8\n\nTraining At training time, we model each query qk as a vector-weighted bipartite graph (Figure\n1) where the nodes on one side correspond to a subset of cardinality M of all D(k) documents\n\n\f[4] Caetano, T. S., Cheng, L., Le, Q. V., & Smola, A. J. (2009). Learning graph matching. IEEE Trans. on\n\nPAMI, 31, 1048\u20131058.\n\n[5] Cao, Z., Qin, T., Liu, T.-Y., Tsai, M.-F., & Li, H. (2007). Learning to rank: from pairwise approach to\n\nlistwise approach. ICML\n\n[6] Freund, Y., Iyer, R., Schapire, R. E., & Singer, Y. (2003). An ef\ufb01cient boosting algorithm for combining\n\npreferences. J. Mach. Learn. Res., 4, 933\u2013969.\n\n[7] Herbrich, A., Graepel, T., & Obermayer, K. (2000). Large margin rank boundaries for ordinal regression.\n\nIn Advances in Large Margin Classi\ufb01ers.\n\n[8] Huang, B., & Jebara, T. (2007). Loopy belief propagation for bipartite maximum weight b-matching.\n\nAISTATS.\n\n[9] Huang, J. C., & Frey, B. J. (2008). Structured ranking learning using cumulative distribution networks. In\n\nNIPS.\n\n[10] Huber, M., & Law, J. (2008). Fast approximation of the permanent for very dense problems. SODA.\n[11] Jarvelin, K., & Kekalainen, J. (2002). Cumulated gain-based evaluation of ir techniques. ACM Transac-\n\ntions on Information Systems, 20, 2002.\n\n[12] Lafferty, J. D., McCallum, A., & Pereira, F. (2001). Conditional random \ufb01elds: Probabilistic modeling\n\nfor segmenting and labeling sequence data. ICML.\n\n[13] Le, Q., & Smola, A. (2007). Direct optimization of ranking measures. http://arxiv.org/abs/0704.3359.\n[14] Liu, T.-Y., Xu, J., Qin, T., Xiong, W., & Li, H. (2007). Letor: Benchmark dataset for research on learning\n\nto rank for information retrieval. LR4IR.\n\n[15] Liu, Y. & Shen, X. (2005) Multicategory -learning and support vector machine: Computational tools.\n\nJ. Computational and Graphical Statistics, 14, 219\u2013236.\n\n[16] Liu, Y. & Shen, X. (2006) Multicategory -learning. JASA, 101, 500\u2013509.\n[17] Martinez, C. (2004). Partial quicksort. SIAM.\n[18] McAllester, D. (2007). Generalization bounds and consistency for structured labeling. Predicting Struc-\n\ntured Data.\n\n[19] Minc, H. (1978). Permanents. Addison-Wesley.\n[20] Minka, T., & Robertson, S. (2008). Selection bias in the letor datasets. LR4IR.\n[21] Nocedal, J., & Wright, S. J. (1999). Numerical optimization. Springer Series in Operations Research.\n\nSpringer.\n\n[22] Papadimitriou, C. H., & Steiglitz, K. (1982). Combinatorial optimization: Algorithms and complexity.\n\nNew Jersey: Prentice-Hall.\n\n[23] Qin, T., Liu, T.-Y., Zhang, X.-D., Wang, D.-S., & Li, H. (2009). Global ranking using continuous condi-\n\ntional random \ufb01elds. NIPS.\n\n[24] Rigutini, L., Papini, T., Maggini, M., & Scarselli, F. (2008). Sortnet: Learning to rank by a neural-based\n\nsorting algorithm. LR4IR.\n\n[25] Ryser, H. J. (1963). Combinatorial mathematics. The Carus Mathematical Monographs, No. 14, Mathe-\n\nmatical Association of America.\n\n[26] Sherman, S. (1951). On a Theorem of Hardy, Littlewood, Polya, and Blackwell. Proceedings of the\n\nNational Academy of Sciences, 37, 826\u2013831.\n\n[27] Taskar, B. (2004). Learning structured prediction models: a large-margin approach. Doctoral disserta-\n\ntion, Stanford University.\n\n[28] Tsai, M., Liu, T., Qin, T., Chen, H., & Ma, W. (2007). Frank: A ranking method with \ufb01delity loss. SIGIR.\n[29] Tsochantaridis, I., Joachims, T., Hofmann, T., & Altun, Y. (2005). Large margin methods for structured\n\nand interdependent output variables. JMLR, 6, 1453\u20131484.\n\n[30] Valiant, L. G. (1979). The complexity of computing the permanent. Theor. Comput. Sci. (pp. 189\u2013201).\n[31] Wainwright, M. J., & Jordan, M. I. (2003). Graphical models, exponential families, and variational\n\ninference (Technical Report 649). UC Berkeley, Department of Statistics.\n\n[32] Xu, J., & Li, H. (2007). Adarank: a boosting algorithm for information retrieval. SIGIR.\n[33] Zheng, Z., Zha, H., & Sun, G. (2008a). Query-level learning to rank using isotonic regression. LR4IR.\n[34] Zheng, Z., Zha, H., Zhang, T., Chapelle, O., Chen, K., & Sun, G. (2008b). A general boosting method\n\nand its application to learning ranking functions for web search. NIPS.\n\n9\n\n\f", "award": [], "sourceid": 197, "authors": [{"given_name": "James", "family_name": "Petterson", "institution": null}, {"given_name": "Jin", "family_name": "Yu", "institution": null}, {"given_name": "Julian", "family_name": "Mcauley", "institution": null}, {"given_name": "Tib\u00e9rio", "family_name": "Caetano", "institution": null}]}