{"title": "Max-Margin Markov Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 25, "page_last": 32, "abstract": "", "full_text": "Max-Margin Markov Networks\n\nBen Taskar Carlos Guestrin Daphne Koller\n\nfbtaskar,guestrin,kollerg@cs.stanford.edu\n\nStanford University\n\nAbstract\n\nIn typical classi\ufb01cation tasks, we seek a function which assigns a label to a sin-\ngle object. Kernel-based approaches, such as support vector machines (SVMs),\nwhich maximize the margin of con\ufb01dence of the classi\ufb01er, are the method of\nchoice for many such tasks. Their popularity stems both from the ability to\nuse high-dimensional feature spaces, and from their strong theoretical guaran-\ntees. However, many real-world tasks involve sequential, spatial, or structured\ndata, where multiple labels must be assigned. Existing kernel-based methods ig-\nnore structure in the problem, assigning labels independently to each object, los-\ning much useful information. Conversely, probabilistic graphical models, such\nas Markov networks, can represent correlations between labels, by exploiting\nproblem structure, but cannot handle high-dimensional feature spaces, and lack\nstrong theoretical generalization guarantees.\nIn this paper, we present a new\nframework that combines the advantages of both approaches: Maximum mar-\ngin Markov (M3) networks incorporate both kernels, which ef\ufb01ciently deal with\nhigh-dimensional features, and the ability to capture correlations in structured\ndata. We present an ef\ufb01cient algorithm for learning M3 networks based on a\ncompact quadratic program formulation. We provide a new theoretical bound\nfor generalization in structured domains. Experiments on the task of handwrit-\nten character recognition and collective hypertext classi\ufb01cation demonstrate very\nsigni\ufb01cant gains over previous approaches.\n\nIntroduction\n\n1\nIn supervised classi\ufb01cation, our goal is to classify instances into some set of discrete cat-\negories. Recently, support vector machines (SVMs) have demonstrated impressive suc-\ncesses on a broad range of tasks, including document categorization, character recognition,\nimage classi\ufb01cation, and many more. SVMs owe a great part of their success to their\nability to use kernels, allowing the classi\ufb01er to exploit a very high-dimensional (possibly\neven in\ufb01nite-dimensional) feature space. In addition to their empirical success, SVMs are\nalso appealing due to the existence of strong generalization guarantees, derived from the\nmargin-maximizing properties of the learning algorithm.\n\nHowever, many supervised learning tasks exhibit much richer structure than a simple cat-\negorization of instances into one of a small number of classes. In some cases, we might\nneed to label a set of inter-related instances. For example: optical character recognition\n(OCR) or part-of-speech tagging both involve labeling an entire sequence of elements into\nsome number of classes; image segmentation involves labeling all of the pixels in an im-\nage; and collective webpage classi\ufb01cation involves labeling an entire set of interlinked\nwebpages. In other cases, we might want to label an instance (e.g., a news article) with\nmultiple non-exclusive labels. In both of these cases, we need to assign multiple labels si-\nmultaneously, leading to a classi\ufb01cation problem that has an exponentially large set of joint\n\n\flabels. A common solution is to treat such problems as a set of independent classi\ufb01cation\ntasks, dealing with each instance in isolation. However, it is well-known that this approach\nfails to exploit signi\ufb01cant amounts of correlation information [7].\n\nAn alternative approach is offered by the probabilistic framework, and speci\ufb01cally by\nprobabilistic graphical models. In this case, we can de\ufb01ne and learn a joint probabilistic\nmodel over the set of label variables. For example, we can learn a hidden Markov model,\nor a conditional random \ufb01eld (CRF) [7] over the labels and features of a sequence, and\nthen use a probabilistic inference algorithm (such as the Viterbi algorithm) to classify these\ninstances collectively, \ufb01nding the most likely joint assignment to all of the labels simultane-\nously. This approach has the advantage of exploiting the correlations between the different\nlabels, often resulting in signi\ufb01cant improvements in accuracy over approaches that classify\ninstances independently [7, 10]. The use of graphical models also allows problem structure\nto be exploited very effectively. Unfortunately, even probabilistic graphical models that are\ntrained discriminatively do not usually achieve the same level of generalization accuracy\nas SVMs, especially when kernel features are used. Moreover, they are not (yet) associated\nwith generalization bounds comparable to those of margin-based classi\ufb01ers.\n\nClearly, the frameworks of kernel-based and probabilistic classi\ufb01ers offer complemen-\ntary strengths and weaknesses. In this paper, we present maximum margin Markov (M3)\nnetworks, which unify the two frameworks, and combine the advantages of both. Our ap-\nproach de\ufb01nes a log-linear Markov network over a set of label variables (e.g., the labels\nof the letters in an OCR problem); this network allows us to represent the correlations be-\ntween these label variables. We then de\ufb01ne a margin-based optimization problem for the\nparameters of this model. For Markov networks that can be triangulated tractably, the re-\nsulting quadratic program (QP) has an equivalent polynomial-size formulation (e.g., linear\nfor sequences) that allows a very effective solution. By contrast, previous margin-based\nformulations for sequence labeling [3, 1] require an exponential number of constraints. For\nnon-triangulated networks, we provide an approximate reformulation based on the relax-\nation used by belief propagation algorithms [8, 12]. Importantly, the resulting QP supports\nthe same kernel trick as do SVMs, allowing probabilistic graphical models to inherit the\nimportant bene\ufb01ts of kernels. We also show a generalization bound for such margin-based\nclassi\ufb01ers. Unlike previous results [3], our bound grows logarithmically rather than lin-\nearly with the number of label variables. Our experimental results on character recognition\nand on hypertext classi\ufb01cation, demonstrate dramatic improvements in accuracy over both\nkernel-based instance-by-instance classi\ufb01cation and probabilistic models.\n\n2 Structure in classi\ufb01cation problems\nIn supervised classi\ufb01cation, the task is to learn a function h : X 7! Y from a set of m i.i.d.\ninstances S = f(x(i); y(i) = t(x(i)))gm\ni=1, drawn from a \ufb01xed distribution DX (cid:2)Y. The\nclassi\ufb01cation function h is typically selected from some parametric family H. A common\nchoice is the linear family: Given n real-valued basis functions fj : X (cid:2) Y 7! IR, a\nhypothesis hw 2 H is de\ufb01ned by a set of n coef\ufb01cients wj such that:\n\nhw(x) = arg max\n\ny\n\nn\n\nX\n\ni=1\n\nwjfj(x; y) = arg max\n\ny\n\nw>f (x; y) ;\n\n(1)\n\nwhere the f (x; y) are features or basis functions.\n\nThe most common classi\ufb01cation setting \u2014 single-label classi\ufb01cation \u2014 takes Y =\nfy1; : : : ; ykg.\nIn this paper, we consider the much more general setting of multi-label\nclassi\ufb01cation, where Y = Y1 (cid:2) : : : (cid:2) Yl with Yi = fy1; : : : ; ykg. In an OCR task, for\nexample, each Yi is a character, while Y is a full word. In a webpage collective classi\ufb01ca-\ntion task [10], each Yi is a webpage label, whereas Y is a joint label for an entire website.\nIn these cases, the number of possible assignments to Y is exponential in the number of\nlabels l. Thus, both representing the basis functions fj(x; y) in (1) and computing the\nmaximization arg maxy are infeasible.\n\n\fAn alternative approach is based on the framework of probabilistic graphical models. In\nthis case, the model de\ufb01nes (directly or indirectly) a conditional distribution P (Y j X ). We\ncan then select the label arg maxy P (y j x). The advantage of the probabilistic framework\nis that it can exploit sparseness in the correlations between labels Yi. For example, in the\nOCR task, we might use a Markov model, where Yi is conditionally independent of the rest\nof the labels given Yi(cid:0)1; Yi+1.\n\nWe can encode this structure using a Markov network.\n\nIn this paper, purely for sim-\nplicity of presentation, we focus on the case of pairwise interactions between labels. We\nemphasize that our results extend easily to the general case. A pairwise Markov network\nis de\ufb01ned as a graph G = (Y; E), where each edge (i; j) is associated with a potential\nfunction ij(x; yi; yj). The network encodes a joint conditional probability distribution as\nP (y j x) / Q(i;j)2E ij(x; yi; yj). These networks exploit the interaction structure to\nparameterize a classi\ufb01er very compactly. In many cases (e.g., tree-structured networks),\nwe can use effective dynamic programming algorithms (such as the Viterbi algorithm) to\n\ufb01nd the highest probability label y; in others, we can use approximate inference algorithms\nthat also exploit the structure [12].\n\nThe Markov network distribution is simply a log-linear model, with the pairwise potential\n ij(x; yi; yj) representing (in log-space) a sum of basis functions over x; yi; yj. We can\ntherefore parameterize such a model using a set of pairwise basis functions f (x; yi; yj) for\n(i; j) 2 E. We assume for simplicity of notation that all edges in the graph denote the same\ntype of interaction, so that we can de\ufb01ne a set of features\n\nfk(x; y) = X\n\nfk(x; yi; yj):\n\n(i;j)2E\n\n(2)\n\nk=1 wkfk(x; yi; yj)] =\n\nThe network potentials are then ij(x; yi; yj) = exp [Pn\nexp (cid:2)w>f (x; yi; yj)(cid:3) :\nThe parameters w in a log-linear model can be trained to \ufb01t the data, typically by maxi-\nmizing the likelihood or conditional likelihood (e.g., [7, 10]). This paper presents an algo-\nrithm for selecting w that maximize the margin, gaining all of the advantages of SVMs.\n3 Margin-based structured classi\ufb01cation\nFor a single-label binary classi\ufb01cation problem, support vector machines (SVMs) [11] pro-\nvide an effective method of learning a maximum-margin decision boundary. For single-\nlabel multi-class classi\ufb01cation, Crammer and Singer [5] provide a natural extension of this\nframework by maximizing the margin (cid:13) subject to constraints:\n\nmaximize (cid:13) s:t:\n\njjwjj (cid:20) 1; w>(cid:1)fx(y) (cid:21) (cid:13); 8 x 2 S; 8y 6= t(x);\n\n(3)\nwhere (cid:1)fx(y) = f (x; t(x)) (cid:0) f (x; y). The constraints in this formulation ensure that\narg maxy w>f (x; y) = t(x). Maximizing (cid:13) magni\ufb01es the difference between the value\nof the true label and the best runner-up, increasing the \u201ccon\ufb01dence\u201d of the classi\ufb01cation.\n\nIn structured problems, where we are predicting multiple labels, the loss function is usu-\nally not simple 0-1 loss I(arg maxy w>fx(y) = t(x)), but per-label loss, such as the\nproportion of incorrect labels predicted. In order to extend the margin-based framework\nto the multi-label setting, we must generalize the notion of margin to take into account\nthe number of labels in y that are misclassi\ufb01ed. In particular, we would like the margin\nbetween t(x) and y to scale linearly with the number of wrong labels in y, (cid:1)tx(y):\njjwjj (cid:20) 1; w>(cid:1)fx(y) (cid:21) (cid:13) (cid:1)tx(y); 8x 2 S; 8 y;\n\nmaximize (cid:13) s:t:\n\n(4)\n\nwhere (cid:1)tx(y) = P l\ntransformation to eliminate (cid:13), we get a quadratic program (QP):\n\ni=1 (cid:1)tx(yi) and (cid:1)tx(yi) (cid:17) I(yi 6= (t(x))i). Now, using a standard\n\nminimize\n\njjwjj2\n\ns:t: w>(cid:1)fx(y) (cid:21) (cid:1)tx(y); 8x 2 S; 8 y:\n\n(5)\n\n1\n2\n\nUnfortunately, the data is often not separable by a hyperplane de\ufb01ned over the space of\nthe given set of features. In such cases, we need to introduce slack variables (cid:24)x to allow\n\n\fsome constraints to be violated. We can now present the complete form of our optimization\nproblem, as well as the equivalent dual problem [2]:\n\nPrimal formulation (6)\n\nDual formulation\n\nmin\n\n1\n2\n\njjwjj2 + CXx\n\n(cid:24)x ;\n\ns:t: w>(cid:1)fx(y) (cid:21) (cid:1)tx(y) (cid:0) (cid:24)x; 8x; y:\n\n(cid:11)x(y)(cid:1)tx(y) (cid:0)\n\nmax Xx;y\ns:t: Xy\n\n(cid:11)x(y) = C; 8x; (cid:11)x(y) (cid:21) 0 ; 8x; y:\n\n(7)\n2\n\nXx;y\n\n1\n\n2 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n;\n\n(cid:11)x(y)(cid:1)fx(y)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n(Note: for each x, we add an extra dual variable (cid:11)x(t(x)), with no effect on the solution.)\n\n4 Exploiting structure in M3 networks\nUnfortunately, both the number of constraints in the primal QP in (6), and the number of\nvariables in the dual QP in (7) are exponential in the number of labels l. In this section, we\npresent an equivalent, polynomially-sized, formulation.\n\nOur main insight is that the variables (cid:11)x(y) in the dual formulation (7) can be interpreted\nas a density function over y conditional on x, as Py (cid:11)x(y) = C and (cid:11)x(y) (cid:21) 0. The dual\nobjective is a function of expectations of (cid:1)tx(y) and (cid:1)fx(y) with respect to (cid:11)x(y). Since\nboth (cid:1)tx(y) = Pi (cid:1)tx(yi) and (cid:1)fx(y) = P(i;j) (cid:1)fx(yi; yj) are sums of functions over\nnodes and edges, we only need node and edge marginals of the measure (cid:11)x(y) to compute\ntheir expectations. We de\ufb01ne the marginal dual variables as follows:\n\n(cid:22)x(yi; yj) = Py(cid:24)[yi;yj ] (cid:11)x(y); 8 (i; j) 2 E; 8yi; yj; 8 x;\n(cid:22)x(yi)\n\n8 i; 8yi; 8 x;\n\n= Py(cid:24)[yi] (cid:11)x(y);\n\n(8)\n\nwhere y (cid:24) [yi; yj] denotes a full assignment y consistent with partial assignment yi; yj.\n\nNow we can reformulate our entire QP (7) in terms of these dual variables. Consider, for\n\nexample, the \ufb01rst term in the objective function:\n\nXy\n\n(cid:11)x(y)(cid:1)tx(y) =Xy Xi\n\n(cid:11)x(y)(cid:1)tx(yi) =Xi;yi\n\n(cid:1)tx(yi) Xy(cid:24)[yi]\n\n(cid:11)x(y) =Xi;yi\n\n(cid:22)x(yi)(cid:1)tx(yi):\n\nThe decomposition of the second term in the objective uses edge marginals (cid:22)x(yi; yj).\nIn order to produce an equivalent QP, however, we must also ensure that the dual variables\n(cid:22)x(yi; yj); (cid:22)x(yi) are the marginals resulting from a legal density (cid:11)(y); that is, that they\nbelong to the marginal polytope [4]. In particular, we must enforce consistency between\nthe pairwise and singleton marginals (and hence between overlapping pairwise marginals):\n(9)\n\n(cid:22)x(yi; yj) = (cid:22)x(yj); 8yj; 8(i; j) 2 E; 8x:\n\nX\n\nyi\n\nIf the Markov network for our basis functions is a forest (singly connected), these con-\nstraints are equivalent to the requirement that the (cid:22) variables arise from a density. There-\nfore, the following factored dual QP is equivalent to the original dual QP:\n\n(cid:22)x(yi)(cid:1)tx(yi) (cid:0)\n\n(cid:22)x(yi; yj)(cid:22)^x(yr; ys)fx(yi; yj)>f^x(yr; ys);\n\nmax Xx Xi;yi\ns:t: Xyi\n\n(cid:22)x(yi; yj) = (cid:22)x(yj); Xyi\n\n1\n\n2 Xx;^x X(i;j)\n\nyi;yj X(r;s)\n\nyr ;ys\n\nSimilarly, the original primal can be factored as follows:\n\n(cid:22)x(yi) = C; (cid:22)x(yi; yj) (cid:21) 0:\n\n(10)\n\nmin\n\n1\n2\n\njjwjj2 + CXx Xi\n\n(cid:24)x;i + CXx X(i;j)\n\n(cid:24)x;ij;\n\ns:t: w>(cid:1)fx(yi; yj) + X(i0;j):i06=i\n\nmx;i0 (yj) + X(j 0;i):j 06=j\n\nmx;j 0 (yi) (cid:21) (cid:0)(cid:24)x;ij;\n\nmx;j(yi) (cid:21) (cid:1)tx(yi) (cid:0) (cid:24)x;i;\n\n(cid:24)x;ij (cid:21) 0; (cid:24)x;i (cid:21) 0:\n\n(11)\n\nX(i;j)\n\n\fThe solution to the factored dual gives us: w = Px P(i;j) Pyi;yj\nTheorem 4.1 If for each x the edges E form a forest, then a set of weights w will be\noptimal for the QP in (6) if and only if it is optimal for the factored QP in (11).\n\n(cid:22)x(yi; yj)(cid:1)fx(yi; yj).\n\nIf the underlying Markov net is not a forest, then the constraints in (9) are not suf\ufb01cient to\nenforce the fact that the (cid:22)\u2019s are in the marginal polytope. We can address this problem by\ntriangulating the graph, and introducing new (cid:17) LP variables that now span larger subsets of\nYi\u2019s. For example, if our graph is a 4-cycle Y1\u2014Y2\u2014Y3\u2014Y4\u2014Y1, we might triangulate\nthe graph by adding an arc Y1\u2014Y3, and introducing (cid:17) variables over joint instantiations of\nthe cliques Y1; Y2; Y3 and Y1; Y3; Y4. These new (cid:17) variables are used in linear equalities\nthat constrain the original (cid:22) variables to be consistent with a density. The (cid:17) variables appear\nonly in the constraints; they do not add any new basis functions nor change the objective\nfunction. The number of constraints introduced is exponential in the number of variables\nin the new cliques. Nevertheless, in many classi\ufb01cation problems, such as sequences and\nother graphs with low tree-width [4], the extended QP can be solved ef\ufb01ciently.\n\nUnfortunately, triangulation is not feasible in highly connected problems. However, we\ncan still solve the QP in (10) de\ufb01ned by an untriangulated graph with loops. Such a proce-\ndure, which enforces only local consistency of marginals, optimizes our objective only over\na relaxation of the marginal polytope. In this way, our approximation is analogous to the\napproximate belief propagation (BP) algorithm for inference in graphical models [8]. In\nfact, BP makes an additional approximation, using not only the relaxed marginal polytope\nbut also an approximate objective (Bethe free-energy) [12]. Although the approximate QP\ndoes not offer the theoretical guarantee in Theorem 4.1, the solutions are often very accu-\nrate in practice, as we demonstrate below.\n\nAs with SVMs [11], the factored dual formulation in (10) uses only dot products between\nbasis functions. This allows us to use a kernel to de\ufb01ne very large (and even in\ufb01nite) set of\nfeatures. In particular, we de\ufb01ne our basis functions by fx(yi; yj) = (cid:26)(yi; yj)(cid:30)ij(x), i.e.,\nthe product of a selector function (cid:26)(yi; yj) with a possibly in\ufb01nite feature vector (cid:30)ij(x).\nFor example, in the OCR task, (cid:26)(yi; yj) could be an indicator function over the class of\ntwo adjacent characters i and j, and (cid:30)ij(x) could be an RBF kernel on the images of these\ntwo characters. The operation fx(yi; yj)>f^x(yr; ys) used in the objective function of the\nfactored dual QP is now (cid:26)(yi; yj)(cid:26)(yr; ys)K(cid:30)(x; i; j; ^x; r; s), where K(cid:30)(x; i; j; ^x; r; s) =\n(cid:30)ij(x) (cid:1) (cid:30)rs(x) is the kernel function for the feature (cid:30). Even for some very complex\nfunctions (cid:30), the dot-product required to compute K(cid:30) can be executed ef\ufb01ciently [11].\n5 SMO learning of M3 networks\nAlthough the number of variables and constraints in the factored dual in (10) is polynomial\nin the size of the data, the number of coef\ufb01cients in the quadratic term (kernel matrix) in the\nobjective is quadratic in the number of examples and edges in the network. Unfortunately,\nthis matrix is often too large for standard QP solvers. Instead, we use a coordinate descent\nmethod analogous to the sequential minimal optimization (SMO) used for SVMs [9].\n\nLet us begin by considering the original dual problem (7). The SMO approach solves\nthis QP by analytically optimizing two-variable subproblems. Recall that Py (cid:11)x(y) = C.\nWe can therefore take any two variables (cid:11)x(y1), (cid:11)x(y2) and \u201cmove weight\u201d from one to\nthe other, keeping the values of all other variables \ufb01xed. More precisely, we optimize for\n(cid:11)0\n\nx(y1); (cid:11)0\nClearly, however, we cannot perform this optimization in terms of the original dual, which\nis exponentially large. Fortunately, we can perform precisely the same optimization in\nx(y2).\nterms of the marginal dual variables. Let (cid:21) = (cid:11)0\nConsider a dual variable (cid:22)x(yi; yj). It is easy to see that a change from (cid:11)x(y1); (cid:11)x(y2) to\n(cid:11)0\n\nx(y1) (cid:0) (cid:11)x(y1) = (cid:11)x(y2) (cid:0) (cid:11)0\n\nx(y2) = (cid:11)x(y1) + (cid:11)x(y2).\n\nx(y2) such that (cid:11)0\n\nx(y1) + (cid:11)0\n\nx(y1); (cid:11)0\n\nx(y2) has the following effect on (cid:22)x(yi; yj):\ni ; yj = y1\n\n(cid:22)0\nx(yi; yj) = (cid:22)x(yi; yj) + (cid:21)I(yi = y1\n\nj ) (cid:0) (cid:21)I(yi = y2\n\ni ; yj = y2\n\nj ):\n\n(12)\n\n\fWe can solve the one-variable quadratic subproblem in (cid:21) analytically and update the ap-\npropriate (cid:22) variables. We use inference in the network to test for optimality of the current\nsolution (the KKT conditions [2]) and use violations from optimality as a heuristic to select\nthe next pair y1; y2. We omit details for lack of space.\n6 Generalization bound\nIn this section, we show a generalization bound for the task of multi-label classi\ufb01cation\nthat allows us to relate the error rate on the training set to the generalization error. As we\nshall see, this bound is signi\ufb01cantly stronger than previous bounds for this problem.\n\nOur goal in multi-label classi\ufb01cation is to maximize the number of correctly classi-\n\ufb01ed labels. Thus an appropriate error function is the average per-label loss L(w; x) =\n1\nl (cid:1)tx(arg maxy w>fx(y)). As in other generalization bounds for margin-based classi-\n\ufb01ers, we relate the generalization error to the margin of the classi\ufb01er. In Sec. 3, we de\ufb01ne\nthe notion of per-label margin, which grows with the number of mistakes between the cor-\nrect assignment and the best runner-up. We can now de\ufb01ne a (cid:13)-margin per-label loss:\n\nL(cid:13)(w; x) = supz: jz(y)(cid:0)w>fx(y)j(cid:20)(cid:13)(cid:1)tx(y); 8y\n\n1\nl (cid:1)tx(arg maxy z(y)):\n\nThis loss function measures the worst per-label loss on x made by any classi\ufb01er z which\nis perturbed from w>fx by at most a (cid:13)-margin per-label. We can now prove that the gen-\neralization accuracy of any classi\ufb01er is bounded by its expected (cid:13)-margin per-label loss on\nthe training data, plus a term that grows inversely with the margin.Intuitively, the \ufb01rst term\ncorresponds to the \u201cbias\u201d, as margin (cid:13) decreases the complexity of our hypothesis class by\nconsidering a (cid:13)-per-label margin ball around w>fx and selecting one (the worst) classi\ufb01er\nwithin this ball. As (cid:13) shrinks, our hypothesis class becomes more complex, and the \ufb01rst\nterm becomes smaller, but at the cost of increasing the second term, which intuitively cor-\nresponds to the \u201cvariance\u201d. Thus, the result provides a bound to the generalization error\nthat trades off the effective complexity of the hypothesis space with the training error.\nTheorem 6.1 If the edge features have bounded 2-norm, max(i;j);yi;yj kfx(yi; yj)k2 (cid:20)\nRedge, then for a family of hyperplanes parameterized by w, and any (cid:14) > 0, there exists a\nconstant K such that for any (cid:13) > 0 per-label margin, and m > 1 samples, the per-label\nloss is bounded by:\n\n[ln m + ln l + ln q + ln k] + ln\n\n1\n\n(cid:14)# ;\n\nExL(w; x) (cid:20) ESL(cid:13)(w; x) +vuut\n\nedge kwk2\n\n2 q2\n\nK\n\nm \" R2\n\n(cid:13)2\n\nwith probability at least 1 (cid:0) (cid:14), where q = maxi jf(i; j) 2 Egj is the maximum edge degree\nin the network, k is the number of classes in a label, and l is the number of labels.\nUnfortunately, we omit the proof due to lack of space. (See a longer version of the paper at\nhttp://cs.stanford.edu/\u02dcbtaskar/.) The proof uses a covering number argument\nanalogous to previous results in SVMs [13]. However we propose a novel method for\ncovering structured problems by constructing a cover to the loss function from a cover\nof the individual edge basis function differences (cid:1)fx(yi; yj). This new type of cover is\npolynomial in the number of edges, yielding signi\ufb01cant improvements in the bound.\n\nSpeci\ufb01cally, our bound has a logarithmic dependence on the number of labels (ln l) and\ndepends only on the 2-norm of the basis functions per-edge (Redge). This is a signi\ufb01cant\ngain over the previous result of Collins [3] which has linear dependence on the number of\nlabels (l), and depends on the joint 2-norm of all of the features (which is (cid:24) lRedge, unless\neach sequence is normalized separately, which is often ineffective in practice). Finally, note\nthat if l\nm = O(1) (for example, in OCR, if the number of instances is at least a constant\ntimes the length of a word), then our bound is independent of the number of labels l. Such\na result was, until now, an open problem for margin-based sequence classi\ufb01cation [3].\n7 Experiments\nWe evaluate our approach on two very different tasks: a sequence model for handwriting\nrecognition and an arbitrary topology Markov network for hypertext classi\ufb01cation.\n\n\f0.35\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n0\n\n)\nr\ne\nt\nc\na\nr\na\nh\nc\n-\nr\ne\np\ne\ng\na\nr\ne\nv\na\n(\n \nr\no\nr\nr\ne\n\n \n\n \nt\ns\ne\nT\n\nlinear\n\nquadratic\n\ncubic\n\nLog-Reg\n\nCRF\n\nmSVM\n\nM^3N\n\n)\nl\no\no\nh\nc\ns\n \nr\ne\np\n \ns\ne\ng\na\np\n(\n \nr\no\nr\nr\ne\n \nt\ns\ne\nT\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n0\n\nmSVM\n\nRMN\n\nM^3N\n\nCor\n\nTex\n\nWas\n\nWis\n\nAve\n\n(c)\n\n(a)\n\n(b)\n\nFigure 1: (a) 3 example words from the OCR data set; (b) OCR: Average per-character test error for\nlogistic regression, CRFs, multiclass SVMs, and M3Ns, using linear, quadratic, and cubic kernels;\n(c) Hypertext: Test error for multiclass SVMs, RMNs and M3Ns, by school and average.\n\nHandwriting Recognition. We selected a subset of (cid:24) 6100 handwritten words, with av-\nerage length of (cid:24) 8 characters, from 150 human subjects, from the data set collected by\nKassel [6]. Each word was divided into characters, each character was rasterized into an\nimage of 16 by 8 binary pixels. (See Fig. 1(a).) In our framework, the image for each word\ncorresponds to x, a label of an individual character to Yi, and a labeling for a complete\nword to Y. Each label Yi takes values from one of 26 classes fa; : : : ; zg.\n\nThe data set is divided into 10 folds of (cid:24) 600 training and (cid:24) 5500 testing examples.\nThe accuracy results, summarized in Fig. 1(b), are averages over the 10 folds. We im-\nplemented a selection of state-of-the-art classi\ufb01cation algorithms:\nindependent label ap-\nproaches, which do not consider the correlation between neighboring characters \u2014 logistic\nregression, multi-class SVMs as described in (3), and one-against-all SVMs (whose perfor-\nmance was slightly lower than multi-class SVMs); and sequence approaches \u2014 CRFs, and\nour proposed M3 networks. Logistic regression and CRFs are both trained by maximiz-\ning the conditional likelihood of the labels given the features, using a zero-mean diagonal\nGaussian prior over the parameters, with a standard deviation between 0.1 and 1. The other\nmethods are trained by margin maximization. Our features for each label Yi are the corre-\nsponding image of ith character. For the sequence approaches (CRFs and M3), we used an\nindicator basis function to represent the correlation between Yi and Yi+1. For margin-based\nmethods (SVMs and M3), we were able to use kernels (both quadratic and cubic were eval-\nuated) to increase the dimensionality of the feature space. Using these high-dimensional\nfeature spaces in CRFs is not feasible because of the enormous number of parameters.\n\nFig. 1(b) shows two types of gains in accuracy: First, by using kernels, margin-based\nmethods achieve a very signi\ufb01cant gain over the respective likelihood maximizing methods.\nSecond, by using sequences, we obtain another signi\ufb01cant gain in accuracy. Interestingly,\nthe error rate of our method using linear features is 16% lower than that of CRFs, and\nabout the same as multi-class SVMs with cubic kernels. Once we use cubic kernels our\nerror rate is 45% lower than CRFs and about 33% lower than the best previous approach.\nFor comparison, the previously published results, although using a different setup (e.g., a\nlarger training set), are about comparable to those of multiclass SVMs.\nHypertext. We also tested our approach on collective hypertext classi\ufb01cation, using the\ndata set in [10], which contains web pages from four different Computer Science depart-\nments. Each page is labeled as one of course, faculty, student, project, other. In all of our\nexperiments, we learn a model from three schools, and test on the remaining school. The\ntext content of the web page and anchor text of incoming links is represented using a set\nof binary attributes that indicate the presence of different words. The baseline model is a\nsimple linear multi-class SVM that uses only words to predict the category of the page. The\nsecond model is a relational Markov network (RMN) of Taskar et al. [10], which in addi-\ntion to word-label dependence, has an edge with a potential over the labels of two pages\nthat are hyper-linked to each other. This model de\ufb01nes a Markov network over each web\nsite that was trained to maximize the conditional probability of the labels given the words\n\n\fand the links. The third model is a M3 net with the same features but trained by maximizing\nthe margin using the relaxed dual formulation and loopy BP for inference.\n\nFig. 1(c) shows a gain in accuracy from SVMs to RMNs by using the correlations between\nlabels of linked web pages, and a very signi\ufb01cant additional gain by using maximum margin\ntraining. The error rate of M3Ns is 40% lower than that of RMNs, and 51% lower than\nmulti-class SVMs.\n8 Discussion\nWe present a discriminative framework for labeling and segmentation of structured data\nsuch as sequences, images, etc. Our approach seamlessly integrates state-of-the-art kernel\nmethods developed for classi\ufb01cation of independent instances with the rich language of\ngraphical models that can exploit the structure of complex data. In our experiments with\nthe OCR task, for example, our sequence model signi\ufb01cantly outperforms other approaches\nby incorporating high-dimensional decision boundaries of polynomial kernels over charac-\nter images while capturing correlations between consecutive characters. We construct our\nmodels by solving a convex quadratic program that maximizes the per-label margin. Al-\nthough the number of variables and constraints of our QP formulation is polynomial in the\nexample size (e.g., sequence length), we also address its quadratic growth using an effec-\ntive optimization procedure inspired by SMO. We provide theoretical guarantees on the\naverage per-label generalization error of our models in terms of the training set margin.\nOur generalization bound signi\ufb01cantly tightens previous results of Collins [3] and suggests\npossibilities for analyzing per-label generalization properties of graphical models.\n\nFor brevity, we simpli\ufb01ed our presentation of graphical models to only pairwise Markov\nnetworks. Our formulation and generalization bound easily extend to interaction patterns\ninvolving more than two labels (e.g., higher-order Markov models). Overall, we believe\nthat M3 networks will signi\ufb01cantly further the applicability of high accuracy margin-based\nmethods to real-world structured data.\nAcknowledgments.\nP00002 under DARPA\u2019s EELD program.\nReferences\n[1] Y. Altun, I. Tsochantaridis, and T. Hofmann. Hidden markov support vector machines. In Proc.\n\nThis work was supported by ONR Contract F3060-01-2-0564-\n\nICML, 2003.\n\n[2] D. Bertsekas. Nonlinear Programming. Athena Scienti\ufb01c, Belmont, MA, 1999.\n[3] M. Collins. Parameter estimation for statistical parsing models: Theory and practice of\n\ndistribution-free methods. In IWPT, 2001.\n\n[4] R.G. Cowell, A.P. Dawid, S.L. Lauritzen, and D.J. Spiegelhalter. Probabilistic Networks and\n\nExpert Systems. Springer, New York, 1999.\n\n[5] K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernelbased vector\n\nmachines. Journal of Machine Learning Research, 2(5):265\u2013292, 2001.\n\n[6] R. Kassel. A Comparison of Approaches to On-line Handwritten Character Recognition. PhD\n\nthesis, MIT Spoken Language Systems Group, 1995.\n\n[7] J. Lafferty, A. McCallum, and F. Pereira. Conditional random \ufb01elds: Probabilistic models for\n\nsegmenting and labeling sequence data. In Proc. ICML01, 2001.\n\n[8] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, 1988.\n[9] J. Platt. Using sparseness and analytic QP to speed training of support vector machines. In\n\nNIPS, 1999.\n\n[10] B. Taskar, P. Abbeel, and D. Koller. Discriminative probabilistic models for relational data. In\n\nProc. UAI02, Edmonton, Canada, 2002.\n\n[11] V.N. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, New York, 1995.\n[12] J. Yedidia, W. Freeman, and Y. Weiss. Generalized belief propagation. In NIPS, 2000.\n[13] T. Zhang. Covering number bounds of certain regularized linear function classes. Journal of\n\nMachine Learning Research, 2:527\u2013550, 2002.\n\n\f", "award": [], "sourceid": 2397, "authors": [{"given_name": "Ben", "family_name": "Taskar", "institution": null}, {"given_name": "Carlos", "family_name": "Guestrin", "institution": null}, {"given_name": "Daphne", "family_name": "Koller", "institution": null}]}