{"title": "Exponentiated Gradient Algorithms for Large-margin Structured Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 113, "page_last": 120, "abstract": null, "full_text": "Exponentiated Gradient Algorithms for\nLarge-margin Structured Classi\ufb01cation\n\nPeter L. Bartlett\n\nU.C.Berkeley\n\nMichael Collins\n\nMIT CSAIL\n\nbartlett@stat.berkeley.edu\n\nmcollins@csail.mit.edu\n\nBen Taskar\n\nStanford University\n\nDavid McAllester\n\nTTI at Chicago\n\nbtaskar@cs.stanford.edu\n\nmcallester@tti-c.org\n\nAbstract\n\nWe consider the problem of structured classi\ufb01cation, where the task is\nto predict a label y from an input x, and y has meaningful internal struc-\nture. Our framework includes supervised training of Markov random\n\ufb01elds and weighted context-free grammars as special cases. We describe\nan algorithm that solves the large-margin optimization problem de\ufb01ned\nin [12], using an exponential-family (Gibbs distribution) representation\nof structured objects. The algorithm is ef\ufb01cient\u2014even in cases where the\nnumber of labels y is exponential in size\u2014provided that certain expecta-\ntions under Gibbs distributions can be calculated ef\ufb01ciently. The method\nfor structured labels relies on a more general result, speci\ufb01cally the ap-\nplication of exponentiated gradient updates [7, 8] to quadratic programs.\n\n1\n\nIntroduction\n\nStructured classi\ufb01cation is the problem of predicting y from x in the case where y has\nmeaningful internal structure. For example x might be a word string and y a sequence of\npart of speech labels, or x might be a Markov random \ufb01eld and y a labeling of x, or x\nmight be a word string and y a parse of x. In these examples the number of possible labels\ny is exponential in the size of x. This paper presents a training algorithm for a general\nde\ufb01nition of structured classi\ufb01cation covering both Markov random \ufb01elds and parsing.\n\nWe restrict our attention to linear discriminative classi\ufb01cation. We assume that pairs hx; yi\ncan be embedded in a linear feature space (cid:8)(x; y), and that a predictive rule is determined\nby a direction (weight vector) w in that feature space. In linear discriminative prediction we\nselect the y that has the greatest value for the inner product h(cid:8)(x; y); wi. Linear discrimi-\nnation has been widely studied in the binary and multiclass setting [6, 4]. However, the case\nof structured labels has only recently been considered [2, 12, 3, 13]. The structured-label\ncase takes into account the internal structure of y in the assignment of feature vectors, the\ncomputation of loss, and the de\ufb01nition and use of margins.\n\nWe focus on a formulation where each label y is represented as a set of \u201cparts\u201d, or equiv-\nalently, as a bit-vector. Moreover, we assume that the feature vector for y and the loss\nfor y are both linear in the individual bits of y. This formulation has the advantage that it\nnaturally covers both simple labeling problems, such as part-of-speech tagging, as well as\nmore complex problems such as parsing.\n\nWe consider the large-margin optimization problem de\ufb01ned in [12] for selecting the clas-\nsi\ufb01cation direction w given a training sample. The starting-point for these methods is a\n\n\fprimal problem that has one constraint for each possible labeling y; or equivalently a dual\nproblem where each y has an associated dual variable. We give a new training algorithm\nthat relies on an exponential-family (Gibbs distribution) representation of structured ob-\njects. The algorithm is ef\ufb01cient\u2014even in cases where the number of labels y is exponential\nin size\u2014provided that certain expectations under Gibbs distributions can be calculated ef-\n\ufb01ciently. The computation of these expectations appears to be a natural computational\nproblem for structured problems, and has speci\ufb01c polynomial-time dynamic programming\nalgorithms for some important examples: for example, the clique-tree belief propagation\nalgorithm can be used in Markov random \ufb01elds, and the inside-outside algorithm can be\nused in the case of weighted context-free grammars.\n\nThe optimization method for structured labels relies on a more general result, speci\ufb01cally\nthe application of exponentiated gradient (EG) updates [7, 8] to quadratic programs (QPs).\nWe describe a method for solving QPs based on EG updates, and give bounds on its rate of\nconvergence. The algorithm uses multiplicative updates on dual parameters in the problem.\nIn addition to their application to the structured-labels task, the EG updates lead to simple\nalgorithms for optimizing \u201cconventional\u201d binary or multiclass SVM problems.\nRelated work [2, 12, 3, 13] consider large-margin methods for Markov random \ufb01elds and\n(weighted) context-free grammars. We consider the optimization problem de\ufb01ned in [12].\n[12] use a row-generation approach based on Viterbi decoding combined with an SMO\noptimization method. [5] describe exponentiated gradient algorithms for SVMs, but for\nbinary classi\ufb01cation in the \u201chard-margin\u201d case, without slack variables. We show that the\nEG-QP algorithm converges signi\ufb01cantly faster than the rates shown in [5]. Multiplicative\nupdates for SVMs are also described in [11], but unlike our method, the updates in [11] do\nnot appear to factor in a way that allows algorithms for MRFs and WCFGs based on Gibbs-\ndistribution representations. Our algorithms are related to those for conditional random\n\ufb01elds (CRFs) [9]. CRFs de\ufb01ne a linear model for structured problems, in a similar way\nto the models in our work, and also rely on the ef\ufb01cient computation of marginals in the\ntraining phase. Finally, see [1] for a longer version of the current paper, which includes\nmore complete derivations and proofs.\n\n2 The General Setting\nWe consider the problem of learning a function f : X ! Y, where X is a set and Y is a\ncountable set. We assume a loss function L : X (cid:2) Y (cid:2) Y ! R+. The function L(x; y; ^y)\nmeasures the loss when y is the true label for x, and ^y is a predicted label; typically, ^y is\nthe label proposed by some function f (x). In general we will assume that L(x; y; ^y) = 0\nfor y = ^y. Given some distribution over examples (X; Y ) in X (cid:2) Y, our aim is to \ufb01nd a\nfunction with low expected loss, or risk, EL(X; Y; f (X)).\nWe consider functions f which take a linear form. First, we assume a \ufb01xed function G\nwhich maps an input x to a set of candidates G(x). For all x, we assume that G(x) (cid:18) Y,\nand that G(x) is \ufb01nite. A second component to the model is a feature-vector representation\n(cid:8) : X (cid:2) Y ! Rd. Given a parameter vector w 2 Rd, we consider functions of the form\n\nfw(x) = arg max\ny2G(x)\n\nh(cid:8)(x; y); wi:\n\nGiven n independent training examples (xi; yi) with the same distribution as (X; Y ), we\nwill formalize a large-margin optimization problem that is a generalization of support vec-\ntor methods for binary classi\ufb01ers, and is essentially the same as the formulation in [12]. The\noptimal parameters are taken to minimize the following regularized empirical risk function:\n\nwhere mi;y(w) = hw; (cid:30)(xi; yi)i (cid:0) hw; (cid:30)(xi; y)i is the \u201cmargin\u201d on (i; y) and (z)+ =\nmaxfz; 0g. This optimization can be expressed as the primal problem in Figure 1. Fol-\nlowing [12], the dual of this problem is also shown in Figure 1. The dual is a quadratic\n\n(L(xi; yi; y) (cid:0) mi;y(w))(cid:19)+\n\n1\n2\n\nkwk2 + CXi (cid:18)max\n\ny\n\n\fPrimal problem:\n\nminw;(cid:22)(cid:15)(cid:16) 1\n\n2 kwk2 + CPi (cid:15)i(cid:17)\n\nSubject to the constraints:\n8i; 8y 2 G(xi); hw; (cid:8)i;yi (cid:21) Li;y (cid:0) (cid:15)i\n8i; (cid:15)i (cid:21) 0\n\nDual problem: max (cid:22)(cid:11) F ((cid:22)(cid:11)), where\n\nF ((cid:22)(cid:11)) =(cid:16)CPi;y (cid:11)i;yLi;y(cid:0)\n\n2 C 2Pi;yPj;z (cid:11)i;y(cid:11)j;zh(cid:8)i;y; (cid:8)j;zi(cid:17)\n\nSubject to the constraints:\n\n1\n\n(cid:11)i;y = 1 ; 8i; y; (cid:11)i;y (cid:21) 0\n\n8i; Xy\n\nRelationship between optimal values: w(cid:3) = CPi;y (cid:11)(cid:3)\n\narg min of the primal problem, and (cid:22)(cid:11)(cid:3) is the arg max of the dual problem.\n\ni;y(cid:8)i;y where w(cid:3) is the\n\nFigure 1: The primal and dual problems. We use the de\ufb01nitions Li;y = L(xi; yi; y), and (cid:8)i;y =\n(cid:8)(xi; yi) (cid:0) (cid:8)(xi; y). We assume that for all i, Li;y = 0 for y = yi. The constant C dictates the\nrelative penalty for values of the slack variables (cid:15)i which are greater than 0.\n\nprogram F ((cid:22)(cid:11)) in the dual variables (cid:11)i;y for all i = 1 : : : n, y 2 G(xi). The dual variables\nfor each example are constrained to form a probability distribution over Y.\n2.1 Models for structured classi\ufb01cation\nThe problems we are interested in concern structured labels, which have a natural decom-\nposition into \u201cparts\u201d. Formally, we assume some countable set of parts, R. We also assume\na function R which maps each object (x; y) 2 X (cid:2) Y to a \ufb01nite subset of R. Thus R(x; y)\nis the set of parts belonging to a particular object. In addition we assume a feature-vector\nrepresentation (cid:30) of parts: this is a function (cid:30) : X (cid:2) R ! Rd. The feature vector for an\nobject (x; y) is then a sum of the feature vectors for its parts, and we also assume that the\nloss function L(x; y; ^y) decomposes into a sum over parts:\n\n(cid:8)(x; y) = Xr2R(x;y)\n\n(cid:30)(x; r)\n\nL(x; y; ^y) = Xr2R(x;^y)\n\nl(x; y; r)\n\nHere (cid:30)(x; r) is a \u201clocal\u201d feature vector for part r paired with input x, and l(x; y; r) is\na \u201clocal\u201d loss for part r when proposed for the pair (x; y). For convenience we de\ufb01ne\nindicator variables I(x; y; r) which are 1 if r 2 R(x; y), 0 otherwise. We also de\ufb01ne sets\nR(xi) = [y2G(xi)R(xi; y) for all i = 1 : : : n.\nExample 1: Markov Random Fields (MRFs)\nIn an MRF the space of labels G(x),\nand their underlying structure, can be represented by a graph. The graph G = (V; E) is\na collection of vertices V = fv1; v2; : : : vlg and edges E. Each vertex vi 2 V has a set\nof possible labels, Yi. The set G(x) is then de\ufb01ned as Y1 (cid:2) Y2 : : : (cid:2) Yl. Each clique in\nthe graph has a set of possible con\ufb01gurations: for example, if a particular clique contains\nvertices fv3; v5; v6g, the set of possible con\ufb01gurations of this clique is Y3 (cid:2) Y5 (cid:2) Y6. We\nde\ufb01ne C to be the set of cliques in the graph, and for any c 2 C we de\ufb01ne Y(c) to be the set\nof possible con\ufb01gurations for that clique. We decompose each y 2 G(x) into a set of parts,\nby de\ufb01ning R(x; y) = f(c; a) 2 R : c 2 C; a 2 Y(c); (c; a) is consistent with yg. The\nfeature vector representation (cid:30)(x; c; a) for each part can essentially track any characteris-\ntics of the assignment a for clique c, together with any features of the input x. A number\nof choices for the loss function l(x; y; (c; a)) are possible. For example, consider the Ham-\n\nming loss used in [12], de\ufb01ned as L(x; y; ^y) =Pi Iyi6=^yi. To achieve this, \ufb01rst assign each\n\nvertex vi to a single one of the cliques in which it appears. Second, de\ufb01ne l(x; y; (c; a)) to\nbe the number of labels in the assignment (c; a) which are both incorrect and correspond\nto vertices which have been assigned to the clique c (note that assigning each vertex to a\nsingle clique avoids \u201cdouble counting\u201d of label errors).\n\nExample 2: Weighted Context-Free Grammars (WCFGs).\nIn this example x is an\ninput string, and y is a \u201cparse tree\u201d for that string, i.e., a left-most derivation for x under\nsome context-free grammar. The set G(x) is the set of all left-most derivations for x\n\n\fInputs: A learning rate (cid:17).\nData structures: A vector (cid:22)(cid:18) of variables, (cid:18)i;r, 8i; 8r 2 R(xi).\n\nDe\ufb01nitions: (cid:11)i;y((cid:22)(cid:18)) = exp(Pr2R(xi;y) (cid:18)i;r)=Zi where Zi is a normalization term.\n\nAlgorithm:\n(cid:15) Choose initial values (cid:22)(cid:18)1 for the (cid:18)i;r variables (these values can be arbitrary).\n(cid:15) For t = 1 : : : T + 1:\n\n\u2013 For i = 1 : : : n; r 2 R(xi), calculate (cid:22)t\n\n\u2013 Set wt = C(cid:16)Pi;r2R(xi;yi) (cid:30)i;r (cid:0)Pi;r2R(xi) (cid:22)t\n\n\u2013 For i = 1 : : : n; r 2 R(xi),\ni;r = (cid:18)t\n\ncalculate updates (cid:18)t+1\n\ni;r + (cid:17)C (li;r + hwt; (cid:30)i;ri)\n\ni;r =Py (cid:11)i;y((cid:22)(cid:18)t)I(xi; y; r).\n\ni;r(cid:30)i;r(cid:17)\n\nOutput: Parameter values wT +1\n\nFigure 2: The EG algorithm for structured problems. We use (cid:30)i;r = (cid:30)(xi; r) and li;r = l(xi; yi; r).\n\nunder the grammar. For convenience, we restrict the grammar to be in Chomsky-normal\nform, where all rules in the grammar are of the form hA ! B Ci or hA ! ai, where\nA; B; C are non-terminal symbols, and a is some terminal symbol. We take a part r to\nbe a CF-rule-tuple hA ! B C; s; m; ei. Under this representation A spans words s : : : e\ninclusive in x; B spans words s : : : m; and C spans words (m + 1) : : : e. The function\nR(x; y) maps a derivation y to the set of parts which it includes. In WCFGs (cid:30)(x; r) can\nbe any function mapping a rule production and its position in the sentence x, to a feature\nvector. One example of a loss function would be to de\ufb01ne l(x; y; r) to be 1 only if r\u2019s\nnon-terminal A is not seen spanning words s : : : e in the derivation y. This would lead\nto L(x; y; ^y) tracking the number of \u201cconstituent errors\u201d in ^y, where a constituent is a\n(non-terminal, start-point, end-point) tuple such as (A; s; e).\n\nCPi;y (cid:11)(cid:3)\n\n3 EG updates for structured objects\nWe now consider an algorithm for computing (cid:22)(cid:11)(cid:3) = arg max (cid:22)(cid:11)2(cid:1) F ((cid:22)(cid:11)), where F ((cid:22)(cid:11)) is the\ndual form of the maximum margin problem, as in Figure 1. In particular, we are interested\nin the optimal values of the primal form parameters, which are related to (cid:22)(cid:11)(cid:3) by w(cid:3) =\ni;y(cid:8)i;y. A key problem is that in many of our examples, the number of dual\nvariables (cid:11)i;y precludes dealing with these variables directly. For example, in the MRF\ncase or the WCFG cases, the set G(x) is exponential in size, and the number of dual\nvariables (cid:11)i;y is therefore also exponential.\nWe describe an algorithm that is ef\ufb01cient for certain examples of structured objects such\nas MRFs or WCFGs. Instead of representing the (cid:11)i;y variables explicitly, we will instead\nmanipulate a vector (cid:22)(cid:18) of variables (cid:18)i;r for i = 1 : : : n; r 2 R(xi). Thus we have one of\nthese \u201cmini-dual\u201d variables for each part seen in the training data. Each of the variables\n(cid:18)i;r can take any value in the reals. We now de\ufb01ne the dual variables (cid:11)i;y as a function of\nthe vector (cid:22)(cid:18), which takes the form of a Gibbs distribution:\n\n(cid:11)i;y((cid:22)(cid:18)) =\n\nexp(Pr2R(xi;y) (cid:18)i;r)\nPy0 exp(Pr2R(xi;y0) (cid:18)i;r)\n\n:\n\nFigure 2 shows an algorithm for maximizing F ((cid:22)(cid:11)). The algorithm de\ufb01nes a sequence of\nvalues (cid:22)(cid:18)1; (cid:22)(cid:18)2; : : :. In the next section we prove that the sequence F ((cid:22)(cid:11)( (cid:22)(cid:18)1)); F ((cid:22)(cid:11)((cid:22)(cid:18)2)); : : :\nconverges to max(cid:11) F ((cid:22)(cid:11)). The algorithm can be implemented ef\ufb01ciently, independently\nof the dimensionality of (cid:22)(cid:11), provided that there is an ef\ufb01cient algorithm for computing\n\nmarginal terms (cid:22)i;r = Pi;y (cid:11)i;y((cid:22)(cid:18))I(xi; y; r) for all i = 1 : : : n; r 2 R(xi), and all (cid:22)(cid:18). A\nkey property is that the primal parameters w = CPi;y (cid:11)i;y((cid:22)(cid:18))(cid:8)i;y = CPi (cid:8)(xi; yi) (cid:0)\n\n\fXi;y\n\n(cid:22)i;r(cid:30)(xi; r)\n\n(cid:11)i;y((cid:22)(cid:18)) Xr2R(xi;y)\n\n(cid:30)(xi; r) = Xi;r2R(xi)\n\n(cid:11)i;y((cid:22)(cid:18))(cid:8)(xi; y) =Xi;y\n\nCPi;y (cid:11)i;y((cid:22)(cid:18))(cid:8)(xi; y) can be expressed in terms of the marginal terms, because:\nand hence w = CPi (cid:8)(xi; yi) (cid:0) CPi;r2R(xi) (cid:22)i;r(cid:30)(xi; r). The (cid:22)i;r values can be cal-\n\nculated for MRFs and WCFGs in many cases, using standard algorithms. For example, in\nthe WCFG case, the inside-outside algorithm can be used, provided that each part r is a\ncontext-free rule production, as described in Example 2 above. In the MRF case, the (cid:22)i;r\nvalues can be calculated ef\ufb01ciently if the tree-width of the underlying graph is small.\nNote that the main storage requirements of the algorithm in Figure 2 concern the vector (cid:22)(cid:18).\nThis is a vector which has as many components as there are parts in the training set. In\npractice, the number of parts in the training data can become extremely large. Fortunately,\nan alternative, \u201cprimal form\u201d algorithm is possible. Rather than explicitly storing the (cid:18)i;r\nvariables, we can store a vector zt of the same dimensionality as wt. The (cid:18)i;r values can\nbe computed from zt. More explicitly, the main body of the algorithm in Figure 2 can be\nreplaced with the following:\n(cid:15) Set z1 to some initial value. For t = 1 : : : T + 1:\n\u2013 Set wt = 0\n\u2013 For i = 1 : : : n: Compute (cid:22)t\n\ni;r = (cid:17)C((t (cid:0) 1)li;r + hzt; (cid:30)i;ri);\n\ni;r for r 2 R(xi), using (cid:18)t\n\nSet wt = wt + C(cid:16)Pr2R(xi;yi) (cid:30)i;r (cid:0)Pr2R(xi) (cid:22)t\n\n\u2013 Set zt+1 = zt + wt\ni;r = (cid:17)Ch(cid:30)i;r; z1i, then this alternative algorithm de\ufb01nes\nIt can be veri\ufb01ed that if 8i; r; (cid:18)1\nthe same sequence of (implicit) (cid:18)t\ni;r values, and (explicit) wt values, as the original algo-\nrithm. In the next section we show that the original algorithm converges for any choice of\ninitial values (cid:22)(cid:18)1, so this restriction on (cid:18)1\n\ni;r should not be signi\ufb01cant.\n\ni;r(cid:30)i;r(cid:17)\n\n4 Exponentiated gradient (EG) updates for quadratic programs\nWe now prove convergence properties of the algorithm in Figure 2. We show that it is\nan instantiation of a general algorithm for optimizing quadratic programs (QPs), which\nrelies on Exponentiated Gradient (EG) updates [7, 8]. In the general problem we assume a\npositive semi-de\ufb01nite matrix A 2 Rm(cid:2)m, and a vector b 2 Rm, specifying a loss function\nQ((cid:22)(cid:11)) = b0 (cid:22)(cid:11) + 1\n2 (cid:22)(cid:11)0A(cid:22)(cid:11). Here (cid:22)(cid:11) is an m-dimensional vector of reals. We assume that (cid:22)(cid:11) is\n\nformed by the concatenation of n vectors (cid:22)(cid:11)i 2 Rmi for i = 1 : : : n, where Pi mi = m.\n\nWe assume that each (cid:22)(cid:11)i lies in a simplex of dimension mi, so that the feasible set is\n\nmi\n\n(cid:1) = f(cid:22)(cid:11) : (cid:22)(cid:11) 2 Rm; for i = 1 : : : n;\n\n(cid:11)i;j = 1; for all i; j, (cid:11)i;j (cid:21) 0g:\n\n(1)\n\nXj=1\n\nOur aim is to \ufb01nd arg min (cid:22)(cid:11)2(cid:1) Q((cid:22)(cid:11)).\nFigure 3 gives an algorithm\u2014the \u201cEG-QP\u201d\nalgorithm\u2014for \ufb01nding the minimum. In the next section we give a proof of its conver-\ngence properties.\n\nThe EG-QP algorithm can be used to \ufb01nd the minimum of (cid:0)F ((cid:22)(cid:11)), and hence the maximum\nof the dual objective F ((cid:22)(cid:11)). We justify the algorithm in Figure 2 by showing that it is\nequivalent to minimization of (cid:0)F ((cid:22)(cid:11)) using the EG-QP algorithm. We give the following\ntheorem:\n\nTheorem 1 De\ufb01ne F ((cid:22)(cid:11)) = CPi;y (cid:11)i;yLi;y (cid:0) 1\n2 C 2Pi;yPj;z (cid:11)i;y(cid:11)j;zh(cid:8)i;y; (cid:8)j;zi,\nand assume as in section 2 that Li;y = Pr2R(xi;y) l(xi; y; r) and (cid:8)(xi; y) =\nPr2R(xi;y) (cid:30)(xi; r). Consider the sequence (cid:22)(cid:11)((cid:22)(cid:18)1) : : : (cid:22)(cid:11)((cid:22)(cid:18)T +1) de\ufb01ned by the algorithm\n\nin Figure 2, and the sequence (cid:22)(cid:11)1 : : : (cid:22)(cid:11)T +1 de\ufb01ned by the EG-QP algorithm when applied\nto Q((cid:22)(cid:11)) = (cid:0)F ((cid:22)(cid:11)). Then under the assumption that (cid:22)(cid:11)( (cid:22)(cid:18)1) = (cid:22)(cid:11)1, it follows that (cid:22)(cid:11)((cid:22)(cid:18)t) = (cid:22)(cid:11)t\nfor t = 1 : : : (T + 1).\n\n\f2 (cid:22)(cid:11)0A(cid:22)(cid:11). Each vector (cid:22)(cid:11) is in (cid:1), where (cid:1) is de\ufb01ned in Eq. 1.\n\nInputs: A positive semi-de\ufb01nite matrix A, and a vector b, specifying a loss function\nQ((cid:22)(cid:11)) = b (cid:1) (cid:22)(cid:11) + 1\nAlgorithm:\n(cid:15) Initialize (cid:22)(cid:11)1 to a point in the interior of (cid:1). Choose a learning rate (cid:17) > 0.\n(cid:15) For t = 1 : : : T\n\n\u2013 Calculate (cid:22)st = rQ((cid:22)(cid:11)t) = b + A(cid:22)(cid:11)t.\n\u2013 Calculate (cid:22)(cid:11)t+1 as: 8i; j; (cid:11)t+1\ni;j = (cid:11)t\n\nOutput: Return (cid:22)(cid:11)T +1.\n\ni;j expf(cid:0)(cid:17)st\n\ni;k expf(cid:0)(cid:17)st\n\ni;kg\n\ni;jg=Pk (cid:11)t\n\nFigure 3: The EG-QP algorithm for quadratic programs.\n\n2 C 2kPi (cid:8)(xi; yi) (cid:0)Pi;y (cid:11)i;y(cid:8)(xi; y)k2. It\nti(cid:1) where as\nti = CPr2R(xi;y)(cid:0)li;r + h(cid:30)i;r; w\n\nfollows that @F ( (cid:22)(cid:11)t)\n@(cid:11)i;y\n\nProof. We can write F ( (cid:22)(cid:11)) = CPi;y (cid:11)i;yLi;y (cid:0) 1\nbefore wt = C(Pi (cid:8)(xi; yi) (cid:0)Pi;y (cid:11)t\n\n= CLi;y + Ch(cid:8)(xi; y); w\n\ni;y(cid:8)(xi; y)). The rest of the proof proceeds by in-\nduction; due to space constraints we give a sketch of the proof here. The idea is to show that\n(cid:22)(cid:11)((cid:22)(cid:18)t+1) = (cid:22)(cid:11)t+1 under the inductive hypothesis that (cid:22)(cid:11)((cid:22)(cid:18)t) = (cid:22)(cid:11)t. This follows immediately\nfrom the de\ufb01nitions of the mappings (cid:22)(cid:11)((cid:22)(cid:18)t) ! (cid:22)(cid:11)((cid:22)(cid:18)t+1) and (cid:22)(cid:11)t ! (cid:22)(cid:11)t+1 in the two algo-\nrithms, together with the identities st\nand (cid:18)t+1\n\n= (cid:0)CPr2R(xi;y) (li;r + h(cid:30)i;r; wti)\n\ni;r = (cid:17)C (li;r + h(cid:30)i;r; wti).\n\ni;y = (cid:0) @F ( (cid:22)(cid:11)t)\n\ni;r (cid:0) (cid:18)t\n\n@(cid:11)i;y\n\n4.1 Convergence of the exponentiated gradient QP algorithm\n\nThe following theorem shows how the optimization algorithm converges to an optimal so-\nlution. The theorem compares the value of the objective function for the algorithm\u2019s vector\n(cid:22)(cid:11)t to the value for a comparison vector u 2 (cid:1). (For example, consider u as the solution\nof the QP.) The convergence result is in terms of several properties of the algorithm and\nthe comparison vector u. The distance between u and (cid:22)(cid:11)1 is measured using the Kullback-\nLiebler (KL) divergence. Recall that the KL-divergence between two probability vectors\n. For sequences of probability vectors, (cid:22)u 2 (cid:1)\nwith (cid:22)u = ((cid:22)u1; : : : ; (cid:22)un) and (cid:22)ui = (ui;1; : : : ; ui;mi ), we can de\ufb01ne a divergence as the sum\ni=1 D((cid:22)ui; (cid:22)vi). Two other key parameters\n\n(cid:22)u; (cid:22)v is de\ufb01ned as D((cid:22)u; (cid:22)v) = Pi ui log ui\nof KL-divergences: for (cid:22)u; (cid:22)v 2 (cid:1), (cid:22)D((cid:22)u; (cid:22)v) = Pn\n\nare (cid:21), the largest eigenvalue of the positive semide\ufb01nite symmetric matrix A, and\n\nvi\n\n(rQ((cid:22)(cid:11)))i (cid:0) min\n\ni\n\n(rQ((cid:22)(cid:11)))i(cid:17) (cid:20) 2(cid:18)n max\n\nij\n\njAijj + max\n\ni\n\njbij(cid:19) :\n\nB = max\n\n(cid:22)(cid:11)2(cid:1)(cid:16)max\n\ni\n\nTheorem 2 For all (cid:22)u 2 (cid:1),\n\nQ((cid:22)(cid:11)t) (cid:20) Q((cid:22)u) +\n\n(cid:22)D((cid:22)u; (cid:22)(cid:11)1)\n\n(cid:17)T\n\n+\n\ne(cid:17)B (cid:0) 1 (cid:0) (cid:17)B\n\nQ((cid:22)(cid:11)1) (cid:0) Q((cid:22)(cid:11)T +1)\n\n(cid:17)2B2 (1 (cid:0) (cid:17)(B + (cid:21))e(cid:17)B)\n\nT\n\n:\n\nChoosing (cid:17) = 0:4=(B + (cid:21)) ensures that\n\n1\nT\n\nT\n\nXt=1\n\nQ(cid:0) (cid:22)(cid:11)T +1(cid:1) (cid:20)\n\n1\nT\n\nT\n\nXt=1\n\nQ((cid:22)(cid:11)t) (cid:20) Q((cid:22)u) + 2:5(B + (cid:21))\n\n(cid:22)D((cid:22)u; (cid:22)(cid:11)1)\n\nT\n\n+ 1:5\n\nQ((cid:22)(cid:11)1) (cid:0) Q((cid:22)(cid:11)T +1)\n\nT\n\n:\n\nThe \ufb01rst lemma we require is due to Kivinen and Warmuth [8].\n\nLemma 1 For any (cid:22)u 2 (cid:1),\n\n(cid:17)Q((cid:22)(cid:11)t) (cid:0) (cid:17)Q((cid:22)u) (cid:20) (cid:22)D((cid:22)u; (cid:22)(cid:11)t) (cid:0) (cid:22)D((cid:22)u; (cid:22)(cid:11)t+1) + (cid:22)D((cid:22)(cid:11)t; (cid:22)(cid:11)t+1)\n\nWe focus on the third term. De\ufb01ne r(i)Q((cid:22)(cid:11)) as the segment of the gradient vector cor-\nresponding to the component (cid:22)(cid:11)i of (cid:22)(cid:11), and de\ufb01ne the random variable Xi;t, satisfying\n\nPr(cid:16)Xi;t = (cid:0)(cid:0)r(i)Q((cid:22)(cid:11)t)(cid:1)j(cid:17) = (cid:11)i;j.\n\n\fLemma 2 (cid:22)D((cid:22)(cid:11)t; (cid:22)(cid:11)t+1) =\n\nn\n\nlog Ehe(cid:17)(Xi;t(cid:0)EXi;t)i (cid:20)(cid:18) e(cid:17)B (cid:0) 1 (cid:0) (cid:17)B\n\nB2\n\n(cid:19) n\nXi=1\n\nvar(Xi;t):\n\nProof. (cid:22)D((cid:22)(cid:11)t; (cid:22)(cid:11)t+1) =\n\n=\n\n=\n\n=\n\nn\n\nij\n\n(cid:11)t\n\n(cid:11)t\n\nij log\n\n(cid:11)t\nij\n(cid:11)t+1\n\nXi=1\nXi=1Xj\nXi=1Xj\nlog Xk\nXi=1\nlog(cid:16)Ehe(cid:17)(Xi;t(cid:0)EXi;t)i(cid:17) (cid:20)\nXi=1\n\nij log Xk\n\n(cid:11)t\n\n(cid:11)t\n\nn\n\nn\n\nn\n\nik exp(cid:0)(cid:0)(cid:17)ri;k + (cid:17) (cid:22)(cid:11)t\n\nik exp((cid:0)(cid:17)ri;k)! + (cid:17)ri;j!\n\ni (cid:1) ri(cid:1)!\n\nB2\n\ne(cid:17)B (cid:0) 1 (cid:0) (cid:17)B\n\nvar(Xi;t):\n\nn\n\nXi=1\n\nThis last inequality is at the heart of the proof of Bernstein\u2019s inequality; e.g., see [10].\n\nThe second part of the proof of the theorem involves bounding this variance in terms of\nthe loss. The following lemma relies on the fact that this variance is, to \ufb01rst order, the\ndecrease in the quadratic loss, and that the second order term in the Taylor series expansion\nof the loss is small compared to the variance, provided the steps are not too large. The\nlemma and its proof require several de\ufb01nitions. For any d, let (cid:27) : Rd ! (0; 1)d be the\nj=1 exp((cid:18)j), for (cid:22)(cid:18) 2 Rd. We shall work in the\nexponential parameter space: let (cid:22)(cid:18)t be the exponential parameters at step t, so that the\nupdates are (cid:22)(cid:18)t+1 = (cid:22)(cid:18)t (cid:0) (cid:17)rQ((cid:22)(cid:11)t), and the QP variables satisfy (cid:22)(cid:11)t\ni). De\ufb01ne the\n. This takes\nthe same values as Xi;t, but its distribution is given by a different exponential parameter\n((cid:22)(cid:18)i instead of (cid:22)(cid:18)t\nLemma 3 For some (cid:22)(cid:18) 2 [(cid:22)(cid:18)t; (cid:22)(cid:18)t+1],\n\nsoftmax function, (cid:27)((cid:22)(cid:18))i = exp((cid:18)i)=Pd\nrandom variables Xi;t;(cid:22)(cid:18), satisfying Pr(cid:16)Xi;t;(cid:22)(cid:18) = (cid:0)(cid:0)r(i)Q((cid:22)(cid:11)t)(cid:1)j(cid:17) =(cid:0)(cid:27)((cid:22)(cid:18)i)(cid:1)j\n\ni). De\ufb01ne(cid:2)(cid:22)(cid:18)t; (cid:22)(cid:18)t+1(cid:3) =(cid:8)a(cid:22)(cid:18)t + (1 (cid:0) a)(cid:22)(cid:18)t+1 : a 2 [0; 1](cid:9).\nXi=1\n\nvar(Xi;t;(cid:22)(cid:18)) (cid:20) Q((cid:22)(cid:11)t) (cid:0) Q((cid:22)(cid:11)t+1);\n\nvar(Xi;t) (cid:0) (cid:17)2(B + (cid:21))\n\ni = (cid:27)((cid:22)(cid:18)t\n\nXi=1\n\n(cid:17)\n\nn\n\nn\n\nbut for all (cid:22)(cid:18) 2 [(cid:22)(cid:18)t; (cid:22)(cid:18)t+1], var(Xi;t;(cid:22)(cid:18)) (cid:20) e(cid:17)B var(Xi;t). Hence,\n\nn\n\nXi=1\n\nvar(Xi;t) (cid:20)\n\n1\n\n(cid:17) (1 (cid:0) (cid:17)(B + (cid:21))e(cid:17)B)(cid:0)Q((cid:22)(cid:11)t) (cid:0) Q((cid:22)(cid:11)t+1)(cid:1) :\n\nThus, for (cid:17) < 0:567=(B + (cid:21)), Q((cid:22)(cid:11)t) is non-increasing in t.\nThe proof is in [1]. Theorem 2 follows from an easy calculation.\n5 Experiments\nWe compared an online1 version of the Exponentiated Gradient algorithm with the factored\nSequential Minimal Optimization (SMO) algorithm in [12] on a sequence segmentation\ntask. We selected the \ufb01rst 1000 sentences (12K words) from the CoNLL-2003 named\nentity recognition challenge data set for our experiment. The goal is to extract (multi-\nword) entity names of people, organizations, locations and miscellaneous entities. Each\nword is labelled by 9 possible tags (beginning of one of the four entity types, continuation\nof one of the types, or not-an-entity). We trained a \ufb01rst-order Markov chain over the tags,\n\n1In the online algorithm we calculate marginal terms, and updates to the w\n\nt parameters, one\ntraining example at a time. As yet we do not have convergence bounds for this method, but we have\nfound that it works well in practice.\n\n\f14\n\n12\n\n10\n\n8\n\n6\n\n4\n\n2\n\n0\n\nSMO\nEG (eta .5)\nEG (eta 1)\n\n14\n\n12\n\n10\n\n8\n\n6\n\n4\n\n2\n\n0\n\nSMO\n\nEG(cid:0)(th(cid:0)-2.7)\nEG(cid:0)(th(cid:0)-3)\nEG(cid:0)(th(cid:0)-4.5)\n\n1 8\n\n5\n1\n\n2\n2\n\n9\n2\n\n6\n3\n\n3\n4\n\n0\n5\n\n7\n5\n\n4\n6\n\n1\n7\n\n8\n7\n\n5\n8\n\n2\n9\n\n9\n9\n\n1 8\n\n5\n1\n\n2\n2\n\n9\n2\n\n6\n3\n\n3\n4\n\n0\n5\n\n7\n5\n\n4\n6\n\n1\n7\n\n8\n7\n\n5\n8\n\n2\n9\n\n9\n9\n\n(a)\n\n(b)\n\nFigure 4: Number of iterations over training set vs. dual objective for the SMO and EG algorithms.\n(a) Comparison with different (cid:17) values; (b) Comparison with (cid:17) = 1 and different initial (cid:18) values.\nwhere our cliques are just the nodes for the tag of each word and edges between tags of\nconsecutive words. The feature vector for each node assignment consists of the word itself,\nits capitalization and morphological features, etc., as well as the previous and consecutive\nwords and their features. Likewise, the feature vector for each edge assignment consists of\nthe two words and their features as well as surrounding words.\nFigure 4 shows the growth of the dual objective function after each pass through the data\nfor SMO and EG, for several settings of the learning rate (cid:17) and the initial setting of the (cid:18)\nparameters. Note that SMO starts up very quickly but slows down in a suboptimal region,\nwhile EG lags at the start, but overtakes SMO and achieves a larger than 10% increase in\nthe value of the objective. These preliminary results suggest that a hybrid algorithm could\nget the bene\ufb01ts of both, by starting out with several SMO updates and then switching to EG.\nThe key issue is to switch from the marginal (cid:22) representation SMO maintains to the Gibbs (cid:18)\nrepresentation that EG uses. We can \ufb01nd (cid:18) that produces (cid:22) by \ufb01rst computing conditional\n\u201cprobabilities\u201d that correspond to our marginals (e.g. dividing edge marginals by node\nmarginals in this case) and then letting (cid:18)\u2019s be the logs of the conditional probabilities.\nReferences\n[1] Long version of this paper. Available at http://www.ai.mit.edu/people/mcollins.\n[2] Y. Altun, I. Tsochantaridis, and T. Hofmann. Hidden markov support vector machines.\nIn\n\nICML, 2003.\n\n[3] Michael Collins. Parameter estimation for statistical parsing models: Theory and practice of\ndistribution-free methods. In Harry Bunt, John Carroll, and Giorgio Satta, editors, New Devel-\nopments in Parsing Technology. Kluwer, 2004.\n\n[4] K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based\n\nvector machines. Journal of Machine Learning Research, 2(5):265\u2013292, 2001.\n\n[5] N. Cristianini, C. Campbell, and J. Shawe-Taylor. Multiplicative updatings for support-vector\n\nlearning. Technical report, NeuroCOLT2, 1998.\n\n[6] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines and Other\n\nKernel-Based Learning Methods. Cambridge University Press, 2000.\n\n[7] J. Kivinen and M. Warmuth. Exponentiated gradient versus gradient descent for linear predic-\n\ntors. Information and Computation, 132(1):1\u201363, 1997.\n\n[8] J. Kivinen and M. Warmuth. Relative loss bounds for multidimensional regression problems.\n\nJournal of Machine Learning Research, 45(3):301\u2013329, 2001.\n\n[9] John Lafferty, Andrew McCallum, and Fernando Pereira. Conditional random \ufb01elds: Proba-\nbilistic models for segmenting and labeling sequence data. In Proceedings of ICML-01, 2001.\n\n[10] D. Pollard. Convergence of Stochastic Processes. Springer-Verlag, 1984.\n[11] F. Sha, L. Saul, and D. Lee. Multiplicative updates for large margin classi\ufb01ers. In COLT, 2003.\n[12] B. Taskar, C. Guestrin, and D. Koller. Max margin Markov networks. In NIPS, 2003.\n[13] I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support vector machine learning for\n\ninterdependent and structured output spaces. ICML, 2004 (To appear).\n\n\f", "award": [], "sourceid": 2677, "authors": [{"given_name": "Peter", "family_name": "Bartlett", "institution": null}, {"given_name": "Michael", "family_name": "Collins", "institution": null}, {"given_name": "Ben", "family_name": "Taskar", "institution": null}, {"given_name": "David", "family_name": "McAllester", "institution": null}]}