{"title": "Generalization Bounds and Consistency for Latent Structural Probit and Ramp Loss", "book": "Advances in Neural Information Processing Systems", "page_first": 2205, "page_last": 2212, "abstract": "We consider latent structural versions of probit loss and ramp loss. We show that  these surrogate loss functions are consistent in the strong sense that for any feature map  (finite or infinite dimensional) they yield predictors approaching the infimum  task loss achievable by any linear predictor over the given features.  We also give  finite sample generalization bounds (convergence rates) for these loss functions.  These bounds suggest that probit loss converges more rapidly.  However, ramp loss is more easily optimized and may ultimately  be more practical.", "full_text": "Generalization Bounds and Consistency for Latent\n\nStructural Probit and Ramp Loss\n\nDavid McAllester\n\nTTI-Chicago\n\nmcallester@ttic.edu\n\nJoseph Keshet\nTTI-Chicago\n\njkeshet@ttic.edu\n\nAbstract\n\nWe consider latent structural versions of probit loss and ramp loss. We show that\nthese surrogate loss functions are consistent in the strong sense that for any feature\nmap (\ufb01nite or in\ufb01nite dimensional) they yield predictors approaching the in\ufb01mum\ntask loss achievable by any linear predictor over the given features. We also give\n\ufb01nite sample generalization bounds (convergence rates) for these loss functions.\nThese bounds suggest that probit loss converges more rapidly. However, ramp\nloss is more easily optimized on a given sample.\n\n1 Introduction\n\nMachine learning has become a central tool in areas such as speech recognition, natural language\ntranslation, machine question answering, and visual object detection. In modern approaches to these\napplications systems are evaluated with quantitative performance metrics. In speech recognition one\ntypically measures performance by the word error rate. In machine translation one typically uses\nthe BLEU score. Recently the IBM deep question answering system was trained to optimize the\nJeopardy game show score. The PASCAL visual object detection challenge is scored by average\nprecision in recovering object bounding boxes. No metric is perfect and any metric is controversial,\nbut quantitative metrics provide a basis for quantitative experimentation and quantitative experimen-\ntation has lead to real progress. Here we adopt the convention that a performance metric is given as a\ntask loss \u2014 a measure of a quantity of error or cost such as the word error rate in speech recognition.\nWe consider general methods for minimizing task loss at evaluation time.\nAlthough the goal is to minimize task loss, most systems are trained by minimizing a surrogate loss\ndifferent from task loss. A surrogate loss is necessary when using scale-sensitive regularization in\ntraining a linear classi\ufb01er. A linear classi\ufb01er selects the output that maximizes an inner product of a\nfeature vector and a weight vector. The output of a linear classi\ufb01er does not change when the weight\nvector is scaled down. But for most regularizers of interest, such as a norm of the weight vector,\nscaling down the weight vector drives the regularizer to zero. So directly regularizing the task loss\nof a linear classi\ufb01er is meaningless.\nFor binary classi\ufb01cation standard surrogate loss functions include log loss, hinge loss, probit loss,\nand ramp loss. Unlike binary classi\ufb01cation, however, the applications mentioned above involve\ncomplex (or structured) outputs. The standard surrogate loss functions for binary classi\ufb01cation have\ngeneralizations to the structured output setting. Structural log loss is used in conditional random\n\ufb01elds (CRFs) [7]. Structural hinge loss is used in structural SVMs [13, 14]. Structural probit loss is\nde\ufb01ned and empirically evaluated in [6]. A version of structural ramp loss is de\ufb01ned and empirically\nevaluated in [3] (but see also [12] for a treatment of the fundamental motivation for ramp loss). All\nfour of these structural surrogate loss functions are de\ufb01ned formally in section 2.1\n\n1The de\ufb01nition of ramp loss used here is slightly different from that in [3].\n\n1\n\n\fThis paper is concerned with developing a better theoretical understanding of the relationship be-\ntween surrogate loss training and task loss testing for structured labels. Structural ramp loss is\njusti\ufb01ed in [3] as being a tight upper bound on task loss. But of course the tightest upper bound on\ntask loss is the task loss itself. Here we focus on generalization bounds and consistency. A \ufb01nite\nsample generalization bound for probit loss was stated implicitly in [9] and an explicit probit loss\nbound is given in [6]. Here we review the \ufb01nite sample bounds for probit loss and prove a \ufb01nite\nsample bound for ramp loss. Using these bounds we show that probit loss and ramp loss are both\nconsistent in the sense that for any arbitrary feature map (possibly in\ufb01nite dimensional) optimizing\nthese surrogate loss functions with appropriately weighted regularization approaches, in the limit of\nin\ufb01nite training data, the minimum loss achievable by a linear predictor over the given features. No\nconvex surrogate loss function, such as log loss or hinge loss, can be consistent in this sense \u2014 for\nany nontrivial convex surrogate loss function one can give examples (a single feature suf\ufb01ces) where\nthe learned weight vector is perturbed by outliers but where the outliers do not actually in\ufb02uence the\noptimal task loss.\nBoth probit loss and ramp loss can be optimized in practice by stochastic gradient descent. Ramp\nloss is simpler and easier to implement. The subgradient update for ramp loss is similar to a per-\nceptron update \u2014 the update is a difference between a \u201cgood\u201d feature vector and a \u201cbad\u201d feature\nvector. Ramp loss updates are closely related to updates derived from n-best lists in training machine\ntranslaiton systems [8, 2]. Ramp loss updates regularized by early stopping have been shown to be\neffective in phoneme alignment [10]. It is also shown in [10] that in the limit of large weight vectors\nthe expected ramp loss update converges to the true gradient of task loss. This result suggests consis-\ntency for ramp loss, a suggestion con\ufb01rmed here. A practical stochastic gradient descent algorithm\nfor structural probit loss is given in [6] where it is also shown that probit loss can be effective for\nphoneme recognition. Although the generalization bounds suggest that probit loss converges faster\nthan ramp loss, ramp loss seems easier to optimize.\nWe formulate all the notions of loss in the presence of latent structure as well as structured labels.\nLatent structure is information that is not given in the labeled data but is constructed by the prediction\nalgorithm. For example, in natural language translation the alignment between the words in the\nsource and the words in the target is not explicitly given in a translation pair. Grammatical structures\nare also not given in a translation pair but may be constructed as part of the translation process. In\nvisual object detection the position of object parts is not typically annotated in the labeled data but\npart position estimates may be used as part of the recognition algorithm. Although the presence of\nlatent structure makes log loss and hinge loss non-convex, latent strucure seems essential in many\napplications. Latent structural log loss, and the notion of a hidden CRF, is formulated in [11]. Latent\nstructural hinge loss, and the notion of a latent structural SVM, is formulated in [15].\n\n2 Formal Setting and Review\nWe consider an arbitrary input space X and a \ufb01nite label space Y. We assume a source probability\ndistribution over labeled data, i.e., a distribution over pairs (x, y), where we write Ex,y [f(x, y)] for\nthe expectation of f(x, y). We assume a loss function L such that for any two labels y and \u02c6y we have\nthat L(y, \u02c6y) \u2208 [0, 1] is the loss (or cost) when the true label is y and we predict \u02c6y. We will work with\nin\ufb01nite dimensional feature vectors. We let \u20182 be the set of \ufb01nite-norm in\ufb01nite-dimensional vectors\n\u2014 the set of all square-summable in\ufb01nite sequences of real numbers. We will be interested in linear\npredictors involving latent structure. We assume a \ufb01nite set Z of \u201clatent labels\u201d. For example, we\nmight take Z to be the set of all parse trees of source and target sentences in a machien translation\nsystem. In machine translation the label y is typically a sentence with no parse tree speci\ufb01ed. We can\nrecover the pure structural case, with no latent information, by taking Z to be a singleton set. It will\nbe convenient to de\ufb01ne S to be the set of pairs of a label and a latent label. An element s of S will\nbe called an augmented label and we de\ufb01ne L(y, s) by L(y, (\u02c6y, z)) = L(y, \u02c6y). We assume a feature\nmap \u03c6 such that for an input x and augmented label s we have \u03c6(x, s) \u2208 \u20182 with ||\u03c6(x, s)|| \u2264 1.2\nGiven an input x and a weight vector w \u2208 \u20182 we de\ufb01ne the prediction \u02c6sw(x) as follows.\n\n\u02c6sw(x) = argmax\n\ns\n\nw>\u03c6(x, s)\n\n2We note that this setting covers the \ufb01nite dimensional case because the range of the feature map can be\n\ntaken to be a \ufb01nite dimensional subset of \u20182 \u2014 we are not assuming a universal feature map.\n\n2\n\n\fOur goal is to use the training data to learn a weight vector w so as to minimize the expected loss\non newly drawn labeled data Ex,y [L(y, \u02c6sw(x))]. We will assume an in\ufb01nite sequence of training\ndata (x1, y1), (x2, y2), (x3, y3), . . . drawn IID from the source distribution and use the following\nnotations.\n\nL(w, x, y) = L(y, \u02c6sw(x))\n\nL(w) = Ex,y [L(w, x, y)]\n\nL\u2217 = inf w\u2208\u20182 L(w)\n\n\u02c6Ln(w) = 1\n\nn\n\ni=1 L(w, xi, yi)\n\nWe adopt the convention that in the de\ufb01nition of L(w, x, y) we break ties in de\ufb01nition of \u02c6sw(x) in\nfavor of augmented labels of larger loss. We will refer to this as pessimistic tie breaking.\nHere we de\ufb01ne latent structural log loss, hinge loss, ramp loss and probit loss as follows.\n\nPn\n\nLlog(w, x, y) = ln\n\n= ln Zw(x) \u2212 ln Zw(x, y)\n\n>\n\nexp(w\n\n\u03a6(x, s)) Zw(x, y) =\n\nexp(w\n\n>\n\n\u03c6(x, (y, z)))\n\nX\n\nz\n\n\u201d\n\n\u201d\n\n\u201d \u2212\u201c\n\u201d \u2212\u201c\n\u201d \u2212 w\n\n>\n\nw\n\n>\n\nw\n\nmax\n\nz\n\nmax\n\ns\n>\n\n\u03c6(x, s) + L(y, s)\n\n\u03a6(x, (y, z))\n\nmax\n\n\u03c6(x, s) + L(y, s)\n\n\u03a6(x, s)\n\nmax\n\n\u03c6(x, s) + L(y, s)\n\n\u03a6(x, \u02c6sw(x))\n\n1\n\nX\n\nPw(y|x)\nZw(x) =\n\ns\n\n\u201c\n\u201c\n\u201c\n\n>\n\nw\n\n>\n\n>\n\nw\n\nw\n\nmax\n\ns\n\ns\n\ns\n\nLhinge(w, x, y) =\n\nLramp(w, x, y) =\n\n=\n\nLprobit(w, x, y) = E\u0001 [L(y, \u02c6sw+\u0001(x))]\n\nIn the de\ufb01nition of probit loss we take \u0001 to be zero-mean unit-variance isotropic Gaussian noise \u2014\nfor each feature dimension j we have that \u0001j is an independent zero-mean unit-variance Gaussian\nvariable.3 More generally we will write E\u0001 [f(\u0001)] for the expectation of f(\u0001) where \u0001 is Gaussian\nnoise. It is interesting to note that Llog, Lhinge, and Lramp are all naturally differences of convex\nfunctions and hence can be optimized by CCCP.\nIn the case of binary classi\ufb01cation we have S = Y = {\u22121, 1}, \u03c6(x, y) = 1\n2 y\u03c6(x), L(y, y0) = 1y6=y0\nand we de\ufb01ne the margin m = yw>\u03c6(x). We then have the following where the expression for\nLprobit(w, x, y) assumes ||\u03a6(x)|| = 1.\n\nLlog(w, x, y) = ln (1 + e\u2212m)\n\nLhinge(w, x, y) = max(0, 1 \u2212 m)\nLramp(w, x, y) = min(1, max(0, 1 \u2212 m)) Lprobit(w, x, y) = P\u0001\u223cN (0,1)[\u0001 \u2265 m]\n\nReturning to the general case we consider the relationship between hinge and ramp loss. First we\nconsider the case where Z is a singleton set \u2014 the case of no latent structure. In this case hinge\nloss is convex in w \u2014 the hinge loss becomes a maximum of linear functions. Ramp loss, however,\nremains a difference of nonlinear convex functions even for Z singleton. Also, in the case where Z\nis singleton one can easily see that hinge loss is unbounded \u2014 wrong labels may score arbitrarily\nbetter than the given label. Hinge loss remains unbounded in case of non-singleton Z. Ramp loss,\non the other hand, is bounded by 1 as follows.\n\nLramp(w, x, y) =\n\nw>\u03a6(x, s) + L(y, s)\n\n(cid:16)\n\u2264 (cid:16)\n\nmax\n\ns\n\nmax\n\ns\n\n(cid:17) \u2212 w>\u03a6(x, \u02c6sw(x))\n(cid:17) \u2212 w>\u03a6(x, \u02c6sw(x)) = 1\n\nw>\u03a6(x, s) + 1\n\nNext, as is emphasized in [3], we note that ramp loss is a tighter upper bound on task loss than\nis hinge loss. To see this we \ufb01rst note that it is immediate that Lhinge(w, x, y) \u2265 Lramp(w, x, y).\n3In in\ufb01nite dimension we have that with probability one ||\u0001|| = \u221e and hence w+\u0001 is not in \u20182. The measure\nunderling E\u0001 [f (\u0001)] is a Gaussian process. However, we still have that for any unit-norm feature vector \u03a6 the\ninner product \u0001>\u03a6 is distributed as a zero-mean unit-norm scalar Gaussian and Lprobit(w, x, y) is therefore\nwell de\ufb01ned.\n\n3\n\n\fFurthermore, the following derivation shows Lramp(w, x, y) \u2265 L(w, x, y) where we assume pes-\nsimistic tie breaking in the de\ufb01nition of \u02c6sw(x).\n\n(cid:17) \u2212 w>\u03a6(x, \u02c6sw(x))\n\nLramp(w, x, y) =\n\nw>\u03a6(x, s) + L(y, s)\n\nmax\n\ns\n\n(cid:16)\n\n\u2265 w>\u03a6(x, \u02c6sw(x)) + L(y, \u02c6sw(x)) \u2212 w>\u03a6(x, \u02c6sw(x)) = L(y, \u02c6sw(x))\n\nBut perhaps the most important property of ramp loss is the following.\n\nlim\n\u03b1\u2192\u221e Lramp(\u03b1w, x, y) = L(w, x, y)\n\n(1)\n\nThis can be veri\ufb01ed by noting that as \u03b1 goes to in\ufb01nity the maximum of the \ufb01rst term in ramp loss\nmust occur at s = \u02c6sw(x).\nNext we note that Optimizing Lramp through subgradient descent (rather than CCCP) yields the\nfollowing update rule (here we ignore regularization).\n\n(2)\n\n(3)\n\n\u2206w \u221d \u03c6(x, \u02c6sw(x)) \u2212 \u03c6(x, \u02c6s+\n\nw(x, y))\nw>\u03c6(x, s) + L(y, s)\n\n\u02c6s+\nw(x, y) = argmax\n\ns\n\nWe will refer to (2) as the ramp loss update rule. The following is proved in [10] under mild\nconditions on the probability distribution over pairs (x, y).\n\n(cid:2)\u03c6(x, \u02c6s+\n\n\u03b1w(x, y)) \u2212 \u03c6(x, \u02c6sw(x))(cid:3)\n\n\u2207wL(w) = lim\n\n\u03b1\u2192\u221e \u03b1Ex,y\n\nEquation (3) expresses a relationship between the expected ramp loss update and the gradient of\ngeneralization loss. Signi\ufb01cant empirical success has been achieved with the ramp loss update rule\nusing early stopping regularization [10]. But both (1) and (3) suggests that regularized ramp loss\nshould be consistent as is con\ufb01rmed here.\nFinally it is worth noting that Lramp and Lprobit are meaningful for an arbitrary prediction space S,\nlabel space Y, and loss function L(y, s) between a label and a prediction. Log loss and hinge loss\ncan be generalized to arbitrary prediction and label spaces provided that we assume a compatibility\nrelation between predictions and labels. The framework of independent prediction and label spaces\nis explored more fully in [5] where a notion of weak-label SVM is de\ufb01ned subsuming both ramp\nand hinge loss as special cases.\n\n3 Consistency of Probit Loss\n\nWe start with the consistency of probit loss which is easier to prove. We consider the following\nlearning rule where the regularization parameter \u03bbn is some given function of n.\n\n\u02c6wn = argmin\n\nw\n\nprobit(w) + \u03bbn\n\u02c6Ln\n2n\n\n||w||2\n\n(4)\n\nWe now prove the following fairly straightforward consequence of a generalization bound appearing\nin [6].\nTheorem 1 (Consistency of Probit loss). For \u02c6wn de\ufb01ned by (4), if the sequence \u03bbn increases without\nbound, and \u03bbn ln n/n converges to zero, then with probability one over the draw of the in\ufb01nite\nsample we have limn\u2192\u221e Lprobit( \u02c6wn) = L\u2217.\nUnfortunately, and in contrast to simple binary SVMs, for a latent binary SVM (an LSVM) there\nexists an in\ufb01nite sequence w1, w2, w3, . . . such that Lprobit(wn) approaches L\u2217 but L(wn) remains\nbounded away from L\u2217 (we omit the example here). However, the learning algorithm (4) achieves\nconsistency in the sense that the stochastic predictor de\ufb01ned by \u02c6wn + \u0001 where \u0001 is Gaussian noise\nhas a loss which converges to L\u2217.\nTo prove theorem 1 we start by reviewing the generalization bound of [6]. The departure point\nfor this generalization bound is the following PAC-Bayesian theorem where P and Q range over\nprobability measures on a given space of predictors and L(Q) and \u02c6Ln(Q) are de\ufb01ned as expectations\nover selecting a predictor from Q.\n\n4\n\n\f(cid:18)\n\n \n\n \n\nTheorem 2 (from [1], see also [4]). For any \ufb01xed prior distribution P and \ufb01xed \u03bb > 1/2 we\nhave that with probability at least 1 \u2212 \u03b4 over the draw of the training data the following holds\nsimultaneously for all Q.\n\n(cid:18)\n\n(cid:18) KL(Q, P ) + ln 1\n\n(cid:19)(cid:19)\n\n\u03b4\n\nn\n\nL(Q) \u2264\n\n1\n1 \u2212 1\n\n2\u03bb\n\n\u02c6Ln(Q) + \u03bb\n\nFor the space of linear predictors we take the prior P to be the zero-mean unit-variance Gaussian\ndistribution and for w \u2208 \u20182 we de\ufb01ne the distribution Qw to be the unit-variance Gaussian centered\nat w. This gives the following corollary of (5).\nCorollary 1 (from [6]). For \ufb01xed \u03bbn > 1/2 we have that with probability at least 1 \u2212 \u03b4 over the\n(cid:18) 1\ndraw of the training data the following holds simultaneously for all w \u2208 \u20182.\n2||w||2 + ln 1\n\n(cid:19)(cid:19)\n\n(cid:18)\n\n\u03b4\n\nLprobit(w) \u2264\n\n1\n1 \u2212 1\n\n2\u03bbn\n\n\u02c6Ln\nprobit(w) + \u03bbn\n\nn\n\nTo prove theorem 1 from (6) we consider an arbitrary unit-norm weight vector w\u2217 and an arbitrary\nscalar \u03b1 > 0. Setting \u03b4 to 1/n2, and noting that \u02c6wn is the minimizer of the right hand side of (6),\nwe have the following with probability at least 1 \u2212 1/n2 over the draw of the sample.\n\n(cid:18) 1\n\n(cid:19)(cid:19)\n\nLprobit( \u02c6wn) \u2264\n\n1\n1 \u2212 1\n\n2\u03bbn\n\nprobit(\u03b1w\u2217) + \u03bbn\n\u02c6Ln\n\n2 \u03b12 + 2 ln n\n\nn\n\nA standard Chernoff bound argument yields that for w\u2217 and \u03b1 > 0 selected prior to drawing the\nsample, we have the following with probability at least 1 \u2212 1/n2 over the choice of the sample.\n\n(5)\n\n(6)\n\n(7)\n\n(8)\nCombining (7) and (8) with a union bound yields that with probability at least 1\u2212 2/n2 we have the\nfollowing.\n\nprobit(\u03b1w\u2217) \u2264 Lprobit(\u03b1w\u2217) +\n\u02c6Ln\n\nLprobit( \u02c6wn) \u2264\n\n1\n1 \u2212 1\n\n2\u03bbn\n\nLprobit(\u03b1w\u2217) +\n\nn\n\n(cid:18) 1\n\nrln n\n\nn\n\n(cid:19)!\n\nBecause the probability that the above inequality is violated goes as 1/n2, with probability one over\nthe draw of the sample we have the following.\n\nn\u2192\u221e Lprobit( \u02c6wn) \u2264 lim\nlim\nn\u2192\u221e\n\n1\n1 \u2212 1\n\n2\u03bbn\n\nLprobit(\u03b1w\u2217) +\n\n+ \u03bbn\n\n2 \u03b12 + 2 ln n\n\nn\n\nUnder the conditions on \u03bbn given in the statement of theorem 1 we then have\n\nn\u2192\u221e Lprobit( \u02c6wn) \u2264 Lprobit(\u03b1w\u2217).\nlim\n\nBecause this holds with probability one for any \u03b1, the following must also hold with probability one.\n(9)\n\nn\u2192\u221e Lprobit( \u02c6wn) \u2264 lim\nlim\n\n\u03b1\u2192\u221e Lprobit(\u03b1w\u2217)\n\nNow consider\n\u03b1\u2192\u221e Lprobit(\u03b1w, x, y) = lim\nlim\n\u03b1\u2192\u221e\n\nE\u0001 [L(\u03b1w + \u0001, x, y)] = lim\n\u03c3\u21920\n\nE\u0001 [L(w + \u03c3\u0001, x, y)] .\n\nWe have that lim\u03c3\u21920 E\u0001 [L(w + \u03c3\u0001, x, y)] is determined by the augmented labels s that are tied for\nthe maximum value of w>\u03a6(x, s). There is some probability distribution over these tied values that\noccurs in the limit of small \u03c3. Under the pessimistic tie breaking in the de\ufb01nition of L(w, x, y) we\nthen get lim\u03b1\u2192\u221e Lprobit(\u03b1w, x, y) \u2264 L(w, x, y). This in turn gives the following.\n\n\u03b1\u2192\u221e Lprobit(\u03b1w) = Ex,y\nlim\n\n(10)\nCombining (9) and (10) yields limn\u2192\u221e Lprobit( \u02c6wn) \u2264 L(w\u2217). Since for any w\u2217 this holds with\nprobability one, with probability one we also have limn\u2192\u221e Lprobit( \u02c6wn) \u2264 L\u2217. Finally we note\nLprobit(w) = E\u0001 [L(w + \u0001)] \u2265 L\u2217 which then gives theorem 1.\n\nlim\n\u03b1\u2192\u221e Lprobit(\u03b1w, x, y)\n\ni \u2264 Ex,y [L(w, x, y)] = L(w)\n\nh\n\n5\n\nrln n\n(cid:18) 1\n\nn\n\n+ \u03bbn\n\nrln n\n\nn\n\n(cid:19)!\n\n2 \u03b12 + 2 ln n\n\n\f4 Consistency of Ramp Loss\n\nNow we consider the following ramp loss training equation.\n\n\u02c6wn = argmin\n\nw\n\nramp(w) + \u03b3n\n\u02c6Ln\n2n\n\n||w||2\n\n(11)\n\nThe main result of this paper is the following.\nTheorem 3 (Consistency of Ramp Loss). For \u02c6wn de\ufb01ned by (11), if the sequence \u03b3n/ ln2 n increases\nwithout bound, and the sequence \u03b3n/(n ln n) converges to zero, then with probability one over the\ndraw of the in\ufb01nite sample we have limn\u2192\u221e Lprobit((ln n) \u02c6wn) = L\u2217.\nAs with theorem 1, theorem 3 is derived from a \ufb01nite sample generalization bound. The bound\nis derived from (6) by upper bounding \u02c6Ln\nramp(w). From section 3 we\nhave that lim\u03c3\u21920 Lprobit(w/\u03c3, x, y) \u2264 L(w, x, y) \u2264 Lramp(w, x, y). This can be converted to the\nfollowing lemma for \ufb01nite \u03c3 where we recall that S is the set of augmented labels s = (y, z).\nLemma 1.\n\nprobit(w/\u03c3) in terms of \u02c6Ln\n\nLprobit\n\n, x, y\n\n(cid:16) w\n\n\u03c3\n\nr\n\n8 ln\n\n|S|\n\u03c3\n\n(cid:17) \u2264 Lramp(w, x, y) + \u03c3 + \u03c3\n(cid:16) w\n\n(cid:17) \u2264 \u03c3 + max\n\n, x, y\n\ns: m(s)\u2264M\n\n\u03c3\n\nL(y, s)\n\nProof. We \ufb01rst prove that for any \u03c3 > 0 we have\n\nLprobit\n\nwhere\n\nm(s) = w>\u2206\u03c6(s) \u2206\u03c6(s) = \u03c6(x, \u02c6sw(x)) \u2212 \u03c6(x, s) M = \u03c3\n\n(cid:2)1\u03a6(\u0001)\n\n(cid:3).\n\nTo prove (12) we note that for m(s) > M we have the following where P\u0001[\u03a6(\u0001)] abbreviates\nE\u0001\n\n(cid:20)\n\n(cid:21)\n\n(cid:2)\u2212\u0001>\u2206\u03c6(s) \u2265 m(s)/\u03c3(cid:3)\n(cid:19)\n\n(cid:18)\n\nP\u0001[\u02c6sw+\u03c3\u0001(x) = s] \u2264 P\u0001[(w + \u03c3\u0001)>\u2206\u03c6(s) \u2264 0] = P\u0001\n\n\u2264 P\u0001\u223cN (0,1)\n\n\u0001 \u2265 M\n2\u03c3\n\n\u2264 exp\n\n\u2212 M 2\n8\u03c32\n\n= \u03c3\n|S|\n\nE\u0001 [L(y, \u02c6sw+\u03c3\u0001(x))] \u2264 P\u0001 [\u2203s : m(s) > M \u02c6sw+\u0001\u03c3(x) = s] + max\ns:m(s)\u2264M\n\nL(y, s)\n\n(12)\n\nr\n\n8 ln\n\n|S|\n\u03c3\n\n.\n\nThe following calculation shows that (12) implies the lemma.\n\nLprobit\n\n, x, y\n\ns: m(s)\u2264M\n\nL(y, s)\n\n(cid:16) w\n\n\u03c3\n\n\u2264 \u03c3 + max\n\nL(y, s)\n\ns:m(s)\u2264M\n\n(cid:17) \u2264 \u03c3 + max\n(cid:18)\n(cid:16)\n\n(cid:19)\n\n+ M\n\nL(y, s) \u2212 m(s)\n\n+ M\n\ns: m(s)\u2264M\n\n\u2264 \u03c3 +\nmax\n\u2264 \u03c3 +\nL(y, s) \u2212 m(s)\n= \u03c3 + Lramp(w, x, y) + M\n\nmax\n\ns\n\n(cid:17)\n\nInserting lemma 1 into (6) we get the following.\nTheorem 4. For \u03bbn > 1/2 we have that with probability at least 1\u2212 \u03b4 over the draw of the training\ndata the following holds simultaneously for all w and \u03c3 > 0.\n\nr\n\n  ||w||2\n\n!!\n\n|S|\n\u03c3\n\n+ \u03bbn\n\n2\u03c32 + ln 1\n\n\u03b4\n\nn\n\n(13)\n\nLprobit\n\n\u02c6Ln\nramp(w) + \u03c3 + \u03c3\n\n8 ln\n\n \n\n(cid:17) \u2264\n\n(cid:16) w\n\n\u03c3\n\n1\n1 \u2212 1\n\n2\u03bbn\n\n6\n\n\fTo prove theorem 3 we now take \u03c3n = 1/ ln n and \u03bbn = \u03b3n/ ln2 n. We then have that \u02c6wn is the\nminimizer of the right hand side of (13). This observation yields the following for any unit-norm\nvector w\u2217 and scalar \u03b1 > 0 where we have set \u03b4 = 1/n2.\n\n1 +p8 ln(|S| ln n)\n\nln n\n\n!\n\n+ \u03b3n\u03b12\n2n\n\n+\n\n2\u03b3n\nn ln n\n\n(14)\n\n \n\nLprobit((ln n) \u02c6wn) \u2264\n\n1\n\n1 \u2212 ln2 n\n\n2\u03b3n\n\n\u02c6Lramp(\u03b1w\u2217) +\n\nAs in section 3, we use a Chernoff bound for the single vector w\u2217 and scalar \u03b1 to bound \u02c6Lramp(\u03b1w\u2217)\nin terms of Lramp(\u03b1w\u2217) and then take the limit as n \u2192 \u221e to get the following with probability one.\n\nn\u2192\u221e Lprobit((ln n) \u02c6wn) \u2264 Lramp(\u03b1w\u2217)\nlim\n\nThe remainder of the proof is the same in section 3 but where we now use lim\u03b1\u2192\u221e Lramp(\u03b1w\u2217) =\nL(w\u2217) whose proof we omit.\n\n5 A Comparison of Convergence Rates\n\nTo compare the convergence rates implicit in (6) and (13) we note that in (13) we can optimize \u03c3 as a\n\nfunction of other quantities in the bound.4 An approximately optimal value for \u03c3 is(cid:0)\u03bbn||w||2/n(cid:1)1/3\n\nwhich gives the following.\n\n(cid:17) \u2264\n\n(cid:16) w\n\n\u03c3\n\n1\n1 \u2212 1\n\n2\u03bbn\n\nLprobit\n\n \n(cid:16)(cid:0)|| \u02c6wn||2/n(cid:1)1/3(cid:17)\n\n\u02c6Ln\nramp(w) +\n\n(cid:19)1/3 \n\n(cid:18) \u03bbn||w||2\nas opposed to (6) which gives O(cid:0)|| \u02c6wn||2/n(cid:1). This\n\n\u03bbn ln 1\n\u03b4\n\n|S|\n\u03c3\n\nr\n\n!\n\n!\n\n(15)\n\n8 ln\n\n3\n2\n\n+\n\n+\n\nn\n\nn\n\nWe have that (15) gives \u02dcO\nsuggests that while probit loss and ramp loss are both consistent, ramp loss may converge more\nslowly.\n\n6 Discussion and Open Problems\n\n(cid:16)\n\nThe contributions of this paper are a consistency theorem for latent structural probit loss and both\na generalization bound and a consistency theorem for latent structural ramp loss. These bounds\nsuggest that probit loss converges more rapidly. However, we have only proved upper bounds on\ngeneralization loss and it remains possible that these upper bounds, while suf\ufb01cient to show consis-\ntency, are not accurate characterizations of the actual generalization loss. Finding more de\ufb01nitive\nstatements, such as matching lower bounds, remains an open problem.\nThe de\ufb01nition of ramp loss used here is not the only one possible. In particular we can consider the\nfollowing variant.\nL0\nramp(w, x, y) =\n\nw>\u03a6(x, s)\nramp as well as Lramp. Experiments indicate that L0\n\nmax\nRelations (1) and (3) both hold for L0\nramp per-\nforms somewhat better than Lramp under early stopping of subgradient descent. However it seems\nthat it is not possible to prove a bound of the form of (15) for L0\nramp. A frustrating observation is\nthat L0\nramp remains an open\nproblem.\nThe isotropic Gaussian noise distribution used in the de\ufb01nition of Lprobit is not optimal. A uni-\nformly tighter upper bound on generalization loss is achieved by optimizing the posterior in the\nPAC-Bayesian theorem. Finding a practical more optimal use of the PAC-Bayesian theorem also\nremains an open problem.\n\nramp(0, x, y) = 0. Finding a meaningful \ufb01nite-sample statement for L0\n\nw>\u03c6(x, s) \u2212 L(y, s)\n\n(cid:17) \u2212(cid:16)\n\nmax\n\n(cid:17)\n\ns\n\ns\n\n4In the consistency proof it was more convenient to set \u03c3 = 1/ln n which is plausibly nearly optimal\n\nanyway.\n\n7\n\n\fReferences\n[1] Olivier Catoni. PAC-Bayesian Supervised Classi\ufb01cation: The Thermodynamics of Statistical\nLearning. Institute of Mathematical Statistics LECTURE NOTES MONOGRAPH SERIES,\n2007.\n\n[2] D. Chiang, K. Knight, and W. Wang. 11,001 new features for statistical machine translation.\n\nIn Proc. NAACL, 2009, 2009.\n\n[3] Chuong B. Do, Quoc Le, Choon Hui Teo, Olivier Chapelle, and Alex Smola. Tighter bounds\n\nfor structured estimation. In nips, 2008.\n\n[4] Pascal Germain, Alexandre Lacasse, Francois Laviolette, and Mario Marchand. Pac-bayesian\n\nlearning of linear classi\ufb01ers. In ICML, 2009.\n\n[5] Ross Girshick, Pedro Felzenszwalb, and David McAllester. Object detection with grammar\n\nmodels. In NIPS, 2011.\n\n[6] Joseph Keshet, David McAllester, and Tamir Hazan. Pac-bayesian approach for minimization\nof phoneme error rate. In International Conference on Acoustics, Speech, and Signal Process-\ning (ICASSP), 2011.\n\n[7] J. Lafferty, A. McCallum, and F. Pereira. Conditional random \ufb01elds: Probabilistic models\nfor segmenting and labeling sequence data. In Proceedings of the Eightneenth International\nConference on Machine Learning, pages 282\u2013289, 2001.\n\n[8] P. Liang, A. Bouchard-Ct, D. Klein, and B. Taskar. An end-to-end discriminative approach to\nmachine translation. In International Conference on Computational Linguistics and Associa-\ntion for Computational Linguistics (COLING/ACL), 2006.\n\n[9] David McAllester. Generalization bounds and consistency for structured labeling. In G. Bakir\nnd T. Hofmann, B. Scholkopf, A. Smola, B. Taskar, and S. V. N. Vishwanathan, editors, Pre-\ndicting Structured Data. MIT Press, 2007.\n\n[10] David A. McAllester, Tamir Hazan, and Joseph Keshet. Direct loss minimization for structured\n\nprediction. In Advances in Neural Information Processing Systems 24, 2010.\n\n[11] A. Quattoni, S. Wang, L.P. Morency, M Collins, and T Darrell. Hidden conditional random\n\n\ufb01elds. PAMI, 29, 2007.\n\n[12] R.Collobert, F.H.Sinz, J.Weston, and L.Bottou. Trading convexity for scalability. In ICML,\n\n2006.\n\n[13] B. Taskar, C. Guestrin, and D. Koller. Max-margin markov networks. In Advances in Neural\n\nInformation Processing Systems 17, 2003.\n\n[14] I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support vector machine learning for\ninterdependent and structured output spaces. In Proceedings of the Twenty-First International\nConference on Machine Learning, 2004.\n\n[15] Chun-Nam John Yu and T. Joachims. Learning structural svms with latent variables. In Inter-\n\nnational Conference on Machine Learning (ICML), 2009.\n\n8\n\n\f", "award": [], "sourceid": 1222, "authors": [{"given_name": "Joseph", "family_name": "Keshet", "institution": null}, {"given_name": "David", "family_name": "McAllester", "institution": null}]}