{"title": "A Primal-Dual Message-Passing Algorithm for Approximated Large Scale Structured Prediction", "book": "Advances in Neural Information Processing Systems", "page_first": 838, "page_last": 846, "abstract": "In this paper we propose an approximated learning framework for large scale graphical models and derive message passing algorithms for learning their parameters efficiently. We first relate CRFs and structured SVMs and show that in the CRF's primal a variant of the log-partition function, known as soft-max, smoothly approximates the hinge loss function of structured SVMs. We then propose an intuitive approximation for structured prediction problems using Fenchel duality based on a local entropy approximation that computes the exact gradients of the approximated problem and is guaranteed to converge. Unlike existing approaches, this allow us to learn graphical models with cycles and very large number of parameters efficiently. We demonstrate the effectiveness of our approach in an image denoising task. This task was previously solved by sharing parameters across cliques. In contrast, our algorithm is able to efficiently learn large number of parameters resulting in orders of magnitude better prediction.", "full_text": "A Primal-Dual Message-Passing Algorithm for\nApproximated Large Scale Structured Prediction\n\nTamir Hazan\nTTI Chicago\n\nhazan@ttic.edu\n\nAbstract\n\nRaquel Urtasun\n\nTTI Chicago\n\nrurtasun@ttic.edu\n\nIn this paper we propose an approximated structured prediction framework for\nlarge scale graphical models and derive message-passing algorithms for learn-\ning their parameters ef\ufb01ciently. We \ufb01rst relate CRFs and structured SVMs and\nshow that in CRFs a variant of the log-partition function, known as the soft-max,\nsmoothly approximates the hinge loss function of structured SVMs. We then\npropose an intuitive approximation for the structured prediction problem, using\nduality, based on a local entropy approximation and derive an ef\ufb01cient message-\npassing algorithm that is guaranteed to converge. Unlike existing approaches, this\nallows us to learn ef\ufb01ciently graphical models with cycles and very large number\nof parameters.\n\nIntroduction\n\n1\nUnlike standard supervised learning problems which involve simple scalar outputs, structured pre-\ndiction deals with structured outputs such as sequences, grids, or more general graphs.\nIdeally,\none would want to make joint predictions on the structured labels instead of simply predicting each\nelement independently, as this additionally accounts for the statistical correlations between label\nelements, as well as between training examples and their labels. These properties make structured\nprediction appealing for a wide range of applications such as image segmentation, image denoising,\nsequence labeling and natural language parsing.\nSeveral structured prediction models have been recently proposed, including log-likelihood models\nsuch as conditional random \ufb01elds (CRFs, [10]), and structured support vector machines (structured\nSVMs) such as maximum-margin Markov networks (M3Ns [21]). For CRFs, learning is done by\nminimizing a convex function composed of a negative log-likelihood loss and a regularization term.\nLearning structured SVMs is done by minimizing the convex regularized structured hinge loss.\nDespite the convexity of the objective functions, \ufb01nding the optimal parameters of these models can\nbe computationally expensive since it involves exponentially many labels. When the label structure\ncorresponds to a tree, learning can be done ef\ufb01ciently by using belief propagation as a subroutine;\nThe sum-product algorithm is typically used in CRFs and the max-product algorithm in structured\nSVMs. In general, when the label structure corresponds to a general graph, one cannot compute\nthe objective nor the gradient exactly, except for some special cases in structured SVMs, such as\nmatching and sub-modular functions [22]. Therefore, one usually resorts to approximate inference\nalgorithms, cf. [2] for structured SVMs and [20, 12] for CRFs. However, the approximate inference\nalgorithms are computationally too expensive to be used as a subroutine of the learning algorithm,\ntherefore they cannot be applied ef\ufb01ciently for large scale structured prediction problems. Also, it is\nnot clear how to de\ufb01ne a stopping criteria for these approaches as the objective does not monotoni-\ncally decrease since the objective and the gradient are both approximated. This might result in poor\napproximations.\nIn this paper we propose an approximated structured prediction framework for large scale graphical\nmodels and derive message-passing algorithms for learning their parameters ef\ufb01ciently. We relate\nCRFs and structured SVMs, and show that in CRFs a variant of the log-partition function, known as\n\n1\n\n\fsoft-max, smoothly approximates the hinge loss function of structured SVMs. We then propose an\nintuitive approximation for the structured prediction problem, using duality, based on a local entropy\napproximation and derive an ef\ufb01cient message-passing algorithm that is guaranteed to converge.\nUnlike existing approaches, this allows us to learn ef\ufb01ciently graphical models with cycles and\nvery large number of parameters. We demonstrate the effectiveness of our approach in an image\ndenoising task. This task was previously solved by sharing parameters across cliques. In contrast,\nour algorithm is able to ef\ufb01ciently learn large number of parameters resulting in orders of magnitude\nbetter prediction.\nIn the remaining of the paper, we \ufb01rst relate CRFs and structured SVMs in Section 3, show our\napproximate prediction framework in Section 4, derive a message-passing algorithm to solve the\napproximated problem ef\ufb01ciently in Section 5, and show our experimental evaluation.\n2 Regularized Structured Loss Minimization\nConsider a supervised learning setting with objects x \u2208 X and labels y \u2208 Y. In structured prediction\nthe labels may be sequences, trees, grids, or other high-dimensional objects with internal structure.\nConsider a function \u03a6 : X \u00d7 Y \u2192 Rd that maps (x, y) pairs to feature vectors. Our goal is to\nconstruct a linear prediction rule\n\ny\u03b8(x) = argmax\n\n\u03b8\n\n\u03a6(x, y)\n\ny\u2208Y\n\n(cid:62)\n\nwith parameters \u03b8 \u2208 Rd, such that y\u03b8(x) is a good approximation to the true label of x. Intuitively\none would like to minimize the loss (cid:96)(y, y\u03b8) incurred by using \u03b8 to predict the label of x, given that\nthe true label is y. However, since the prediction is norm-insensitive this method can lead to over\n\ufb01tting. Therefore the parameters \u03b8 are typically learned by minimizing the norm-dependent loss\n\n\u00af(cid:96)(\u03b8, x, y) +\n\n(cid:107)\u03b8(cid:107)p\np,\n\nC\np\n\n(1)\n\n(cid:88)\n\n(x,y)\u2208S\n\n(cid:110)\n\nde\ufb01ned over a training set S. The function \u00af(cid:96) is a surrogate loss of the true loss (cid:96)(y, \u02c6y). In this paper\nwe focus on structured SVMs and CRFs which are the most common structured prediction models.\nThe \ufb01rst de\ufb01nition of structured SVMs used the structured hinge loss [21]\n\n\u00af(cid:96)hinge(\u03b8, x, y) = max\n\u02c6y\u2208Y\n\n(cid:96)(y, \u02c6y) + \u03b8\n\n(cid:62)\n\n\u03a6(x, \u02c6y) \u2212 \u03b8\n\n(cid:62)\n\n\u03a6(x, y)\n\n(cid:111)\n\n(cid:62)\n\nThe structured hinge loss upper bounds the true loss function, and corresponds to a maximum-\nmargin approach that explicitly penalizes training examples (x, y) for which \u03b8\n\u03a6(x, y) <\n(cid:96)(y, y\u03b8(x)) + \u03b8\nThe second loss function that we consider is based on log-linear models, and is commonly used in\nCRFs [10]. Let the conditional distribution be\np(\u02c6y|x, y; \u03b8) =\n\n\u03a6(x, y\u03b8(x)).\n\n(cid:88)\n\n(cid:96)(y, \u02c6y) + \u03b8\n\nZ(x, y) =\n\n\u03a6(x, \u02c6y)\n\nexp\n\n(cid:96)(y, \u02c6y) + \u03b8\n\n\u03a6(x, \u02c6y)\n\n(cid:17)\n\n(cid:16)\n\n(cid:16)\n\n(cid:62)\n\n(cid:62)\n\n(cid:62)\n\n,\n\n(cid:17)\n\n1\n\nexp\n\nZ(x, y)\n\n\u02c6y\u2208Y\n\nwhere (cid:96)(y, \u02c6y) is a prior distribution and Z(x, y) the partition function. The surrogate loss function\nis then the negative log-likelihood under the parameters \u03b8\n\n\u00af(cid:96)log(\u03b8, x, y) = ln\n\n1\n\np(\u02c6y|x, y; \u03b8)\n\n.\n\nIn structured SVMs and CRFs a convex loss function and a convex regularization are minimized.\n\n3 One parameter extension of CRFs and Structured SVMs\n\nIn CRFs one aims to minimize the regularized negative log-likelihood of the conditional distribution\np(\u02c6y|x, y; \u03b8) which decomposes into the log-partition and the linear term \u03b8\n\u03a6(x, y). Hence the\nproblem of minimizing the regularized loss in (1) with the loss function \u00af(cid:96)log can be written as\n\n(cid:62)\n\n(CRF)\n\nmin\n\n\u03b8\n\nln Z(x, y) \u2212 d\n(cid:62)\n\n\u03b8 +\n\n(cid:107)\u03b8(cid:107)p\n\np\n\nC\np\n\n\uf8fc\uf8fd\uf8fe ,\n\n\uf8f1\uf8f2\uf8f3 (cid:88)\n\n(x,y)\u2208S\n\n2\n\n\fwhere (x, y) \u2208 S ranges over training pairs and d = (cid:80)\n\n(x,y)\u2208S \u03a6(x, y) is the vector of empirical\n\n(cid:110)\n\n(structured SVM)\n\nmin\n\n\u03b8\n\nmeans.\nStructured SVMs aim at minimizing the regularized hinge loss \u00af(cid:96)hinge(\u03b8, x, y), which measures the\nloss of the label y\u03b8(x) that most violates the training pair (x, y) \u2208 S by more than (cid:96)(y, y\u03b8(x)).\nSince y\u03b8(x) is independent of the training label y, the structured SVM program takes the form:\n\nwhere (x, y) \u2208 S ranges over the training pairs, and d is the vector of empirical means.\nIn the following we deal with both structured prediction tasks (i.e., structured SVMs and CRFs)\nas two instances of the same framework, by extending the partition function to norms, namely\nZ\u0001(x, y) = (cid:107) exp\ning over \u02c6y \u2208 Y. Using the norm formulation we move from the partition function, for \u0001 = 1, to the\nmaximum over the exponential function for \u0001 = 0. Equivalently, we relate the log-partition and the\nmax-function by the soft-max function\n\n\uf8f1\uf8f2\uf8f3 (cid:88)\n\uf8fc\uf8fd\uf8fe ,\n(cid:17)(cid:107)1/\u0001, where the norm is computed for the vector rang-\n(cid:88)\n\n(cid:111) \u2212 d(cid:62)\u03b8 +\n\n(cid:96)(y, \u02c6y) + \u03b8\n\n(cid:107)\u03b8(cid:107)p\n\np\n\n(cid:96)(y, \u02c6y) + \u03b8\n\n\u03a6(x, \u02c6y)\n\n(x,y)\u2208S\n\n\u03a6(x, \u02c6y)\n\n(cid:62)\n\n\u03a6(x, \u02c6y)\n\nmax\n\u02c6y\u2208Y\n\n(cid:32)\n\n(cid:33)\n\n(cid:16)\n\nC\np\n\n(cid:62)\n\n(cid:62)\n\nln Z\u0001(x, y) = \u0001 ln\n\nexp\n\n\u02c6y\u2208Y\n\n(cid:96)(y, \u02c6y) + \u03b8\n\u0001\n\nFor \u0001 = 1 the soft-max function reduces to the log-partition function, and for \u0001 = 0 it reduces\nto the max-function. Moreover, when \u0001 \u2192 0 the soft-max function is a smooth approximation of\nthe max-function, in the same way the (cid:96)1/\u0001-norm is a smooth approximation of the (cid:96)\u221e-norm. This\nsmooth approximation of the max-function is used in different areas of research [8]. We thus de\ufb01ne\nthe structured prediction problem as\n\n(structured-prediction)\n\nmin\n\n\u03b8\n\nln Z\u0001(x, y) \u2212 d(cid:62)\u03b8 +\n\n(cid:107)\u03b8(cid:107)p\n\np\n\nC\np\n\n\uf8f1\uf8f2\uf8f3 (cid:88)\n\n(x,y)\u2208S\n\n\uf8fc\uf8fd\uf8fe ,\n\n(2)\n\n(3)\n\n(4)\n\nwhich is a one-parameter extension of CRFs and structured SVMs, i.e., \u0001 = 1 and \u0001 = 0 respec-\ntively. Similarly to CRFs and structured SVMs [11, 16], one can use gradient methods to optimize\nstructured prediction. The gradient of \u03b8r takes the form\n\n(cid:88)\n\n(cid:88)\n\n(x,y)\u2208S\n\n\u02c6y\n\np\u0001(\u02c6y|x, y; \u03b8)\u03c6r(x, \u02c6y) \u2212 dr + |\u03b8r|p\u22121sign(\u03b8r),\n\n(cid:32)\n\n(cid:33)\n\n(cid:62)\n\n\u03a6(x, \u02c6y)\n\nwhere\n\n(cid:96)(y, \u02c6y) + \u03b8\n\u0001\n\np\u0001(\u02c6y|x, y; \u03b8) =\n\n1\n\nexp\n\nZ\u0001(x, y)1/\u0001\n\n(5)\nis a probability distribution over the possible labels \u02c6y \u2208 Y. When \u0001 \u2192 0 this probability distribution\ngets concentrated around its maximal values, since all its elements are raised to the power of a very\nlarge number (i.e., 1/\u0001). Therefore for \u0001 = 0 we get a structured SVM subgradient.\nIn many real-life applications the labels y \u2208 Y are n-tuples, y = (y1, ..., yn), hence there are\nexponentially many labels in Y. The feature maps usually describe relations between subsets of\nlabel elements y\u03b1 \u2282 {y1, ..., yn}, and local interactions on single label elements yv, namely\n\n\u03c6r(x, \u02c6y1, ..., \u02c6yn) =\n\n\u03c6r,v(x, \u02c6yv) +\n\n\u03c6r,\u03b1(x, \u02c6y\u03b1).\n\n(6)\n\n(cid:88)\n\nv\u2208Vr,x\n\n(cid:88)\n\n\u03b1\u2208Er,x\n\nEach feature \u03c6r(x, \u02c6y) can be described by its factor graph Gr,x, a bipartite graph with one set of\nnodes corresponding to Vr,x and the other set corresponds to Er,x. An edge connects a single label\nnode v \u2208 Vr,x with a subset of label nodes \u03b1 \u2208 Er,x if and only if yv \u2208 y\u03b1. In the following we\nconsider the factor graph G = \u222arGr which is the union of all factor graphs. We denote by N (v)\nand N (\u03b1) the set of neighbors of v and \u03b1 respectively, in the factor graph G. For clarity in the\nv=1 (cid:96)v(yv, \u02c6yv), although our derivation\n\npresentation we consider fully factorized loss (cid:96)(y, \u02c6y) = (cid:80)n\n\nnaturally extends to any graphical model representing the interactions (cid:96)(y, \u02c6y).\n\n3\n\n\fTo compute the soft-max and the marginal probabilities, p\u0001(\u02c6yv|x, y; \u03b8) and p\u0001(\u02c6y\u03b1|x, y; \u03b8), expo-\nnentially many labels have to be considered. This is in general computationally prohibitive, and\nthus one has to rely on inference and message-passing algorithms. When the factor graph has no\ncycles inference can be ef\ufb01ciently computed using belief propagation, but in the presence of cycles\ninference can only be approximated [25, 26, 7, 5, 13]. There are two main problems when deal-\ning with graphs with cycles and approximate inference: ef\ufb01ciency and accuracy. For graphs with\ncycles there are no guarantees on the number of steps the message-passing algorithm requires till\nconvergence, therefore it is computationally costly to run it as a subroutine. Moreover, as these\nmessage-passing algorithms have no guarantees on the quality of their solution, the gradient and the\nobjective function can only be approximated, and one cannot know if the update rule decreased or\nincreased the structured prediction objective. In contrast, in this work we propose to approximate\nthe structured prediction problem and to ef\ufb01ciently solve the approximated problem exactly using\nmessage-passing. Intuitively, we suggest a principled way to run the approximate inference updates\nfor few steps, while re-using the messages of previous steps to extract intermediate beliefs. These\nbeliefs are used to update \u03b8r, although the intermediate beliefs may not agree on their marginal\nprobabilities. This allows us to ef\ufb01ciently learn graphical models with large number of parameters.\n\n4 Approximate Structured Prediction\n\nThe structured prediction objective in (3) and its gradients de\ufb01ned in (4) cannot be computed ef-\n\ufb01ciently for general graphs since both involve computing the soft-max function, ln Z\u0001(x, y), and\nthe marginal probabilities, p\u0001(\u02c6yv|x, y; \u03b8) and p\u0001(\u02c6y\u03b1|x, y; \u03b8), which take into account exponentially\nmany elements \u02c6y \u2208 Y . In the following we suggest an intuitive approximation for structured pre-\ndiction, based on its dual formulation.\nSince the dual of the soft-max is the entropy barrier, it follows that the dual program for structured\nprediction is governed by the entropy function of the probabilities px,y(\u02c6y). The following duality\nformulation is known for CRFs when \u0001 = 1 with (cid:96)2\n2 regularization, and for structured SVM when\n\u0001 = 0 with (cid:96)2\n2 regularization, [11, 21, 1]. Here we derive the dual program for every \u0001 and every (cid:96)p\np\nregularization using conjugate duality:\n\nClaim 1 The dual program of the structured prediction program in (3) takes the form\n\n(cid:88)\n\n\uf8eb\uf8ed\u0001H(px,y) +\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\n\uf8f6\uf8f8\u2212 C 1\u2212q\nwhere \u2206Y is the probability simplex over Y and H(px,y) = \u2212(cid:80)\n\npx,y(\u02c6y)(cid:96)(y, \u02c6y)\n\nmax\n\npx,y(\u02c6y)\u2208\u2206Y\n\n(x,y)\u2208S\n\n(cid:88)\n\n\u02c6y\n\nq\n\n\u02c6y px,y(\u02c6y) ln px,y(\u02c6y) is the entropy.\n\n(cid:88)\n\n(cid:88)\n\n(x,y)\u2208S\n\n\u02c6y\u2208Y\n\npx,y(\u02c6y)\u03a6(x, \u02c6y) \u2212 d\n\n,\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)q\n\nq\n\nProof: In [6]\nWhen \u0001 = 1 the CRF dual program reduces to the well-known duality relation between the log-\nlikelihood and the entropy. When \u0001 = 0 we obtain the dual formulation of structured SVM which\nemphasizes the duality relation between the max-function and the probability simplex. In general,\nClaim 1 describes the relation between the soft-max function and the entropy barrier over the prob-\nability simplex.\nThe dual program in Claim 1 considers the probabilities px,y(\u02c6y) over exponentially many labels\n\u02c6y \u2208 Y, as well as their entropies H(px,y). However, when we take into account the graphical\nmodel imposed by the features, Gr,x, we observe that the linear terms in the dual formulation con-\nsider the marginals probabilities px,y(\u02c6yv) and px,y(\u02c6y\u03b1). We thus propose to replace the marginal\nprobabilities with their corresponding beliefs, and to replace the entropy term by the local entropies\nv cvH(bx,y,v) over the beliefs. Whenever \u0001, cv, c\u03b1 \u2265 0, the approximated\ndual is concave and it corresponds to a convex dual program. By deriving its dual we obtain our\napproximated structured prediction, for which we construct an ef\ufb01cient algorithm in Section 5.\n\n(cid:80)\n\u03b1 c\u03b1H(bx,y,\u03b1) +(cid:80)\n\n4\n\n\fGaussian noise\nI2\n\nI3\n\n2.4707\n2.4731\n2.4194\n3.0762\n3.0640\n2.7783\n0.0073\n\n3.2275\n3.2324\n3.1299\n4,1382\n3.8721\n3.6157\n0.1294\n\nI1\n\n2.7344\n2.7344\n2.7417\n3.0469\n2.9688\n3.0005\n0.0488\n\nI4\n\n2.3193\n2.3145\n2.4023\n2.9053\n14.4360\n2.4780\n0.1318\n\nLBP-SGD\nLBP-SMD\nLBP-BFGS\nMF-SGD\nMF-SMD\nMF-BFGS\n\nOurs\n\nBimodal noise\nI2\nI3\n\nI4\n\nI1\n\n5.2905\n5.2954\n5.2148\n10.0488\n\n\u2013\n\n5.2661\n0.0537\n\n4.4751\n4.4678\n4.3994\n41.0718\n\n\u2013\n\n4.6167\n0.0244\n\n6.8164\n6.7578\n6.0278\n29.6338\n\n\u2013\n\n6.4624\n0.1221\n\n7.2510\n7.2583\n6.6211\n53.6035\n\n\u2013\n\n7.2510\n0.9277\n\nFigure 1: Gaussian and bimodal noise: Comparison of our approach to loopy belief propaga-\ntion and mean \ufb01eld approximations when optimizing using BFGS, SGD and SMD. Note that our\napproach signi\ufb01cantly outperforms all the baselines. MF-SMD did not work for Bimodal noise.\nTheorem 1 The approximation of the structured prediction program in (3) takes the form\n\nmin\n\n\u03bbx,y,v\u2192\u03b1,\u03b8\n\n(cid:88)\n(cid:88)\n\n(x,y)\u2208S,v\n\n\u0001cv ln\n\n(cid:88)\n\n(cid:88)\n(cid:32)(cid:80)\n\nexp\n\n\u02c6yv\n\n\u0001c\u03b1 ln\n\n+\n(x,y)\u2208S,\u03b1\n\n\u02c6y\u03b1\n\nexp\n\n(cid:32) (cid:96)v(yv, \u02c6yv) +(cid:80)\n\nr:v\u2208Vr,x\n\n\u03b8r\u03c6r,\u03b1(x, \u02c6y\u03b1) +(cid:80)\n\n\u0001c\u03b1\n\n\u03b8r\u03c6r,v(x, \u02c6yv) \u2212(cid:80)\n\n\u0001cv\n\n(cid:33)\n\n\u03b1\u2208N (v) \u03bbx,y,v\u2192\u03b1(\u02c6yv)\n\n(cid:33)\n\n\u2212 d(cid:62)\u03b8 \u2212 C\np\n\n(cid:107)\u03b8(cid:107)p\n\np\n\nr:\u03b1\u2208Er\n\nv\u2208N (\u03b1) \u03bbx,y,v\u2192\u03b1(\u02c6yv)\n\nProof: In [6]\n5 Message-Passing Algorithm for Approximated Structured Prediction\nIn the following we describe a block coordinate descent algorithm for the approximated structured\nprediction program of Theorem 1. Coordinate descent methods are appealing as they optimize a\nsmall number of variables while holding the rest \ufb01xed, therefore they are ef\ufb01cient and can be easily\nparallelized. Since the primal program is lower bounded by the dual program, the primal objective\nfunction is guaranteed to converge. We begin by describing how to \ufb01nd the optimal set of variables\nrelated to a node v in the graphical model, namely \u03bbx,y,v\u2192\u03b1(\u02c6yv) for every \u03b1 \u2208 N (v), every \u02c6yv and\nevery (x, y) \u2208 S.\nLemma 1 Given a vertex v in the graphical model, the optimal \u03bbx,y,v\u2192\u03b1(\u02c6yv) for every \u03b1 \u2208\nN (v), \u02c6yv \u2208 Yv, (x, y) \u2208 S in the approximated program of Theorem 1 satis\ufb01es\n\nfor every constant cx,y,v\u2192\u03b1\n\u03b1\u2208N (v) c\u03b1. In particular, if either \u0001 and/or c\u03b1\nare zero then \u00b5x,y,\u03b1\u2192v corresponds to the (cid:96)\u221e norm and can be computed by the max-function.\nMoreover, if either \u0001 and/or c\u03b1 are zero in the objective, then the optimal \u03bbx,y,v\u2192\u03b1 can be computed\nfor any arbitrary c\u03b1 > 0, and similarly for cv > 0.\nProof: In [6]\nIt is computationally appealing to \ufb01nd the optimal \u03bbx,y,v\u2192\u03b1(\u02c6yv). When the optimal value cannot be\nfound, one usually takes a step in the direction of the negative gradient and the objective function\nneeds to be computed to ensure that the chosen step size reduces the objective. Obviously, com-\nputing the objective function at every iteration signi\ufb01cantly slows the algorithm. When the optimal\n\u03bbx,y,v\u2192\u03b1(\u02c6yv) can be found, the block coordinate descent algorithm can be executed ef\ufb01ciently in\ndistributed manner, since every \u03bbx,y,v\u2192\u03b1(\u02c6yv) can be computed independently. The only interactions\noccur when computing the normalization step cx,y,v\u2192\u03b1. This allows for easy computation in GPUs.\nWe now turn to describe how to change \u03b8 in order to improve the approximated structured prediction.\nSince we cannot \ufb01nd the optimal \u03b8r while holding the rest \ufb01xed, we perform a step in the direction\n\n1For numerical stability in our algorithm we set cx,y,v\u2192\u03b1 such that(cid:80)\n\n\u03bbx,y,v\u2192\u03b1(\u02c6yv) = 0\n\n\u02c6yv\n\n5\n\n\u00b5x,y,\u03b1\u2192v(\u02c6yv) = \u0001c\u03b1 ln\n\n\u03bbx,y,v\u2192\u03b1(\u02c6yv) =\n\nc\u03b1\n\u02c6cv\n\nexp\n\nr:\u03b1\u2208Er,x\n\n(cid:32)(cid:80)\n(cid:88)\n\n\uf8eb\uf8ed(cid:88)\n\uf8eb\uf8ed(cid:96)v(yv, \u02c6yv) +\n1, where \u02c6cv = cv +(cid:80)\n\nr:v\u2208Vr,x\n\n\u02c6y\u03b1\\\u02c6yv\n\n\u03b8r\u03c6r,\u03b1(x, \u02c6y\u03b1) +(cid:80)\n(cid:88)\n\n\u0001c\u03b1\n\n\u03b2\u2208N (v)\n\nu\u2208N (\u03b1)\\v \u03bbx,y,u\u2192\u03b1(\u02c6yu)\n\n\uf8f6\uf8f8 \u2212 \u00b5x,y,\u03b1\u2192v(\u02c6yv) + cx,y,v\u2192\u03b1\n\n\u03b8r\u03c6r,v(x, \u02c6yv) +\n\n\u00b5x,y,\u03b2\u2192v(\u02c6yv)\n\n(cid:33)\uf8f6\uf8f8\n\n\fof the negative gradient, when \u0001, c\u03b1, ci are positive, or in the direction of the subgradient otherwise.\nWe choose the step size \u03b7 to guarantee a descent on the objective.\n\nLemma 2 The gradient of the approximated structured prediction program in Theorem 1 with re-\nspect to \u03b8r equals to\n\n(cid:88)\n\nbx,y,v(\u02c6yv)\u03c6r,v(x, \u02c6yv) +\n\n(x,y)\u2208S,v\u2208Vr,x,\u02c6yv\n\n(x,y)\u2208S,\u03b1\u2208Er,x,\u02c6y\u03b1\n\nwhere\n\nbx,y,v(\u02c6yv) \u221d exp\n\nbx,y,\u03b1(\u02c6y\u03b1) \u221d exp\n\n(cid:88)\n(cid:32) (cid:96)v(yv, \u02c6yv) +(cid:80)\n(cid:32)(cid:80)\n\nr:\u03b1\u2208Er,x\n\nbx,y,\u03b1(\u02c6y\u03b1)\u03c6r,\u03b1(x, \u02c6y\u03b1) \u2212 dr + C \u00b7 |\u03b8r|p\u22121 \u00b7 sign(\u03b8r),\n\n\u03b8r\u03c6r,v(x, \u02c6yv) \u2212(cid:80)\n\n\u03b1\u2208N (v) \u03bbx,y,v\u2192\u03b1(\u02c6yv)\n\n(cid:33)\n\nr:v\u2208Vr,x\n\n\u03b8r\u03c6r,\u03b1(x, \u02c6y\u03b1) +(cid:80)\n\n\u0001c\u03b1\n\n\u0001cv\nv\u2208N (\u03b1) \u03bbx,y,v\u2192\u03b1(\u02c6yv)\n\n(cid:33)\n\n(cid:111)\n\n(cid:110)(cid:80)\n\nr:\u03b1\u2208Er,x\n\nv\u2208N (\u03b1) \u03bbx,y,v\u2192\u03b1(\u02c6y\u03b1)\n\n\u03b8r\u03c6r,\u03b1(x, \u02c6y\u03b1) +(cid:80)\n\nHowever, if either \u0001 and/or c\u03b1 equal zero, then the beliefs bx,y,\u03b1(\u02c6y\u03b1) can be taken from the\nset of probability distributions over support of the max-beliefs, namely bx,y,\u03b1(\u02c6y\u2217\n\u03b1) > 0 only if\n\u03b1 \u2208 argmax\u02c6y\u03b1\n\u02c6y\u2217\n. Similarly for bx,y,v(\u02c6y\u2217\nv)\nwhenever \u0001 and/or cv equal zero.\nProof: In [6]\nLemmas 1 and 2 describe the coordinate descent algorithm for the approximated structured predic-\ntion in Theorem 1. We refer the reader to [6] for a summary of our algorithm.\nThe coordinate descent algorithm is guaranteed to converge, as it monotonically decreases the ap-\nproximated structured prediction objective in Theorem 1, which is lower bounded by its dual pro-\ngram. However, convergence to the global minimum cannot be guaranteed in all cases. In particular,\nfor \u0001 = 0 the coordinate descent on the approximated structured SVMs is not guaranteed to converge\nto its global minimum, unless one uses subgradient methods which are not monotonically decreas-\ning. Moreover, even when we are guaranteed to converge to the global minimum, i.e., \u0001, c\u03b1, cv > 0,\nthe sequence of variables \u03bbx,y,v\u2192\u03b1(\u02c6yv) generated by the algorithm is not guaranteed to converge\nto an optimal solution, nor to be bounded. As a trivial example, adding an arbitrary constant to the\nvariables, \u03bbx,y,v\u2192\u03b1(\u02c6yv) + c, does not change the objective value, hence the algorithm can generate\nnon-decreasing unbounded sequences. However, the beliefs generated by the algorithm are bounded\nand guaranteed to converge to the solution of the dual approximated structured prediction problem.\n\nClaim 2 The block coordinate descent algorithm in lemmas 1 and 2 monotonically reduces the\napproximated structured prediction objective in Theorem 1, therefore the value of its objective is\nguaranteed to converge. Moreover, if \u0001, c\u03b1, cv > 0, the objective is guaranteed to converge to the\nglobal minimum, and its sequence of beliefs are guaranteed to converge to the unique solution of the\napproximated structured prediction dual.\nProof: In [6]\nThe convergence result has a practical implication, describing the ways we can estimate the con-\nvergence of the algorithm, either by the primal objective, the dual objective or the beliefs. The\napproximated structured prediction can also be used for non-concave entropy approximations, such\nas the Bethe entropy, where c\u03b1 > 0 and cv < 0. In this case the algorithm is well de\ufb01ned, and its\nstationary points correspond to the stationary points of the approximated structured prediction and\nits dual. Intuitively, this statement holds since the coordinate descent algorithm iterates over points\n\u03bbx,y,v\u2192\u03b1(\u02c6yv), \u03b8r with vanishing gradients. Equivalently the algorithm iterates over saddle points\n\u03bbx,y,v\u2192\u03b1(\u02c6yv), bx,y,v(\u02c6yv), bx,y,\u03b1(\u02c6y\u03b1) and (\u03b8r, zr) of the Lagrangian de\ufb01ned in Theorem 1. When-\never the dual program is concave these saddle points are optimal points of the convex primal, but for\nnon-concave dual the algorithm iterates over saddle points. This is summarized in the claim below:\n\nClaim 3 Whenever the approximated structured prediction is non convex, i.e., \u0001, c\u03b1 > 0 and cv < 0,\nthe algorithm in lemmas 1 and 2 is not guaranteed to converge, but whenever it converges it reaches\na stationary point of the primal and dual approximated structured prediction programs.\nProof: In [6]\n\n6\n\n\fFigure 2: Denoising results: Gaussian (left) and Bimodal (right) noise.\n\n6 Experimental evaluation\nWe performed experiments on 2D grids since they are widely used to represent images, and have\nmany cycles. We \ufb01rst investigate the role of \u0001 in the accuracy and running time of our algorithm, for\n\ufb01xed c\u03b1, cv = 1. We used a 10 \u00d7 10 binary image and randomly generated 10 corrupted samples\n\ufb02ipping every bit with 0.2 probability. We trained the model using CRF, structured-SVM and our\napproach for \u0001 = {1, 0.5, 0.01, 0}, ranging from approximated CRFs (\u0001 = 1) to approximated\nstructured SVM (\u0001 = 0) and its smooth version (\u0001 = 0.01). The runtime for CRF and structured-\nSVM is order of magnitudes larger than our method since they require exact inference for every\ntraining example and every iteration of the algorithm. For the approximated structured prediction,\nthe runtimes are 323, 324, 326, 294 seconds for \u0001 = {1, 0.5, 0.01, 0} respectively. As \u0001 gets smaller\nthe runtime slightly increases, but it decreases for \u0001 = 0 since the (cid:96)\u221e norm is computed ef\ufb01ciently\nusing the max function. However, \u0001 = 0 is less accurate than \u0001 = 0.01; When the approximated\nstructured SVM converges, the gap between the primal and dual objectives was 1.3, and only 10\u22125\nfor \u0001 > 0. This is to be expected since the approximated structured SVM is non-smooth (Claim 2),\nand we did not used subgradient methods to ensure convergence to the optimal solution.\nWe generated test images in a similar fashion while using the same \u0001 for training and testing. In\nthis setting both CRF and structured-SVM performed well, with 2 misclassi\ufb01cations. For the ap-\nproximated structured prediction, we obtained 2 misclassi\ufb01cations for \u0001 > 0. We also evaluated the\nquality of the solution using different values of \u0001 for training and inference [24]. When predicting\nwith smaller \u0001 than the one used for learning the results are marginally worse than when predicting\nwith the same \u0001. However, when predicting with larger \u0001, the results get signi\ufb01cantly worse, e.g.,\nlearning with \u0001 = 0.01 and predicting with \u0001 = 1 results in 10 errors, and only 2 when \u0001 = 0.01.\nThe main advantage of our algorithm is that it can ef\ufb01ciently learn many parameters. We now com-\npared in a 5 \u00d7 5 dataset a model learned with different parameters for every edge and vertex (\u2248 300\nparameters) and a model learned with parameters shared among the vertices and edges (2 parameters\nfor edges and 2 for vertices) [9]. Using large number of parameters increases performance: sharing\nparameters resulted in 16 misclassi\ufb01cations, while optimizing over the 300 parameters resulted in 2\nerrors. Our algorithm avoids over\ufb01tting in this case, we conjecture it is due to the regularization.\nWe now compare our approach to state-of-the-art CRF solvers on the binary image dataset of [9]\nthat consists of 4 different 64 \u00d7 64 base images. Each base image was corrupted 50 times with each\ntype of noise. Following [23], we trained different models to denoise each individual image, using\n40 examples for training and 10 for test. We compare our approach to approximating the conditional\nlikelihood using loopy belief propagation (LBP) and mean \ufb01eld approximation (MF). For each of\nthese approximations, we use stochastic gradient descent (SGD), stochastic meta-descent (SMD)\nand BFGS to learn the parameters. We do not report pseudolikelihood (PL) results since it did not\nwork. The same behavior of PL was noticed by [23]. To reduce the computational complexity and\nthe chances of convergence, [9, 23] forced their parameters to be shared across all nodes such that\n\u2200i, \u03b8i = \u03b8n and \u2200i,\u2200j \u2208 N (i), \u03b8ij = \u03b8e. In contrast, since our approach is ef\ufb01cient, we can exploit\nthe full \ufb02exibility of the graph and learn more than 10, 000 parameters. This is computationally\nprohibitive with the baselines. We use the pixel values as node potentials and an Ising model with\nonly bias for the edge potentials, i.e., \u03c6i,j = [1,\u22121;\u22121, 1]. For all experiments we use \u0001 = 1, and\np = 2. For the baselines, we use the code, features and optimal parameters of [23].\nUnder the \ufb01rst noise model, each pixel was corrupted via i.i.d. Gaussian noise with mean 0 and stan-\ndard deviation of 0.3. Fig. 1 depicts test error in (%) for the different base images (i.e., I1, . . . , I4).\nNote that our approach outperforms considerably the loopy belief propagation and mean \ufb01eld ap-\nproximations for all optimization criteria (BFGS, SGD, SMD). For example, for the \ufb01rst base image\nthe error of our approach is 0.0488%, which is equivalent to a 2 pixels error on average. In contrast\n\n7\n\n\f(Gaussian)\n\n(Bimodal)\n\nFigure 3: Convergence. Primal and dual train errors for I1.\n\nthe best baseline gets 112 pixels wrong on average. Fig. 2 (left) depicts test examples as well as our\ndenoising results. Note that our approach is able to cope with large amounts of noise.\nUnder the second noise model, each pixel was corrupted with an independent mixture of Gaussians.\nFor each class, a mixture of 2 Gaussians with equal mixing weights was used, yielding the Bimodal\nnoise. The mixture model parameters were (0.08, 0.03) and (0.46, 0.03) for the \ufb01rst class and\n(0.55, 0.02) and (0.42, 0.10) for the second class, with (a, b) a Gaussian with mean a and standard\ndeviation b. Fig. 1 depicts test error in (%) for the different base images. As before, our approach\noutperforms all the baselines. We do not report MF-SMD results since it did not work. Denoised\nimages are shown in Fig. 2 (right). We now show how our algorithm converges in a few iterations.\nFig. 3 depicts the primal and dual training errors as a function of the number of iterations. Note that\nour algorithm converges, and the dual and primal values are very tight after a few iterations.\n7 Related Work\nFor the special case of CRFs, the idea of approximating the entropy function with local entropies\nappears in [24, 3]. In particular, [24] proved that using a concave entropy approximation gives robust\nprediction. [3] optimized the non-concave Bethe entropy c\u03b1 = 1, cv = 1 \u2212 |N (v)|, by repeatedly\nmaximizing its concave approximation, thus converging in few concave iterations. Our work differs\nfrom these works in two aspects: we derive an ef\ufb01cient algorithm in Section 5 for the concave\napproximated program (c\u03b1, cv > 0) and our framework and algorithm include structured SVMs, as\nwell as their smooth approximation when \u0001 \u2192 0.\nSome forms of approximated structured prediction were investigated for the special cases of CRFs.\nIn [18] a similar program was used, but without the Lagrange multipliers \u03bbx,y,v\u2192\u03b1(\u02c6yv) and no\nregularization, i.e., C = 0. As a result the local log-partition functions are unrelated, and ef\ufb01cient\ncounting algorithm can be used for learning. In [3] a different approximated program was derived for\nc\u03b1 = 1, cv = 0 which was solved by the BFGS convex solver. Our work is different as it considers\nef\ufb01cient algorithms for approximated structured prediction which take advantage of the graphical\nmodel by sending messages along its edges. We show in the experiments that this signi\ufb01cantly\nimproves the run-time of the algorithm. Also, our approximated structured prediction includes as\nspecial cases approximated CRF, for \u0001 = 1, and approximated structured SVM, for \u0001 = 0. More-\nover, we describe how to smoothly approximate the structured SVMs to avoid the shortcomings of\nsubgradient methods, by simply setting \u0001 \u2192 0 .\nSome forms of approximated structured SVMs were dealt in [19] with the structured SMO algo-\nrithm. Independently, [14] presented an approximated structured SVMs program and a message\npassing algorithm, which reduce to Theorem 1 and Lemma 1 with \u0001 = 0 and c\u03b1 = 1, cv = 1.\nHowever, in this algorithm the messages are not guaranteed to be bounded. They main difference of\n[14] from our work is that they lack the dual formulation, which we use to prove that the structured\nSVM smooth approximation, with \u0001 \u2192 0, is guaranteed to converge to optimum and that the dual\nvariables, i.e. the beliefs, are guaranteed to converge to the optimal beliefs. The relation between\nthe margin and the soft-max is similar to the one used in [17]. Independently, [4, 15] described the\nconnection between structured SVMs loss and CRFs loss. [15] also presented the one-parameter\nextension of CRFs and structured SVMs described in (3).\n8 Conclusion and Discussion\nIn this paper we have related CRFs and structured SVMs and shown that the soft-max, a variant of\nthe log-partition function, approximates smoothly the structured SVM hinge loss. We have also pro-\nposed an approximation for structured prediction problems based on local entropy approximations\nand derived an ef\ufb01cient message-passing algorithm that is guaranteed to converge, even for general\ngraphs. We have demonstrated the effectiveness of our approach to learn graphs with large number\nof parameters.We plan to investigate other domains of application such as image segmentation.\n\n8\n\n0510152025\u22128000\u22127000\u22126000\u22125000\u22124000\u22123000\u22122000Iterations PrimalDual0510152025\u22128000\u22127000\u22126000\u22125000\u22124000\u22123000\u22122000Iterations PrimalDual\fReferences\n[1] M. Collins, A. Globerson, T. Koo, X. Carreras, and P.L. Bartlett. Exponentiated gradient algorithms for\n\nconditional random \ufb01elds and max-margin markov networks. JMLR, 9:1775\u20131822, 2008.\n\n[2] T. Finley and T. Joachims. Training structural SVMs when exact inference is intractable. In ICML, pages\n\n304\u2013311. ACM, 2008.\n\n[3] V. Ganapathi, D. Vickrey, J. Duchi, and D. Koller. Constrained approximate maximum entropy learning\n\nof markov random \ufb01elds. In UAI, 2008.\n\n[4] K. Gimpel and N.A. Smith. Softmax-margin CRFs: Training log-linear models with cost functions. In\nHuman Language Technologies: The 2010 Annual Conference of the North American Chapter of the\nAssociation for Computational Linguistics, pages 733\u2013736. Association for Computational Linguistics,\n2010.\n\n[5] T. Hazan and A. Shashua. Norm-Product Belief Propagation: Primal-Dual Message-Passing for Approx-\n\nimate Inference. Arxiv preprint arXiv:0903.3127, 2009.\n\n[6] T. Hazan and R. Urtasun. Approximated Structured Prediction for Learning Large Scale Graphical Mod-\n\nels. Arxiv preprint arXiv:1006.2899, 2010.\n\n[7] T. Heskes. Convexity arguments for ef\ufb01cient minimization of the Bethe and Kikuchi free energies. Journal\n\nof Arti\ufb01cial Intelligence Research, 26(1):153\u2013190, 2006.\n\n[8] J.K. Johnson, D.M. Malioutov, and A.S. Willsky. Lagrangian relaxation for MAP estimation in graphical\nmodels. In Proceedings of the Allerton Conference on Control, Communication and Computing. Citeseer,\n2007.\n\n[9] S. Kumar and M. Hebert. Discriminative Fields for Modeling Spatial Dependencies in Natural Images.\n\nIn Neural Information Processing Systems. MIT Press, Cambridge, MA, 2003.\n\n[10] J. Lafferty, A. McCallum, and F. Pereira. Conditional random \ufb01elds: Probabilistic models for segmenting\n\nand labeling sequence data. In ICML, pages 282\u2013289, 2001.\n\n[11] G. Lebanon and J. Lafferty. Boosting and maximum likelihood for exponential models. NIPS, 1:447\u2013454,\n\n2002.\n\n[12] A. Levin and Y. Weiss. Learning to Combine Bottom-Up and Top-Down Segmentation. In European\n\nConference on Computer Vision, 2006.\n\n[13] T. Meltzer, A. Globerson, and Y. Weiss. Convergent message passing algorithms-a unifying view. In UAI,\n\n2009.\n\n[14] O. Meshi, D. Sontag, T. Jaakkola, and A. Globerson. Learning Ef\ufb01ciently with Approximate Inference\n\nvia Dual Losses. In Proc. ICML. Citeseer, 2010.\n\n[15] P. Pletscher, C. Ong, and J. Buhmann. Entropy and Margin Maximization for Structured Output Learning.\n\nMachine Learning and Knowledge Discovery in Databases, pages 83\u201398, 2010.\n\n[16] N. Ratliff, J.A. Bagnell, and M. Zinkevich. Subgradient methods for maximum margin structured learn-\n\ning. In ICML Workshop on Learning in Structured Output Spaces, 2006.\n\n[17] F. Sha and L.K. Saul. Large margin hidden Markov models for automatic speech recognition. Advances\n\nin neural information processing systems, 19:1249, 2007.\n\n[18] C. Sutton and A. McCallum. Piecewise training for structured prediction. Machine Learning, 77(2):165\u2013\n\n194, 2009.\n\n[19] B. Taskar. Learning structured prediction models: a large margin approach. PhD thesis, Stanford, CA,\n\nUSA, 2005. Adviser-Koller, Daphne.\n\n[20] B. Taskar, P. Abbeel, and D. Koller. Discriminative probabilistic models for relational data. In UAI, pages\n\n895\u2013902. Citeseer, 2002.\n\n[21] B. Taskar, C. Guestrin, and D. Koller. Max-margin Markov networks. NIPS, 16:51, 2004.\n[22] B. Taskar, S. Lacoste-Julien, and M. I. Jordan. Structured prediction, dual extragradient and Bregman\n\nprojections. JMLR, 7:1653\u20131684, 2006.\n\n[23] S. Vishwanathan, N. Schraudolph, M. Schmidt, and K. Murphy. Accelerated Training of Conditional\nRandom Fields with Stochastic Meta-Descent . In International Conference in Machine Learning, 2006.\n[24] M.J. Wainwright. Estimating the Wrong Graphical Model: Bene\ufb01ts in the Computation-Limited Setting.\n\n[25] M.J. Wainwright and M.I. Jordan. Graphical models, exponential families, and variational inference.\n\nJMLR, 7:1859, 2006.\nFoundations and Trends R(cid:13) in Machine Learning, 1(1-2):1\u2013305, 2008.\n\n[26] J.S. Yedidia, W. T. Freeman, and Y. Weiss. Constructing free-energy approximations and generalized\n\nbelief propagation algorithms. Transactions on Information Theory, 51(7):2282\u20132312, 2005.\n\n9\n\n\f", "award": [], "sourceid": 385, "authors": [{"given_name": "Tamir", "family_name": "Hazan", "institution": null}, {"given_name": "Raquel", "family_name": "Urtasun", "institution": null}]}