{"title": "Inference by Learning: Speeding-up Graphical Model Optimization via a Coarse-to-Fine Cascade of Pruning Classifiers", "book": "Advances in Neural Information Processing Systems", "page_first": 2105, "page_last": 2113, "abstract": "We propose a general and versatile framework that significantly speeds-up graphical model optimization while maintaining an excellent solution accuracy. The proposed approach, refereed as Inference by Learning or IbyL, relies on a multi-scale pruning scheme that progressively reduces the solution space by use of a coarse-to-fine cascade of learnt classifiers. We thoroughly experiment with classic computer vision related MRF problems, where our novel framework constantly yields a significant time speed-up (with respect to the most efficient inference methods) and obtains a more accurate solution than directly optimizing the MRF. We make our code available on-line.", "full_text": "Inference by Learning: Speeding-up Graphical\n\nModel Optimization via a Coarse-to-Fine Cascade of\n\nPruning Classi\ufb01ers\n\nBruno Conejo\u2217\n\nGPS Division, California Institute of Technology, Pasadena, CA, USA\nUniversite Paris-Est, Ecole des Ponts ParisTech, Marne-la-Vallee, France\n\nbconejo@caltech.edu\n\nUniversite Paris-Est, Ecole des Ponts ParisTech, Marne-la-Vallee, France\n\nnikos.komodakis@enpc.fr\n\nNikos Komodakis\n\nSebastien Leprince & Jean Philippe Avouac\n\nGPS Division, California Institute of Technology, Pasadena, CA, USA\n\nleprincs@caltech.edu avouac@gps.caltech.edu\n\nAbstract\n\nWe propose a general and versatile framework that signi\ufb01cantly speeds-up graph-\nical model optimization while maintaining an excellent solution accuracy. The\nproposed approach, refereed as Inference by Learning or in short as IbyL, relies\non a multi-scale pruning scheme that progressively reduces the solution space by\nuse of a coarse-to-\ufb01ne cascade of learnt classi\ufb01ers. We thoroughly experiment\nwith classic computer vision related MRF problems, where our novel framework\nconstantly yields a signi\ufb01cant time speed-up (with respect to the most ef\ufb01cient\ninference methods) and obtains a more accurate solution than directly optimizing\nthe MRF. We make our code available on-line [4].\n\nIntroduction\n\n1\nGraphical models in computer vision Optimization of undirected graphical models such as\nMarkov Random Fields, MRF, or Conditional Random Fields, CRF, is of fundamental importance in\ncomputer vision. Currently, a wide spectrum of problems including stereo matching [25, 13], opti-\ncal \ufb02ow estimation [27, 16], image segmentation [23, 14], image completion and denoising [10], or,\nobject recognition [8, 2] rely on \ufb01nding the mode of the distribution associated to the random \ufb01eld,\ni.e., the Maximum A Posteriori (MAP) solution. The MAP estimation, often referred as the labeling\nproblem, is posed as an energy minimization task. While this task is NP-Hard, strong optimum solu-\ntions or even the optimal solutions can be obtained [3]. Over the past 20 years, tremendous progress\nhas been made in term of computational cost, and, many different techniques have been developed\nsuch as move making approaches [3, 19, 22, 21, 28], and message passing methods [9, 32, 18, 20].\nA review of their effectiveness has been published in [31, 12]. Nevertheless, the ever increasing\ndimensionality of the problems and the need for larger solution space greatly challenge these tech-\n\n\u2217This work was supported by USGS through the Measurements of surface ruptures produced by continental\nearthquakes from optical imagery and LiDAR project (USGS Award G13AP00037), the Terrestrial Hazard\nObservation and Reporting Center of Caltech, and the Moore foundation through the Advanced Earth Surface\nObservation Project (AESOP Grant 2808).\n\n1\n\n\fniques as even the best ones have a highly super-linear computational cost and memory requirement\nrelatively to the dimensionality of the problem.\nOur goal in this work is to develop a general MRF optimization framework that can provide a\nsigni\ufb01cant speed-up for such methods while maintaining the accuracy of the estimated solutions.\nOur strategy for accomplishing this goal will be to gradually reduce (by a signi\ufb01cant amount) the\nsize of the discrete state space via exploiting the fact that an optimal labeling is typically far from\nbeing random. Indeed, most MRF optimization problems favor solutions that are piecewise smooth.\nIn fact, this spatial structure of the MAP solution has already been exploited in prior work to reduce\nthe dimensionality of the solution space.\n\nRelated work A \ufb01rst set of methods of this type, referred here for short as the super-pixel approach\n[30], de\ufb01nes a grouping heuristic to merge many random variables together in super-pixels. The\ngrouping heuristic can be energy-aware if it is based on the energy to minimize as in [15], or, energy-\nagnostic otherwise as in [7, 30]. All random variables belonging to the same super-pixel are forced\nto take the same label. This restricts the solution space and results in an optimization speed-up as\na smaller number of variables needs to be optimized. The super-pixel approach has been applied\nwith segmentation, stereo and object recognition [15]. However, if the grouping heuristic merges\nvariables that should have a different label in the MAP solution, only an approximate labeling is\ncomputed. In practice, de\ufb01ning general yet ef\ufb01cient grouping heuristics is dif\ufb01cult. This represents\nthe key limitation of super-pixel approaches.\nOne way to overcome this limitation is to mimic the multi-scale scheme used in continuous opti-\nmization by building a coarse to \ufb01ne representation of the graphical model. Similarly to the super-\npixel approach, such a multi-scale method, relies again on a grouping of variables for building the\nrequired coarse to \ufb01ne representation [17, 24, 26]. However, contrary to the super-pixel approach,\nif the grouping merges variables that should have a different label in the MAP solution, there al-\nways exists a scale at which these variables are not grouped. This property thus ensures that the\nMAP solution can still be recovered. Nevertheless, in order to manage a signi\ufb01cant speed-up of\nthe optimization, the multi-scale approach also needs to progressively reduce the number of labels\nper random variable (i.e., the solution space). Typically, this is achieved by use, for instance, of a\nheuristic that keeps only a small \ufb01xed number of labels around the optimal label of each node found\nat the current scale, while pruning all other labels, which are therefore not considered thereafter [5].\nThis strategy, however, may not be optimal or even valid for all types of problems. Furthermore,\nsuch a pruning heuristic is totally inappropriate (and can thus lead to errors) for nodes located along\ndiscontinuity boundaries of an optimal solution, where such boundaries are always expected to exist\nin practice. An alternative strategy followed by some other methods relies on selecting a subset of\nthe MRF nodes at each scale (based on some criterion) and then \ufb01xing their labels according to the\noptimal solution estimated at the current scale (essentially, such methods contract the entire label\nset of a node to a single label). However, such a \ufb01xing strategy may be too aggressive and can also\neasily lead to eliminating good labels.\n\nProposed approach Our method simultaneously makes use of the following two strategies for\nspeeding-up the MRF optimization process:\n\n(i) it solves the problem through a multi-scale approach that gradually re\ufb01nes the MAP esti-\n\nmation based on a coarse-to-\ufb01ne representation of the graphical model,\n\n(ii) and, at the same time, it progressively reduces the label space of each variable by cleverly\n\nutilizing the information computed during the above coarse-to-\ufb01ne process.\n\nTo achieve that, we propose to signi\ufb01cantly revisit the way that the pruning of the solution space\ntakes place. More speci\ufb01cally:\n\n(i) we make use of and incorporate into the above process a \ufb01ne-grained pruning scheme that\nallows an arbitrary subset of labels to be discarded, where this subset can be different for\neach node,\n\n(ii) additionally, and most importantly, instead of trying to manually come up with some criteria\nfor deciding what labels to prune or keep, we introduce the idea of relying entirely on\na sequence of trained classi\ufb01ers for taking such decisions, where different classi\ufb01ers per\nscale are used.\n\n2\n\n\fWe name such an approach Inference by Learning, and show that it is particularly ef\ufb01cient and effec-\ntive in reducing the label space while omitting very few correct labels. Furthermore, we demonstrate\nthat the training of these classi\ufb01ers can be done based on features that are not application speci\ufb01c\nbut depend solely on the energy function, which thus makes our approach generic and applicable\nto any MRF problem. The end result of this process is to obtain both an important speed-up and\na signi\ufb01cant decrease in memory consumption as the solution space is progressively reduced. Fur-\nthermore, as each scale re\ufb01nes the MAP estimation, a further speed-up is obtained as a result of a\nwarm-start initialization that can be used when transitioning between different scales.\nBefore proceeding, it is worth also noting that there exists a body of prior work [29] that focuses on\n\ufb01xing the labels of a subset of nodes of the graphical model by searching for a partial labeling with\nthe so-called persistency property (which means that this labeling is provably guaranteed to be part\nof an optimal solution). However, \ufb01nding such a set of persistent variables is typically very time\nconsuming. Furthermore, in many cases only a limited number of these variables can be detected.\nAs a result, the focus of these works is entirely different from ours, since the main motivation in our\ncase is how to obtain a signi\ufb01cant speed-up for the optimization.\nHereafter, we assume without loss of generality that the graphical model is a discrete pairwise\nCRF/MRF. However, one can straightforwardly apply our approach to higher order models.\n\nOutline of the paper We brie\ufb02y review the optimization problem related to a discrete pairwise\nMRF and introduce the necessary notations in section 2. We describe our general multi-scale pruning\nframework in section 3. We explain how classi\ufb01ers are trained in section 4. Experimental results\nand their analysis are presented in 5. Finally, we conclude the paper in section 6.\n2 Notation and preliminaries\nTo represent a discrete MRF model M, we use the following notation\n\nM =(cid:0)V,E,L,{\u03c6i}i\u2208V ,{\u03c6ij}(i,j)\u2208E(cid:1) .\n\n(1)\nHere V and E represent respectively the nodes and edges of a graph, and L represents a discrete\nlabel set. Furthermore, for every i \u2208 V and (i, j) \u2208 E, the functions \u03c6i : L \u2192 R and \u03c6ij : L2 \u2192 R\nrepresent respectively unary and pairwise costs (that are also known connectively as MRF potentials\nvertex i, taking values in the label set L, and the total cost (energy) E(x|M) of such a solution is\n\n\u03c6 =(cid:8){\u03c6i}i\u2208V ,{\u03c6ij}(i,j)\u2208E(cid:9)). A solution x = (xi)i\u2208V of this model consists of one variable per\n\nE(x|M) =\n\n\u03c6i(xi) +\n\n\u03c6ij(xi, xj) .\n\n(cid:88)\n\ni\u2208V\n\n(cid:88)\n\n(i,j)\u2208E\n\nThe goal of MAP estimation is to \ufb01nd a solution that has minimum energy, i.e., computes\n\nxMAP = arg min\nx\u2208L|V|\n\nE(x|M) .\n\nThe above minimization takes place over the full solution space of model M, which is L|V|. Here\nwe will also make use of a pruned solution space S(M, A), which is de\ufb01ned based on a binary\nfunction A : V \u00d7 L \u2192 {0, 1} (referred to as the pruning matrix hereafter) that speci\ufb01es the status\n(active or pruned) of a label for a given vertex, i.e.,\n\nA(i, l) =\n\nif label l is active at vertex i\nif label l is pruned at vertex i\n\n(2)\n\n(cid:26) 1\n(cid:110)\n\n0\n\nDuring optimization, active labels are retained while pruned labels are discarded. Based on a given\nA, the corresponding pruned solution space of model M is de\ufb01ned as\nx \u2208 L|V| | (\u2200i), A(i, xi) = 1\n\nS(M, A) =\n\n(cid:111)\n\n.\n\n3 Multiscale Inference by Learning\nIn this section we describe the overall structure of our MAP estimation framework, beginning by\nexplaining how to construct the coarse-to-\ufb01ne representation of the input graphical model.\n\n3\n\n\f(cid:0)V(cid:48),E(cid:48),L,{\u03c6(cid:48)\n\nij}(i,j)\u2208E(cid:48)(cid:1). Intuitively, we want to partition the nodes of M into groups,\n\n3.1 Model coarsening\nGiven a model M (de\ufb01ned as in (1)), we wish to create a \u201ccoarser\u201d version of this model M(cid:48) =\ni}i\u2208V(cid:48),{\u03c6(cid:48)\nand treat each group as a single node of the coarser model M(cid:48) (the implicit assumption is that nodes\nof M that are grouped together are assigned the same label). To that end, we will make use of a\ngrouping function g : V \u2192 N . The nodes and edges of the coarser model are then de\ufb01ned as follows\n\nV(cid:48) ={i(cid:48) | \u2203i \u2208 V, i(cid:48) = g(i)} ,\nE(cid:48) ={(i(cid:48), j(cid:48)) | \u2203(i, j) \u2208 E, i(cid:48) = g(i), j(cid:48) = g(j), i(cid:48) (cid:54)= j(cid:48)} .\nFurthermore, the unary and pairwise potentials of M(cid:48) are given by\n\n(3)\n(4)\n\n, (5)\n\n(\u2200i(cid:48) \u2208 V(cid:48)), \u03c6(cid:48)\n\ni(cid:48)(l) =(cid:80)\n+(cid:80)\n(cid:88)\n\ni\u2208V|i(cid:48)=g(i)\n(i,j)\u2208E|i(cid:48)=g(i)=g(j) \u03c6ij(l, l)\n\n\u03c6i(l)\n\ni(cid:48)j(cid:48)(l0, l1) =\n\n\u03c6ij(l0, l1) . (6)\n\n(i,j)\u2208E|i(cid:48)=g(i),j(cid:48)=g(j)\n\n(\u2200(i(cid:48), j(cid:48)) \u2208 E(cid:48)), \u03c6(cid:48)\n\nFigure 1: Black circles:\nV, Black lines: E, Red\nsquares: V(cid:48), Blue lines:\nE(cid:48).\nWith a slight abuse of notation, we will hereafter use g(M) to denote the coarser model resulting\nfrom M when using the grouping function g, i.e., we de\ufb01ne g(M) = M(cid:48). Also, given a solution x(cid:48)\nof M(cid:48), we can \u201cupsample\u201d it into a solution x of M by setting xi = x(cid:48)\ng(i) for each i \u2208 V. We will\nuse the following notation in this case: g\u22121(x(cid:48)) = x. We provide a toy example in supplementary\nmaterials.\n3.2 Coarse-to-\ufb01ne optimization and label pruning\nTo estimate the MAP of an input model M, we \ufb01rst construct a series of N +1 progressively coarser\nmodels (M(s))0\u2264s\u2264N by use of a sequence of N grouping functions (g(s))0\u2264s<N , where\n\nM(0) = M and\n\n(\u2200s), M(s+1) = g(s)(M(s)) .\n\nThis provides a multiscale (coarse-to-\ufb01ne) representation of the original model., where the elements\nof the resulting models are denoted as follows:\n\n(cid:16)V (s),E (s),L,{\u03c6(s)\n\nM(s) =\n\ni }i\u2208V (s) ,{\u03c6(s)\n\nij }(i,j)\u2208E (s)\n\n(cid:17)\n\nIn our framework, MAP estimation proceeds from the coarsest to the \ufb01nest scale (i.e., from model\nM(N ) to M(0)). During this process, a pruning matrix A(s) is computed at each scale s, which is\nused for de\ufb01ning a restricted solution space S(M(s), A(s)). The elements of the matrix A(N ) at the\ncoarsest scale are all set equal to 1 (i.e., no label pruning is used in this case), whereas in all other\nscales A(s) is computed by use of a trained classi\ufb01er f (s).\nMore speci\ufb01cally, at any given scale s, the following steps take place:\n\ni. We approximately minimize (via any existing MRF optimization method) the energy of the\n\nmodel M(s) over the restricted solution space S(M(s), A(s)), i.e., we compute\n\nx(s) \u2248 arg minx\u2208S(M(s),A(s)) E(x|M(s)) .\n\nii. Given the estimated solution x(s), a feature map z(s) : V (s) \u00d7 L \u2192 RK is computed at\nthe current scale, and a trained classi\ufb01er f (s) : RK \u2192 {0, 1} uses this feature map z(s) to\nconstruct the pruning matrix A(s\u22121) for the next scale as follows\n\n(\u2200i \u2208 V (s\u22121), \u2200l \u2208 L), A(s\u22121)(i, l) = f (s)(z(s)(g(s\u22121)(i), l)) .\n\niii. Solution x(s) is \u201cupsampled\u201d into x(s\u22121) = [g(s\u22121)]\u22121(x(s)) and used as the initializa-\ntion for the optimization at the next scale s \u2212 1. Note that, due to (5) and (6), it holds\nE(x(s\u22121)|M(s\u22121)) = E(x(s)|M(s)). Therefore, this initialization ensures that energy will\ncontinually decrease if the same is true for the optimization applied per scale. Furthermore,\nit can allow for a warm-starting strategy when transitioning between scales.\n\nThe pseudocode of the resulting algorithm appears in Algo. 1.\n\n4\n\n\fAlgorithm 1: Inference by learning framework\nData: Model M, grouping functions (g(s))0\u2264s<N , classi\ufb01ers (f (s))0<s\u2264N\nResult: x(0)\nCompute the coarse to \ufb01ne sequence of MRFs:\nM(0) \u2190 M\nfor s = [0 . . . N \u2212 1] do\n\nM(s+1) \u2190 g(s)(M(s))\n\nOptimize the coarse to \ufb01ne sequence of MRFs over pruned solution spaces:\n(\u2200i \u2208 V (N ),\u2200l \u2208 L), A(N )(i, l) \u2190 1\nInitialize x(N )\nfor s = [N...0] do\nUpdate x(s) by iterative minimization: x(s) \u2248 arg minx\u2208S(M(s),A(s)) E(x|M(s))\nif s (cid:54)= 0 then\n\nCompute feature map z(s)\nUpdate pruning matrix for next \ufb01ner scale: A(s\u22121)(i, l) = f (s)(z(s)(g(s\u22121)(i), l))\nUpsample x(s) for initializing solution x(s\u22121) at next scale: x(s\u22121) \u2190 [g(s\u22121)]\u22121(x(s))\n\n4 Features and classi\ufb01er for label pruning\nFor each scale s, we explain how the set of features comprising the feature map z(s) is computed\nand how we train (off-line) the classi\ufb01er f (s). This is a crucial step for our approach. Indeed, if the\nclassi\ufb01er wrongly prunes labels that belong to the MAP solution, then, only an approximate labeling\nmight be found at the \ufb01nest scale. Moreover, keeping too many active labels will result in a poor\nspeed-up for MAP estimation.\n4.1 Features\nThe feature map z(s) : V (s) \u00d7 L \u2192 RK is formed by stacking K individual real-valued features\nde\ufb01ned on V (s) \u00d7 L. We propose to compute features that are not application speci\ufb01c but depend\nsolely on the energy function and the current solution x(s). This makes our approach generic and\napplicable to any MRF problem. However, as we establish a general framework, speci\ufb01c application\nfeatures can be straightforwardly added in future work.\n\nPresence of strong discontinuity This binary feature, PSD(s), accounts for the existence of dis-\ncontinuity in solution x(s) when a strong link (i.e., \u03c6ij(x(s)\n, x(s)\nj ) > \u03c1) exists between neighbors.\nIts de\ufb01nition follows for any vertex i \u2208 V (s) and any label l \u2208 L :\n\u2203(i, j) \u2208 E (s)| \u03c6ij(x(s)\notherwise\n\nPSD(s)(i, l) =\n\nj ) > \u03c1\n\n(cid:26)\n\n, x(s)\n\ni\n\n(7)\n\n1\n0\n\ni\n\nLocal energy variation This feature represents the local variation of the energy around the current\nsolution x(s). It accounts for both the unary and pairwise terms associated to a vertex and a label.\nAs in [11], we remove the local energy of the current solution as it leads to a higher discriminative\npower. The local energy variation feature, LEV(s), is de\ufb01ned for any i \u2208 V (s) and l \u2208 L as follows:\n\n(cid:88)\n\nLEV(s)(i, l) =\n\ni (l) \u2212 \u03c6(s)\n\u03c6(s)\nN (s)V (i)\n\ni (x(s)\ni )\n\n+\n\nj:(i,j)\u2208E (s)\n\n\u03c6(s)\nij (l, x(s)\n\nj ) \u2212 \u03c6(s)\nN (s)E (i)\n\nij (x(s)\n\ni\n\n, x(s)\nj )\n\n(8)\n\nwith N (s)V (i) = card{i(cid:48) \u2208 V (s\u22121) : g(s\u22121)(i(cid:48)) = i} and N (s)E (i) = card{(i(cid:48), j(cid:48)) \u2208 E (s\u22121) :\ng(s\u22121)(i(cid:48)) = i, g(s\u22121)(j(cid:48)) = j}.\n\nUnary \u201ccoarsening\u201d This feature, UC(s), aims to estimate an approximation of the coarsening\ninduced in the MRF unary terms when going from model M(s\u22121) to model M(s), i.e., as a result of\n\n5\n\n\fapplying the grouping function g(s\u22121). It is de\ufb01ned for any i \u2208 V (s) and l \u2208 L as follows\n\n(cid:88)\n\nUC(s)(i, l) =\n\ni(cid:48)\u2208V (s\u22121)|g(s\u22121)(i(cid:48))=i\n\n|\u03c6(s\u22121)\ni(cid:48)\n\ni\n\n(l)\nN (s)V (i)\n\n(l) \u2212 \u03c6(s)\nN (s)V (i)\n\n|\n\n(9)\n\nFeature normalization The features are by design insensitive to any additive term applied on all\nthe unary and pairwise terms. However, we still need to apply a normalization to the LEV(s) and\nUC(s) features to make them insensitive to any positive global scaling factor applied on both the\nunary and pairwise terms (such scaling variations are commonly used in computer vision). Hence,\nwe simply divide group of features, LEV(s) and UC(s) by their respective mean value.\n4.2 Classi\ufb01er\nTo train the classi\ufb01ers, we are given as input a set of MRF instances (all of the same class, e.g.,\nstereo-matching) along with the ground truth MAP solutions. We extract a subset of MRFs for off-\nline learning and a subset for on-line testing. For each MRF instance in the training set, we apply\nthe algorithm 1 without any pruning (i.e., A(s) \u2261 1) and, at each scale, we keep track of the features\nz(s) and also compute the binary function X (s)\n\n(cid:26)1,\n\nMAP : V (s) \u00d7 L \u2192 {0, 1} de\ufb01ned as follows:\nif l is the ground truth label for node i\notherwise\n\n0,\n\nX (s\u22121)\nMAP (i(cid:48), l) ,\n\nMAP(i, l) =\n\n(\u2200i \u2208 V,\u2200l \u2208 L), X (0)\n\n(cid:95)\nwhere(cid:87) denotes the binary OR operator. The values 0 and 1 in X (s)\n\n(\u2200s > 0)(\u2200i \u2208 V (s),\u2200l \u2208 L), X (s)\n\nMAP(i, l) =\n\ni(cid:48)\u2208V (s\u22121):g(s)(i(cid:48))=i\n\n0\n\nand z(s)\n\n1 , where z(s)\n\nMAP de\ufb01ne respectively the two\nclasses c0 and c1 when training the classi\ufb01er f (s), where c0 means that the label can be pruned and\nc1 that the label should not be pruned.\nTo treat separately the nodes that are on the border of a strong discontinuity, we split the feature map\nz(s) into two groups z(s)\ncontains only features where PSD(s) = 0 and z(s)\n1\n0\ncontains only features where PSD(s) = 1 (strong discontinuity). For each group, we train a standard\nlinear C-SVM classi\ufb01er with l2-norm regularization (regularization parameter was set to C = 10).\nThe linear classi\ufb01ers give good enough accuracy during training while also being fast to evaluate at\ntest time\nDuring training (and for each group), we also introduce weights to balance the different number of\nelements in each class (c0 is much larger than c1), and to also strongly penalize misclassi\ufb01cation in\nc1 (as such misclassi\ufb01cation can have a more drastic impact on the accuracy of MAP estimation). To\naccomplish that, we set the weight for class c0 to 1, and the weight for class c1 to \u03bb card(c0)\ncard(c1), where\ncard(\u00b7) counts the number of training samples in each class. Parameter \u03bb is a positive scalar (com-\nmon to both groups) used for tuning the penalization of misclassi\ufb01cation in c1 (it will be referred\nto as the pruning aggressiveness factor hereafter as it affects the amount of labels that get pruned).\nDuring on-line testing, depending on the value of the PSD feature, f (s) applies the linear classi\ufb01er\nlearned on group z(s)\n0\n5 Experimental results\nWe evaluate our framework on pairwise MRFs from stereo-matching, image restoration, and, optical\n\ufb02ow estimation problems. The corresponding MRF graphs consist of regular 4-connected grids in\nthis case. At each scale, the grouping function merges together vertices of 2 \u00d7 2 subgrids. We leave\nmore advanced grouping functions [15] for future work. As MRF optimization subroutine, we use\nthe Fast-PD algorithm [21]. We make our code available on-line [4].\n\nif PSD(s) = 0, or the linear classi\ufb01er learned on group z(s)\n1\n\nif PSD(s) = 1.\n\nExperimental setup For the stereo matching problem, we estimate the disparity map from images\nIR and IL where each label encodes a potential disparity d (discretized at quarter of a pixel preci-\nsion), with MRF potentials \u03c6p(d) = ||IL(yp, xp)\u2212IR(yp, xp\u2212d)||1 and \u03c6pq(d0, d1) = wpq|d0\u2212d1|,\nwith the weight wpq varying based on the image gradient (parameters are adjusted for each se-\nquence). We train the classi\ufb01er on the well-known Tsukuba stereo-pair (61 labels), and use all other\n\n6\n\n\f(a) Speed-up\n\n(b) Active label ratio\n\n(c) Energy ratio\n\n(d) Label agreement\n\nFigure 2: Performance of our Inference by Learning framework: (Top row) stereo matching, (Middle row)\noptical \ufb02ow, (Bottom row) image restoration. For stereo matching, the Average Middlebury curve represents\nthe average value of the statistic for the entire Middlebury dataset [6] (2001, 2003, 2005 and 2006) (37 stereo-\npairs).\n\nstereo-pairs of [6] (2001, 2003, 2005 and 2006) for testing. For image restoration, we estimate the\npixel intensity of a noisy and incomplete image I with MRF potentials \u03c6p(l) = ||I(yp, xp) \u2212 l||2\nand \u03c6(l0, l1) = 25 min(||l0 \u2212 l1||2\n2\n2, 200). We train the classi\ufb01er on the Penguin image stereo-pair\n(256 labels), and use House (256 labels) for testing (dataset [31]). For the optical \ufb02ow estimation,\nwe estimate a subpixel-accurate 2D displacement \ufb01eld between two frames by extending the stereo\nmatching formulation to 2D. Using the dataset of [1], we train the classi\ufb01er on Army (1116 labels),\nand test on RubberWhale (625 labels) and Dimetrodon (483 labels). For all experiments, we use 5\nscales and set in (7) \u03c1 = 5 \u00afwpq with \u00afwpq being the mean value of edge weights.\n\nEvaluations We evaluate three optimization strategies: the direct optimization (i.e., optimizing\nthe full MRF at the \ufb01nest scale), the multi-scale optimization (\u03bb = 0, i.e., our framework without\nany pruning), and our Inference by Learning optimization, where we experiment with different error\nratios \u03bb that range between 0.001 and 1.\nWe assess the performance by computing the energy ratio, i.e., the ratio between the current energy\nand the energy computed by the direct optimization, the best label agreement, i.e., the proportion\nof labels that coincides with the labels of the lowest computed energy, the speed-up factor, i.e., the\nratio of computation time between the direct optimization and the current optimization strategy, and,\nthe active label ratio, i.e., the percentage of active labels at the \ufb01nest scale.\n\nResults and discussion For all problems, we present in Fig. 2 the performance of our Inference\nby Learning approach for all tested aggressiveness factors and show in Fig. 3 estimated results for\n\u03bb = 0.01. We present additional results in the supplementary material.\nFor every problem and aggressiveness factors until \u03bb = 0.1, our pruning-based optimization obtains\na lower energy (column (c) of Fig. 2) in less computation time, achieving a speed-up factor (column\n(a) of Fig. 2) close to 5 for Stereo-matching, above 10 for Optical-\ufb02ow and up to 3 for image\nrestoration. (note that these speed-up factors are with respect to an algorithm, FastPD, that was the\nmost ef\ufb01cient one in recent comparisons [12]). The percentage of active labels (Fig. 2 column (b))\nstrongly correlates with the speed-up factor. The best labeling agreement (Fig. 2 column (d)) is\nnever worse than 97% (except for the image restoration problems because of the in-painted area)\n\n7\n\n\fa\nb\nu\nk\nu\ns\nT\n\ns\nu\nn\ne\nV\n\ny\nd\nd\ne\nT\n\ny\nm\nA\n\nr\n\n.\nt\ne\nm\nD\n\ni\n\ne\ns\nu\no\nH\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\n(f)\n\nFigure 3: Results of our Inference by Learning framework for \u03bb = 0.1. Each row is a different MRF problem.\n(a) original image, (b) ground truth, (c) solution of the pruning framework, (d,e,f) percentage of active labels\nper vertex for scale 0, 1 and 2 (black 0%, white 100%).\n\nand is always above 99% for \u03bb (cid:54) 0.1. As expected, less pruning happens near label discontinuities\nas illustrated in column (d,e,f) of Fig. 3 justifying the use of a dedicated linear classi\ufb01er. Moreover,\nlarge homogeneously labeled regions are pruned earlier in the coarse to \ufb01ne scale.\n6 Conclusion and future work\nOur Inference by Learning approach consistently speeds-up the graphical model optimization by\na signi\ufb01cant amount while maintaining an excellent accuracy of the labeling estimation. On most\nexperiments, it even obtains a lower energy than direct optimization.\nIn future work, we plan to experiment with problems that require general pairwise potentials where\nmessage-passing techniques can be more effective than graph-cut based methods but are at the same\ntime much slower. Our framework is guaranteed to provide an even more dramatic speedup in this\ncase since the computational complexity of message-passing methods is quadratic with respect to\nthe number of labels while being linear for graph-cut based methods used in our experiments. We\nalso intend to explore the use of application speci\ufb01c features, learn the grouping functions used in\nthe coarse-to-\ufb01ne scheme, jointly train the cascade of classi\ufb01ers, and apply our framework to high\norder graphical models.\nReferences\n[1] S. Baker, S. Roth, D. Scharstein, M.J. Black, J. P. Lewis, and R. Szeliski. A database and evaluation\n\nmethodology for optical \ufb02ow. In ICCV 2007., 2007.\n\n[2] Martin Bergtholdt, J\u00a8org Kappes, Stefan Schmidt, and Christoph Schn\u00a8orr. A study of parts-based object\n\nclass detection using complete graphs. IJCV, 2010.\n\n[3] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization via graph cuts. PAMI, 2001.\n[4] B. Conejo. http://imagine.enpc.fr/\u02dcconejob/ibyl/.\n\n8\n\n\f[5] B. Conejo, S. Leprince, F. Ayoub, and J. P. Avouac. Fast global stereo matching via energy pyramid\n\nminimization. ISPRS Ann. Photogramm. Remote Sens. Spatial Inf. Sci., 2014.\n\n[6] Middlebury Stereo Datasets.\n[7] Pedro F. Felzenszwalb and Daniel P. Huttenlocher. Ef\ufb01cient graph-based image segmentation.\n\n2004.\n\nIJCV,\n\n[8] Pedro F. Felzenszwalb and Daniel P. Huttenlocher. Pictorial structures for object recognition. IJCV, 2005.\n[9] P.F. Felzenszwalb and D.P. Huttenlocher. Ef\ufb01cient belief propagation for early vision. In CVPR, 2004.\n[10] W.T. Freeman and E.C. Pasztor. Learning low-level vision. In ICCV, 1999.\n[11] Xiaoyan Hu and P. Mordohai. A quantitative evaluation of con\ufb01dence measures for stereo vision. PAMI.,\n\n2012.\n\n[12] J.H. Kappes, B. Andres, F.A. Hamprecht, C. Schnorr, S. Nowozin, D. Batra, Sungwoong Kim, B.X.\nKausler, J. Lellmann, N. Komodakis, and C. Rother. A comparative study of modern inference techniques\nfor discrete energy minimization problems. In CVPR, 2013.\n\n[13] Junhwan Kim, V. Kolmogorov, and R. Zabih. Visual correspondence using energy minimization and\n\nmutual information. In ICCV, 2003.\n\n[14] S. Kim, C. Yoo, S. Nowozin, and P. Kohli. Image segmentation using higher-order correlation clustering,\n\n2014.\n\n[15] Taesup Kim, S. Nowozin, P. Kohli, and C.D. Yoo. Variable grouping for energy minimization. In CVPR,\n\n2011.\n\n[16] T. Kohlberger, C. Schnorr, A. Bruhn, and J. Weickert. Domain decomposition for variational optical-\ufb02ow\n\ncomputation. IEEE Transactions on Information Theory/Image Processing, 2005.\n\n[17] Pushmeet Kohli, Victor S. Lempitsky, and Carsten Rother. Uncertainty driven multi-scale optimization.\n\nIn DAGM-Symposium, 2010.\n\n[18] V. Kolmogorov. Convergent tree-reweighted message passing for energy minimization. PAMI, 2006.\n[19] V. Kolmogorov and R. Zabin. What energy functions can be minimized via graph cuts? PAMI, 2004.\n[20] N. Komodakis, N. Paragios, and G. Tziritas. Mrf optimization via dual decomposition: Message-passing\n\nrevisited. In CVPR, 2007.\n\n[21] N. Komodakis, G. Tziritas, and N. Paragios. Fast, approximately optimal solutions for single and dynamic\n\nmrfs. In CVPR, 2007.\n\n[22] M. Pawan Kumar and Daphne Koller. Map estimation of semi-metric mrfs via hierarchical graph cuts. In\n\nUAI, 2009.\n\n[23] M.P. Kumar, P.H.S. Ton, and A. Zisserman. Obj cut. In CVPR, 2005.\n[24] H. Lombaert, Yiyong Sun, L. Grady, and Chenyang Xu. A multilevel banded graph cuts method for fast\n\nimage segmentation. In ICCV 2005., 2005.\n\n[25] T. Meltzer, C. Yanover, and Y. Weiss. Globally optimal solutions for energy minimization in stereo vision\n\nusing reweighted belief propagation. In ICCV, 2005.\n\n[26] P. Perez and F. Heitz. Restriction of a markov random \ufb01eld on a graph and multiresolution statistical\n\nimage modeling. IEEE Transactions on Information Theory/Image Processing, 1996.\n\n[27] S. Roth and M.J. Black. Fields of experts: a framework for learning image priors. In CVPR, 2005.\n[28] C. Rother, V. Kolmogorov, V. Lempitsky, and M. Szummer. Optimizing binary mrfs via extended roof\n\nduality. In CVPR, 2007.\n\n[29] Alexander Shekhovtsov. Maximum persistency in energy minimization. In CVPR, 2014.\n[30] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. PAMI., 2000.\n[31] R. Szeliski, R. Zabih, D. Scharstein, O. Veksler, V. Kolmogorov, Aseem Agarwala, M. Tappen, and\nC. Rother. A comparative study of energy minimization methods for markov random \ufb01elds with\nsmoothness-based priors. PAMI, 2008.\n\n[32] M.J. Wainwright, T.S. Jaakkola, and A.S. Willsky. Map estimation via agreement on trees: message-\n\npassing and linear programming. IEEE Transactions on Information Theory/Image Processing, 2005.\n\n9\n\n\f", "award": [], "sourceid": 1124, "authors": [{"given_name": "Bruno", "family_name": "Conejo", "institution": "Caltech / Ecole des Ponts ParisTech"}, {"given_name": "Nikos", "family_name": "Komodakis", "institution": "Ecole des Ponts ParisTech"}, {"given_name": "Sebastien", "family_name": "Leprince", "institution": "Caltech"}, {"given_name": "Jean Philippe", "family_name": "Avouac", "institution": "Caltech"}]}