{"title": "Dynamic Pruning of Factor Graphs for Maximum Marginal Prediction", "book": "Advances in Neural Information Processing Systems", "page_first": 82, "page_last": 90, "abstract": "We study the problem of maximum marginal prediction (MMP) in probabilistic graphical models, a task that occurs, for example, as the Bayes optimal decision rule under a Hamming loss. MMP is typically performed as a two-stage procedure: one estimates each variable's marginal probability and then forms a prediction from the states of maximal probability. In this work we propose a simple yet effective technique for accelerating MMP when inference is sampling-based: instead of the above two-stage procedure we directly estimate the posterior probability of each decision variable. This allows us to identify the point of time when we are sufficiently certain about any individual decision. Whenever this is the case, we dynamically prune the variable we are confident about from the underlying factor graph. Consequently, at any time only samples of variable whose decision is still uncertain need to be created. Experiments in two prototypical scenarios, multi-label classification and image inpainting, shows that adaptive sampling can drastically accelerate MMP without sacrificing prediction accuracy.", "full_text": "Dynamic Pruning of Factor Graphs\nfor Maximum Marginal Prediction\n\nChristoph H. Lampert\n\nIST Austria (Institute of Science and Technology Austria)\n\nAm Campus 1, 3400 Klosterneuburg, Austria\n\nhttp://www.ist.ac.at/\u223cchl\n\nchl@ist.ac.at\n\nAbstract\n\nWe study the problem of maximum marginal prediction (MMP) in probabilistic\ngraphical models, a task that occurs, for example, as the Bayes optimal decision\nrule under a Hamming loss. MMP is typically performed as a two-stage proce-\ndure: one estimates each variable\u2019s marginal probability and then forms a predic-\ntion from the states of maximal probability.\nIn this work we propose a simple yet effective technique for accelerating MMP\nwhen inference is sampling-based: instead of the above two-stage procedure we\ndirectly estimate the posterior probability of each decision variable. This allows us\nto identify the point of time when we are suf\ufb01ciently certain about any individual\ndecision. Whenever this is the case, we dynamically prune the variables we are\ncon\ufb01dent about from the underlying factor graph. Consequently, at any time only\nsamples of variables whose decision is still uncertain need to be created.\nExperiments in two prototypical scenarios, multi-label classi\ufb01cation and image\ninpainting, show that adaptive sampling can drastically accelerate MMP without\nsacri\ufb01cing prediction accuracy.\n\nIntroduction\n\n1\nProbabilistic graphical models (PGMs) have become useful tools for classical machine learning\ntasks, such as multi-label classi\ufb01cation [1] or semi-supervised learning [2], as well for many real-\nworld applications, for example image processing [3], natural language processing [4], bioinfor-\nmatics [5], and computational neuroscience [6]. Despite their popularity, the question of how to\nbest perform (approximate) inference in any given graphical models is still far from solved. While\nvariational approximations and related message passing algorithms have been proven useful for cer-\ntain classes of models (see [7] for an overview), there is still a large number of cases for which\nsampling-based approaches are the safest choice. Unfortunately, inference by sampling is often\ncomputationally costly: many samples are required to reach a con\ufb01dent result, and generating the\nindividual samples can be a complex task in itself, in particular if the underlying graphical model is\nlarge and highly connected.\nIn this work we study a particular inference problem: maximum marginal prediction (MMP) in\nbinary-valued PGMs, i.e. the task of determining for each variable in the graphical model which of\nits states has highest marginal probability. MMP occurs naturally as the Bayes optimal decision rule\nunder Hamming loss [8], and it has also found use as a building block for more complex prediction\ntasks, such as M-best MAP prediction [9]. The standard approach to sampling-based MMP is\nto estimate each variable\u2019s marginal probability distribution from a set of samples from the joint\nprobability, and for each variable pick the state of highest estimated marginal probability. In this\nwork, we propose an almost as simple, but more ef\ufb01cient way. We introduce one binary indicator\nvariable for each decision we need to make, and keep estimates of the posterior probabilities of\neach of these during the process of sampling. As soon as we are con\ufb01dent enough about any of\n\n1\n\n\fthe decisions, we remove it from the factor graph that underlies the sampling process, so no more\nsamples are generated for it. Consequently, the factor graph shrinks over time, and later steps in the\nsampling procedure are accelerated, often drastically so.\nOur main contribution lies in the combination of two relatively elementary components that we\nwill introduce in the following section: an estimate for the posterior distributions of the decision\nvariables, and a mean \ufb01eld-like construction for removing individual variables from a factor graph.\n\np(x) \u221d exp(cid:0) \u2212 E(x)(cid:1)\n\n(cid:88)\n\n2 Adaptive Sampling for Maximum Marginal Prediction\nLet p(x) be a \ufb01xed probability distribution over the set X = {0, 1}V of binary labelings of a vertex\nset V = {1, . . . , n}. We assume that p is given to us by means of a factor graph, G = (V,F), with\nfactor set F = {F1 . . . , Fk}. Each factor, Fj \u2282 V , has an associated log-potential, \u03c8j, which is a\nreal-valued function of only the variables occurring in Fj. Writing xFj = (xi)i\u2208Fj we have\n\nwith E(x) =\n\n(cid:80)m\n\nF\u2208F \u03c8F (xF ).\n\n(1)\nfor any x \u2208 {0, 1}V . Our goal is maximum marginal prediction, i.e. to infer the values of decision\nvariables (zi)i\u2208V that are de\ufb01ned by zi := 0 if \u00b5i \u2264 0.5, and zi := 1 otherwise, where \u00b5i := p(xi =\n1) is the marginal probability of the ith variable taking the value 1. Computing the marginals \u00b5i in a\nloopy graphical model is in general #P-complete [10], so one has to settle for approximate marginals\nand approximate predictions.\nIn this work, we assume access to a suitable constructed sampler\nbased on the Markov chain Monte Carlo (MCMC) principle [11, 12], e.g. a Gibbs sampler [3] It\nproduces a chain of states Sm = {x(1), . . . , x(N )}, where each x(i) is a random sample from the\njoint distribution p(x). From the set of sample we can compute an estimate, \u02c6\u00b5i = 1\nof\nthe true marginal, \u00b5i, and make approximate decisions: \u02c6zi := 1 if and only if \u02c6\u00b5i \u2265 0.5. Under mild\nm\nconditions on the sampling procedure the law of large number guarantees that limN\u2192\u221e \u02c6\u00b5i = \u00b5i,\nand the decisions will become correct almost surely.\nThe main problem with sampling-based inference is when to stop sampling [13]. The more samples\nwe have, the lower the variance on the estimates, so the more con\ufb01dent we can be about our deci-\nsions. However, each sample we generate increases the computational cost at least proportionally to\nthe numbers of factors and variables involved. At the same time, the variance of the estimators \u02c6\u00b5i\nis reduced only proportionally to the square root of the sample size. In combination, this means that\noften, one spends a large amount of computational resources on a small win in predictive accuracy.\nIn the rest of this section, we explain our proposed idea of adaptive sampling in graphical models,\nwhich reduces the number of variables and factors during the course of the sampling procedure.\nAs an illustrative example we start by the classical situation of adaptive sampling in the case of a\nsingle binary variable. This is a special case of Bayesian hypothesis selection, and \u2013for the case of\ni.i.d. data\u2013 has recently also been rediscovered in the pattern recognition literature, for example for\nevaluating decision trees [14]. We then introduce our proposed extensions to correlated samples, and\nshow how the per-variable decisions can be applied in the graphical model situation with potentially\nmany variables and dependencies between them.\n\nj=1 x(j)\n\ni\n\n2.1 Adaptive Sampling of Binary Variables\nLet x be a single binary variable, for which we have a set of samples, S = {x(1), . . . , x(N )},\navailable. The main insight lies in the fact that even though samples are used to empirically estimate\nthe (marginal) probability \u00b5, the latter is not the actual quantity of interest to us. Ultimately, we are\nonly interested in the value of the associated decision variable z.\n\nIndependent samples. Assuming for the moment that the samples are independent (i.i.d.), we can\nderive an analytic expression for the posterior probability of z given the observed samples,\n\np(z = 0|S) =\n\np(q|S)dq\n\n(2)\n\n(cid:90) 1\n\n2\n\n0\n\n2\n\n\fwhere p(q|S) is the conditional probability density for \u00b5 having the value q. Applying Bayes\u2019 rule\nwith likelihood p(x|q) = qx(1 \u2212 q)1\u2212x and uniform prior, p(q) = 1, results in\n\n(cid:90) 1\n\n2\n\n0\n\n=\n\nwhere m = (cid:80)N\n\n1\n\nB(m+ 1, N\u2212m+1)\n\nqm(1\u2212q)N\u2212m dq = I 1\n\n2\n\n(m+1, N\u2212m+1),\n\n(3)\n\nj=1 x(j). The normalization factor B(\u03b1, \u03b2) = \u0393(\u03b1)\u0393(\u03b2)\n\n\u0393(\u03b1+\u03b2)\n\nis the beta function; the\n2). In combination, they form the\n\nintegral is called the incomplete beta function (here evaluated at 1\nregularized incomplete beta function Ix(\u03b1, \u03b2) [15].\nFrom the above derivation we obtain a stopping criterion of \u0001-con\ufb01dence: given any number of\nsamples we compute p(z = 0|S) using Equation (3). If its value is above 1 \u2212 \u0001, we are \u0001-con\ufb01dent\nthat the correct decision is z = 0. If it is below \u0001, we are equally con\ufb01dent that the correct decision\nis z = 1. Only if it lies inbetween we need to continue sampling. An analogue derivation to the\nabove leads to a con\ufb01dence bound for estimates of the marginal probability, \u02c6\u00b5 = m/N, itself:\n\np(|\u02c6\u00b5 \u2212 \u00b5| \u2264 \u03b4|S) = I\u02c6\u00b5+\u03b4(m+1, N\u2212m+1) \u2212 I\u02c6\u00b5\u2212\u03b4(m+1, N\u2212m+1).\n\n(4)\nNote that both tests are computable fast enough to be done after each sample, or small batches of\nsamples. Evaluating the regularized incomplete beta function does not require numeric integration,\nand for \ufb01xed parameter \u0001 the values N and m that bound the regions of con\ufb01dence can also be\ntabulated [16]. A \ufb01gure illustrating the difference between con\ufb01dence in the MMP, and con\ufb01dence\nin the estimated marginals can be found in the supplemental material. It shows that only relatively\nfew independent samples (tens to hundreds) are suf\ufb01cient to get a very con\ufb01dent MMP decision,\nif the actual marginals are close to 0 or 1. Intuitively, this makes sense, since in this situation a\neven coarse estimate of the marginal is suf\ufb01cient to make of a decision with low error probability.\nOnly if the true marginal lies inside of a relatively narrow interval around 0.5, the MMP decision\nbecomes hard, and a large number of samples will be necessary to make a con\ufb01dent decision. Our\nexperiments in Section 4 will show that in practical problem where the probability distribution is\nlearned from data, the regions close to 0 and 1 are in fact the most relevant ones.\n\nDependent samples. Practical sampling procedures, such as MCMC, do not create i.i.d. samples,\nbut dependent ones. Using the above bounds directly with these would make the tests overcon\ufb01dent.\nWe overcome this problem, approximately, by borrowing the concept of effective sample size (ESS)\nfrom the statistics literature. Intuitively, the ESS re\ufb02ects how many independent samples, N(cid:48), a set of\nN correlated sample is equivalent to. In \ufb01rst order 1, one estimates the effective sample size as N(cid:48) =\n1\u2212r\n1+r N, where r is the \ufb01rst order autocorrelation coef\ufb01cient, r = 1\n, and\nN\u22121\n\u03c32 is the estimated variance of the sample sequence. Consequently, we can adjust the con\ufb01dence\ntests de\ufb01ned above to correlated data: we \ufb01rst collecting a small number of samples, N0, which we\nuse to estimate initial values of \u03c32 and r. Subsequently, we estimate the con\ufb01dence of a decision by\n(5)\ni.e. we replace the sample size N by the effective sample size N(cid:48) and the raw count m by its adjusted\nvalue \u02c6\u00b5N(cid:48).\n\n(\u02c6\u00b5N(cid:48) + 1, (1 \u2212 \u02c6\u00b5)N(cid:48) + 1),\n\np(z = 0|S) = I 1\n\n(cid:80)N\u22121\n\n(x(j)\u2212\u02c6\u00b5)(x(j+1)\u2212\u02c6\u00b5)\n\nj=1\n\n\u03c32\n\n2\n\n2.2 Adaptive Sampling in Graphical Models\n\nIn this section we extend the above con\ufb01dence criterion from single binary decisions to the situation\nof joint sampling from the joint probability of multiple binary variables. Note that we are only inter-\nested in per-variable decisions, so we can treat the value of each variable x(j)\nin a joint sample x(j) as\na separate sample from the marginal probability p(xi). We will have to take the dependence between\ndifferent samples x(j)\ninto account, but between variable dependencies within a sample do\nnot pose problems. Consequently, estimate the con\ufb01dence of any decision variable zi is straight\nforward from Equation (5), applied separately to the binary sample set Si = {x(1)\n}. Note\nthat all quantities de\ufb01ned above for the single variable case need to be computed separately for each\ndecision. For example, each variable has its own autocorrelation estimate and effective sample size.\n\n, . . . , x(N )\n\nand x(k)\n\ni\n\ni\n\ni\n\ni\n\ni\n\n1Many more involved methods for estimating the effective sample size exist, see, for example, [17], but in\n\nour experiments the \ufb01rst-order method proved suf\ufb01cient for our purposes.\n\n3\n\n\fComputing p(xu) = (cid:80)\n\nThe difference to the binary situation lies in what we do when we are con\ufb01dent enough about the\ndecision of some subset of variables, V c \u2282 V . Simply stopping all sampling would be too risky,\nsince we are still uncertain about the decisions of V u := V \\ V c. Continuing to sample until we are\ncertain about all decision will be wasteful, since we know that variables with marginal close to 0.5\nrequire many more samples than others for a con\ufb01dent decision. We therefore propose to continue\nsampling, but only for the variables about which we are still uncertain. This requires us to derive an\nexpression for p(xu), the marginal probability of all variables that we are still uncertain about.\n\n\u00afxc\u2208{0,1}V c p(\u00afxc, xu) exactly is almost always infeasible, otherwise, we\nwould not have needed to resort to sampling based inference in the \ufb01rst place. An alternative idea\nwould be to continue using the original factor graph, but to clamp all variables we are certain about\nto their MMP values. This is computationally feasible, but it results in samples from a conditional\ndistribution, p(xu|xc = zc), not from the desired marginal one. The new construction that we in-\ntroduce combines advantages of both previous ideas: it is computationally as ef\ufb01cient as the value\nclamping, but it uses a distribution that approximates the marginal distribution as closely as possible.\nSimilar as in mean-\ufb01eld methods [7], the main step consists of \ufb01nding distributions q and q(cid:48) such\nthat p(x) \u2248 q(xu)q(cid:48)(xc). Subsequently, q(xu) can be used as approximate replacement to p(xu),\n\u00afxc\u2208{0,1}V c q(cid:48)(\u00afxc)q(xu) = q(xu). The main difference to\nmean-\ufb01eld inference lies in the fact that q and q(cid:48) have different role in our construction. For q(cid:48) we\nprefer a distribution that factorizes over the variables that we are con\ufb01dent about. Because we want\nq also to respect the marginal probabilities, \u02c6\u00b5i for i \u2208 V c, as estimated them from the sampling\ni (1 \u2212 \u02c6\u00b5i)xi. The distribution q contain all variables\nthat we are not yet con\ufb01dent about, so we want to avoid making any limiting assumptions about its\npotential values or structure. Instead, we de\ufb01ne it as the solution of minimizing KL(p|qq(cid:48)) over all\ndistributions q, which yields the solution\n\nbecause p(xu) =(cid:80)\n\u00afxc\u2208{0,1}V c p(x) \u2248(cid:80)\nprocess so far, we obtain q(cid:48)(xc) = (cid:81)\n\nWhat remains is to de\ufb01ne factors F(cid:48) and log-potentials \u03c8(cid:48), such that q(xu) \u221d exp(cid:0) \u2212\nF (xF )(cid:1) while also allowing for ef\ufb01cient sampling from q. For this we partition the orig-\n(cid:80)\nF\u2208F(cid:48) \u03c8(cid:48)\n\nq(xu) \u221d exp( \u2212E\u00afxc\u223cq(cid:48)(xc){E(\u00afxc, xu)} ).\n\ninal factor set into three disjoint sets, F = F c \u222a F u \u222a F0, with F c := {F \u2282 F : F \u2282 V c},\nF u := {F \u2282 F : F \u2282 V u}, and F0 := F \\ (F c \u222a F u). Each factor F0 \u2208 F0 we split further into\nits certain and uncertain components, F c\n(cid:88)\nWith this we obtain a decomposition of the exponent in Equation (6):\nE\u00afxc\u223cq(cid:48){E(\u00afxc, xu)} =\n\n0 \u2282 V u, respectively.\n0 \u2282 V c and F u\n(cid:88)\n(cid:88)\n\nq(cid:48)(\u00afxc)\u03c8F c (xF c) +\n\n\u03c8Fu (xFu)+\n\n(cid:88)\n\n(cid:88)\n\ni\u2208V c \u02c6\u00b5xi\n\nq(cid:48)(\u00afxF c\n\n)\u03c8F (\u00afxF c\n\n, xF u\n\n)\n\n(6)\n\n0\n\n0\n\n0\n\nF c\u2208F c\n\n\u00afxF c\n\nFu\u2208F u\n\nF0\u2208F0\n\n\u00afxF c\n0\n\n(cid:80){F u=F\u2229V u:F\u2208F0} \u03c8(cid:48)\n\nThe \ufb01rst sum is a constant with respect to xu, so we can disregard it in the construction of F(cid:48). The\nfactors and log-potentials in the second sum already depend only on V u, so we can re-use them\nF = \u03c8F for every F \u2208 F u. The third sum we rewrite as\nin unmodi\ufb01ed form for F(cid:48), we set \u03c8(cid:48)\n(cid:88)\nF u (xF u ), with\n\ni (1 \u2212 \u02c6\u00b5i)1\u2212\u00afxi(cid:3)\u03c8F (\u00afxc, xu).\n\n\u03c8(cid:48)\nF u (xu) :=\n\n(cid:2) (cid:89)\n\n\u02c6\u00b5\u00afxi\n\n(7)\n\n\u00afxc\u2208{0,1}Fc\n\ni\u2208Fc\n\nq(xu) \u221d exp(cid:0) (cid:88)\n\nF (xF )(cid:1)\n\n\u03c8(cid:48)\n\nF\u2208F(cid:48)\n\nfor any F \u2208 F0, where we have made use of the explicit form of \u00afq. If factors with identical variable\nset occur during this construction, we merge them by summing their log-potentials. Ultimately, we\nobtain a new factor set F(cid:48) := F u \u222a {F \u2229 V u : F \u2208 F0}, and probability distribution\n\nfor xu \u2208 {0, 1}V u\n\n.\n\n(8)\n\nNote that during the process, not only the number of variables is reduced, but also the number of\nfactors and the size of each factor can never grow. Consequently, if sampling was feasible for the\noriginal distribution p, it will also be feasible for q, and potentially more ef\ufb01cient.\n\n3 Related Work\nSequential sampling with the option of early stopping has a long tradition in Bayesian statistics. First\nintroduced by Wald in 1945 [18], the ability to continuously accumulate information until a decision\ncan be made with suf\ufb01cient con\ufb01dence was one of the key factors that contributed to the success of\n\n4\n\n\fBayesian reasoning for decision making. Today, it has been a standard technique in areas as diverse\nas clinical medicine (e.g. for early stop of drug trials [19]), social sciences (e.g. for designing and\nevaluating experiments [20]), and economics (e.g. in modelling stock market behavior [21]).\nIn current machine learning research, sequential sampling is used less frequently for making indi-\nvidual decisions, but in the form of MCMC it has become one of the most successful techniques for\nstatistical inference of probability distributions with many dependent variables [12, 22]. Neverthe-\nless, to the best of our knowledge, the method we propose is the \ufb01rst one that performs early stopping\nof subsets of variables in this context. Many other approaches to reduce the complexity of sampling\niterations exist, however, for example to approximate complex graphical models by simpler ones,\nsuch as trees [23], or loopy models of low treewidth [24]. These fall into a different category than the\nproposed method, though, as they are usually performed statically and prior to the actual inference\nstep, so they cannot dynamically assign computational resources where they are needed most. Beam\nsearch [25] and related techniques take an orthogonal approach to ours. They dynamically exclude\nlow-likelihood label combinations from the inference process, but they keep the size and topology of\nthe factor graph \ufb01xed. Select and sample [26] disregards a data-dependent subset of variables dur-\ning each sampling iterations. It is not directly applicable in our situation, though, since it requires\nthat the underlying graphical model is bipartite, such that the individual variables are conditionally\nindependent of each other. Given their complementary nature, we believe that the idea of combining\nadaptive MMP with beam search and/or select and sample could be a promising direction for future\nwork.\n\n4 Experimental Evaluation\nTo demonstrate the effect of adaptive MMP compared to naive MMP, we performed experiments in\ntwo prototypical applications: multi-label classi\ufb01cation and binary image inpainting. In both tasks,\nperformance is typically measured by the Hamming loss, so MMP is the preferred method of test\ntime prediction.\n\n4.1 Multi-Label Classi\ufb01cation\nIn multi-label classi\ufb01cation, the task is to predict for each input y \u2208 Y, which labels out of a label\nset L = {1, . . . , K} are correct. The difference to multi-class classi\ufb01cation is that several labels can\nbe correct simultaneously, or potentially none at all. Multi-label classi\ufb01cation can be formulated\nas simultaneous prediction of K binary labels (xi)i=1,...K, where xi = 1 indicates that the label i\nis part of the prediction, and xi = 0 indicates that it is not. Even though multi-label classi\ufb01cation\ncan in principle be solved by training K independent predictors, several studies have shown that\nby making use of dependencies between label, the accuracy of the individual predictions can be\nimproved [1, 27, 28].\nFor our experiments we follow [1] in using a fully-connected conditional random \ufb01eld model.\nGiven an input y, each label variable i has a unary factor Fi = {i} with log-linear potential\n\u03c8i(xi) = (cid:104)wi, y(cid:105)xi, where wi is a label-speci\ufb01c weight vector that was learned from training\ndata. Additionally there are K(K \u2212 1)/2 pairwise factors, Fij = {i, j}, with log-potentials\n\u03c8ij(xi, xj) = \u03b7ijxixj. Its free parameter \u03b7ij is learned as well. The resulting conditional joint\ndistribution has the form of a Boltzmann machine, p(x|y) \u221d exp(\u2212Ey(x)), with energy function\nj=i+1 \u03b7ijxixj in minimal representation, where \u03b7i and \u03b7ij depend\non y. We downloaded several standard datasets and trained the CRF on each of them using a stochas-\ntic gradient descent procedure based on the sgd2 package. The necessary gradients are computing\nusing a junction tree algorithms for problems with 20 variables or less, and by Gibbs sampling\notherwise. For model selection, when required, we used 10-fold cross-validation on the training set.\nNote that our goal in this experiment is not to advocate a new model multi-label classi\ufb01cation, but to\ncreate probability distributions as they would appear in real problems. Nevertheless, we also report\nclassi\ufb01cation accuracy in Table 1 to show that a) the learned models have similar characteristics as\nearlier work, in particular to [29], where the an identical model was trained using structured SVM\nlearning, and b) adaptive MMP can achieve as high prediction accuracy as ordinary Gibbs sampling,\nas long as the con\ufb01dence parameter \u0001 is not chosen overly optimistically. In fact, in many cases even\n\ni=1 \u03b7ixi +(cid:80)L\n\nEy(x) =(cid:80)K\n\n(cid:80)L\n\ni=1\n\n2http://leon.bottou.org/projects/sgd\n\n5\n\n\fSCENE\n\nRCV1-10 [29]\n\nMEDIAMILL-10 [29]\n\nDataset\n\nSYNTH1 [29]\nSYNTH2 [29]\n\nYEAST\n\nTMC2007\nAWA [30]\nMEDIAMILL\n\nRCV1\n\n[28]\n\u2014\n\u2014\n\n[29]\n#Labels #Train #Test\n6.9\n5045\n471\n7.0\n10000\n1000\n10.1 9.5 \u00b1 2.1\n1196\n1211\n5.6\n2916\n2914\n18.8\n29415 12168\n20.2 20.2 \u00b1 1.3\n917\n1500\n7077 \u2014 3.3 \u00b1 2.7\n21519\n24295\n6180 \u2014\n29415 12168 \u2014 3.6 \u00b1 0.5\n3000\n\n6\n10\n6\n10\n10\n14\n22\n85\n101\n103\n\n3000 \u2014\n\n\u2014\n\u2014\n\n\u2014\n\n\u2014\n\nExact Gibbs\n5.3\n5.2\n10.0\n10.0\n10.3\n10.4\n4.2\n4.2\n18.6\n18.4\n20.2\n20.0\n5.3\n5.3\n\u2014 32.2\n3.7\n\u2014\n\u2014\n1.5\n\nProposed\n\n5.2 / 5.2 / 5.2\n10.0/10.0/10.0\n10.2/10.2/10.2\n4.6 / 4.4 / 4.2\n19.0/18.6/18.4\n23.4/21.4/20.5\n5.3 / 5.3 / 5.3\n32.7/32.7/32.7\n3.6 / 3.5 / 3.6\n1.7 / 1.6 / 1.5\n\nTable 1: Multi-label classi\ufb01cation. Dataset characteristics (number of labels, number of training\nexamples, number of test examples) and classi\ufb01cation error rate in percent. [29] used the same model\nas we do, but trained it using a structured SVM framework and predicted using MAP. [28] compared\n12 different multi-label classi\ufb01cation techniques, we report their mean and standard deviation. The\nremaining columns give MMP prediction accuracy of the trained CRF models: Exact computes the\nexact marginal values by a junction tree, Gibbs and Proposed performs ordinary Gibbs sampling, or\nthe proposed adaptive version with \u0001 = 10\u22122/10\u22125/10\u22128, both run for up to 500 iterations.\n\nFigure 1: Results of adaptive pruning on RCV1 dataset for \u0001 = 10\u22122, 10\u22125, 10\u22128 (left to right).\nx-axis: regularization parameter C used for training, y-axis: ratio of iterations/variables/factors/\nruntime used by adaptive sampling relative to Gibbs sampling.\n\na relative large value, such as \u0001 = 0.01 results in a smaller loss of accuracy than the potential 1%,\nbut overall, a value of 10\u22125 or less seems advisable.\nFigures 1 and 2 show in more detail how the adaptive sampling behaves on two exemplary datasets\nwith respect to four aspects: the number of iterations, the number of variables, the number of fac-\ntors, and the overall runtime. For each aspect we show a box plot of the corresponding relative\nquantity compared to the Gibbs sampler. For example, a value of 0.5 in iterations means that the\nadaptive sample terminated after 250 iterations instead of the maximum of 500, because it was con-\n\ufb01dent about all decisions. Values of 0.2 in variables and factors means that the number of variable\nstates samples by the adaptive sampler was 20%, and the number of factors in the corresponding\nfactor graphs was 10% of the corresponding quantities for the Gibbs sampler. Within each plot, we\nreported results for the complete range of regularization parameters in order to illustrate the effect\nthat regularization has on the distribution of marginals.\n\n6\n\nIterationsVariablesRuntimeFactors0.010.11101001000100000.00.20.40.60.81.00.010.11101001000100000.00.10.20.30.40.50.010.11101001000100000.000.050.100.150.200.250.010.11101001000100000.00.10.20.30.40.50.010.11101001000100000.00.20.40.60.81.00.010.11101001000100000.00.20.40.60.81.00.010.11101001000100000.00.10.20.30.40.50.010.11101001000100000.00.20.40.60.81.00.010.11101001000100000.00.20.40.60.81.00.010.11101001000100000.00.20.40.60.81.00.010.11101001000100000.00.20.40.60.81.00.010.11101001000100000.00.20.40.60.81.0\fFigure 2: Results of adaptive pruning on YEAST dataset for \u0001 = 10\u22122, 10\u22125, 10\u22128 (left to right).\nx-axis: regularization parameter C used for training, y-axis: ratio of iterations/variables/factors/\nruntime used by adaptive sampling relative to Gibbs sampling. Note that the scaling of the y-axis\ndiffers between columns.\n\nFigure 1 shows results for the relatively simple RCV1 dataset. As one can see, a large number of\nvariables and factors are removed quickly from the factor graph, leading to a large speedup compared\nto the ordinary Gibbs sampler. In fact, as the \ufb01rst row shows, it was possible to make a con\ufb01dent\ndecision for all variables far before the 500th iteration, such that the adaptive method terminated\nearly. As a general trend, the weaker the regularization (larger C value in the plot), the earlier the\nadaptive sampler is able to remove variables and factors, presumably because more extreme values\nof the energy function result in more marginal probabilities close to 0 or 1. A second insight is that\ndespite the exponential scaling of the con\ufb01dence parameter between the columns, the runtime grows\nonly roughly linearly. This indicates that we can choose \u0001 conservatively without taking a large\nperformance hit. On the hard YEAST dataset (Figure 2) in the majority of cases the adaptive sampling\ndoes not terminate early, indicating that some of the variables have marginal probabilities close to\n0.5. Nevertheless, a clear gain in speed can be observed, in particular in the weakly regularized case,\nindicating that nevertheless, many tests for con\ufb01dence are successful early during the sampling.\n\n4.2 Binary Image Inpainting\n\nInpainting is a classical image processing task: given an image (in our case black-and-white) in\nwhich some of the pixels are occluded or have missing values, the goal is to predict a completed\nimage in which the missing pixels are set to their correct value, or at least in a visually pleasing\nway. Image inpainting has been tackled successfully by grid-shaped Markov random \ufb01eld models,\nwhere each pixel is represented by a random variable, unary factors encode local evidence extracted\nfrom the image, and pairwise terms encode the cooccurrence of pixel value. For our experiment, we\nuse the Hard Energies from Chinese Characters (HECC) dataset [31], for which the authors provide\npre-computed energy functions. The dataset has 100 images, each with between 4992 and 17856\npixels, i.e. binary variables. Each variable has one unary and up to 64 pairwise factors, leading to an\noverall factor count of 146224 to 553726. Because many of the pairwise factors act repulsively, the\nunderlying energy function is highly non-submodular, and sampling has proven a more successful\nmean of inference than, for example, message passing [31].\nFigure 3 shows exemplary results of the task. The complete set can be found in the supplemental\nmaterial. In each case, we ran an ordinary Gibbs sampler and the adaptive sampler for 30 seconds,\n\n7\n\nIterationsVariablesRuntimeFactors0.010.11101001000100000.00.20.40.60.81.00.010.11101001000100000.00.10.20.30.40.50.010.11101001000100000.000.050.100.150.200.250.010.11101001000100000.00.10.20.30.40.50.010.11101001000100000.00.20.40.60.81.00.010.11101001000100000.00.20.40.60.81.00.010.11101001000100000.00.10.20.30.40.50.010.11101001000100000.00.20.40.60.81.00.010.11101001000100000.00.20.40.60.81.00.010.11101001000100000.00.20.40.60.81.00.010.11101001000100000.00.20.40.60.81.00.010.11101001000100000.00.20.40.60.81.0\finput\n\nGibbs\n\n\u0001 = 10\u22122\n\n\u0001 = 10\u22125\n\n\u0001 = 10\u22128\n\nFigure 3: Example results of binary image inpainting on HECC dataset. From left to right: image to\nbe inpainted, result of Gibbs sampling, result of adaptive sampling, where each method was run for\nup to 30 seconds per image. The left plot of each result shows the marginal probabilities, the right\nplot shows how often each pixel was sampled on a log scale from 10 (dark blue) to 100000 (bright\nred). Gibbs sampling treats all pixels uniformly, reaching around 100 sampling sweeps within the\ngiven time budget. Adaptive sampling stops early for parts of the image that it is certain about, and\nconcentrates its samples in the uncertain regions, i.e. pixels with marginal probability close to 0.5.\nThe larger \u0001, the more pronounced this effect it.\n\nand we visualize the resulting marginal probabilities as well as the number of samples created for\neach of the pixels. One can see that adaptive sampling comes to a more con\ufb01dent prediction within\nthe given time budget. The larger the \u0001 parameter, the earlier to stops sampling the \u2019easy\u2019 pixels,\nspending more time on the dif\ufb01cult cases, i.e. pixel with marginal probability close to 0.5.\n\n5 Summary and Outlook\n\nIn this paper we derived an analytic expression for how con\ufb01dent one can be about the maximum\nmarginal predictions (MMPs) of a binary graphical model after a certain number of samples, and\nwe presented a method for pruning factor graphs when we want to stop sampling for a subset of\nthe variables. In combination, this allowed us to more ef\ufb01ciently infer the MMPs: starting from\nthe whole factor graph, we sample sequentially, and whenever we are suf\ufb01ciently certain about a\nprediction, we prune it from the factor graph before continuing to sample. Experiments on multi-\nlabel classi\ufb01cation and image inpainting show a clear increase in performance at virtually no loss in\naccuracy, unless the con\ufb01dence is chosen too optimistically.\nDespite the promising results there are two main limitations that we plan to address. On the one\nhand, the multi-label experiments showed that sometimes, a conservative estimate of the con\ufb01dence\nis required to achieve highest accuracy. This is likely a consequence of the fact that our pruning\nuses the estimated marginal to build a new factor graph, and even if the decision con\ufb01dence is high,\nthe marginals can still vary considerately. We plan to tackle this problem by also integrating bounds\non the marginals with data-dependent con\ufb01dence into our framework. A second limitation is that\nwe can currently only handle binary-valued labelings. This is suf\ufb01cient for multi-label classi\ufb01cation\nand many problems in image processing, but ultimately, one would hope to derive similar early\nstopping criteria also for graphical models with larger label set. Our pruning method would be\nreadily applicable to this situation, but an open challenge lies in \ufb01nding a suitable criterion when\nto prune variables. This will require a deeper understanding of tail probabilities of multinomial\ndecision variables, but we are con\ufb01dent it will be achievable, for example based on existing prior\nworks from the case of i.i.d. samples [14, 32].\n\n8\n\n\fReferences\n[1] N. Ghamrawi and A. McCallum. Collective multi-label classi\ufb01cation. In CIKM, 2005.\n[2] X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised learning using Gaussian \ufb01elds and harmonic\n\nfunctions. In ICML, 2003.\n\n[3] S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of\n\nimages. PAMI, 6(6), 1984.\n\n[4] S. Della Pietra, V. Della Pietra, and J. Lafferty. Inducing features of random \ufb01elds. PAMI, 19(4), 1997.\n[5] C. Yanover and Y. Weiss. Approximate inference and protein folding. In NIPS, volume 15, 2002.\n[6] E. Schneidman, M. J. Berry, R. Segev, and W. Bialek. Weak pairwise correlations imply strongly corre-\n\nlated network states in a neural population. Nature, 440(7087), 2006.\n\n[7] M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference.\n\nFoundations and Trends in Machine Learning, 1(1-2), 2008.\n\n[8] J. Marroquin, S. Mitter, and T. Poggio. Probabilistic solution of ill-posed problems in computational\n\nvision. Journal of the American Statistical Association, 82(397), 1987.\n\n[9] C. Yanover and Y. Weiss. Finding the m most probable con\ufb01gurations using loopy belief propagation. In\n\nNIPS, volume 16, 2004.\n\n[10] M. Jerrum and A. Sinclair. Polynomial-time approximation algorithms for the Ising model. SIAM Journal\n\non Computing, 22, 1993.\n\n[11] R. M. Neal. Probabilistic inference using Markov chain Monte Carlo methods. Technical Report CRG-\n\nTR-93-1, Department of Computer Science, University of Toronto, 1993.\n\n[12] D. J. C. MacKay. Introduction to Monte Carlo methods. In Proceedings of the NATO Advanced Study\n\nInstitute on Learning in graphical models, 1998.\n\n[13] A. E. Raftery and S. Lewis. How many iterations in the Gibbs sampler. Bayesian Statistics, 4(2), 1992.\n[14] A. G. Schwing, C. Zach, Y. Zheng, and M. Pollefeys. Adaptive random forest \u2013 how many \u201dexperts\u201d to\n\nask before making a decision? In CVPR, 2011.\n\n[15] H. Weiler. The use of incomplete beta functions for prior distributions in binomial sampling. Technomet-\n\nrics, 1965.\n\n[16] C. M. Thompson, E. S. Pearson, L. J. Comrie, and H. O. Hartley. Tables of percentage points of the\n\nincomplete beta-function. Biometrika, 1941.\n\n[17] R. V. Lenth. Some practical guidelines for effective sample size determination. The American Statistician,\n\n55(3), 2001.\n\n[18] A. Wald. Sequential tests of hypotheses. Annals of Mathematical Statistics, 16, 1945.\n[19] D. A. Berry. Bayesian clinical trials. Nature Reviews Drug Discovery, 5(1), 2006.\n[20] A. E. Raftery. Bayesian model selection in social research. Sociological Methodology, 25, 1995.\n[21] D. Easley, N. M. Kiefer, M. O\u2019hara, and J. B. Paperman. Liquidity, information, and infrequently traded\n\nstocks. Journal of Finance, 1996.\n\n[22] C. J. Geyer. Practical Markov chain Monte Carlo. Statistical Science, 7(4), 1992.\n[23] C. Chow and C. Liu. Approximating discrete probability distributions with dependence trees.\n\nTransactions on Information Theory, 14(3), 1968.\n\nIEEE\n\n[24] F. Bach and M. I. Jordan. Thin junction trees. In NIPS, volume 14, 2002.\n[25] M. J. Collins. A new statistical parser based on bigram lexical dependencies. In ACL, 1996.\n[26] J. A. Shelton, J. Bornschein, A. S. Sheikh, P. Berkes, and J. L\u00a8ucke. Select and sample \u2013 a model of\n\nef\ufb01cient neural inference and learning. In NIPS, volume 24, 2011.\n\n[27] Y. Guo and S. Gu. Multi-label classi\ufb01cation using conditional dependency networks. In IJCAI, 2011.\n[28] G. Madjarov, D. Kocev, D. Gjorgjevikj, and S. Dzeroski. An extensive experimental comparison of\n\nmethods for multi-label learning. Pattern Recognition, 2012.\n\n[29] T. Finley and T. Joachims. Training structural SVMs when exact inference is intractable. In ICML, 2008.\n[30] C. H. Lampert, H. Nickisch, and S. Harmeling. Learning to detect unseen object classes by between-class\n\nattribute transfer. In CVPR, 2009.\n\n[31] S. Nowozin, C. Rother, S. Bagon, T. Sharp, B. Yao, and P. Kohli. Decision tree \ufb01elds. In ICCV, 2011.\n[32] D. Chafai and D. Concordet. Con\ufb01dence regions for the multinomial parameter with small sample size.\n\nJournal of the American Statistical Association, 104(487), 2009.\n\n9\n\n\f", "award": [], "sourceid": 50, "authors": [{"given_name": "Christoph", "family_name": "Lampert", "institution": null}]}