{"title": "Latent Structured Active Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 728, "page_last": 736, "abstract": "In this paper we present active learning algorithms in the context of structured prediction problems. To reduce the amount of labeling necessary to learn good models, our algorithms only label subsets of the output. To this end, we query examples using entropies of local marginals, which are a good surrogate for uncertainty. We demonstrate the effectiveness of our approach in the task of 3D layout prediction from single images, and show that good models are learned when labeling only a handful of random variables. In particular, the same performance as using the full training set can be obtained while only labeling ~10\\% of the random variables.", "full_text": "Latent Structured Active Learning\n\nWenjie Luo\nTTI Chicago\n\nwenjie.luo@ttic.edu\n\nAlexander G. Schwing\n\nETH Zurich\n\naschwing@inf.ethz.ch\n\nRaquel Urtasun\n\nTTI Chicago\n\nrurtasun@ttic.edu\n\nAbstract\n\nIn this paper we present active learning algorithms in the context of structured\nprediction problems. To reduce the amount of labeling necessary to learn good\nmodels, our algorithms operate with weakly labeled data and we query additional\nexamples based on entropies of local marginals, which are a good surrogate for\nuncertainty. We demonstrate the effectiveness of our approach in the task of 3D\nlayout prediction from single images, and show that good models are learned when\nlabeling only a handful of random variables. In particular, the same performance\nas using the full training set can be obtained while only labeling \u223c10% of the\nrandom variables.\n\n1\n\nIntroduction\n\nMost real-world applications are structured, i.e., they are composed of multiple random variables\nwhich are related. For example, in natural language processing, we might be interested in parsing\nsentences syntactically. In computer vision, we might want to predict the depth of each pixel, or its\nsemantic category. In computational biology, given a sequence of proteins (e.g., lethal and edema\nfactors, protective antigen) we might want to predict the 3D docking of the anthrax toxin. While in-\ndividual variables could be considered independently, it has been demonstrated that taking relations\ninto account improves prediction performance signi\ufb01cantly.\nPrediction in structured models is typically performed by maximizing a scoring function over the\nspace of all possible outcomes, an NP-hard task for most graphical models. Traditional learning al-\ngorithms for structured problems tackle the supervised setting [16, 33, 11], where input-output pairs\nare given and each structured output is fully labeled. Obtaining fully labeled examples might, how-\never, be very cumbersome as structured models often involve a large number of random variables,\ne.g., in semantic segmentation, we have to label several million random variables, one for each pixel.\nFurthermore, obtaining ground truth is sometimes dif\ufb01cult as it potentially requires accessing extra\nsensors, e.g., laser scanners in the case of stereo. This is even more extreme in the medical domain,\nwhere obtaining extra labels is sometimes not even possible, e.g., when tests are not available. Thus,\nreducing the amount of labeled examples required for learning the scoring function is key for the\nsuccess of structured prediction in real-world applications.\nThe active learning setting is particularly bene\ufb01cial as it has the potential to considerably reduce\nthe amount of supervision required to learn a good model, by querying only the most informative\nexamples. In the structured case, active learning can be generalized to query only subparts of the\ngraph for each example, reducing the amount of necessary labeling even further.\nWhile a variety of active learning approaches exists for the case of classi\ufb01cation and regression, the\nstructured case has been less popular, perhaps because of its intrinsic computational dif\ufb01culties as we\nhave to deal with exponentially sized output spaces. Existing approaches typically consider the case\nwhere exact inference is possible [7], label the full output space [7, 22], or rely on computationally\nexpensive processes that require inference for each possible outcome of each random variable [34].\nThe latter is computationally infeasible for most graphical models.\n\n1\n\n\fIn contrast, in this paper we present ef\ufb01cient approximate approaches for general graphical models\nwhere exact inference is intractable. In particular, we propose to select which parts to label based\non the entropy of the local marginal distributions. Our active learning algorithms exploit recently\ndeveloped weakly supervised methods for structured prediction [28], showing that we can bene\ufb01t\nfrom unlabeled examples and exploit the marginal distributions computed during learning. Further-\nmore, computation is re-used at each active learning iteration, improving ef\ufb01ciency signi\ufb01cantly.\nWe demonstrate the effectiveness of our approach in the context of 3D room layout estimation from\nsingle images, and show that state-of-the-art results are achieved by employing much fewer man-\nual interactions (i.e., labels). In particular, we match the performance of the state-of-the-art in this\ntask [27] while only labeling \u223c10% of the random variables.\nIn the remainder of the paper we \ufb01rst review learning methods for structured prediction. We then\npropose our active learning algorithms, and show our experimental evaluation followed by a discus-\nsion on related work and conclusions.\n\n2 Maximum Likelihood Structure Prediction\n\nWe begin by reviewing structured prediction approaches that employ both fully labeled training sets\nas well as those that handle latent variables. Of particular interest to us are probabilistic formulations\nsince we employ entropies of local probability distributions as our criteria for deciding which parts\nof the graph to label during each active learning step.\nLet x \u2208 X be the input space (e.g., an image or a sentence), and let s \u2208 S be the structured labeled\nspace that we are interested in predicting (e.g., an image segmentation or a parse tree). We de\ufb01ne\n\u03c6 : X \u00d7 S \u2192 RF to be a mapping from input and label space to an F -dimensional feature space.\nHere we consider log-linear distributions pw(s|x) describing the probability over a structured label\nspace S given an object x \u2208 X as\n\n(1)\nDuring learning, we are interested in estimating the parameters w \u2208 RF of the log-linear distribution\nsuch that the score w(cid:62)\u03c6(x, s) is high if s \u2208 S is a \u201cgood\u201d label for x \u2208 X .\n\npw(s|x) \u221d exp(cid:0)w(cid:62)\u03c6(x, s)(cid:1) .\n\n2.1 Supervised Setting\nTo de\ufb01ne \u201cgood,\u201d in the supervised setting we are given a training set D = {(xi, si)N\ni=1} containing\nN pairs, each composed of an input x \u2208 X and some fully labeled data s \u2208 S. In addition, we are\noften able to compare the \ufb01tness of an estimate \u02c6s \u2208 S for a training sample (x, s) \u2208 D via what we\nrefer to as the task-loss function (cid:96)(x,s)(\u02c6s). Its purpose is very much like enforcing a distance between\nthe hyperplane de\ufb01ned by the parameters and the respective sample when considering the popular\nmax-margin setting. We incorporate this loss function into the learning process by considering the\nloss-augmented distribution\n\np(x,s)(s|w) \u221d exp(w(cid:62)\u03c6(x, s) + (cid:96)(x,y)(s)).\n\n(2)\nIntuitively it places more probability mass on those parts of the output space S that have a high loss,\nforcing the model to adapt to a more dif\ufb01cult setting than the one encountered at inference, where\nthe loss is not present.\nMaximum likelihood learning aims at \ufb01nding model parameters w which assign highest probability\nto the training set D. Assuming the data to be independent and identically distributed (i.i.d.), our\n(x,s)\u2208D p(x,s)(s|w)] with p(w) \u221d e\u2212(cid:107)w(cid:107)p\n\ngoal is to minimize the negative log-posterior \u2212 ln[p(w)(cid:81)\n\np\n\nbeing a prior on the model parameters. The cost function is therefore given by\n\n(cid:33)\n\n(cid:33)\n\n(cid:32)\n\n(cid:88)\n\n(x,y)\u2208D\n\n(cid:88)\n\n\u02c6s\u2208S\n\n\u0001 ln\n\nexp\n\n(cid:32)\n\n(cid:107)w(cid:107)p\n\np +\n\nC\np\n\nw(cid:62)\u03c6(x, \u02c6s) + (cid:96)(x,y)(\u02c6s)\n\n\u0001\n\n\u2212 w(cid:62)\u03c6(x, s)\n\n,\n\n(3)\n\nwhere we have included a parameter \u0001 to yield a soft-max function. Although being a convex func-\ntion, the dif\ufb01culty arises from the sum over exponentially many label con\ufb01gurations \u02c6s.\nDifferent algorithms have been proposed to solve this task. While ef\ufb01cient computation over tree-\nstructured models is required for convergence guarantees [16], approximations were suggested to\nachieve convergence even when working with loopy models [11].\n\n2\n\n\f(cid:88)\n\n\u02c6h\u2208H\n\n(cid:88)\n\n\u02c6h\u2208H\n\n2.2 Dealing with Latent Variables\ni=1} containing N pairs,\nIn the weakly supervised setting, we are given a training set D = {(xi, yi)N\neach composed of an input x \u2208 X and some partially labeled data y \u2208 Y \u2286 S. For every training\npair, the label space S = Y \u00d7H is divided into two non-intersecting subspaces Y and H. We refer to\nthe missing information h \u2208 H as hidden or latent. As before, we incorporate a task-loss function,\nand de\ufb01ne the loss-augmented likelihood of a prediction \u02c6y \u2208 Y when observing the pair (x, y) as\n\np(x,y)(\u02c6y, \u02c6h|w) =\n\np(x,y)(\u02c6s|w),\n\n(4)\n\np(x,y)(\u02c6y|w) \u221d (cid:88)\n\n\u02c6h\u2208H\n\n(cid:33)\uf8f6\uf8f8 ,\n\n(cid:33)\n\nwith p(x,y)(\u02c6s|w) de\ufb01ned as in Eq. 2. The minimization of the negative log-posterior results in the\ndifference of two convex terms as follows\n\nexp\n\nw(cid:62)\u03c6(x, \u02c6s) + (cid:96)(x,y)(\u02c6s)\n\n\u0001\n\n\u2212\u0001 ln\n\nexp\n\n(cid:33)\n\n(cid:32) w(cid:62)\u03c6(x, y, \u02c6h) + (cid:96)c\n\n(x,y)(y, \u02c6h)\n\n\u0001\n\n(cid:88)\n\n\uf8eb\uf8ed\u0001 ln\n\n(cid:88)\n\n\u02c6s\u2208S\n\n(cid:107)w(cid:107)p\n\np +\n\nC\np\n\n(x,y)\u2208D\n\n(cid:32)\n\nwith the \ufb01rst two terms being the sum of the log-prior and the logarithm of the partition function.\nFor generality we allow different task-loss (cid:96), (cid:96)c while noting that (cid:96)c \u2261 0 in our experiments.\nBesides the previously outlined dif\ufb01culty of exponentially sized product spaces, the cost function is\nno longer convex. Hence we generally employ expectation maximization (EM) or concave-convex\nprocedure (CCCP) [37] type of approaches, i.e., we linearize the non-convex part at the current\niterate before taking a step in the direction of the gradient of a convex function. More speci\ufb01cally,\nwe follow Schwing et al. [28] and upper-bound the concave part via a minimization over a set of\ndual variables subsequently referred to as q(x,y)(h):\n\n(cid:32)\n\n(cid:32)\n(cid:88)\n\n(x,y)\n\n(cid:88)\n\n\u02c6s\u2208S\n\n\u0001 ln\n\nexp\n\n(cid:107)w(cid:107)p\n\np +\n\nC\np\n\n(cid:33)\n\nw(cid:62)\u03c6(x, \u02c6s)+(cid:96)(x,y)(\u02c6s)\n\n\u0001\n\n\u2212\u0001H(q(x,y))\u2212Eq(x,y) [w\n\n(cid:62)\n\n\u03c6(x, y, \u02c6h)+(cid:96)c(x, y, \u02c6h)]\n\n.\n\ni\u2208Vk,x\n\n\u03c6k,i(x, si) +(cid:80)\n\nvector decomposes into local terms, i.e., \u03c6k(x, s) = (cid:80)\n\nTo deal with the exponential complexity we notice that frequently the k-th element of the feature\n\u03c6k,\u03b1(x, s\u03b1).\nVk,x represents the set indexing the unary potentials for the k-th feature of example (x, y). Similarly\nEk,x denotes the set of all high-order variable interaction sets \u03b1 in the k-th feature of example (x, y).\nAll variable indexes which are not observed are subsumed within the set H. Similarly all factors \u03b1\nthat contain variable i are summarized within the set N (i).\nWe leverage the decomposition within the features to also approximate the entropy over the joint dis-\ntribution q(x,y)(h) by local ones ranging over marginals. Furthermore, we approximate the marginal\npolytope by the local polytope. We deal with the summation over the output space objects \u02c6s \u2208 S in\nthe convex part in a similar manner. To this end we change to the dual space, employ the entropy\napproximations and transform the resulting surrogate function back to the primal space where we\nobtain Lagrange multipliers \u03bb which enforce the marginalization constraints. Altogether we obtain\nan approximate primal program having the following form:\n\n\u03b1\u2208Ek,x\n\nf1(w, d, \u03bb) + f2(d) + f3(d)\n\n(5)\n\nmin\nd,\u03bb,w\n\ns.t. (cid:88)\n\nh\u03b1\\hi\nd(x,y),i, d(x,y),\u03b1 \u2208 \u2206\n\nd(x,y),\u03b1(h\u03b1) = d(x,y),i(hi) \u2200(x, y), i \u2208 H, \u03b1 \u2208 N (i), hi \u2208 Si\n\nwith \u2206 denoting probability simplexes. We refer the reader to [28] for the speci\ufb01c forms of these\nfunctions.\nFollowing EM or CCCP, this program is optimized by alternatively minimizing w.r.t. the local beliefs\nd to solve the latent variable prediction problem, and performing a gradient step w.r.t. the weights as\nwell as block-coordinate descent steps to update the Lagrange multipliers \u03bb. The latter is equivalent\nto solving a supervised conditional random \ufb01eld problem given the distribution over latent variables\ninferred in the preceding latent variable prediction step.\nWe augment [28], and return not only the weights but also the local beliefs d which represent the\njoint distribution q(x,y)(h), i.e., a distribution over the latent space only. We summarize this process\nin Alg. 1. Note that only a local minimum is obtained as we are solving a non-convex problem.\n\n3\n\n\fAlgorithm 1 latent structured prediction\n\nInput: data D, initial weights w\nrepeat\n\nrepeat\n\n//solve latent variable prediction problem\nmind f2 + f3 s.t. \u2200(x, y) d(x,y) \u2208 D(x,y)\n\nuntil convergence\n//message passing update\n\u2200(x, y), i \u2208 S\n//gradient step with step size \u03b7\nw \u2190 w \u2212 \u03b7\u2207w(f1 + f2)\nuntil convergence\nOutput: weights w, beliefs d\n\n\u03bb(x,y),i \u2190 \u2207\u03bb(x,y),i(f1 + f2) = 0\n\n3 Active Learning\n\ni=1}.\n\nIn the previous section, we de\ufb01ned the maximum likelihood estimators for learning in the supervised\nand weakly supervised setting. We now derive our active learning approaches. In the active learning\nsetting, we assume a given training set DS = {(xi, yi)NL\ni=1} containing NL pairs, each composed\nby an input x \u2208 X and some partially labeled data y \u2208 Y \u2286 S. As before, for every training pair,\nwe divide the label space S = Y \u00d7 H into two non-intersecting subspaces Y and H, and refer to\nthe missing information h \u2208 H as hidden or latent. Additionally, we are given a set of unlabeled\nexamples DU = {(xi)Nu\nWe are interested in answering the following question: which part of the graph for which example\nshould we labeled in order to learn the best model with the least amount of supervision? Towards\nthis goal, we derive iterative algorithms which select the random variables to be labeled based on\nthe local entropies. This is intuitive, as entropy is a surrogate for uncertainty and useful for the\nconsidered application since the cost of labeling a random variable is independent of the selection.\nHere, our algorithms iteratively query the labels of the random variables of highest uncertainty,\nupdate the model parameters w and again ask for the next most uncertain set of variables.\nTowards this goal, we need to compute the entropies of the marginal distributions over each latent\nvariable, as well as the entropy over each random variable of the unlabeled examples. This is in\ngeneral NP-hard, as we are interested in dealing with graphical models with general potentials and\nconnectivity. In this paper we derive two active learning algorithms, each with a different trade-off\nbetween accuracy and computational complexity.\n\nSeparate active: Our \ufb01rst algorithm utilizes the labeled and weakly labeled examples to learn\nat each iteration. Once the parameters are learned it performs inference over the unlabeled and\npartially labeled examples to query for the next random variable to label. Thus, it requires a separate\ninference step for each active learning iteration. As shown in our experiments, this can be done\nef\ufb01ciently using convex belief propagation [10, 26]. The corresponding algorithm is summarized in\nAlg. 2.\n\nJoint active: Our second active learning algorithm takes advantage of unlabeled examples during\nlearning and no extra effort is required to compute the most informative random variable. Note\nthat this contrasts active learning algorithms which typically do not exploit unlabeled data during\nlearning and require very expensive computations in order to select the next example or random\nvariable to be labeled. Let D1 = DS \u222a DU be the set of all training examples containing both\nfully labeled, partially labeled and unlabeled examples. At each iteration we obtain Dt by querying\nthe label of a random variable not being labeled in Dt\u22121. Thus, at each iteration, we learn using a\nweakly supervised structured prediction task that solves\n\n(cid:88)\n\n\uf8eb\uf8ed\u0001 ln\n\n(cid:88)\n\n\u02c6s\u2208S\n\n(cid:107)wt(cid:107)p\n\np +\n\nC\np\n\n(x,y)\u2208Dt\n\n(cid:32)\n\nexp\n\n(cid:33)\n\n(cid:88)\n\n\u02c6h\u2208Ht\n\n\u2212\u0001 ln\n\nexp\n\n(cid:32) w(cid:62)\n\nt \u03c6(x, y, \u02c6h)+(cid:96)c\n\n(x,y)(y, \u02c6h)\n\n\u0001\n\n(cid:33)\uf8f6\uf8f8 ,\n\nw(cid:62)\nt \u03c6(x, \u02c6s)+(cid:96)(x,y)(\u02c6s)\n\n\u0001\n\n4\n\n\fAlgorithm 2 Separate active\n\nInput: data DS, DU , initial weights w\nrepeat\n\n(w, dS) \u2190 Alg. 1(DS, w)\ndU \u2190 Inference(DU )\ni\u2217 \u2190 arg maxi H(di)\nDS \u2190 DS \u222a{(xi\u2217 , yi\u2217 )}, DU \u2190 DU \\ xi\u2217\nuntil suf\ufb01ciently certain\nOutput: weights w\n\nAlgorithm 3 Joint active\n\nInput: data DS, DU , initial weights w\nrepeat\n\n(w, d) \u2190 Alg. 1(DS \u222a DU , w)\ni\u2217 \u2190 arg maxi H(di)\nDS \u2190 DS \u222a{(xi\u2217 , yi\u2217 )}, DU \u2190 DU \\ xi\u2217\nuntil suf\ufb01ciently certain\nOutput: weights w\n\nthen given by H(di) = \u2212(cid:80)|Hi|\n\nwith wt the weights for the t-th iteration. We resort to the approximated problem given in Eq. 5 to\nsolve this optimization task. The entropies are readily computable in close form, as the local beliefs\nd are computed during learning. Thus, no extra inference step is necessary. The local entropies are\nhi=1 di(hi) log di(hi), and we query the variable that has the highest\nentropy, i.e., the highest uncertainty. Note that this computation is linear in the number of unlabeled\nrandom variables and linear in the number of states. We summarize our approach in Alg. 3. Note\nthat this algorithm is more expensive than the previous one as learning employs the fully, weakly\nand unlabeled examples. This is particularly the case when the pool of unlabeled examples is large.\nHowever, as shown in our experimental evaluation, it can dramatically reduce the amount of labeling\nrequired to learn a good model.\n\nBatch mode: The two previously de\ufb01ned active learning approaches are computationally expen-\nsive as for each sequential active learning step, a new model has to be learned and inference has to\nbe performed over all latent variables. We also investigate batch algorithms which label k random\nvariables at each step of the algorithm. Towards this goal, we simply label the top k most uncer-\ntain variables. Note that this is an approximation of what the sequential algorithm will do, as the\nestimates of the parameters and the entropies are not updated when selecting the i-th variable.\n\nRe-using computation: Warm starting the learning algorithm after each active learning query is\nimportant in order to reduce the number of iterations required for convergence. Since (almost) the\nsame samples are involved at each step, we can extract a lot of information from previous iterations.\nTo this end we re-use both the weights w as well as the messages \u03bb and beliefs. More speci\ufb01cally,\nfor Alg. 2 we \ufb01rst perform inference on only newly selected examples to update the corresponding\nmessages \u03bb. Only afterwards and together with Lagrange multipliers from the other training images\nand the current weights, we perform the next iteration and another active step. On the other hand,\nsince we take advantage of all the unlabeled data during the joint active learning algorithm (Alg. 3),\nwe already know the Lagrange multipliers \u03bb for every image. Without any further updates we\ndirectly start a new active step. In our experimental evaluation we show that this choice results in\ndramatic speed ups when compared to randomly initializing the weights and messages during every\nactive learning iteration. Note that the joint approach (Alg. 3) requires a larger number of iterations\nto converge as it employs large amounts of unlabeled data. After a few iterations, convergence\nfor the following active learning steps improves signi\ufb01cantly requiring about as much time as the\nseparate approach (Alg. 2) does.\n\n4 Experimental Evaluation\n\nWe demonstrate the performance of our algorithms on the task of predicting the 3D layout of rooms\nfrom a single image. Existing approaches formulate the task as a structured prediction problem fo-\ncusing on estimating the 3D box which best describes the layout. Taking advantage of the Manhat-\ntan world assumption (i.e., the existence of three dominant vanishing points which are orthonormal),\nand given the vanishing points, the problem can be formulated as inference in a pairwise graphical\nmodel composed of four random variables [27]. As shown in Fig. 1, these variables represent the\nangles encoding the rays that originate from the respective vanishing points. Following existing\napproaches [12, 17], we employ F = 55 features based on geometric context (GC) [13] and ori-\nentation maps (OM) [18] as image cues. Our features \u03c6 count for each face in the cuboid (given\na particular con\ufb01guration of the layout) the number of pixels with a certain label for OM and the\nprobability that such label exists for GC and the task-loss (cid:96) denotes the pixel-wise prediction error.\n\n5\n\n\fs1\n\ns2\n\ns1\n\ns2\n\ns3\n\ns4\n\ns3\n\ns4\n\nFigure 1: Parameterization and factor graph for the 3D layout prediction task.\n\n(a) k = 1\n\n(b) k = 4\n\nFigure 2: Test set error as a function of the number of random variables labeled, when using joint vs\nseparate active learning. The different plots re\ufb02ect scenarios where the top k random variables are\nlabeled at each iteration (i.e., batch setting). From left to right k = 1, 4, 8 and 12.\n\n(c) k = 8\n\n(d) k = 12\n\nPerformance is measured as the percentage of pixels that have been correctly labeled as, left-wall,\nright-wall, front-wall, ceiling or \ufb02oor. Unless otherwise stated all experiments are performed by\naveraging over 20 runs of the algorithm, where the initial seed of 10 fully labeled images is selected\nat random.\n\nActive learning: We begin our experimentation by comparing the two proposed active learning\nalgorithms, i.e., separate (Alg. 2) and joint (Alg. 3). As shown in Fig. 2(a), both active learning\nalgorithms achieve much lower test error than an algorithm that selects which variables to label\nat random. Also, note that the joint algorithm takes advantage of unlabeled data and achieves good\nperformance after labeling only a few variables, improving signi\ufb01cantly over the separate algorithm.\n\nBatch active learning: Fig. 2 shows the performances of both active learning algorithms when\nlabeling a batch of k random variables before re-learning. Note that even with a batch of k = 12\nrandom variables, our algorithms quickly outperform random selection, as illustrated in Fig. 2(d).\n\nImage vs random variable:\nInstead of labeling a random variable at a time, we also experiment\nwith an algorithm that labels the four variables of an image at once. Note that this setting is equiv-\nalent to labeling four random variables per image. As shown in Fig. 3(a), labeling the full image\nrequires more labeling to achieve the same test error performance when compared to labeling ran-\ndom variables from possibly different examples.\n\nImportance of \u0001:\nFig. 3(b) and (c) show the performance of our active learning algorithms as a\nfunction of \u0001. Note that this parameter is fairly important. In particular, when \u0001 = 1, the entropy\nof most random variables is too large to be discriminative. This is illustrated in Fig. 3(d) where we\nobserve a fairly uniform distribution over the states of a randomly chosen variable for \u0001 = 1. Our\nactive learning algorithm thus prefers smaller values of \u0001. We hypothesize that this is due to the fact\nthat we have a small number of random variables each having a large number of states. Our initial\ntests show that in other applications where the number of states is smaller (e.g., segmentation) larger\nvalues of \u0001 perform better. An automatic selection of \u0001 is subject of our future research.\n\nComplexity Separate vs. Joint:\nIn Fig. 4(a) we illustrate the number of CCCP iterations as a\nfunction of the number of queried examples for both active learning algorithms. We observe that\nthe joint algorithm requires more computation initially. But after the \ufb01rst few active steps, i.e., after\nhaving converged to a good solution, its computation requirements reduce drastically. Here we use\n\u0001 = 0.01 for both algorithms.\n\n6\n\n0204060800.160.180.20.220.24number of queriespixelwise test error randomseparatejointbest0204060800.160.180.20.220.24number of queriespixelwise test error randomseparatejointbest0204060800.160.180.20.220.24number of queriespixelwise test error randomseparatejointbest0204060800.160.180.20.220.24number of queriespixelwise test error randomseparatejointbest\f(a) Image vs variable\n\n(b) \u0001 separate\n\n(c) \u0001 joint\n\n(d) Marginal distribution\n\nFigure 3: Test set error as a function of the number of random variables labeled ((a)-(c)). Marginal\ndistribution is illustrated in (d) for different \u0001.\n\n(a)\n\n(b)\n\n(c)\n\nFigure 4: Number of CCCP iterations as a function of the amount of queried variables in (a) and\ntime after speci\ufb01ed number of active iterations in (b) (joint) and (c) (separate).\n\nReusing computation: Fig. 4(b) and (c) show the number of \ufb01nished active learning iterations as a\nfunction of time for the joint and separate algorithm respectively. Note that by reusing computation,\na much larger number of active learning iterations \ufb01nishes when given a speci\ufb01c time budget.\n\n5 Related Work\n\nActive learning approaches consider two different scenarios. In stream-based methods [5], samples\nare considered successively and a decision is made to discard or eventually pick the currently inves-\ntigated sample. In contrast, pool-based methods [20] have access to a large set of unlabeled data.\nClearly our proposed approach has a pool-based \ufb02avor. Over the years many different strategies\nhave been proposed in the context of active learning algorithms to decide which example to label\nnext. While we follow the uncertainty sampling scheme [20, 19] using an entropy measure, sam-\npling schemes based on expected model change [29] have also been proposed. Other alternatives are\nexpected error reduction [24], variance reduction [4, 6], least-con\ufb01dent measure [7] or margin-based\nmeasures [25].\nAn alternative way to classify active learning algorithms is related to the information revealed after\nquerying for a label. In the multi-armed bandit model [1, 2] the algorithm chooses an action/sample\nand observes the utility of only that action. Alternatively when learning with expert advice, utilities\nfor all possible actions are revealed [3]. Between both of the aforementioned extremes sits the co-\nactive learning setting [30] where a subset of rewards for all possible actions is revealed by the user.\nOur approach resembles the multi-armed bandit setting since we only get to know the result of the\nnewly queried sample.\nActive learning approaches have been proposed in the context of Neural Networks [6], Support Vec-\ntor Machines [32], Gaussian processes [14], CRFs [7] and structured max-margin formulations [22].\nContrasting many of the previously proposed approaches we consider active learning as an exten-\nsion of a latent structured prediction setting, i.e., we extend the double-loop algorithm by yet another\nlayer. Importantly, our active learning algorithm follows the recent ideas to unify CRFs and struc-\ntured SVMs. It employs convex approximations and is amenable to general graphical models with\narbitrary topology and energy functions.\n\n7\n\n0204060800.160.180.20.220.24number of queriespixelwise test error random: 1 imgseparate: 1 imgjoint: 1 imgrandom: 1 varseparate: 1 varjoint: 1 var0204060800.160.180.20.220.24number of queriespixelwise test error \u03b5=1\u03b5=0.1\u03b5=0.010204060800.160.180.20.220.24number of queriespixelwise test error \u03b5=1\u03b5=0.1\u03b5=0.01020406000.050.10.150.2probabilitystate \u03b5=1\u03b5=0.1\u03b5=0.01020406080020406080100number of queries# CCCP iteration joint \u03b5=0.01separate \u03b5=0.01020406080012345678x 104# active finishedtime[s] reusenot reuse0204060800100020003000400050006000# active finishedtime[s] reusenot reuse\fThe \ufb01rst application of active learning in computer vision was developed by Kapoor et al. [14] to\nperform object recognition with minimal supervision. In the context of structured models [8] pro-\nposed to use conditional entropies to decide which image to label next in a segmentation task. In [36]\nthe set of frames to label in a video sequence is selected based on the cost of labeling each frame and\nthe cost of correcting errors. Unlike our approach, [8, 36] labeled full images (not sets of random\nvariables). As shown in our experiments this requires more manual interactions than our approach.\nGrabCut [23] popularized the use of \u201cactive learning\u201d for \ufb01gure ground segmentation, where the\nquestion of what to labeled next is answered by a human via an interactive segmentation system.\nSiddiquie et al. [31] propose to label that variable which most reduces the entropy of the entire \u201csys-\ntem,\u201d i.e., all the data, by taking into account correlations between variables using a Bethe entropy\napproximation. In [15], the next region to be labeled is selected based on a surrogate of uncertainty\n(i.e., min marginals) which is computed ef\ufb01ciently via dynamic graph cuts. This, however, is only\nsuitable for problems that can be solved via graph cuts (e.g., binary labeling problems with sub mod-\nular energies). In contrast, in this paper we are interested in the general setting of arbitrary energies\nand connectivities. Entropy was used as an active learning criteria for tree-structured models [21],\nwhere marginal probabilities can be computed exactly.\nIn the context of video segmentation, Fathi et al. [9] frame active learning as a semi-supervised\nlearning problem over a graph. They utilized the entropy as a metric for selecting which super-\npixel to label within their graph regularization approach.\nIn the context of holistic approaches,\nVijayanarasimhan et al. [35] investigated the problem of which task to label. Towards this goal they\nderived a multi-label multiple-instance approach, which takes into account the task effort (i.e., the\nexpected time to perform each labeling). Vezhnevets et al. [34] resort to the expected change as the\ncriteria to select which parts to label in the graphical model. Unfortunately, computing this mea-\nsure is computationally expensive, and their approach is only feasible for graphical models where\ninference can be solved via graph cuts.\n\n6 Conclusions\n\nWe have proposed active learning algorithms in the context of structure models which utilized local\nentropies in order to decide which subset of the output space for which example to label. We have\ndemonstrated the effectiveness of our approach in the problem of 3D room layout prediction given a\nsingle image, and we showed that state-of-the-art performance can be obtained while only employing\n\u223c10% of the labelings. We will release the source code upon acceptance as well as scripts to\nreproduce all experiments in the paper. In the future, we plan to apply our algorithms in the context\nof holistic models in order to investigate which tasks are more informative for visual parsing.\n\nReferences\n\n[1] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multi-armed bandit problem. Machine\n\nLearning, 2002.\n\n[2] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. Schapire. The non-stochastic multi-armed bandit problem.\n\nSIAM J. on Computing, 2002.\n\n[3] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning and Games. Cambridge University Press, 2006.\n[4] D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine Learning,\n\n1994.\n\n[5] D. Cohn, L. Atlas, R. Ladner, M. El-Sharkawi, R. Marks II, M. Aggoune, and D. Park. Training connec-\n\ntionist networks with queries and selective sampling. In Proc. NIPS, 1990.\n\n[6] D. Cohn, Z. Ghahramani, and M. I. Jordan. Active learning with statistical models. J. of Arti\ufb01cial\n\nIntelligence Research, 1996.\n\n[7] A. Culotta and A. McCallum. Reducing labeling effort for structured prediction tasks. In Proc. AAAI,\n\n2005.\n\n[8] A. Farhangfar, R. Greiner, and C. Szepesvari. Learning to Segment from a Few Well-Selected Training\n\nImages. In Proc. ICML, 2009.\n\n[9] A. Fathi, M. F. Balcan, X. Ren, and J. M. Rehg. Combining Self Training and Active Learning for Video\n\nSegmentation. In Proc. BMVC, 2011.\n\n8\n\n\f[10] T. Hazan and A. Shashua. Norm-Product Belief Propagation: Primal-Dual Message-Passing for LP-\n\nRelaxation and Approximate-Inference. Trans. Information Theory, 2010.\n\n[11] T. Hazan and R. Urtasun. A Primal-Dual Message-Passing Algorithm for Approximated Large Scale\n\nStructured Prediction. In Proc. NIPS, 2010.\n\n[12] V. Hedau, D. Hoiem, and D. A. Forsyth. Recovering the Spatial Layout of Cluttered Rooms . In Proc.\n\nICCV, 2009.\n\n[13] D. Hoiem, A. A. Efros, and M. Hebert. Recovering Surface Layout from an Image. IJCV, 2007.\n[14] A. Kapoor, K. Grauman, R. Urtasun, and T. Darrell. Active Learning with Gaussian Processes for Object\n\nCategorization . In Proc. ICCV, 2007.\n\n[15] P. Kohli and P. Torr. Measuring Uncertainty in Graph Cut Solutions -Ef\ufb01ciently Computing Min-marginal\n\nEnergies using Dynamic Graph Cuts. In Proc. ECCV, 2006.\n\n[16] J. Lafferty, A. McCallum, and F. Pereira. Conditional random \ufb01elds: Probabilistic models for segmenting\n\nand labeling sequence data. In Proc. ICML, 2001.\n\n[17] D. C. Lee, A. Gupta, M. Hebert, and T. Kanade. Estimating Spatial Layout of Rooms using Volumetric\n\nReasoning about Objects and Surfaces. In Proc. NIPS, 2010.\n\n[18] D. C. Lee, M. Hebert, and T. Kanade. Geometric Reasoning for Single Image Structure Recovery. In\n\nProc. CVPR, 2009.\n\n[19] D. Lewis and J. Catlett. Heterogeneous uncertainty sampling for supervised learning. In Proc. ICML,\n\n1994.\n\n[20] D. Lewis and W. Gale. A sequential algorithm for training text classi\ufb01ers. In Proc. Research and Devel-\n\nopment in Info. Retrieval, 1994.\n\n[21] T. Mensink, J. Verbeek, and G. Csurka. Learning Structured Prediction Models for Interactive Image\n\nLabeling. In Proc. CVPR, 2011.\n\n[22] D. Roth and K. Small. Margin-based Active Learning for Structured Output Spaces. In Proc. ECML,\n\n2006.\n\n[23] C. Rother, V. Kolmogorov, and A. Blake. GrabCut\n\nGraph Cuts. In Proc. SIGGRAPH, 2004.\n\nInteractive Foreground Extraction using Iterated\n\n[24] N. Roy and A. McCallum. Toward optimal active learning through sampling estimation of error reduction.\n\nIn Proc. ICML, 2001.\n\n[25] T. Scheffer, C. Decomain, and S. Wrobel. Active hidden Markov models for information extraction. In\n\nProc. Int\u2019l Conf. Advances in Intelligent Data Analysis, 2001.\n\n[26] A. G. Schwing, T. Hazan, M. Pollefeys, and R. Urtasun. Distributed Message Passing for Large Scale\n\nGraphical Models. In Proc. CVPR, 2011.\n\n[27] A. G. Schwing, T. Hazan, M. Pollefeys, and R. Urtasun. Ef\ufb01cient Structured Prediction for 3D Indoor\n\nScene Understanding. In Proc. CVPR, 2012.\n\n[28] A. G. Schwing, T. Hazan, M. Pollefeys, and R. Urtasun. Ef\ufb01cient Structured Prediction with Latent\n\nVariables for General Graphical Models. In Proc. ICML, 2012.\n\n[29] B. Settles, M. Craven, and S. Ray. Multiple-instance active learning. In Proc. NIPS, 2008.\n[30] P. Shivaswamy and T. Joachims. Online Structured Prediction via Coactive Learning. In Proc. ICML,\n\n2012.\n\n[31] B. Siddiquie and A. Gupta. Beyond Active Noun Tagging: Modeling Contextual Interactions for Multi-\n\nClass Active Learning. In Proc. CVPR, 2010.\n\n[32] S. Tong and D. Koller. Support vector machine active learning with applications to text classi\ufb01cation.\n\nJMLR, 2001.\n\n[33] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large Margin Methods for Structured and\n\nInterdependent Output Variables. JMLR, 2005.\n\n[34] A. Vezhnevets, V. Ferrari, and J. M. Buhmann. Active Learning for Semantic Segmentation with Expected\n\nChange. In Proc. CVPR, 2012.\n\n[35] S. Vijayanarasimhan and K. Grauman. Cost-Sensitive Active Visual Category Learning. IJCV, 2010.\n[36] S. Vijayanarasimhan and K. Grauman. Active Frame Selection for Label Propagation in Videos. In Proc.\n\nECCV, 2012.\n\n[37] A. L. Yuille and A. Rangarajan. The Concave-Convex Procedure. Neural Computation, 2003.\n\n9\n\n\f", "award": [], "sourceid": 420, "authors": [{"given_name": "Wenjie", "family_name": "Luo", "institution": "TTI Chicago"}, {"given_name": "Alex", "family_name": "Schwing", "institution": "ETH Zurich"}, {"given_name": "Raquel", "family_name": "Urtasun", "institution": "TTI Chicago"}]}